内容介绍
- 用scrapy框架编写爬虫,爬取纽约客网站文章,包括文章url、文章标题、作者及发表时间、文章正文、图片等。
- 将上述爬取到的信息保存到mysql数据库中。
- 将数据库中的文章进行分页展示。
- 统计每篇文章正文的单词总数、段落总数、句子总数、词汇总数,并计算平均单词长度(单词字母数)、平均句子长度(句子单词数)、平均段落长度(段落句子数)。
开发环境配置
Python2.7
Pycharm2016.2.3
Windows7 64位
Python2.7环境配置
Pycharm安装
先在官网上下载http://www.jetbrains.com/pycharm/download/
下载后进行安装
scrapy框架的安装
Scrapy安装过程中需要很多依赖项,lxml,pyin32,twisted,pyOpenSSL,zope.inerface等等,而这些依赖在安装过程中很容易出错,最好的办法就是下载最新的python2.7。在cmd命令行下执行pip install scrapy 进行安装,安装成功后在cmd中输入scrapy。结果如下:
我们cmd执行$ Python进入python控制台
- 执行import lxml,如果没报错,则说明lxml安装成功.
- 执行import twisted,如果没报错,则说明twisted安装成功.
- 执行import OpenSSL,如果没报错,则说明OpenSSL安装成功.
- 执行import zope.interface,如果没报错,则说明zope.interface安装成功.
Flask框架安装
- 在D盘中创建myvir文件(任意文件),打开cmd进入这个目录
- 安装虚拟环境,pip install virtualenv 进行安装
- 安装Flask,pip install Flask进行安装
开发过程
数据爬取过程的实现
进入cmd ,用scrapy stratproject newyorker 命令创建一个命名为newyorker的项目。创建scrapy项目后会形成如下目录
1 2 3 4 5 6 7 8 9 10
| newyorker/ scrapy.cfg newyorker/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...
|
这些文件分别是:
scrapy.cfg: 项目的配置文件
newyorker/: 该项目的python模块。之后将在此加入代码。
newyorker/items.py: 项目中的item文件.
newyorker/pipelines.py: 项目中的pipelines文件.
newyorker/settings.py: 项目的设置文件.
newyoker/spiders/: 放置spider代码的目录.
我们在newyorker中创建new_spider.py文件,在此文件中我们编写爬取网站文章,和统计文章的关键代码,new_spider.py的内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
| import re import requests from scrapy.http import Request from scrapy.spiders import CrawlSpider from bs4 import BeautifulSoup from scrapy.selector import Selector from newyorker.items import NewyorkerItem class newyorker(CrawlSpider): name = "newyorker" allowed_domains = ["newyorker.com"] start_urls = ['http://www.newyorker.com/news/daily-comment/'] def parse(self, response): sel = Selector(response) infos = sel.xpath("//main/div/ul/li") for info in infos: article_url_part = info.xpath("div/h4/a/@href").extract()[0] article_url = 'http://www.newyorker.com/'+article_url_part yield Request(article_url,meta={'article_url':article_url},callback=self.parse_item) urls = ['http://www.newyorker.com/news/daily-comment/page/{}'.format(str(i)) for i in range(1,10)] for url in urls: yield Request(url,callback=self.parse) def parse_item(self,response): item = NewyorkerItem() item['article_url'] = response.meta['article_url'] data =requests.get(response.meta['article_url']) sel =Selector(response) title = sel.xpath("//h1/text()").extract()[0] author = sel.xpath("//div/div/div[2]/p/a/text()").extract()[0] time = sel.xpath("//hgroup/div[2]/p/text()").extract()[0] soup=BeautifulSoup(data.text,'lxml') image_urls = soup.select('figure > div > picture > img')[0].get('srcset')if soup.find_all('picture','component-responsive-image') else None articles=soup.select('#articleBody p') article = [i.text +'<br />' for i in articles] article_process = str(article).replace("', '"," ").strip("['").strip("']").strip(" ?").replace('\\xa0','') w_sum = len(re.findall('[a-zA-Z]+', article_process)) s_sum = len(re.findall('([.!?].\s?[A-Z"(])', article_process)) p_sum = len(article) v_sum = len(set(re.findall('[a-zA-Z]+', article_process.lower()))) a_sum = len(re.findall('[a-zA-Z]', article_process)) avg_w = round(a_sum / w_sum, 2) avg_s = round(w_sum / s_sum, 2) avg_p = round(s_sum / p_sum, 2) item['title']=title item['author']=author item['time']=time item['article']=article_process item['image_urls']= image_urls item['w_sum']=w_sum item['s_sum']=s_sum item['p_sum']=p_sum item['v_sum']=v_sum item['a_sum']=a_sum item['avg_w']=avg_w item['avg_s']=avg_s item['avg_p']=avg_p yield item
|
另外我们还需要连接数据库,当然在连接mysql数据库前,要确认是否安装了Python连接mysql数据库的驱动pymysql。连接数据库的代码写在pipelines.py文件中。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| import pymysql def dbHandle(): conn = pymysql.connect( host = "localhost", user = "root", passwd = "root", db="text", port =3306, charset = "utf8", use_unicode = False ) return conn class newyorkerPipeline(object): def process_item(self,item,spider): dbObject = dbHandle() cursor = dbObject.cursor() cursor.execute("USE text") sql = "INSERT INTO newtext(title,author,time,article,image_urls,w_sum,s_sum,p_sum,v_sum,a_sum,avg_w,avg_s,avg_p) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)" try: cursor.execute(sql, (item['title'], item['author'], item['time'], item['article'],item['image_urls'],item['w_sum'],item['s_sum'],item['p_sum'],item['v_sum'],item['a_sum'],item['avg_w'],item['avg_s'],item['avg_p'])) cursor.connection.commit() except BaseException as e: print e dbObject.rollback() return item
|
我们用命令scrapy crawl newyorker启动这个项目,结果如下:
我们打开数据库,发现数据库中已经存入了我们所要的数据,并完成了统计,当然在此之前要在数据库中建好数据库和表。
数据展示
我们创建一个Flask项目,项目结构如下:
1 2 3 4 5 6 7 8 9 10
| D:. ├─.idea │ └─dictionaries ├─static │ └─web.css ├─templates │ └─base.html │ └─index.html ├─app.cfg └─app.py
|
app.py中的代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141
| from __future__ import unicode_literals import pymysql from flask import Flask, render_template, g, current_app from flask_paginate import Pagination, get_page_args import click click.disable_unicode_literals_warning = True app = Flask(__name__) app.config.from_pyfile('app.cfg') @app.before_request def before_request(): g.conn = pymysql.connect( host = "localhost", user = "root", passwd = "root", db="text", port =3306, charset = "utf8", use_unicode = False ) g.cur = g.conn.cursor() @app.teardown_request def teardown(error): if hasattr(g, 'conn'): g.conn.close() @app.route('/') def index(): g.cur.execute('select count(*) from newtext') user = g.cur.fetchone()[0] page, per_page, offset = get_page_args(page_parameter='page', per_page_parameter='per_page') sql = 'select title from newtext order by title limit {}, {}'\ .format(offset, per_page) g.cur.execute(sql) users = g.cur.fetchall() print "------sss", users pagination = get_pagination(page=page, per_page=per_page, total=100, record_name='users', format_total=True, format_number=True, ) return render_template('index.html', users=users, page=page, per_page=per_page, pagination=pagination, ) @app.route('/users/', defaults={'page': 1}) @app.route('/users', defaults={'page': 1}) @app.route('/users/page/<int:page>/') @app.route('/users/page/<int:page>') def users(page): g.cur.execute('select count(*) from newtext') total = g.cur.fetchone()[0] page, per_page, offset = get_page_args() sql = 'select title from newtext order by title limit {}, {}'\ .format(offset, per_page) g.cur.execute(sql) users = g.cur.fetchall() pagination = get_pagination(page=page, per_page=per_page, total=total, record_name='users', format_total=True, format_number=True, ) return render_template('index.html', users=users, page=page, per_page=per_page, pagination=pagination, active_url='users-page-url', ) @app.route('/search/<name>/') @app.route('/search/<name>') def search(name): """The function is used to test multi values url.""" sql = 'select count(*) from newtext where title like ?' args = ('%{}%'.format(title), ) g.cur.execute(sql) total = g.cur.fetchone()[0] page, per_page, offset = get_page_args() users = g.cur.fetchall() pagination = get_pagination(page=page, per_page=per_page, total=total, record_name=users, ) return render_template('index.html', users=users, page=page, per_page=per_page, pagination=pagination, ) def get_css_framework(): return current_app.config.get('CSS_FRAMEWORK', 'bootstrap3') def get_link_size(): return current_app.config.get('LINK_SIZE', 'sm') def show_single_page_or_not(): return current_app.config.get('SHOW_SINGLE_PAGE', False) def get_pagination(**kwargs): kwargs.setdefault('record_name', 'records') return Pagination(css_framework=get_css_framework(), link_size=get_link_size(), show_single_page=show_single_page_or_not(), **kwargs ) @click.command() @click.option('--port', '-p', default=5000, help='listening port') def run(port): app.run(debug=True, port=port) if __name__ == '__main__': run()
|
我们run这个项目,然后在浏览器中输入127.0.0.1:5000即可,结果如下:
我们发现这个web程序已经成功读取到了mysql数据库中的数据并做好了分页。
这个项目的源码在我的github中,链接