scrapy使用介绍
1、scrapy介绍
2、scrapy的安装
3.scrapy模块
4、scrapy爬虫实例
1、scrapy介绍
Scrapy是一个基于Twisted,纯Python实现的爬虫框架,用户只需要定制开发几个模块就可以轻松的实现一个爬虫,用来抓取网页内容以及各种图片,非常之方便~
Scrapy 使用 Twisted这个异步网络库来处理网络通讯,架构清晰,并且包含了各种中间件接口,可以灵活的完成各种需求
2、scrapy的安装
很简单,直接用pip install scrapy即可,scrapy的一个限制是支持Python2.7+,不支持3。 开始一个工程直接用命令 scrapy startproject projectname
3.scrapy模块
生成scrapy工程后,会看到几个文件及Python包。
projectname-
- --pipelines
- --middlewares
- --spiders
- --settings.py
- --items.py
- --scrapy.cfg
分别介绍几个模块:
1)、pipelines
管道模块,处理spider模块分析好的结构化数据,对数据进行一切你希望做的操作。如筛选,保存等。创建管道需要两部:一是在该Python文件包定义一个pipeLine,然后在setting.py中进行配置。拿出我的一个实例。在该Python包创建一个Python文件down_image.py,定义一个类:
import os
import scrapy
# from scrapy.spider import Request
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
class MyImagePipeline(ImagesPipeline):
def __init__(self,store_uri,download_func=None):
# store_uri is automatically set from setting.py :IMAGE_STORE
self.store_path = store_uri
super(MyImagePipeline,self).__init__(store_uri,download_func=None)
def get_media_requests(self, item, info):
url = item.get('url',None)
if url is not None:
yield scrapy.Request(url)
#when the scrapy.Request finished downloading ,the method will be called
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok,x in results if ok]
if image_paths:
image_path = image_paths[0]
item['image_path'] = os.path.join(os.path.abspath(self.store_path)
,image_path) if image_path else ''
return item
class JsonWriterPipeline(object):
def __init__(self):
self.file = codecs.open('store.html','w',encoding='utf8')
def process_item(self, item, spider):
line = json.dumps(item['url'],ensure_ascii=False) + "\n"
self.file.write(line)
return item
def close_spider(self,spider):
#when the spider is closed ,the method will be called
self.file.close()
我这个管道首先判断url是不是空,如果不是空,再进行爬取,当完成下载后,保存文件。
然后在setting中定义:
ITEM_PIPELINES = [
'house.pipelines.down_image.MyImagePipeline',
'house.pipelines.store_original.JsonWriterPipeline',
]
2)、middlewares
中间件,相当于钩子,可以对爬取前后做预处理,如修改请求header,url过滤等
class MyUserAgentDownloaderMiddleware(UserAgentMiddleware):
user_agent_list = [
'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31',\
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17',\
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17',\
\
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.2; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)',\
'Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)',\
'Mozilla/5.0 (Windows; U; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)',\
\
'Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1',\
'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1',\
'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:15.0) Gecko/20120910144328 Firefox/15.0.2',\
\
'Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201',\
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a3pre) Gecko/20070330',\
'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13; ) Gecko/20101203',\
\
'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14',\
'Opera/9.80 (X11; Linux x86_64; U; fr) Presto/2.9.168 Version/11.50',\
'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; de) Presto/2.9.168 Version/11.52',\
\
'Mozilla/5.0 (Windows; U; Win 9x 4.90; SG; rv:1.9.2.4) Gecko/20101104 Netscape/9.1.0285',\
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3',\
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6',\
]
@property
def user_agent(self):
return self._user_agent
@user_agent.setter
def user_agent(self,value):
self._user_agent = value
def process_request(self, request, spider):
self.user_agent = random.choice(self.user_agent_list)
request.headers.setdefault('User-Agent', self.user_agent)
我的这个中间件是胃每个request添加了User-Agent头。
3)、settings.py
这个很好理解,配置了所有必要的信息。这里面的内容较多,具体信息一是可以在生成的文件中看基本配置,其次可以在scrapy官网看想配置的信息。下面只是一小部分实例:
IMAGES_STORE = os.path.join(path_dir,'multimedia/images')
IMAGES_EXPIRES = 30
IMAGES_THUMBS = {
'small': (70, 70),
'big': (270, 270),
}
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110
FILES_STORE = os.path.join(path_dir,'multimedia/files')
FILES_EXPIRES = 50
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = [
'house.pipelines.down_image.MyImagePipeline',
'house.pipelines.store_original.JsonWriterPipeline',
]
LOG_LEVEL = 'INFO'
4)、spiders
这是爬虫的灵魂,爬取必要的信息。负责配置需要爬取的数据和爬取规则,以及解析结构化数据。
5)、items.py
定义了爬取数据的字段。使用相当于dict。
import scrapy
class HouseItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
url = scrapy.Field()
image_path = scrapy.Field()
6)、scrapy.cfg
整个项目的配置,定义了配置文件setting.py的位置。
[settings]
default = house.settings
[deploy]
project = house
4、scrapy爬虫实例
关于pipelines,middlewares,setting的编写,我在上面其实已经提到了,这里主要说spiders部分。
我爬取了九九文章网的文章,分几部分来讲:
-
name 这是爬虫的名称,这个名称是唯一的。
-
start_urls 这是开始爬取的链接,在该URL中确定需要具体爬取数据的链接。
-
start_requests 当没有指定start url的时候,就使用该方法,该方法必须返回一个爬虫最先爬取的链接。
- parse 当没有指定回调函数的时候,就会默认调取parse方法。
下面是我写的代码:
class JjJdpider(CrawlSpider):
name = 'jjjd'
def start_requests(self):
sites = [JiJinpin(),JiSanwen()]
urls = []
for site in sites:
response = requests.get(site.request_url).content
response = BeautifulSoup(StringIO(response),'html5lib')
pages = response.find('span',attrs={'class':'pageinfo'}).strong.string
pages = int(str(pages))
for page in range(1,pages+1):
url = site.request_url + 'list_%d_%d.html' % (site.list,page)
urls.append(url)
for url in urls:
yield scrapy.Request(url=url,callback=self.parse)
def parse(self, response):
content = response.body
if not content:
self.log('the content is empty')
return
sel = Selector(response)
urls = sel.xpath('//div[@class="list_box"]/ul[@class="e2"]/li/h3/a/@href').extract()
for url in urls:
yield scrapy.Request(url=url,callback=self.parse_items)
def parse_items(self,response):
"""
This method, as well as any other Request callback,
must return an iterable of Request and/or dicts or Item objects.
:param response:
:return:item
"""
sel = Selector(response)
item = HouseItem()
title = sel.xpath('//div[@class="art_viewbox bd"]/div[@class="title"]/h1/text()').extract_first()
authorInfo = sel.xpath('//div[@class="art_viewbox bd"]/div[@class="info"]').extract_first()
content = sel.xpath('//div[@class="art_viewbox bd"]/div[@class="content"]/p').extract()
if content:
article_content = ' '.join(content) if len(content) > 1 else content[0]
item['title'] = title
item['content'] = article_content
item['author_info'] = authorInfo
yield item
很好理解,执行爬虫时,会先执行start_requests方法,然后调用parse,最后parse方法中最后一句yield scrapy.Request(url=url,callback=self.parse_items),这句是拿到具体爬取数据的链接,就调用方法parse_items.
总结
这只介绍了scrapy的基础使用,后续还要介绍如何使用代理IP,如何模拟登陆等。
微信分享/微信扫码阅读