scrapy使用介绍

1、scrapy介绍
2、scrapy的安装
3.scrapy模块
4、scrapy爬虫实例

1、scrapy介绍

Scrapy是一个基于Twisted,纯Python实现的爬虫框架,用户只需要定制开发几个模块就可以轻松的实现一个爬虫,用来抓取网页内容以及各种图片,非常之方便~

Scrapy 使用 Twisted这个异步网络库来处理网络通讯,架构清晰,并且包含了各种中间件接口,可以灵活的完成各种需求

2、scrapy的安装

很简单,直接用pip install scrapy即可,scrapy的一个限制是支持Python2.7+,不支持3。 开始一个工程直接用命令 scrapy startproject projectname

3.scrapy模块

生成scrapy工程后,会看到几个文件及Python包。

projectname-

  • --pipelines
  • --middlewares
  • --spiders
  • --settings.py
  • --items.py
  • --scrapy.cfg

分别介绍几个模块:

1)、pipelines

管道模块,处理spider模块分析好的结构化数据,对数据进行一切你希望做的操作。如筛选,保存等。创建管道需要两部:一是在该Python文件包定义一个pipeLine,然后在setting.py中进行配置。拿出我的一个实例。在该Python包创建一个Python文件down_image.py,定义一个类:

import os
import scrapy
# from scrapy.spider import Request
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem


class MyImagePipeline(ImagesPipeline):

    def __init__(self,store_uri,download_func=None):
        # store_uri is automatically set from setting.py :IMAGE_STORE
        self.store_path = store_uri
        super(MyImagePipeline,self).__init__(store_uri,download_func=None)

    def get_media_requests(self, item, info):
        url = item.get('url',None)
        if url is not None:
            yield scrapy.Request(url)

    #when the scrapy.Request finished downloading ,the method will be called
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok,x in results if ok]
        if image_paths:
            image_path = image_paths[0]
            item['image_path'] = os.path.join(os.path.abspath(self.store_path)
                                              ,image_path)  if image_path else ''
        return item


    class JsonWriterPipeline(object):

        def __init__(self):
            self.file = codecs.open('store.html','w',encoding='utf8')

        def process_item(self, item, spider):
            line = json.dumps(item['url'],ensure_ascii=False) + "\n"
            self.file.write(line)
            return item

        def close_spider(self,spider):
            #when the spider is closed ,the method will be called
            self.file.close()

我这个管道首先判断url是不是空,如果不是空,再进行爬取,当完成下载后,保存文件。

然后在setting中定义:

ITEM_PIPELINES = [
    'house.pipelines.down_image.MyImagePipeline',
    'house.pipelines.store_original.JsonWriterPipeline',
    ]

2)、middlewares

中间件,相当于钩子,可以对爬取前后做预处理,如修改请求header,url过滤等

class MyUserAgentDownloaderMiddleware(UserAgentMiddleware):

    user_agent_list = [
        'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.43 Safari/537.31',\
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17',\
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17',\
        \
        'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.2; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)',\
        'Mozilla/5.0 (Windows; U; MSIE 7.0; Windows NT 6.0; en-US)',\
        'Mozilla/5.0 (Windows; U; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)',\
        \
        'Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1',\
        'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:15.0) Gecko/20100101 Firefox/15.0.1',\
        'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:15.0) Gecko/20120910144328 Firefox/15.0.2',\
        \
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201',\
        'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a3pre) Gecko/20070330',\
        'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13; ) Gecko/20101203',\
        \
        'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14',\
        'Opera/9.80 (X11; Linux x86_64; U; fr) Presto/2.9.168 Version/11.50',\
        'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; de) Presto/2.9.168 Version/11.52',\
        \
        'Mozilla/5.0 (Windows; U; Win 9x 4.90; SG; rv:1.9.2.4) Gecko/20101104 Netscape/9.1.0285',\
        'Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3',\
        'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.12) Gecko/20080219 Firefox/2.0.0.12 Navigator/9.0.0.6',\
    ]

    @property
    def user_agent(self):
        return self._user_agent

    @user_agent.setter
    def user_agent(self,value):
        self._user_agent = value

    def process_request(self, request, spider):
        self.user_agent = random.choice(self.user_agent_list)
        request.headers.setdefault('User-Agent', self.user_agent)

我的这个中间件是胃每个request添加了User-Agent头。

3)、settings.py

这个很好理解,配置了所有必要的信息。这里面的内容较多,具体信息一是可以在生成的文件中看基本配置,其次可以在scrapy官网看想配置的信息。下面只是一小部分实例:

IMAGES_STORE = os.path.join(path_dir,'multimedia/images')
IMAGES_EXPIRES = 30
IMAGES_THUMBS = {
'small': (70, 70),
'big': (270, 270),
}
IMAGES_MIN_HEIGHT = 110
IMAGES_MIN_WIDTH = 110

FILES_STORE = os.path.join(path_dir,'multimedia/files')
FILES_EXPIRES = 50

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = [
    'house.pipelines.down_image.MyImagePipeline',
    'house.pipelines.store_original.JsonWriterPipeline',
    ]

LOG_LEVEL = 'INFO'

4)、spiders

这是爬虫的灵魂,爬取必要的信息。负责配置需要爬取的数据和爬取规则,以及解析结构化数据。

5)、items.py

定义了爬取数据的字段。使用相当于dict。

import scrapy

class HouseItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    url = scrapy.Field()
    image_path = scrapy.Field()

6)、scrapy.cfg

整个项目的配置,定义了配置文件setting.py的位置。

[settings]
default = house.settings

[deploy]
project = house

4、scrapy爬虫实例

关于pipelines,middlewares,setting的编写,我在上面其实已经提到了,这里主要说spiders部分。

我爬取了九九文章网的文章,分几部分来讲:

  1. name 这是爬虫的名称,这个名称是唯一的。

  2. start_urls 这是开始爬取的链接,在该URL中确定需要具体爬取数据的链接。

  3. start_requests 当没有指定start url的时候,就使用该方法,该方法必须返回一个爬虫最先爬取的链接。

  4. parse 当没有指定回调函数的时候,就会默认调取parse方法。

下面是我写的代码:

class JjJdpider(CrawlSpider):
    name = 'jjjd'

    def start_requests(self):
        sites = [JiJinpin(),JiSanwen()]
        urls = []
        for site in sites:
            response = requests.get(site.request_url).content
            response = BeautifulSoup(StringIO(response),'html5lib')
            pages = response.find('span',attrs={'class':'pageinfo'}).strong.string
            pages = int(str(pages))
            for page in range(1,pages+1):
                url = site.request_url + 'list_%d_%d.html' % (site.list,page)
                urls.append(url)

        for url in urls:
            yield scrapy.Request(url=url,callback=self.parse)

    def parse(self, response):
        content = response.body
        if not content:
            self.log('the content is empty')
            return
        sel = Selector(response)
        urls = sel.xpath('//div[@class="list_box"]/ul[@class="e2"]/li/h3/a/@href').extract()
        for url in urls:
            yield scrapy.Request(url=url,callback=self.parse_items)

    def parse_items(self,response):
        """
        This method, as well as any other Request callback,
        must return an iterable of Request and/or dicts or Item objects.
        :param response:
        :return:item
        """
        sel = Selector(response)
        item = HouseItem()
        title = sel.xpath('//div[@class="art_viewbox bd"]/div[@class="title"]/h1/text()').extract_first()
        authorInfo = sel.xpath('//div[@class="art_viewbox bd"]/div[@class="info"]').extract_first()
        content = sel.xpath('//div[@class="art_viewbox bd"]/div[@class="content"]/p').extract()
        if content:
            article_content = ' '.join(content) if len(content) > 1 else content[0]
            item['title'] = title
            item['content'] = article_content
            item['author_info'] = authorInfo
            yield item

很好理解,执行爬虫时,会先执行start_requests方法,然后调用parse,最后parse方法中最后一句yield scrapy.Request(url=url,callback=self.parse_items),这句是拿到具体爬取数据的链接,就调用方法parse_items.

总结

这只介绍了scrapy的基础使用,后续还要介绍如何使用代理IP,如何模拟登陆等。

--------EOF---------
微信分享/微信扫码阅读