Scrapy学习

Scrapy 架构

  1. The Engine gets the initial Requests to crawl from the Spider.(获取要爬的url)
  2. The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.(调度request到调度器中,形成一些列的url调度队列)
  3. The Scheduler returns the next Requests to the Engine.(调度器调度完成,发送第一个request给引擎)
  4. The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()). (引擎发送第一个request给下载器)
  5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).(下载器得到该request的response,返回给引擎)
  6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).(引擎将response发给spider进行解析)
  7. The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).(spider解析好需要的数据:包括:提取需要的内容+提取新的url请求,将处理好的数据发给引擎)
  8. The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.(一方面引擎将得到的内容发给item,将得到的url请求发给调度器)
  9. The process repeats (from step 1) until there are no more requests from the Scheduler.

摘自:https://docs.scrapy.org/en/latest/topics/architecture.html

Scrapy 使用

第一步: 创建爬虫项目;

$scrapy startproject project_name

第二步: 设置settings

settings关闭robots =》 POBOTSOXT_OBEY = True

第三步: 定义item信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

import scrapy

class Test1Item(scrapy.Item): # ==>>>修改item名称
# define the fields for your item here like:
# name = scrapy.Field()
pass

# Items clearly define the common output data format in a separate file ,有的时候可能需要对传过来的数据进行进一步处理

# example:
quote_content = Field(input_processor=MapCompose(remove_quotes),output_processor=TakeFirst())


def remove_quotes(text):
pass

第四步:编写代码逻辑

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

首先创建爬虫文件
cd project_name
scrapy genspider spider_name url

def parse(self, response):
## 编写代码逻辑
self.logger.info('test...')

## 一般逻辑:
res = response.xpath("xxx").getall()
for i in res:
yield{
# 放入item中,更加复杂的写法见item itemloader
}
##如果需要深度搜索,具有下一页之类的
next_page = response.xpath(xxx).get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse) # 这里的callback 也可以重新写个函数进行回调 crapy.Request(next_page, callback=self.parse2)
# 如果不进行url拼接的话,也可以直接
yield response.follow(a, callback=self.parse)


def parse2(self,response):
pass

#Ite/Itemloader
for i in res:
loader = ItemLoader(item=Test1Item(), selector=i)
...

item = loader.load_item() # 此时item便得到了提取的数据
yield item

第五步:调试 输出

$ scrapy shell URL
response.xpath(xxx).getall()

$ scrapy crawl quotes -o xx.json


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!