python爬虫技术步骤(Python爬虫框架运行之前的设置)
python爬虫技术步骤(Python爬虫框架运行之前的设置)在打开 CMD 之后,输入 pip3 install scrapy 之后,按下回车键。虽然 scrapy 爬虫框架的安装和我们之前的安装其他库的方法一样,但是还是有一些不同的。
前面我们了解了 Scrapy 爬虫框架的主要组件和数据爬取流程。在了解这些基础知识之后,我们开始学习如何使用此爬虫框架。
Scrapy爬虫框架的安装
虽然 scrapy 爬虫框架的安装和我们之前的安装其他库的方法一样,但是还是有一些不同的。
在打开 CMD 之后,输入 pip3 install scrapy 之后,按下回车键。
接下来的安装界面与之前的状态有一些不同,由于 Scrapy 是一个框架,因此会下载框架内的其他组件。
安装之后,我们就可以设置一个爬虫框架。
如果我们要爬取百度贴吧的图片信息,就可以将框架的目录设置成这样:
|____ scrapy.cfg
|____ baidu
| |____ spiders
| | |____ __init__.py
| | |____ __pycache__
| |____ __init__.py
| |____ __pycache__
| |____ middlewares.py
| |____ settings.py
| |____ items.py
| |____ pipelines.py
这个形状是树状图,可以通过 tree 命令将其在 CMD 当中调用出来。
上述树状图当中,后面带 .py 的文件就可以让我们自己进行定义。
在数据获取流程运行之前,我们先要在上述的一些后缀为 .py的文件中做一些定义。
1.在items.py文件中定义字段,这些字段用来保存数据。
代码如下:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class baiduItem(scrapy.Item):
name = scrapy.Field()
year = scrapy.Field()
2.在spiders文件夹中编写自己的爬虫。
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider Rule
from baidu.items import baiduItem
class ImageSpider(CrawlSpider):
name = 'image'
allowed_domains = ['image.baidu.com']
start_urls = ['https://tieba.baidu.com/']
rules = (
Rule(LinkExtractor(allow=(r'https://tieba.baidu.com\?start=\d .*')))
Rule(LinkExtractor(allow=(r'https://tieba.daidu.com/subject/\d ')) callback='parse_item')
)
def parse_item(self response):
sel = Selector(response)
item = baiduItem()
item['name']=sel.xpath('//*[@id="content"]/h1/span[1]/text()').extract()
item['year']=sel.xpath('//*[@id="content"]/h1/span[2]/text()').re(r'\((\d )\)')
return item
3.在pipelines.py中完成对数据进行持久化(即存储)的操作。
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.exceptions import DropItem
from scrapy.conf import settings
from scrapy import log
class baiduPipeline(object):
def __init__(self):
connection = pymongo.MongoClient(settings['MONGODB_SERVER'] settings['MONGODB_PORT'])
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
def process_item(self item spider):
#Remove invalid data
valid = True
for data in item:
if not data:
valid = False
raise DropItem("Missing %s of blogpost from %s" %(data item['url']))
if valid:
#Insert data into database
new_image=[{
"name":item['name'][0]
"year":item['year'][0]
}]
self.collection.insert(new_image)
log.msg("Item wrote to MongoDB database %s/%s" %
(settings['MONGODB_DB'] settings['MONGODB_COLLECTION'])
level=log.DEBUG spider=spider)
return item
4.修改settings.py文件对项目进行配置。
# -*- coding: utf-8 -*-
# Scrapy settings for baidu project
#
# For simplicity this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'baidu'
SPIDER_MODULES = ['baidu.spiders']
NEWSPIDER_MODULE = 'baidu.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML like Gecko) Chrome/19.0.1084.54 Safari/536.5'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = True
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = True
MONGODB_SERVER = '14.215.177.221'
MONGODB_PORT = 45654
MONGODB_DB = 'baidu'
MONGODB_COLLECTION = 'image'
# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html application/xhtml xml application/xml;q=0.9 */*;q=0.8'
# 'Accept-Language': 'en'
# }
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
# 'baidu.middlewares.baiduSpiderMiddleware': 543
# }
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
# 'baidu.middlewares.baiduDownloaderMiddleware': 543
# }
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None
# }
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'baidu.pipelines.baiduPipeline': 400
}
LOG_LEVEL = 'DEBUG'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
这就是爬虫框架开始工作之前的准备工作。
如果想学习更多科技知识,可以点击关注。
如果对文章中的内容有什么困惑的地方,可以在评论区提出自己的问题,学记同大家一起交流,解决各种问题,一起进步。
青年学记 陪伴着各位青年
作者:青年学记 一名不断进步的程序猿
一起学习 一起进步
走向自立