python爬虫技术步骤（Python爬虫框架运行之前的设置）-爱玩科技

小君 2023-08-22 23:15:59 235

python爬虫技术步骤（Python爬虫框架运行之前的设置）在打开 CMD 之后，输入 pip3 install scrapy 之后，按下回车键。虽然 scrapy 爬虫框架的安装和我们之前的安装其他库的方法一样，但是还是有一些不同的。

前面我们了解了 Scrapy 爬虫框架的主要组件和数据爬取流程。在了解这些基础知识之后，我们开始学习如何使用此爬虫框架。

Scrapy爬虫框架的安装

虽然 scrapy 爬虫框架的安装和我们之前的安装其他库的方法一样，但是还是有一些不同的。

在打开 CMD 之后，输入 pip3 install scrapy 之后，按下回车键。

接下来的安装界面与之前的状态有一些不同，由于 Scrapy 是一个框架，因此会下载框架内的其他组件。

安装之后，我们就可以设置一个爬虫框架。

python爬虫技术步骤（Python爬虫框架运行之前的设置）(1)

如果我们要爬取百度贴吧的图片信息，就可以将框架的目录设置成这样：

|____ scrapy.cfg

|____ baidu

| |____ spiders

| | |____ __init__.py

| | |____ __pycache__

| |____ __init__.py

| |____ __pycache__

| |____ middlewares.py

| |____ settings.py

| |____ items.py

| |____ pipelines.py

这个形状是树状图，可以通过 tree 命令将其在 CMD 当中调用出来。

上述树状图当中，后面带 .py 的文件就可以让我们自己进行定义。

在数据获取流程运行之前，我们先要在上述的一些后缀为 .py的文件中做一些定义。

1.在items.py文件中定义字段，这些字段用来保存数据。

代码如下：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class baiduItem(scrapy.Item):

name = scrapy.Field()

year = scrapy.Field()

2.在spiders文件夹中编写自己的爬虫。

# -*- coding: utf-8 -*-

import scrapy

from scrapy.selector import Selector

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider Rule

from baidu.items import baiduItem

class ImageSpider(CrawlSpider):

name = 'image'

allowed_domains = ['image.baidu.com']

start_urls = ['https://tieba.baidu.com/']

rules = (

Rule(LinkExtractor(allow=(r'https://tieba.baidu.com\?start=\d .*')))

Rule(LinkExtractor(allow=(r'https://tieba.daidu.com/subject/\d ')) callback='parse_item')

)

def parse_item(self response):

sel = Selector(response)

item = baiduItem()

item['name']=sel.xpath('//*[@id="content"]/h1/span[1]/text()').extract()

item['year']=sel.xpath('//*[@id="content"]/h1/span[2]/text()').re(r'\((\d )\)')

return item

3.在pipelines.py中完成对数据进行持久化（即存储）的操作。

# -*- coding: utf-8 -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

from scrapy.exceptions import DropItem

from scrapy.conf import settings

from scrapy import log

class baiduPipeline(object):

def __init__(self):

connection = pymongo.MongoClient(settings['MONGODB_SERVER'] settings['MONGODB_PORT'])

db = connection[settings['MONGODB_DB']]

self.collection = db[settings['MONGODB_COLLECTION']]

def process_item(self item spider):

#Remove invalid data

valid = True

for data in item:

if not data:

valid = False

raise DropItem("Missing %s of blogpost from %s" %(data item['url']))

if valid:

#Insert data into database

new_image=[{

"name":item['name'][0]

"year":item['year'][0]

}]

self.collection.insert(new_image)

log.msg("Item wrote to MongoDB database %s/%s" %

(settings['MONGODB_DB'] settings['MONGODB_COLLECTION'])

level=log.DEBUG spider=spider)

return item

4.修改settings.py文件对项目进行配置。

# -*- coding: utf-8 -*-

# Scrapy settings for baidu project

#

# For simplicity this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

# https://doc.scrapy.org/en/latest/topics/settings.html

# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'baidu'

SPIDER_MODULES = ['baidu.spiders']

NEWSPIDER_MODULE = 'baidu.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML like Gecko) Chrome/19.0.1084.54 Safari/536.5'

# Obey robots.txt rules

ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

# CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

DOWNLOAD_DELAY = 3

RANDOMIZE_DOWNLOAD_DELAY = True

# The download delay setting will honor only one of:

# CONCURRENT_REQUESTS_PER_DOMAIN = 16

# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

COOKIES_ENABLED = True

MONGODB_SERVER = '14.215.177.221'

MONGODB_PORT = 45654

MONGODB_DB = 'baidu'

MONGODB_COLLECTION = 'image'

# Disable Telnet Console (enabled by default)

# TELNETCONSOLE_ENABLED = False

# Override the default request headers:

# DEFAULT_REQUEST_HEADERS = {

# 'Accept': 'text/html application/xhtml xml application/xml;q=0.9 */*;q=0.8'

# 'Accept-Language': 'en'

# }

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

# SPIDER_MIDDLEWARES = {

# 'baidu.middlewares.baiduSpiderMiddleware': 543

# }

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

# DOWNLOADER_MIDDLEWARES = {

# 'baidu.middlewares.baiduDownloaderMiddleware': 543

# }

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

# EXTENSIONS = {

# 'scrapy.extensions.telnet.TelnetConsole': None

# }

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'baidu.pipelines.baiduPipeline': 400

}

LOG_LEVEL = 'DEBUG'

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

HTTPCACHE_ENABLED = True

HTTPCACHE_EXPIRATION_SECS = 0

HTTPCACHE_DIR = 'httpcache'

HTTPCACHE_IGNORE_HTTP_CODES = []

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

这就是爬虫框架开始工作之前的准备工作。

如果想学习更多科技知识，可以点击关注。

如果对文章中的内容有什么困惑的地方，可以在评论区提出自己的问题，学记同大家一起交流，解决各种问题，一起进步。

青年学记陪伴着各位青年

python爬虫技术步骤（Python爬虫框架运行之前的设置）(2)

作者：青年学记一名不断进步的程序猿

一起学习一起进步

走向自立

网站首页

返回栏目

python爬虫技术步骤（Python爬虫框架运行之前的设置）

猜您喜欢：

相关文章