快捷搜索:  汽车  科技

python爬虫技术步骤(Python爬虫框架运行之前的设置)

python爬虫技术步骤(Python爬虫框架运行之前的设置)在打开 CMD 之后,输入 pip3 install scrapy 之后,按下回车键。虽然 scrapy 爬虫框架的安装和我们之前的安装其他库的方法一样,但是还是有一些不同的。


前面我们了解了 Scrapy 爬虫框架的主要组件和数据爬取流程。在了解这些基础知识之后,我们开始学习如何使用此爬虫框架。


Scrapy爬虫框架的安装


虽然 scrapy 爬虫框架的安装和我们之前的安装其他库的方法一样,但是还是有一些不同的


在打开 CMD 之后,输入 pip3 install scrapy 之后,按下回车键。


接下来的安装界面与之前的状态有一些不同,由于 Scrapy 是一个框架,因此会下载框架内的其他组件


安装之后,我们就可以设置一个爬虫框架。

python爬虫技术步骤(Python爬虫框架运行之前的设置)(1)

如果我们要爬取百度贴吧的图片信息,就可以将框架的目录设置成这样:


|____ scrapy.cfg

|____ baidu

| |____ spiders

| | |____ __init__.py

| | |____ __pycache__

| |____ __init__.py

| |____ __pycache__

| |____ middlewares.py

| |____ settings.py

| |____ items.py

| |____ pipelines.py


这个形状是树状图,可以通过 tree 命令将其在 CMD 当中调用出来。


上述树状图当中,后面带 .py 的文件就可以让我们自己进行定义


在数据获取流程运行之前,我们先要在上述的一些后缀为 .py的文件中做一些定义。


1.在items.py文件中定义字段,这些字段用来保存数据。


代码如下:


# -*- coding: utf-8 -*-


# Define here the models for your scraped items

#

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html


import scrapy


class baiduItem(scrapy.Item):


name = scrapy.Field()

year = scrapy.Field()


2.在spiders文件夹中编写自己的爬虫


# -*- coding: utf-8 -*-

import scrapy

from scrapy.selector import Selector

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider Rule


from baidu.items import baiduItem


class ImageSpider(CrawlSpider):

name = 'image'

allowed_domains = ['image.baidu.com']

start_urls = ['https://tieba.baidu.com/']

rules = (

Rule(LinkExtractor(allow=(r'https://tieba.baidu.com\?start=\d .*')))

Rule(LinkExtractor(allow=(r'https://tieba.daidu.com/subject/\d ')) callback='parse_item')

)


def parse_item(self response):

sel = Selector(response)

item = baiduItem()

item['name']=sel.xpath('//*[@id="content"]/h1/span[1]/text()').extract()

item['year']=sel.xpath('//*[@id="content"]/h1/span[2]/text()').re(r'\((\d )\)')

return item


3.在pipelines.py中完成对数据进行持久化(即存储)的操作。


# -*- coding: utf-8 -*-


# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo


from scrapy.exceptions import DropItem

from scrapy.conf import settings

from scrapy import log


class baiduPipeline(object):


def __init__(self):

connection = pymongo.MongoClient(settings['MONGODB_SERVER'] settings['MONGODB_PORT'])

db = connection[settings['MONGODB_DB']]

self.collection = db[settings['MONGODB_COLLECTION']]


def process_item(self item spider):

#Remove invalid data

valid = True

for data in item:

if not data:

valid = False

raise DropItem("Missing %s of blogpost from %s" %(data item['url']))

if valid:

#Insert data into database

new_image=[{

"name":item['name'][0]

"year":item['year'][0]

}]

self.collection.insert(new_image)

log.msg("Item wrote to MongoDB database %s/%s" %

(settings['MONGODB_DB'] settings['MONGODB_COLLECTION'])

level=log.DEBUG spider=spider)

return item


4.修改settings.py文件对项目进行配置


# -*- coding: utf-8 -*-


# Scrapy settings for baidu project

#

# For simplicity this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

# https://doc.scrapy.org/en/latest/topics/settings.html

# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

# https://doc.scrapy.org/en/latest/topics/spider-middleware.html


BOT_NAME = 'baidu'


SPIDER_MODULES = ['baidu.spiders']

NEWSPIDER_MODULE = 'baidu.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML like Gecko) Chrome/19.0.1084.54 Safari/536.5'


# Obey robots.txt rules

ROBOTSTXT_OBEY = True


# Configure maximum concurrent requests performed by Scrapy (default: 16)

# CONCURRENT_REQUESTS = 32


# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

DOWNLOAD_DELAY = 3

RANDOMIZE_DOWNLOAD_DELAY = True

# The download delay setting will honor only one of:

# CONCURRENT_REQUESTS_PER_DOMAIN = 16

# CONCURRENT_REQUESTS_PER_IP = 16


# Disable cookies (enabled by default)

COOKIES_ENABLED = True


MONGODB_SERVER = '14.215.177.221'

MONGODB_PORT = 45654

MONGODB_DB = 'baidu'

MONGODB_COLLECTION = 'image'


# Disable Telnet Console (enabled by default)

# TELNETCONSOLE_ENABLED = False


# Override the default request headers:

# DEFAULT_REQUEST_HEADERS = {

# 'Accept': 'text/html application/xhtml xml application/xml;q=0.9 */*;q=0.8'

# 'Accept-Language': 'en'

# }


# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

# SPIDER_MIDDLEWARES = {

# 'baidu.middlewares.baiduSpiderMiddleware': 543

# }


# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

# DOWNLOADER_MIDDLEWARES = {

# 'baidu.middlewares.baiduDownloaderMiddleware': 543

# }


# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

# EXTENSIONS = {

# 'scrapy.extensions.telnet.TelnetConsole': None

# }


# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'baidu.pipelines.baiduPipeline': 400

}


LOG_LEVEL = 'DEBUG'


# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False


# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

HTTPCACHE_ENABLED = True

HTTPCACHE_EXPIRATION_SECS = 0

HTTPCACHE_DIR = 'httpcache'

HTTPCACHE_IGNORE_HTTP_CODES = []

HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


这就是爬虫框架开始工作之前的准备工作。


如果想学习更多科技知识,可以点击关注


如果对文章中的内容有什么困惑的地方,可以在评论区提出自己的问题,学记同大家一起交流,解决各种问题,一起进步。


青年学记 陪伴着各位青年

python爬虫技术步骤(Python爬虫框架运行之前的设置)(2)


作者:青年学记 一名不断进步的程序


一起学习 一起进步


走向自立

猜您喜欢: