scrapy安装框架教程(安装与创建项目)
scrapy安装框架教程(安装与创建项目)import scrapy class DmozSpider(scrapy.spiders.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/" "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self response): filename = response.url.split("/&
- 系统:Windows
- 工具:Pycharm Professional 20.3
- Python版本:3.x
File-New Project
pip install scrapy
3.新建爬虫项目
scrapy startproject code_space_spider
项目结构
scrapy_demo
├─code_space_spider
│ ├─code_space_spider
│ ├─spiders 放置spider代码的目录
│ ├─items.py 收集数据实体
│ ├─middlewares.py 处理请求中间件
│ ├─pipelines.py 持久化处理文件
│ ├─settings.py 项目的设置文件
│ ├─scrapy.cfg 项目的配置文件
4.测试demo
- 核心代码
items.py
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
dmoz_spider.py
import scrapy
class DmozSpider(scrapy.spiders.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self response):
filename = response.url.split("/")[-2]
with open(filename 'wb') as f:
f.write(response.body)
- 进入项目根目录
cd code_space_spider/code_space_spider
- 执行启动爬虫命令
scrapy crawl dmoz
2022-02-15 21:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2022-02-15 21:13:07-0400 [scrapy] INFO: Optional features available: ...
2022-02-15 21:13:07-0400 [scrapy] INFO: Overridden settings: {}
2022-02-15 21:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2022-02-15 21:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2022-02-15 21:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2022-02-15 21:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2022-02-15 21:13:07-0400 [dmoz] INFO: Spider opened
2022-02-15 21:13:08-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2022-02-15 21:13:09-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2022-02-15 21:13:09-0400 [dmoz] INFO: Closing spider (finished)
访问测试网站成功
关注我,坚持每日积累一个技巧,长期坚持,我们将会不断进步。
#程序员##python##scrapy爬虫##计算机##兼职赚钱#