快捷搜索:  汽车  科技

scrapy安装框架教程(安装与创建项目)

scrapy安装框架教程(安装与创建项目)import scrapy class DmozSpider(scrapy.spiders.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/" "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self response): filename = response.url.split("/&

  • 系统:Windows
  • 工具:Pycharm Professional 20.3
  • Python版本:3.x
1.新建项目

File-New Project

scrapy安装框架教程(安装与创建项目)(1)

2.安装依赖包

pip install scrapy3.新建爬虫项目

scrapy startproject code_space_spider

scrapy安装框架教程(安装与创建项目)(2)

scrapy安装框架教程(安装与创建项目)(3)

项目结构

scrapy_demo ├─code_space_spider │ ├─code_space_spider │ ├─spiders 放置spider代码的目录 │ ├─items.py 收集数据实体 │ ├─middlewares.py 处理请求中间件 │ ├─pipelines.py 持久化处理文件 │ ├─settings.py 项目的设置文件 │ ├─scrapy.cfg 项目的配置文件4.测试demo

  • 核心代码

items.py

import scrapy class DmozItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field()

dmoz_spider.py

import scrapy class DmozSpider(scrapy.spiders.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/" "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self response): filename = response.url.split("/")[-2] with open(filename 'wb') as f: f.write(response.body)

  • 进入项目根目录

cd code_space_spider/code_space_spider

  • 执行启动爬虫命令

scrapy crawl dmoz

2022-02-15 21:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial) 2022-02-15 21:13:07-0400 [scrapy] INFO: Optional features available: ... 2022-02-15 21:13:07-0400 [scrapy] INFO: Overridden settings: {} 2022-02-15 21:13:07-0400 [scrapy] INFO: Enabled extensions: ... 2022-02-15 21:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ... 2022-02-15 21:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ... 2022-02-15 21:13:07-0400 [scrapy] INFO: Enabled item pipelines: ... 2022-02-15 21:13:07-0400 [dmoz] INFO: Spider opened 2022-02-15 21:13:08-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None) 2022-02-15 21:13:09-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None) 2022-02-15 21:13:09-0400 [dmoz] INFO: Closing spider (finished)

访问测试网站成功

关注我,坚持每日积累一个技巧,长期坚持,我们将会不断进步。

#程序员##python##scrapy爬虫##计算机##兼职赚钱#

猜您喜欢: