scrapy安装框架教程(安装与创建项目)
scrapy安装框架教程(安装与创建项目)import scrapy class DmozSpider(scrapy.spiders.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/" "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self response): filename = response.url.split("/&
- 系统:Windows
- 工具:Pycharm Professional 20.3
- Python版本:3.x
File-New Project

pip install scrapy3.新建爬虫项目
    
scrapy startproject code_space_spider
    


项目结构
scrapy_demo
├─code_space_spider
│  ├─code_space_spider
│     ├─spiders		放置spider代码的目录
│     ├─items.py		收集数据实体
│     ├─middlewares.py	处理请求中间件
│     ├─pipelines.py		持久化处理文件
│     ├─settings.py		项目的设置文件
│  ├─scrapy.cfg		项目的配置文件4.测试demo
- 核心代码
items.py
import scrapy
class DmozItem(scrapy.Item):
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()
    
dmoz_spider.py
import scrapy
class DmozSpider(scrapy.spiders.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/" 
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
    def parse(self  response):
        filename = response.url.split("/")[-2]
        with open(filename  'wb') as f:
            f.write(response.body)
- 进入项目根目录
cd code_space_spider/code_space_spider
- 执行启动爬虫命令
scrapy crawl dmoz
    
2022-02-15 21:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)
2022-02-15 21:13:07-0400 [scrapy] INFO: Optional features available: ...
2022-02-15 21:13:07-0400 [scrapy] INFO: Overridden settings: {}
2022-02-15 21:13:07-0400 [scrapy] INFO: Enabled extensions: ...
2022-02-15 21:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...
2022-02-15 21:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...
2022-02-15 21:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...
2022-02-15 21:13:07-0400 [dmoz] INFO: Spider opened
2022-02-15 21:13:08-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2022-02-15 21:13:09-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2022-02-15 21:13:09-0400 [dmoz] INFO: Closing spider (finished)
    
访问测试网站成功
关注我,坚持每日积累一个技巧,长期坚持,我们将会不断进步。
#程序员##python##scrapy爬虫##计算机##兼职赚钱#




