This unit covers how to extend Scrapy capabilities, either via Item Pipelines or Middlewares.
- Scrapy Architecture
- How to extend Scrapy
- Item Pipelines
- Spider Middlewares
- Downloader Middlewares
- Signals
Check out the slides for this unit
- A project including a Pipeline that drops items that don't have
tags
:p1_pipeline
- A project including a Pipeline that stores scraped data in MongoDB:
p2_pipeline
- A project with 2 spider middlewares:
p3_spider_middleware
Build an item pipeline that stores the quotes from each author from http://quotes.toscrape.com in a separate json-lines file.
- Albert Einstein → albert_einstein.jl
- Jane Austen → jane_austen.jl
- etc
Check out the project once you're done.
Build a downloader middleware to fetch and render pages using Selenium + PhantomJS instead of the Scrapy downloader.
- Make sure users can disable it either via settings or in a per-request basis.