Unit 6: Extending Scrapy

This unit covers how to extend Scrapy capabilities, either via Item Pipelines or Middlewares.

Topics

Scrapy Architecture
How to extend Scrapy
- Item Pipelines
- Spider Middlewares
- Downloader Middlewares
- Signals

Check out the slides for this unit

Sample Spiders

A project including a Pipeline that drops items that don't have tags: p1_pipeline
A project including a Pipeline that stores scraped data in MongoDB: p2_pipeline
A project with 2 spider middlewares: p3_spider_middleware

Hands-on

1. Pipeline

Build an item pipeline that stores the quotes from each author from http://quotes.toscrape.com in a separate json-lines file.

Albert Einstein → albert_einstein.jl
Jane Austen → jane_austen.jl
etc

Check out the project once you're done.

2. Downloader Middleware

Build a downloader middleware to fetch and render pages using Selenium + PhantomJS instead of the Scrapy downloader.

Make sure users can disable it either via settings or in a per-request basis.

Check out the project once you're done.

References