Skip to content

Latest commit

 

History

History
50 lines (32 loc) · 1.82 KB

README.md

File metadata and controls

50 lines (32 loc) · 1.82 KB

Unit 6: Extending Scrapy

This unit covers how to extend Scrapy capabilities, either via Item Pipelines or Middlewares.

Topics

  • Scrapy Architecture
  • How to extend Scrapy
    • Item Pipelines
    • Spider Middlewares
    • Downloader Middlewares
    • Signals

Check out the slides for this unit

Sample Spiders

  1. A project including a Pipeline that drops items that don't have tags: p1_pipeline
  2. A project including a Pipeline that stores scraped data in MongoDB: p2_pipeline
  3. A project with 2 spider middlewares: p3_spider_middleware

Hands-on

1. Pipeline

Build an item pipeline that stores the quotes from each author from http://quotes.toscrape.com in a separate json-lines file.

  • Albert Einstein → albert_einstein.jl
  • Jane Austen → jane_austen.jl
  • etc

Check out the project once you're done.

2. Downloader Middleware

Build a downloader middleware to fetch and render pages using Selenium + PhantomJS instead of the Scrapy downloader.

  • Make sure users can disable it either via settings or in a per-request basis.

Check out the project once you're done.

References