Skip to content

Latest commit

 

History

History
36 lines (24 loc) · 1.77 KB

README.md

File metadata and controls

36 lines (24 loc) · 1.77 KB

Unit 4: Handling HTML Forms

This unit covers how to post data to web servers, so that our spiders can perform searches and authenticate themselves in websites that require that.

Topics

  • Handling HTML forms
  • Authenticating your spider via login forms
  • Dealing with validation tokens

Check out the slides for this unit

Sample Spiders

  1. A simple spider to demonstrate how FormRequest works: spider_1_basic_form.py
  2. A spider that authenticates into quotes.toscrape.com: spider_2_login.py
  3. Same as #2, but using FormRequest.from_response() method: spider_3_login.py

Hands-on

1. Hacker News spider

Build a Spider that authenticates into news.ycombinator.com and then extracts your own username and amount of points from the news page top (fake user/pass: scrape1123/scrape1123).

Check out the spider once you're done.

2. Quotes filtering crawler

Build a spider that scrapes all the quotes from every author listed in quotes.toscrape.com/search.aspx.

Check out the spider once you're done.

References