Skip to content

Latest commit

 

History

History
46 lines (27 loc) · 2.05 KB

README.md

File metadata and controls

46 lines (27 loc) · 2.05 KB

Unit 5: Scraping JavaScript based pages

In this unit, we will see how to extract data from JS based pages.

Topics

  • Scraping JavaScript-based websites
  • Scraping AJAX-based websites

Check out the slides for this unit

Sample Spiders

  1. A spider using js2xml to extract alternate color from products: spider_1_hm_js2xml.py

  2. A crawler rendering a JS based page via Splash: p2_quotes_splash/

  3. Same as the previous one, but now using Selenium and PhantomJS: spider_3_quotes_selenium.py

  4. A spider built with Selenium + Python (not using Scrapy): spider_4_standalone_selenium.py

  5. A spider that scrapes data using AJAX calls to simulate infinite scrolling: spider_5_ajax_quotes.py

Hands-on

1. Infinite Scrolling (AJAX)

Build a spider to fetch all the posts from http://pythonhelp.wordpress.com

Check out the project once you're done.

2. JavaScript

Build a spider to fetch all quotes from http://quotes.toscrape.com/js using js2xml.

Check out the project once you're done.

References