Unit 5: Scraping JavaScript based pages

In this unit, we will see how to extract data from JS based pages.

Topics

Scraping JavaScript-based websites
Scraping AJAX-based websites

Check out the slides for this unit

Sample Spiders

A spider using js2xml to extract alternate color from products: spider_1_hm_js2xml.py
A crawler rendering a JS based page via Splash: p2_quotes_splash/
Same as the previous one, but now using Selenium and PhantomJS: spider_3_quotes_selenium.py
A spider built with Selenium + Python (not using Scrapy): spider_4_standalone_selenium.py
A spider that scrapes data using AJAX calls to simulate infinite scrolling: spider_5_ajax_quotes.py

Hands-on

1. Infinite Scrolling (AJAX)

Build a spider to fetch all the posts from http://pythonhelp.wordpress.com

Check out the project once you're done.

2. JavaScript

Build a spider to fetch all quotes from http://quotes.toscrape.com/js using js2xml.

Check out the project once you're done.

References