In this unit, we will see how to extract data from JS based pages.
- Scraping JavaScript-based websites
- Scraping AJAX-based websites
Check out the slides for this unit
-
A spider using js2xml to extract alternate color from products:
spider_1_hm_js2xml.py
-
A crawler rendering a JS based page via Splash:
p2_quotes_splash/
-
Same as the previous one, but now using Selenium and PhantomJS:
spider_3_quotes_selenium.py
-
A spider built with Selenium + Python (not using Scrapy):
spider_4_standalone_selenium.py
-
A spider that scrapes data using AJAX calls to simulate infinite scrolling:
spider_5_ajax_quotes.py
Build a spider to fetch all the posts from http://pythonhelp.wordpress.com
Check out the project once you're done.
Build a spider to fetch all quotes from http://quotes.toscrape.com/js using js2xml.