Skip to content

Latest commit

 

History

History
40 lines (25 loc) · 1.36 KB

crawlers.md

File metadata and controls

40 lines (25 loc) · 1.36 KB

Web Crawlers of JAW

JAW includes two JavaScript-enabled, chrome-based web crawlers. This Wiki describes the input of the crawler, configuration options, and its outputs.

The crawler folder in the root directory contains the source code for the web crawlers of JAW. As of now, JAW supports the following:

  • a puppetter-based crawler enhanced with ChromeDevTools Protocol (CDP)
  • a selenium based crawler enhanced with custom Chrome extensions

CLI Usage (Puppeteer)

To start the crawler, do:

$ node crawler.js --seedurl=https://google.com --maxurls=100 --browser=chrome --headless=true

Please see here for more information.

CLI Usage (Selenium)

If you want to crawl a particular site:

$ python3 hpg_crawler/driver.py <site-id>

If you want to crawl a range of websites:

$ python3 hpg_crawler/driver.py <from-site-id> <to-site-id>

Running with Docker: Specify which website you want to crawl in docker-compose.yml under the command field. Then, you can spawn an instance of the crawler by:

$ ./run.docker.sh

For more information, please refer to the documentation of the hpg_crawler here.