Web Crawlers of JAW

JAW includes two JavaScript-enabled, chrome-based web crawlers. This Wiki describes the input of the crawler, configuration options, and its outputs.

The crawler folder in the root directory contains the source code for the web crawlers of JAW. As of now, JAW supports the following:

a puppetter-based crawler enhanced with ChromeDevTools Protocol (CDP)
a selenium based crawler enhanced with custom Chrome extensions

CLI Usage (Puppeteer)

To start the crawler, do:

$ node crawler.js --seedurl=https://google.com --maxurls=100 --browser=chrome --headless=true

Please see here for more information.

CLI Usage (Selenium)

If you want to crawl a particular site:

$ python3 hpg_crawler/driver.py <site-id>

If you want to crawl a range of websites:

$ python3 hpg_crawler/driver.py <from-site-id> <to-site-id>

Running with Docker: Specify which website you want to crawl in docker-compose.yml under the command field. Then, you can spawn an instance of the crawler by:

$ ./run.docker.sh

For more information, please refer to the documentation of the hpg_crawler here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawlers.md

crawlers.md

Web Crawlers of JAW

CLI Usage (Puppeteer)

CLI Usage (Selenium)

Files

crawlers.md

Latest commit

History

crawlers.md

File metadata and controls

Web Crawlers of JAW

CLI Usage (Puppeteer)

CLI Usage (Selenium)