JAW includes two JavaScript-enabled, chrome-based web crawlers. This Wiki describes the input of the crawler, configuration options, and its outputs.
The crawler folder in the root directory contains the source code for the web crawlers of JAW. As of now, JAW supports the following:
- a puppetter-based crawler enhanced with ChromeDevTools Protocol (CDP)
- a selenium based crawler enhanced with custom Chrome extensions
To start the crawler, do:
$ node crawler.js --seedurl=https://google.com --maxurls=100 --browser=chrome --headless=true
Please see here for more information.
If you want to crawl a particular site:
$ python3 hpg_crawler/driver.py <site-id>
If you want to crawl a range of websites:
$ python3 hpg_crawler/driver.py <from-site-id> <to-site-id>
Running with Docker: Specify which website you want to crawl in docker-compose.yml
under the command
field. Then, you can spawn an instance of the crawler by:
$ ./run.docker.sh
For more information, please refer to the documentation of the hpg_crawler
here.