Skip to content

Latest commit

 

History

History
34 lines (18 loc) · 1021 Bytes

puppeteer-crawler.md

File metadata and controls

34 lines (18 loc) · 1021 Bytes

Puppeteer Crawler

JAW features a JavaScript-enabled crawler leveraging Puppeteer and Chrome DevTools Protocol (CDP).

The crawler visitsthe webpages following a depth-first strategy, and stops when it doesn’t find new URLs, or a maximum of URLs is reached. During the visit, it collects, the URLs, the scripts as they are parsed by the browser via CDP, and the snapshot of the HTML webpage, among others.

CLI Usage

To start the crawler, do:

$ node crawler.js --seedurl=https://google.com --maxurls=100 --browser=chrome --headless=true

Crawler Configuration

  • seedurl:

    Specifies the seed URL for the crawler.

  • maxurls*:

    Specifies the termination criteria, i.e., the maximum number of URLs to visit

  • browser*:

    Specifies the browser to use. The only option is chrome at the moment.

  • headless:

    Specifies whether the browser should be instantiated in headless mode.