Distributed Scraper

This is a scraper function that automatically pulls in metadata from the page, as well as supports simple HTML querying using cheerio.

It's built on top of stdlib which makes it highly distributed and scalable.

Usage

You can either use the ready service that's deployed on stdlib here, or fork this repository and launch your own version on stdlib.

Example

For example, a simple scrape to pick up my own email address from Github (and a bunch of extra metadata):

lib nemo.scrape --url https://github.com/nemo --query "li[itemprop='email'] a"

{ metadata:
   { general:
      { description: 'nemo has 36 repositories available. Follow their code on GitHub.',
        title: 'nemo (Nima Gardideh) · GitHub',
        lang: 'en' },
     openGraph:
      { app_id: '1401488693436528',
        image: [Object],
        site_name: 'GitHub',
        type: 'profile',
        title: 'nemo (Nima Gardideh)',
        url: 'https://github.com/nemo',
        description: 'nemo has 36 repositories available. Follow their code on GitHub.',
        username: 'nemo' },
     schemaOrg: { items: [Object] },
     twitter:
      { image: [Object],
        site: '@github',
        card: 'summary',
        title: 'nemo (Nima Gardideh)',
        description: 'nemo has 36 repositories available. Follow their code on GitHub.' } },
  url: 'https://github.com/nemo',
  query: 'li[itemprop=\'email\'] a',
  query_value: 'nima@halfmoon.ws'
}

You can view the function specification here.

Notes

Note that this scraper does not support sites that are single page Javascript applications. You should also follow robot.txt rules when you're scraping websites. Use responsibly.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
f		f
.gitignore		.gitignore
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

f

f

.gitignore

.gitignore

README.md

README.md

package.json

package.json

Repository files navigation

Distributed Scraper

Usage

Example

Notes

License

About

Releases

Packages

Languages

nemo/scrape

Folders and files

Latest commit

History

Repository files navigation

Distributed Scraper

Usage

Example

Notes

License

About

Topics

Resources

Stars

Watchers

Forks

Languages