codeGUST crawler

Description

codeGUST crawler uses spidr to crawl the following websites:

Stack Overflow
GitHub
Tutorials Point
GeeksforGeeks

Getting started

Install the dependencies:

$ bundle install

Decide what to crawl

Crawlables are read from input files in ./crawlables/ directory.

For example: ./crawlables/crawl_github.txt contains a hash with the following keys:

:url => the starting URL to crawl
:ignored_url => a list of regex for URLs that will be ignored
:main_divs => a list of XPaths that will be extracted from the HTML of the crawled page
:score_divs => a hash of score name to XPaths that will be extracted from the HTML of the page for various scoring reasons Note that the lines in the beginning of the file starting with # will be ignored, and the rest will be evaluated into a hash using eval()

Start crawling!

Command line arguments:

-i, --input_file INPUT_FILE      [REQUIRED] Input filename in ./crawlables/
-l, --limit LIMIT                [REQUIRED] Crawling limit
-u, --url URL                    Alternative starting URL
--prod                           Production environment if set, development if not set

For example:

$ ruby main.rb -i crawl_github.txt -l 10
$ ruby main.rb -i crawl_github.txt -l 5 -u https://github.com/codeGUST-SE/crawler/
$ ruby main.rb -i crawl_stackoverflow.txt -l 100 --prod

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
crawlables		crawlables
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
crawlable_pages.rb		crawlable_pages.rb
crawler.rb		crawler.rb
main.rb		main.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawlables

crawlables

.gitignore

.gitignore

Gemfile

Gemfile

Gemfile.lock

Gemfile.lock

LICENSE

LICENSE

README.md

README.md

crawlable_pages.rb

crawlable_pages.rb

crawler.rb

crawler.rb

main.rb

main.rb

Repository files navigation

codeGUST crawler

Description

Getting started

Decide what to crawl

Start crawling!

About

Releases

Packages

Contributors 2

Languages

License

codeGUST-SE/crawler

Folders and files

Latest commit

History

Repository files navigation

codeGUST crawler

Description

Getting started

Decide what to crawl

Start crawling!

About

Topics

Resources

License

Stars

Watchers

Forks

Languages