WikiCrawler

Sample Java Web Crawler. Uses JSoup library. You can specify either number of files to be downloaded or max data downloaded(I had a 2G connection that time).

Also includes proxy authentication. NOTE: Doesn't downloads images/other resources.

Algorithm

Download a base URL
Scrap all 'http://' strings out of it(the links)
Recursively download the links until some stopping condition is met(file count or total download size)
Also maintains a list of loaded links, to prevent re-downloading

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Crawl.java		Crawl.java
ProxyAuthenticator.java		ProxyAuthenticator.java
README.md		README.md
StaticCrawl.java		StaticCrawl.java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl.java

Crawl.java

ProxyAuthenticator.java

ProxyAuthenticator.java

README.md

README.md

StaticCrawl.java

StaticCrawl.java

Repository files navigation

WikiCrawler

Algorithm

About

Releases

Packages

Languages

akshay326/WikiCrawler

Folders and files

Latest commit

History

Repository files navigation

WikiCrawler

Algorithm

About

Topics

Resources

Stars

Watchers

Forks

Languages