Skip to content

Open Source Wikipedia Crawler with Proxy Authetication

Notifications You must be signed in to change notification settings

akshay326/WikiCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

WikiCrawler

Sample Java Web Crawler. Uses JSoup library. You can specify either number of files to be downloaded or max data downloaded(I had a 2G connection that time).

Also includes proxy authentication. NOTE: Doesn't downloads images/other resources.

Algorithm

  1. Download a base URL
  2. Scrap all 'http://' strings out of it(the links)
  3. Recursively download the links until some stopping condition is met(file count or total download size)
  4. Also maintains a list of loaded links, to prevent re-downloading

About

Open Source Wikipedia Crawler with Proxy Authetication

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages