Sample Java Web Crawler. Uses JSoup library. You can specify either number of files to be downloaded or max data downloaded(I had a 2G connection that time).
Also includes proxy authentication. NOTE: Doesn't downloads images/other resources.
- Download a base URL
- Scrap all 'http://' strings out of it(the links)
- Recursively download the links until some stopping condition is met(file count or total download size)
- Also maintains a list of loaded links, to prevent re-downloading