Skip to content

A WebCrawler using efficient LinkedList ADT to store Deep Page links in an N Tree

License

Notifications You must be signed in to change notification settings

codecakes/CrawlDance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CrawlDance

A WebCrawler using efficient LinkedList ADT to store Deep Page links in an N Tree

Run it like:

python dance.py http://www.facebook.com <depth level of tree> <max links>
python dance.py http://www.facebook.com 2 40

Running len(result) and nTree.scanSize() comparison yields good Tree scanning time:

start = time.time()
print len(result)
print "len time:{}".format(time.time()-start)

start = time.time()
print root.scanSize()
print "scanSize time:{}".format(time.time()-start)

In the print screen below, the format is like this:

1

2

3

4

5

This is all without optimization. Cythonizing variables C style usually yields speed gains 10x. using numpy for contingent arrays may increase retrieval time and overall speed.

TODOs:

  • Clean code
  • Add inline doc
  • Replace result with numpy array
  • Cythonze helper libs
  • Use something like joblib to parallelize parsing in parseurl Using

About

A WebCrawler using efficient LinkedList ADT to store Deep Page links in an N Tree

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages