Motivation

There is a very low amount SpreadSheet (Excel, Numbers, etc) type datasets readily available for consumption for data science applications. My hypothesis is that the internet will have similar format of data (rows, columns, w/ numeric relations between the cells) in the table tag of html data from the web. And with the sheer volume of html data we can extract a very large dataset from these websites.

Purpose

The table-collector library allows you to collect html data from info inside table tags on the public web. It fires off a spider which crawls the web and looks for these tags. It uses a heuristic that a certain percentage of the table td tags should have solely numeric values so that we don't get tables that are only labeled tables for style purposes. The data writes to a text file as it is crawling

Installation

Probably want to make a virtualenv first, then

$ pip install -r requirements.txt

Run

$ scrapy runspider spider.py

or

$ scrapy runspider --nolog spider.py

TXT File Format

The output text file will have each html table tag entry separated by a single new line char.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cleandata.py		cleandata.py
log.py		log.py
requirements.txt		requirements.txt
save_html.py		save_html.py
spider.py		spider.py
txt.py		txt.py
validator.py		validator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

cleandata.py

cleandata.py

log.py

log.py

requirements.txt

requirements.txt

save_html.py

save_html.py

spider.py

spider.py

txt.py

txt.py

validator.py

validator.py

Repository files navigation

Motivation

Purpose

Installation

Run

TXT File Format

About

Releases

Packages

Languages

License

Srokit/table-collector

Folders and files

Latest commit

History

Repository files navigation

Motivation

Purpose

Installation

Run

TXT File Format

About

Topics

Resources

License

Stars

Watchers

Forks

Languages