Common Crawl Downloader

Distributed download scripts for Common Crawl data.

Dependencies

Python >= 3.7 is required.

Install dependencies by:

pip install -r requirements.txt

libmysqlclient-dev or an equivalent one is also required on Linux distros：

sudo apt install libmysqlclient-dev

Run

Configurations

The default config file is located at configs/default.conf, which lists all the modifiable entries. Their descriptions and default values are listed below:

[database]
drivername = mysql
username = user
password = password
host = localhost
port = 3306
database = common_crawl

[worker]
; The name of this worker
name = unknown
; The interval of retries in seconds
retry_interval = 5
; The number of retries before giving up
retries = 10
; The timeout of internet connections in seconds
socket_timeout = 30
; The download root path
download_path = downloaded

[schedule]
; Whether to restrict download time
enabled = false
; The start of the allowed download time
start_time = 20:00:00
; The end of the allowed download time
end_time = 07:59:59
; The interval of retries when download is restricted
retry_interval = 300

Do not modify the default config file directly. You can create your own local.conf under the configs folder and add modified entries in it.

An example of a valid local config file:

[database]
username = common_crawl
password = &WcKLEsX!
host = 10.10.1.217

[schedule]
enabled = true
start_time = 20:00:00
end_time = 07:59:59

Execute the download script

Run the following command at the root path of the project:

python src/main.py

Always press CTRL-C to exit the download process. Killing it directly will cause data loss and inconsistency in database.

Database Structure

data

Field	Type	Description
id	int	Primary Key Data ID
uri	varchar(256)	The URI of the data, which constitutes the download URL and the folder structure
size	int	The size of the data in bytes
started_at	datetime	Download start time (CST)
finished_at	datetime	Download end time (CST)
download_state	tinyint	Download state `0` for pending `1` for downloading `2` for finished `3` for failed
id_worker	int	Foreign Key The ID of the worker that downloads this data
archive	varchar(30)	The year and month of the data on Common Crawl

URIs can be obtained from wet.paths files on Common Crawl website.

An example of a URI:

crawl-data/CC-MAIN-2021-10/segments/1614178347293.1/wet/CC-MAIN-20210224165708-20210224195708-00000.warc.wet.gz

worker

Field	Type	Description
id	int	Primary Key Worker ID
name	varchar(128)	The name of the worker

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
configs		configs
database		database
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
environment.yml		environment.yml
requirements.txt		requirements.txt
run.sh		run.sh
stop_and_update.sh		stop_and_update.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

database

database

src

src

tests

tests

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

README_CN.md

README_CN.md

environment.yml

environment.yml

requirements.txt

requirements.txt

run.sh

run.sh

stop_and_update.sh

stop_and_update.sh

Repository files navigation

Common Crawl Downloader

Dependencies

Run

Configurations

Execute the download script

Database Structure

data

worker

About

Languages

License

alumik/common-crawl-downloader

Folders and files

Latest commit

History

Repository files navigation

Common Crawl Downloader

Dependencies

Run

Configurations

Execute the download script

Database Structure

data

worker

About

Topics

Resources

License

Stars

Watchers

Forks

Languages