Skip to content

Distributed download scripts for Common Crawl data

License

Notifications You must be signed in to change notification settings

alumik/common-crawl-downloader

Repository files navigation

Common Crawl Downloader

Languages: English | 中文

python-3.7-3.8-3.9 license-MIT

Distributed download scripts for Common Crawl data.

Dependencies

Python >= 3.7 is required.

Install dependencies by:

pip install -r requirements.txt

libmysqlclient-dev or an equivalent one is also required on Linux distros:

sudo apt install libmysqlclient-dev

Run

Configurations

The default config file is located at configs/default.conf, which lists all the modifiable entries. Their descriptions and default values are listed below:

[database]
drivername = mysql
username = user
password = password
host = localhost
port = 3306
database = common_crawl

[worker]
; The name of this worker
name = unknown
; The interval of retries in seconds
retry_interval = 5
; The number of retries before giving up
retries = 10
; The timeout of internet connections in seconds
socket_timeout = 30
; The download root path
download_path = downloaded

[schedule]
; Whether to restrict download time
enabled = false
; The start of the allowed download time
start_time = 20:00:00
; The end of the allowed download time
end_time = 07:59:59
; The interval of retries when download is restricted
retry_interval = 300

Do not modify the default config file directly. You can create your own local.conf under the configs folder and add modified entries in it.

An example of a valid local config file:

[database]
username = common_crawl
password = &WcKLEsX!
host = 10.10.1.217

[schedule]
enabled = true
start_time = 20:00:00
end_time = 07:59:59

Execute the download script

Run the following command at the root path of the project:

python src/main.py

Always press CTRL-C to exit the download process. Killing it directly will cause data loss and inconsistency in database.

Database Structure

data

Field Type Description
id int Primary Key Data ID
uri varchar(256) The URI of the data, which constitutes the download URL and the folder structure
size int The size of the data in bytes
started_at datetime Download start time (CST)
finished_at datetime Download end time (CST)
download_state tinyint Download state
0 for pending
1 for downloading
2 for finished
3 for failed
id_worker int Foreign Key The ID of the worker that downloads this data
archive varchar(30) The year and month of the data on Common Crawl

URIs can be obtained from wet.paths files on Common Crawl website.

An example of a URI:

crawl-data/CC-MAIN-2021-10/segments/1614178347293.1/wet/CC-MAIN-20210224165708-20210224195708-00000.warc.wet.gz

worker

Field Type Description
id int Primary Key Worker ID
name varchar(128) The name of the worker

About

Distributed download scripts for Common Crawl data

Topics

Resources

License

Stars

Watchers

Forks