Process Common Crawl data with Python and Spark
-
Updated
Apr 8, 2024 - Python
Process Common Crawl data with Python and Spark
Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖
A python utility for downloading Common Crawl data
News crawling with StormCrawler - stores content as WARC
A dataset for knowledge base population research using Common Crawl and DBpedia.
🕷️ The pipeline for the OSCAR corpus
The website of the Oscar Project
Drill into WARC web archives
Statistics of Common Crawl monthly archives mined from URL index files
Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.
Various Jupyter notebooks about Common Crawl data
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
Parsing Huge Web Archive files from Common Crawl data index to fetch any required domain's data concurrently with Python and Scrapy.
Tools to construct and process webgraphs from Common Crawl data
We explore data by using Big Data Analysis and Visualization skills. To obtain this, we perform 3 main operations. i.e. i)Data Aggregation through different sources. ii) Big Data Analysis using MapReduce and iii) Visualization through Tableau. Data Analysis is very critical in understanding the data, and what we can do with the data. For small d…
Various Common Crawl utilities in Clojure.
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types for mass-testing of frameworks like Apache POI and Apache Tika
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
Add a description, image, and links to the common-crawl topic page so that developers can more easily learn about it.
To associate your repository with the common-crawl topic, visit your repo's landing page and select "manage topics."