Hadoop streaming EMR job
-
Updated
Dec 18, 2018 - Python
Hadoop streaming EMR job
Parsing the common crawl database using Scala and Spark
ES6 Class to read .warc or .warc.gz file member by member in nodejs
Analyzing Common Crawl data (specifically) to classify fake/real based on trained deep learning models (LSTM, CNN)
Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.
Discourse Markers identification in French Language
This repository contains mapreduce extractors to preprocess and extract websites from the common crawl corpus.
This library is a very lightweight client to Common Crawl's WARC files.
An application that crawls the Common Crawl corpus for URLs with the specified file extensions.
A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types for mass-testing of frameworks like Apache POI and Apache Tika
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
A Common Crawl client example for scraping specific websites.
Perform big data analysis on New york times, Twitter and Common Crawl APIs
This library gets urls from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl. Inspired by Corbin Leo's gau
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
Application of topic models for information retrieval and search engine optimization.
CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl
Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.
Add a description, image, and links to the common-crawl topic page so that developers can more easily learn about it.
To associate your repository with the common-crawl topic, visit your repo's landing page and select "manage topics."