Skip to content

code402/warc-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wat-benchmark

This repository acts as a Hello World for working with WARC files.

Its subfolders contain implementations that fetch a WARC file and search all captures from .com domains for a regex that detects YouTube links.

See also the blog post.

This is not bulletproof, production-ready code - I/O retries, closing resources and robust character decoding is omitted to focus on the WARC aspect of the code.