Awesome digital preservation

Awesome list of digital preservation tools

Web archiving

Crawlers

Wget - a free software package for retrieving files using HTTP, HTTPS, FTP and FTPS, the most widely used Internet protocols.
WPull - Wget-compatible web downloader and crawler.
Conifer - collect and revisit web pages
grab-site - The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Heritrix3 - Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
WAIL - Web Archiving Integration Layer: One-Click User Instigated Preservation

Browsetrix Crawler - run a high-fidelity browser-based crawler in a single Docker container

Replay tools

Archive Web.page - A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers
Reply Web.page - Serverless Web Archive Replay directly in the browser
pywb - Core Python Web Archiving Toolkit for replay and recording of web archives
webrecorder-player - Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
ipwb - InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

Analysis and data processing

AUT - The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
AUT Notebooks - Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.
WARCIO - Streaming WARC/ARC library for fast web archive IO
Metawarc - Metadata extractor from WARC files
WarcDB - WarcDB: Web crawl data as SQLite databases
ArchiveSpark - An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
CDX Toolkit - A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

Page pushers

ArchiveBox - Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more
Wayback - A self-hosted toolkit for archiving webpages to the Internet Archive, archive.today, IPFS, and local file systems
Archivenow - A Tool To Push Web Resources Into Web Archives
iagitup - A command line tool to archive a git repository from GitHub to the Internet Archive.

Online services

ArchiveIt - web archiving online services

Social Networks

Twitter

twarc - A command line tool (and Python library) for archiving Twitter JSON

Instagram

instaloader - Download pictures (or videos) along with their captions and other metadata from Instagram.

Universal

sfm-ui - Social Feed Manager user interface application.
Media downloader - download Instagram Reels, Stories, Post, Stalk Instagram Profile, Facebook Public Videos, YouTube Videos and YouTube to MP3 converter, SoundCloud MP3 and Dailymotion videos. Made from Node JS Express JS, React JS and Rapid API.

Other digital objects

Online storage

ydiskarc - command-line tool to backup public resources from Yandex.disk (disk.yandex.ru / yadi.sk) filestorage service
filegetter - A command-line tool to collect files from public data sources using URL patterns and config files

Messengers and chats

tgarc - A command line tool for archiving Telegram JSON

Specific CMS

wparc - Wordpress API data and files archival command line tool
spcrawler - A command-line tool to backup Sharepoint public installations data from open API endpoint

Public Data API

apibackuper - Python library and cmd tool to backup API calls

Standards and specifications

The WARC Format 1.1 - The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information.
CDX File format - format of CDX files, that are list of files in WARC files
WARC Specifications - collection of WARC related specifications and formats
The WACZ Format 1.1.1 - Web Archive Collection Zipped. WACZ is a media type that allows web archive collections to be packaged and shared on the web as a discrete file.

Organizations

Digital preservation coalition - The DPC is a not-for-profit company dedicated to digital preservation inititatives
International Internet Preservation Consortium - Leading consortium for web archiving

Knowledge bases

Archiveteam Wiki - Wiki about various archival topics and file formats

Major digital archives

Internet Archive - biggest digital archive with big web archives
Common Crawl - open data search engine index crawled from whole Internet

Related lists

Awesome Web Archiving - An Awesome List for getting started with web archiving
Awesome data takeout - An Awesome Data Takeout list of services to take out your personal data from major online services and providers

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Awesome digital preservation

Table of contents

Web archiving

Crawlers

Replay tools

Analysis and data processing

Page pushers

Online services

Social Networks

Twitter

Instagram

Universal

Other digital objects

Online storage

Messengers and chats

Specific CMS

Public Data API

Standards and specifications

Organizations

Knowledge bases

Major digital archives

Related lists

About

Releases

Packages

License

ruarxive/awesome-digital-preservation

Folders and files

Latest commit

History

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Awesome digital preservation

Table of contents

Web archiving

Crawlers

Replay tools

Analysis and data processing

Page pushers

Online services

Social Networks

Twitter

Instagram

Universal

Other digital objects

Online storage

Messengers and chats

Specific CMS

Public Data API

Standards and specifications

Organizations

Knowledge bases

Major digital archives

Related lists

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages