Skip to content
@toimik

Toimik

Open source projects for an upcoming web search engine
  • Singapore

Pinned

  1. WarcProtocol WarcProtocol Public

    Parser for WARC (aka WebArchive) files

    C# 8 3

  2. CommonCrawl CommonCrawl Public

    Common Crawl's processing tools

    C# 5

  3. UrlNormalization UrlNormalization Public

    URL normalizer to canonicalize (standardize) the text representation of a URL to determine if differently-formatted URLs are identical

    C# 4

  4. SitemapsProtocol SitemapsProtocol Public

    Parsers for sitemap / sitemap index (aka Sitemaps Protocol)

    C#

  5. RobotsProtocol RobotsProtocol Public

    Parsers for robots.txt (aka Robots Exclusion Standard / Robots Exclusion Protocol), Robots Meta Tag, and X-Robots-Tag

    C#

Repositories

Showing 7 of 7 repositories
  • WarcProtocol Public

    Parser for WARC (aka WebArchive) files

    C# 8 Apache-2.0 3 1 0 Updated May 22, 2024
  • Wikimedia Public

    Wikimedia Downloads' processing tools

    C# 0 Apache-2.0 0 0 0 Updated May 2, 2024
  • UrlNormalization Public

    URL normalizer to canonicalize (standardize) the text representation of a URL to determine if differently-formatted URLs are identical

    C# 4 Apache-2.0 0 0 0 Updated May 2, 2024
  • SitemapsProtocol Public

    Parsers for sitemap / sitemap index (aka Sitemaps Protocol)

    C# 0 Apache-2.0 0 0 0 Updated May 2, 2024
  • RobotsProtocol Public

    Parsers for robots.txt (aka Robots Exclusion Standard / Robots Exclusion Protocol), Robots Meta Tag, and X-Robots-Tag

    C# 0 Apache-2.0 0 0 0 Updated May 2, 2024
  • IpAddressEnumeration Public

    IP address enumerators

    C# 0 Apache-2.0 0 0 0 Updated May 2, 2024
  • CommonCrawl Public

    Common Crawl's processing tools

    C# 5 Apache-2.0 0 0 0 Updated May 2, 2024

Top languages

Loading…

Most used topics

Loading…