Skip to content

Parsers for robots.txt (aka Robots Exclusion Standard / Robots Exclusion Protocol), Robots Meta Tag, and X-Robots-Tag

License

Notifications You must be signed in to change notification settings

toimik/RobotsProtocol

Repository files navigation

Code Coverage Nuget

Toimik.RobotsProtocol

.NET 8 C# robots.txt parser and a C# Robots Meta Tag / X-Robots-Tag parser.

Features

RobotsTxt.cs

  • Creates instance via string or stream
  • Parses standard, extended, and custom fields:
    • User-agent
    • Disallow
    • Crawl-delay
    • Sitemap
    • Allow (Toggle-able; Can be ignored if needed)
    • Others (e.g. Host)
  • Supports misspellings of fields
  • Matches wild cards in paths (* and $)

RobotsTag.cs

  • Parses custom fields

Quick Start

Installation

Package Manager

PM> Install-Package Toimik.RobotsProtocol

.NET CLI

> dotnet add package Toimik.RobotsProtocol

Usage

Snippets are shown below.

Refer to demo programs in samples folder for complete source code.

RobotsTxt.cs (for parsing robots.txt)

var robotsTxt = new RobotsTxt();

// Load content of a robots.txt from a String
var content = "...";
_ = robotsTxt.Load(content);

// Load content of a robots.txt from a Stream
// var stream = "...";
// _ = await robotsTxt.Load(stream);

var isAllowed = robotsTxt.IsAllowed("autobot", "/folder/file.htm"};

RobotsTag.cs (for parsing robots meta tag / x-robots-tag)

var robotsTag = new RobotsTag();

// This data is either retrieved from Robots Meta Tag (e.g. <meta name="badbot"
// content="none"> or X-Robots-Tag HTTP response header (e.g. X-Robots-Tag: otherbot:
// index, nofollow). 
var data = ...;

// Words treated as the name of directives with values (e.g. max-snippet: 10).
var specialWords = new HashSet<string>
{
    "max-snippet",
    "max-image-preview",

    // ... Add accordingly
};

// Load the data to parse. This will extract every directive into their own Tag class
_ = robotsTag.Load(data, specialWords);

var hasNone = robotsTag.HasTag("autobot", "none");
var hasNoIndex = robotsTag.HasTag("autobot", "noindex");
var isIndexable = !hasNone && !hasNoIndex;