Robots Exclusion Tools

A "robots.txt" parsing and querying library for .NET

Closely following the NoRobots RFC, Robots Exclusion Protocol RFC and other details on robotstxt.org.

📋 Features

Load Robots by string, by URI (Async) or by streams (Async)
Supports multiple user-agents and wildcard user-agent (*)
Supports Allow and Disallow
Supports Crawl-delay entries
Supports Sitemap entries
Supports wildcard paths (*) as well as must-end-with declarations ($)
Dedicated parser for the data from <meta name="robots" /> tag and the X-Robots-Tag header

🤝 Licensing and Support

Robots Exclusion Tools is licensed under the MIT license. It is free to use in personal and commercial projects.

There are support plans available that cover all active Turner Software OSS projects. Support plans provide private email support, expert usage advice for our projects, priority bug fixes and more. These support plans help fund our OSS commitments to provide better software for everyone.

NoRobots RFC vs Robots Exclusion Protocol RFC

The NoRobots RFC was released in 1996 and describes the core syntax that makes up a typical Robots.txt file. There is a new standard being proposed called the Robots Exclusion Protocol RFC (in draft as of August 2022) which would effectively replace it.

The two RFCs have quite a lot of overlap in terms of the core rules. Generally though, the Robots Exclusion Protocol RFC is more flexible when it comes to allowed characters (full UTF-8) and spacing.

The Robots Exclusion Tools library attempts to strike a compatibility balance for both, allowing some specific quirks of the NoRobots RFC with the expanded characterset from the Robots Exclusion Protocol RFC.

Parsing in-request robots rules (metatags and header)

Similar to the rules from a "robots.txt" file, there can be in-request rules deciding whether a page allows indexing or following links. The process of extracting this data from a request isn't currently part of this library, avoiding a dependency to parse HTML.

If you extract the raw rules from the metatags and X-Robots-Tag header, you can pass those into the parser. The parser takes an array of rules and returns a RobotsPageDefinition file which allows querying of the rules by user agent.

There is no RFC available to define the formats of metatag or X-Robots-Tag data. The parser follows the base formatting rules described in the NoRobots and the Robots Exclusion Protocol RFCs regarding fields combined with rules from Google's documentation on the robots metatag. There are ambiguities in the rules described there (like whether there is rule inheritence from global scope) which may be different to what other implementations may use.

Example Usage

Parsing a "robots.txt" file from URI

using TurnerSoftware.RobotsExclusionTools;

var robotsFileParser = new RobotsFileParser();
RobotsFile robotsFile = await robotsFileParser.FromUriAsync(new Uri("http://www.example.org/robots.txt"));

var allowedAccess = robotsFile.IsAllowedAccess(
	new Uri("http://www.example.org/some/url/i-want-to/check"),
	"MyUserAgent"
);

Parsing robots data from metatags or the `X-Robots-Tag`

using TurnerSoftware.RobotsExclusionTools;

//These rules are gathered by you from the Robots metatag and `X-Robots-Tag` header
var pageRules = new[] {
	"noindex, notranslate",
	"googlebot: none",
	"otherbot: nofollow",
	"superbot: all"
};

var robotsPageParser = new RobotsPageParser();
RobotsPageDefinition robotsPageDefinition = robotsPageParser.FromRules(pageRules);

robotsPageDefinition.CanIndex("SomeNotListedBot/1.0"); //False
robotsPageDefinition.CanFollowLinks("SomeNotListedBot/1.0"); //True
robotsPageDefinition.HasRule("notranslate", "SomeNotListedBot/1.0"); //True

robotsPageDefinition.CanIndex("GoogleBot/1.0"); //False
robotsPageDefinition.CanFollowLinks("GoogleBot/1.0"); //False
robotsPageDefinition.HasRule("notranslate", "GoogleBot/1.0"); //True

robotsPageDefinition.CanIndex("OtherBot/1.0"); //False
robotsPageDefinition.CanFollowLinks("OtherBot/1.0"); //False
robotsPageDefinition.HasRule("notranslate", "OtherBot/1.0"); //True

robotsPageDefinition.CanIndex("superbot/1.0"); //True
robotsPageDefinition.CanFollowLinks("superbot/1.0"); //True
robotsPageDefinition.HasRule("notranslate", "superbot/1.0"); //True

Name		Name	Last commit message	Last commit date
Latest commit History 223 Commits
.github		.github
benchmarks/TurnerSoftware.RobotsExclusionTools.Benchmarks		benchmarks/TurnerSoftware.RobotsExclusionTools.Benchmarks
images		images
src		src
tests		tests
.codecov.yml		.codecov.yml
.editorconfig		.editorconfig
.gitignore		.gitignore
CodeCoverage.runsettings		CodeCoverage.runsettings
License.txt		License.txt
README.md		README.md
TurnerSoftware.RobotsExclusionTools.sln		TurnerSoftware.RobotsExclusionTools.sln

License

TurnerSoftware/RobotsExclusionTools

Folders and files

Latest commit

History

Repository files navigation

Robots Exclusion Tools

📋 Features

🤝 Licensing and Support

NoRobots RFC vs Robots Exclusion Protocol RFC

Parsing in-request robots rules (metatags and header)

Example Usage

Parsing a "robots.txt" file from URI

Parsing robots data from metatags or the X-Robots-Tag

About

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Languages

Parsing robots data from metatags or the `X-Robots-Tag`