Suggestion: robots.txt shouldn't be reparsed every time #198

panthony · 2018-04-04T10:23:26Z

What is the current behavior?

The robots.txt is re-parsed for every request but those files can be big.

Today Google only reads the first 500 Kb and ignore the rest.

What is the expected behavior?

Maybe the crawler could keep the parsed robots.txt up to N instances. It should allow a strong cache hit without allowing it to growth forever.

What is the motivation / use case for changing the behavior?

Although I didn't manage to find the robots.txt again, I did already see ones that were doing easily > 1Mb.

The overall performance could take a serious hit if it were to be reparsed for every single request.

The text was updated successfully, but these errors were encountered:

yujiosaka added the feature label Apr 4, 2018

Provide feedback