Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: robots.txt shouldn't be reparsed every time #198

Open
panthony opened this issue Apr 4, 2018 · 0 comments
Open

Suggestion: robots.txt shouldn't be reparsed every time #198

panthony opened this issue Apr 4, 2018 · 0 comments
Labels

Comments

@panthony
Copy link

panthony commented Apr 4, 2018

What is the current behavior?

The robots.txt is re-parsed for every request but those files can be big.

Today Google only reads the first 500 Kb and ignore the rest.

What is the expected behavior?

Maybe the crawler could keep the parsed robots.txt up to N instances. It should allow a strong cache hit without allowing it to growth forever.

What is the motivation / use case for changing the behavior?

Although I didn't manage to find the robots.txt again, I did already see ones that were doing easily > 1Mb.

The overall performance could take a serious hit if it were to be reparsed for every single request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants