Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-encoding of URL queries causing Regex robots.txt rule matches to be missed #28

Open
jack-kaiasm opened this issue Aug 12, 2022 · 3 comments

Comments

@jack-kaiasm
Copy link

jack-kaiasm commented Aug 12, 2022

Hi, we have recently switched from Reppy to Protego for our robots.txt parser. All seems fine, except we noticed a few differences between Reppy and Protego in the URLs we are crawling - essentially, Protego appeared to be allowing access to URLs that should be blocked. It seems that Protego follows the Google specification and Reppy does not, so differences should be expected. However, we noticed that the official Google Robots Tester blocked access to these URLs also - so there seems to be an error here.

The string in the robots.txt file that appeared to be being ignored is /en-uk/*q=*relevance* and an example of a URL that was not being filtered by this string is /en-uk/new-wildlife-range/c/4018?q=%3Arelevance%3Atype%3AHedgehog%2BFood
Here is the output from Google Robots Tester showing that this URL should be blocked by the aforementioned string:
image

Having looked at the Protego code, we believe that we have found where this apparent error comes from. We also think we have a fix for it, and will happily submit the fix for your scrutiny as we'd like to know if there are unforeseen consequences from this change.

The problem involves the ASCII hex-encoding of the URL string. Protego splits the URL into parts, e.g.:

scheme="http://", netloc='www.website.com', path='/en-uk/new-wildlife-range/c/4018', params='', query='q=%3Arelevance%3Atype%3AHedgehog%2BFood', fragment=''

It then encodes symbols in the "path" part of this, removes the "scheme" and "netloc" parts, and reassembles the URL to compare with all the rules in robots.txt. The issue we're seeing is that it only encodes the symbols in the "path" part. The "query" part is left alone.
We end up with this as the URL to be checked:
/en-uk/new-wildlife-range/c/4018?q=%3Arelevance%3Atype%3AHedgehog%2BFood
Which when a regex search is applied to it using this string: /en-uk/.*?q%3D.*?relevance.*? a match isn't found as the = in the URL hasn't been encoded to %3D.
The fix we have is simple, it just encodes the "query" part in the same way as the "path" part. So instead we end up with URL:
/en-uk/new-wildlife-range/c/4018?q%3D%3Arelevance%3Atype%3AHedgehog%2BFood
the URL is matched with the regex string correctly, and crawler access is blocked.

Is this likely to cause any unforeseen issues?
Thanks

@Gallaecio
Copy link
Member

Gallaecio commented Aug 12, 2022

Sounds like a good fix.

I do wonder what the right behavior should be if the rule is /en-uk/*q%3D*relevance* instead of /en-uk/*q=*relevance*. We should find out, in case we need to adjust the fix accordingly (i.e. we may need to decode rules as well).

@jack-kaiasm
Copy link
Author

I think as long as there is consistency in how rules and URLs are encoded, then decoding won't be necessary. I couldn't find any official word on the right behaviour, but I'd expect (for example) = and %3D to be treated as the same thing. So if we encode the rule and the URL, it won't matter how each is specified. Right?
Protego already decodes rules and URLs before encoding them, presumably to ensure consistency in formatting.

Your point made me look into the code a little further, and I notice that the "query" part is also not touched for a rule string. E.g.:

rule = Disallow: /en-uk/*?*specialFeatures=*|*|*
parts = ParseResult(scheme='', netloc='', path='/en-uk/*', params='', query='*specialFeatures=*|*|*', fragment='')
encoded rule = /en-uk/*?*specialFeatures=*|*|*

Asterisks (as well as a few other special characters) are ignored, but | should be encoded in theory - and they would be if in the "path" part. If we modify the code again to encode "query" like we did with URLs, we get this:

rule = Disallow: /en-uk/*?*specialFeatures=*|*|*
parts = ParseResult(scheme='', netloc='', path='/en-uk/*', params='', query='*specialFeatures%3D*%7C*%7C*', fragment='')
encoded rule = /en-uk/*?*specialFeatures%3D*%7C*%7C* 

In this way, any URL, regardless of formatting, should match a corresponding rule, regardless of formatting. Because they'll all be encoded in the same way.

parts is created with six.moves.urllib.parse.urlparse(), and there seems to be some inconsistency in how this library parses a URL. For example, from the rule in my previous post: q=*relevance* is found in the "path" part, when it should be in "query". Which is likely the source of the original problem...

So perhaps it's also worth considering encoding the "params" and "fragment" parts in the same way. They probably wouldn't actually be present in any rules, but urlparse may parse the URL incorrectly and use these parts too.

@Gallaecio
Copy link
Member

It looks like Google published a standard last September, which seems to detail how the encoding/decoding should be handled. Now it is a matter of actually following it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants