-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latest allowed_domains behavior breaks middlewares that rewrite urls #6366
Comments
As a quick workaround you can disable the new downloader middleware and enable the old spider middleware. |
This sounds good to me, allowing to set
For the record, this is a breaking change in a patch version, which is something we are willing to do for security issues like this one. We have done it in the past, and we will probably do it again in the future. Python does the same. That said, had we realized this could be an issue, we would have provided a workaround like the one you suggest as part of the fix. We did not think of scenarios like yours, even though it sounds very fair. We’ll try to address this in 2.11.3. |
Related: #6151 |
This still will require updating middlewares that reschedule requests. I think we need to add a couple of suggestions about this problem to the docs, probably to the 2.11.2 release notes. |
Description
The change to behavior of
Spider.allowed_domains
in 2.11.2 broke several of our crawls because it does not play well with downloader middlewares that replace the original request in order to modify the url.This is because modifying the url (such as for proxies that only have apis without a transparent proxy mode) requires using
request.replace
but that generates a "new" request which scrapy then reschedules which replays all middlewares including the new offsite middleware, causing the proxied request to get filtered (unless the proxy domain is also included in theallowed_domains
).The documented workaround of using
dont_filter
is a non-starter since that won't fix published middlewares that you don't own and it has other side effects (e.g. disabling the duplicate filter) which may be undesired.This is a breaking change on a minor version which really shouldn't happen.
Steps to Reproduce
allowed_domains
to the desired domain, e.g. books.toscrape.comExpected behavior:
Behavior prior to 2.11.2, the desired domain would still be crawled.
Actual behavior:
Nothing gets crawled
Reproduces how often:
100% in the given scenario
Additional context
I can think of the following workarounds but all of them have major drawbacks:
allowed_domains
(but now you don't get its intended benefits)allowed_domains
(this breaks encapsulation and makes it harder to switch proxies and might make this a very large list for large rotating proxies)dont_filter=True
(but what if you're using a 3rd party middleware like the one above so you can't modify it and also now you lose dupe filtering)Suggested fixes
Perhaps the new behavior could be off by default and toggled with a new setting and then the old behavior could go through a deprecation cycle in future major versions. At least then the breaking change would be expected.
It might also help to separate out a
dont_offsite_filter
flag so that a well designed middleware could intentionally avoid this dilemma without also triggering the rest of thedont_filter
behavior.The text was updated successfully, but these errors were encountered: