You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But my URL comes back as: http://myurl/SomePath/services/words/MorePath?abc
Note that MYURL has become myurl.
I've just upgraded from Scrapy 1.7.x to 2.11.1. In 1.7 and previously it would come out as MYURL. There's nothing in LinkExtractor docs about changing case, nor can I see anything in the changelogs (but may be missing that)
May or may not be intentional behaviour, but the docs should probably be updated if this is intented to say the case will change.
The text was updated successfully, but these errors were encountered:
mohmad-null
changed the title
LinkExctractor changing case of URL (but didn't used to)
LinkExtractor changing case of URL (but didn't used to)
May 3, 2024
On it.
There may be URLs, or parts of URLs, where case doesn't matter, but identifying these may not be easy. Users should always consider that URLs are case-sensitive.
After investigation I found that above case is due to use of canonicalize_url. This is an important function which helps in finding duplicates, etc. We can definitely document this so that it helps user.
There is a canonicalize parameter that is False by default, so I’m not so sure this is about canonicalize_url. Maybe it is Lxml’s behavior? May be worth looking into, and adding a note about it to the reference docs about the canonicalize parameter.
Regression?
I have a HTML file that contains a link like:
<a target="_blank" href="http://MYURL/SomePath/services/words/MorePath?abc">Words</a>
I'm extracting with code that looks like this:
But my URL comes back as:
http://myurl/SomePath/services/words/MorePath?abc
Note that
MYURL
has becomemyurl
.I've just upgraded from Scrapy 1.7.x to 2.11.1. In 1.7 and previously it would come out as
MYURL
. There's nothing in LinkExtractor docs about changing case, nor can I see anything in the changelogs (but may be missing that)May or may not be intentional behaviour, but the docs should probably be updated if this is intented to say the case will change.
The text was updated successfully, but these errors were encountered: