Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinkExtractor changing case of URL (but didn't used to) #6329

Open
mohmad-null opened this issue May 3, 2024 · 3 comments
Open

LinkExtractor changing case of URL (but didn't used to) #6329

mohmad-null opened this issue May 3, 2024 · 3 comments
Labels

Comments

@mohmad-null
Copy link

Regression?
I have a HTML file that contains a link like:

<a target="_blank" href="http://MYURL/SomePath/services/words/MorePath?abc">Words</a>

I'm extracting with code that looks like this:

	link_extractor = LinkExtractor(
		restrict_xpaths=xpath)

	tmp_links = link_extractor.extract_links(response)

But my URL comes back as:
http://myurl/SomePath/services/words/MorePath?abc

Note that MYURL has become myurl.
I've just upgraded from Scrapy 1.7.x to 2.11.1. In 1.7 and previously it would come out as MYURL. There's nothing in LinkExtractor docs about changing case, nor can I see anything in the changelogs (but may be missing that)

May or may not be intentional behaviour, but the docs should probably be updated if this is intented to say the case will change.

@mohmad-null mohmad-null changed the title LinkExctractor changing case of URL (but didn't used to) LinkExtractor changing case of URL (but didn't used to) May 3, 2024
@kumar-sanchay
Copy link
Contributor

kumar-sanchay commented May 4, 2024

On it.
There may be URLs, or parts of URLs, where case doesn't matter, but identifying these may not be easy. Users should always consider that URLs are case-sensitive.

@kumar-sanchay
Copy link
Contributor

After investigation I found that above case is due to use of canonicalize_url. This is an important function which helps in finding duplicates, etc. We can definitely document this so that it helps user.

@Gallaecio
Copy link
Member

There is a canonicalize parameter that is False by default, so I’m not so sure this is about canonicalize_url. Maybe it is Lxml’s behavior? May be worth looking into, and adding a note about it to the reference docs about the canonicalize parameter.

@Gallaecio Gallaecio added the docs label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants