Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

canonicalize_url breaks certain url(s) #131

Open
markbaas opened this issue Mar 6, 2015 · 5 comments
Open

canonicalize_url breaks certain url(s) #131

markbaas opened this issue Mar 6, 2015 · 5 comments
Labels

Comments

@markbaas
Copy link

markbaas commented Mar 6, 2015

The url /cmp/Supermercados-Dia%25 is incorrectly unquoted into /cmp/Supermercados-Dia%

Problem happens in
def _unquotepath(path): for reserved in ('2f', '2F', '3f', '3F'): path = path.replace('%' + reserved, '%25' + reserved.upper()) return urllib.unquote(path)

@alisufian
Copy link

Happens with urls containing "%26" (&) as well.

@redapple
Copy link
Contributor

From my unscientific tests,
with this page,

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<base href="  http://www.example.com/">
<title>No title</title>
</head>

<body>

<a href="/%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F">"/%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F", relative to base http://www.example.com/</a><br />
<a href="/%20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,%2D,%2E,%2F">"/%20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,%2D,%2E,%2F", relative to base http://www.example.com/</a><br />
<a href="/%30,%31,%32,%33,%34,%35,%36,%37,%38,%39,%3A,%3B,%3C,%3D,%3E,%3F">"/%30,%31,%32,%33,%34,%35,%36,%37,%38,%39,%3A,%3B,%3C,%3D,%3E,%3F", relative to base http://www.example.com/</a><br />
<a href="/%40,%41,%42,%43,%44,%45,%46,%47,%48,%49,%4A,%4B,%4C,%4D,%4E,%4F">"/%40,%41,%42,%43,%44,%45,%46,%47,%48,%49,%4A,%4B,%4C,%4D,%4E,%4F", relative to base http://www.example.com/</a><br />
<a href="/%50,%51,%52,%53,%54,%55,%56,%57,%58,%59,%5A,%5B,%5C,%5D,%5E,%5F">"/%50,%51,%52,%53,%54,%55,%56,%57,%58,%59,%5A,%5B,%5C,%5D,%5E,%5F", relative to base http://www.example.com/</a><br />
<a href="/%60,%61,%62,%63,%64,%65,%66,%67,%68,%69,%6A,%6B,%6C,%6D,%6E,%6F">"/%60,%61,%62,%63,%64,%65,%66,%67,%68,%69,%6A,%6B,%6C,%6D,%6E,%6F", relative to base http://www.example.com/</a><br />
<a href="/%70,%71,%72,%73,%74,%75,%76,%77,%78,%79,%7A,%7B,%7C,%7D,%7E,%7F">"/%70,%71,%72,%73,%74,%75,%76,%77,%78,%79,%7A,%7B,%7C,%7D,%7E,%7F", relative to base http://www.example.com/</a><br />

</body>

</html>

these are the URL that my Chrome browser (Version 53.0.2785.113 (64-bit) on Ubuntu) fetches, as seen in the network tab:

http://www.example.com/%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F
http://www.example.com/%20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,-,.,%2F
http://www.example.com/0,1,2,3,4,5,6,7,8,9,%3A,%3B,%3C,%3D,%3E,%3F
http://www.example.com/%40,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
http://www.example.com/P,Q,R,S,T,U,V,W,X,Y,Z,%5B,%5C,%5D,%5E,_
http://www.example.com/%60,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o
http://www.example.com/p,q,r,s,t,u,v,w,x,y,z,%7B,%7C,%7D,~,%7F

@redapple
Copy link
Contributor

redapple commented Sep 14, 2016

Summary for Chrome vs. canonicalize_url:

>>> from w3lib.url import canonicalize_url
>>> 
>>> chrome_normalized = '''%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F
... %20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,-,.,%2F
... 0,1,2,3,4,5,6,7,8,9,%3A,%3B,%3C,%3D,%3E,%3F
... %40,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
... P,Q,R,S,T,U,V,W,X,Y,Z,%5B,%5C,%5D,%5E,_
... %60,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o
... p,q,r,s,t,u,v,w,x,y,z,%7B,%7C,%7D,~,%7F'''
>>> 
>>> raw_in_html = '''%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F
... %20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,%2D,%2E,%2F
... %30,%31,%32,%33,%34,%35,%36,%37,%38,%39,%3A,%3B,%3C,%3D,%3E,%3F
... %40,%41,%42,%43,%44,%45,%46,%47,%48,%49,%4A,%4B,%4C,%4D,%4E,%4F
... %50,%51,%52,%53,%54,%55,%56,%57,%58,%59,%5A,%5B,%5C,%5D,%5E,%5F
... %60,%61,%62,%63,%64,%65,%66,%67,%68,%69,%6A,%6B,%6C,%6D,%6E,%6F
... %70,%71,%72,%73,%74,%75,%76,%77,%78,%79,%7A,%7B,%7C,%7D,%7E,%7F'''
>>> 
>>> raw_lines = raw_in_html.splitlines()
>>> norm_lines = chrome_normalized.splitlines()
>>> 
>>> for i, line in enumerate(raw_lines):
...     raw_chars = line.split(',')
...     norm_chars = norm_lines[i].split(',')
...     for pos, c in enumerate(raw_chars):
...         canonicalized = canonicalize_url(c)
...         if c == norm_chars[pos]:
...             if c != canonicalized:
...                 print('{0} was preserved by Chrome, but canonicalize_url("{0}") unquoted it to {1}'.format(c, canonicalized))
...         
... 
%21 was preserved by Chrome, but canonicalize_url("%21") unquoted it to !
%23 was preserved by Chrome, but canonicalize_url("%23") unquoted it to #
%24 was preserved by Chrome, but canonicalize_url("%24") unquoted it to $
%25 was preserved by Chrome, but canonicalize_url("%25") unquoted it to %
%26 was preserved by Chrome, but canonicalize_url("%26") unquoted it to &
%27 was preserved by Chrome, but canonicalize_url("%27") unquoted it to '
%28 was preserved by Chrome, but canonicalize_url("%28") unquoted it to (
%29 was preserved by Chrome, but canonicalize_url("%29") unquoted it to )
%2A was preserved by Chrome, but canonicalize_url("%2A") unquoted it to *
%2B was preserved by Chrome, but canonicalize_url("%2B") unquoted it to +
%2C was preserved by Chrome, but canonicalize_url("%2C") unquoted it to ,
%3A was preserved by Chrome, but canonicalize_url("%3A") unquoted it to :
%3B was preserved by Chrome, but canonicalize_url("%3B") unquoted it to ;
%3D was preserved by Chrome, but canonicalize_url("%3D") unquoted it to =
%40 was preserved by Chrome, but canonicalize_url("%40") unquoted it to @
%7C was preserved by Chrome, but canonicalize_url("%7C") unquoted it to |

@redapple
Copy link
Contributor

redapple commented Sep 14, 2016

For Firefox (48.0 Mozilla Firefox for Ubuntu) it's a bit different:

"on the wire" as copied from the network panel:

http://www.example.com/%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F
http://www.example.com/%20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,%2D,.,%2F
http://www.example.com/%30,%31,%32,%33,%34,%35,%36,%37,%38,%39,%3A,%3B,%3C,%3D,%3E,%3F
http://www.example.com/%40,%41,%42,%43,%44,%45,%46,%47,%48,%49,%4A,%4B,%4C,%4D,%4E,%4F
http://www.example.com/%50,%51,%52,%53,%54,%55,%56,%57,%58,%59,%5A,%5B,%5C,%5D,%5E,%5F
http://www.example.com/%60,%61,%62,%63,%64,%65,%66,%67,%68,%69,%6A,%6B,%6C,%6D,%6E,%6F
http://www.example.com/%70,%71,%72,%73,%74,%75,%76,%77,%78,%79,%7A,%7B,%7C,%7D,%7E,%7F

as displayed in the address bar:

www.example.com/%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F
www.example.com/ ,!,",%23,%24,%25,%26,',(,),*,%2B,%2C,-,.,%2F
www.example.com/0,1,2,3,4,5,6,7,8,9,%3A,%3B,<,%3D,>,%3F
www.example.com/%40,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
www.example.com/P,Q,R,S,T,U,V,W,X,Y,Z,[,\,],^,_
www.example.com/`,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o
www.example.com/p,q,r,s,t,u,v,w,x,y,z,{,|,},~,%7F

Summary using the URL bar data as output:

>>> from w3lib.url import canonicalize_url
>>> 
>>> raw_in_html = '''%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F
... %20,%21,%22,%23,%24,%25,%26,%27,%28,%29,%2A,%2B,%2C,%2D,%2E,%2F
... %30,%31,%32,%33,%34,%35,%36,%37,%38,%39,%3A,%3B,%3C,%3D,%3E,%3F
... %40,%41,%42,%43,%44,%45,%46,%47,%48,%49,%4A,%4B,%4C,%4D,%4E,%4F
... %50,%51,%52,%53,%54,%55,%56,%57,%58,%59,%5A,%5B,%5C,%5D,%5E,%5F
... %60,%61,%62,%63,%64,%65,%66,%67,%68,%69,%6A,%6B,%6C,%6D,%6E,%6F
... %70,%71,%72,%73,%74,%75,%76,%77,%78,%79,%7A,%7B,%7C,%7D,%7E,%7F'''
>>> 
>>> firefox_normalized = '''%10,%11,%12,%13,%14,%15,%16,%17,%18,%19,%1A,%1B,%1C,%1D,%1E,%1F
...  ,!,",%23,%24,%25,%26,',(,),*,%2B,%2C,-,.,%2F
... 0,1,2,3,4,5,6,7,8,9,%3A,%3B,<,%3D,>,%3F
... %40,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
... P,Q,R,S,T,U,V,W,X,Y,Z,[,\,],^,_
... `,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o
... p,q,r,s,t,u,v,w,x,y,z,{,|,},~,%7F'''
>>> 
>>> raw_lines = raw_in_html.splitlines()
>>> norm_lines = firefox_normalized.splitlines()
>>> 
>>> for i, line in enumerate(raw_lines):
...     raw_chars = line.split(',')
...     norm_chars = norm_lines[i].split(',')
...     for pos, c in enumerate(raw_chars):
...         canonicalized = canonicalize_url(c)
...         if c == norm_chars[pos]:
...             if c != canonicalized:
...                 print('{0} was preserved by Firefox, but canonicalize_url("{0}") unquoted it to {1}'.format(c, canonicalized))
...         
... 
%23 was preserved by Firefox, but canonicalize_url("%23") unquoted it to #
%24 was preserved by Firefox, but canonicalize_url("%24") unquoted it to $
%25 was preserved by Firefox, but canonicalize_url("%25") unquoted it to %
%26 was preserved by Firefox, but canonicalize_url("%26") unquoted it to &
%2B was preserved by Firefox, but canonicalize_url("%2B") unquoted it to +
%2C was preserved by Firefox, but canonicalize_url("%2C") unquoted it to ,
%3A was preserved by Firefox, but canonicalize_url("%3A") unquoted it to :
%3B was preserved by Firefox, but canonicalize_url("%3B") unquoted it to ;
%3D was preserved by Firefox, but canonicalize_url("%3D") unquoted it to =
%40 was preserved by Firefox, but canonicalize_url("%40") unquoted it to @
>>> 

@redapple
Copy link
Contributor

This needs to be moved to w3lib, the new home of canonicalize_url

@Gallaecio Gallaecio transferred this issue from scrapy/scrapy Jul 8, 2019
@Gallaecio Gallaecio added the bug label Jul 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants