I had a problem with some URLs that we found on someones website. They looked like this:
<a href=”http://example.com/../../../path/page.html”>, here is the same link: test. Notice that when you mouse over it, Firefox normalizes the URL so it looks correct.
Using urlparse in Python:
‘http://site.com/path/../path/.././path/./’
How, poopy.
So, we need to do better than that. The os module has a path notmalizer in it, but this would go squiffy when run on a Windows box because it will translate all the / to \. But there is a posixpath.py module which we can use:
import posixpath
def join(base,url):
join = urlparse.urljoin(base,url)
url = urlparse.urlparse(join)
path = posixpath.normpath(url[2])
return urlparse.urlunparse(
(url.scheme,url.netloc,path,url.params,url.query,url.fragment)
)
# strange paths
print join( ‘http://site.com/’, ‘/path/../path/.././path/./’ )
print join( ‘http://site.com/path/x.html’, ‘/path/../path/.././path/./y.html’ )
# paths that .. up too far
print join( ‘http://site.com/’, ‘../../../../path/’ )
print join( ‘http://site.com/x/x.html’, ‘../../../../path/moo.html’ )
# how path and base path combine
print join( ‘http://site.com/99/x.html’, ‘1/2/3/moo.html’ )
print join( ‘http://site.com/99/x.html’, ‘../1/2/3/moo.html’ )
This prints:
http://site.com/path/y.html
http://site.com/path
http://site.com/path/moo.html
http://site.com/99/1/2/3/moo.html
http://site.com/1/2/3/moo.html
Cool, huh?