Normalize URL path python – Teeth Grinder UK

I had a problem with some URLs that we found on someones website. They looked like this:

<a href=”http://example.com/../../../path/page.html”>, here is the same link: test. Notice that when you mouse over it, Firefox normalizes the URL so it looks correct.

Using urlparse in Python:

print urlparse.urljoin( ‘http://site.com/’, ‘/path/../path/.././path/./’ )
‘http://site.com/path/../path/.././path/./’

How, poopy.

So, we need to do better than that. The os module has a path notmalizer in it, but this would go squiffy when run on a Windows box because it will translate all the / to \. But there is a posixpath.py module which we can use:

import urlparse
import posixpath

def join(base,url):
join = urlparse.urljoin(base,url)
url = urlparse.urlparse(join)
path = posixpath.normpath(url[2])
return urlparse.urlunparse(
(url.scheme,url.netloc,path,url.params,url.query,url.fragment)
)

# strange paths
print join( ‘http://site.com/’, ‘/path/../path/.././path/./’ )
print join( ‘http://site.com/path/x.html’, ‘/path/../path/.././path/./y.html’ )

# paths that .. up too far
print join( ‘http://site.com/’, ‘../../../../path/’ )
print join( ‘http://site.com/x/x.html’, ‘../../../../path/moo.html’ )

# how path and base path combine
print join( ‘http://site.com/99/x.html’, ‘1/2/3/moo.html’ )
print join( ‘http://site.com/99/x.html’, ‘../1/2/3/moo.html’ )

This prints:

http://site.com/path
http://site.com/path/y.html
http://site.com/path
http://site.com/path/moo.html
http://site.com/99/1/2/3/moo.html
http://site.com/1/2/3/moo.html

Cool, huh?