BeautifulSoup is a python class that takes HTML and returns a tree of objects. You can then search that tree to find HTML tags (which is pretty mega easy.)

Here is a little example that downloads the msn.com web page, parses it using BS and grabs all the LINK tags that are linking to a CSS file:

from BeautifulSoup import BeautifulSoup
import urllib2url = “http://www.msn.com”

request = urllib2.Request(url)
opener = urllib2.build_opener()
f = opener.open(request)

print ‘-‘*5,‘URL Info:’,‘-‘*5
print f.info()
print ‘-‘*15

html = f.read()
soup = BeautifulSoup(html)
css_files = soup.findAll(‘link’,{‘rel’:‘stylesheet’})

for css in css_files:
print str(css)

This outputs (a lot of page header stuff which I have removed. Page header stuff is interesting, it shows all the POST vars, cookies, cache info etc… so I thought I’d keep it in my little example), anyway BS finds one CSS file:

<link type=”text/css” rel=”stylesheet” id=”csslink” href=”http://stc.msn.com/br/hp/en-us/css/48/blu.css” />

The magic method in BS that does the work is findAll(), this takes a tag name as the first parameter, the second parameter is a dictionary of attribute name and value pairs. So in the example above, we are looking for tag link that contains rel=”stylesheet”.

Running this script on myspace.com returns:

<link rel=”stylesheet” type=”text/css” href=”http://x.myspace.com/modules/common/static/css/header002.css” />
<link rel=”stylesheet” type=”text/css” href=”http://x.myspace.com/modules/common/static/css/global004.css” />
<link rel=”stylesheet” type=”text/css” href=”http://x.myspace.com/modules/common/static/css/master.css” />
<link rel=”stylesheet” type=”text/css” href=”http://x.myspace.com/modules/common/static/css/google-003.css” />
<link rel=”stylesheet” type=”text/css” href=”http://x.myspace.com/modules/splash/static/css/splash003.css” />
<link rel=”stylesheet” type=”text/css” href=”http://x.myspace.com/modules/common/static/css/abipromo.css” />
<link rel=”stylesheet” type=”text/css” href=”http://x.myspace.com/js/css/SignUpHelpOverlay-001.css” />

Hope you like 🙂