By Jake Edge
June 27, 2012
Parsing HTML is sometimes a surprisingly complicated task. Even what seem
like fairly trivial constructs seem to have a bunch of "fiddly bits" that
one needs to account for. For most of those kinds of tasks, then, one generally
turns to a full-blown HTML parser. While Python includes one in its
standard library, it can be somewhat painful to use as well. Some recent
experiments with Beautiful Soup, in
particular version 4 released earlier this year, have shown a
parser that is well-designed and easy to use.
Beautiful Soup is available
as a tarball that can be installed in the usual Python way (using
setup.py). It can also be installed
using pip or easy_install from the PyPi repositories. It
is also packaged as python-bs4 for Debian and Ubuntu;
packages
for other
distributions will presumably be coming along as well. It
supports both Python 2.7 and Python 3, with few external dependencies.
Beautiful Soup uses the Python standard library parser, but can also use
optional faster parsers (lxml, html5lib) if they are installed.
Getting started with Beautiful Soup is pretty straightforward:
from bs4 import BeautifulSoup
soup = BeautifulSoup(string_or_filehandle)
From that point on, the
soup object is used to query and manipulate
the HTML contained in the argument. The input data is converted
to Unicode, parsed, and cleaned up so that it is valid. At that
point, simply outputting the object (e.g.
print soup) will produce
the cleaned-up HTML.
soup.prettify() will do even more, indenting
and matching up tags to create pretty output.
But, there's lots more available than simply cleaner HTML. It can also take
HTML fragments, which will be transformed into full HTML documents once
parsed. Beautiful Soup
breaks the HTML down into objects which correspond to the tags contained in
the input. For example:
print soup.head
print soup.body
will print the
<head> and
<body> sections.
For an HTML fragment,
soup.body will contain the parsed contents
of that fragment, and each
piece in the HTML can be accessed via the
.children iterator or
the
.contents list. So:
html_frag = '<p>foo bar</p><p>baz<p>yet another graf</p>'
soup = BeautifulSoup(html_frag)
print soup.body.contents
for c in soup.body.children:
print c
# output
[<p>foo bar</p>, <p>baz</p>, <p>yet another graf</p>]
<p>foo bar</p>
<p>baz</p>
<p>yet another graf</p>
Note that the unclosed paragraph tag was fixed.
Each tag in the HTML gets turned into a Tag object by Beautiful Soup. Tag
objects
have attributes like .name, .children, .parent,
and so on, which can be used to distinguish various tags and to navigate
the tree. In addition, the HTML attributes of a particular tag are
available by accessing the tag as a dictionary:
soup = BeautifulSoup('<b class="boldest">foo</b> bar <b class="bolder">baz</b>')
print soup.b['class']
# output
['boldest']
The first tag of a given type can be referred to using the dot notation, so
soup.b is the first "b" tag in the object. One can access the
other tags by navigating in the HTML tree or by searching.
But wait, there's more:
soup.b['class'] = 'waylessbold'
print soup.body
# output
<body><b class="waylessbold">foo</b> bar <b class="bolder">baz</b></body>
So, changing an attribute is reflected in the output of the soup object. That
can be used to add new attributes (
tag['newattr']='foo') or to
remove existing attributes (
del tag['class']).
For many transformation tasks, though, one might rather not step through
the whole HTML tree, and would, instead, want to search for tags of
interest. Beautiful Soup has some powerful capabilities in that area too.
Using the above example:
for b in soup.find_all('b'):
b['class'] = 'justbold'
would change the class of all "b" tags in the object to "justbold"
(adding a "class" attribute to any that don't have it).
More complicated things can be done using regular expressions as well:
for a in soup.find_all('a', href=re.compile(r'^/')):
a['href'] = 'http://lwn.net%s' % a['href']
would turn relative links into full URLs for example. Using keyword
arguments (like
href above) will search the HTML attributes of the
tag based on a string or regular expression. There are also ways to search
based on the presence of a tag (
href=True), to limit the number
of results returned, or to change the default recursive searching so that
only direct children are searched.
One can also create
a dictionary with just the attributes needed (or wanted) on a particular
tag type and simply assign those tags using the .attrs attribute:
for img in soup.find_all('img'):
idict = { 'height' : 22, 'width' : 42, 'src' : img['src'] }
img.attrs = idict
Something like that might be useful to remove HTML attributes like
align= that
aren't acceptable in some forms of HTML (e.g. EPUB).
One can also create and insert entirely new tags into the soup object (and
thus the HTML). A tag can be created with new_tag() then
inserted before or after any other tag:
itag = soup.new_tag('i')
itag.string = 'italicized'
soup.b.insert_before(itag)
print soup.body
# output
<body><i>italicized</i><b class="justbold">foo</b> bar <b class="justbold">baz</b></body>
There is one seeming oddity with Beautiful Soup in that there is another
type of object called a NavigableString.
Any string in the input (even those that are the
child of a Tag, e.g. the "foo" in <b>foo</b>) will be represented as a NavigableString.
These can be encountered while moving around in the HTML tree:
soup = BeautifulSoup('<i>italicized</i><b class="justbold">foo</b> bar <b class="justbold">baz</b>')
for t in soup.body.children:
print t.name
# output
i
b
Traceback (most recent call last):
...
AttributeError: 'NavigableString' object has no attribute 'name'
The unadorned "bar" in the fragment above is clearly not a tag, but by making it
a different object, without the "standard"
.name attribute, it
is a bit of a pain to deal with. The problem occurs mostly in
experimentation and debugging, but one needs to do something like:
from bs4 import NavigableString
if isinstance(child, NavigableString):
...
to detect it, which is somewhat annoying. Perhaps I don't understand the
ramifications of synthesizing a Tag object to hold those strings
(or at least providing things like
.name on those objects),
but on first glance it seems like it would be a better way to
handle them.
Overall, Beautiful Soup is rather impressive. There is quite a bit more to it
than this brief overview shows, but the documentation
is excellent and provides lots of examples. If you have some need to parse
or transform HTML in Python, Beautiful Soup 4 is surely worth a look.
(
Log in to post comments)