By Jake Edge
June 27, 2012
Parsing HTML is sometimes a surprisingly complicated task. Even what seem
like fairly trivial constructs seem to have a bunch of "fiddly bits" that
one needs to account for. For most of those kinds of tasks, then, one generally
turns to a full-blown HTML parser. While Python includes one in its
standard library, it can be somewhat painful to use as well. Some recent
experiments with Beautiful Soup, in
particular version 4 released earlier this year, have shown a
parser that is well-designed and easy to use.
Beautiful Soup is available
as a tarball that can be installed in the usual Python way (using
setup.py). It can also be installed
using pip or easy_install from the PyPi repositories. It
is also packaged as python-bs4 for Debian and Ubuntu;
packages
for other
distributions will presumably be coming along as well. It
supports both Python 2.7 and Python 3, with few external dependencies.
Beautiful Soup uses the Python standard library parser, but can also use
optional faster parsers (lxml, html5lib) if they are installed.
Getting started with Beautiful Soup is pretty straightforward:
from bs4 import BeautifulSoup
soup = BeautifulSoup(string_or_filehandle)
From that point on, the
soup object is used to query and manipulate
the HTML contained in the argument. The input data is converted
to Unicode, parsed, and cleaned up so that it is valid. At that
point, simply outputting the object (e.g.
print soup) will produce
the cleaned-up HTML.
soup.prettify() will do even more, indenting
and matching up tags to create pretty output.
But, there's lots more available than simply cleaner HTML. It can also take
HTML fragments, which will be transformed into full HTML documents once
parsed. Beautiful Soup
breaks the HTML down into objects which correspond to the tags contained in
the input. For example:
print soup.head
print soup.body
will print the
<head> and
<body> sections.
For an HTML fragment,
soup.body will contain the parsed contents
of that fragment, and each
piece in the HTML can be accessed via the
.children iterator or
the
.contents list. So:
html_frag = '<p>foo bar</p><p>baz<p>yet another graf</p>'
soup = BeautifulSoup(html_frag)
print soup.body.contents
for c in soup.body.children:
print c
# output
[<p>foo bar</p>, <p>baz</p>, <p>yet another graf</p>]
<p>foo bar</p>
<p>baz</p>
<p>yet another graf</p>
Note that the unclosed paragraph tag was fixed.
Each tag in the HTML gets turned into a Tag object by Beautiful Soup. Tag
objects
have attributes like .name, .children, .parent,
and so on, which can be used to distinguish various tags and to navigate
the tree. In addition, the HTML attributes of a particular tag are
available by accessing the tag as a dictionary:
soup = BeautifulSoup('<b class="boldest">foo</b> bar <b class="bolder">baz</b>')
print soup.b['class']
# output
['boldest']
The first tag of a given type can be referred to using the dot notation, so
soup.b is the first "b" tag in the object. One can access the
other tags by navigating in the HTML tree or by searching.
But wait, there's more:
soup.b['class'] = 'waylessbold'
print soup.body
# output
<body><b class="waylessbold">foo</b> bar <b class="bolder">baz</b></body>
So, changing an attribute is reflected in the output of the soup object. That
can be used to add new attributes (
tag['newattr']='foo') or to
remove existing attributes (
del tag['class']).
For many transformation tasks, though, one might rather not step through
the whole HTML tree, and would, instead, want to search for tags of
interest. Beautiful Soup has some powerful capabilities in that area too.
Using the above example:
for b in soup.find_all('b'):
b['class'] = 'justbold'
would change the class of all "b" tags in the object to "justbold"
(adding a "class" attribute to any that don't have it).
More complicated things can be done using regular expressions as well:
for a in soup.find_all('a', href=re.compile(r'^/')):
a['href'] = 'http://lwn.net%s' % a['href']
would turn relative links into full URLs for example. Using keyword
arguments (like
href above) will search the HTML attributes of the
tag based on a string or regular expression. There are also ways to search
based on the presence of a tag (
href=True), to limit the number
of results returned, or to change the default recursive searching so that
only direct children are searched.
One can also create
a dictionary with just the attributes needed (or wanted) on a particular
tag type and simply assign those tags using the .attrs attribute:
for img in soup.find_all('img'):
idict = { 'height' : 22, 'width' : 42, 'src' : img['src'] }
img.attrs = idict
Something like that might be useful to remove HTML attributes like
align= that
aren't acceptable in some forms of HTML (e.g. EPUB).
One can also create and insert entirely new tags into the soup object (and
thus the HTML). A tag can be created with new_tag() then
inserted before or after any other tag:
itag = soup.new_tag('i')
itag.string = 'italicized'
soup.b.insert_before(itag)
print soup.body
# output
<body><i>italicized</i><b class="justbold">foo</b> bar <b class="justbold">baz</b></body>
There is one seeming oddity with Beautiful Soup in that there is another
type of object called a NavigableString.
Any string in the input (even those that are the
child of a Tag, e.g. the "foo" in <b>foo</b>) will be represented as a NavigableString.
These can be encountered while moving around in the HTML tree:
soup = BeautifulSoup('<i>italicized</i><b class="justbold">foo</b> bar <b class="justbold">baz</b>')
for t in soup.body.children:
print t.name
# output
i
b
Traceback (most recent call last):
...
AttributeError: 'NavigableString' object has no attribute 'name'
The unadorned "bar" in the fragment above is clearly not a tag, but by making it
a different object, without the "standard"
.name attribute, it
is a bit of a pain to deal with. The problem occurs mostly in
experimentation and debugging, but one needs to do something like:
from bs4 import NavigableString
if isinstance(child, NavigableString):
...
to detect it, which is somewhat annoying. Perhaps I don't understand the
ramifications of synthesizing a Tag object to hold those strings
(or at least providing things like
.name on those objects),
but on first glance it seems like it would be a better way to
handle them.
Overall, Beautiful Soup is rather impressive. There is quite a bit more to it
than this brief overview shows, but the documentation
is excellent and provides lots of examples. If you have some need to parse
or transform HTML in Python, Beautiful Soup 4 is surely worth a look.
Comments (8 posted)
Brief items
I was asked a few weeks ago, "What was the biggest surprise you
encountered rolling out Go?" I knew the answer instantly: Although
we expected C++ programmers to see Go as an alternative, instead
most Go programmers come from languages like Python and Ruby. Very
few come from C++.
—
Rob
Pike
It's fair enough to say that I wouldn't be a programmer today if it weren't for an interest in game programming, and that is true of several of my friends as well. But if that is true, why then do we have so few finished and polished free software games? Answering that question actually deserves of a post of its own (and indeed, solving that riddle is a good portion of the motive behind Liberated Pixel Cup), but it's enough to say for now that we are missing opportunities of encouraging future hackers by not making free software a welcoming playground for game development.
—
Chris Webber
Comments (14 posted)
Enlightenment developer Bruno Dilly
announced that EFL has merged in EPhysics, a wrapper for the
Bullet Physics library, making it "
pretty simple for an EFL programmer and we expect them to adopt EPhysics to create their next splash screen, transition effects and even more games." The library allows Evas objects to "
have physical attributes such as mass, friction and restitution and shape. They may receive impulses and collide between them."
Comments (none posted)
The first release candidate for KDE 4.9 desktop environment has landed. "
With API, dependency and feature
freezes in place, the KDE team's focus is now on fixing bugs and further
polishing new and old functionality." Highlights include support for Qt Quick in Plasma, deeper integration of the "Activities" scheme for organizing workspaces, and improved metadata sorting-and-searching within the Dolphin file manager.
Full Story (comments: none)
Version 1.2.99.1 of the
SyncEvolution PIM-synchronization framework has been released. This is the first pre-release version of the upcoming 1.3 series, and includes several new features, including KDE/Akonadi support, ActiveSync support, and rewritten D-Bus and CalDAV components. Despite the pre-release status, upgrading is still recommended for several reasons: "
for example,
SyncEvolution 1.3 is required for Evolution 3.4, otherwise photos are
not exported properly. Further workarounds for recent changes in
Google CalDAV were added."
Full Story (comments: none)
Newsletters and articles
Comments (none posted)
Ars technica has posted
a
review of the Android version of Firefox. "
One of the key
features of Firefox for Android is its support for Mozilla's
synchronization service. It works seamlessly with the desktop version of
the browser, allowing the user to access their bookmarks and other browser
data. This capability works as expected and will likely be a major draw for
existing Firefox users."
Comments (21 posted)
Jasper St. Pierre has posted
a lengthy
overview of the Linux graphics stack. It's a good starting point for
anybody who is not clear on what all those acronyms mean. "
The X
server needs to know what’s happening here, though, so it can do things
like synchronization. This synchronization between your glxgears, the
kernel, and the X server is called DRI, or more accurately, DRI2. 'DRI'
stands for 'Direct Rendering Infrastructure', but it’s sort of a strange
acronym. 'DRI' refers to both the project that glued mesa and Xorg together
(introducing DRM and a bunch of the things I talk about in this article),
as well as the DRI protocol and library. DRI 1 wasn’t really that good, so
we threw it out and replaced it with DRI 2."
Comments (128 posted)
Page editor: Nathan Willis
Next page: Announcements>>