LWN.net Logo

Development

Making some Beautiful Soup

By Jake Edge
June 27, 2012

Parsing HTML is sometimes a surprisingly complicated task. Even what seem like fairly trivial constructs seem to have a bunch of "fiddly bits" that one needs to account for. For most of those kinds of tasks, then, one generally turns to a full-blown HTML parser. While Python includes one in its standard library, it can be somewhat painful to use as well. Some recent experiments with Beautiful Soup, in particular version 4 released earlier this year, have shown a parser that is well-designed and easy to use.

Beautiful Soup is available as a tarball that can be installed in the usual Python way (using setup.py). It can also be installed using pip or easy_install from the PyPi repositories. It is also packaged as python-bs4 for Debian and Ubuntu; packages for other distributions will presumably be coming along as well. It supports both Python 2.7 and Python 3, with few external dependencies. Beautiful Soup uses the Python standard library parser, but can also use optional faster parsers (lxml, html5lib) if they are installed.

Getting started with Beautiful Soup is pretty straightforward:

    from bs4 import BeautifulSoup

    soup = BeautifulSoup(string_or_filehandle)
From that point on, the soup object is used to query and manipulate the HTML contained in the argument. The input data is converted to Unicode, parsed, and cleaned up so that it is valid. At that point, simply outputting the object (e.g. print soup) will produce the cleaned-up HTML. soup.prettify() will do even more, indenting and matching up tags to create pretty output.

But, there's lots more available than simply cleaner HTML. It can also take HTML fragments, which will be transformed into full HTML documents once parsed. Beautiful Soup breaks the HTML down into objects which correspond to the tags contained in the input. For example:

    print soup.head
    print soup.body
will print the <head> and <body> sections. For an HTML fragment, soup.body will contain the parsed contents of that fragment, and each piece in the HTML can be accessed via the .children iterator or the .contents list. So:
    html_frag = '<p>foo bar</p><p>baz<p>yet another graf</p>'
    soup = BeautifulSoup(html_frag)
    print soup.body.contents
    for c in soup.body.children:
        print c

    # output
    [<p>foo bar</p>, <p>baz</p>, <p>yet another graf</p>]
    <p>foo bar</p>
    <p>baz</p>
    <p>yet another graf</p>
Note that the unclosed paragraph tag was fixed.

Each tag in the HTML gets turned into a Tag object by Beautiful Soup. Tag objects have attributes like .name, .children, .parent, and so on, which can be used to distinguish various tags and to navigate the tree. In addition, the HTML attributes of a particular tag are available by accessing the tag as a dictionary:

    soup = BeautifulSoup('<b class="boldest">foo</b> bar <b class="bolder">baz</b>')
    print soup.b['class']

    # output
    ['boldest']
The first tag of a given type can be referred to using the dot notation, so soup.b is the first "b" tag in the object. One can access the other tags by navigating in the HTML tree or by searching.

But wait, there's more:

    soup.b['class'] = 'waylessbold'
    print soup.body

    # output
    <body><b class="waylessbold">foo</b> bar <b class="bolder">baz</b></body>
So, changing an attribute is reflected in the output of the soup object. That can be used to add new attributes (tag['newattr']='foo') or to remove existing attributes (del tag['class']).

For many transformation tasks, though, one might rather not step through the whole HTML tree, and would, instead, want to search for tags of interest. Beautiful Soup has some powerful capabilities in that area too. Using the above example:

    for b in soup.find_all('b'):
        b['class'] = 'justbold'
would change the class of all "b" tags in the object to "justbold" (adding a "class" attribute to any that don't have it). More complicated things can be done using regular expressions as well:
    for a in soup.find_all('a', href=re.compile(r'^/')):
        a['href'] = 'http://lwn.net%s' % a['href']
would turn relative links into full URLs for example. Using keyword arguments (like href above) will search the HTML attributes of the tag based on a string or regular expression. There are also ways to search based on the presence of a tag (href=True), to limit the number of results returned, or to change the default recursive searching so that only direct children are searched.

One can also create a dictionary with just the attributes needed (or wanted) on a particular tag type and simply assign those tags using the .attrs attribute:

    for img in soup.find_all('img'):
        idict = { 'height' : 22, 'width' : 42, 'src' : img['src'] }
        img.attrs = idict
Something like that might be useful to remove HTML attributes like align= that aren't acceptable in some forms of HTML (e.g. EPUB).

One can also create and insert entirely new tags into the soup object (and thus the HTML). A tag can be created with new_tag() then inserted before or after any other tag:

    itag = soup.new_tag('i')
    itag.string = 'italicized'
    soup.b.insert_before(itag)
    print soup.body

    # output
    <body><i>italicized</i><b class="justbold">foo</b> bar <b class="justbold">baz</b></body>

There is one seeming oddity with Beautiful Soup in that there is another type of object called a NavigableString. Any string in the input (even those that are the child of a Tag, e.g. the "foo" in <b>foo</b>) will be represented as a NavigableString. These can be encountered while moving around in the HTML tree:

    soup = BeautifulSoup('<i>italicized</i><b class="justbold">foo</b> bar <b class="justbold">baz</b>')

    for t in soup.body.children:
        print t.name

    # output
    i
    b
    Traceback (most recent call last):
    ...
    AttributeError: 'NavigableString' object has no attribute 'name'
The unadorned "bar" in the fragment above is clearly not a tag, but by making it a different object, without the "standard" .name attribute, it is a bit of a pain to deal with. The problem occurs mostly in experimentation and debugging, but one needs to do something like:
    from bs4 import NavigableString

    if isinstance(child, NavigableString):
        ...
to detect it, which is somewhat annoying. Perhaps I don't understand the ramifications of synthesizing a Tag object to hold those strings (or at least providing things like .name on those objects), but on first glance it seems like it would be a better way to handle them.

Overall, Beautiful Soup is rather impressive. There is quite a bit more to it than this brief overview shows, but the documentation is excellent and provides lots of examples. If you have some need to parse or transform HTML in Python, Beautiful Soup 4 is surely worth a look.

Comments (8 posted)

Brief items

Quotes of the week

I was asked a few weeks ago, "What was the biggest surprise you encountered rolling out Go?" I knew the answer instantly: Although we expected C++ programmers to see Go as an alternative, instead most Go programmers come from languages like Python and Ruby. Very few come from C++.
Rob Pike

It's fair enough to say that I wouldn't be a programmer today if it weren't for an interest in game programming, and that is true of several of my friends as well. But if that is true, why then do we have so few finished and polished free software games? Answering that question actually deserves of a post of its own (and indeed, solving that riddle is a good portion of the motive behind Liberated Pixel Cup), but it's enough to say for now that we are missing opportunities of encouraging future hackers by not making free software a welcoming playground for game development.
Chris Webber

Comments (14 posted)

Enlightenment (EFL) gains physics support

Enlightenment developer Bruno Dilly announced that EFL has merged in EPhysics, a wrapper for the Bullet Physics library, making it "pretty simple for an EFL programmer and we expect them to adopt EPhysics to create their next splash screen, transition effects and even more games." The library allows Evas objects to "have physical attributes such as mass, friction and restitution and shape. They may receive impulses and collide between them."

Comments (none posted)

KDE Announces 4.9 Release Candidate 1

The first release candidate for KDE 4.9 desktop environment has landed. "With API, dependency and feature freezes in place, the KDE team's focus is now on fixing bugs and further polishing new and old functionality." Highlights include support for Qt Quick in Plasma, deeper integration of the "Activities" scheme for organizing workspaces, and improved metadata sorting-and-searching within the Dolphin file manager.

Full Story (comments: none)

SyncEvolution 1.2.99.1 released

Version 1.2.99.1 of the SyncEvolution PIM-synchronization framework has been released. This is the first pre-release version of the upcoming 1.3 series, and includes several new features, including KDE/Akonadi support, ActiveSync support, and rewritten D-Bus and CalDAV components. Despite the pre-release status, upgrading is still recommended for several reasons: "for example, SyncEvolution 1.3 is required for Evolution 3.4, otherwise photos are not exported properly. Further workarounds for recent changes in Google CalDAV were added."

Full Story (comments: none)

Newsletters and articles

Development newsletters from the last week

Comments (none posted)

Firefox for Android may become your favorite mobile browser (ars technica)

Ars technica has posted a review of the Android version of Firefox. "One of the key features of Firefox for Android is its support for Mozilla's synchronization service. It works seamlessly with the desktop version of the browser, allowing the user to access their bookmarks and other browser data. This capability works as expected and will likely be a major draw for existing Firefox users."

Comments (21 posted)

St. Pierre: The Linux Graphics Stack

Jasper St. Pierre has posted a lengthy overview of the Linux graphics stack. It's a good starting point for anybody who is not clear on what all those acronyms mean. "The X server needs to know what’s happening here, though, so it can do things like synchronization. This synchronization between your glxgears, the kernel, and the X server is called DRI, or more accurately, DRI2. 'DRI' stands for 'Direct Rendering Infrastructure', but it’s sort of a strange acronym. 'DRI' refers to both the project that glued mesa and Xorg together (introducing DRM and a bunch of the things I talk about in this article), as well as the DRI protocol and library. DRI 1 wasn’t really that good, so we threw it out and replaced it with DRI 2."

Comments (128 posted)

Page editor: Nathan Willis
Next page: Announcements>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds