LWN.net Logo

Making some Beautiful Soup

By Jake Edge
June 27, 2012

Parsing HTML is sometimes a surprisingly complicated task. Even what seem like fairly trivial constructs seem to have a bunch of "fiddly bits" that one needs to account for. For most of those kinds of tasks, then, one generally turns to a full-blown HTML parser. While Python includes one in its standard library, it can be somewhat painful to use as well. Some recent experiments with Beautiful Soup, in particular version 4 released earlier this year, have shown a parser that is well-designed and easy to use.

Beautiful Soup is available as a tarball that can be installed in the usual Python way (using setup.py). It can also be installed using pip or easy_install from the PyPi repositories. It is also packaged as python-bs4 for Debian and Ubuntu; packages for other distributions will presumably be coming along as well. It supports both Python 2.7 and Python 3, with few external dependencies. Beautiful Soup uses the Python standard library parser, but can also use optional faster parsers (lxml, html5lib) if they are installed.

Getting started with Beautiful Soup is pretty straightforward:

    from bs4 import BeautifulSoup

    soup = BeautifulSoup(string_or_filehandle)
From that point on, the soup object is used to query and manipulate the HTML contained in the argument. The input data is converted to Unicode, parsed, and cleaned up so that it is valid. At that point, simply outputting the object (e.g. print soup) will produce the cleaned-up HTML. soup.prettify() will do even more, indenting and matching up tags to create pretty output.

But, there's lots more available than simply cleaner HTML. It can also take HTML fragments, which will be transformed into full HTML documents once parsed. Beautiful Soup breaks the HTML down into objects which correspond to the tags contained in the input. For example:

    print soup.head
    print soup.body
will print the <head> and <body> sections. For an HTML fragment, soup.body will contain the parsed contents of that fragment, and each piece in the HTML can be accessed via the .children iterator or the .contents list. So:
    html_frag = '<p>foo bar</p><p>baz<p>yet another graf</p>'
    soup = BeautifulSoup(html_frag)
    print soup.body.contents
    for c in soup.body.children:
        print c

    # output
    [<p>foo bar</p>, <p>baz</p>, <p>yet another graf</p>]
    <p>foo bar</p>
    <p>baz</p>
    <p>yet another graf</p>
Note that the unclosed paragraph tag was fixed.

Each tag in the HTML gets turned into a Tag object by Beautiful Soup. Tag objects have attributes like .name, .children, .parent, and so on, which can be used to distinguish various tags and to navigate the tree. In addition, the HTML attributes of a particular tag are available by accessing the tag as a dictionary:

    soup = BeautifulSoup('<b class="boldest">foo</b> bar <b class="bolder">baz</b>')
    print soup.b['class']

    # output
    ['boldest']
The first tag of a given type can be referred to using the dot notation, so soup.b is the first "b" tag in the object. One can access the other tags by navigating in the HTML tree or by searching.

But wait, there's more:

    soup.b['class'] = 'waylessbold'
    print soup.body

    # output
    <body><b class="waylessbold">foo</b> bar <b class="bolder">baz</b></body>
So, changing an attribute is reflected in the output of the soup object. That can be used to add new attributes (tag['newattr']='foo') or to remove existing attributes (del tag['class']).

For many transformation tasks, though, one might rather not step through the whole HTML tree, and would, instead, want to search for tags of interest. Beautiful Soup has some powerful capabilities in that area too. Using the above example:

    for b in soup.find_all('b'):
        b['class'] = 'justbold'
would change the class of all "b" tags in the object to "justbold" (adding a "class" attribute to any that don't have it). More complicated things can be done using regular expressions as well:
    for a in soup.find_all('a', href=re.compile(r'^/')):
        a['href'] = 'http://lwn.net%s' % a['href']
would turn relative links into full URLs for example. Using keyword arguments (like href above) will search the HTML attributes of the tag based on a string or regular expression. There are also ways to search based on the presence of a tag (href=True), to limit the number of results returned, or to change the default recursive searching so that only direct children are searched.

One can also create a dictionary with just the attributes needed (or wanted) on a particular tag type and simply assign those tags using the .attrs attribute:

    for img in soup.find_all('img'):
        idict = { 'height' : 22, 'width' : 42, 'src' : img['src'] }
        img.attrs = idict
Something like that might be useful to remove HTML attributes like align= that aren't acceptable in some forms of HTML (e.g. EPUB).

One can also create and insert entirely new tags into the soup object (and thus the HTML). A tag can be created with new_tag() then inserted before or after any other tag:

    itag = soup.new_tag('i')
    itag.string = 'italicized'
    soup.b.insert_before(itag)
    print soup.body

    # output
    <body><i>italicized</i><b class="justbold">foo</b> bar <b class="justbold">baz</b></body>

There is one seeming oddity with Beautiful Soup in that there is another type of object called a NavigableString. Any string in the input (even those that are the child of a Tag, e.g. the "foo" in <b>foo</b>) will be represented as a NavigableString. These can be encountered while moving around in the HTML tree:

    soup = BeautifulSoup('<i>italicized</i><b class="justbold">foo</b> bar <b class="justbold">baz</b>')

    for t in soup.body.children:
        print t.name

    # output
    i
    b
    Traceback (most recent call last):
    ...
    AttributeError: 'NavigableString' object has no attribute 'name'
The unadorned "bar" in the fragment above is clearly not a tag, but by making it a different object, without the "standard" .name attribute, it is a bit of a pain to deal with. The problem occurs mostly in experimentation and debugging, but one needs to do something like:
    from bs4 import NavigableString

    if isinstance(child, NavigableString):
        ...
to detect it, which is somewhat annoying. Perhaps I don't understand the ramifications of synthesizing a Tag object to hold those strings (or at least providing things like .name on those objects), but on first glance it seems like it would be a better way to handle them.

Overall, Beautiful Soup is rather impressive. There is quite a bit more to it than this brief overview shows, but the documentation is excellent and provides lots of examples. If you have some need to parse or transform HTML in Python, Beautiful Soup 4 is surely worth a look.


(Log in to post comments)

Making some Beautiful Soup

Posted Jun 28, 2012 8:59 UTC (Thu) by faassen (subscriber, #1676) [Link]

For an alternative to Beautiful Soup for parsing messy HTML in Python also check out lxml's html parser. lxml also offers Beautiful Soup integration.

http://lxml.de/lxmlhtml.html

Making some Beautiful Soup

Posted Jun 28, 2012 17:17 UTC (Thu) by shlomif (guest, #11299) [Link]

Thanks for the nice write-up.

Making some Beautiful Soup

Posted Jun 28, 2012 20:25 UTC (Thu) by geofft (subscriber, #59789) [Link]

So, my impression of the developer community consensus was that BeautifulSoup was pretty awesome four or five years ago, but lxml has caught up with it, and BeautifulSoup has in fact regressed in its ability to competently parse malformed HTML. There's some discussion in this StackOverflow question from 2009 about issues with it, notably quoting the changelog from 3.1:
Beautiful Soup is now based on HTMLParser rather than SGMLParser, which is gone in Python 3. There's some bad HTML that SGMLParser handled but HTMLParser doesn't
It's possible version 4 has improved things. I guess from their website, they've changed parser backends: "Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility." Well, I'm a little unsure why you wouldn't just use lxml (or html5lib) directly; I've been happy with lxml.

Making some Beautiful Soup

Posted Jul 3, 2012 11:36 UTC (Tue) by robbe (guest, #16131) [Link]

> Note that the unclosed paragraph tag was fixed.

One may omit the end tag of a P element in all versions of HTML¹. Thus there is no need to "fix" anything. Maybe you were thinking of XHTML, which requires end tags?

¹ In all the common cases, including the example given. See the specs for details.

Making some Beautiful Soup

Posted Jul 3, 2012 12:17 UTC (Tue) by hummassa (subscriber, #307) [Link]

One must always know where the (implicit) end tags are in order to construct the DOM.

Oh, and some of us have OCD and will burn your house if you forget the </p>.

Making some Beautiful Soup

Posted Jul 5, 2012 14:05 UTC (Thu) by leonard.richardson (guest, #85452) [Link]

Hi, I'm the author of Beautiful Soup. I really appreciate this writeup. I wanted to briefly respond to the article and some of the comments:

* A NavigableString doesn't define the .name attribute because as you point out, it's a string, not a tag. The underlying problem is that BeautifulSoup doesn't provide a good idiom for iterating over mixed lists of strings and tags. I have thought about this but haven't come up with anything good enough to add to the API.

* NavigableString is a subclass of Python's `unicode` class (`str` in Python 3), so you can write isinstance(child, unicode) instead of isinstance(child, NavigableString). This makes checking a little less annoying.

* Thanks largely to Ezio Melotti, the problems with Python's built-in HTMLParser have been fixed as of Python 2.7.3 and Python 3.2.2. HTMLParser's ability to handle bad HTML is now on par with lxml and html5lib, and Beautiful Soup 4 can use any of those parsers.

* If you're happy with lxml's interface, you should use lxml. Beautiful Soup 4 is like pyquery: a library that sits on top of a parser and provides an alternate interface that some people find easier to use.

Making some Beautiful Soup

Posted Oct 13, 2012 14:41 UTC (Sat) by demarchi (subscriber, #67492) [Link]

But as far as I could check, if your software supports both python 2 and 3, you will need to use NavigableString, right?

Making some Beautiful Soup

Posted Jul 12, 2012 19:48 UTC (Thu) by philomath (guest, #84172) [Link]

Finally something not about SQL and databases...
nice article.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds