LWN.net Logo

Making some Beautiful Soup

Making some Beautiful Soup

Posted Jul 5, 2012 14:05 UTC (Thu) by leonard.richardson (guest, #85452)
Parent article: Making some Beautiful Soup

Hi, I'm the author of Beautiful Soup. I really appreciate this writeup. I wanted to briefly respond to the article and some of the comments:

* A NavigableString doesn't define the .name attribute because as you point out, it's a string, not a tag. The underlying problem is that BeautifulSoup doesn't provide a good idiom for iterating over mixed lists of strings and tags. I have thought about this but haven't come up with anything good enough to add to the API.

* NavigableString is a subclass of Python's `unicode` class (`str` in Python 3), so you can write isinstance(child, unicode) instead of isinstance(child, NavigableString). This makes checking a little less annoying.

* Thanks largely to Ezio Melotti, the problems with Python's built-in HTMLParser have been fixed as of Python 2.7.3 and Python 3.2.2. HTMLParser's ability to handle bad HTML is now on par with lxml and html5lib, and Beautiful Soup 4 can use any of those parsers.

* If you're happy with lxml's interface, you should use lxml. Beautiful Soup 4 is like pyquery: a library that sits on top of a parser and provides an alternate interface that some people find easier to use.


(Log in to post comments)

Making some Beautiful Soup

Posted Oct 13, 2012 14:41 UTC (Sat) by demarchi (subscriber, #67492) [Link]

But as far as I could check, if your software supports both python 2 and 3, you will need to use NavigableString, right?

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds