Posted Jul 5, 2012 14:05 UTC (Thu) by leonard.richardson (guest, #85452)
Parent article: Making some Beautiful Soup
Hi, I'm the author of Beautiful Soup. I really appreciate this writeup. I wanted to briefly respond to the article and some of the comments:
* A NavigableString doesn't define the .name attribute because as you point out, it's a string, not a tag. The underlying problem is that BeautifulSoup doesn't provide a good idiom for iterating over mixed lists of strings and tags. I have thought about this but haven't come up with anything good enough to add to the API.
* NavigableString is a subclass of Python's `unicode` class (`str` in Python 3), so you can write isinstance(child, unicode) instead of isinstance(child, NavigableString). This makes checking a little less annoying.
* Thanks largely to Ezio Melotti, the problems with Python's built-in HTMLParser have been fixed as of Python 2.7.3 and Python 3.2.2. HTMLParser's ability to handle bad HTML is now on par with lxml and html5lib, and Beautiful Soup 4 can use any of those parsers.
* If you're happy with lxml's interface, you should use lxml. Beautiful Soup 4 is like pyquery: a library that sits on top of a parser and provides an alternate interface that some people find easier to use.