LWN.net Logo

Making some Beautiful Soup

Making some Beautiful Soup

Posted Jun 28, 2012 20:25 UTC (Thu) by geofft (subscriber, #59789)
Parent article: Making some Beautiful Soup

So, my impression of the developer community consensus was that BeautifulSoup was pretty awesome four or five years ago, but lxml has caught up with it, and BeautifulSoup has in fact regressed in its ability to competently parse malformed HTML. There's some discussion in this StackOverflow question from 2009 about issues with it, notably quoting the changelog from 3.1:

Beautiful Soup is now based on HTMLParser rather than SGMLParser, which is gone in Python 3. There's some bad HTML that SGMLParser handled but HTMLParser doesn't
It's possible version 4 has improved things. I guess from their website, they've changed parser backends: "Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility." Well, I'm a little unsure why you wouldn't just use lxml (or html5lib) directly; I've been happy with lxml.


(Log in to post comments)

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds