By Jonathan Corbet
March 2, 2011
Regular expressions are a pain. Their power cannot be doubted; a regular
expression can describe complicated text patterns in an exceedingly concise
manner. Using regular expressions, a program can perform all kinds of
string parsing and recognition tasks. But they are also difficult to
write, difficult to read, difficult to understand, and difficult to debug.
Any but the most trivial of regular expressions are quite likely to contain
errors. So it is not surprising that developers would think about replacing
them with something better. But, as a recent discussion in the Python
community shows, that replacement, like regular expressions themselves, may
be difficult.
The compactness of the regular expression syntax is part of their power,
but also part of the problem. Consider even a very simple expression:
<A\b[^>]*>(.*)</A>
A reader familiar with this syntax will recognize that this expression
matches the HTML <A> tag and sets aside the anchor text for
later processing. But even experienced regular expression developers must
look at that expression for a moment and think about how the various
metacharacters affect each other before being able to say for sure what it
does. It takes even longer to notice the subtle bug: this expression will
be confused by the presence of multiple <A> tags in the text
being searched.
So how might one do better? That was Mike
Meyer's question as he sought a more "pythonic" way of doing text
matching. Needless to say, he is not the first to ask that kind of
question; there are a number of attempts at better string matching out
there. The first of those is arguably not "pythonic" at all: it is SnoPy, a port of
the venerable SNOBOL language to Python.
SNOBOL was developed during the 1960's; it included pattern matching as a
core feature of the language. Unlike regular expressions, SNOBOL was
anything but concise. Concatenation of strings was explicit,
"[abc]" was "Any("abc")", and so on. Nonetheless, SNOBOL
was highly influential in this area, and one can see echoes of the language
in current regular expressions. That said, SNOBOL is not heavily used now,
and the Python SNOBOL module seems to have suffered the same fate; its last
release was in 2002.
Another approach is the rxb.py module by
Ka-Ping Yee. This module, posted in 2005, creates a new, relatively
verbose but relatively readable language for the creation of patterns.
Using this language, the regular expression shown above would look
something like:
<A + any(wordchars + whitespace)> + label(1, anychars) + </A>
(Note to readers; the above is totally untested and should not be relied
upon for production use). This module, too, has not seen a great deal of
use.
Various other packages are out there. For example, one can try to use Icon-style pattern
matching with Python. For something completely different, there is the
eGenix
mxTextTools module, which allows the creation of text-matching programs
in an assembly-like language, complete with goto constructs. mxTextTools
is intimidating and not necessarily any easier to read than regular
expressions, but it is said to be powerful and fast, and there are a number
of real users.
Still, none of these seem likely to replace regular expressions as the
first tool Python programmers reach for when they need to perform string
matching. Python creator Guido van Rossum thinks things will stay that way:
I fear that regular expressions have this market cornered, and
there isn't anything possible that is so much better that it'll
drive them out.
Pushing aside an established incumbent is always hard, and regular
expressions are well established indeed. It is never enough to simply be
better in this situation; the proposed replacement has to be a lot better.
As Guido noted, nothing seems to have come along which is that much better,
and it may be that nothing ever will. For some medium-term value of
"ever," anyway.
But, then, one also should not underestimate the ingenuity of free software
developers. Or their persistence. People will almost certainly continue
to throw themselves against this problem, and, maybe, somebody will come up
with something interesting. Until then, we'll have to continue beating our
heads against our desks as we try to figure out why our expressions don't
work as intended.
(
Log in to post comments)