LWN.net Logo

Google's RE2 regular expression library

Google's RE2 regular expression library

Posted Mar 13, 2010 10:23 UTC (Sat) by PO8 (guest, #41661)
Parent article: Google's RE2 regular expression library

This all seems a little weird to me. The things chosen as reference points are mostly strange; most were never particularly touted as examples of speed. GNU grep is pretty fast, though; indeed, if you look at their own graph you might wonder why they cut it off where they did—grep certainly looks like it's almost constant-time in this example, and about to cross the y axis. Anyhow, the RE and target given in that second paper are a lousy benchmark for normal use; they are only intended to illustrate the thesis that backtracking loses on some REs and targets. In short, I'm not sure why backtracking matchers are any more than a strawman when building a fast modern RE engine. The benchmarking in the first paper is only against PCRE.

It doesn't appear that any kind of Boyer-Moore is being used; this can be a major performance penalty for fixed strings. If I recall correctly, one can do Boyer-Moore-like things with general REs these days also. This is a performance no-brainer as far as I know.

I'd be willing to wager a small amount of money that ripping Mike Haertel's RE engine out of GNU grep and clipping out the backreference support (which wouldn't be very hard) would give a much faster engine than re2, with an equally good memory footprint. The problem with this approach for Google, I suspect, is the GPL. Reimplementing a more sophisticated DFA-based RE engine doesn't seem that hard either, though. Certainly one could pay Mike to do it on a contract basis.

Here's some stories about Mike and grep and RE matching: (1) (and note the comments), (2).


(Log in to post comments)

Google's RE2 regular expression library

Posted Mar 14, 2010 10:05 UTC (Sun) by tzafrir (subscriber, #11501) [Link]

So this is yet another part of Project UNG (UNG is Not GNU)?

Google's RE2 regular expression library

Posted Mar 14, 2010 19:38 UTC (Sun) by droundy (subscriber, #4559) [Link]

Somehow it seems a bit odd to argue that comparison with perl, python and
ruby and pcre is irrelevant. True, they are slow, but regexp matching is
supposed to be one of their strengths (esp of perl). And most people think
of them as being only maybe 10 or 1000 times slower than C, precisely when
the computation is dominated by library calls such as regular expression
matching, not 1e6 times slower than C. Of course, the point is that the
algorithm is at issue, not the language.

Admittedly, it seems like pretty weird regular expressions are needed in
order to trigger this exponential behavior. But then again, taking a minute
to match a 29 character string with a 29 character regular expression does
seem pretty excessive...

Google's RE2 regular expression library

Posted Mar 15, 2010 9:47 UTC (Mon) by PO8 (guest, #41661) [Link]

I guess my point was just that this work seems to be presented as a significant improvement on the state of the art in RE-matching performance. As far as I can tell it's only an improvement over RE implementations that have never been held up as the state of the art for performance. Further, its performance seems to have been measured mostly on a particularly pathological microbenchmark.

Just grabbed the source code and stared at it for a bit. It's big; 14K lines of C++ code, really well commented. It does do lazy DFA compilation with a state cache, which is nice. Couldn't find any evidence of Boyer-Moore, though. This is really a major omission if performance is an issue.

Don't really have time right now to do more benchmarking of this code. There's some benchmarks there already, but I haven't figured out how to play with them yet. More if and when I get some time.

Google's RE2 regular expression library

Posted Mar 15, 2010 17:31 UTC (Mon) by droundy (subscriber, #4559) [Link]

I guess I saw the articles more as a denouncement of the poor
implementations in common use than the claim of a drastically new and better
implementations, and the choice of a pathological regular expression made
sense as such, since the point is that such a beast exists with backtracking
implementations, and doesn't exist with good implementations.

Google's RE2 regular expression library

Posted Mar 17, 2010 21:39 UTC (Wed) by dthurston (subscriber, #4603) [Link]

I think it's more that they're giving the standard, fast performance of a proper RegExp engine to almost all of the various convenient extensions that Perl, Python, etc. have. So it's an advance in speed over Perl et al, and an advance in convenience over awk, etc.

GNU Grep has its bad days

Posted Mar 19, 2010 12:16 UTC (Fri) by walles (guest, #954) [Link]

My experience isn't that grep is "pretty fast"; here's a bug report where both Ruby and Perl runs circles around Grep:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=445215

This performance issue is one that I'd really like to see fixed, so if somebody would step up to fix it that would be great!

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds