Google's RE2 regular expression library
Posted Mar 13, 2010 10:23 UTC (Sat) by PO8
Parent article: Google's RE2 regular expression library
This all seems a little weird to me. The things chosen as reference points are mostly strange; most were never particularly touted as examples of speed. GNU grep is pretty fast, though; indeed, if you look at their own graph you might wonder why they cut it off where they did—grep certainly looks like it's almost constant-time in this example, and about to cross the y axis. Anyhow, the RE and target given in that second paper are a lousy benchmark for normal use; they are only intended to illustrate the thesis that backtracking loses on some REs and targets. In short, I'm not sure why backtracking matchers are any more than a strawman when building a fast modern RE engine. The benchmarking in the first paper is only against PCRE.
It doesn't appear that any kind of Boyer-Moore is being used; this can be a major performance penalty for fixed strings. If I recall correctly, one can do Boyer-Moore-like things with general REs these days also. This is a performance no-brainer as far as I know.
I'd be willing to wager a small amount of money that ripping Mike Haertel's RE engine out of GNU grep and clipping out the backreference support (which wouldn't be very hard) would give a much faster engine than re2, with an equally good memory footprint. The problem with this approach for Google, I suspect, is the GPL. Reimplementing a more sophisticated DFA-based RE engine doesn't seem that hard either, though. Certainly one could pay Mike to do it on a contract basis.
Here's some stories about Mike and grep and RE matching: (1) (and note the comments), (2).
to post comments)