LWN.net Logo

Google's RE2 regular expression library

Google's RE2 regular expression library

Posted Mar 12, 2010 18:34 UTC (Fri) by robert_s (subscriber, #42402)
Parent article: Google's RE2 regular expression library

You know, ever since LLVM appeared, I've been expecting someone to come along and write a regex engine that dynamically compiles its matchers to native code.

Wonder why that's never happened.


(Log in to post comments)

Google's RE2 regular expression library

Posted Mar 12, 2010 19:17 UTC (Fri) by ajross (subscriber, #4563) [Link]

Mostly because constant factor execution speed (not the algorithmic stuff the linked article is talking about) doesn't matter for regexes. Anything doing performance critical work on text data in the modern world is going to be I/O bound at some level (even if it's only DRAM latency). Cycle counting doesn't help in that regime.

Google's RE2 regular expression library

Posted Mar 12, 2010 22:57 UTC (Fri) by tzafrir (subscriber, #11501) [Link]

Something like perl's qr// operator?

Google's RE2 regular expression library

Posted Mar 13, 2010 19:24 UTC (Sat) by tkil (subscriber, #1787) [Link]

Something like perl's qr// operator?

Not really; qr// simply lets the programmer control when the double-quotish regex string is turned into a regex object. It's more like generating an opcode tree for the regex, as opposed to compiling it down to bytecode or machine instructions.

This is still a win, since the regex object creation step is often the most expensive part of using a regex: it involves interpolation, compilation to opcodes, and various optimizations. Doing that only once vs. every time through a loop is a huge help.

Being able to compile down to machine instruction (with all the optimizations along the way) is just going further along the scale of "more effort up front" vs. "better speed on each match". When you're working with long-lived C++ apps and the regexes are static, it makes perfect sense to try to compile them down as far as possible; when working with short- lived dynamic scripts, it isn't always a win.

Google's RE2 regular expression library

Posted Mar 14, 2010 10:02 UTC (Sun) by tzafrir (subscriber, #11501) [Link]

What's so special about a regex, then?

If the VM has the code half-compiled, there are probably many other optimizations to do under the cover.

Google's RE2 regular expression library

Posted Mar 15, 2010 2:37 UTC (Mon) by tkil (subscriber, #1787) [Link]

What's so special about a regex, then?

If the VM has the code half-compiled, there are probably many other optimizations to do under the cover.

That's pretty much the reason: it's a tradeoff between upfront regex parsing / compilation / optimizing, and the ongoing cost of matching against that regular expression.

If you are only going to match against a regex once or twice, then there's no point in spending a huge amount of time optimizing that regex (and certainly not in doing it every time it's seen!). If, on the other hand, you are going to match many things against that regex, it makes sense to spend more time up front so that each match is a bit faster.

It's a continuum: on the one end, the regex engine can just parse the regex text as it goes, and do no compilation or optimization whatsoever (this would be analogous to a true interpreter). As the next step, the interpreter could process the regex into some internal representation, but not do any optimizations. A step further, that internal representation could be optimized in various ways. Finally, the optimized internal representation could be compiled down to machine code (with possible further optimizations applied). Each of those steps takes longer up front, though, so you have to amortize the up-front regex processing time across the number of matches.

Google's RE2 regular expression library

Posted Mar 13, 2010 2:26 UTC (Sat) by emk (guest, #1128) [Link]

Doesn't .NET compile regular expressions using System.Reflection.Emit,
which is basically Microsoft's closest approximation of LLVM?

In any case, you might not need all of LLVM—that kind of simplified control
flow can occasionally be handled quite nicely by specialized code generators.

Google's RE2 regular expression library

Posted Mar 13, 2010 17:10 UTC (Sat) by andikleen (subscriber, #39006) [Link]

It already happened multiple times of course.

You don't really need all the infrastructure in a full compiler like LLVM
for it either, regular expressions are rather simple.

One old style version (non JIT) is re2c that simply compiles regexprs to C.
Then the V8 java script engine in google chrome which uses an own compiler for their JS regexprs. Undoubtedly there are others too. e.g. in a language with built in compiler like Common Lisp it's rather easy to do.

Google's RE2 regular expression library

Posted Mar 16, 2010 16:21 UTC (Tue) by wahern (subscriber, #37304) [Link]

Ragel, which is amazing. I use it all over the place. I've written C parsers for HTTP, RTSP, SMTP, SDP, MySQL wire protocol, and many more w/ the help of Ragel.

http://www.complang.org/ragel/

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds