LWN.net Logo

It handles only regular expressions, while Perl regexps give more

It handles only regular expressions, while Perl regexps give more

Posted Mar 12, 2010 15:57 UTC (Fri) by epa (subscriber, #39769)
Parent article: Google's RE2 regular expression library

The project page notes that

Unlike most automata-based engines, RE2 implements almost all the common Perl and PCRE features and syntactic sugars. It also finds the leftmost-first match, the same match that Perl would, and can return submatch information. The one significant exception is that RE2 drops support for backreferences and generalized zero-width assertions, because they cannot be implemented efficiently.
This is quite correct. Once you allow backreferences, you no longer have a regular language but some more powerful class of language, which can't make the same runtime guarantees.

However it would be nice if the library could have two modes: a safe, regular-expression-only mode, and an unsafe mode where it will also accept backreferences and other funky Perl-like features, at the cost of fewer guarantees on run time. It might then be more widely adopted.


(Log in to post comments)

It handles only regular expressions, while Perl regexps give more

Posted Mar 12, 2010 17:51 UTC (Fri) by rriggs (subscriber, #11598) [Link]

Actually, I think this will be widely adopted precisely because it can always make those guarantees. I would really like to see this as an algorithm selection option in the Boost Regex library.

Full regular expressions?

Posted Mar 12, 2010 18:18 UTC (Fri) by vonbrand (subscriber, #4458) [Link]

There are regular expressions that require an exponentially large deterministic automaton to recognize (alternatively, if they simulate a nondeterministic one directly, an exponentially large set of states to remember), so they'll cut it short somewhere.

Full regular expressions?

Posted Mar 12, 2010 18:42 UTC (Fri) by ballombe (subscriber, #9523) [Link]

Well, they only say 'bounded stack space', not 'bounded heap space'.

Full regular expressions?

Posted Mar 12, 2010 19:13 UTC (Fri) by Pc5Y9sbv (guest, #41328) [Link]

Don't some of the back-referencing extensions actually take it from regular deep into LR(1) languages?

If I remember way back from school, in practice, many human-comprehensible non-determinstic REs can be emulated in closer to linear time. Running a set of DFAs, growing on each non-deterministic branch point, doesn't grow as wildly as it could in theory, because so many of the DFAs are culled relatively soon on dead-end input transitions. Is that no longer considered a valid heuristic strategy?

Full regular expressions?

Posted Mar 14, 2010 10:50 UTC (Sun) by ballombe (subscriber, #9523) [Link]

> Don't some of the back-referencing extensions actually take it from regular deep into LR(1) languages?

This is much much worse. Languages defined with general back-references are undecidable, and even the simplest kind of back reference can cause languages not to be algebraic, for example ([ab]*)\1 and (a*)(b*)\1\2 are not.

Now, all true regular expressions can be matched in linear time and bounded memory. However both the memory bound and the setup time (before starting the match) are in the worse case exponential in the size of the expression.

Full regular expressions?

Posted Mar 13, 2010 4:29 UTC (Sat) by greenfie (subscriber, #6488) [Link]

RE2 will fail to compile certain regular expressions due to its memory
limitations, but during matching (when it builds up its DFA) it has bounded
memory. It'll just run slower on perverse regexes.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds