|
|
Log in / Subscribe / Register

Development quotes of the week - regular expression edition

I *thought* I had a reasonable understanding of regexes, and now I have learned that I don't, and that the regexes I've been writing don't do what I thought they did, and presumably the only reason they haven't blown up in my face (either performance-wise, or the wrong output) is blind luck.

Now I have *three* problems.

Steven D'Aprano

Pretending that a regex matches in a simpler way than it actually does is like pretending that the earth is a sphere: technically wrong, but almost always close enough.
Chris Angelico

to post comments

Development quotes of the week - regular expression edition

Posted Feb 17, 2022 2:30 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (3 responses)

I always find these discussions a bit bizarre. At Google, we happily use RE2[1] for ~everything, and while it's not literally the fastest engine in the world, performance is quite acceptable and we basically don't have problems with exponential blowup at all. The only downside is that you can't use lookaround assertions, which is a bit of a problem if, say, you're trying to build a parser that operates on whole lines of input, but the correct solution to that problem is "lex, then parse," not "write a crazy regex that matches the exact combination of keywords and syntax that you wanted to match."

[1]: https://github.com/google/re2

Development quotes of the week - regular expression edition

Posted Feb 20, 2022 23:00 UTC (Sun) by ras (subscriber, #33059) [Link] (2 responses)

It's not completely bizarre. It made perfect to use an NFA back when Thompson first did it. CPU's were slow, RAM was hideously expensive.

At some point, sometime between 10 and 20 years ago, that decision made it's way across the wide blurry line delineating "good engineering decision" to "actively harmful". Using an NFA instead of a DFA is now so far into "actively harmful" territory it's bloody annoying - I get bitten by non-deterministically lurking exponential time blowup bugs at least once a year now. Well maybe it's not non-deterministic, but you would have to be superman to figure out when a given NFA engine take until until the heat death of the universe to process some random real world input, while breezing though all the unit tests you give it. And it will do that despite you give it a well formed RE a DFA could easily represent in sane amount of space.

If I had a choice, I would use any DFA based engine (like RE) every time. But the regex engines most languages put in their "comes with batteries" standard libraries are NFA's, because some bright spark decided to allow back reference matching donkey's years ago. I presume they added it because it's something NFA's can do easily, and that fact that it turned an O(1) problem into an NP one didn't really matter because hey, NFA's already can't detect NP regex's so why not add a few more cases? The answer turned out to be because when the time came to move from NFA's to DFA's, back references would prevent it.

It's going to take a major change mindset to change this situation. It's not that languages like Python or PHP couldn't add an RE like regex engine to their batteries. It's that they would have to acknowledge it's more important to have a safe regex engine than to have the most features. In fact there was some discussion about replacing Python's regex engine not so long ago. I don't know what happened - maybe it's still ongoing. I got all excited and looked up what was being proposed. But no, the one thing I wanted - guaranteed O(1) execution time was not on the list. In fact as far as I could tell, it hadn't even occurred to anyone the current situation might be a problem. They just wanted more features.

Development quotes of the week - regular expression edition

Posted Feb 22, 2022 23:30 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (1 responses)

I believe you're mixing up NFAs with Turing machines; every NFA can be translated into an equivalent DFA (possibly at the cost of having a larger number of states), and NFAs and DFAs can match exactly the same sets of languages (neither can handle lookaround assertions). Other than that, I entirely agree with your assessment of the situation. But at least they added atomic/possessive matches to band-aid over the situation - which is still a lot worse than just using a properly-designed engine in the first place.

Development quotes of the week - regular expression edition

Posted Feb 23, 2022 2:13 UTC (Wed) by ras (subscriber, #33059) [Link]

> I believe you're mixing up NFAs with Turing machines

Well, perhaps. In fact in language theory terms I definitely am. I guess I was loose with my language. Current regexp recognises started life as Thompson's NFA's, and are still called NFA engines today.

Consider this "regular expression":

^.?|(..+?)\\1+$

That regexp can't be represented as an NFA, which I guess is your point. But it *is* handled by regexp libraries in found most languages today and we generally call them NFA engines. That was how I was using the term NFA.

I don't think these "NFA engines" are as powerful as Turning machines, but they do have some extra features like prime number recognition. Those features seem to have a lot of cheerleaders who apparently don't mind that they come at the cost non-deterministic execution time. cf. quote from today's LWN front page story:

> Jonathan Slenders replied that the regex module on the Python Package Index (PyPI) has support for timeouts.

So the fix is to replace one form of non-deterministic behaviour with another. Words fail me, as in my attempt to find polite words to express my opinion of the "fix" failed.

Development quotes of the week - regular expression edition

Posted Feb 17, 2022 12:51 UTC (Thu) by flussence (guest, #85566) [Link] (2 responses)

Oh, here's a fun regex story I had last week:

I had a bunch of syslog data in wildly varying shapes (httpd, postfix, various other servers) and wanted to pull out badly-behaving remote actors' IPs and stuff them into an ipset. Easy enough to do those one at a time, but wouldn't it be nice to do them in a single pass? (Before you run away screaming, I had no intention of using a single regex for the whole thing)

So I used ripgrep's handy --file flag, which allegedly reads one regex per line (and more importantly, supports comments between them). And then I put the IP address part of each line in a capture group, intending to ask for "$1" as output. And... it broke horribly because it turns out something internally rewrites it into a single regex for the whole thing, and the capture groups become sequentially numbered. What about named (?<foo>) groups? That's a no-go too because libpcre loudly complains about duplicates and errors out.

I ended up doing it one at a time.

Development quotes of the week - regular expression edition

Posted Feb 21, 2022 8:54 UTC (Mon) by tlamp (subscriber, #108540) [Link] (1 responses)

> That's a no-go too because libpcre loudly complains about duplicates and errors out.

ripgrep doesn't use libpcre, but the native rust regex crate[0] which is inspired by Google's re2 and has similar pro/cons that NYKevin mentioned above w.r.t. re2.

Maybe it could be worth opening an issue at ripgrep regarding this behavior, but IMO it's a bit odd to use grep/rg/ag/... for this, awk (and derivates, like frawk[1]) sounds like it could be a better option. I use awk quite successfully for searching nginx access logs for unusual patterns.

[0]: https://github.com/rust-lang/regex
[1]: https://github.com/ezrosent/frawk

Development quotes of the week - regular expression edition

Posted Feb 26, 2022 3:02 UTC (Sat) by flussence (guest, #85566) [Link]

> ripgrep doesn't use libpcre, but the native rust regex crate

It uses both, see `ldd`. I was using the --pcre2 flag because there were lookbehinds in a few of the patterns.


Copyright © 2022, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds