|
|
Log in / Subscribe / Register

Development quotes of the week - regular expression edition

Development quotes of the week - regular expression edition

Posted Feb 20, 2022 23:00 UTC (Sun) by ras (subscriber, #33059)
In reply to: Development quotes of the week - regular expression edition by NYKevin
Parent article: Development quotes of the week - regular expression edition

It's not completely bizarre. It made perfect to use an NFA back when Thompson first did it. CPU's were slow, RAM was hideously expensive.

At some point, sometime between 10 and 20 years ago, that decision made it's way across the wide blurry line delineating "good engineering decision" to "actively harmful". Using an NFA instead of a DFA is now so far into "actively harmful" territory it's bloody annoying - I get bitten by non-deterministically lurking exponential time blowup bugs at least once a year now. Well maybe it's not non-deterministic, but you would have to be superman to figure out when a given NFA engine take until until the heat death of the universe to process some random real world input, while breezing though all the unit tests you give it. And it will do that despite you give it a well formed RE a DFA could easily represent in sane amount of space.

If I had a choice, I would use any DFA based engine (like RE) every time. But the regex engines most languages put in their "comes with batteries" standard libraries are NFA's, because some bright spark decided to allow back reference matching donkey's years ago. I presume they added it because it's something NFA's can do easily, and that fact that it turned an O(1) problem into an NP one didn't really matter because hey, NFA's already can't detect NP regex's so why not add a few more cases? The answer turned out to be because when the time came to move from NFA's to DFA's, back references would prevent it.

It's going to take a major change mindset to change this situation. It's not that languages like Python or PHP couldn't add an RE like regex engine to their batteries. It's that they would have to acknowledge it's more important to have a safe regex engine than to have the most features. In fact there was some discussion about replacing Python's regex engine not so long ago. I don't know what happened - maybe it's still ongoing. I got all excited and looked up what was being proposed. But no, the one thing I wanted - guaranteed O(1) execution time was not on the list. In fact as far as I could tell, it hadn't even occurred to anyone the current situation might be a problem. They just wanted more features.


to post comments

Development quotes of the week - regular expression edition

Posted Feb 22, 2022 23:30 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (1 responses)

I believe you're mixing up NFAs with Turing machines; every NFA can be translated into an equivalent DFA (possibly at the cost of having a larger number of states), and NFAs and DFAs can match exactly the same sets of languages (neither can handle lookaround assertions). Other than that, I entirely agree with your assessment of the situation. But at least they added atomic/possessive matches to band-aid over the situation - which is still a lot worse than just using a properly-designed engine in the first place.

Development quotes of the week - regular expression edition

Posted Feb 23, 2022 2:13 UTC (Wed) by ras (subscriber, #33059) [Link]

> I believe you're mixing up NFAs with Turing machines

Well, perhaps. In fact in language theory terms I definitely am. I guess I was loose with my language. Current regexp recognises started life as Thompson's NFA's, and are still called NFA engines today.

Consider this "regular expression":

^.?|(..+?)\\1+$

That regexp can't be represented as an NFA, which I guess is your point. But it *is* handled by regexp libraries in found most languages today and we generally call them NFA engines. That was how I was using the term NFA.

I don't think these "NFA engines" are as powerful as Turning machines, but they do have some extra features like prime number recognition. Those features seem to have a lot of cheerleaders who apparently don't mind that they come at the cost non-deterministic execution time. cf. quote from today's LWN front page story:

> Jonathan Slenders replied that the regex module on the Python Package Index (PyPI) has support for timeouts.

So the fix is to replace one form of non-deterministic behaviour with another. Words fail me, as in my attempt to find polite words to express my opinion of the "fix" failed.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds