So where did end up for earlier verisons of Python?

Posted Nov 28, 2013 8:24 UTC (Thu) by kleptog (subscriber, #1183)
In reply to: So where did end up for earlier verisons of Python? by ras
Parent article: Python adopts SipHash

Is that (fix the non-deterministic behaviour of regexs) even possible? Most regexes are fairly simple, but as the quantifiers increase it becomes more likely that you can make a pathelogical input for it. Can you do anything other that just flagging it?

So where did end up for earlier verisons of Python?

Posted Nov 28, 2013 10:34 UTC (Thu) by epa (subscriber, #39769) [Link] (4 responses)

Provided you limit yourself to true regular expressions (forgoing non-regular extensions such as backreferences), I believe the worst-case performance can be made tractable. This paper http://swtch.com/~rsc/regexp/regexp1.html has more details.

So where did end up for earlier verisons of Python?

Posted Nov 29, 2013 12:23 UTC (Fri) by k8to (guest, #15413) [Link] (3 responses)

Sadly the modern regex implementations in use all do not limit themselves in this way.

Morover, the fancy regex engines like pcre are actually significantly faster than this implementation for the typical cases.

So it's really a hard sell. You have to convince the users of your regex interface to accept losing all the functionality they now rely on, and also accept a notable speed penalty for your expected cases, in the name of achieving more expectable behavior for exceptional cases.

For some software that's reasonable. For other software it's better to just limit the recursion and handle the failures.

So where did end up for earlier verisons of Python?

Posted Nov 29, 2013 17:18 UTC (Fri) by epa (subscriber, #39769) [Link] (2 responses)

Actually, I think it is backreferences and other non-regular features that are the 'exceptional case' - 90% of regexps do not use them. And most regexp matches are so fast that they are not the performance bottleneck in a program that uses them. The cases where people care about speed are obviously the ones that run particularly slowly, which is often (though not always) due to poor worst-case performance.

It would therefore be better to use the safer (as in more predictable) NFA implementation by default, with DFA available as an option if you want to do non-regular matching, or if you've made a conscious decision to sacrifice worst-case performance for better performance on your typical workload.

But in perl, at least, the regexp engine is so tied up with the guts of the language it would be a Herculean task to replace it.

Limiting the allowed recursion is a good idea and it would be nice if languages offered a friendly way to set this.

So where did end up for earlier verisons of Python?

Posted Nov 30, 2013 2:25 UTC (Sat) by mchapman (subscriber, #66589) [Link] (1 responses)

> But in perl, at least, the regexp engine is so tied up with the guts of the language it would be a Herculean task to replace it.

That used to be the case, however since Perl 5.9.5 different RE engines can be used in place of the native one. See http://search.cpan.org/search?query=re%3A%3Aengine for some examples.

So where did end up for earlier verisons of Python?

Posted Nov 30, 2013 8:24 UTC (Sat) by epa (subscriber, #39769) [Link]

Yes, you can plug in other engines, but all of those (even PCRE I expect) have different semantics to the builtin regexp engine. I was talking of replacing the standard engine used in all code, with all the same behaviour including Unicode handling and probably even 'experimental' features like (?{}). That would be a huge effort.

After all, the idea is not that the programmer can specially choose a different regexp implementation with more predictable performance - that is already the case - but to change the default so that all programmers, even those who don't read LWN and didn't study finite automaton theory, can write code which doesn't have hidden DoS performance bugs.

So where did end up for earlier verisons of Python?

Posted Nov 29, 2013 1:18 UTC (Fri) by ras (subscriber, #33059) [Link]

> Is that (fix the non-deterministic behaviour of regexs) even possible?

Yes and no.

We use finite automata to recognise regex's. A automata is just a fancy name for a directed graph. The lines on the graph that connect one state to the next are labelled with the input character. If there is no line from the current state labelled with the next input character the regex didn't match the input. If there is a line, you move onto the next state and input character. If this eventually gets you to the end of the graph, you have a match. The re.compile() call compiles your pattern into the said graph. At least that is the theory. In practice Python's re module is derived from a the C PCRE library, which looks to work differently again. It compiles to a mini program - but it turns out that program is equivalent to a automata.

There are two distinct types of automata - Deterministic and Non-Deterministic. (Actually in theory there is also infinite and infinite, which refers to the number of states in the graph. For obvious reasons practical computers only deal with finite graphs.) The sole difference between a NFA (Non-deterministic finite automata) and a DFA (deterministic finite automata) is an NFA state is allowed to have several lines labelled with the same input character. Which is where the "non-deterministic" comes from - when confronted that input character an NFA has no deterministic way of knowing which line to choose. In practice a real computer must execute all of them, either in parallel or using backtracking (which is effectively what PCRE does).

That difference between NFA and DFA oddly doesn't effect the their power to recognise complex regular expressions overly. What it does effect is their performance characteristics. The space used by compiled graph on an NFA is of the same order as the size of the regular expression it is compiled from (ie the relationship between the size of the regular expression and the size of the compiled graph is linear). However, for some inputs it's execution time can be exponential to the number of characters in the input.

In contrast a DFA's execution time is always linear to the size of it's input. In practice the execution time is always faster than that implies because an NFA has to do substantially more bookkeeping. Studies I've seen say around 20 times faster. However, the space used by its compiled graph is in the worst case an exponential function of the size of the pattern it is executing. Worse, notice I used the term "execution time" when describing their relative speed. That is because in an interpreted language like Python always has to compile the regular expression at run time. That compiling can also take an exponential amount of time. So this is where the "no" in my "Yes and No" answer comes from. Both can take exponential amounts of time.

A backtracking recogniser like PCRE gets a lot of it's speed by delaying most of the work until the it recognises the input. This can save a lot of time (think a lot of '|' conditions that aren't executed.) That's were PCRE gets it's speed from in the "usual case". So if you are recognising a regular expression just once PCRE is usually faster than other methods, when you include compile time. But it is also the worst in execution time if it strikes a pathological input.

So now we get the "Yes" part of my "Yes and No" answer. When we are taking about protecting against a DOS, the nice thing about a DFA is it's predictable. Predictable in the sense that while the compile step incurs a large upfront cost, that cost is always the same in every run of the program - including testing runs. If it is a problem you change your regex during testing. So the user will never see a program fail because of some input the programmer hadn't tested for. PCRE is at the opposite end of the spectrum. You will never see an issue from the speed of PCRE compiles. Unless you are very, very good you won't generate a pathological input during testing. But your program can mysteriously fail (well, hang) during production.

That had never happened to me for 20 years. And now suddenly, it's happened every couple of months. Having programs fail on inputs I am not and never will be smart enough to generate test cases for is not good. Looking at this from the standpoint of a Software Engineer trying to build reliable things, I'll wear the upfront cost of DFA's every time. Hell, the upfront cost of DFA's needn't even be incurred at runtime even in languages like Python. All the re module need do it allow you to save the compiled form somewhere so you can reuse it at runtime. And guess what? If you eliminate the time it takes to compile a DFA, it's execution time is always linear based on the size of the input. That is how my "Yes" answer arises.