Python support for regular expressions
Regular expressions are a common feature of computer languages, especially higher-level languages like Ruby, Perl, Python, and others, for doing fairly sophisticated text-pattern matching. Some languages, including Perl, incorporate regular expressions into the language itself, while others have classes or libraries that come with the language installation. Python's standard library has the re module, which provides facilities for working with regular expressions; as a recent discussion on the python-ideas mailing shows, though, that module has somewhat fallen by the wayside in recent times.
Timeouts?
J.B. Langston posted to the list on February 14; he had filed a bug about a problem he encountered using re, which was closed with a suggestion that he bring it to the mailing list. Langston had a regular expression (which is often abbreviated as "regex" or "regexp") that seemed to hang his program when applied to a rarely occurring log message; it turned out that there was a flaw in his regular expression, which caused an enormous amount of backtracking that, effectively, never completed. In the bug report, he was asking for a way to specify a timeout:
I will try to rewrite my regex to address this specific issue, but it's hard to anticipate every possible input and craft a bulletproof regex, so something like this kind of thing can be used for a denial of service attack (intentional or not). In this case the regex was used in an automated import process and caused the process to back up for many hours before someone noticed. Maybe a solution could be to add a timeout option to the regex engine so it will give up and throw an exception if the regex executes for longer than the configured timeout.
He elaborated his use case further in his post:
My use case is log parsing and I have a large number of regexes that run over many different log lines. With the volume of regexes I have, it's hard to make sure every regex has no potential problems, especially when the pathological behavior only occurs on certain inputs that may not have been anticipated when developing the regex.Also because of the volume of data these regexes are parsing, I would never want to allow a regex to run longer than a few milliseconds because if it did, that would kill my log processing throughput. I'd rather that it just raise an exception and move on to the next log entry.
Jonathan Slenders replied that the regex module on the Python Package Index (PyPI) has support for timeouts. regex is meant to be both a drop-in replacement for re (using its "version 0") and to provide support for additional regular-expression features when using "version 1". Those features include nested sets, full Unicode case-folding by default, dropping the global interpreter lock (GIL) during matching for concurrency, and more. The regex home page lists a whole raft of differences when using version 1, including a timeout parameter for the matching functions.
Musings on regular expressions
Tim Peters also mentioned
regex, though not because it implements timeouts (which is
something he only learned via the thread), but because it is "a
terrific module
" with many features from newer regular-expression implementations.
It is "also
harder to provoke into exponential-time bad cases
". He
discussed some of the tradeoffs that come with regular expressions and recommended the
classic book, Mastering
Regular Expressions, for those struggling to use them well. He
noted that SNOBOL, which
is a string-processing language from the 1960s, did not have regular expressions, though
there were tasks where it (and
its successor of sorts, Icon)
were able to do matching in more natural ways:
Naive regexps are both clumsy and prone to bad timing in many tasks that "should be" very easy to express. For example, "now match up to the next occurrence of 'X'". In SNOBOL and Icon, that's trivial. 75% of regexp users will write ".*X", with scant understanding that it may match waaaay more than they intended. Another 20% will write ".*?X", with scant understanding that may extend beyond _just_ "the next" X in some cases. That leaves the happy 5% who write "[^X]*X", which finally says what they intended from the start.
Those who are not well-versed in regular-expression syntax may find some of that a bit puzzling. While this article cannot be an introduction to regular expressions—there are countless web sites, books, and other resources for that—we will try to give readers a bit of a leg up. In regular-expression syntax, "." represents any single character, adding "*" says to match zero or more occurrences of the previous term, so ".*" matches any string, while ".*X" matches any string up to a literal "X". But, as the following example shows, that may not be exactly what the programmer had in mind:
>>> import re >>> re.search(r'.*X', 'abcXdefXg') <re.Match object; span=(0, 8), match='abcXdefX'>
The match value in the object returned from re.search() shows that it matched all the way up to the second occurrence of "X" in the string. That is because the "*" operator is "greedy"—it matches as much as it can. The 20% case in Peters's message uses the non-greedy quantifier "?" to get closer to the result that was asked for:
>>> re.search(r'.*?X', 'abcXdefXg') <re.Match object; span=(0, 4), match='abcX'>
The 5% case, "[^X]*X", uses a negated character set, "[^X]", which means to match anything but "X". So that regular expression can be read as "match any characters that are not 'X', followed by 'X'", but it may well not be the first thing that comes to mind. Steven D'Aprano was surprised that ".*?X" might match beyond the next "X", but Chris Angelico pointed out that non-greedy does not mean it will only match to the next "X":
Nongreedy means it'll prefer the next X, but it has to be open to checking others.>>> re.search("a.*?X[^X]*?Y", "zzzabbbXcccXdddYzzz") <re.Match object; span=(3, 16), match='abbbXcccXdddY'>The X between bbb and ccc won't result in a match, so the .*? has to capture more.
The sub-thread extended a ways, looking into deeper aspects of matching "up to the next 'string'", which shows the complications that can arise—complete with mistaken "solutions". As can be seen, regular expressions are a powerful mechanism, but they are complex, can be used inappropriately, and are prone to bugs of various sorts; they are also difficult to test fully and to debug when problems are encountered. But they have become ubiquitous in today's computing landscape. Peters said:
As to why regexps prevailed, traction! They are useful tools, and _started_ life as pretty simple things, with small, elegant, and efficient implementations Feature creep and "faster! faster! faster!" turned the implementations more into bottomless pits now ;-)
Matthew Barnett ("MRAB"), who is the developer of regex, agreed with that assessment:
Regexes were simple to start with, so only a few metacharacters were needed, the remaining characters being treated as literals.As new features were added, the existing metacharacters were used in new ways that had been illegal until then in order to remain backwards-compatible.
Add to that that there are multiple implementations with differing (and sometimes only slightly differing) features and behaviours.
It's a good example of evolution: often messy, and resulting in clunky designs.
Back to timeouts and regex
Langston "quite enjoyed reading
" the thread that resulted from
his post, but wanted to see if there was support "for adding a timeout
feature to the Python re library
". He said that he would be
investigating regex but still thought re could benefit
from a way to stop runaway regular expressions. Angelico was
opposed to the idea, suggesting that some of the other matching
techniques explored in the thread should be pursued instead.
"It
would add overhead to common cases in order to put a shield around
pathological ones, and it's difficult to impossible to usefully define
the cutoff.
"
As he noted in his first reply, though, Peters
thinks
that no work is likely to be done on features for re:
Buried in the fun discussion was my guess: no way. Python's re is effectively dead legacy code, with no current "owner". Its commit history shows very little activity for some years already. Most commits are due to generic "code cleanup" crusades that have nothing specific to do with the algorithms. None required non-trivial knowledge of the implementation.
Peters said that the code that makes up the regex module was originally targeted at becoming
part of core Python back in 2008 by its original author, Jeffrey
C. Jacobs; the work was picked
up and carried forward by Barnett, and eventually turned into
regex. The request to switch re over to using it was closed in 2021
because of the existence of regex in PyPI; "If someone wants
to move it into the Python stdlib, I suggest to start on the python-ideas
list first.
"
The problem is that regex has "_dozens_ of features that would be
valuable to have in the standard library
", Peters said, so:
[N]o core dev I know of is going to devote their limited time to reproducing a tiny subset of regex's many improvements in Python's legacy engine. In fact, "install regex!" is such an obvious choice at this point that I wouldn't even give time to just reviewing a patch that added timeouts.
Barnett said
that he eventually decided against having the regex code added to
the standard library, at least in part because "that would tie fixes
and additions to Python's release cycle
". Python is known for being
"batteries included", "but not nuclear
reactors
", so having regex in PyPI makes more sense, he
said. Peters thought
that some features in regex might have been nuclear reactors back
in 2008, but are being used more commonly today:
[...] Python's re module is frozen in an ever-receding past. Nobody wants to work on it because, well, "regex already does that! In fact, it's been doing it for 15 years already".Your module is _too_ successful for Python's good ;-)
Peters also pointed out that trying out regex may be easier than Langston realized. In fact, because of the version 0 compatibility mode, he could perhaps add a simple import to give it a whirl:
import regex as reIn another message, Peters further described what he meant by that:
What I wrote here is more elaboration on that _trying_ this is easier than they might be thinking: They don't have to, e.g., rewrite their regexps, or invoke different function or method names, or worry that they'll get different results. The packages are highly compatible in syntax and semantics and APIs so long as you stick to the things re does. That in no way suggests they _should_ stick to what re does. It's assuring them that _getting started_ is close to trivial. Their current code should continue to work unchanged, apart from just changing "re" to "regex".
It turns out that Langston's program already
uses regex "via some transitive
dependency
", but he was not
impressed with its performance when compared to re. A test
data set of 700MB was processed in 77 seconds with re but it
took 92 seconds with regex. Peters was
"mildly surprised
" by that, since: "Most times
people report that regex is at least
modestly faster than re
".
Whither re?
The performance of regex seems like something that might need attention, especially if it is becoming the de facto regular-expression module for Python. Peters's analysis of the status of re is somewhat disheartening; it is apparently permanently stuck in the past. That may be sufficient for many, however, but it somehow seems suboptimal to have two separate pieces of the Python ecosystem that both support the more limited re subset; most enhancements are likely to only go into regex so that re falls further and further behind the state of the art.
Supplanting re with regex in the standard library (using version 0 by default) would seem attractive, though Barnett seems at least somewhat cautious about doing so. A similar thing occurred with the popular Requests HTTP module, which was considered as a possible addition to the standard library at the 2015 Python Language Summit. The conclusion was that it made more sense for Requests to stay out of the standard library because it moves faster than the normal Python release cadence (then 18 months, but now yearly), especially for security updates (which are done more frequently for the language, but not as quickly as Requests can move on its own).
The "batteries included" story for Python has been a major part of its success over the years, but it is starting to fray in various ways. For one thing, the large number of said batteries is straining the maintenance ability of the Python core developers. That has led to discussions, a Python Enhancement Proposal (PEP), and further discussion about removing some parts of the library over the years. Most of the modules in the standard library were added long ago and, once something is added, it is difficult to remove it—even if the reason for its inclusion and the existence of maintainers for it have gone away.
Meanwhile, regular expressions are clearly something that Python programmers use—a lot—so
having the best support for them, in one place if possible, seems like the
right approach. As Peters noted, they are a "a tool with a hyper-concise notation, where a
correct expression is pretty much indistinguishable from line noise,
and a typo is rarely detectable as a syntax error
". Adding risks of
incompatibility (or differing performance) into the mix may not lead to
much joy either. It is all a bit of a pickle, but not that pickle,
of course.
On the other hand, re, which was originally developed by Fredrik Lundh (who sadly died back in November), does what it needs to do for lots of different use cases. Those who need timeouts, atomic groups, nested character sets, possessive quantifiers, and other advanced features have a place to turn. Barnett seems keen to maintain the compatibility with re, so it may turn out to be a situation, like with Requests, where alternatives to the standard library should be recommended, perhaps even in the Python documentation. There is no clear and obvious "right" solution here it seems.
Index entries for this article | |
---|---|
Python | Regular expressions |
Python | Standard library |
Posted Feb 22, 2022 21:53 UTC (Tue)
by brenns10 (subscriber, #112114)
[Link] (5 responses)
Probably as a result of my attitude toward regex matching, I haven't really found any issues with the default "re" module, and didn't even know of the existence of "regex", nor that it was a de-facto standard for many Python users -- I suppose I've never felt that re lacked features.
In any case, I wonder of the original use case would have been solved by a truly "regular" engine, maybe with the option to switch to the more complex (and slower) engine for the rare cases where non-regular features are needed.
Posted Feb 23, 2022 8:23 UTC (Wed)
by marcH (subscriber, #57642)
[Link] (2 responses)
> As can be seen, regular expressions are a powerful mechanism, but they are complex, can be used inappropriately, and are prone to bugs of various sorts; they are also difficult to test fully and to debug when problems are encountered. [...] They are useful tools, and _started_ life as pretty simple things, with small, elegant, and efficient implementations Feature creep and "faster! faster! faster!" turned the implementations more into bottomless pits now ;-)
and:
> Those who need timeouts, atomic groups, nested character sets, possessive quantifiers, and other advanced footguns ...
Although "timeouts" look more like a security fix than an advanced feature.
Posted Feb 26, 2022 7:22 UTC (Sat)
by flussence (guest, #85566)
[Link] (1 responses)
More of a lesson everyone keeps relearning the hard way; PHP got it right 25 years ago, and it stings every time another system for processing arbitrary code+data on the fly gets it wrong (regex, SQL, shaders, CSS, ...)
Posted Feb 27, 2022 21:06 UTC (Sun)
by marcH (subscriber, #57642)
[Link]
Posted Feb 23, 2022 13:16 UTC (Wed)
by fman (subscriber, #121579)
[Link] (1 responses)
Thanks for that link. That is certainly an enlightening read.
Posted Mar 1, 2022 0:22 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link]
1. Modify one or both of re/regex to use an NFA/DFA implementation if possible (i.e. if there are no lookarounds/possessives/atomic groups in the expression).
This is 100% backwards compatible, would significantly improve the performance of existing regular expressions, and the only downside is a more complicated implementation.
Posted Feb 22, 2022 22:01 UTC (Tue)
by atnot (subscriber, #124910)
[Link] (4 responses)
Posted Feb 23, 2022 1:56 UTC (Wed)
by fsateler (subscriber, #65497)
[Link] (3 responses)
Posted Feb 23, 2022 15:00 UTC (Wed)
by simcop2387 (subscriber, #101710)
[Link] (2 responses)
Posted Feb 26, 2022 0:58 UTC (Sat)
by rra (subscriber, #99804)
[Link] (1 responses)
As the module maintainer, this is the best of both worlds for me. I can release more quickly than Perl itself and test changes and new features, Perl will pick up the stable version during its release cycles, and people who desperately need the new version for some reason can depend on it directly with a version bound, which will trigger the correct behavior from various module managers.
Posted Mar 10, 2022 1:26 UTC (Thu)
by bartoc (guest, #124262)
[Link]
The re module is fine for a lot of stuff, we use it in some python scripts related to our build system and test harness, and we wouldn't want to use regex from pip since re works well enough for us, and regex doesn't come pre-installed. That said if it _did_ come pre-installed I'd be tempted to go through and upgrade all our python scripts, even if it was maintained and released on a schedule separate from the standard library proper.
Sidenote: I really like raku/perl6's alternate regex syntax, it's nice.
Posted Feb 22, 2022 23:38 UTC (Tue)
by iabervon (subscriber, #722)
[Link] (6 responses)
AFAICT, "requests" hasn't removed any documented features since 2015; on the other hand, "construct" has not been as stable, even over a shorter period of time.
Posted Feb 23, 2022 0:03 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (5 responses)
* On the one hand, import statements use a relatively straightforward, easy to understand set of semantics (i.e. "just stick a bunch of py files in a directory structure, and you're done!"). This is good for scripting purposes, because you don't have to faff about with something like CMake just to build an entirely self-contained app.
Perhaps it would've been less bad if Python had used a slightly more elaborate syntax instead:
from python import re: 3.11+ # Must include Python version for standard library modules.
Unfortunately, that would be egregiously incompatible with existing usage, so it's probably too late now. Oh well.
Posted Feb 23, 2022 8:14 UTC (Wed)
by marcH (subscriber, #57642)
[Link]
Posted Feb 23, 2022 8:28 UTC (Wed)
by interalia (subscriber, #26615)
[Link] (1 responses)
But anyway, do you think what you said about every language doing packaging/dependencies poorly is partly due to every approach having advantages/disadvantages, that being the nature of software development?
Posted Feb 24, 2022 18:45 UTC (Thu)
by NYKevin (subscriber, #129325)
[Link]
Posted Feb 23, 2022 10:24 UTC (Wed)
by MrWim (subscriber, #47432)
[Link]
I realise this is a bit of a tangent, the subject is Python and not rust, but here goes anyway:
> On the one hand, import statements use a relatively straightforward, easy to understand set of semantics (i.e. "just stick a bunch of py files in a directory structure, and you're done!").
This is where rust isn't so good. It can be confusing how `mod` and `use` and `Cargo.toml` interrelate.
>This is good for scripting purposes, because you don't have to faff about with something like CMake just to build an entirely self-contained app.
Cargo helps here as it's the single blessed build system, and comes bundled with rust.
> On the other, those semantics are perhaps *too* simple, because there is no way to specify "I need version X or greater" within the import statement itself,
With Cargo these versions are specified in `Cargo.toml` - so still external.
> nor where the module actually comes from.
Cargo does restrict where the module comes from. You'll get the version you specified in your Cargo.toml, and you can't `use` things that you haven't specified there - even if they're included in a transitive dependency.
> To add insult to injury, you can't have two different versions of the same module in the same process
Cargo and rust fix this with clever symbol mangling. It could still be an issue with C dependencies, but they are relatively rare in rust land vs. Python
> And, of course, you have the common beginner mistake of accidentally naming a Python script after a stdlib module
This isn't a problem with rust - you can use from your local crate with `use crate::mymod`. And anyway - if you get it wrong you'll get a compile error, it's not going to silently give you the wrong behaviour at runtime.
Thanks for your comment, it encouraged me to finally publish that blog post. It had been sitting almost finished for about a year now.
Posted Feb 23, 2022 18:17 UTC (Wed)
by smoogen (subscriber, #97)
[Link]
I remember all the OS packaging and dependency fights about how they all did it terribly (and thus resulting in a new OS which would use some new method which would be fine for a simple case and then fall over horribly when faced with 'the real world'. Then it seemed like those conversations went dormant, but instead they had all moved to ${SCRIPT_LANGUAGE} which would again seem to come up with some method that people thought would be better/faster/cleaner than say .deb/.rpm/.etc. First the lines would be 'this is better', then they go to 'it does a good enough job', and finally 'it is horrible but we have too large of an ecosystem to change it.'
Posted Feb 23, 2022 1:34 UTC (Wed)
by da4089 (subscriber, #1195)
[Link]
Switching my brain to "no, you should use 'regex' again now" is going to be fun ...
Posted Feb 23, 2022 2:20 UTC (Wed)
by klossner (subscriber, #30046)
[Link] (2 responses)
Posted Feb 23, 2022 8:58 UTC (Wed)
by lkundrak (subscriber, #43452)
[Link]
Posted Feb 23, 2022 17:47 UTC (Wed)
by salewski (subscriber, #121521)
[Link]
That makes we want to learn SNOBOL.
Posted Feb 24, 2022 13:02 UTC (Thu)
by dveeden (subscriber, #120424)
[Link] (2 responses)
This would also work in other situations.
Posted Feb 24, 2022 14:13 UTC (Thu)
by kleptog (subscriber, #1183)
[Link] (1 responses)
The "recommended" approach is to spawn a process/thread, run the regex there, and implement the timeout by having the main process kill the child on timeout.
Not exactly an approach that scales or performs, especially if you're already using threads for something else, because only the main thread handles signals. The regex modules solves it directly.
Posted Mar 4, 2022 2:02 UTC (Fri)
by njs (subscriber, #40338)
[Link]
You can spawn a child process and then kill it – that's safe because the child's state is isolated from the parent's. But that's way too expensive and fragile to use it for all regex matches. Much better to just teach your regex engine to check a timeout/abort flag on a regular basis.
Posted Feb 24, 2022 17:49 UTC (Thu)
by esemwy (guest, #83963)
[Link]
Posted Feb 26, 2022 6:31 UTC (Sat)
by Otus (subscriber, #67685)
[Link]
Posted Nov 3, 2022 16:01 UTC (Thu)
by swilmet (subscriber, #98424)
[Link]
Great article and a good summary of the statu quo of Python's regex support. Searching for PCRE, GLib or GRegex gave no result on this page, so I would like to mention it.
PCRE (Perl Compatible Regular Expressions) is a widely-used library. GLib (part of the GTK and GNOME project) has the GRegex API which is or should be entirely usable from Python, and uses PCRE (GRegex has been ported to PCRE 2 recently). It is usable from Python thanks to GObject Introspection automatic language bindings (with PyGObject).
So GRegex is another alternative for Python developers.
Python support for "irregular" expressions
Python support for "irregular" expressions
Python support for "irregular" expressions
Python support for "irregular" expressions
Python support for "irregular" expressions
> [1]: https://swtch.com/~rsc/regexp/regexp1.html
Make me wonder if what is *really* needed isn't a "simple-re" module with a "Thompson NFA" regex engine. A 6 digit speedup should be worth aiming for after all
Python support for "irregular" expressions
2. Add a flag to re/regex.compile() that throws if it sees any of those features in the expression to be compiled. Off by default, must be explicitly passed.
2½. For bonus points, make a locked_down_regex module (preferably with a better name) that is exactly like re/regex, except the flag in (2) is always passed for you automatically and cannot be turned off by any means. This is analogous to the use of the secrets module in lieu of random. Since it's a whole new module, it won't break anything and must be opted-into, but OTOH it's easy to audit whether you are using the "right" module if your org cares.
Python support for regular expressions
Python support for regular expressions
Python support for regular expressions
Python support for regular expressions
Python support for regular expressions
Python support for regular expressions
Python support for regular expressions
* On the other, those semantics are perhaps *too* simple, because there is no way to specify "I need version X or greater" within the import statement itself, nor where the module actually comes from. So now that information needs to live in metadata somewhere, and get tracked and managed separately by a tool like Pip.
* To add insult to injury, you can't have two different versions of the same module in the same process, without doing all sorts of nasty hacks that may or may not break something depending on how the underlying module works (e.g. Does it check __name__? Does it fiddle around with sys.modules? etc.).
* And, of course, you have the common beginner mistake of accidentally naming a Python script after a stdlib module (Python will prefer to re-import the script a second time, rather than using the stdlib module, and then everything breaks because it probably won't implement the stdlib module's API). This can break backcompat if a new stdlib module is introduced with the same name as one of your existing modules, but for some reason nobody seems to care about that failure mode.
from pypi import regex: 3.9+ # Also need version for PyPI modules - if it's not installed or too old, then error out.
from local import regex # "local" means "don't search sys.path, just check the __main__ module's containing directory for regex.py." No version.
from my_custom_namespace import regex: 1.2.3+ # You can install custom hooks to handle imports in whatever crazy way you want.
Python support for regular expressions
Python support for regular expressions
Python support for regular expressions
Python support for regular expressions
Python support for regular expressions
Python support for regular expressions
SNOBOL and parsing
SNOBOL and parsing
:(
SNOBOL and parsing
Python support for regular expressions
Python support for regular expressions
Python support for regular expressions
Python support for regular expressions
Python support for regular expressions
PCRE and GRegex