Python support for regular expressions

By Jake Edge
February 22, 2022

Regular expressions are a common feature of computer languages, especially higher-level languages like Ruby, Perl, Python, and others, for doing fairly sophisticated text-pattern matching. Some languages, including Perl, incorporate regular expressions into the language itself, while others have classes or libraries that come with the language installation. Python's standard library has the re module, which provides facilities for working with regular expressions; as a recent discussion on the python-ideas mailing shows, though, that module has somewhat fallen by the wayside in recent times.

Timeouts?

J.B. Langston posted to the list on February 14; he had filed a bug about a problem he encountered using re, which was closed with a suggestion that he bring it to the mailing list. Langston had a regular expression (which is often abbreviated as "regex" or "regexp") that seemed to hang his program when applied to a rarely occurring log message; it turned out that there was a flaw in his regular expression, which caused an enormous amount of backtracking that, effectively, never completed. In the bug report, he was asking for a way to specify a timeout:

I will try to rewrite my regex to address this specific issue, but it's hard to anticipate every possible input and craft a bulletproof regex, so something like this kind of thing can be used for a denial of service attack (intentional or not). In this case the regex was used in an automated import process and caused the process to back up for many hours before someone noticed. Maybe a solution could be to add a timeout option to the regex engine so it will give up and throw an exception if the regex executes for longer than the configured timeout.

He elaborated his use case further in his post:

My use case is log parsing and I have a large number of regexes that run over many different log lines. With the volume of regexes I have, it's hard to make sure every regex has no potential problems, especially when the pathological behavior only occurs on certain inputs that may not have been anticipated when developing the regex.
Also because of the volume of data these regexes are parsing, I would never want to allow a regex to run longer than a few milliseconds because if it did, that would kill my log processing throughput. I'd rather that it just raise an exception and move on to the next log entry.

Jonathan Slenders replied that the regex module on the Python Package Index (PyPI) has support for timeouts. regex is meant to be both a drop-in replacement for re (using its "version 0") and to provide support for additional regular-expression features when using "version 1". Those features include nested sets, full Unicode case-folding by default, dropping the global interpreter lock (GIL) during matching for concurrency, and more. The regex home page lists a whole raft of differences when using version 1, including a timeout parameter for the matching functions.

Musings on regular expressions

Tim Peters also mentioned regex, though not because it implements timeouts (which is something he only learned via the thread), but because it is "a terrific module" with many features from newer regular-expression implementations. It is "also harder to provoke into exponential-time bad cases". He discussed some of the tradeoffs that come with regular expressions and recommended the classic book, Mastering Regular Expressions, for those struggling to use them well. He noted that SNOBOL, which is a string-processing language from the 1960s, did not have regular expressions, though there were tasks where it (and its successor of sorts, Icon) were able to do matching in more natural ways:

Naive regexps are both clumsy and prone to bad timing in many tasks that "should be" very easy to express. For example, "now match up to the next occurrence of 'X'". In SNOBOL and Icon, that's trivial. 75% of regexp users will write ".*X", with scant understanding that it may match waaaay more than they intended. Another 20% will write ".*?X", with scant understanding that may extend beyond _just_ "the next" X in some cases. That leaves the happy 5% who write "[^X]*X", which finally says what they intended from the start.

Those who are not well-versed in regular-expression syntax may find some of that a bit puzzling. While this article cannot be an introduction to regular expressions—there are countless web sites, books, and other resources for that—we will try to give readers a bit of a leg up. In regular-expression syntax, "." represents any single character, adding "*" says to match zero or more occurrences of the previous term, so ".*" matches any string, while ".*X" matches any string up to a literal "X". But, as the following example shows, that may not be exactly what the programmer had in mind:

    >>> import re
    >>> re.search(r'.*X', 'abcXdefXg')
    <re.Match object; span=(0, 8), match='abcXdefX'>

The match value in the object returned from re.search() shows that it matched all the way up to the second occurrence of "X" in the string. That is because the "*" operator is "greedy"—it matches as much as it can. The 20% case in Peters's message uses the non-greedy quantifier "?" to get closer to the result that was asked for:

    >>> re.search(r'.*?X', 'abcXdefXg')
    <re.Match object; span=(0, 4), match='abcX'>

The 5% case, "[^X]*X", uses a negated character set, "[^X]", which means to match anything but "X". So that regular expression can be read as "match any characters that are not 'X', followed by 'X'", but it may well not be the first thing that comes to mind. Steven D'Aprano was surprised that ".*?X" might match beyond the next "X", but Chris Angelico pointed out that non-greedy does not mean it will only match to the next "X":

Nongreedy means it'll prefer the next X, but it has to be open to checking others.
>>> re.search("a.*?X[^X]*?Y", "zzzabbbXcccXdddYzzz")
<re.Match object; span=(3, 16), match='abbbXcccXdddY'>
The X between bbb and ccc won't result in a match, so the .*? has to capture more.

The sub-thread extended a ways, looking into deeper aspects of matching "up to the next 'string'", which shows the complications that can arise—complete with mistaken "solutions". As can be seen, regular expressions are a powerful mechanism, but they are complex, can be used inappropriately, and are prone to bugs of various sorts; they are also difficult to test fully and to debug when problems are encountered. But they have become ubiquitous in today's computing landscape. Peters said:

As to why regexps prevailed, traction! They are useful tools, and _started_ life as pretty simple things, with small, elegant, and efficient implementations Feature creep and "faster! faster! faster!" turned the implementations more into bottomless pits now ;-)

Matthew Barnett ("MRAB"), who is the developer of regex, agreed with that assessment:

Regexes were simple to start with, so only a few metacharacters were needed, the remaining characters being treated as literals.
As new features were added, the existing metacharacters were used in new ways that had been illegal until then in order to remain backwards-compatible.
Add to that that there are multiple implementations with differing (and sometimes only slightly differing) features and behaviours.
It's a good example of evolution: often messy, and resulting in clunky designs.

Back to timeouts and `regex`

Langston "quite enjoyed reading" the thread that resulted from his post, but wanted to see if there was support "for adding a timeout feature to the Python re library". He said that he would be investigating regex but still thought re could benefit from a way to stop runaway regular expressions. Angelico was opposed to the idea, suggesting that some of the other matching techniques explored in the thread should be pursued instead. "It would add overhead to common cases in order to put a shield around pathological ones, and it's difficult to impossible to usefully define the cutoff." As he noted in his first reply, though, Peters thinks that no work is likely to be done on features for re:

Buried in the fun discussion was my guess: no way. Python's re is effectively dead legacy code, with no current "owner". Its commit history shows very little activity for some years already. Most commits are due to generic "code cleanup" crusades that have nothing specific to do with the algorithms. None required non-trivial knowledge of the implementation.

Peters said that the code that makes up the regex module was originally targeted at becoming part of core Python back in 2008 by its original author, Jeffrey C. Jacobs; the work was picked up and carried forward by Barnett, and eventually turned into regex. The request to switch re over to using it was closed in 2021 because of the existence of regex in PyPI; "If someone wants to move it into the Python stdlib, I suggest to start on the python-ideas list first."

The problem is that regex has "_dozens_ of features that would be valuable to have in the standard library", Peters said, so:

[N]o core dev I know of is going to devote their limited time to reproducing a tiny subset of regex's many improvements in Python's legacy engine. In fact, "install regex!" is such an obvious choice at this point that I wouldn't even give time to just reviewing a patch that added timeouts.

Barnett said that he eventually decided against having the regex code added to the standard library, at least in part because "that would tie fixes and additions to Python's release cycle". Python is known for being "batteries included", "but not nuclear reactors", so having regex in PyPI makes more sense, he said. Peters thought that some features in regex might have been nuclear reactors back in 2008, but are being used more commonly today:

[...] Python's re module is frozen in an ever-receding past. Nobody wants to work on it because, well, "regex already does that! In fact, it's been doing it for 15 years already".
Your module is _too_ successful for Python's good ;-)

Peters also pointed out that trying out regex may be easier than Langston realized. In fact, because of the version 0 compatibility mode, he could perhaps add a simple import to give it a whirl:

    import regex as re

In another message, Peters further described what he meant by that:

What I wrote here is more elaboration on that _trying_ this is easier than they might be thinking: They don't have to, e.g., rewrite their regexps, or invoke different function or method names, or worry that they'll get different results. The packages are highly compatible in syntax and semantics and APIs so long as you stick to the things re does. That in no way suggests they _should_ stick to what re does. It's assuring them that _getting started_ is close to trivial. Their current code should continue to work unchanged, apart from just changing "re" to "regex".

It turns out that Langston's program already uses regex "via some transitive dependency", but he was not impressed with its performance when compared to re. A test data set of 700MB was processed in 77 seconds with re but it took 92 seconds with regex. Peters was "mildly surprised" by that, since: "Most times people report that regex is at least modestly faster than re".

Whither `re`?

The performance of regex seems like something that might need attention, especially if it is becoming the de facto regular-expression module for Python. Peters's analysis of the status of re is somewhat disheartening; it is apparently permanently stuck in the past. That may be sufficient for many, however, but it somehow seems suboptimal to have two separate pieces of the Python ecosystem that both support the more limited re subset; most enhancements are likely to only go into regex so that re falls further and further behind the state of the art.

Supplanting re with regex in the standard library (using version 0 by default) would seem attractive, though Barnett seems at least somewhat cautious about doing so. A similar thing occurred with the popular Requests HTTP module, which was considered as a possible addition to the standard library at the 2015 Python Language Summit. The conclusion was that it made more sense for Requests to stay out of the standard library because it moves faster than the normal Python release cadence (then 18 months, but now yearly), especially for security updates (which are done more frequently for the language, but not as quickly as Requests can move on its own).

The "batteries included" story for Python has been a major part of its success over the years, but it is starting to fray in various ways. For one thing, the large number of said batteries is straining the maintenance ability of the Python core developers. That has led to discussions, a Python Enhancement Proposal (PEP), and further discussion about removing some parts of the library over the years. Most of the modules in the standard library were added long ago and, once something is added, it is difficult to remove it—even if the reason for its inclusion and the existence of maintainers for it have gone away.

Meanwhile, regular expressions are clearly something that Python programmers use—a lot—so having the best support for them, in one place if possible, seems like the right approach. As Peters noted, they are a "a tool with a hyper-concise notation, where a correct expression is pretty much indistinguishable from line noise, and a typo is rarely detectable as a syntax error". Adding risks of incompatibility (or differing performance) into the mix may not lead to much joy either. It is all a bit of a pickle, but not that pickle, of course.

On the other hand, re, which was originally developed by Fredrik Lundh (who sadly died back in November), does what it needs to do for lots of different use cases. Those who need timeouts, atomic groups, nested character sets, possessive quantifiers, and other advanced features have a place to turn. Barnett seems keen to maintain the compatibility with re, so it may turn out to be a situation, like with Requests, where alternatives to the standard library should be recommended, perhaps even in the Python documentation. There is no clear and obvious "right" solution here it seems.

Index entries for this article
Python	Regular expressions
Python	Standard library

Python support for "irregular" expressions

Posted Feb 22, 2022 21:53 UTC (Tue) by brenns10 (subscriber, #112114) [Link] (5 responses)

This article and discussion did not go the way I expected. Maybe my way of thinking is uncommon, but I avoid using the "non-regular" (in the theoretical sense of regular languages) features of regex at all costs. Sure, assertions and backreferences can simplify things, but at that point you're trying to implement logic in a quite arcane language, and you'd likely be better served by writing that logic with, well, Python. I had assumed this discussion would naturally land on the issues of exponential backtracking, and a recommendation to switch to a truly "regular" expression engine, such as Google's re2. (A good background on this from Russ Cox is found here [1]).

Probably as a result of my attitude toward regex matching, I haven't really found any issues with the default "re" module, and didn't even know of the existence of "regex", nor that it was a de-facto standard for many Python users -- I suppose I've never felt that re lacked features.

In any case, I wonder of the original use case would have been solved by a truly "regular" engine, maybe with the option to switch to the more complex (and slower) engine for the rare cases where non-regular features are needed.

[1]: https://swtch.com/~rsc/regexp/regexp1.html

Python support for "irregular" expressions

Posted Feb 23, 2022 8:23 UTC (Wed) by marcH (subscriber, #57642) [Link] (2 responses)

Same, I enjoyed the "contrast" between:

> As can be seen, regular expressions are a powerful mechanism, but they are complex, can be used inappropriately, and are prone to bugs of various sorts; they are also difficult to test fully and to debug when problems are encountered. [...] They are useful tools, and _started_ life as pretty simple things, with small, elegant, and efficient implementations Feature creep and "faster! faster! faster!" turned the implementations more into bottomless pits now ;-)

and:

> Those who need timeouts, atomic groups, nested character sets, possessive quantifiers, and other advanced footguns ...

Although "timeouts" look more like a security fix than an advanced feature.

Python support for "irregular" expressions

Posted Feb 26, 2022 7:22 UTC (Sat) by flussence (guest, #85566) [Link] (1 responses)

> Although "timeouts" look more like a security fix than an advanced feature.

More of a lesson everyone keeps relearning the hard way; PHP got it right 25 years ago, and it stings every time another system for processing arbitrary code+data on the fly gets it wrong (regex, SQL, shaders, CSS, ...)

Python support for "irregular" expressions

Posted Feb 27, 2022 21:06 UTC (Sun) by marcH (subscriber, #57642) [Link]

"Regular expressions" a.k.a: how to make it impossible to sanitize inputs.

Python support for "irregular" expressions

Posted Feb 23, 2022 13:16 UTC (Wed) by fman (subscriber, #121579) [Link] (1 responses)

> (A good background on this from Russ Cox is found here [1]).
> [1]: https://swtch.com/~rsc/regexp/regexp1.html

Thanks for that link. That is certainly an enlightening read.
Make me wonder if what is *really* needed isn't a "simple-re" module with a "Thompson NFA" regex engine. A 6 digit speedup should be worth aiming for after all

Python support for "irregular" expressions

Posted Mar 1, 2022 0:22 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

IMHO all you really need is 2½ pieces:

1. Modify one or both of re/regex to use an NFA/DFA implementation if possible (i.e. if there are no lookarounds/possessives/atomic groups in the expression).
2. Add a flag to re/regex.compile() that throws if it sees any of those features in the expression to be compiled. Off by default, must be explicitly passed.
2½. For bonus points, make a locked_down_regex module (preferably with a better name) that is exactly like re/regex, except the flag in (2) is always passed for you automatically and cannot be turned off by any means. This is analogous to the use of the secrets module in lieu of random. Since it's a whole new module, it won't break anything and must be opted-into, but OTOH it's easy to audit whether you are using the "right" module if your org cares.

This is 100% backwards compatible, would significantly improve the performance of existing regular expressions, and the only downside is a more complicated implementation.

Python support for regular expressions

Posted Feb 22, 2022 22:01 UTC (Tue) by atnot (subscriber, #124910) [Link] (4 responses)

I feel like the real question here is less what python is going to do about regex specifically, but what they can do about the "The standard library is where modules go to die" situation. Importing the current state of the art in regex library into the cpython repository is going to be nice for a bit, but it doesn't really do anything to address the reasons that it came to be neglected in the first place. It would effectively crowd out an actively maintained pypi module in favor of a soon-to-be-unmaintained standard library one once again, as keeps happening. Surely that's not in the ecosystem's best interest.

Python support for regular expressions

Posted Feb 23, 2022 1:56 UTC (Wed) by fsateler (subscriber, #65497) [Link] (3 responses)

There is no reason you can't have both (stdlib and pypi). Ruby, for example, ships some gems by default, which you can upgrade independently in your Gemfile. This opens a path for removal from the stdlib: recommend to "just add foo to your gemfile" and then it can be removed from the included batteries in a later release. They did this in ruby 3 iirc.

Python support for regular expressions

Posted Feb 23, 2022 15:00 UTC (Wed) by simcop2387 (subscriber, #101710) [Link] (2 responses)

Perl does this too with "Dual-life" modules that ship with perl and are on cpan too. It does take a little more maintenance and care than just having it up as a package to install but it works pretty well for a good number of modules there too.

Python support for regular expressions

Posted Feb 26, 2022 0:58 UTC (Sat) by rra (subscriber, #99804) [Link] (1 responses)

Speaking as the maintainer of two dual-life Perl modules, this has long been one of my favorite parts of the Perl ecosystem and I would strongly encourage Python to adopt it.

As the module maintainer, this is the best of both worlds for me. I can release more quickly than Perl itself and test changes and new features, Perl will pick up the stable version during its release cycles, and people who desperately need the new version for some reason can depend on it directly with a version bound, which will trigger the correct behavior from various module managers.

Python support for regular expressions

Posted Mar 10, 2022 1:26 UTC (Thu) by bartoc (guest, #124262) [Link]

Speaking as a C++ standard library maintainer (we have probably the worst regex module on the face of the earth, it makes python's re look like alien technology) I think you're right. We have a major problem in C++ with people trying to basically use the standard library as a package manager and wanting things in the standard library just because that means they don't have to install anything, without really thinking about what that means in terms of who maintains the thing (turns out, C++ library maintainers aren't expert regex engine implementers, who knew). C++ makes this worse because the standard library promises to maintain ABI forever (essentially).

The re module is fine for a lot of stuff, we use it in some python scripts related to our build system and test harness, and we wouldn't want to use regex from pip since re works well enough for us, and regex doesn't come pre-installed. That said if it _did_ come pre-installed I'd be tempted to go through and upgrade all our python scripts, even if it was maintained and released on a schedule separate from the standard library proper.

Sidenote: I really like raku/perl6's alternate regex syntax, it's nice.

Python support for regular expressions

Posted Feb 22, 2022 23:38 UTC (Tue) by iabervon (subscriber, #722) [Link] (6 responses)

I really wish Python releases came bundled with particular versions of some well-managed PyPI packages you could upgrade further. You should be able to rely on the fact that "regex" is present, and that it conforms to the documentation frozen when 3.9 came out, but not that is doesn't have undocumented features from the future or that it still has bugs that hadn't been discovered.

AFAICT, "requests" hasn't removed any documented features since 2015; on the other hand, "construct" has not been as stable, even over a shorter period of time.

Python support for regular expressions

Posted Feb 23, 2022 0:03 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (5 responses)

I feel like almost every language does packaging and dependencies poorly, and they all just find different ways of being terrible (albeit most also have some redeeming qualities). In the case of Python:

* On the one hand, import statements use a relatively straightforward, easy to understand set of semantics (i.e. "just stick a bunch of py files in a directory structure, and you're done!"). This is good for scripting purposes, because you don't have to faff about with something like CMake just to build an entirely self-contained app.
* On the other, those semantics are perhaps *too* simple, because there is no way to specify "I need version X or greater" within the import statement itself, nor where the module actually comes from. So now that information needs to live in metadata somewhere, and get tracked and managed separately by a tool like Pip.
* To add insult to injury, you can't have two different versions of the same module in the same process, without doing all sorts of nasty hacks that may or may not break something depending on how the underlying module works (e.g. Does it check __name__? Does it fiddle around with sys.modules? etc.).
* And, of course, you have the common beginner mistake of accidentally naming a Python script after a stdlib module (Python will prefer to re-import the script a second time, rather than using the stdlib module, and then everything breaks because it probably won't implement the stdlib module's API). This can break backcompat if a new stdlib module is introduced with the same name as one of your existing modules, but for some reason nobody seems to care about that failure mode.

Perhaps it would've been less bad if Python had used a slightly more elaborate syntax instead:

from python import re: 3.11+ # Must include Python version for standard library modules.
from pypi import regex: 3.9+ # Also need version for PyPI modules - if it's not installed or too old, then error out.
from local import regex # "local" means "don't search sys.path, just check the __main__ module's containing directory for regex.py." No version.
from my_custom_namespace import regex: 1.2.3+ # You can install custom hooks to handle imports in whatever crazy way you want.

Unfortunately, that would be egregiously incompatible with existing usage, so it's probably too late now. Oh well.

Python support for regular expressions

Posted Feb 23, 2022 8:14 UTC (Wed) by marcH (subscriber, #57642) [Link]

https://medium.com/@sdboyer/so-you-want-to-write-a-packag...

Python support for regular expressions

Posted Feb 23, 2022 8:28 UTC (Wed) by interalia (subscriber, #26615) [Link] (1 responses)

It does seem crazy that Python didn't at some stage namespace libraries better, both stdlib and external ones...

But anyway, do you think what you said about every language doing packaging/dependencies poorly is partly due to every approach having advantages/disadvantages, that being the nature of software development?

Python support for regular expressions

Posted Feb 24, 2022 18:45 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

Maybe a little, but I think the bigger issue here is that packaging and dependencies are a pathologically hard problem to solve. Nevertheless, I would expect Python to do a better job than it actually does, because Python's import system is almost entirely made up of first-class objects that exist at runtime. If you *wanted* to have some sort of fancy dependency resolution system requiring elaborate runtime support, Python has already built all of the complicated infrastructure which you need in order to make that possible (see e.g. the importlib documentation: https://docs.python.org/3/library/importlib.html). And yet Python's out-of-the-box packaging and dependency system is just as bad as everyone else's.

Python support for regular expressions

Posted Feb 23, 2022 10:24 UTC (Wed) by MrWim (subscriber, #47432) [Link]

Rust (with Cargo) does this better. I've written a bit about this here: https://blog.williammanley.net/2022/02/23/pip-and-cargo-a... .

I realise this is a bit of a tangent, the subject is Python and not rust, but here goes anyway:

> On the one hand, import statements use a relatively straightforward, easy to understand set of semantics (i.e. "just stick a bunch of py files in a directory structure, and you're done!").

This is where rust isn't so good. It can be confusing how `mod` and `use` and `Cargo.toml` interrelate.

>This is good for scripting purposes, because you don't have to faff about with something like CMake just to build an entirely self-contained app.

Cargo helps here as it's the single blessed build system, and comes bundled with rust.

> On the other, those semantics are perhaps *too* simple, because there is no way to specify "I need version X or greater" within the import statement itself,

With Cargo these versions are specified in `Cargo.toml` - so still external.

> nor where the module actually comes from.

Cargo does restrict where the module comes from. You'll get the version you specified in your Cargo.toml, and you can't `use` things that you haven't specified there - even if they're included in a transitive dependency.

> To add insult to injury, you can't have two different versions of the same module in the same process

Cargo and rust fix this with clever symbol mangling. It could still be an issue with C dependencies, but they are relatively rare in rust land vs. Python

> And, of course, you have the common beginner mistake of accidentally naming a Python script after a stdlib module

This isn't a problem with rust - you can use from your local crate with `use crate::mymod`. And anyway - if you get it wrong you'll get a compile error, it's not going to silently give you the wrong behaviour at runtime.

Thanks for your comment, it encouraged me to finally publish that blog post. It had been sitting almost finished for about a year now.

Python support for regular expressions

Posted Feb 23, 2022 18:17 UTC (Wed) by smoogen (subscriber, #97) [Link]

> I feel like almost every language does packaging and dependencies poorly, and they all just find different ways of being terrible (albeit most also have some redeeming qualities). In the case of Python:

I remember all the OS packaging and dependency fights about how they all did it terribly (and thus resulting in a new OS which would use some new method which would be fine for a simple case and then fall over horribly when faced with 'the real world'. Then it seemed like those conversations went dormant, but instead they had all moved to ${SCRIPT_LANGUAGE} which would again seem to come up with some method that people thought would be better/faster/cleaner than say .deb/.rpm/.etc. First the lines would be 'this is better', then they go to 'it does a good enough job', and finally 'it is horrible but we have too large of an ecosystem to change it.'

Python support for regular expressions

Posted Feb 23, 2022 1:34 UTC (Wed) by da4089 (subscriber, #1195) [Link]

The funny thing is that up until Python 1.4, 'regex' was the standard regular expression module. 're' was added in Python 1.5, and 'regex' was declared obsolete.

Switching my brain to "no, you should use 'regex' again now" is going to be fun ...

SNOBOL and parsing

Posted Feb 23, 2022 2:20 UTC (Wed) by klossner (subscriber, #30046) [Link] (2 responses)

In school, I once realized that I could write the entire parser for my simple compiler in one (continued) line of SNOBOL. I punched the cards, submitted the job, and got my fanfold listing an hour later: the OS aborted my job when it exceeded its CPU allocation.

SNOBOL and parsing

Posted Feb 23, 2022 8:58 UTC (Wed) by lkundrak (subscriber, #43452) [Link]

This story made me very sad.
:(

SNOBOL and parsing

Posted Feb 23, 2022 17:47 UTC (Wed) by salewski (subscriber, #121521) [Link]

> In school, I once realized that I could write the entire parser for my simple compiler in one (continued) line of SNOBOL.

That makes we want to learn SNOBOL.

Python support for regular expressions

Posted Feb 24, 2022 13:02 UTC (Thu) by dveeden (subscriber, #120424) [Link] (2 responses)

Why not rely on a higher level method for timeouts? Like a thread or process that has a timeout? or something with sigalarm?

This would also work in other situations.

Python support for regular expressions

Posted Feb 24, 2022 14:13 UTC (Thu) by kleptog (subscriber, #1183) [Link] (1 responses)

Just SIGALRM doesn't work because the signal handler isn't called until the re module returns. Python catches the signal, sets a flag and waits for execution to reach the interpreter loop again.

The "recommended" approach is to spawn a process/thread, run the regex there, and implement the timeout by having the main process kill the child on timeout.

Not exactly an approach that scales or performs, especially if you're already using threads for something else, because only the main thread handles signals. The regex modules solves it directly.

Python support for regular expressions

Posted Mar 4, 2022 2:02 UTC (Fri) by njs (subscriber, #40338) [Link]

Threads aren't interruptible anyway. They share state with the rest of the process, so if you abruptly kill one without coordinating with it, then now your process's memory is in an undefined state. That's also the more fundamental problem with using SIGALRM – you can get the signal, but then what? How does your signal handler unwind whatever operation the thread was in the middle of?

You can spawn a child process and then kill it – that's safe because the child's state is isolated from the parent's. But that's way too expensive and fragile to use it for all regex matches. Much better to just teach your regex engine to check a timeout/abort flag on a regular basis.

Python support for regular expressions

Posted Feb 24, 2022 17:49 UTC (Thu) by esemwy (guest, #83963) [Link]

Of all the things I lost when I switched from Perl to Python, I miss regular expressions the most. ;-)

Python support for regular expressions

Posted Feb 26, 2022 6:31 UTC (Sat) by Otus (subscriber, #67685) [Link]

I don't think a package from PyPI is a good excuse not to improve the standard library. With the standard library you get five years of support and a well understood (if sometimes rocky) upgrade path. With an external package you always need to do an analysis of what you get and what the risks are.

PCRE and GRegex

Posted Nov 3, 2022 16:01 UTC (Thu) by swilmet (subscriber, #98424) [Link]

Great article and a good summary of the statu quo of Python's regex support.

Searching for PCRE, GLib or GRegex gave no result on this page, so I would like to mention it.

PCRE (Perl Compatible Regular Expressions) is a widely-used library. GLib (part of the GTK and GNOME project) has the GRegex API which is or should be entirely usable from Python, and uses PCRE (GRegex has been ported to PCRE 2 recently). It is usable from Python thanks to GObject Introspection automatic language bindings (with PyGObject).

So GRegex is another alternative for Python developers.

Python support for regular expressions

Timeouts?

Musings on regular expressions

Back to timeouts and regex

Whither re?

Python support for "irregular" expressions

Python support for "irregular" expressions

Python support for "irregular" expressions

Python support for "irregular" expressions

Python support for "irregular" expressions

Python support for "irregular" expressions

Python support for regular expressions

Python support for regular expressions

Python support for regular expressions

Python support for regular expressions

Python support for regular expressions

Python support for regular expressions

Python support for regular expressions

Python support for regular expressions

Python support for regular expressions

Python support for regular expressions

Python support for regular expressions

Python support for regular expressions

Python support for regular expressions

SNOBOL and parsing

SNOBOL and parsing

SNOBOL and parsing

Python support for regular expressions

Python support for regular expressions

Python support for regular expressions

Python support for regular expressions

Python support for regular expressions

PCRE and GRegex

Back to timeouts and `regex`

Whither `re`?