Python first()
Python prides itself on being a newbie-friendly language; its developers have gone out of their way to try to ensure that easy tasks are straightforward to program. A recent discussion on the python-ideas mailing list looked at a use case that is common, but often implemented in an inefficient, incorrect fashion, with an eye toward making it easier to do correctly. Finding the first match for a regular expression in a body of text is where the conversation started, but it went in some other interesting directions as well.
findfirst()
Juancarlo Añez started things off with a post
about the use case, noting that there are "many ugly recipes
"
for getting the first match for a regular expression (regex), but that a
lot of programmers resort to:
matched = re.findall(regex, text)[0]
That formulation will fail with an IndexError if there are no matches. Even if it succeeds, it does a lot of extra work that gets thrown away to produce the first match; the entire list of matches will be created by re.findall() and all but the first will be wasted effort. He suggested:
def findfirst(regex, text, default=None, flags=0): return next(finditer(regex, text, flags=flags), default=default)
That would use the next() function with a default value (None by default) on the iterator returned by re.finditer(). That would produce the first match or the default value without causing an exception if the iterator raised StopIteration because it had no more values. As pointed out in the thread, though, re.finditer() returns Match objects, rather than a list of strings or tuples like re.findall(). The "obvious" switch to re.findall() will not work because a list is not an iterator, so next() will raise a TypeError.
There was some additional confusion about the two functions and their return values in the thread. Serhiy Storchaka pointed out that re.search() could be used in place of the re.finditer() in the example as follows:
re.search(regex, text, flags) or defaultHowever that returns a Match object (or the default), which Guido van Rossum said may not be what is wanted:
Storchaka was not convinced that a good case had been made for adding such a function, however. Añez tried to fill in the use case:
He pointed out that using the iter() function would make an iterator out of the list returned from re.findall(); that could be used to make his version of findfirst() behave as he wanted. But he also noted that a common extension to the itertools module is a function called first():
def first(seq, default=None): return next(iter(seq), default= default)
He suggested that first() would make a good addition as well:
Note that the case for findfirst() is weaker if first() is available. Yet findfirst() solves the bigger problem.
Andrew Barnert pointed out that using re.findall() is not particularly efficient, though:
To Van Rossum, that indicates
even more need for a findfirst(), presumably one that is optimized
for its task, rather than using re.findall(). But in analyzing some
examples posted
by Kyle Stanley, Storchaka found that they could generally be rewritten to
avoid the troublesome re.findall(...)[0] pattern, often using
re.search()—which is optimized to find the first
occurrence. "It seems that in most cases the author just [does] not know
about
re.search(). Adding re.findfirst() will not fix this.
"
But first()
Añez seemed to agree that basing findfirst() on re.search() made sense, but he also started a new thread to argue that first() should be added, perhaps as a builtin in the standard library. The example he gave in that message was unconvincing to a number of participants, but the general idea of adding a first() sparked a long discussion down several paths.
Steven D'Aprano thought that adding a builtin as a thin wrapper around next() was not providing much of an advantage. He also showed that there are some potential "gotchas" in first() as described:
# non-iterator iterable py> obj = [1, 2, 3, 4] py> [first(obj) for __ in range(5)] [1, 1, 1, 1, 1] # iterator py> obj = iter([1, 2, 3, 4]) py> [first(obj) for __ in range(5)] [1, 2, 3, 4, None]
That could be fixed by adding a way to peek ahead in an iterator, though
there are some oddities to that as well.
But Oscar Benjamin is concerned
that the StopIteration exception essentially leaks an
implementation detail to the users of a function that uses the one-argument
form of next(). It would make more sense to catch that exception
and return a ValueError instead. He pointed to PEP 479
("Change StopIteration handling inside generators
") as one
proposal that changed the behavior for generators, but did not do so for
iterators, which he sees as a problem.
It is interesting to note that there are some corners of Python that even
Van Rossum has lost track of.
Tim Peters said
that the simple one-line implementation of first() is equivalent
to a more complicated version that Van Rossum had posted. Peters
suggested that there should be two bars to pass before something gets added
as a builtin: "In other words, in that context, the bar for 'building it
in' consists far more of 'will it be used?' than 'is it difficult or
tricky or long-winded?'.
"
It turns out that the existence of the two-argument form of next(), which will not raise an exception and simply return the default (i.e. the second argument) if the iterator is exhausted, is something that Van Rossum only learned about in the thread. He thinks others may be in the same boat, which puts first() in the "tricky" category:
But Barnert asked:
"Are people who never find 2-arg next going to find
itertools.first?
"
Van Rossum acknowledged
the problem but thought that adding first() to itertools
would likely result in people finding out about it since that module tends
to attract some attention in blogs and tutorials. Storchaka wondered if
simply adding it to the itertools
recipes section of the documentation would be sufficient. Van Rossum
thought
otherwise: "Not nearly as much as just adding the function. We'll be
doing humanity a favor if we just add it.
"
The difference in behavior for first() on an iterator versus on an iterable (e.g. a list) that D'Aprano had pointed out showed up again later in the thread. Because next() changes the state of an iterator, first() will give surprising results. D'Aprano showed that first() is not idempotent so that calling it twice with the same iterator argument will not give the same results—which is not true for an iterable. He also said that this behavior makes the name "first" ambiguous:
return the first element of a sequence or container (in standard iteration order), OR the *next* element of an iterator
That set off some discussion of the right spelling for first() and whether if should raise an exception (something other than StopIteration) if there is no default passed to it. The latter would effectively remove one of the main features of first() that many proponents preferred, at least early on. Like the get() method for dictionaries, first(), as it was originally proposed, would never raise an exception. As Van Rossum put it:
Similarly, there is already a spelling of first() (or close enough) that raises: 1-arg next(). If 1-arg first() would also raise, it would fail the rule "similar things should be spelled similarly, and different things should be spelled differently".
I am not making this rule up -- it's an old design principle that Lambert Meertens used for ABC, Python's predecessor. It just didn't make it in the Zen of Python, unless you want to interpret "there should be one [...] way to do it" as its spiritual descendant.
Peters, who is in favor of a first() that can raise an exception,
suggested
that first(iterator) and next(iterator) "look
darned near identical
" but act quite differently if the iterator is
exhausted. He sees get() differently because it is a method on
the dict object, but Van Rossum was unconvinced
by that argument. He was not particularly interested in prolonging the
discussion, however: "But if
everyone thinks that first() should raise, fine, this thread is way too
long already (I should have kept it muted :-).
"
Others were not hesitant to keep the discussion going, though. In the end,
Añez came
to the conclusion that a one-argument first() should raise an
exception "so it can be applied to iterables that may yield None as
the legit, first value
". He also noted that the
the more-itertools
module from the Python Package Index (PyPI) has a first()
that raises ValueError on exhaustion if no default is given. Even
if first() is not "promoted" into itertools itself, he is
in favor of updating the itertools documentation and recipes to better address the
proper way to get the first entry in iterators and iterables.
Getting back to the original problem, Añez posted a findalliter() that would return strings or tuples like re.findall(), but do so with an iterator, rather than creating a whole list that might just be thrown away. He then defined findfirst() in terms of that (and first()).
It is not entirely clear where things go from here. In order to add either function to the standard library, it would seem that a PEP is needed; there doesn't seem to be a lot of opposition to the overall idea. A documentation change is likely even easier. But the discussion provided an interesting look into the Python development process, which tends to start out with postings to python-ideas to be fully hashed out before moving over to python-dev once a PEP has been drafted. The discussion also seems to have brought some semi-obscure corners of the language out of the darkness; much was learned in the process of working through it.
Index entries for this article | |
---|---|
Python | Simplifying |
Posted Jan 2, 2020 17:54 UTC (Thu)
by iabervon (subscriber, #722)
[Link]
Posted Jan 4, 2020 12:15 UTC (Sat)
by intgr (subscriber, #39733)
[Link] (3 responses)
Posted Jan 5, 2020 1:26 UTC (Sun)
by excors (subscriber, #95769)
[Link] (1 responses)
Posted Jan 9, 2020 12:49 UTC (Thu)
by intgr (subscriber, #39733)
[Link]
Posted Jan 9, 2020 12:42 UTC (Thu)
by LtWorf (subscriber, #124958)
[Link]
Python first()
Python first()
Python first()
Python first()
Thanks, I hadn't thought of it! :)
Python first()