|
|
Subscribe / Log in / New account

Python first()

By Jake Edge
January 1, 2020

Python prides itself on being a newbie-friendly language; its developers have gone out of their way to try to ensure that easy tasks are straightforward to program. A recent discussion on the python-ideas mailing list looked at a use case that is common, but often implemented in an inefficient, incorrect fashion, with an eye toward making it easier to do correctly. Finding the first match for a regular expression in a body of text is where the conversation started, but it went in some other interesting directions as well.

findfirst()

Juancarlo Añez started things off with a post about the use case, noting that there are "many ugly recipes" for getting the first match for a regular expression (regex), but that a lot of programmers resort to:

matched = re.findall(regex, text)[0]

That formulation will fail with an IndexError if there are no matches. Even if it succeeds, it does a lot of extra work that gets thrown away to produce the first match; the entire list of matches will be created by re.findall() and all but the first will be wasted effort. He suggested:

def findfirst(regex, text, default=None, flags=0):

    return next(finditer(regex, text, flags=flags), default=default)

That would use the next() function with a default value (None by default) on the iterator returned by re.finditer(). That would produce the first match or the default value without causing an exception if the iterator raised StopIteration because it had no more values. As pointed out in the thread, though, re.finditer() returns Match objects, rather than a list of strings or tuples like re.findall(). The "obvious" switch to re.findall() will not work because a list is not an iterator, so next() will raise a TypeError.

There was some additional confusion about the two functions and their return values in the thread. Serhiy Storchaka pointed out that re.search() could be used in place of the re.finditer() in the example as follows:

    re.search(regex, text, flags) or default
However that returns a Match object (or the default), which Guido van Rossum said may not be what is wanted:

For people who are just interested in strings, the Match object is just a distraction. I think I am +1 on adding re.findfirst() as proposed by the OP.

Storchaka was not convinced that a good case had been made for adding such a function, however. Añez tried to fill in the use case:

The OP thinks that the case for wanting just the string for a first regex match, or a verifiable default if there is no match, is way too common, that the advice on the web is not very good (it should be "write a findfirst() using next() over finditer()", and that novices default to using findall(..)[0], which is troublesome.

He pointed out that using the iter() function would make an iterator out of the list returned from re.findall(); that could be used to make his version of findfirst() behave as he wanted. But he also noted that a common extension to the itertools module is a function called first():

def first(seq, default=None):
    return next(iter(seq), default= default)

He suggested that first() would make a good addition as well:

That function, first(), would also be a nice addition in itertools, and findfirst() could be implemented using it. first() avoids most use cases needing to check if a sequence or iterator is empty before using a default value. MHO is that first() deals with so many common cases that it should be a builtin.

Note that the case for findfirst() is weaker if first() is available. Yet findfirst() solves the bigger problem.

Andrew Barnert pointed out that using re.findall() is not particularly efficient, though:

The problem with using findall instead of finditer or search is that it scans the whole document rather than just until the first match, and it builds a potentially huge list just to throw it away. It’s pretty common that one or both of those will be a serious performance issue. Imagine asking to find the first double consonant in the OED [Oxford English Dictionary] and it takes a minute to run and pins a gigabyte of memory.

To Van Rossum, that indicates even more need for a findfirst(), presumably one that is optimized for its task, rather than using re.findall(). But in analyzing some examples posted by Kyle Stanley, Storchaka found that they could generally be rewritten to avoid the troublesome re.findall(...)[0] pattern, often using re.search()—which is optimized to find the first occurrence. "It seems that in most cases the author just [does] not know about re.search(). Adding re.findfirst() will not fix this."

But first()

Añez seemed to agree that basing findfirst() on re.search() made sense, but he also started a new thread to argue that first() should be added, perhaps as a builtin in the standard library. The example he gave in that message was unconvincing to a number of participants, but the general idea of adding a first() sparked a long discussion down several paths.

Steven D'Aprano thought that adding a builtin as a thin wrapper around next() was not providing much of an advantage. He also showed that there are some potential "gotchas" in first() as described:

But there's a major difference in behaviour depending on your input, and one which is surely going to lead to bugs from people who didn't realise that iterator arguments and iterable arguments will behave differently:
    # non-iterator iterable
    py> obj = [1, 2, 3, 4]
    py> [first(obj) for __ in range(5)]
    [1, 1, 1, 1, 1]

    # iterator
    py> obj = iter([1, 2, 3, 4])
    py> [first(obj) for __ in range(5)]
    [1, 2, 3, 4, None]

That could be fixed by adding a way to peek ahead in an iterator, though there are some oddities to that as well. But Oscar Benjamin is concerned that the StopIteration exception essentially leaks an implementation detail to the users of a function that uses the one-argument form of next(). It would make more sense to catch that exception and return a ValueError instead. He pointed to PEP 479 ("Change StopIteration handling inside generators") as one proposal that changed the behavior for generators, but did not do so for iterators, which he sees as a problem.

It is interesting to note that there are some corners of Python that even Van Rossum has lost track of. Tim Peters said that the simple one-line implementation of first() is equivalent to a more complicated version that Van Rossum had posted. Peters suggested that there should be two bars to pass before something gets added as a builtin: "In other words, in that context, the bar for 'building it in' consists far more of 'will it be used?' than 'is it difficult or tricky or long-winded?'."

It turns out that the existence of the two-argument form of next(), which will not raise an exception and simply return the default (i.e. the second argument) if the iterator is exhausted, is something that Van Rossum only learned about in the thread. He thinks others may be in the same boat, which puts first() in the "tricky" category:

But even if you know about 2-arg next(), the next(iter(it), default) version is not quite trivial to come up with -- you have to remember to put the iter() call in -- but IMO the main problem is that not enough people know about 2-arg next(), and that makes it pass the second bar.

But Barnert asked: "Are people who never find 2-arg next going to find itertools.first?" Van Rossum acknowledged the problem but thought that adding first() to itertools would likely result in people finding out about it since that module tends to attract some attention in blogs and tutorials. Storchaka wondered if simply adding it to the itertools recipes section of the documentation would be sufficient. Van Rossum thought otherwise: "Not nearly as much as just adding the function. We'll be doing humanity a favor if we just add it."

The difference in behavior for first() on an iterator versus on an iterable (e.g. a list) that D'Aprano had pointed out showed up again later in the thread. Because next() changes the state of an iterator, first() will give surprising results. D'Aprano showed that first() is not idempotent so that calling it twice with the same iterator argument will not give the same results—which is not true for an iterable. He also said that this behavior makes the name "first" ambiguous:

The meaning of `first` is:
return the first element of a sequence or container (in standard iteration order), OR the *next* element of an iterator

That set off some discussion of the right spelling for first() and whether if should raise an exception (something other than StopIteration) if there is no default passed to it. The latter would effectively remove one of the main features of first() that many proponents preferred, at least early on. Like the get() method for dictionaries, first(), as it was originally proposed, would never raise an exception. As Van Rossum put it:

dict.get() does *not* require you to specify a default -- in fact its *sole* purpose in life is to not raise an exception. And it has the sensible default default of None. There is already a spelling that does raise: dict[key].

Similarly, there is already a spelling of first() (or close enough) that raises: 1-arg next(). If 1-arg first() would also raise, it would fail the rule "similar things should be spelled similarly, and different things should be spelled differently".

I am not making this rule up -- it's an old design principle that Lambert Meertens used for ABC, Python's predecessor. It just didn't make it in the Zen of Python, unless you want to interpret "there should be one [...] way to do it" as its spiritual descendant.

Peters, who is in favor of a first() that can raise an exception, suggested that first(iterator) and next(iterator) "look darned near identical" but act quite differently if the iterator is exhausted. He sees get() differently because it is a method on the dict object, but Van Rossum was unconvinced by that argument. He was not particularly interested in prolonging the discussion, however: "But if everyone thinks that first() should raise, fine, this thread is way too long already (I should have kept it muted :-)."

Others were not hesitant to keep the discussion going, though. In the end, Añez came to the conclusion that a one-argument first() should raise an exception "so it can be applied to iterables that may yield None as the legit, first value". He also noted that the the more-itertools module from the Python Package Index (PyPI) has a first() that raises ValueError on exhaustion if no default is given. Even if first() is not "promoted" into itertools itself, he is in favor of updating the itertools documentation and recipes to better address the proper way to get the first entry in iterators and iterables.

Getting back to the original problem, Añez posted a findalliter() that would return strings or tuples like re.findall(), but do so with an iterator, rather than creating a whole list that might just be thrown away. He then defined findfirst() in terms of that (and first()).

It is not entirely clear where things go from here. In order to add either function to the standard library, it would seem that a PEP is needed; there doesn't seem to be a lot of opposition to the overall idea. A documentation change is likely even easier. But the discussion provided an interesting look into the Python development process, which tends to start out with postings to python-ideas to be fully hashed out before moving over to python-dev once a PEP has been drafted. The discussion also seems to have brought some semi-obscure corners of the language out of the darkness; much was learned in the process of working through it.


Index entries for this article
PythonSimplifying


to post comments

Python first()

Posted Jan 2, 2020 17:54 UTC (Thu) by iabervon (subscriber, #722) [Link]

I think that the right answer might be to enshrine in the lore: "the first item is next() at the beginning", in much the way that "slices are (the first index that's included):(the first index that's not included)" is something that newcomers don't know beforehand but can remember once they find out. It's like how, to get the first card from a deck of cards, you draw a card, which is, in general, how you get the next card; you get the first card just by being the first to get a card.

Python first()

Posted Jan 4, 2020 12:15 UTC (Sat) by intgr (subscriber, #39733) [Link] (3 responses)

Something similar that I've wanted on multiple occasions would be a `single()` function, that gets the first item from an iterable and throws an exception if there are no values, or multiple values. Like "make sure there's just a single item and give it to me".

Python first()

Posted Jan 5, 2020 1:26 UTC (Sun) by excors (subscriber, #95769) [Link] (1 responses)

Isn't that simply "(item,) = iterable" (or even "item, = iterable"), which unpacks into a tuple of size 1 and throws ValueError if there are too few/many values?

Python first()

Posted Jan 9, 2020 12:49 UTC (Thu) by intgr (subscriber, #39733) [Link]

That's a very nice short and idiomatic way of implementing that.
Thanks, I hadn't thought of it! :)

Python first()

Posted Jan 9, 2020 12:42 UTC (Thu) by LtWorf (subscriber, #124958) [Link]

You could even implement such a function yourself, put it in a module and then import the module!


Copyright © 2020, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds