Revisiting NaNs in Python
Back in January 2020, we looked at some oddities in Python's handling of Not a Number (NaN) values in its statistics module. The conversation went quiet after that, but it has been revived recently with an eye toward fixing the problems that were reported. As detailed in that earlier article, NaNs are rather strange beasts in the floating-point universe, so figuring out how best to deal with their presence is less straightforward than it might seem.
Not a number
NaNs are defined in the IEEE 754 floating-point standard, which Python follows, as special values for quantities that cannot be represented by the hardware. This can happen as a result of operations like division by zero or taking the square root of a negative number, though Python raises an exception in those cases. Sometimes NaNs are used as a kind of special value marking missing data, especially in scientific or big-data contexts, but doing so can cause some odd behavior.
The earlier discussion centered around the statistics.median() function, which returns the "middle" value in a list of numbers; the list is sorted and either the actual middle or the average of the two middle values is returned. But median() (and others) run afoul of the perhaps counterintuitive nature of NaN comparisons. In Python (and IEEE 754), all comparisons involving NaNs are false, except for not-equal, which is always true, even when comparing the same two NaN values.
>>> NaN = float('nan')
>>> NaN < 0
False
>>> NaN >= 0
False
>>> NaN == NaN
False
>>> NaN != NaN
True
Steven D'Aprano, who developed the statistics module for the
standard library, returned to the problem with median() in an
August 24 post
to the python-ideas mailing list. He noted that NaN handling in the
statistics module is implementation dependent, which "*usually* means that if your
data has a NAN in it, the result you get will probably be a NAN
".
But median() behaves differently:
>>> statistics.median([1, 2, float('nan'), 4])
nan
>>> statistics.median([float('nan'), 1, 2, 4])
1.5
I've spoken to users of other statistics packages and languages, such as
R, and I cannot find any consensus on what the "right" behaviour should
be for NANs except "not that!".
He proposed adding a keyword argument to functions in the module to indicate what should be done when NaNs are found in their input. He saw three possible choices: raise an exception, return NaN, or filter out the NaNs and ignore them. He was looking for opinions on or objections to that plan, as well as wondering what the default should be.
Christopher Barker noted that NumPy has a set of functions (nan*()) that explicitly ignore NaNs, but:
Filtering [out] NaNs should *not* be the default. Often NaN means missing data, but could also be the result of an error of some sort. Incorrect results are much worse than errors — NaNs should never be ignored unless explicitly asked for.Beyond that, I’d prefer returning NaN to raising an exception, but either is OK.
Marc-Andre Lemburg concurred
that filtering should be an explicit choice. He likened the situation to
that of the errors
parameter for the codecs
module; it allows the programmer to choose various types of behavior
when errors are encountered in encoding or decoding the input. Finn Mason
suggested
having an option to warn the user if NaN values were ignored; he also
thought that NaNs could perhaps be treated as zeroes in that case:
"This allows calculations to still quickly and easily be made with or
without NaNs, but an
alternative course of action can be taken in the presence of a NaN value if
desired.
"
Treating NaNs as zeroes was a non-starter, but D'Aprano did see
value in the idea of a warning.
More surprises
There are other ways that NaNs exhibit surprising behavior, as highlighted by a suggestion made by Lemburg. He said that codecs defaults to raising an exception to expose the need to make an explicit decision, but that might not be the right choice for a long-running calculation that encounters a NaN. He said that a simple test, similar to the one below, could be used to check for NaNs before doing said calculation:
>>> a = [ 1, 2, 3, NaN ]
>>> NaN in a
True
While that particular test works, Peter Otten pointed out that it is not actually a solution to the problem because the container test skips the equality comparison if the objects are the same, as the following shows:
>>> float('nan') in a
False
Unless all of the NaNs are the same object, math.isnan() must be used instead. Lemburg thought that NaNs were treated as a singleton, but, as D'Aprano pointed out, there are an enormous number of NaN values defined by the standard; making them into a singleton would lose extra information that might be useful:
The IEEE-754 standard doesn't mandate that NANs preserve the payload, but it does recommend it. We shouldn't gratuitously discard that information. It could be meaningful to whoever is generating the data.
Lemburg thinks that the current median() behavior is a bug, but D'Aprano reiterated at some length that it was consistent with the standard. He also pointed out that there are further consequences of the comparison rules for NaNs:
So when you sort a list containing NANs, they end up in some arbitrary position that depends on the sort implementation, the other values in the list, and their initial position. NANs can even throw out the order of other values:>>> sorted([3, nan, 4, 2, nan, 1]) [3, nan, 1, 2, 4, nan]and *that* violates `median`'s assumption that sorting values actually puts them in sorted order, which is why median returns the wrong value.
Adding the totalOrder
predicate, which provides a consistent definition of where to sort NaNs
(and other special values), would perhaps be helpful, but could not change
the existing NaN handling in Python. As Richard Damon noted:
"Asking for the median
value of a list that doesn't have a proper total
order is a nonsense question, so you get a nonsense answer.
"
Defaulting to returning NaN would be consistent with other Python math functions, Mark Dickinson said. Brendan Barnwell also thought that was probably the best default, but he raised another issue:
One important thing we should think about is whether to add similar handling to `max` and `min`. These are builtin functions, not in the statistics module, but they have similarly confusing behavior with NAN: compare `max(1, 2, float('nan'))` with `max(float('nan'), 1, 2)`. As long as we're handling this for median and so on, it would be nice to have the ability to do NAN-aware max and min as well.
As might be guessed from earlier examples, the first max() call returns 2, while the second returns a NaN. As with the existing behavior of NaN comparisons, though, no change to the default behavior of those built-in functions can be made for backward-compatibility reasons. If desired, though, a new keyword argument could be added as D'Aprano proposed for median().
Naming
After a few days discussion, D'Aprano opened the bikeshed floodgates when
he asked
for opinions on the name and type of the parameter. "I'm leaning
towards "nans=..." with an enum.
" Sebastian Berg surveyed
a few other Python libraries, two of which use string types to specify what
to do about NaNs. Barker liked
the SciPy version, which has a
nan_policy parameter with three values: 'propagate'
(return NaN), 'raise' (an exception), or 'omit' (filter
NaNs). In any case, he preferred a string flag rather than an Enum.
Lemburg agreed,
noting that codecs uses string values; others in the thread were
largely in favor of that approach as well. But Ronald Oussoren wondered
what the objection to using an enum is. Barker does
not see
any real advantage to them over string flags, except for static typing,
and noted that they are somewhat more painful to use.
But for flags that can be combined in various ways (e.g. with bitwise-or,
"|"), enums have a "*huge* advantage
", Chris Angelico
said:
It's easy to accept a flags argument that is the bitwise Or of a collection of flags, and then ascertain whether or not a specific flag was included. The repr of such a combination is useful and readable, too.
It does not make sense to combine the NaN flags, though. Oussoren pointed out that enums provide more advantages than just for static typing; static analysis and things like auto-completion in IDEs are enabled by enums. He said that he had no opinion on what should be used for statistics, but that he has been moving string flags to enums in his own code, in part to avoid bugs from typos in the strings. For Barker, though, that is more evidence of a shift in Python over the years:
Features to support static analysis are a good example -- far less important for "scripting' [than] "systems programming".Personally, I find it odd -- I've spent literally a couple decades telling people that Python's dynamic nature is a net plus, and the bugs that static typing (and static analysis I suppose) catch for you are generally shallow bugs. (for example: misspelling a string flag is a shallow bug).
But here we are.
Anyway, if you are writing a quick script to calculate a few statistics, I think a string flag is easier. If you are writing a big system with some statistical calculations built in, you will appreciate the additional safety of an Enum. It's hard to optimize for different use cases.
The discussion largely wound down after that. D'Aprano thanked participants for their opinions along the way, so he is clearly trying to find a middle-ground position to solve the problems in the statistics module. Based on the discussion, one might guess that the default will be to return a NaN, the parameter will be called nans or maybe nan_policy, and that string flag values will be used. In any case, the open nature of Python development once again gives us a nice look inside the kinds of problems that are grappled with on the way to fixing corner cases and surprising behavior in the language.
| Index entries for this article | |
|---|---|
| Python | Floating point |
