|
|
Log in / Subscribe / Register

Revisiting NaNs in Python

By Jake Edge
September 15, 2021

Back in January 2020, we looked at some oddities in Python's handling of Not a Number (NaN) values in its statistics module. The conversation went quiet after that, but it has been revived recently with an eye toward fixing the problems that were reported. As detailed in that earlier article, NaNs are rather strange beasts in the floating-point universe, so figuring out how best to deal with their presence is less straightforward than it might seem.

Not a number

NaNs are defined in the IEEE 754 floating-point standard, which Python follows, as special values for quantities that cannot be represented by the hardware. This can happen as a result of operations like division by zero or taking the square root of a negative number, though Python raises an exception in those cases. Sometimes NaNs are used as a kind of special value marking missing data, especially in scientific or big-data contexts, but doing so can cause some odd behavior.

The earlier discussion centered around the statistics.median() function, which returns the "middle" value in a list of numbers; the list is sorted and either the actual middle or the average of the two middle values is returned. But median() (and others) run afoul of the perhaps counterintuitive nature of NaN comparisons. In Python (and IEEE 754), all comparisons involving NaNs are false, except for not-equal, which is always true, even when comparing the same two NaN values.

    >>> NaN = float('nan')
    >>> NaN < 0
    False
    >>> NaN >= 0
    False
    >>> NaN == NaN
    False
    >>> NaN != NaN
    True

Steven D'Aprano, who developed the statistics module for the standard library, returned to the problem with median() in an August 24 post to the python-ideas mailing list. He noted that NaN handling in the statistics module is implementation dependent, which "*usually* means that if your data has a NAN in it, the result you get will probably be a NAN". But median() behaves differently:

    >>> statistics.median([1, 2, float('nan'), 4])
    nan
    >>> statistics.median([float('nan'), 1, 2, 4])
    1.5
I've spoken to users of other statistics packages and languages, such as R, and I cannot find any consensus on what the "right" behaviour should be for NANs except "not that!".

He proposed adding a keyword argument to functions in the module to indicate what should be done when NaNs are found in their input. He saw three possible choices: raise an exception, return NaN, or filter out the NaNs and ignore them. He was looking for opinions on or objections to that plan, as well as wondering what the default should be.

Christopher Barker noted that NumPy has a set of functions (nan*()) that explicitly ignore NaNs, but:

Filtering [out] NaNs should *not* be the default. Often NaN means missing data, but could also be the result of an error of some sort. Incorrect results are much worse than errors — NaNs should never be ignored unless explicitly asked for.

Beyond that, I’d prefer returning NaN to raising an exception, but either is OK.

Marc-Andre Lemburg concurred that filtering should be an explicit choice. He likened the situation to that of the errors parameter for the codecs module; it allows the programmer to choose various types of behavior when errors are encountered in encoding or decoding the input. Finn Mason suggested having an option to warn the user if NaN values were ignored; he also thought that NaNs could perhaps be treated as zeroes in that case: "This allows calculations to still quickly and easily be made with or without NaNs, but an alternative course of action can be taken in the presence of a NaN value if desired." Treating NaNs as zeroes was a non-starter, but D'Aprano did see value in the idea of a warning.

More surprises

There are other ways that NaNs exhibit surprising behavior, as highlighted by a suggestion made by Lemburg. He said that codecs defaults to raising an exception to expose the need to make an explicit decision, but that might not be the right choice for a long-running calculation that encounters a NaN. He said that a simple test, similar to the one below, could be used to check for NaNs before doing said calculation:

    >>> a = [ 1, 2, 3, NaN ]
    >>> NaN in a
    True

While that particular test works, Peter Otten pointed out that it is not actually a solution to the problem because the container test skips the equality comparison if the objects are the same, as the following shows:

    >>> float('nan') in a
    False

Unless all of the NaNs are the same object, math.isnan() must be used instead. Lemburg thought that NaNs were treated as a singleton, but, as D'Aprano pointed out, there are an enormous number of NaN values defined by the standard; making them into a singleton would lose extra information that might be useful:

The IEEE-754 standard doesn't mandate that NANs preserve the payload, but it does recommend it. We shouldn't gratuitously discard that information. It could be meaningful to whoever is generating the data.

Lemburg thinks that the current median() behavior is a bug, but D'Aprano reiterated at some length that it was consistent with the standard. He also pointed out that there are further consequences of the comparison rules for NaNs:

So when you sort a list containing NANs, they end up in some arbitrary position that depends on the sort implementation, the other values in the list, and their initial position. NANs can even throw out the order of other values:
>>> sorted([3, nan, 4, 2, nan, 1])
[3, nan, 1, 2, 4, nan]
and *that* violates `median`'s assumption that sorting values actually puts them in sorted order, which is why median returns the wrong value.

Adding the totalOrder predicate, which provides a consistent definition of where to sort NaNs (and other special values), would perhaps be helpful, but could not change the existing NaN handling in Python. As Richard Damon noted: "Asking for the median value of a list that doesn't have a proper total order is a nonsense question, so you get a nonsense answer."

Defaulting to returning NaN would be consistent with other Python math functions, Mark Dickinson said. Brendan Barnwell also thought that was probably the best default, but he raised another issue:

One important thing we should think about is whether to add similar handling to `max` and `min`. These are builtin functions, not in the statistics module, but they have similarly confusing behavior with NAN: compare `max(1, 2, float('nan'))` with `max(float('nan'), 1, 2)`. As long as we're handling this for median and so on, it would be nice to have the ability to do NAN-aware max and min as well.

As might be guessed from earlier examples, the first max() call returns 2, while the second returns a NaN. As with the existing behavior of NaN comparisons, though, no change to the default behavior of those built-in functions can be made for backward-compatibility reasons. If desired, though, a new keyword argument could be added as D'Aprano proposed for median().

Naming

After a few days discussion, D'Aprano opened the bikeshed floodgates when he asked for opinions on the name and type of the parameter. "I'm leaning towards "nans=..." with an enum." Sebastian Berg surveyed a few other Python libraries, two of which use string types to specify what to do about NaNs. Barker liked the SciPy version, which has a nan_policy parameter with three values: 'propagate' (return NaN), 'raise' (an exception), or 'omit' (filter NaNs). In any case, he preferred a string flag rather than an Enum.

Lemburg agreed, noting that codecs uses string values; others in the thread were largely in favor of that approach as well. But Ronald Oussoren wondered what the objection to using an enum is. Barker does not see any real advantage to them over string flags, except for static typing, and noted that they are somewhat more painful to use. But for flags that can be combined in various ways (e.g. with bitwise-or, "|"), enums have a "*huge* advantage", Chris Angelico said:

It's easy to accept a flags argument that is the bitwise Or of a collection of flags, and then ascertain whether or not a specific flag was included. The repr of such a combination is useful and readable, too.

It does not make sense to combine the NaN flags, though. Oussoren pointed out that enums provide more advantages than just for static typing; static analysis and things like auto-completion in IDEs are enabled by enums. He said that he had no opinion on what should be used for statistics, but that he has been moving string flags to enums in his own code, in part to avoid bugs from typos in the strings. For Barker, though, that is more evidence of a shift in Python over the years:

Features to support static analysis are a good example -- far less important for "scripting' [than] "systems programming".

Personally, I find it odd -- I've spent literally a couple decades telling people that Python's dynamic nature is a net plus, and the bugs that static typing (and static analysis I suppose) catch for you are generally shallow bugs. (for example: misspelling a string flag is a shallow bug).

But here we are.

Anyway, if you are writing a quick script to calculate a few statistics, I think a string flag is easier. If you are writing a big system with some statistical calculations built in, you will appreciate the additional safety of an Enum. It's hard to optimize for different use cases.

The discussion largely wound down after that. D'Aprano thanked participants for their opinions along the way, so he is clearly trying to find a middle-ground position to solve the problems in the statistics module. Based on the discussion, one might guess that the default will be to return a NaN, the parameter will be called nans or maybe nan_policy, and that string flag values will be used. In any case, the open nature of Python development once again gives us a nice look inside the kinds of problems that are grappled with on the way to fixing corner cases and surprising behavior in the language.


Index entries for this article
PythonFloating point


to post comments

Missing values in Julia

Posted Sep 15, 2021 23:25 UTC (Wed) by garrison (subscriber, #39220) [Link] (1 responses)

Sometimes NaNs are used as a kind of special value marking missing data

There is an explicit missing singleton in Julia to avoid NaNs having this double meaning. (And yes, this is distinct from nothing, Julia's "null" singleton.

Missing values in Julia

Posted Sep 16, 2021 15:45 UTC (Thu) by southey (guest, #9466) [Link]

There is not a double meaning but an misuse of the standard because using NaN as missing value indicators relies on quiet NaN behavior (in that an invalid operation exception is not raised). Like with missing value codes, it is adequate provided that the functions are written to properly handle it and people understand the limitations. A missing value code does not avoid this type of problem unless that value gets set (or replaced) in advance (such as function that sets it when dividing by zero) or the user understands the problem. At least with a missing value code it is easier to define appropriate rules as to include or exclude values based on that code (which essentially is this scenario).

Revisiting NaNs in Python

Posted Sep 16, 2021 1:50 UTC (Thu) by ncm (guest, #165) [Link] (12 responses)

> "Python's dynamic nature is a net plus..."

(Head in hands.) This is why we can't have nice things.

"Shallow", of a bug, describes only how much must be changed to fix it. Otherwise, it has unlimited potential for impact, and is also the overwhelmingly most common sort, particularly in dynamically typed languages. Thus, if you can do something simple to cut your "shallow" bug count, you have cut your total bug count by almost as much.

Revisiting NaNs in Python

Posted Sep 16, 2021 2:17 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (6 responses)

This is (presumably) written from the perspective of someone who would otherwise have written a multi-line awk or perl -e monstrosity. Python is arguably less bad than either of those things... but I very much doubt such people are meticulously annotating the static types of all of their variables and functions.

Revisiting NaNs in Python

Posted Sep 16, 2021 7:14 UTC (Thu) by ncm (guest, #165) [Link] (5 responses)

Hint: Perl is also dynamically-typed.

The topic here is the possibility of static error checking, and a choice between two ways to specify an argument, *both* in Python. The fail is in choosing not to minimize "shallow" bugs, just because they are shallow.

If passing an enumeration is, at present, substantially less convenient than passing a string, that is fixable by improving the language.

Or, by improving the library. E.g. the library could permit either a string or an enumerator in that position. Or, it could treat a misspelled string argument as a fatal runtime, or even compile-to-pyc-time, error. It would be inherently wrong to vary the behavior of a particular call site at runtime.

Choices are available to somebody who cares.

Revisiting NaNs in Python

Posted Sep 16, 2021 16:38 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (4 responses)

> Hint: Perl is also dynamically-typed.

I'm aware of that. I'm also rather confused. How exactly did you read my comment and come to the conclusion that I wasn't?

> The topic here is the possibility of static error checking, and a choice between two ways to specify an argument, *both* in Python. The fail is in choosing not to minimize "shallow" bugs, just because they are shallow.

The argument (with which I disagree) is that less typing = people will choose to use Python instead of perl or awk, more typing = people won't want to bother with Python because it's too heavy. Realistically, it is impossible to simultaneously satisfy the "less typing" crowd and the "static typing" crowd. I'm firmly in the "static typing" camp, but it's important to understand that this is, in fact, a tradeoff.

Revisiting NaNs in Python

Posted Sep 16, 2021 16:43 UTC (Thu) by Wol (subscriber, #4433) [Link]

What would be the best of both worlds is

default to var
explicit typing is enforced (with co-ercion?)
there's an option to disable the default

Working in a language where everything is varchar (including numbers, where the normal way of multiplying by 10 is to concatenate a zero!) I love it but really would appreciate the ability to strictly type stuff on occasion!

Cheers,
Wol

Revisiting NaNs in Python

Posted Sep 16, 2021 19:47 UTC (Thu) by ncm (guest, #165) [Link] (2 responses)

Did you miss that I spelled out how the Python library could, in fact, support both?

Revisiting NaNs in Python

Posted Sep 16, 2021 21:31 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (1 responses)

Well, the library *certainly* does not have the ability to fail at compile-to-pyc time. That is not a thing that you can do, without modifying the interpreter (which, I charitably assume, you are not seriously proposing for a simple pure-Python library that could just as easily live as a standalone statistics.py file). The other idea (accept both a string and an enum) violates "There should be one-- and preferably only one --way to do it" from the Zen of Python. As for failing at runtime, that's exactly what they are proposing to do.

In short: Your ideas continue to elude my understanding.

Revisiting NaNs in Python

Posted Sep 17, 2021 23:47 UTC (Fri) by ncm (guest, #165) [Link]

Improving the interpreter doesn't seem like such a bad idea.

But it doesn't seem like my place to tell Python, in detail, how to make their language better. I just observe that they appear to be making bad choices for bad reasons, and I would like to encourage them to listen to their better selves.

Revisiting NaNs in Python

Posted Sep 16, 2021 9:52 UTC (Thu) by aragilar (subscriber, #122569) [Link] (4 responses)

While I am a strong fan of static analysis, I think *for the kind of problems where you're going to use the statistics library* (or the Python scientific stack), standard static analysis/type systems are less useful than almost any other situation, because none of them (excluding a few niche ones like https://herbie.uwplse.org/) are able to help with the issues that worry people, such numerical stability, reproducibility or incorrect use of statistics. The comment from Christopher Barker about Python dynamicism makes sense in that context.

Revisiting NaNs in Python

Posted Sep 16, 2021 10:48 UTC (Thu) by roc (subscriber, #30627) [Link] (3 responses)

What we need is a static type system that prevents incorrect use of statistics. That'd be cool.

Revisiting NaNs in Python

Posted Sep 19, 2021 20:39 UTC (Sun) by JanC_ (guest, #34940) [Link]

Even better: one that can prevent the use of incorrect data sets in statistics (just saw a OECD statistician correct someone on Twitter about that—to be fair the used UN data set was confusingly named, at least to a lay person).

Revisiting NaNs in Python

Posted Sep 23, 2021 14:07 UTC (Thu) by welinder (guest, #4699) [Link] (1 responses)

Beautiful idea in theory.

The Devil's Advocate in me, however, thinks that such a type system would just push people to use Excel. And if you are worried over NaNs and consistency in general for Python, rest assured that Excel will rob you of your sanity.

Revisiting NaNs in Python

Posted Sep 28, 2021 23:23 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

I'm pretty sure it's undecidable, or at least AI-complete. We have three different classes of error that I can think of:

  1. Dimensional analysis errors. This is detectable using libraries which already exist, such as pint. It's also got relatively little to do with statistics, and arguably isn't a very interesting problem in the first place.
  2. Incorrect accounting for bias, which you can further subdivide into sampling bias and systematic bias.
    • Systematic bias is a methodological issue, well outside the scope of Python and indeed statistical analysis in general. If you know you have a systematic bias, and can measure it, you may be able to correct for it, but this is a hard problem, which mostly reduces to "are you really sure you've measured and accounted for all of your systematic biases?" (see for example pollsters weighting their respondents to make their samples more representative of census numbers - and then consider the error bars on the average poll).
    • Sampling bias is a very complicated issue, but it is well studied and can be properly accounted for if you know what you are doing. The problem is that doing so requires a lot of information about the nature and source of your data, which may be difficult for a non-expert to express in a standardized way. For example, if you have data which is *almost* statistically significant, it is tempting to go collect more data and re-run the analysis, but that introduces hard-to-measure and impossible-to-detect sampling bias. Correcting for this is possible, but only if the software knows that you have done it. OTOH, if you use Bayesian statistics instead of frequentist statistics, then you just have to keep the posterior distribution from the previous analysis and use it as your new prior, but the Bayesian approach has its own pitfalls ("So is it statistically significant or not?" "Well, that depends on your priors. And we don't use p-values here, so it also depends on what you mean by 'significant.'").
  3. Incorrect interpretation of statistical results, especially the p-value and confidence intervals. I struggle to see what more the computer can do here, because these values are generally the final output of a statistical analysis. Humans misusing that information after the fact is not something which Python can fix.

Revisiting NaNs in Python

Posted Sep 16, 2021 10:17 UTC (Thu) by amck (subscriber, #7270) [Link] (2 responses)

Were context managers considered as an option ? I'm of the opinion that the _default* must not be to ignore NaNs, but to explicitly handle them we could use:
with ignore_nans(): a = some_math(b)
where
def some_math(b): # problematic routine buried deep return median(b)
It's explicit, doesn't require passing parameters deep everywhere, and copes with situations where new NaNs may be produced.

Revisiting NaNs in Python

Posted Sep 16, 2021 10:41 UTC (Thu) by rgb (guest, #57129) [Link]

With this setup it is easy to change external libraries code behaviour which might be a good or a bad thing.

Revisiting NaNs in Python

Posted Sep 17, 2021 0:07 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

My 2¢: Wrapping global state in a context manager is only marginally less bad than *not* wrapping it in a context manager. The better option is to not use global state in the first place.

Revisiting NaNs in Python

Posted Sep 23, 2021 22:06 UTC (Thu) by jreiser (subscriber, #11027) [Link]

Please actually read IEEE-754. (Yes, it costs some money to obtain a copy from IEEE.) Some NaNs are signalling NaNs which require that arithmetic operations generate an exception.

After that, amck has a good idea to use context managers. In the default context, then an operation should classify each input (zero, denormal, normal, infinity, NaN) and act accordingly, probably generating an exception for each NaN, but otherwise yielding a result that has "no surprises". A maximum speed context might blindly assume that the hardware handles naive code appropriately, which will yield "surprises" like those mentioned in the article. Then there can be careful contexts such as treat NaN as missing data, treat NaN as infinity, etc. For the median subroutine an extra careful context could report additional information: Find the median of the non-NaNs. Find a second median of the original set, but acting as if each NaN is below the smallest non-NaN. Find a third median of the original set, but acting as if each NaN is above the largest non-NaN. Report a result which is an ordered list of all three found values.


Copyright © 2021, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds