|
|
Subscribe / Log in / New account

Some median Python NaNsense

By Jonathan Corbet
January 3, 2020
Anybody who has ever taken a numerical analysis course understands that floating-point arithmetic on computers is a messy affair. Even so, it is easy to underestimate just how messy things can be. This topic came to the fore in an initially unrelated python-ideas mailing-list thread; what should the Python statistics module do with floating-point values that are explicitly not numbers?

Kemal Diri doubtless did not mean to start a massive thread with this request to add a built-in function to the language to calculate the average of the values in a list. That request was quickly dismissed, but the developers went on to the seemingly strange behavior of the statistics module's median() function when presented with floating-point not-a-number values.

What's not-a-number?

Python integer values have no specific internal representation and are defined to cover an arbitrary range. The float type, instead, is restricted to what the underlying CPU implements; on almost all current hardware, floating-point numbers are described by IEEE 754. That format looks like this:

[IEEE floating point]

The interpretation of these numbers is relatively straightforward — except for the quirks. Since this is essentially a sign/magnitude format, it is capable of representing both positive and negative values of zero; both are treated as zero during mathematical operations and (most) comparisons, but they are distinct numbers. The exponent is a biased value that makes comparisons easier. And, importantly for this discussion, an exponent value that is all ones is interpreted in special ways.

In particular, if the exponent is all ones and the mantissa is all zeroes, then the value represents infinity (usually written as "inf"). The sign bit matters, so there can be both positive and negative inf values. Any other value of the mantissa, instead, creates a special value called "not-a-number" or NaN. These values can be created by the hardware when an operation results in a value that cannot be represented — arithmetic overflow, for example. NaN values are also often used in scientific code to represent missing data, though that was apparently not a part of the original intent for NaNs.

There is more. With 52 bits to play with (plus the sign bit), it is obviously possible to create a lot of unique NaN values. Some code uses the mantissa bits to record additional information. But the most significant bit of the mantissa is special: if that bit is set, the result is a "quiet NaN"; if it is clear, instead, the value is a "signaling NaN". Quiet NaNs will quietly propagate through computations; adding a number to a quiet NaN yields another quiet NaN as a result. Operations on signaling NaNs are supposed to generate an immediate error.

(And, yes, some processors invert the meaning of the "quiet" bit, but that's more pain than we need to get into here.)

The problem with median()

The diversion of the initial thread came about when David Mertz pointed out that median() can yield some interesting results when the input data includes NaNs. To summarize his examples:

   >>> import statistics
   >>>
   >>> NaN = float('nan')
   >>> statistics.median([NaN, 1, 2, 3, 4])
   2
   >>> statistics.median([1, 2, 3, 4, NaN])
   3

This strikes Mertz as a nonsensical result: the return value from a function like median() should not depend on the order of the data passed to it, but that is exactly what happens if the input data contains NaN values.

There was no immediate consensus on whether this behavior is indeed a problem or not. Richard Damon asserted: "Getting garbage answers for garbage input isn't THAT unreasonable". Steven D'Aprano, who added the statistics module to the standard library, agreed: "this is a case of garbage in, garbage out". In both cases, the basis of the argument is that median() is documented to require orderable input values, and that is not what is provided in the example above.

Naturally, others see things differently. It turns out that IEEE 754 defines a "total order" that can be used to order all floating-point values. In this order, it turns out that signaling NaNs are bigger than infinity (in both the positive and negative directions), and quiet NaNs are even bigger than that. So some felt that this total order should be used in calculating the median of a list of floating-point values. Brendan Barnwell went a bit further by arguing that NaN values should be treated like other numbers: "The things that computers work with are floats, and NaN is a float, so in any relevant sense it is a number". D'Aprano disagreed strongly:

If something doesn't quack like a duck, doesn't swim like a duck, and doesn't walk like a duck, and is explicitly called Not A Duck, would we insist that it's actually a duck?

Despite insisting that median() is within its rights to return strange results in this situation, D'Aprano also acknowledged that silently returning those results might not be the best course of action. All that is needed is to decide what that best course would actually be.

What should median() do

Much of the discussion was thus understandably focused on what the correct behavior for median() (and for other functions in the statistics module that have similar issues) should be. There were two core concerns here: what the proper behavior is, and the performance cost of implementing a different solution.

The cost is the easier concern to get a handle on. median() currently works by sorting the supplied values into an ordered list, then taking the value in the middle. Dealing with NaN values would require adding extra tests to the sort, slowing it down. D'Aprano figured that this change would slow median() by a factor between four and eight times. Christopher Barker did some tests and measured a slowdown range between two and ten times, depending on how the checking for NaN values is done. (And, for the curious, testing for NaN is not as easy as one might hope.) Some of this slowdown could be addressed by switching to a smarter way of calculating the median.

Once those little problems have been dealt with, there is still the issue of what median() should actually do. A number of alternatives surfaced during the discussion:

  • Keep the current behavior, which works well enough for most users (who may never encounter a NaN value in their work) and doesn't hurt performance. Users who want specialized NaN handling could use a library like NumPy instead.
  • Simply ignore NaNs.
  • Raise an exception if the input data contains NaNs. This applies to quiet NaNs; if a signaling NaN is encountered, D'Aprano said, an exception should always be raised.
  • Return NaN if the input data contains NaNs.
  • Move the NaN values to one end of the list.
  • Sort the list using total order (which would move NaNs to both ends of the list, depending on their sign bits).
  • Probably one or two others as well.

D'Aprano quickly came to the conclusion that the community is unlikely to come to a consensus on what the proper handling for NaN values is. So he has proposed adding a keyword argument (called nan_policy) to median() that would allow the user to pick between a subset of the options above. The default behavior would be to raise an exception.

This proposal appears to have brought the discussion to an orderly close; nobody seems to strongly object to handling the problem in this way. All that is left is to actually write the code to implement this behavior. After that, one would hope, not-a-number handling in the statistics module would be not-a-problem.

Index entries for this article
PythonFloating point


to post comments

Some median Python NaNsense

Posted Jan 4, 2020 6:10 UTC (Sat) by sethkush (subscriber, #107552) [Link]

After doing work with other statistics packages (I've never used Python), I would be mildly annoyed if I was using something that didn't ignore NaN values. The nan_policy solution does seem like the best way to keep everyone happy though.

Some median Python NaNsense

Posted Jan 4, 2020 20:40 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (4 responses)

Going through those options and my reactions to them:

> Keep the current behavior, which works well enough for most users (who may never encounter a NaN value in their work) and doesn't hurt performance. Users who want specialized NaN handling could use a library like NumPy instead.

That seems quite backwards to me. If you wanted high performance, you'd already be using NumPy anyway. Standard library modules should favor correctness over speed.

> Simply ignore NaNs.

I don't like this. NaN is not the same thing as SQL NULL; it means "something went wrong," not "no data available." Ignoring it is unlikely to produce the answer you wanted.

> Raise an exception if the input data contains NaNs. This applies to quiet NaNs; if a signaling NaN is encountered, D'Aprano said, an exception should always be raised.

I've never been a fan of the quiet/signaling distinction, TBH. Python doesn't provide any portable means of distinguishing between signaling and quiet NaNs, or at least none that I'm aware of,* and I've never heard of "regular" Python operations such as x + y raising an exception on signalling NaNs. So I'd prefer not to try and treat them differently here. It could cause much confusion for (IMHO) little benefit. But I would be OK with always raising an exception, regardless of NaN type.

> Return NaN if the input data contains NaNs.

That's what every other operation on NaNs is supposed to do, and seems obviously correct (or at least, not wrong) to me.

> Move the NaN values to one end of the list.

> Sort the list using total order (which would move NaNs to both ends of the list, depending on their sign bits).

I could accept either of these as "not too ridiculous" if it were documented to work that way. But if you're going to try and sort the NaNs, you might as well sort them in the way the IEEE says you're supposed to, instead of making up a new standard and https://xkcd.com/927/

* You can take them apart using one of struct, ctypes, etc. and inspect the byte representation. That does not count; I was thinking of something more along the lines of math.is_quiet_nan(), which does not exist.

Some median Python NaNsense

Posted Jan 5, 2020 19:15 UTC (Sun) by khim (subscriber, #9252) [Link] (3 responses)

> But if you're going to try and sort the NaNs, you might as well sort them in the way the IEEE says you're supposed to, instead of making up a new standard and https://xkcd.com/927/

Python would need to invent something new anyway. I mean: it's really easy to sort floats in total order - just use the same bits sequence as integer and compare them as signed number. But... Python also supports integers which couldn't fit in the float range... how THESE are supposed to be ordered WRT infs and NaNs? IEEE 754 doesn't answer that question - means you need a new standard anyway.

Some median Python NaNsense

Posted Jan 6, 2020 1:39 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (2 responses)

IEEE already pushes all the "weird" numbers to the ends. All you really have to do is ensure that large integers are sorted into the "finite regular numbers" bucket with all the smaller floats, and then compare them numerically.

Some median Python NaNsense

Posted Jan 6, 2020 13:51 UTC (Mon) by rbanffy (guest, #103898) [Link] (1 responses)

Sorting infinite and -infinite isn't much of a problem, but where does a number go when you have an integer that is in the middle of the range of possible NaNs?

Some median Python NaNsense

Posted Jan 7, 2020 0:49 UTC (Tue) by Baughn (subscriber, #124425) [Link]

NaNs are larger than infinity, so no such integer exists.

Some median Python NaNsense

Posted Jan 4, 2020 21:41 UTC (Sat) by jtaylor (subscriber, #91739) [Link] (3 responses)

Not returning NaN for this input in median is not a case of garbage in, garbage out. What the statistics module is doing is taking garbage pretending it is not garbage anymore in the output.
garbage in garbage out in the case if NaN numbers ist: NaN in NaN out.
Doing otherwise you lose the information that something went wrong (if you do not provide the floating point exceptions that will be thrown through as second channel).

That is why numpy.median returns NaN when there is a NaN in the input.
This does come with some computational costs but it is not so bad, NaNs sort to the end of array (the sign bit is ignored in all nan operations including sorting), so you only have to check the values after the median for a NaN which can be done very efficiently (easily vectorized with SIMD) compared to the median selection operation itself.
You can also use a quickselect/introselect with pivot storage to reduce the amount of data you have to search to on average 25% of uniformly distributed data (this is was numpy does https://github.com/numpy/numpy/blob/master/numpy/lib/func...).

Though for pythons statistics module this may be more costly as it has to consider mixed type input lists.

That said it is very common that missing values are represented as NaN in practical numerical data. For this case numpy has dedicated functions (nanmedian, nanpercentile, ....) which will consider NaNs as missing data and ignore them.

Some median Python NaNsense

Posted Jan 6, 2020 4:02 UTC (Mon) by areilly (subscriber, #87829) [Link] (2 responses)

It shouldn't be necessary to do a vectorised search after the median to find a NaN: they sort to the end, so just check the last (sorted) element. If it's a NaN, return NaN.

If you wanted to support the idea that NaNs represented missing data, and return a median of the rest, then yes you need to search to find the first one. Since your data is sorted by that point, a binary search may (or may not!) be faster than a vectorised linear search. There are probably machine-dependent length breakpoints.

Some median Python NaNsense

Posted Jan 6, 2020 11:34 UTC (Mon) by jtaylor (subscriber, #91739) [Link] (1 responses)

yes, if you are sorting to find the median it can be done by checking the last element.
But finding the median does not require a full sort, for example the quickselect (https://en.wikipedia.org/wiki/Quickselect) algorithm (which is only the partition step of a quicksort) finds the median (or any quantile) in O(N) time while sorting requires O(NlogN) time.
With quickselect only certain pivot points in the array are guaranteed to have an ordering guarantee, e.g. for the median it is guaranteed that all positions after the median are larger and all before smaller than the median, but the ordering within these two partitions is undefined. So the NaNs may be anywhere in the top 50% of the array (likely higher as you will mostly need more than one partition to find the median).
The lower complexity more than makes up for the need of searching part of the array linearly to handle NaN.

Some median Python NaNsense

Posted Jan 7, 2020 22:03 UTC (Tue) by nivedita76 (subscriber, #121790) [Link]

It was a little weird to find people worrying about performance of median when the implementation actually sorts the input rather than using quickselect.

Some median Python NaNsense

Posted Jan 5, 2020 16:39 UTC (Sun) by scientes (guest, #83068) [Link] (2 responses)

Please use "significand" instead of "matissa".

Some median Python NaNsense

Posted Jan 6, 2020 13:41 UTC (Mon) by Paf (subscriber, #91811) [Link] (1 responses)

Why should that be preferred?

Some median Python NaNsense

Posted Jan 6, 2020 17:45 UTC (Mon) by rgmoore (✭ supporter ✭, #75) [Link]

As I understand it, "significand" is the preferred terminology of IEEE 754. Mantissa is used for logarithms. Because logarithms treat the part after the decimal differently from floating point numbers, they wanted to use a different term.

Some median Python NaNsense

Posted Jan 5, 2020 16:48 UTC (Sun) by scientes (guest, #83068) [Link] (4 responses)

> Sort the list using total order (which would move NaNs to both ends of the list, depending on their sign bits).

This is not a viable solution, as there is no such thing as negative NaN in the IEE 754 standard, and as such ARM processors normalize negative NaN to positive NaN, unsetting the signed bit, which differs from the behavior of x86_64 processors, and would mean you would get different results on differn't cpus.

Some median Python NaNsense

Posted Jan 5, 2020 19:19 UTC (Sun) by khim (subscriber, #9252) [Link] (3 responses)

That's if you do a stupid thing and try to compare them as floats. Treat the same bit sequence as integer, use regular comparison operator - and you always get the same correct result on all CPUs.

Trouble comes when you need to compare float and non-float (e.g. large integer). EEE 754 doesn't even define "total order" for cases like these.

Some median Python NaNsense

Posted Jan 7, 2020 18:39 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (2 responses)

That's too late. The architecture has already normalized them by the time you call median().

Basically, the problem is not in median() at all. The problem is that the architecture changed the numbers out from under you, so you are passing a different list of numbers under ARM than under x86. "Most people" won't notice that, because it's "just" replacing one NaN-valued bit pattern with another, and "most people" don't care about which NaN is which. But if you're casting to int, then you'll certainly notice differing bit patterns.

Some median Python NaNsense

Posted Jan 10, 2020 19:57 UTC (Fri) by khim (subscriber, #9252) [Link] (1 responses)

That's nonsense. Both NEON and SSE use THE SAME registers for both interger and floating point values. And instructions like movhlps don't even care if they move interger or floating point values.

Now, certain floating point operations may normalize NaNs... but then - they are supposed to change values anyway, it's unclear why you think NaNs shouldn't be affected that point. I mean: should "1.0 + NaN" give you the same NaN or a different NaN? IEEE 754 doesn't say and the solution is simple: just don't do it.

P.S. Now, if you would say that PYTHON would destroy these NaN values... this could happen, I don't work with Python at that level all that often. But I happen to DO work with ARM⇔x86 translation - which means I deal with all these changes regularly... neither x86 nor ARM change NaN unless they are supposed to.

Some median Python NaNsense

Posted Jan 10, 2020 20:03 UTC (Fri) by anselm (subscriber, #2796) [Link]

Python doesn't destroy NaN values if the underlying C implementation doesn't. At the end of the day, leaving aside unusual implementations such as JPython, the Python interpreter is just another program written in C, and for floating-point arithmetic it does whatever the C implementation on the system does (which in this day and age is presumably to rely on the floating-point instructions of the host processor).

What makes the NaN check expensive?

Posted Jan 5, 2020 17:51 UTC (Sun) by jansson (subscriber, #2227) [Link] (4 responses)

Thanks for an interesting article! :)

If all input numbers including NaN's are sorted, and if all NaN's ends up first or last after sorting. We need to check only first and last element after the sort. If any of those two numbers are a NaN then we have NaN in the input. Otherwise there is no NaN in the input! How can these two checks for NaN be expensive?

What did I miss?

What makes the NaN check expensive?

Posted Jan 5, 2020 19:21 UTC (Sun) by khim (subscriber, #9252) [Link]

You miss the fact that right now python sorting algorithm WOULDN'T put them first or last. Means before you would check these two elements you need to implement brand new way to compare numbers (and brand new sort on top of that) - which would lead to total order when applied to floats.

What makes the NaN check expensive?

Posted Jan 5, 2020 20:12 UTC (Sun) by epa (subscriber, #39769) [Link] (2 responses)

Moreover, if you are sorting the list then you are already doing an O(N log N) operation, so a linear scan to find any NaN values would be a tiny fraction of the total work on large lists, and for small lists it would be fast enough anyway not to matter. As others said you would not be using vanilla Python if you needed high-performance numeric operations. So I don’t understand the objections based on performance. Perhaps if you have vast numbers of short lists and need the median of each... but even then you’d surely use a different tool.

What makes the NaN check expensive?

Posted Jan 7, 2020 18:43 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (1 responses)

You don't sort to get median, you do quickselect (or another algorithm), which is O(N) (in the average case).

(Granted, a linear scan for NaN is also O(N), and probably has a better cache hit rate to boot. So I agree with you that performance really shouldn't be a problem here.)

What makes the NaN check expensive?

Posted Jan 7, 2020 22:06 UTC (Tue) by nivedita76 (subscriber, #121790) [Link]

You /should/, but the python implementation that's broken actually does a full sort.

Some median Python NaNsense

Posted Jan 6, 2020 14:10 UTC (Mon) by gdt (subscriber, #6284) [Link] (7 responses)

The article possibly could have mentioned the other half of the tension: non-trivial statistical processing requires the concept of a "missing value"; that is, data which was asked of the respondent but which was not supplied.

Missing values can't be ignored for many statistical operations, as they increase the error (handwavingly: we can be less certain of a statistic when less of the population responds).

Missing values aren't that relevant to calculating the median, but they are relevant to calculating more advanced statistics. Having some statistical functions not accept missing values and other statistical functions accept missing values is error-prone. Each dataset would have two data structures: one with missing values and one without. Inevitably the data structure without missing values would be passed to a statistical function which can process missing values, and that function will silently give an incorrect result. Stats processing generally takes a view of "hospital pass APIs" to "optimised" functions that "fast but sometimes wrong = wrong". Usually once the processing to clean up the dataset is done then the dataset is passed to analysis functions unchanged.

Missing values themselves are useful as data: a pattern of missing values can uncover non-statistical bias in data, which in turn has consequences for the choice of analysis.

The trouble comes when statistical packages look at the IEEE NaN as a way to encode the missing value.

Some median Python NaNsense

Posted Jan 7, 2020 13:42 UTC (Tue) by cpitrat (subscriber, #116459) [Link] (4 responses)

If only there was a way to represent a missing value in Python. Some keyword that would inform that instead of containing a value, this variable contains none.

Some median Python NaNsense

Posted Jan 8, 2020 0:02 UTC (Wed) by KaiRo (subscriber, #1987) [Link] (1 responses)

Good idea! Let's call that "No Number Encountered" or short "NoNE". What do you think?

Some median Python NaNsense

Posted Jan 8, 2020 16:02 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

"No Actual Number Indicated" would go well with "nani" and the associated meme (https://www.urbandictionary.com/define.php?term=omae%20wa...)

Some median Python NaNsense

Posted Jan 10, 2020 2:29 UTC (Fri) by gdt (subscriber, #6284) [Link] (1 responses)

None isn't that useful for processing large datasets where memory efficiency matters, as typically generated by scientific instruments. Thus SciPy's overloading of NaN. Putting that another way:

import array
a = array.array('d', [1.0, 2.0, 3.0, None])
TypeError: must be real number, not NoneType

Note that I am not arguing for overloading NaN -- I don't have a dog in this fight -- I'm just using my background as a statistics professional to explain why choices your tone suggests are unreasonable have been made by people acting reasonably.

Some median Python NaNsense

Posted Jan 10, 2020 11:19 UTC (Fri) by cpitrat (subscriber, #116459) [Link]

Yes but here we're talking about statistics package, not numpy, which already takes mixed input lists as input IIUC.

Some median Python NaNsense

Posted Jan 8, 2020 17:04 UTC (Wed) by Hattifnattar (subscriber, #93737) [Link] (1 responses)

But what if someone needs higher granularity?

Say you have a survey, and you did not get numbers from some people because they cannot be reached, and from others, because they refuse to answer. Suppose you want to deal with these two types of "no value" differently.
Should Python support it too?

It seems to me it's a task for a particular library (in an object-oriented language!) to design a type that accurately represents its domain, rather than blindly use (and abuse) a native type that mostly, but not completely, does the job.
Then under the hood optimize it to your heart's content...

Some median Python NaNsense

Posted Jan 10, 2020 2:31 UTC (Fri) by gdt (subscriber, #6284) [Link]

No higher granularity is needed for a statistics library: either the number is contributing both a value and to the population or the number is contributing solely to the population. Reason codes for non-response are certainly useful for the application to maintain, but they are irrelevant to a statistics library.

Some median Python NaNsense

Posted Jan 9, 2020 14:47 UTC (Thu) by NRArnot (subscriber, #3033) [Link]

This is Python! One advantage of this language, is that it's possible to update an API without breaking it, by introducing a new keyword argument with a default value.

So why can't all of the sensible alternatives be implemented and made selectable with a keyword argument? The default one, is to carry on doing exactly what it does at the moment, so as to not break anything that (for good or ill) currently relies on that behaviour.

Some median Python NaNsense

Posted Jan 19, 2020 11:07 UTC (Sun) by jondo (guest, #69852) [Link]

In this context it's interesting to see that Pandas also tries to move away from using numpy.nan (i.e. IEEE silent NaN) as "missing" indicator. Target is to deal with "missing" uniformly across all data types, instead of numpy.nan for floats, pandas.NaT for "not-a-time", and None for everything else (integers, strings, ...). Internally, the "missing" info is kept as a separate, binary mask.

See the documentation of the upcoming 1.0.0 release: https://pandas.pydata.org/pandas-docs/version/1.0.0/user_...

I am looking forward to it, because in my work I currently have to represent integer data as float to efficiently deal with missings.


Copyright © 2020, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds