Some median Python NaNsense
Kemal Diri doubtless did not mean to start a massive thread with this request to add a built-in function to the language to calculate the average of the values in a list. That request was quickly dismissed, but the developers went on to the seemingly strange behavior of the statistics module's median() function when presented with floating-point not-a-number values.
What's not-a-number?
Python integer values have no specific internal representation and are defined to cover an arbitrary range. The float type, instead, is restricted to what the underlying CPU implements; on almost all current hardware, floating-point numbers are described by IEEE 754. That format looks like this:
The interpretation of these numbers is relatively straightforward — except for the quirks. Since this is essentially a sign/magnitude format, it is capable of representing both positive and negative values of zero; both are treated as zero during mathematical operations and (most) comparisons, but they are distinct numbers. The exponent is a biased value that makes comparisons easier. And, importantly for this discussion, an exponent value that is all ones is interpreted in special ways.
In particular, if the exponent is all ones and the mantissa is all zeroes, then the value represents infinity (usually written as "inf"). The sign bit matters, so there can be both positive and negative inf values. Any other value of the mantissa, instead, creates a special value called "not-a-number" or NaN. These values can be created by the hardware when an operation results in a value that cannot be represented — arithmetic overflow, for example. NaN values are also often used in scientific code to represent missing data, though that was apparently not a part of the original intent for NaNs.
There is more. With 52 bits to play with (plus the sign bit), it is obviously possible to create a lot of unique NaN values. Some code uses the mantissa bits to record additional information. But the most significant bit of the mantissa is special: if that bit is set, the result is a "quiet NaN"; if it is clear, instead, the value is a "signaling NaN". Quiet NaNs will quietly propagate through computations; adding a number to a quiet NaN yields another quiet NaN as a result. Operations on signaling NaNs are supposed to generate an immediate error.
(And, yes, some processors invert the meaning of the "quiet" bit, but that's more pain than we need to get into here.)
The problem with median()
The diversion of the initial thread came about when David Mertz pointed out that median() can yield some interesting results when the input data includes NaNs. To summarize his examples:
>>> import statistics >>> >>> NaN = float('nan') >>> statistics.median([NaN, 1, 2, 3, 4]) 2 >>> statistics.median([1, 2, 3, 4, NaN]) 3
This strikes Mertz as a nonsensical result: the return value from a function like median() should not depend on the order of the data passed to it, but that is exactly what happens if the input data contains NaN values.
There was no immediate consensus on whether this behavior is indeed a
problem or not. Richard Damon asserted:
"Getting garbage answers for garbage input isn't THAT
unreasonable
". Steven D'Aprano, who added the statistics
module to the standard library, agreed: "this is a case of
garbage in, garbage out
". In both cases, the basis of the argument
is that median() is documented to require orderable input values,
and that is not what is provided in the example above.
Naturally, others see things differently. It turns out that IEEE 754
defines a "total order" that can be used to order all floating-point
values. In this order, it turns out that signaling NaNs are bigger
than infinity (in both the positive and negative directions), and quiet
NaNs are even bigger than that. So some felt that this total order should
be used in calculating the median of a list of floating-point values.
Brendan Barnwell went a bit
further by arguing that NaN values should be treated like other
numbers: "The things that computers work with
are floats, and NaN is a float, so in any relevant sense it is a
number
". D'Aprano disagreed
strongly:
Despite insisting that median() is within its rights to return strange results in this situation, D'Aprano also acknowledged that silently returning those results might not be the best course of action. All that is needed is to decide what that best course would actually be.
What should median() do
Much of the discussion was thus understandably focused on what the correct behavior for median() (and for other functions in the statistics module that have similar issues) should be. There were two core concerns here: what the proper behavior is, and the performance cost of implementing a different solution.
The cost is the easier concern to get a handle on. median() currently works by sorting the supplied values into an ordered list, then taking the value in the middle. Dealing with NaN values would require adding extra tests to the sort, slowing it down. D'Aprano figured that this change would slow median() by a factor between four and eight times. Christopher Barker did some tests and measured a slowdown range between two and ten times, depending on how the checking for NaN values is done. (And, for the curious, testing for NaN is not as easy as one might hope.) Some of this slowdown could be addressed by switching to a smarter way of calculating the median.
Once those little problems have been dealt with, there is still the issue of what median() should actually do. A number of alternatives surfaced during the discussion:
- Keep the current behavior, which works well enough for most users (who may never encounter a NaN value in their work) and doesn't hurt performance. Users who want specialized NaN handling could use a library like NumPy instead.
- Simply ignore NaNs.
- Raise an exception if the input data contains NaNs. This applies to quiet NaNs; if a signaling NaN is encountered, D'Aprano said, an exception should always be raised.
- Return NaN if the input data contains NaNs.
- Move the NaN values to one end of the list.
- Sort the list using total order (which would move NaNs to both ends of the list, depending on their sign bits).
- Probably one or two others as well.
D'Aprano quickly came to the conclusion that the community is unlikely to come to a consensus on what the proper handling for NaN values is. So he has proposed adding a keyword argument (called nan_policy) to median() that would allow the user to pick between a subset of the options above. The default behavior would be to raise an exception.
This proposal appears to have brought the discussion to an orderly close;
nobody seems to strongly object to handling the problem in this way. All
that is left is to actually write the code to implement this behavior.
After that, one would hope, not-a-number handling in the statistics module
would be not-a-problem.
Index entries for this article | |
---|---|
Python | Floating point |
Posted Jan 4, 2020 6:10 UTC (Sat)
by sethkush (subscriber, #107552)
[Link]
Posted Jan 4, 2020 20:40 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (4 responses)
> Keep the current behavior, which works well enough for most users (who may never encounter a NaN value in their work) and doesn't hurt performance. Users who want specialized NaN handling could use a library like NumPy instead.
That seems quite backwards to me. If you wanted high performance, you'd already be using NumPy anyway. Standard library modules should favor correctness over speed.
> Simply ignore NaNs.
I don't like this. NaN is not the same thing as SQL NULL; it means "something went wrong," not "no data available." Ignoring it is unlikely to produce the answer you wanted.
> Raise an exception if the input data contains NaNs. This applies to quiet NaNs; if a signaling NaN is encountered, D'Aprano said, an exception should always be raised.
I've never been a fan of the quiet/signaling distinction, TBH. Python doesn't provide any portable means of distinguishing between signaling and quiet NaNs, or at least none that I'm aware of,* and I've never heard of "regular" Python operations such as x + y raising an exception on signalling NaNs. So I'd prefer not to try and treat them differently here. It could cause much confusion for (IMHO) little benefit. But I would be OK with always raising an exception, regardless of NaN type.
> Return NaN if the input data contains NaNs.
That's what every other operation on NaNs is supposed to do, and seems obviously correct (or at least, not wrong) to me.
> Move the NaN values to one end of the list.
> Sort the list using total order (which would move NaNs to both ends of the list, depending on their sign bits).
I could accept either of these as "not too ridiculous" if it were documented to work that way. But if you're going to try and sort the NaNs, you might as well sort them in the way the IEEE says you're supposed to, instead of making up a new standard and https://xkcd.com/927/
* You can take them apart using one of struct, ctypes, etc. and inspect the byte representation. That does not count; I was thinking of something more along the lines of math.is_quiet_nan(), which does not exist.
Posted Jan 5, 2020 19:15 UTC (Sun)
by khim (subscriber, #9252)
[Link] (3 responses)
Python would need to invent something new anyway. I mean: it's really easy to sort floats in total order - just use the same bits sequence as integer and compare them as signed number. But... Python also supports integers which couldn't fit in the float range... how THESE are supposed to be ordered WRT infs and NaNs? IEEE 754 doesn't answer that question - means you need a new standard anyway.
Posted Jan 6, 2020 1:39 UTC (Mon)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
Posted Jan 6, 2020 13:51 UTC (Mon)
by rbanffy (guest, #103898)
[Link] (1 responses)
Posted Jan 7, 2020 0:49 UTC (Tue)
by Baughn (subscriber, #124425)
[Link]
Posted Jan 4, 2020 21:41 UTC (Sat)
by jtaylor (subscriber, #91739)
[Link] (3 responses)
That is why numpy.median returns NaN when there is a NaN in the input.
Though for pythons statistics module this may be more costly as it has to consider mixed type input lists.
That said it is very common that missing values are represented as NaN in practical numerical data. For this case numpy has dedicated functions (nanmedian, nanpercentile, ....) which will consider NaNs as missing data and ignore them.
Posted Jan 6, 2020 4:02 UTC (Mon)
by areilly (subscriber, #87829)
[Link] (2 responses)
If you wanted to support the idea that NaNs represented missing data, and return a median of the rest, then yes you need to search to find the first one. Since your data is sorted by that point, a binary search may (or may not!) be faster than a vectorised linear search. There are probably machine-dependent length breakpoints.
Posted Jan 6, 2020 11:34 UTC (Mon)
by jtaylor (subscriber, #91739)
[Link] (1 responses)
Posted Jan 7, 2020 22:03 UTC (Tue)
by nivedita76 (subscriber, #121790)
[Link]
Posted Jan 5, 2020 16:39 UTC (Sun)
by scientes (guest, #83068)
[Link] (2 responses)
Posted Jan 6, 2020 13:41 UTC (Mon)
by Paf (subscriber, #91811)
[Link] (1 responses)
Posted Jan 6, 2020 17:45 UTC (Mon)
by rgmoore (✭ supporter ✭, #75)
[Link]
As I understand it, "significand" is the preferred terminology of IEEE 754. Mantissa is used for logarithms. Because logarithms treat the part after the decimal differently from floating point numbers, they wanted to use a different term.
Posted Jan 5, 2020 16:48 UTC (Sun)
by scientes (guest, #83068)
[Link] (4 responses)
This is not a viable solution, as there is no such thing as negative NaN in the IEE 754 standard, and as such ARM processors normalize negative NaN to positive NaN, unsetting the signed bit, which differs from the behavior of x86_64 processors, and would mean you would get different results on differn't cpus.
Posted Jan 5, 2020 19:19 UTC (Sun)
by khim (subscriber, #9252)
[Link] (3 responses)
Trouble comes when you need to compare float and non-float (e.g. large integer). EEE 754 doesn't even define "total order" for cases like these.
Posted Jan 7, 2020 18:39 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
Basically, the problem is not in median() at all. The problem is that the architecture changed the numbers out from under you, so you are passing a different list of numbers under ARM than under x86. "Most people" won't notice that, because it's "just" replacing one NaN-valued bit pattern with another, and "most people" don't care about which NaN is which. But if you're casting to int, then you'll certainly notice differing bit patterns.
Posted Jan 10, 2020 19:57 UTC (Fri)
by khim (subscriber, #9252)
[Link] (1 responses)
That's nonsense. Both NEON and SSE use THE SAME registers for both interger and floating point values. And instructions like movhlps don't even care if they move interger or floating point values. Now, certain floating point operations may normalize NaNs... but then - they are supposed to change values anyway, it's unclear why you think NaNs shouldn't be affected that point. I mean: should "1.0 + NaN" give you the same NaN or a different NaN? IEEE 754 doesn't say and the solution is simple: just don't do it. P.S. Now, if you would say that PYTHON would destroy these NaN values... this could happen, I don't work with Python at that level all that often. But I happen to DO work with ARM⇔x86 translation - which means I deal with all these changes regularly... neither x86 nor ARM change NaN unless they are supposed to.
Posted Jan 10, 2020 20:03 UTC (Fri)
by anselm (subscriber, #2796)
[Link]
Python doesn't destroy NaN values if the underlying C implementation doesn't. At the end of the day, leaving aside unusual implementations such as JPython, the Python interpreter is just another program written in C, and for floating-point arithmetic it does whatever the C implementation on the system does (which in this day and age is presumably to rely on the floating-point instructions of the host processor).
Posted Jan 5, 2020 17:51 UTC (Sun)
by jansson (subscriber, #2227)
[Link] (4 responses)
If all input numbers including NaN's are sorted, and if all NaN's ends up first or last after sorting. We need to check only first and last element after the sort. If any of those two numbers are a NaN then we have NaN in the input. Otherwise there is no NaN in the input! How can these two checks for NaN be expensive?
What did I miss?
Posted Jan 5, 2020 19:21 UTC (Sun)
by khim (subscriber, #9252)
[Link]
Posted Jan 5, 2020 20:12 UTC (Sun)
by epa (subscriber, #39769)
[Link] (2 responses)
Posted Jan 7, 2020 18:43 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
(Granted, a linear scan for NaN is also O(N), and probably has a better cache hit rate to boot. So I agree with you that performance really shouldn't be a problem here.)
Posted Jan 7, 2020 22:06 UTC (Tue)
by nivedita76 (subscriber, #121790)
[Link]
Posted Jan 6, 2020 14:10 UTC (Mon)
by gdt (subscriber, #6284)
[Link] (7 responses)
Missing values can't be ignored for many statistical operations, as they increase the error (handwavingly: we can be less certain of a statistic when less of the population responds).
Missing values aren't that relevant to calculating the median, but they are relevant to calculating more advanced statistics. Having some statistical functions not accept missing values and other statistical functions accept missing values is error-prone. Each dataset would have two data structures: one with missing values and one without. Inevitably the data structure without missing values would be passed to a statistical function which can process missing values, and that function will silently give an incorrect result. Stats processing generally takes a view of "hospital pass APIs" to "optimised" functions that "fast but sometimes wrong = wrong". Usually once the processing to clean up the dataset is done then the dataset is passed to analysis functions unchanged.
Missing values themselves are useful as data: a pattern of missing values can uncover non-statistical bias in data, which in turn has consequences for the choice of analysis.
The trouble comes when statistical packages look at the IEEE NaN as a way to encode the missing value.
Posted Jan 7, 2020 13:42 UTC (Tue)
by cpitrat (subscriber, #116459)
[Link] (4 responses)
Posted Jan 8, 2020 0:02 UTC (Wed)
by KaiRo (subscriber, #1987)
[Link] (1 responses)
Posted Jan 8, 2020 16:02 UTC (Wed)
by mathstuf (subscriber, #69389)
[Link]
Posted Jan 10, 2020 2:29 UTC (Fri)
by gdt (subscriber, #6284)
[Link] (1 responses)
None isn't that useful for processing large datasets where memory efficiency matters, as typically generated by scientific instruments. Thus SciPy's overloading of NaN. Putting that another way: Note that I am not arguing for overloading NaN -- I don't have a dog in this fight -- I'm just using my background as a statistics professional to explain why choices your tone suggests are unreasonable have been made by people acting reasonably.
Posted Jan 10, 2020 11:19 UTC (Fri)
by cpitrat (subscriber, #116459)
[Link]
Posted Jan 8, 2020 17:04 UTC (Wed)
by Hattifnattar (subscriber, #93737)
[Link] (1 responses)
Say you have a survey, and you did not get numbers from some people because they cannot be reached, and from others, because they refuse to answer. Suppose you want to deal with these two types of "no value" differently.
It seems to me it's a task for a particular library (in an object-oriented language!) to design a type that accurately represents its domain, rather than blindly use (and abuse) a native type that mostly, but not completely, does the job.
Posted Jan 10, 2020 2:31 UTC (Fri)
by gdt (subscriber, #6284)
[Link]
Posted Jan 9, 2020 14:47 UTC (Thu)
by NRArnot (subscriber, #3033)
[Link]
So why can't all of the sensible alternatives be implemented and made selectable with a keyword argument? The default one, is to carry on doing exactly what it does at the moment, so as to not break anything that (for good or ill) currently relies on that behaviour.
Posted Jan 19, 2020 11:07 UTC (Sun)
by jondo (guest, #69852)
[Link]
See the documentation of the upcoming 1.0.0 release: https://pandas.pydata.org/pandas-docs/version/1.0.0/user_...
I am looking forward to it, because in my work I currently have to represent integer data as float to efficiently deal with missings.
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
garbage in garbage out in the case if NaN numbers ist: NaN in NaN out.
Doing otherwise you lose the information that something went wrong (if you do not provide the floating point exceptions that will be thrown through as second channel).
This does come with some computational costs but it is not so bad, NaNs sort to the end of array (the sign bit is ignored in all nan operations including sorting), so you only have to check the values after the median for a NaN which can be done very efficiently (easily vectorized with SIMD) compared to the median selection operation itself.
You can also use a quickselect/introselect with pivot storage to reduce the amount of data you have to search to on average 25% of uniformly distributed data (this is was numpy does https://github.com/numpy/numpy/blob/master/numpy/lib/func...).
Some median Python NaNsense
Some median Python NaNsense
But finding the median does not require a full sort, for example the quickselect (https://en.wikipedia.org/wiki/Quickselect) algorithm (which is only the partition step of a quicksort) finds the median (or any quantile) in O(N) time while sorting requires O(NlogN) time.
With quickselect only certain pivot points in the array are guaranteed to have an ordering guarantee, e.g. for the median it is guaranteed that all positions after the median are larger and all before smaller than the median, but the ordering within these two partitions is undefined. So the NaNs may be anywhere in the top 50% of the array (likely higher as you will mostly need more than one partition to find the median).
The lower complexity more than makes up for the need of searching part of the array linearly to handle NaN.
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
What makes the NaN check expensive?
What makes the NaN check expensive?
What makes the NaN check expensive?
What makes the NaN check expensive?
What makes the NaN check expensive?
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense
import array
a = array.array('d', [1.0, 2.0, 3.0, None])
TypeError: must be real number, not NoneType
Some median Python NaNsense
Some median Python NaNsense
Should Python support it too?
Then under the hood optimize it to your heart's content...
Some median Python NaNsense
Some median Python NaNsense
Some median Python NaNsense