Some median Python NaNsense
Kemal Diri doubtless did not mean to start a massive thread with this request to add a built-in function to the language to calculate the average of the values in a list. That request was quickly dismissed, but the developers went on to the seemingly strange behavior of the statistics module's median() function when presented with floating-point not-a-number values.
What's not-a-number?
Python integer values have no specific internal representation and are defined to cover an arbitrary range. The float type, instead, is restricted to what the underlying CPU implements; on almost all current hardware, floating-point numbers are described by IEEE 754. That format looks like this:
The interpretation of these numbers is relatively straightforward — except for the quirks. Since this is essentially a sign/magnitude format, it is capable of representing both positive and negative values of zero; both are treated as zero during mathematical operations and (most) comparisons, but they are distinct numbers. The exponent is a biased value that makes comparisons easier. And, importantly for this discussion, an exponent value that is all ones is interpreted in special ways.
In particular, if the exponent is all ones and the mantissa is all zeroes, then the value represents infinity (usually written as "inf"). The sign bit matters, so there can be both positive and negative inf values. Any other value of the mantissa, instead, creates a special value called "not-a-number" or NaN. These values can be created by the hardware when an operation results in a value that cannot be represented — arithmetic overflow, for example. NaN values are also often used in scientific code to represent missing data, though that was apparently not a part of the original intent for NaNs.
There is more. With 52 bits to play with (plus the sign bit), it is obviously possible to create a lot of unique NaN values. Some code uses the mantissa bits to record additional information. But the most significant bit of the mantissa is special: if that bit is set, the result is a "quiet NaN"; if it is clear, instead, the value is a "signaling NaN". Quiet NaNs will quietly propagate through computations; adding a number to a quiet NaN yields another quiet NaN as a result. Operations on signaling NaNs are supposed to generate an immediate error.
(And, yes, some processors invert the meaning of the "quiet" bit, but that's more pain than we need to get into here.)
The problem with median()
The diversion of the initial thread came about when David Mertz pointed out that median() can yield some interesting results when the input data includes NaNs. To summarize his examples:
>>> import statistics
>>>
>>> NaN = float('nan')
>>> statistics.median([NaN, 1, 2, 3, 4])
2
>>> statistics.median([1, 2, 3, 4, NaN])
3
This strikes Mertz as a nonsensical result: the return value from a function like median() should not depend on the order of the data passed to it, but that is exactly what happens if the input data contains NaN values.
There was no immediate consensus on whether this behavior is indeed a
problem or not. Richard Damon asserted:
"Getting garbage answers for garbage input isn't THAT
unreasonable
". Steven D'Aprano, who added the statistics
module to the standard library, agreed: "this is a case of
garbage in, garbage out
". In both cases, the basis of the argument
is that median() is documented to require orderable input values,
and that is not what is provided in the example above.
Naturally, others see things differently. It turns out that IEEE 754
defines a "total order" that can be used to order all floating-point
values. In this order, it turns out that signaling NaNs are bigger
than infinity (in both the positive and negative directions), and quiet
NaNs are even bigger than that. So some felt that this total order should
be used in calculating the median of a list of floating-point values.
Brendan Barnwell went a bit
further by arguing that NaN values should be treated like other
numbers: "The things that computers work with
are floats, and NaN is a float, so in any relevant sense it is a
number
". D'Aprano disagreed
strongly:
Despite insisting that median() is within its rights to return strange results in this situation, D'Aprano also acknowledged that silently returning those results might not be the best course of action. All that is needed is to decide what that best course would actually be.
What should median() do
Much of the discussion was thus understandably focused on what the correct behavior for median() (and for other functions in the statistics module that have similar issues) should be. There were two core concerns here: what the proper behavior is, and the performance cost of implementing a different solution.
The cost is the easier concern to get a handle on. median() currently works by sorting the supplied values into an ordered list, then taking the value in the middle. Dealing with NaN values would require adding extra tests to the sort, slowing it down. D'Aprano figured that this change would slow median() by a factor between four and eight times. Christopher Barker did some tests and measured a slowdown range between two and ten times, depending on how the checking for NaN values is done. (And, for the curious, testing for NaN is not as easy as one might hope.) Some of this slowdown could be addressed by switching to a smarter way of calculating the median.
Once those little problems have been dealt with, there is still the issue of what median() should actually do. A number of alternatives surfaced during the discussion:
- Keep the current behavior, which works well enough for most users (who may never encounter a NaN value in their work) and doesn't hurt performance. Users who want specialized NaN handling could use a library like NumPy instead.
- Simply ignore NaNs.
- Raise an exception if the input data contains NaNs. This applies to quiet NaNs; if a signaling NaN is encountered, D'Aprano said, an exception should always be raised.
- Return NaN if the input data contains NaNs.
- Move the NaN values to one end of the list.
- Sort the list using total order (which would move NaNs to both ends of the list, depending on their sign bits).
- Probably one or two others as well.
D'Aprano quickly came to the conclusion that the community is unlikely to come to a consensus on what the proper handling for NaN values is. So he has proposed adding a keyword argument (called nan_policy) to median() that would allow the user to pick between a subset of the options above. The default behavior would be to raise an exception.
This proposal appears to have brought the discussion to an orderly close;
nobody seems to strongly object to handling the problem in this way. All
that is left is to actually write the code to implement this behavior.
After that, one would hope, not-a-number handling in the statistics module
would be not-a-problem.
| Index entries for this article | |
|---|---|
| Python | Floating point |
