LWN: Comments on "Some median Python NaNsense"

Some median Python NaNsense

jondo — Sun, 19 Jan 2020 11:07:44 +0000

In this context it's interesting to see that Pandas also tries to move away from using numpy.nan (i.e. IEEE silent NaN) as "missing" indicator. Target is to deal with "missing" uniformly across all data types, instead of numpy.nan for floats, pandas.NaT for "not-a-time", and None for everything else (integers, strings, ...). Internally, the "missing" info is kept as a separate, binary mask.

See the documentation of the upcoming 1.0.0 release: https://pandas.pydata.org/pandas-docs/version/1.0.0/user_...

I am looking forward to it, because in my work I currently have to represent integer data as float to efficiently deal with missings.

Some median Python NaNsense

anselm — Fri, 10 Jan 2020 20:03:38 +0000

Python doesn't destroy NaN values if the underlying C implementation doesn't. At the end of the day, leaving aside unusual implementations such as JPython, the Python interpreter is just another program written in C, and for floating-point arithmetic it does whatever the C implementation on the system does (which in this day and age is presumably to rely on the floating-point instructions of the host processor).

Some median Python NaNsense

khim — Fri, 10 Jan 2020 19:57:48 +0000

That's nonsense. Both NEON and SSE use THE SAME registers for both interger and floating point values. And instructions like movhlps don't even care if they move interger or floating point values.

Now, certain floating point operations may normalize NaNs... but then - they are supposed to change values anyway, it's unclear why you think NaNs shouldn't be affected that point. I mean: should "1.0 + NaN" give you the same NaN or a different NaN? IEEE 754 doesn't say and the solution is simple: just don't do it.

P.S. Now, if you would say that PYTHON would destroy these NaN values... this could happen, I don't work with Python at that level all that often. But I happen to DO work with ARM⇔x86 translation - which means I deal with all these changes regularly... neither x86 nor ARM change NaN unless they are supposed to.

Some median Python NaNsense

cpitrat — Fri, 10 Jan 2020 11:19:31 +0000

Yes but here we're talking about statistics package, not numpy, which already takes mixed input lists as input IIUC.

Some median Python NaNsense

gdt — Fri, 10 Jan 2020 02:31:00 +0000

No higher granularity is needed for a statistics library: either the number is contributing both a value and to the population or the number is contributing solely to the population. Reason codes for non-response are certainly useful for the application to maintain, but they are irrelevant to a statistics library.

Some median Python NaNsense

gdt — Fri, 10 Jan 2020 02:29:24 +0000

None isn't that useful for processing large datasets where memory efficiency matters, as typically generated by scientific instruments. Thus SciPy's overloading of NaN. Putting that another way:

import array
a = array.array('d', [1.0, 2.0, 3.0, None])
TypeError: must be real number, not NoneType

Note that I am not arguing for overloading NaN -- I don't have a dog in this fight -- I'm just using my background as a statistics professional to explain why choices your tone suggests are unreasonable have been made by people acting reasonably.

Some median Python NaNsense

NRArnot — Thu, 09 Jan 2020 14:47:28 +0000

This is Python! One advantage of this language, is that it's possible to update an API without breaking it, by introducing a new keyword argument with a default value.

So why can't all of the sensible alternatives be implemented and made selectable with a keyword argument? The default one, is to carry on doing exactly what it does at the moment, so as to not break anything that (for good or ill) currently relies on that behaviour.

Some median Python NaNsense

Hattifnattar — Wed, 08 Jan 2020 17:04:26 +0000

But what if someone needs higher granularity?

Say you have a survey, and you did not get numbers from some people because they cannot be reached, and from others, because they refuse to answer. Suppose you want to deal with these two types of "no value" differently.
Should Python support it too?

It seems to me it's a task for a particular library (in an object-oriented language!) to design a type that accurately represents its domain, rather than blindly use (and abuse) a native type that mostly, but not completely, does the job.
Then under the hood optimize it to your heart's content...

Some median Python NaNsense

mathstuf — Wed, 08 Jan 2020 16:02:05 +0000

"No Actual Number Indicated" would go well with "nani" and the associated meme (https://www.urbandictionary.com/define.php?term=omae%20wa...)

Some median Python NaNsense

KaiRo — Wed, 08 Jan 2020 00:02:41 +0000

Good idea! Let's call that "No Number Encountered" or short "NoNE". What do you think?

What makes the NaN check expensive?

nivedita76 — Tue, 07 Jan 2020 22:06:04 +0000

You /should/, but the python implementation that's broken actually does a full sort.

Some median Python NaNsense

nivedita76 — Tue, 07 Jan 2020 22:03:03 +0000

It was a little weird to find people worrying about performance of median when the implementation actually sorts the input rather than using quickselect.

What makes the NaN check expensive?

NYKevin — Tue, 07 Jan 2020 18:43:13 +0000

You don't sort to get median, you do quickselect (or another algorithm), which is O(N) (in the average case).

(Granted, a linear scan for NaN is also O(N), and probably has a better cache hit rate to boot. So I agree with you that performance really shouldn't be a problem here.)

Some median Python NaNsense

NYKevin — Tue, 07 Jan 2020 18:39:58 +0000

That's too late. The architecture has already normalized them by the time you call median().

Basically, the problem is not in median() at all. The problem is that the architecture changed the numbers out from under you, so you are passing a different list of numbers under ARM than under x86. "Most people" won't notice that, because it's "just" replacing one NaN-valued bit pattern with another, and "most people" don't care about which NaN is which. But if you're casting to int, then you'll certainly notice differing bit patterns.

Some median Python NaNsense

cpitrat — Tue, 07 Jan 2020 13:42:53 +0000

If only there was a way to represent a missing value in Python. Some keyword that would inform that instead of containing a value, this variable contains none.

Some median Python NaNsense

Baughn — Tue, 07 Jan 2020 00:49:43 +0000

NaNs are larger than infinity, so no such integer exists.

Some median Python NaNsense

rgmoore — Mon, 06 Jan 2020 17:45:49 +0000

As I understand it, "significand" is the preferred terminology of IEEE 754. Mantissa is used for logarithms. Because logarithms treat the part after the decimal differently from floating point numbers, they wanted to use a different term.

Some median Python NaNsense

gdt — Mon, 06 Jan 2020 14:10:07 +0000

The article possibly could have mentioned the other half of the tension: non-trivial statistical processing requires the concept of a "missing value"; that is, data which was asked of the respondent but which was not supplied.

Missing values can't be ignored for many statistical operations, as they increase the error (handwavingly: we can be less certain of a statistic when less of the population responds).

Missing values aren't that relevant to calculating the median, but they are relevant to calculating more advanced statistics. Having some statistical functions not accept missing values and other statistical functions accept missing values is error-prone. Each dataset would have two data structures: one with missing values and one without. Inevitably the data structure without missing values would be passed to a statistical function which can process missing values, and that function will silently give an incorrect result. Stats processing generally takes a view of "hospital pass APIs" to "optimised" functions that "fast but sometimes wrong = wrong". Usually once the processing to clean up the dataset is done then the dataset is passed to analysis functions unchanged.

Missing values themselves are useful as data: a pattern of missing values can uncover non-statistical bias in data, which in turn has consequences for the choice of analysis.

The trouble comes when statistical packages look at the IEEE NaN as a way to encode the missing value.

Some median Python NaNsense

rbanffy — Mon, 06 Jan 2020 13:51:33 +0000

Sorting infinite and -infinite isn't much of a problem, but where does a number go when you have an integer that is in the middle of the range of possible NaNs?

Some median Python NaNsense

Paf — Mon, 06 Jan 2020 13:41:38 +0000

Why should that be preferred?

Some median Python NaNsense

jtaylor — Mon, 06 Jan 2020 11:34:42 +0000

yes, if you are sorting to find the median it can be done by checking the last element.
But finding the median does not require a full sort, for example the quickselect (https://en.wikipedia.org/wiki/Quickselect) algorithm (which is only the partition step of a quicksort) finds the median (or any quantile) in O(N) time while sorting requires O(NlogN) time.
With quickselect only certain pivot points in the array are guaranteed to have an ordering guarantee, e.g. for the median it is guaranteed that all positions after the median are larger and all before smaller than the median, but the ordering within these two partitions is undefined. So the NaNs may be anywhere in the top 50% of the array (likely higher as you will mostly need more than one partition to find the median).
The lower complexity more than makes up for the need of searching part of the array linearly to handle NaN.

Some median Python NaNsense

areilly — Mon, 06 Jan 2020 04:02:04 +0000

It shouldn't be necessary to do a vectorised search after the median to find a NaN: they sort to the end, so just check the last (sorted) element. If it's a NaN, return NaN.

If you wanted to support the idea that NaNs represented missing data, and return a median of the rest, then yes you need to search to find the first one. Since your data is sorted by that point, a binary search may (or may not!) be faster than a vectorised linear search. There are probably machine-dependent length breakpoints.

Some median Python NaNsense

NYKevin — Mon, 06 Jan 2020 01:39:55 +0000

IEEE already pushes all the "weird" numbers to the ends. All you really have to do is ensure that large integers are sorted into the "finite regular numbers" bucket with all the smaller floats, and then compare them numerically.

What makes the NaN check expensive?

epa — Sun, 05 Jan 2020 20:12:31 +0000

Moreover, if you are sorting the list then you are already doing an O(N log N) operation, so a linear scan to find any NaN values would be a tiny fraction of the total work on large lists, and for small lists it would be fast enough anyway not to matter. As others said you would not be using vanilla Python if you needed high-performance numeric operations. So I don’t understand the objections based on performance. Perhaps if you have vast numbers of short lists and need the median of each... but even then you’d surely use a different tool.

What makes the NaN check expensive?

khim — Sun, 05 Jan 2020 19:21:32 +0000

You miss the fact that right now python sorting algorithm WOULDN'T put them first or last. Means before you would check these two elements you need to implement brand new way to compare numbers (and brand new sort on top of that) - which would lead to total order when applied to floats.

Some median Python NaNsense

khim — Sun, 05 Jan 2020 19:19:09 +0000

That's if you do a stupid thing and try to compare them as floats. Treat the same bit sequence as integer, use regular comparison operator - and you always get the same correct result on all CPUs.

Trouble comes when you need to compare float and non-float (e.g. large integer). EEE 754 doesn't even define "total order" for cases like these.

Some median Python NaNsense

khim — Sun, 05 Jan 2020 19:15:46 +0000

> But if you're going to try and sort the NaNs, you might as well sort them in the way the IEEE says you're supposed to, instead of making up a new standard and https://xkcd.com/927/

Python would need to invent something new anyway. I mean: it's really easy to sort floats in total order - just use the same bits sequence as integer and compare them as signed number. But... Python also supports integers which couldn't fit in the float range... how THESE are supposed to be ordered WRT infs and NaNs? IEEE 754 doesn't answer that question - means you need a new standard anyway.

What makes the NaN check expensive?

jansson — Sun, 05 Jan 2020 17:51:47 +0000

Thanks for an interesting article! :)

If all input numbers including NaN's are sorted, and if all NaN's ends up first or last after sorting. We need to check only first and last element after the sort. If any of those two numbers are a NaN then we have NaN in the input. Otherwise there is no NaN in the input! How can these two checks for NaN be expensive?

What did I miss?

Some median Python NaNsense

scientes — Sun, 05 Jan 2020 16:48:04 +0000

> Sort the list using total order (which would move NaNs to both ends of the list, depending on their sign bits).

This is not a viable solution, as there is no such thing as negative NaN in the IEE 754 standard, and as such ARM processors normalize negative NaN to positive NaN, unsetting the signed bit, which differs from the behavior of x86_64 processors, and would mean you would get different results on differn't cpus.

Some median Python NaNsense

scientes — Sun, 05 Jan 2020 16:39:34 +0000

Please use "significand" instead of "matissa".

Some median Python NaNsense

jtaylor — Sat, 04 Jan 2020 21:41:34 +0000

Not returning NaN for this input in median is not a case of garbage in, garbage out. What the statistics module is doing is taking garbage pretending it is not garbage anymore in the output.
garbage in garbage out in the case if NaN numbers ist: NaN in NaN out.
Doing otherwise you lose the information that something went wrong (if you do not provide the floating point exceptions that will be thrown through as second channel).

That is why numpy.median returns NaN when there is a NaN in the input.
This does come with some computational costs but it is not so bad, NaNs sort to the end of array (the sign bit is ignored in all nan operations including sorting), so you only have to check the values after the median for a NaN which can be done very efficiently (easily vectorized with SIMD) compared to the median selection operation itself.
You can also use a quickselect/introselect with pivot storage to reduce the amount of data you have to search to on average 25% of uniformly distributed data (this is was numpy does https://github.com/numpy/numpy/blob/master/numpy/lib/func...).

Though for pythons statistics module this may be more costly as it has to consider mixed type input lists.

That said it is very common that missing values are represented as NaN in practical numerical data. For this case numpy has dedicated functions (nanmedian, nanpercentile, ....) which will consider NaNs as missing data and ignore them.

Some median Python NaNsense

NYKevin — Sat, 04 Jan 2020 20:40:07 +0000

Going through those options and my reactions to them:

> Keep the current behavior, which works well enough for most users (who may never encounter a NaN value in their work) and doesn't hurt performance. Users who want specialized NaN handling could use a library like NumPy instead.

That seems quite backwards to me. If you wanted high performance, you'd already be using NumPy anyway. Standard library modules should favor correctness over speed.

> Simply ignore NaNs.

I don't like this. NaN is not the same thing as SQL NULL; it means "something went wrong," not "no data available." Ignoring it is unlikely to produce the answer you wanted.

> Raise an exception if the input data contains NaNs. This applies to quiet NaNs; if a signaling NaN is encountered, D'Aprano said, an exception should always be raised.

I've never been a fan of the quiet/signaling distinction, TBH. Python doesn't provide any portable means of distinguishing between signaling and quiet NaNs, or at least none that I'm aware of,* and I've never heard of "regular" Python operations such as x + y raising an exception on signalling NaNs. So I'd prefer not to try and treat them differently here. It could cause much confusion for (IMHO) little benefit. But I would be OK with always raising an exception, regardless of NaN type.

> Return NaN if the input data contains NaNs.

That's what every other operation on NaNs is supposed to do, and seems obviously correct (or at least, not wrong) to me.

> Move the NaN values to one end of the list.

> Sort the list using total order (which would move NaNs to both ends of the list, depending on their sign bits).

I could accept either of these as "not too ridiculous" if it were documented to work that way. But if you're going to try and sort the NaNs, you might as well sort them in the way the IEEE says you're supposed to, instead of making up a new standard and https://xkcd.com/927/

* You can take them apart using one of struct, ctypes, etc. and inspect the byte representation. That does not count; I was thinking of something more along the lines of math.is_quiet_nan(), which does not exist.

Some median Python NaNsense

sethkush — Sat, 04 Jan 2020 06:10:47 +0000

After doing work with other statistics packages (I've never used Python), I would be mildly annoyed if I was using something that didn't ignore NaN values. The nan_policy solution does seem like the best way to keep everyone happy though.