Some median Python NaNsense

Posted Jan 4, 2020 21:41 UTC (Sat) by jtaylor (subscriber, #91739)
Parent article: Some median Python NaNsense

Not returning NaN for this input in median is not a case of garbage in, garbage out. What the statistics module is doing is taking garbage pretending it is not garbage anymore in the output.
garbage in garbage out in the case if NaN numbers ist: NaN in NaN out.
Doing otherwise you lose the information that something went wrong (if you do not provide the floating point exceptions that will be thrown through as second channel).

That is why numpy.median returns NaN when there is a NaN in the input.
This does come with some computational costs but it is not so bad, NaNs sort to the end of array (the sign bit is ignored in all nan operations including sorting), so you only have to check the values after the median for a NaN which can be done very efficiently (easily vectorized with SIMD) compared to the median selection operation itself.
You can also use a quickselect/introselect with pivot storage to reduce the amount of data you have to search to on average 25% of uniformly distributed data (this is was numpy does https://github.com/numpy/numpy/blob/master/numpy/lib/func...).

Though for pythons statistics module this may be more costly as it has to consider mixed type input lists.

That said it is very common that missing values are represented as NaN in practical numerical data. For this case numpy has dedicated functions (nanmedian, nanpercentile, ....) which will consider NaNs as missing data and ignore them.

Some median Python NaNsense

Posted Jan 6, 2020 4:02 UTC (Mon) by areilly (subscriber, #87829) [Link] (2 responses)

It shouldn't be necessary to do a vectorised search after the median to find a NaN: they sort to the end, so just check the last (sorted) element. If it's a NaN, return NaN.

If you wanted to support the idea that NaNs represented missing data, and return a median of the rest, then yes you need to search to find the first one. Since your data is sorted by that point, a binary search may (or may not!) be faster than a vectorised linear search. There are probably machine-dependent length breakpoints.

Some median Python NaNsense

Posted Jan 6, 2020 11:34 UTC (Mon) by jtaylor (subscriber, #91739) [Link] (1 responses)

yes, if you are sorting to find the median it can be done by checking the last element.
But finding the median does not require a full sort, for example the quickselect (https://en.wikipedia.org/wiki/Quickselect) algorithm (which is only the partition step of a quicksort) finds the median (or any quantile) in O(N) time while sorting requires O(NlogN) time.
With quickselect only certain pivot points in the array are guaranteed to have an ordering guarantee, e.g. for the median it is guaranteed that all positions after the median are larger and all before smaller than the median, but the ordering within these two partitions is undefined. So the NaNs may be anywhere in the top 50% of the array (likely higher as you will mostly need more than one partition to find the median).
The lower complexity more than makes up for the need of searching part of the array linearly to handle NaN.

Some median Python NaNsense

Posted Jan 7, 2020 22:03 UTC (Tue) by nivedita76 (subscriber, #121790) [Link]

It was a little weird to find people worrying about performance of median when the implementation actually sorts the input rather than using quickselect.