I have not yet read the code, but I would suspect that mergesort would not be an in-place sorting algorithm (meaning that it would cost a lot of memory, besides the recursion stack).
Are there are obvious reasons why Perl has not chosen Heapsort instead of Mergesort if we are considering "worst case performance is a security bug"?
Frankly, as someone coming from a theoretical background, I find this "security bug" to be a non-bug at all...
Worst case performance and "impossibly unlikely" conditions
Posted Jan 12, 2012 0:07 UTC (Thu) by wahern (subscriber, #37304)
[Link]
Perl didn't select merge sort because of security concerns, AFAIK. Merge sort was made the default with 5.8.0, which was released about a year before the paper on attacking Perl's hash.
And of course coming from a theoretical background algorithmic complexity attacks don't matter to you. You don't even think languages need interfaces like read() or write(). The data you operate on just mysteriously exists inside your program. And on your really good days, you don't even need complex data structures; a list is all you need. :P
Worst case performance and "impossibly unlikely" conditions
Posted Jan 12, 2012 1:49 UTC (Thu) by rgmoore (✭ supporter ✭, #75)
[Link]
Perl switched sort implementations because they saw bad performance as a performance problem, not because they saw poor performance as a possible DOS problem. It turns out that the oddball pathological data isn't so oddball after all, so performance regressions because of their sort implementation were more common than you'd expect if you assume that you're always sorting random data. They also let the user tweak the sort algorithm with a pragma, so people who really know what they're doing can override the default and use quicksort if they know it won't cause problems.
Worst case performance and "impossibly unlikely" conditions
Posted Jan 12, 2012 2:08 UTC (Thu) by rbrito (subscriber, #66188)
[Link]
> Perl switched sort implementations because they saw bad performance as a performance problem, not because they saw poor performance as a possible DOS problem.
Thanks for pointing it out.
The article does (at least to my non-native English-speaker interpretation) give the impression that the "bad performance is DOS security problem" was the case for the change.
Worst case performance and "impossibly unlikely" conditions
Posted Jan 12, 2012 3:42 UTC (Thu) by rgmoore (✭ supporter ✭, #75)
[Link]
I think I may have inadvertently confused you. The sort implementation I'm talking about is a different example than the hash implementation mentioned in the parent article, where the performance regression is only likely to be important in the context of a deliberate DOS attack. Perl sorting is just a real world example of an algorithm encountering pathological data far more often than you might naively expect. Practical experience with pathological data forced them to start considering worst-case performance rather than just average case.
FWIW, the Perl documentation also points out that merge sort uses fewer comparisons than quicksort in the typical case (and no more in the worst case), so it may be faster if the comparisons themselves are expensive. One more example of why it's important to have several algorithms with different and well understood performance characteristics for important tasks.
Heapsort is useless in real world...
Posted Jan 12, 2012 9:32 UTC (Thu) by khim (subscriber, #9252)
[Link]
Heapsort is only good for things like priority_queue. The fact that it does not use additional memory looks good on paper but in reality it's red herring. The problem with heapsort is that it uses pathological memory access patters which make it literally hundreds of times slower once your heap is larger then your L3 cache (end it's significantly slower then merge sort when you overflow L1).
This means that there are only two cases where heapsort is good:
1. When data is small enough to fit in L1 cache.
2. When you have just enough data to fit in memory (L2 cache, L3 cache, etc), but there are not enough memory to keep second copy.
In first case mergesort is worse but still acceptable and second case is extremely rare (unless your program is designed to behave this way) thus heapsort is obviously very poor default.
Heapsort is useless in real world...
Posted Jan 20, 2012 1:45 UTC (Fri) by Wol (guest, #4433)
[Link]
I'd say heapsort is also a very good choice if you want to process data in sorted order. Ie "get first item, process, get next item, process, rinse, repeat".
But there you don't care too much about time and "pathological" cases, because sorting is a minor part of your costs.
Cheers,
Wol
Heapsort is useless in real world...
Posted Jan 20, 2012 7:55 UTC (Fri) by ekj (guest, #1524)
[Link]
If the "process data" step is constant time - i.e. the work performed to process one item is independent of how many items exist, then that's O(n).
If the pathological sort-case is O(n^2), then no matter how large the constant separating these two cases is, for some amounts of items, the sort will dominate.
You might have a "process data" step that takes 50ms, and a sort that you expect to take 0.2ms * n * log2(n) and conclude that processing a million items, should take ~21 minutes. (about 20 minutes for processing, 1 minute for sorting)
If pathological data causes that sort to instead take 0.2ms * n * n time, then you're looking at 20 minutes of processing, and 6 years of sorting, which is clearly likely to be a DOS.
Forcing n^2 where n*log2(n) is expected is a huge deal for largish data-sets (a million items is large, but not gargantuan), it ups the workload by a factor of 50000. Thus even if your sort *was* a small fraction of the expected work, it's now dominant.
Heapsort is useless in real world...
Posted Jan 20, 2012 22:45 UTC (Fri) by Wol (guest, #4433)
[Link]
Except that, is there a pathological case for heap sort? It's actually a very well behaved sort. Setting up the heap is the most unpredictable bit, and the most pathological case (an already-sorted array) isn't that much worse than a random array.
And once the heap is created, it's completely predictable. For a heap of 2^x items, it will require x comparisons and x swaps to get the next value. (Give or take an off-by-one error on my part :-)
Cheers,
Wol
It was already explained...
Posted Jan 22, 2012 0:54 UTC (Sun) by khim (subscriber, #9252)
[Link]
Except that, is there a pathological case for heap sort?
Sometimes it's good idea to actually look up further upthread.
It's actually a very well behaved sort.
No, it's absolutely not. It's quite good WRT to comparisons and swaps, but it's absolute disaster WRT to memory access. If you data fit in L1 cache (16K-32K) it works just fine, if you hit L2 and/or L3 cache you start doing worse then merge sort or quick sort, but difference is tolerable (typically less then 2x). The real disaster happens once you hit main memory: to read one number from random address in memory you need about 250-300 ticks (and this is what happens with heapsort because prefetch can not cope with it's memory access patterns), while sustained sequential access gives you next 32bit number in 2-3 ticks. Since some part of heap will be in cache even for large arrays heapsort does not suddenly become 50-100 times slower then mergesort when you pass L3 border, to get to this level of difference you need hundreds of gigabytes of data.