|
|

# Where this hubris comes from ?

## Where this hubris comes from ?

Posted Jan 14, 2012 0:08 UTC (Sat) by nybble41 (subscriber, #55106)
In reply to: Where this hubris comes from ? by khim
Parent article: Denial of service via hash collisions

> The first result gives you link to Median of Medians algorithm - and yes, this algorithm can be used to find median in linear time.

The result of the algorithm is, as the name implies, the median of a list of subgroup medians. It is not the same as the median of the entire list. Given the group size of five used in the article, the result will split the list somewhere between 30% and 70%, rather than at 50% as with a true median. So thus far the GP is entirely correct.

Despite that, however, it can be used to guarantee O(n) worst-case performance for the quickselect algorithm, and thus O(n log n) for quicksort.

Where this hubris comes from ?

Posted Jan 14, 2012 1:13 UTC (Sat) by HelloWorld (guest, #56129) [Link]

> The result of the algorithm is, as the name implies, the median of a list of subgroup medians.
No dude, it's not.

What happened?

Posted Jan 14, 2012 11:32 UTC (Sat) by khim (subscriber, #9252) [Link]

This is just sad. LWN was a place where you were able to come and discuss things easily. Without trying to do something with mass of twitter generation guys who have eyes connected directly to hands (perhaps via spinal cord but definitely bypassing brain) and/or who reply without thinking (this makes brain pointless: it's there but it does something unrelated to commenting process).

You both (cmccabe and nybble41) say that algorithm picks Median of Medians - and this is true. Now the question: how can it do that if that's the end result? If it uses some another algorithm to select true median from medians - then what's the point? Answer (obvious and true): yes, it does select "Median of Medians" (hence the name), but no, this is not the end result. The end result is true median.

I honestly don't know what to do with people of "twitter generation". When you are employer you can just fire them and refuse to hire new ones. When you are open site... they tend to clog all discussions and make them hard to follow.

So far LWN was place which happily avoided this fate (perhaps because it used subscription model), but cmccabe and nybble41 are subscribers thus it's obviously not a panacea.

Guys. Please, please, please don't write on LWN if you don't have time to think. Everyone does mistakes (phrase errare humanum est, … is well-known and loved), but somehow people start forgetting the second part of that some phrase (perseverare diabolicum). Be human! PLEASE!!!

What happened?

Posted Jan 14, 2012 21:45 UTC (Sat) by cmccabe (guest, #60281) [Link]

Hmm. I didn't expect this to be so controversial.

Here is a quick example of median of medians.

```#!/usr/bin/python

def find_median(list):
n = len(list)
list2 = sorted(list)
if ((n % 2) == 0):
return (list2[int(n/2)] + list2[int(n/2) + 1]) / 2
else:
return list2[int(n/2)]
def median_of_medians(k, list):
n = len(list)
if (n == k):
return find_median(list)
m = []
for i in range(0, n, k):
m.append(median_of_medians(k, list[i:i+k]))
return find_median(m)
list = [3, 1, 4, 4, 5, 9, 9, 8, 2]
print find_median(list)
print median_of_medians(3, list)
```
You can see that the median of our example list is 4, but the "median of medians" (with k=3) is 5. The median of medians is not the same as the median.

However, you can use median-of-medians as the pivot selection heuristic for quickselect. If you do this, you are guaranteed O(n) worst case running time for quickselect.

So yes, in theory, you could use quicksort with quickselect as the pivot selection algorithm, and median-of-medians as the pivot selection algorithm for quickselect. And it would all be N log N. However, the constant factors would be horrendous. You'd basically be doing 3x as much work as the guy who just used merge sort. That's why the Java guys went with merge sort.

If you want to know even more, check out: http://martinsjava.blogspot.com/2009/03/test-this-is-code-more-code-end-test.html.

TL;DR: it's easy to code an in-place quicksort, and it's still useful for some things. However, it can't be made adversary-proof without losing a lot of its real-world performance advantage over other N log N sorts.

Are you moron or just play one on TV?

Posted Jan 14, 2012 23:39 UTC (Sat) by khim (subscriber, #9252) [Link]

Hmm. I didn't expect this to be so controversial.

There are no controversy. Just pure obstinacy.

Here is a quick example of median of medians.

Let's see.

...
`list2 = sorted(list)`
...

Brilliant idea. Not. This single line guarantees that your median_of_medians function will need at least O(N log N) operations in some cases. This, in turn, will probably mean that your “outer sort algorithm” will have at least O(N log²N) complexity, not desired O(N log N). If you'll use it recursively then situation can be even worse (not sure if you can push it all the way to O(N²): for that to happen you'll need to hit “worst case” often enough and I'm not 100% sure it's possible to organize).

Let me remind you the story of whole so called “controversy”:
HelloWorld: there are algorithms that can chose the median (and hence the optimal pivot element) from a list in O(n) ⇨ TRUTH
cmccabe: You're thinking of quickselect ⇨ NONSENSE¹)
HelloWorld: There are other algorithms (such as the "median of medians" algorithm) that have a worst-case performance of O(n) ⇨ TRUTH
cmccabe: The median of medians, though possibly a useful thing, is not the same thing as the median ⇨ IRRELEVANT²)
khim: The first result gives you link to Median of Medians algorithm - and yes, this algorithm can be used to find median in linear time ⇨ TRUTH
nybble41: The result of the algorithm is, as the name implies, the median of a list of subgroup medians ⇨ NONSENSE³)
khim: The end result is true median ⇨ TRUTH⁴)
cmccabe: However, you can use median-of-medians as the pivot selection heuristic for quickselect. If you do this, you are guaranteed O(n) worst case running time for quickselect ⇨ NONSENSE⁵)

As you can see the whole controversy is kind of trivial: one side postulates true (and relevant!) facts while other side either says incorrect assertions or correct yet entirely irrelevant facts.

I'm not sure why it's so hard for you to accept the fact that Median of Medians algorithm's goal is not to find median of medians (it uses median of medians as intermediate step) - but, well, that's the root of the whole controversy. If you'll stop feeling indignant for just a moment and spent five minutes to actually take a look on Median of Medians algorithm (Wikipedia does pretty good job explaining it) then suddenly the while discussion will stop being controversial.

If you want to know even more, check out: http://martinsjava.blogspot.com/2009/03/test-this-is-code-more-code-end-test.html

AFAICS this article tells nothing about median selection - it just points out that “median of three” can not save quicksort. Which was kind of expected from the beginning but it's still nice to have formal proof.

That's why the Java guys went with merge sort.

Another fail. Java guys have not “went with merge sort”. They actually use both tuned quicksort and merge sort. Apparently they felt it'll be good idea to spice life of a programmer. Need to sort Int[] ? Sure, be my guest - O(N log N) is guaranteed. Managed to switch int[] to save on allocations? Have a nice surprise of O(N²) sort! Otherwise life will be dull, you know...

──────────
¹) Obviously here HelloWorld does not talk about quickselect because quickselect does not guarantee O(N) complexity.
²) Here Median of Medians algorithm (100% relevant because it can be used to find median in O(N) time) somehow transformed to median of medians (which is of course entirely different thing).
³) The same story again: Median of Medians algorithm phrase can fit in twitterbrain, linear algorithm for the general case of selecting the kth largest element was published (which is explanation if what the said algorithm does) overflows it. We are Twitter generation! 100 character, 14 words, 1 click? NOOOOOO! That's too long! That's too hard!
⁴) Here I've tried to simplify algorithm's goal description to fit them in twitterbrain - apparently without any success
⁵) This particular “find median of medians” algorithm requires at least O(N log N) operations which gives us at least O(N log²N) complexity. Close, but no cigar.

Are you moron or just play one on TV?

Posted Jan 15, 2012 3:30 UTC (Sun) by nybble41 (subscriber, #55106) [Link]

People might tend to take you more seriously if you refrained from resorting to personal insults. Yes, the cited section of the Wikipedia article does start out with the assertion that the algorithm can give the median in linear time. However, without already knowing the answer, that is not at all apparent from the remainder of the description. Your opponents are not "twitterbrains" just because they expect more authoritative citations than a single, rather dense, Wikipedia article without so much as a single hyperlink for further reading.

However, after rereading this thread, the Wikipedia article, several other descriptions of the Median of Medians selection algorithm (several of which explicitly supported the 30%/70% argument), and finally the original article which I was able to track down via Google Scholar[1], I am forced to agree that this algorithm can be used (in combination with the quickselect algorithm) to find the true median in linear time.

Note that the Median of Medians algorithm, on its own, does in fact return a result between 30% and 70% of the list, not the true median at 50%. Only by using it to select the pivot point for the quickselect algorithm do you get the true median of the list in linear time.

Posted Jan 15, 2012 11:42 UTC (Sun) by khim (subscriber, #9252) [Link]

People might tend to take you more seriously if you refrained from resorting to personal insults.

If people expect to be spoon-feed then it's their problem, not mine. I remember ages-old principle errare humanum est, perseverare diabolicum very well, thank you very much. I'm human, after all - and sometimes make mistakes. This is why I try to give hints without insults at first. But after few rounds it's time to recall the second part of principle and recall that what we are, unapologetically, is hostile to people who seem to be unwilling to think or to do their own homework before asking questions.

Yes, the cited section of the Wikipedia article does start out with the assertion that the algorithm can give the median in linear time. However, without already knowing the answer, that is not at all apparent from the remainder of the description.

Well, you actually need to read Wikipedia's article (in particular it's good idea to not skip Proof of O(n) running time and Important notes parts which state quite bluntly that the worst-case algorithm can construct a worst-case O(n log n) quicksort algorithm, by using it to find the median at every step) and think. Is it such a big problem?

Your opponents are not "twitterbrains" just because they expect more authoritative citations…

Yes, they are. This is the very definition of twitterbrain: the person who expect citations. They don't seek hints, they are not interested in proof, they just want authoritative citations to spread them around without thinking.

than a single, rather dense, Wikipedia article without so much as a single hyperlink for further reading.

This “single, rather dense, Wikipedia article” includes:
1. Explanation of the algorithm.
2. Proof of the O(N) complexity.
3. Direct explanation of how the algorithm can be used in quicksort.
More then enough for anyone with more than half a brain. If you don't trust wikipedia then you can just check the included proof - but apparently this is beyond abilities of “twitter generation”.

Note that the Median of Medians algorithm, on its own, does in fact return a result between 30% and 70% of the list, not the true median at 50%.

And now we are back to square one. I know, I know, the temptation to stop thinking and start clicking and copy-pasting is almost irresistible. But please stop for a moment and think: just how the h*ll median of medians algorithm works? It needs to find true median in a smaller (N/5) array. How can it do it if it does not refer any other algorithm and if it can not find median itself? Answer is simple: "median of medians" is intermediate step which guarantees that recursive call will have less then 70% of elements to deal with. This is where T(n/5) + O(n) in T(n) ≤ T(n/5) + T(7n/10) + O(n) formula comes from. But this is not the end result! The end result is kth largest element (and median is, of course, N/2th largest element thus it can be produced by algorithm directly). This is where T(7n/10) in the aforementioned formula comes from.

Is it so hard to understand?

Lowering the bar and hitting people with it

Posted Jan 15, 2012 14:13 UTC (Sun) by man_ls (guest, #15091) [Link]

I'm so glad khim plonked me...

Are you moron or just play one on TV?

Posted Jan 15, 2012 8:18 UTC (Sun) by cmccabe (guest, #60281) [Link]

> Let's see.
>
> > ...
> > list2 = sorted(list)
> > ...
> Brilliant idea. Not. This single line guarantees that your
> median_of_medians function will need at least O(N log N) operations in
> some cases. This, in turn, will probably mean that your "outer sort
> algorithm" will have at least O(N log²N) complexity, not desired O(N log
> N). If you'll use it recursively then situation can be even worse (not
> sure if you can push it all the way to O(N²): for that to happen you'll
> need to hit “worst case” often enough and I'm not 100% sure it's possible
> to organize).

It is true that I could have implemented find_median more efficiently. However, this does not affect the overall time complexity of the algorithm. We never find the median of more than 3 numbers at a time.

In other words, we are doing O(3 log 3) work instead of O(3) work at each recursion. But O(3 log 3) and O(3) are both still O(1).

As for the rest of what you wrote-- yes, it looks like the Java situation is more complex than I made it out to be. They do still use some form of quicksort for primitive data types-- probably to get the benefits of in-place sorting.

I don't understand the other stuff you wrote at all. As far as I know, we all agree on the fact that O(N log N) worst-case quicksort is possible, O(N) quickselect is possible, and we all understand how median-of-medians works. What exactly are we in disagreement about, if anything?

By the way, I am probably not going to continue posting to this thread unless someone posts some math that's clearly wrong-- like the above confusion about constant factors.

And the saga continiues!

Posted Jan 15, 2012 11:42 UTC (Sun) by khim (subscriber, #9252) [Link]

It is true that I could have implemented find_median more efficiently. However, this does not affect the overall time complexity of the algorithm. We never find the median of more than 3 numbers at a time.

Just how many times can you say gibberish without checking facts? Here is your call which processes more then 3 numbers at a time:

```    return find_median(m)
```
Here size of m is N/3 and O(N/3 log(N/3)) is O(N log N), not O(N) or O(1), sorry.

As far as I know, we all agree on the fact that O(N log N) worst-case quicksort is possible,

True.

O(N) quickselect is possible

This remains to be seen. We can use median of medians algorithm to find pivot element for quickselect - but this is kinda pointless because median of medians algorithm can produce result directly. Formally it'll be quickselect which produces result in O(N) but on practice it's be just a useless addendum to another algorithm which solves the task just fine on it's own. If someone can offer useful way to find “good” pivot for quickselect (which does not use another algorithm capable of solving the task on it's own) with guaranteed complexity O(N) - it'll be interesting. So far I've not seen such algorithms.

Note that while median of medians algorithm is based on quickselect it's quite distinct from quickselect. For example quickselect recursively calls itself once on each step while median of medians algorithm calls itself twice on each step.

and we all understand how median-of-medians works.

And this is yet another NONSENSE
0. Yes, we (as in: HelloWorld, me, and now even nybble41) understand how it works. You still refuse to accept it.
1. Your algorithm produces median of medians in O(N log N), not in O(N).
2. Apparently it's still not obvious for you that you can only find median of medians in O(N) if you can find just median in O(N) - and then you can just use said median as pivot point!

When/if you'll understand how median of medians algorithm works you'll understand why you was wrong all along - from your first post in this thread. The fact that you groups include 3 elements strongly suggests that you still don't understand how median of medians algorithm works.

By the way, I am probably not going to continue posting to this thread unless someone posts some math that's clearly wrong-- like the above confusion about constant factors.

Fine with me. The only one who posts “some math that's clearly wrong” in this thread is you, anyway.

And the saga continiues!

Posted Jan 16, 2012 22:05 UTC (Mon) by cmccabe (guest, #60281) [Link]

Let's derive the running time of the median of medians algorithm. Since you can use Google, you already know that the answer is O(n). But do you know why?

Median of medians is a divide-and-conquer algorithm. In each stage of the recursion, we split the array into K subarrays and recurse on them. To combine the results of those recursions, we do a constant amount of work.

So the running time for an array of length N is
T(n) = kT(n/k) + C
where k and C are constants. Usually K=5.

Luckily, we can solve this recurrence with case one of the master theorem. This gives a running time of O(n).

What if, instead, we did O(n) work to combine the results of the recursions? This is essentially what you are claiming.

Then the recurrence would be
T(n) = kT(n/k) + Cn + D
where k, C, and D are constants.
By case 2 of the master theorem, the running time would be O(n log n).

Incidentally, this is the reason why quicksort's running time is O(n log n)-- because it does O(n) work before doing each recursion. In quicksort's case, k = 2.

Anyone can use Google and find the running time of an algorithm. But unless you can derive it, you do not truly understand it. Perhaps you need to do a little bit less talking and a little bit more listening.

Is ignorance bliss?

Posted Jan 17, 2012 8:39 UTC (Tue) by khim (subscriber, #9252) [Link]

Let's derive the running time of the median of medians algorithm.

Let's do.

Since you can use Google, you already know that the answer is O(n). But do you know why?

Yes, I do. I also know that your contraption has nothing to do with medians of medians algorithm - that's why I was confused.

Median of medians is a divide-and-conquer algorithm. In each stage of the recursion, we split the array into K subarrays and recurse on them. To combine the results of those recursions, we do a constant amount of work.

Oops. Ah, now I see. Sorry, I missed the fact that you call median_of_medians recursively. Very embarrassing: I did the same mistake you did - have looked on the name of the algorithm and assumed it just picks medians of pieces and then selects median from these.

Well... this algorithm is linear, all right. The only problem: it does not guarantee linear complexity of quicksort! You basically split array in two uneven pieces, then combine six (if k == 5 then ten) such arrays to organize bigger array and you gurantee that at least two pieces go to the left and at least two pieces go to the right. This means that each recursion step potentially amplifies the disproportion. In the end you can have two pieces of quite disproportionate sizes. It's not clear if you can organize array in such a bad fashion as to push complexity of quicksort back to O(N²) but this looks highly probable.

The property of pivot produced by Median of Medians algorithm is quite different: it's always between 30% and 70% elements and these percentages do not depend on the number of recursive calls. Why? Median of Medians algorithm also introduces disproportions at each step, right? Yes, but it includes mechanism which fixes these disproportions. This is what guarantees O(N) complexity for finding true median and this is what guarantees O(N log N) complexity for quicksort.

Do you have any proof that your “median of median of median…” algorithm can not produce bad results at each step of quicksort? If not then this will put the whole excercise in the same bucket as “median of three” and not in the bucket of Median of Medians algorithm which guarantees O(N) complexity and guarantees that quicksort will not go to recursion lever deeper then log₂N. I've assumed that your code at least keeps the second property, but apparently you were more concerned with first. My bad.

Is ignorance bliss?

Posted Jan 19, 2012 8:02 UTC (Thu) by cmccabe (guest, #60281) [Link]

* Some people choose to call quickselect + the clever pivot selection algorithm "median of medians." Yes, this is confusing, given that finding the median of medians is only one step of the process. But it's not worth getting excited about.

* The clever pivot selection algorithm is still recursive. Yes, it's a recursive algorithm within another recursive algorithm. We are very clever, aren't we.

* When choosing a pivot for quickselect with the method I described, you need to have k=5 rather than k=3; otherwise the quickselect can still go n^2.

* Your prose reminds me "time cube." But that's probably because I was "educated stupid and evil."

Yet another small correction...

Posted Jan 19, 2012 8:42 UTC (Thu) by khim (subscriber, #9252) [Link]

When choosing a pivot for quickselect with the method I described, you need to have k=5 rather than k=3; otherwise the quickselect can still go n^2.

Sadly this is not true. Your argorithm will still introduce disbalance on each step - even with "k == 5". Disbalance will be smaller (2+0.3N/3+0.7N instead of 1+⅓N/2+⅔N), but recursive calls will still compound it thus the whole algorithm will have larger then O(N log N) complexity (most probably O(N²) with "evil source").

Median of Medians algorithm uses two recursive calls to battle this phenomenon: it finds "more-or-less Ok median" (30%/70%) using first recursive call with 0.2N elements and then it "fixes it" using another recursive call with no more then 0.7N elements. Disbalance is fixed at each step thus it does not grow beyond certain point (30%/70%) no matter how many steps there are - and the whole thing needs O(N) operations: T(N) ≤ c*N*(1 + (9/10) + (9/10)² + …) = O(N). If you'll use "k == 3" then your first pass will use ⅓N elements and second pass will use ⅔N elements and this will mean T(N) ≤ c*N*(1 + 1 + …) ≄ O(N).

Another note...

Posted Jan 19, 2012 9:23 UTC (Thu) by khim (subscriber, #9252) [Link]

There are another interesting fact related to Median of Medians algorithm: k must be at least 5 only because algorithm looks for kth largest element. In this case “k == 3” just does not work (as I've noted above). However if want to find median then not only "k == 3" works, it usually works better then “k == 5”. This is true because in the "find median" task you can throw away not just “top left” xor “bottom right” elements (as pictured in Wikipedia's illustration) but you can throw away both “top left” and “bottom right” elements. This will lead to complexity of T(N) ≤ C₁*N*(1 + (⅗) + (⅗)² + …) = O(N) for “k == 5” and to complexity of T(N) ≤ C₂*N*(1 + (⅔) + (⅔)² + …) = O(N) for “k == 3”, but ⅗ (for “k == 5”) comes from “⅕ + ⅖” (for two recursive calls) while ⅔ (for “k == 3”) comes from “⅓ + ⅓” (for two recursive calls). Thus in most cases version with “k == 3” is faster (because recursion depth is smaller), but difference is small and you must correctly handle case of N=3M+1...