|
|

# And the saga continiues!

## And the saga continiues!

Posted Jan 16, 2012 22:05 UTC (Mon) by cmccabe (guest, #60281)
In reply to: And the saga continiues! by khim
Parent article: Denial of service via hash collisions

Let's derive the running time of the median of medians algorithm. Since you can use Google, you already know that the answer is O(n). But do you know why?

Median of medians is a divide-and-conquer algorithm. In each stage of the recursion, we split the array into K subarrays and recurse on them. To combine the results of those recursions, we do a constant amount of work.

So the running time for an array of length N is
T(n) = kT(n/k) + C
where k and C are constants. Usually K=5.

Luckily, we can solve this recurrence with case one of the master theorem. This gives a running time of O(n).

What if, instead, we did O(n) work to combine the results of the recursions? This is essentially what you are claiming.

Then the recurrence would be
T(n) = kT(n/k) + Cn + D
where k, C, and D are constants.
By case 2 of the master theorem, the running time would be O(n log n).

Incidentally, this is the reason why quicksort's running time is O(n log n)-- because it does O(n) work before doing each recursion. In quicksort's case, k = 2.

Anyone can use Google and find the running time of an algorithm. But unless you can derive it, you do not truly understand it. Perhaps you need to do a little bit less talking and a little bit more listening.

Is ignorance bliss?

Posted Jan 17, 2012 8:39 UTC (Tue) by khim (subscriber, #9252) [Link]

Let's derive the running time of the median of medians algorithm.

Let's do.

Since you can use Google, you already know that the answer is O(n). But do you know why?

Yes, I do. I also know that your contraption has nothing to do with medians of medians algorithm - that's why I was confused.

Median of medians is a divide-and-conquer algorithm. In each stage of the recursion, we split the array into K subarrays and recurse on them. To combine the results of those recursions, we do a constant amount of work.

Oops. Ah, now I see. Sorry, I missed the fact that you call median_of_medians recursively. Very embarrassing: I did the same mistake you did - have looked on the name of the algorithm and assumed it just picks medians of pieces and then selects median from these.

Well... this algorithm is linear, all right. The only problem: it does not guarantee linear complexity of quicksort! You basically split array in two uneven pieces, then combine six (if k == 5 then ten) such arrays to organize bigger array and you gurantee that at least two pieces go to the left and at least two pieces go to the right. This means that each recursion step potentially amplifies the disproportion. In the end you can have two pieces of quite disproportionate sizes. It's not clear if you can organize array in such a bad fashion as to push complexity of quicksort back to O(N²) but this looks highly probable.

The property of pivot produced by Median of Medians algorithm is quite different: it's always between 30% and 70% elements and these percentages do not depend on the number of recursive calls. Why? Median of Medians algorithm also introduces disproportions at each step, right? Yes, but it includes mechanism which fixes these disproportions. This is what guarantees O(N) complexity for finding true median and this is what guarantees O(N log N) complexity for quicksort.

Do you have any proof that your “median of median of median…” algorithm can not produce bad results at each step of quicksort? If not then this will put the whole excercise in the same bucket as “median of three” and not in the bucket of Median of Medians algorithm which guarantees O(N) complexity and guarantees that quicksort will not go to recursion lever deeper then log₂N. I've assumed that your code at least keeps the second property, but apparently you were more concerned with first. My bad.

Is ignorance bliss?

Posted Jan 19, 2012 8:02 UTC (Thu) by cmccabe (guest, #60281) [Link]

* Some people choose to call quickselect + the clever pivot selection algorithm "median of medians." Yes, this is confusing, given that finding the median of medians is only one step of the process. But it's not worth getting excited about.

* The clever pivot selection algorithm is still recursive. Yes, it's a recursive algorithm within another recursive algorithm. We are very clever, aren't we.

* When choosing a pivot for quickselect with the method I described, you need to have k=5 rather than k=3; otherwise the quickselect can still go n^2.

* Your prose reminds me "time cube." But that's probably because I was "educated stupid and evil."

Yet another small correction...

Posted Jan 19, 2012 8:42 UTC (Thu) by khim (subscriber, #9252) [Link]

When choosing a pivot for quickselect with the method I described, you need to have k=5 rather than k=3; otherwise the quickselect can still go n^2.

Sadly this is not true. Your argorithm will still introduce disbalance on each step - even with "k == 5". Disbalance will be smaller (2+0.3N/3+0.7N instead of 1+⅓N/2+⅔N), but recursive calls will still compound it thus the whole algorithm will have larger then O(N log N) complexity (most probably O(N²) with "evil source").

Median of Medians algorithm uses two recursive calls to battle this phenomenon: it finds "more-or-less Ok median" (30%/70%) using first recursive call with 0.2N elements and then it "fixes it" using another recursive call with no more then 0.7N elements. Disbalance is fixed at each step thus it does not grow beyond certain point (30%/70%) no matter how many steps there are - and the whole thing needs O(N) operations: T(N) ≤ c*N*(1 + (9/10) + (9/10)² + …) = O(N). If you'll use "k == 3" then your first pass will use ⅓N elements and second pass will use ⅔N elements and this will mean T(N) ≤ c*N*(1 + 1 + …) ≄ O(N).

Another note...

Posted Jan 19, 2012 9:23 UTC (Thu) by khim (subscriber, #9252) [Link]

There are another interesting fact related to Median of Medians algorithm: k must be at least 5 only because algorithm looks for kth largest element. In this case “k == 3” just does not work (as I've noted above). However if want to find median then not only "k == 3" works, it usually works better then “k == 5”. This is true because in the "find median" task you can throw away not just “top left” xor “bottom right” elements (as pictured in Wikipedia's illustration) but you can throw away both “top left” and “bottom right” elements. This will lead to complexity of T(N) ≤ C₁*N*(1 + (⅗) + (⅗)² + …) = O(N) for “k == 5” and to complexity of T(N) ≤ C₂*N*(1 + (⅔) + (⅔)² + …) = O(N) for “k == 3”, but ⅗ (for “k == 5”) comes from “⅕ + ⅖” (for two recursive calls) while ⅔ (for “k == 3”) comes from “⅓ + ⅓” (for two recursive calls). Thus in most cases version with “k == 3” is faster (because recursion depth is smaller), but difference is small and you must correctly handle case of N=3M+1...