See? Patents do help foster innovation!

Posted May 8, 2009 9:34 UTC (Fri) by muwlgr (guest, #35359)
In reply to: See? Patents do help foster innovation! by lemmings
Parent article: KSM tries again

To find identical pages by comparing their hashes, first you have to calculate these hashes, reading full contents of each page into CPU registers (and maybe even evicting all useful CPU cache). Your memory traffic is 100% of your used pages, every time.

Contrary, when you compare pages by their contents right from the start, in most cases you could find differing pages quite quickly, by reading only their first few bytes. You have to read two whole pages only to find out that they are identical. And this is done without mathematical overhead of hash calculation, remember. So your memory traffic is 100% only for those pages that happen to be identical. For pages that are different pairwise, you usually have to read only a small part of them to find that. With random contents of the pages, you stop comparing them on the first byte in 255/256 cases. On the fourth byte (32-bit word), in (2^32-1)/2^32 cases. In Linux case, there could exist differing pages with longer common prefixes, but I think this prefix will be shorter than half of page length often enough. So for differing memory pages you most probably don't get 100% of memory traffic anyway. I clearly see why simplified approach of KSM could have its benefits.

See? Patents do help foster innovation!

Posted May 8, 2009 10:50 UTC (Fri) by efexis (guest, #26355) [Link] (3 responses)

If you want to just compare two pages then yes, that may be true. It starts to break down though when you want to compare a page with, say, ten thousand other pages. Comparing a hash with ten thousand other hashes is going to be quicker than comparing a page with ten thousand other pages (your argument of only needing to scan the first x bytes of a page before likelyhood of finding differences holds up also when comparing hashes). If that speed increase comparing hashes outweights the time spent hashing the pages to begin with, then you are losing speed by not hashing. Of course it's not this simple; optimised sorting/indexing algorithms means you don't have to compare every page with every other page to rule out matches (as you also wouldn't have to compare every pair of hashes). For example, what's the effects of reading from all these pages many times as opposed to smaller hashes on the CPU memory cache going to be?

I think in this case, testing and observation are going to be important, it's near impossible to speculate - with the dynamicness of potentially millions of memory pages spread across similar to disparate virtual systems - what the comparitive results of the two different methods will be.

See? Patents do help foster innovation!

Posted May 8, 2009 14:19 UTC (Fri) by nix (subscriber, #2304) [Link] (2 responses)

What I'd do there is hash the first 32/64/128 bytes or thereabouts of a
page (one cacheline) and then use that partial hash to eliminate the
majority of pages from comparison. (But this obvious solution, which took
less than five seconds to think of, may be patented. Bleah.)

See? Patents do help foster innovation!

Posted May 15, 2009 19:43 UTC (Fri) by nevyn (guest, #33129) [Link]

Speed improvement: Don't bother hashing it, just use the first 8 bytes of the page as the first pass value. I'd assume that couldn't be patented (given how it's a trivial optimization of what the memcmp() is doing) ... but who knows.

We can avoid reading the whole page for each comparison, even in the worst case.

Posted Nov 17, 2009 16:03 UTC (Tue) by gmatht (guest, #58961) [Link]

Say that they have a binary tree. Each left edge is labelled 0, and each
right edge is labelled 1. Whenever we read a bit from the page we follow the
edge with the label of the bit. So for example, if we look for the zero page
in the tree then we navigate to the leftmost child node.

This is clearly very inefficient, as this simple algorithms needs exactly
(4096*8) branches to find any page. However we read the page exactly once
for all comparisons. And we can optimize it, e.g. if we have a chain of
nodes with only single children we can merge them into a single edge.

Another approach would be to make use of the fact that if say, we know all
child pages of the current node start with "Foo" we don't need to compare
the first 3 characters anymore. As such, we'd usually only have to access
each cache-line once.