LWN.net Logo

/dev/ksm: dynamic memory sharing

By Jonathan Corbet
November 12, 2008
The kernel generally goes out of its way to share identical memory pages between processes. Program text is always shared, for example. But writable pages will also be shared between processes when the kernel knows that the contents of the memory are the same for all processes involved. When a process calls fork(), all writable pages are turned into copy-on-write (COW) pages and shared between the parent and child. As long as neither process modified the contents of any given page, that sharing can continue, with a corresponding reduction in memory use.

Copy-on-write with fork() works because the kernel knows that each process expects to find the same contents in those pages. When the kernel lacks that knowledge, though, it will generally be unable to arrange sharing of identical pages. One might not think that this would ordinarily be a problem, but the KVM developers have come up with a couple of situations where this kind of sharing opportunity might come about. Your editor cannot resist this case proposed by Avi Kivity:

Consider the typical multiuser gnome minicomputer with all 150 users reading lwn.net at the same time instead of working. You could share the firefox rendered page cache, reducing memory utilization drastically.

Beyond such typical systems, though, consider the case of a host running a number of virtualized guests. Those guests will not share a process-tree relationship which makes the sharing of pages between them easy, but they may well be using a substantial portion of their memory to hold identical contents. If that host could find a way to force the sharing of pages with identical contents, it should be able to make much better use of its memory and, as a result, run more guests. This is the kind of thing which gets the attention of virtualization developers. So the hackers at Qumranet Red Hat (Izik Eidus, Andrea Arcanageli, and Chris Wright in particular) have put together a mechanism to make that kind of sharing happen. The resulting code, called KSM, was recently posted for wider review.

KSM takes the form of a device driver for a single, virtual device: /dev/ksm. A process which wants to take part in the page sharing regime can open that device and register (with an ioctl() call) a portion of its address space with the KSM driver. Once the page sharing mechanism is turned on (via another ioctl()), the kernel will start looking for pages to share.

The algorithm is relatively simple. The KSM driver, inside a kernel thread, picks one of the memory regions registered with it and start scanning over it. For each page which is resident in memory, KSM will generate an SHA1 hash of the page's contents. That hash will then be used to look up other pages with the same hash value. If a subsequent memcmp() call shows that the contents of the pages are truly identical, all processes with a reference to the scanned page will be pointed (in COW mode) to the other one, and the redundant page will be returned to the system. As long as nobody modifies the page, the sharing can continue; once a write operation happens, the page will be copied and the sharing will end.

The kernel thread will scan up to a maximum number of pages before going to sleep for a while. Both the number of pages to scan and the sleep period are passed in as parameters to the ioctl() call which starts scanning. A user-space control process can also pause scanning via another ioctl() call.

The initial response to the patch from Andrew Morton was not entirely enthusiastic:

The whole approach seems wrong to me. The kernel lost track of these pages and then we run around post-facto trying to fix that up again. Please explain (for the changelog) why the kernel cannot get this right via the usual sharing, refcounting and COWing approaches.

The answer from Avi Kivity was reasonably clear:

For kvm, the kernel never knew those pages were shared. They are loaded from independent (possibly compressed and encrypted) disk images. These images are different; but some pages happen to be the same because they came from the same installation media.

Izik Eidus adds that, with this patch, a host running a bunch of Windows guests is able to overcommit its memory 300% without terribly ill effects. This technique, it seems, is especially effective with Windows guests: Windows apparently zeroes all freed memory, so each guest's list of free pages can be coalesced down to a single, shared page full of zeroes.

What has not been done (or, at least, not posted) is any sort of benchmarking of the impact KSM has on a running system. The scanning, hashing, and comparing of pages will require some CPU time, and it is likely to have noticeable cache effects as well. If you are trying to run dozens of Windows guests, cache effects may well be relatively low on your list of problems. But that cost may be sufficient to prevent the more general use of KSM, even though systems which are not using virtualization at all may still have a lot of pages with identical contents.


(Log in to post comments)

Zeroing freed memory

Posted Nov 12, 2008 23:56 UTC (Wed) by Felix_the_Mac (guest, #32242) [Link]

"Windows apparently zeroes all freed memory"

On the face of it that sounds pretty sensible (from a security perspective).
Why doesn't Linux do it?

Zeroing freed memory

Posted Nov 13, 2008 0:48 UTC (Thu) by nix (subscriber, #2304) [Link]

Because it pointlessly blows the dcache (a precious resource), generally
to very little gain, because a lot of freed userspace pages are reused for
something other than userspace pages and are filled with something else,
and those pages which *are* reused for other userspace pages are zeroed at
*that* point.

Zeroing freed memory

Posted Nov 13, 2008 22:45 UTC (Thu) by bdauvergne (guest, #6989) [Link]

and there is still madvise(MADV_DONTNEED) to release the physical pages to the system.

Zeroing freed memory

Posted Nov 13, 2008 9:19 UTC (Thu) by dlang (✭ supporter ✭, #313) [Link]

zeroing the memory when it's freed is better from a security point of view, but it's expensive to do.

since it may sometimes not need to be zeroed (besides the kernel uses noted in post above, if the page is going to be used to hold the executable code to be run, just load the appropriate code in the page, there's no benifit to zeroing it out first) and other times it can be zeroed when the system is idle, linux does the more efficant thing and zeros the page with as little impact tot he rest of the system as possible.

Linux can do it

Posted Nov 13, 2008 15:07 UTC (Thu) by wtogami (subscriber, #32325) [Link]

http://udrepper.livejournal.com/11429.html
If you set MALLOC_PERTURB_=$NUMBER, it sets all malloc'ed bytes to $NUMBER, and the bitwise inverse upon free. It is great to expose otherwise difficult to detect bugs, at the expense of speed. It might also be useful for /dev/ksm.

Linux can do it

Posted Nov 13, 2008 19:52 UTC (Thu) by nix (subscriber, #2304) [Link]

It's a really cool feaure, but, well, Ulrich says that it 'Seems like the
number of people who know this feature is still almost zero'. Yes, that's
because it was never documented, as with pretty much everything glibc can
do above POSIX. (e.g., quick, how does LD_AUDIT work? How do you use it?
Good luck finding out without reading the source, and it's tricky to
understand even then.)

Zeroing freed memory

Posted Nov 14, 2008 15:10 UTC (Fri) by PaXTeam (subscriber, #24616) [Link]

> On the face of it that sounds pretty sensible (from a security perspective).
> Why doesn't Linux do it?

PaX has had such a feature for some time, but its performance impact isn't negligible. i have only numbers for an early naive implementation (pages were zeroed twice effectively), the kernel time of kernel compilation went up by some 40%, IIRC, so even assuming the current implementation it's probably not better than 20%. now this is kernel time only, if your workload is mostly userland then you will care a lot less, otherwise you'll have to find out where on the user/kernel scale you fall and decide accordingly if it's worth it.

Blank pages

Posted Nov 13, 2008 10:56 UTC (Thu) by rvfh (subscriber, #31018) [Link]

"Windows apparently zeroes all freed memory"

Could not the system just release zero'd pages altogether until someone writes something in them, rather sharing them amongst KVM instances? Seems easier and less costly to do than the SHA1 thing and seems (in Windows' case) to yield some memory economies?

Especially if it's freed memory...

Or did I miss something?

Blank pages

Posted Nov 13, 2008 14:20 UTC (Thu) by avik (guest, #704) [Link]

There are also non-zero pages which can be shared, such as text and read-only data.

Blank pages

Posted Nov 14, 2008 16:34 UTC (Fri) by jzbiciak (✭ supporter ✭, #5246) [Link]

I'm pretty sure Linux has hypervisor hooks to allow the hypervisor to tell a hosted OS to give back pages, and for a hosted OS to tell the hypervisor when it may have back pages. I don't know what Windows offers here.

I get the sense though that such changes in page ownership are coarse grain, not fine grain, since there's cost in passing the ownership back and forth.

(Note that I'm not actually super familiar with the paravirt_ops interface. I just remember reading about it here.)

Why not just use the SHA1 only?

Posted Nov 15, 2008 0:06 UTC (Sat) by nevets (subscriber, #11875) [Link]

KSM will generate an SHA1 hash of the page's contents. That hash will then be used to look up other pages with the same hash value. If a subsequent memcmp() call shows that the contents of the pages are truly identical,

I've been informed by the git Gods that the chances of a SHA1 collision is the same as the Earth being sucked up into the black hole created by the CERN accelerator. If this is indeed true, then there's no need for that extra memcmp.

I brought this up to the git God himself about git's use of SHA1. He did agree with me that if there were to be a collision of SHA1's in git, that the database would be corrupted. But he blew it off as a snowball's chance in hell. He did show some concern that there might be a way to crack the algorithm.

But who am I to question a git God?

Why not just use the SHA1 only?

Posted Nov 15, 2008 8:25 UTC (Sat) by dlang (✭ supporter ✭, #313) [Link]

actually the statement is that you can't deliberatly come up with a conflicting sha1.

there are databases what hold the sha1 of various files, and there are a lot of known conflicts in them

Why not just use the SHA1 only?

Posted Nov 15, 2008 14:46 UTC (Sat) by jbh (subscriber, #494) [Link]

Are you sure? According to wikipedia, none have been found (although it is known that it can be found with complexity 2^63, less than the expected 2^80).

Why not just use the SHA1 only?

Posted Nov 15, 2008 14:56 UTC (Sat) by jbh (subscriber, #494) [Link]

Just to be clear: If you restrict yourself to "collision-prone" SHA1s, there's a 1/2^63 chance of conflict. With normal (random) SHA1s, the chance is 1/2^80. Deliberately creating a conflict with a given SHA1 (second preimage attack) is still 1/2^160, and the chance of that second preimage being non-gibberish substantially lower.

Why not just use the SHA1 only?

Posted Nov 15, 2008 19:45 UTC (Sat) by scarabaeus (subscriber, #7142) [Link]

IMHO nevets's suggestion is the wrong way round. For better performance, the initial checksum should be a very fast 64-bit checksum - possibly even simpler than CRC. (The style of weak+fast checksum used by rsync springs to mind...)

Even on systems with huge amounts of pages, the likelihood of hash collisions will be far too low to affect performance - also because memcmp() will abort early if it detects differences between the compared pages.

This way, you save memory for the checksums (which also improves hash lookup performance), and the checking will be faster.

Why not just use the SHA1 only?

Posted Nov 18, 2008 14:41 UTC (Tue) by nevets (subscriber, #11875) [Link]

Actually, my comment was a little facetious. My point was not a way to fix this algorithm, but a comment against what git is doing. The git repo really relies on absolutely no conflicts. If one happens then the database is corrupted. I keep hearing that the chances of this happening is astronomically low, but the fact that the chance can happen, bothers me.

Why not just use the SHA1 only?

Posted Nov 18, 2008 16:07 UTC (Tue) by nix (subscriber, #2304) [Link]

Jean-Luc Herren did the maths recently on the git list, in
<48E4ABC0.80100@gmx.ch>:

In case it's interesting to someone, I once calculated (and wrote
down) the math for the following scenario:

- 10 billion humans are programming
- They *each* produce 5000 git objects every day
- They all push to the same huge repository
- They keep this up for 50 years

With those highly exagerated assumptions, the probability of
getting a hash collision in that huge git object database is
6e-13. Provided I got the math right.

So, mathematically speaking you have to say "yes, it *is*
possible". But math aside it's perfectly correct to say "no, it
won't happen, ever". (Speaking about the *accidental* case.)

Why not just use the SHA1 only?

Posted Nov 18, 2008 18:37 UTC (Tue) by dlang (✭ supporter ✭, #313) [Link]

git will never overwrite an object that it thinks that it has.

so git could get corrupted, but it would not be corrupted by overwriting old data and loosing it, it would be corrupted by not saving new data (much easier to detect as that is the data that people would be trying to use)

there is an option in git (I don't remember if it's compile time or not) to do additional checking when saving data to check that the data is the same even if it has the same hash, and give an error if it's not the same.

Why not just use the SHA1 only?

Posted Nov 16, 2008 13:39 UTC (Sun) by nix (subscriber, #2304) [Link]

It only generates a hash of the first, what, 128 bytes of the page, so any
pages with the same leading 128 bytes will 'collide' (in the sense that
the first 128 bytes are identical, but the rest are not).

Why not just use the SHA1 only?

Posted Nov 18, 2008 4:30 UTC (Tue) by nevets (subscriber, #11875) [Link]

From your friends at Wikipedia, there is an article on SHA-1

SHA-1 (as well as SHA-0) produces a 160-bit digest from a message with a maximum length of (2^64 − 1) bits

This looks like it can be much bigger than 128 bytes.

Why not just use the SHA1 only?

Posted Nov 18, 2008 16:01 UTC (Tue) by nix (subscriber, #2304) [Link]

Yeah, but for speed reasons they're only hashing the first N bytes (I
think it's 128), rather than the whole page. It's a sensible tradeoff, I
think.

Why not just use the SHA1 only?

Posted May 23, 2010 9:47 UTC (Sun) by rafal.maj (guest, #66508) [Link]

This doesn't make sense, they are for sure hashing entire page.

Why not just use the SHA1 only?

Posted May 23, 2010 10:57 UTC (Sun) by johill (subscriber, #25196) [Link]

Well, but you clearly didn't read the code:
+#define PAGECMP_OFFSET 128
+#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
+/* hash the page */
+static void page_hash(struct page *page, unsigned char *digest)
+{
+	struct scatterlist sg;
+	struct hash_desc desc;
+
+	sg_init_table(&sg, 1);
+	sg_set_page(&sg, page, PAGEHASH_SIZE, 0);
+	desc.tfm = tfm;
+	desc.flags = 0;
+	crypto_hash_digest(&desc, &sg, PAGEHASH_SIZE, digest);
+}
and it does "for sure" make sense since it's just a way to speed up matching, and just using 128 bytes means much less loading from memory. Tuning the 128 might make sense, potentially.

ksm and mergemem

Posted Nov 25, 2008 21:46 UTC (Tue) by anton (guest, #25547) [Link]

KSM is essentially the same idea as mergemen, which was developed about 10 years ago. This is not just useful for virtual machines, but also for ordinary multi-user machines (as explained by the lwn-reading example), and that's what mergemem was developed for. Unfortunately for mergemem, RAM was cheap, and few people in the Linux community seemed to be interested in multi-user systems. Maybe with the current interest in virtual machines KSM will fare better.

Hash?

Posted May 23, 2010 14:42 UTC (Sun) by dmarti (subscriber, #11625) [Link]

Didn't they have to avoid hashing pages because of a software patent problem? (Following these software patent stories is like reading comic books and trying to remember who's really dead.)

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds