| Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
The kernel generally goes out of its way to share identical memory pages between processes. Program text is always shared, for example. But writable pages will also be shared between processes when the kernel knows that the contents of the memory are the same for all processes involved. When a process calls fork(), all writable pages are turned into copy-on-write (COW) pages and shared between the parent and child. As long as neither process modified the contents of any given page, that sharing can continue, with a corresponding reduction in memory use.
Copy-on-write with fork() works because the kernel knows that each process expects to find the same contents in those pages. When the kernel lacks that knowledge, though, it will generally be unable to arrange sharing of identical pages. One might not think that this would ordinarily be a problem, but the KVM developers have come up with a couple of situations where this kind of sharing opportunity might come about. Your editor cannot resist this case proposed by Avi Kivity:
Beyond such typical systems, though, consider the case of a host running a
number of virtualized guests. Those guests will not share a process-tree
relationship which makes the sharing of pages between them easy, but they
may well be using a substantial portion of their memory to hold identical
contents. If that host could find a way to force the sharing of pages with
identical contents, it should be able to make much better use of its memory
and, as a result, run more guests.
This is the kind of thing which gets the attention of virtualization
developers. So the hackers at Qumranet Red Hat (Izik
Eidus, Andrea Arcanageli, and Chris Wright in particular) have put
together a mechanism to make that kind of sharing happen. The resulting
code, called KSM, was recently posted for wider review.
KSM takes the form of a device driver for a single, virtual device: /dev/ksm. A process which wants to take part in the page sharing regime can open that device and register (with an ioctl() call) a portion of its address space with the KSM driver. Once the page sharing mechanism is turned on (via another ioctl()), the kernel will start looking for pages to share.
The algorithm is relatively simple. The KSM driver, inside a kernel thread, picks one of the memory regions registered with it and start scanning over it. For each page which is resident in memory, KSM will generate an SHA1 hash of the page's contents. That hash will then be used to look up other pages with the same hash value. If a subsequent memcmp() call shows that the contents of the pages are truly identical, all processes with a reference to the scanned page will be pointed (in COW mode) to the other one, and the redundant page will be returned to the system. As long as nobody modifies the page, the sharing can continue; once a write operation happens, the page will be copied and the sharing will end.
The kernel thread will scan up to a maximum number of pages before going to sleep for a while. Both the number of pages to scan and the sleep period are passed in as parameters to the ioctl() call which starts scanning. A user-space control process can also pause scanning via another ioctl() call.
The initial response to the patch from Andrew Morton was not entirely enthusiastic:
The answer from Avi Kivity was reasonably clear:
Izik Eidus adds that, with this patch, a host running a bunch of Windows guests is able to overcommit its memory 300% without terribly ill effects. This technique, it seems, is especially effective with Windows guests: Windows apparently zeroes all freed memory, so each guest's list of free pages can be coalesced down to a single, shared page full of zeroes.
What has not been done (or, at least, not posted) is any sort of
benchmarking of the impact KSM has on a running system. The scanning,
hashing, and comparing of pages will require some CPU time, and it is
likely to have noticeable cache effects as well. If you are trying to run
dozens of Windows guests, cache effects may well be relatively low on your
list of problems. But that cost may be sufficient to prevent the more
general use of KSM, even though systems which are not using virtualization
at all may still have a lot of pages with identical contents.
| Index entries for this article | |
|---|---|
| Kernel | Memory management/Virtualization |
| Kernel | Virtualization |
Zeroing freed memory
Posted Nov 12, 2008 23:56 UTC (Wed) by Felix_the_Mac (guest, #32242) [Link]
On the face of it that sounds pretty sensible (from a security perspective).
Why doesn't Linux do it?
Zeroing freed memory
Posted Nov 13, 2008 0:48 UTC (Thu) by nix (subscriber, #2304) [Link]
Zeroing freed memory
Posted Nov 13, 2008 22:45 UTC (Thu) by bdauvergne (subscriber, #6989) [Link]
Zeroing freed memory
Posted Nov 13, 2008 9:19 UTC (Thu) by dlang (guest, #313) [Link]
since it may sometimes not need to be zeroed (besides the kernel uses noted in post above, if the page is going to be used to hold the executable code to be run, just load the appropriate code in the page, there's no benifit to zeroing it out first) and other times it can be zeroed when the system is idle, linux does the more efficant thing and zeros the page with as little impact tot he rest of the system as possible.
Linux can do it
Posted Nov 13, 2008 15:07 UTC (Thu) by wtogami (subscriber, #32325) [Link]
Linux can do it
Posted Nov 13, 2008 19:52 UTC (Thu) by nix (subscriber, #2304) [Link]
Zeroing freed memory
Posted Nov 14, 2008 15:10 UTC (Fri) by PaXTeam (guest, #24616) [Link]
PaX has had such a feature for some time, but its performance impact isn't negligible. i have only numbers for an early naive implementation (pages were zeroed twice effectively), the kernel time of kernel compilation went up by some 40%, IIRC, so even assuming the current implementation it's probably not better than 20%. now this is kernel time only, if your workload is mostly userland then you will care a lot less, otherwise you'll have to find out where on the user/kernel scale you fall and decide accordingly if it's worth it.
Blank pages
Posted Nov 13, 2008 10:56 UTC (Thu) by rvfh (guest, #31018) [Link]
Could not the system just release zero'd pages altogether until someone writes something in them, rather sharing them amongst KVM instances? Seems easier and less costly to do than the SHA1 thing and seems (in Windows' case) to yield some memory economies?
Especially if it's freed memory...
Or did I miss something?
Blank pages
Posted Nov 13, 2008 14:20 UTC (Thu) by avik (guest, #704) [Link]
Blank pages
Posted Nov 14, 2008 16:34 UTC (Fri) by jzbiciak (guest, #5246) [Link]
I'm pretty sure Linux has hypervisor hooks to allow the hypervisor to tell a hosted OS to give back pages, and for a hosted OS to tell the hypervisor when it may have back pages. I don't know what Windows offers here.
I get the sense though that such changes in page ownership are coarse grain, not fine grain, since there's cost in passing the ownership back and forth.
(Note that I'm not actually super familiar with the paravirt_ops interface. I just remember reading about it here.)
Why not just use the SHA1 only?
Posted Nov 15, 2008 0:06 UTC (Sat) by nevets (subscriber, #11875) [Link]
KSM will generate an SHA1 hash of the page's contents. That hash will then be used to look up other pages with the same hash value. If a subsequent memcmp() call shows that the contents of the pages are truly identical,I've been informed by the git Gods that the chances of a SHA1 collision is the same as the Earth being sucked up into the black hole created by the CERN accelerator. If this is indeed true, then there's no need for that extra memcmp.
I brought this up to the git God himself about git's use of SHA1. He did agree with me that if there were to be a collision of SHA1's in git, that the database would be corrupted. But he blew it off as a snowball's chance in hell. He did show some concern that there might be a way to crack the algorithm.
But who am I to question a git God?
Why not just use the SHA1 only?
Posted Nov 15, 2008 8:25 UTC (Sat) by dlang (guest, #313) [Link]
there are databases what hold the sha1 of various files, and there are a lot of known conflicts in them
Why not just use the SHA1 only?
Posted Nov 15, 2008 14:46 UTC (Sat) by jbh (subscriber, #494) [Link]
Why not just use the SHA1 only?
Posted Nov 15, 2008 14:56 UTC (Sat) by jbh (subscriber, #494) [Link]
Why not just use the SHA1 only?
Posted Nov 15, 2008 19:45 UTC (Sat) by scarabaeus (guest, #7142) [Link]
Even on systems with huge amounts of pages, the likelihood of hash collisions will be far too low to affect performance - also because memcmp() will abort early if it detects differences between the compared pages.
This way, you save memory for the checksums (which also improves hash lookup performance), and the checking will be faster.
Why not just use the SHA1 only?
Posted Nov 18, 2008 14:41 UTC (Tue) by nevets (subscriber, #11875) [Link]
Actually, my comment was a little facetious. My point was not a way to fix this algorithm, but a comment against what git is doing. The git repo really relies on absolutely no conflicts. If one happens then the database is corrupted. I keep hearing that the chances of this happening is astronomically low, but the fact that the chance can happen, bothers me.
Why not just use the SHA1 only?
Posted Nov 18, 2008 16:07 UTC (Tue) by nix (subscriber, #2304) [Link]
In case it's interesting to someone, I once calculated (and wrote
down) the math for the following scenario:
- 10 billion humans are programming
- They *each* produce 5000 git objects every day
- They all push to the same huge repository
- They keep this up for 50 years
With those highly exagerated assumptions, the probability of
getting a hash collision in that huge git object database is
6e-13. Provided I got the math right.
So, mathematically speaking you have to say "yes, it *is*
possible". But math aside it's perfectly correct to say "no, it
won't happen, ever". (Speaking about the *accidental* case.)
Why not just use the SHA1 only?
Posted Nov 18, 2008 18:37 UTC (Tue) by dlang (guest, #313) [Link]
so git could get corrupted, but it would not be corrupted by overwriting old data and loosing it, it would be corrupted by not saving new data (much easier to detect as that is the data that people would be trying to use)
there is an option in git (I don't remember if it's compile time or not) to do additional checking when saving data to check that the data is the same even if it has the same hash, and give an error if it's not the same.
Why not just use the SHA1 only?
Posted Nov 16, 2008 13:39 UTC (Sun) by nix (subscriber, #2304) [Link]
Why not just use the SHA1 only?
Posted Nov 18, 2008 4:30 UTC (Tue) by nevets (subscriber, #11875) [Link]
From your friends at Wikipedia, there is an article on SHA-1SHA-1 (as well as SHA-0) produces a 160-bit digest from a message with a maximum length of (2^64 − 1) bits
This looks like it can be much bigger than 128 bytes.
Why not just use the SHA1 only?
Posted Nov 18, 2008 16:01 UTC (Tue) by nix (subscriber, #2304) [Link]
Why not just use the SHA1 only?
Posted May 23, 2010 9:47 UTC (Sun) by rafal.maj (guest, #66508) [Link]
Why not just use the SHA1 only?
Posted May 23, 2010 10:57 UTC (Sun) by johill (subscriber, #25196) [Link]
Well, but you clearly didn't read the code:
+#define PAGECMP_OFFSET 128
+#define PAGEHASH_SIZE (PAGECMP_OFFSET ? PAGECMP_OFFSET : PAGE_SIZE)
+/* hash the page */
+static void page_hash(struct page *page, unsigned char *digest)
+{
+ struct scatterlist sg;
+ struct hash_desc desc;
+
+ sg_init_table(&sg, 1);
+ sg_set_page(&sg, page, PAGEHASH_SIZE, 0);
+ desc.tfm = tfm;
+ desc.flags = 0;
+ crypto_hash_digest(&desc, &sg, PAGEHASH_SIZE, digest);
+}
and it does "for sure" make sense since it's just a way to speed up matching, and just using 128 bytes means much less loading from memory. Tuning the 128 might make sense, potentially.
ksm and mergemem
Posted Nov 25, 2008 21:46 UTC (Tue) by anton (subscriber, #25547) [Link]
KSM is essentially the same idea as mergemen, which was developed about 10 years ago. This is not just useful for virtual machines, but also for ordinary multi-user machines (as explained by the lwn-reading example), and that's what mergemem was developed for. Unfortunately for mergemem, RAM was cheap, and few people in the Linux community seemed to be interested in multi-user systems. Maybe with the current interest in virtual machines KSM will fare better.
Hash?
Posted May 23, 2010 14:42 UTC (Sun) by dmarti (subscriber, #11625) [Link]
Didn't they have to avoid hashing pages because of a software patent problem? (Following these software patent stories is like reading comic books and trying to remember who's really dead.)
Copyright © 2008, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds