Adding a huge zero page
Transparent huge pages are mainly used for anonymous pages — pages that are not backed by a specific file on disk. These are the pages forming the data areas of processes. When an anonymous memory area is created or extended, no actual pages of memory are allocated (whether transparent huge pages are enabled or not). That is because a typical program will never touch many of the pages that are part of its address space; allocating pages before there is a demonstrated need would waste a considerable amount of time and memory. So the kernel will wait until the process tries to access a specific page, generating a page fault, before allocating memory for that page.
But, even then, there is an optimization that can be made. New anonymous pages must be filled with zeroes; to do anything else would be to risk exposing whatever data was left in the page by its previous user. Programs often depend on the initialization of their memory; since they know that memory starts zero-filled, there is no need to initialize that memory themselves. As it turns out, a lot of those pages may never be written to; they stay zero-filled for the life of the process that owns them. Once that is understood, it does not take long to see that there is an opportunity to save a lot of memory by sharing those zero-filled pages. One zero-filled page looks a lot like another, so there is little value in making too many of them.
So, if a process instantiates a new (non-huge) page by trying to read from it, the kernel still will not allocate a new memory page. Instead, it maps a special page, called simply the "zero page," into the process's address space instead. Thus, all unwritten anonymous pages, across all processes in the system, are, in fact, sharing one special page. Needless to say, the zero page is always mapped read-only; it would not do to have some process changing the value of zero for everybody else. Whenever a process attempts to write to the zero page, it will generate a write-protection fault; the kernel will then (finally) get around to allocating a real page of memory and substitute it into the process's address space at the right spot.
This behavior is easy to observe. As Kirill Shutemov described, a process executing a bit of code like this:
posix_memalign((void **)&p, 2 * MB, 200 * MB); for (i = 0; i < 200 * MB; i+= 4096) assert(p[i] == 0); pause();
will have a surprisingly small resident set at the time of the pause() call. It has just worked through 200MB of memory, but that memory is all represented by a single zero page. The system works as intended.
Or, it does until the transparent huge pages feature is enabled; then that process will show the full 200MB of allocated memory. A growth of memory usage by two orders of magnitude is not the sort of result users are typically looking for when they enable a performance-enhancing feature. So, Kirill says, some sites are finding themselves forced to disable transparent huge pages in self defense.
The problem is simple enough: there is no huge zero page. The transparent huge pages feature tries to use huge pages whenever possible; when a process faults in a new page, the kernel will try to put a huge page there. Since there is no huge zero page, the kernel will simply allocate a real zero page instead. This behavior leads to correct execution, but it also causes the allocation of a lot of memory that would otherwise not have been needed. Transparent huge page support, in other words, has turned off another important optimization that has been part of the kernel's memory management subsystem for many years.
Once the problem is understood, the solution isn't that hard. Kirill's patch adds a special, zero-filled huge page to function as the huge zero page. Only one such page is needed, since the transparent huge pages feature only uses one size of huge page. With this page in place and used for read faults, the expansion of memory use simply goes away.
As always, there are complications: the page is large enough that it would be nice to avoid allocating it if transparent huge pages are not in use. So there's a lazy allocation scheme; Kirill also added a reference count so that the huge zero page can be returned if there is no longer a need for it. That reference counting slows a read-faulting benchmark by 1%, so it's not clear that it is worthwhile; in the end, the developers might conclude that it's better to just keep the zero huge page around once it has been allocated and not pay the reference counting cost. This is, after all, a situation that has come about before with the (small) zero page.
There have not been a lot of comments on this patch; the implementation is
relatively straightforward and, presumably, does not need a lot in the way
of changes. Given the obvious and measurable benefits from the addition of
a huge zero page, it should be added to the kernel sometime in the fairly
near future; the 3.8 development cycle seems like a reasonable target.
Index entries for this article | |
---|---|
Kernel | Huge pages |
Kernel | Memory management/Huge pages |
Kernel | Zero page |
Posted Sep 27, 2012 4:37 UTC (Thu)
by ikm (guest, #493)
[Link] (3 responses)
Posted Sep 27, 2012 6:08 UTC (Thu)
by mgedmin (subscriber, #34497)
[Link] (1 responses)
Posted Sep 28, 2012 8:14 UTC (Fri)
by rvfh (guest, #31018)
[Link]
Posted Sep 27, 2012 8:57 UTC (Thu)
by kiryl (subscriber, #41516)
[Link]
Posted Sep 27, 2012 6:44 UTC (Thu)
by mti (subscriber, #5390)
[Link] (5 responses)
Posted Sep 27, 2012 7:00 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (2 responses)
Posted Sep 27, 2012 7:32 UTC (Thu)
by mti (subscriber, #5390)
[Link] (1 responses)
My thinking is that this unintialized memory is not really read much until the first write so it is not performance critical. That assumption may of course be wrong. Btw, what programs are reading unintialized memory and why?
But on the other hand, if there is a lot of reading of this zeroed memory, wouldn't a single small page fit better in the cache, thus improving preformance?
Posted Sep 27, 2012 8:07 UTC (Thu)
by justincormack (subscriber, #70439)
[Link]
Posted Sep 27, 2012 9:07 UTC (Thu)
by kiryl (subscriber, #41516)
[Link] (1 responses)
Posted Oct 12, 2012 11:51 UTC (Fri)
by etienne (guest, #25256)
[Link]
Posted Sep 27, 2012 9:33 UTC (Thu)
by alankila (guest, #47141)
[Link] (8 responses)
I observed this issue on my virtual machines test server which runs 6 virtual machines of various sizes with a total of 8 of the 16 GB of system memory. After bootup, almost all of the kvm memory was hugepage'd, but the next day only some 10 % of the memory still was. The problem was that the machines were shutdown during night for backup, and then brought back up. My guess is that the backup process filled memory with pages, some which were dirty, and this defeated the hugepages optimization.
Posted Sep 27, 2012 13:51 UTC (Thu)
by ejr (subscriber, #51652)
[Link] (7 responses)
And some of these HPC codes are old and/or expect to run on more than Linux. Conditionally changing all the user-specified and compiler-generated memory allocations is a painful task.
Posted Sep 28, 2012 9:11 UTC (Fri)
by alankila (guest, #47141)
[Link] (6 responses)
I was wondering if there shouldn't be a memory defragmenting task that goes through the running process' heap periodically and would move the 4k pages around until coalescing them to a hugepage becomes possible. I mean, if using these pages really gives you around 5 % performance benefit, it would seem reasonable to spend up to few % of CPU to do it for tasks that seem long-lived enough.
Posted Sep 28, 2012 11:13 UTC (Fri)
by nix (subscriber, #2304)
[Link] (3 responses)
AnonHugePages: 788480 kB
The latter machine is running a single virtual machine, but the former is running no VMs of any kind and has still turned a gigabyte into transpages (probably largely inside monsters like Chromium). That's not insignificant. (For that matter, I routinely see compile jobs getting hugepaged up, and a TLB saving in a pointer-mad monster like GCC really does speed it up. Sure, it's only a few percent, but that's better than nothing, right?)
Posted Sep 28, 2012 23:19 UTC (Fri)
by alankila (guest, #47141)
[Link] (2 responses)
That being said, out of 1.5 GB of other services on the server:
AnonHugePages: 71680 kB
*sigh*
Posted Sep 28, 2012 23:37 UTC (Fri)
by khc (guest, #45209)
[Link] (1 responses)
MemTotal: 16327088 kB
Of course, this box has a fairly specialized daemon that allocates 8GB of memory as 2 separate pools, so it's not surprising that auto huge pages work well (although I've never measured the performance impact of that).
Posted Sep 29, 2012 10:28 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted Sep 28, 2012 14:09 UTC (Fri)
by ejr (subscriber, #51652)
[Link]
Posted Oct 5, 2012 16:10 UTC (Fri)
by cpasqualini (guest, #69417)
[Link]
Did you test any of these?
echo always >/sys/kernel/mm/transparent_hugepage/defrag
Adding a huge zero page
Adding a huge zero page
Adding a huge zero page
Adding a huge zero page
Adding a huge zero page
Adding a huge zero page
That would kill more performance than you gain by huge pages in the first plale.
Adding a huge zero page
Adding a huge zero page
Adding a huge zero page
Adding a huge zero page
Shortening a huge zero page is good from the cache point of view, but maybe bad from DMA point of view, a:
dd if=/dev/zero of=/dev/sda18
may be forced to use 4 Kbytes pages because the huge zero page is not contiguous in physical memory...
Just my £0.02
Adding a huge zero page
Adding a huge zero page
Adding a huge zero page
Adding a huge zero page
AnonHugePages: 2553856 kB
Adding a huge zero page
Adding a huge zero page
AnonHugePages: 3102720 kB
Adding a huge zero page
Adding a huge zero page
Adding a huge zero page
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
echo never >/sys/kernel/mm/transparent_hugepage/defrag