Adding a huge zero page

By Jonathan Corbet
September 26, 2012

The transparent huge pages feature allows applications to take advantage of the larger page sizes supported by most contemporary processors without the need for explicit configuration by administrators, developers, or users. It is mostly a performance-enhancing feature: huge pages reduce the pressure on the system's translation lookaside buffer (TLB), making memory accesses faster. It can also save a bit of memory, though, as the result of the elimination of a layer of page tables. But, as it turns out, transparent huge pages can actually increase the memory usage of an application significantly under certain conditions. The good news is that a solution is at hand; it is as easy as a page full of zeroes.

Transparent huge pages are mainly used for anonymous pages — pages that are not backed by a specific file on disk. These are the pages forming the data areas of processes. When an anonymous memory area is created or extended, no actual pages of memory are allocated (whether transparent huge pages are enabled or not). That is because a typical program will never touch many of the pages that are part of its address space; allocating pages before there is a demonstrated need would waste a considerable amount of time and memory. So the kernel will wait until the process tries to access a specific page, generating a page fault, before allocating memory for that page.

But, even then, there is an optimization that can be made. New anonymous pages must be filled with zeroes; to do anything else would be to risk exposing whatever data was left in the page by its previous user. Programs often depend on the initialization of their memory; since they know that memory starts zero-filled, there is no need to initialize that memory themselves. As it turns out, a lot of those pages may never be written to; they stay zero-filled for the life of the process that owns them. Once that is understood, it does not take long to see that there is an opportunity to save a lot of memory by sharing those zero-filled pages. One zero-filled page looks a lot like another, so there is little value in making too many of them.

So, if a process instantiates a new (non-huge) page by trying to read from it, the kernel still will not allocate a new memory page. Instead, it maps a special page, called simply the "zero page," into the process's address space instead. Thus, all unwritten anonymous pages, across all processes in the system, are, in fact, sharing one special page. Needless to say, the zero page is always mapped read-only; it would not do to have some process changing the value of zero for everybody else. Whenever a process attempts to write to the zero page, it will generate a write-protection fault; the kernel will then (finally) get around to allocating a real page of memory and substitute it into the process's address space at the right spot.

This behavior is easy to observe. As Kirill Shutemov described, a process executing a bit of code like this:

    posix_memalign((void **)&p, 2 * MB, 200 * MB);
    for (i = 0; i < 200 * MB; i+= 4096)
        assert(p[i] == 0);
    pause();

will have a surprisingly small resident set at the time of the pause() call. It has just worked through 200MB of memory, but that memory is all represented by a single zero page. The system works as intended.

Or, it does until the transparent huge pages feature is enabled; then that process will show the full 200MB of allocated memory. A growth of memory usage by two orders of magnitude is not the sort of result users are typically looking for when they enable a performance-enhancing feature. So, Kirill says, some sites are finding themselves forced to disable transparent huge pages in self defense.

The problem is simple enough: there is no huge zero page. The transparent huge pages feature tries to use huge pages whenever possible; when a process faults in a new page, the kernel will try to put a huge page there. Since there is no huge zero page, the kernel will simply allocate a real zero page instead. This behavior leads to correct execution, but it also causes the allocation of a lot of memory that would otherwise not have been needed. Transparent huge page support, in other words, has turned off another important optimization that has been part of the kernel's memory management subsystem for many years.

Once the problem is understood, the solution isn't that hard. Kirill's patch adds a special, zero-filled huge page to function as the huge zero page. Only one such page is needed, since the transparent huge pages feature only uses one size of huge page. With this page in place and used for read faults, the expansion of memory use simply goes away.

As always, there are complications: the page is large enough that it would be nice to avoid allocating it if transparent huge pages are not in use. So there's a lazy allocation scheme; Kirill also added a reference count so that the huge zero page can be returned if there is no longer a need for it. That reference counting slows a read-faulting benchmark by 1%, so it's not clear that it is worthwhile; in the end, the developers might conclude that it's better to just keep the zero huge page around once it has been allocated and not pay the reference counting cost. This is, after all, a situation that has come about before with the (small) zero page.

There have not been a lot of comments on this patch; the implementation is relatively straightforward and, presumably, does not need a lot in the way of changes. Given the obvious and measurable benefits from the addition of a huge zero page, it should be added to the kernel sometime in the fairly near future; the 3.8 development cycle seems like a reasonable target.

Index entries for this article
Kernel	Huge pages
Kernel	Memory management/Huge pages
Kernel	Zero page

Adding a huge zero page

Posted Sep 27, 2012 4:37 UTC (Thu) by ikm (guest, #493) [Link] (3 responses)

How huge is that page, anyway?

Adding a huge zero page

Posted Sep 27, 2012 6:08 UTC (Thu) by mgedmin (subscriber, #34497) [Link] (1 responses)

4 megabytes, IIRC.

Adding a huge zero page

Posted Sep 28, 2012 8:14 UTC (Fri) by rvfh (guest, #31018) [Link]

Ouch! You mean the whole 0.1% of my RAM? That's a lot, especially compared to Firefox using only 704 MB, Thunderbird 337 MB (after restarting it, otherwise it was at 700+) and Chromium 1GB+... ;-)

Adding a huge zero page

Posted Sep 27, 2012 8:57 UTC (Thu) by kiryl (subscriber, #41516) [Link]

It depends on architecture obviously. On x86_64 it's 2M.

Adding a huge zero page

Posted Sep 27, 2012 6:44 UTC (Thu) by mti (subscriber, #5390) [Link] (5 responses)

Why not just keep the small zero page mapping in this case and switch to a huge (no-zero) page on the first write?

Adding a huge zero page

Posted Sep 27, 2012 7:00 UTC (Thu) by smurf (subscriber, #17840) [Link] (2 responses)

Because the whole point is to save a layer of page translations. You can't, when you do get a page fault, go back a layer and check whether all of the other pages referenced by that layer all point to the (small) zero page.
That would kill more performance than you gain by huge pages in the first plale.

Adding a huge zero page

Posted Sep 27, 2012 7:32 UTC (Thu) by mti (subscriber, #5390) [Link] (1 responses)

It would be easy mark this small zero page as beeing part of what really should have been a huge zero page. (One way would be to have two small zero pages, the old one and a new special one that is only used as a substitute for the huge zero page.)

My thinking is that this unintialized memory is not really read much until the first write so it is not performance critical. That assumption may of course be wrong. Btw, what programs are reading unintialized memory and why?

But on the other hand, if there is a lot of reading of this zeroed memory, wouldn't a single small page fit better in the cache, thus improving preformance?

Adding a huge zero page

Posted Sep 27, 2012 8:07 UTC (Thu) by justincormack (subscriber, #70439) [Link]

Given everyone knows it is zero, probably not that much reading, other than for programs using it as a sparse data structure I guess.

Adding a huge zero page

Posted Sep 27, 2012 9:07 UTC (Thu) by kiryl (subscriber, #41516) [Link] (1 responses)

H. Peter Anvin asked to evaluate other implementation: virtual huge zero page. The idea to have pmd table where all entires point to normal zero page. This way should be more effective from cache usage point of view, but it will increase pressure on TLB. I'm going to implement it and compare with current implementation.

Adding a huge zero page

Posted Oct 12, 2012 11:51 UTC (Fri) by etienne (guest, #25256) [Link]

Sorry, late comment.
Shortening a huge zero page is good from the cache point of view, but maybe bad from DMA point of view, a:
dd if=/dev/zero of=/dev/sda18
may be forced to use 4 Kbytes pages because the huge zero page is not contiguous in physical memory...
Just my £0.02

Adding a huge zero page

Posted Sep 27, 2012 9:33 UTC (Thu) by alankila (guest, #47141) [Link] (8 responses)

Transparent hugepages are nice in theory, but probably not as reliable as mounting hugetlbfs and using that instead. It seems that whether hugepages can be used depends to a degree on how much unused memory you have at the point of doing the memory allocation. If you are unlucky, that allocation doesn't use hugepages and it will not get converted into hugepages later on, either.

I observed this issue on my virtual machines test server which runs 6 virtual machines of various sizes with a total of 8 of the 16 GB of system memory. After bootup, almost all of the kvm memory was hugepage'd, but the next day only some 10 % of the memory still was. The problem was that the machines were shutdown during night for backup, and then brought back up. My guess is that the backup process filled memory with pages, some which were dirty, and this defeated the hugepages optimization.

Adding a huge zero page

Posted Sep 27, 2012 13:51 UTC (Thu) by ejr (subscriber, #51652) [Link] (7 responses)

Ah, but transparent huge pages are portable. They don't change your code. In HPC land, your task pretty much own the node(s) on which you're running, so there is little danger of the fragmentation you encountered. THP often is a big performance win with *no* code change. Fixing the zero page issue will fix the remaining decently-sized gotcha for HPC-style uses.

And some of these HPC codes are old and/or expect to run on more than Linux. Conditionally changing all the user-specified and compiler-generated memory allocations is a painful task.

Adding a huge zero page

Posted Sep 28, 2012 9:11 UTC (Fri) by alankila (guest, #47141) [Link] (6 responses)

Right. Well, I'm just saying that there are cases where it doesn't work, so it's a bit akin a voodoo feature you enable and then convince yourself that there is a speed benefit. It is only when you read AnonHugePages line from /proc/meminfo and see, for instance, that only 64 MB is actually in hugepages that you realize it isn't all it's cracked up to be. But hey, it's better than nothing, right?

I was wondering if there shouldn't be a memory defragmenting task that goes through the running process' heap periodically and would move the 4k pages around until coalescing them to a hugepage becomes possible. I mean, if using these pages really gives you around 5 % performance benefit, it would seem reasonable to spend up to few % of CPU to do it for tasks that seem long-lived enough.

Adding a huge zero page

Posted Sep 28, 2012 11:13 UTC (Fri) by nix (subscriber, #2304) [Link] (3 responses)

Working quite well here:

AnonHugePages: 788480 kB
AnonHugePages: 2553856 kB

The latter machine is running a single virtual machine, but the former is running no VMs of any kind and has still turned a gigabyte into transpages (probably largely inside monsters like Chromium). That's not insignificant. (For that matter, I routinely see compile jobs getting hugepaged up, and a TLB saving in a pointer-mad monster like GCC really does speed it up. Sure, it's only a few percent, but that's better than nothing, right?)

Adding a huge zero page

Posted Sep 28, 2012 23:19 UTC (Fri) by alankila (guest, #47141) [Link] (2 responses)

Sure. I'm not saying it never works, I just wish it worked for my use case. Anyway, explicit hugepages are not too huge a pain for now, you just have to calculate/measure how many you need and then hack some apparmor rules for kvm to allow the hugepages mount region to be accessible for writing.

That being said, out of 1.5 GB of other services on the server:

AnonHugePages: 71680 kB

*sigh*

Adding a huge zero page

Posted Sep 28, 2012 23:37 UTC (Fri) by khc (guest, #45209) [Link] (1 responses)

are we doing some kind of poll? :-)

MemTotal: 16327088 kB
AnonHugePages: 3102720 kB

Of course, this box has a fairly specialized daemon that allocates 8GB of memory as 2 separate pools, so it's not surprising that auto huge pages work well (although I've never measured the performance impact of that).

Adding a huge zero page

Posted Sep 29, 2012 10:28 UTC (Sat) by nix (subscriber, #2304) [Link]

Yeah, exactly. If you run things with big heaps composed of lots of little pieces (so malloc uses arena allocation and allocates >>2Mb), you'll probably do well with transparent hugepages. If instead you have lots of little programs with small heaps, you won't see any benefit: if you have programs that make lots of medium-big allocations between 512Kb and 2Mb, you'll probably see glibc malloc falling back to mmap() of regions a bit too small to be converted into a transparent hugepage.

Adding a huge zero page

Posted Sep 28, 2012 14:09 UTC (Fri) by ejr (subscriber, #51652) [Link]

Look up Mel Gorman's memory compaction work. IIRC, there was a somewhat recent (months ago?) bit here on the painful interactions between memory compaction and VFAT on removable devices.

Adding a huge zero page

Posted Oct 5, 2012 16:10 UTC (Fri) by cpasqualini (guest, #69417) [Link]

Looking at this article and it's comments adn started to read about THP and found an interesting file in the Docs: http://lwn.net/Articles/423592/

Did you test any of these?

echo always >/sys/kernel/mm/transparent_hugepage/defrag
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
echo never >/sys/kernel/mm/transparent_hugepage/defrag