LWN.net Logo

Some ado about zero

By Jonathan Corbet
July 7, 2009
Computers use a lot of zeroes. Early in your editor's programming career, he worked on a machine that provided a special hardware register containing zero; programmers on this system knew they could use all the zeroes they needed with no fear of running out. Meanwhile, in this century, the Linux kernel sets aside a page full of zeros. It's called empty_zero_page on the x86 architecture, and it's even exported to modules. Interestingly, this special page is not used as heavily as it was prior to the 2.6.24 kernel, but that may be about to change.

In the good old days, the kernel would use the zero page in situations where it knew it needed a page full of zeroes. So, for example, if a process incurred a read fault on a page it had never used, the kernel would simply map the zero page into that address. A copy-on-write mapping would be used, of course; if the process subsequently modified the page, it would end up with its own copy. But deferring the creation of a new, zero-filled page helped to conserve zeroes, keeping the kernel from running out. Incidentally, it also saved memory, reduced cache pressure, and eliminated the need to clear the new page.

Memory management changes made back in 2007 had the effect of adding reference counting to the zero page. And that turned out to be a problem on multiprocessor machines. Since all processors shared the same zero page (per-CPU differences being unlikely), they also all manipulated the same reference count. That led to serious problems with cache line bouncing, with a measurable performance impact. In response, Nick Piggin evaluated a number of possible fixes, including special hacks to avoid reference-counting the zero page or adding per-CPU zero pages. The patch that got merged, though, simply eliminated most use of the zero page altogether. The change was justified this way:

Inserting a ZERO_PAGE for anonymous read faults appears to be a false optimisation: if an application is performance critical, it would not be doing many read faults of new memory, or at least it could be expected to write to that memory soon afterwards. If cache or memory use is critical, it should not be working with a significant number of ZERO_PAGEs anyway (a more compact representation of zeroes should be used).

There was some nervousness about the patch at the time; Linus grumbled about the changes which created the problem in the first place, and worried:

The kernel has *always* (since pretty much day 1) done that ZERO_PAGE thing. This means that I would not be at all surprised if some application basically depends on it. I've written test-programs that depends on it - maybe people have written other code that basically has been written for and tested with a kernel that has basically always made read-only zero pages extra cheap.

Despite his misgivings, Linus merged the patch for 2.6.24 to see what sort of problems might come to the surface. For the next 18 months, it appeared that such problems were scarce indeed; most people forgot about the zero page altogether. In early June, though, Julian Phillips reported a problem he had observed:

I have a program which creates a reasonably large private anonymous map. The program then writes into a few places in the map, but ends up reading from all of them.

When I run this program on a system running 2.6.20.7 the process only ever seems to use enough memory to hold the data that has actually been written (well - in units of PAGE_SIZE). When I run the program on a system running 2.6.24.5 then as it reads the map the amount of memory used continues to increase until the complete map has actually been allocated (and since the total size is greater than the physically available RAM causes swapping). Basically I seem to be seeing copy-on-read instead of copy-on-write type behaviour.

What Julian was seeing, of course, was the effects from the removal of the zero page. On older kernels, all of the unwritten pages in the data structure would be mapped to the zero page, using no additional physical memory at all. As of 2.6.24, each of those pages gets an actual physical page - containing nothing but zeroes - assigned to it, increasing memory use significantly.

Hiroyuki Kamezawa reports that he has seen zero-page-dependent workloads at other sites. Many of those sites, he says, are running enterprise Linux distributions which have not, yet, shipped kernels new enough to lack zero page support. He worries that these users will encounter the same sort of unpleasant surprise Julian found when they upgrade to newer kernels. In response, he has posted a patch which restores zero page support to the kernel.

Hiroyuki's zero page support isn't quite the same as what came before, though. It avoids reference counting for the zero page, a change which should eliminate the worst of the performance problems. It does add some interesting special cases, though, where virtual memory code has to be careful to test for zero pages; the bulk of those cases are handled with the addition of a get_user_pages_nonzero() function which removes any zero pages from the indicated range. Linus dislikes the special cases, thinking that they are unnecessary. Instead, he has proposed an alternative implementation using the relatively new PTE_SPECIAL flag to mark zero pages. As of this writing, a updated version of the patch using this approach has not yet been posted.

Nick Piggin, who wrote the patch removing zero page support in the first place, would rather not see it return. With regard to the affected users, he asks:

Can we just try to wean them off it? Using zero page for huge sparse matricies is probably not ideal anyway because it needs to still be faulted in and it occupies TLB space. They might see better performance by using a better algorithm.

Linus, however, would like to see this feature restored if it can be done in a clean way. So the return of zero page support seems fairly likely, assuming the patch can be worked into sufficiently good shape. Whether that will bring comfort to enterprise kernel users remains to be seen, though; the next generation of enterprise Linux releases look set to use kernels around 2.6.27. Unless distributors backport the zero page patch, enterprise Linux users will still be stuck with the current, zero-wasting behavior.


(Log in to post comments)

Some ado about zero

Posted Jul 9, 2009 2:36 UTC (Thu) by qg6te2 (guest, #52587) [Link]

the next generation of enterprise Linux releases look set to use kernels around 2.6.27.

Why 2.6.27 ? I really don't see the upcoming RHEL 6 using 2.6.27, given the amount of recent work that Red Hat has put into the kernel (e.g. FS-Cache, KVM, ...)

Some ado about zero

Posted Jul 12, 2009 9:20 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

Red Hat will continue to work on new features in the upstream kernel since they are bound to backport features over the release cycle anyway (c.f. KVM in RHEL 5.4 beta) and the work that cannot or does not have to be backported will end up going into the next release. 2.6.27 seems to be arbitrary number however.

Some ado about zero

Posted Jul 23, 2009 19:20 UTC (Thu) by SEMW (guest, #52697) [Link]

2.6.27, like 2.6.16 before it, will be a long term supported kernel; distros probably find this useful in their own kernel support endeavours.

Problems caused by removing ZERO_PAGE

Posted Jul 9, 2009 3:07 UTC (Thu) by abatters (✭ supporter ✭, #6932) [Link]

> Despite his misgivings, Linus merged the patch for 2.6.24 to see what sort of problems might come to the surface. For the next 18 months, it appeared that such problems were scarce indeed; most people forgot about the zero page altogether.

The removal of ZERO_PAGE did cause some problems for me, but they were resolved by follow-up patches:

http://marc.info/?t=120939558800001&r=1&w=2
http://lwn.net/Articles/287339/
http://lwn.net/Articles/287342/

Some ado about zero

Posted Jul 9, 2009 3:45 UTC (Thu) by neilbrown (subscriber, #359) [Link]

I'm curious as to why the "per-CPU zero pages" approach wasn't pursued. It should preserve most of the benefits of a single zero page, while significantly reducing the contention on the refcount.... or would it just give us the worst of both worlds?

Some ado about zero

Posted Jul 9, 2009 10:56 UTC (Thu) by johill (subscriber, #25196) [Link]

Well imagine having 4096 zero pages. I guess you could better get away with just a percpu refcounter. But then again, why do you need to refcount the zero page anyway? It's always there to start with, and never writable, isn't it?

Some ado about zero

Posted Jul 9, 2009 12:35 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

That's just 16MB. That's close to 'nothing' for a computer with 4096 CPUs.

Some ado about zero

Posted Jul 9, 2009 14:51 UTC (Thu) by ejr (subscriber, #51652) [Link]

There are single chips with dozens of CPUs from the OS's view. Soon these will be in the hundreds. And they'll end up being used in smaller devices relatively soon, simply because of economies of scale. The extra space can add up pretty quickly.

Some ado about zero

Posted Jul 9, 2009 7:40 UTC (Thu) by Frej (subscriber, #4165) [Link]

The interesting datapoint is that it takes 18 months for a stable kernel to get enough widespread use before this kinda of problem emerges.
Ofcourse it's not a statistically valid sample (a single datapoint....) but still interesting :).

But what's the actual delay between release and actual distribution? Nobody outside the distro's know how long it takes before fixes (x.y.z+1 releases) reach the user. In many cases that includes the maintainer being left in the dark. But what's the average? I guess availability of said release in stable software is good enough, not actual penetration among users.

Maybe it's worth an article? (Although i guess it's hard to get data from redhat/novell, but they would know when actual customers upgrade).

I guess i'm asking for others to do the work ;)

Some ado about zero

Posted Jul 9, 2009 9:59 UTC (Thu) by modernjazz (guest, #4185) [Link]

Jon, the introduction to this article really made my morning---thanks!

Some ado about zero

Posted Jul 9, 2009 13:10 UTC (Thu) by mlawren (subscriber, #10136) [Link]

> But deferring the creation of a new, zero-filled page helped to
> conserve zeroes, keeping the kernel from running out.

When the kernel runs out of zeroes, can't it just start using ones & twos? :-)

Our editor's early computer

Posted Jul 9, 2009 19:29 UTC (Thu) by felixfix (subscriber, #242) [Link]

I worked on a CDC 6x00 which had a read-zero write-bitbucket register (B0). I wonder if this was our editor's machine?

Our editor's early computer

Posted Jul 9, 2009 19:38 UTC (Thu) by corbet (editor, #1) [Link]

Indeed, it was a 6600, already somewhat obsolete by the time I was punching cards for it. I did some work on a 7600 as well. Very interesting systems to program at the assembly level.

Our editor's early computer

Posted Jul 9, 2009 19:46 UTC (Thu) by felixfix (subscriber, #242) [Link]

I once was a bit too idle and created a program which died with all memory and registers zeroed and all three possible errors. The only flaw was the PC being after the end of memory rather than 0. That was on the 6400 which had almost no pipelining.

The 6600 had a bug which the 7600 might have had too. Instructions were 15 or 30 bits, with 6-bit opcode and three 3-bit register numbers. The conditional instructions used one of those register fields as the condition number (zero, non-zero, negative, plus, nan, etc) and if that happened to be the same number as a busy register, the instruction would stall until that register was done, even if not involved in the instruction itself.

Thanks for bringing back some fun memories.

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds