Fixing kmap_atomic()

By Jonathan Corbet
October 13, 2009

Once upon a time, Linux was limited to less than 1GB of physical memory on 32-bit systems. This limit was imposed by two technical decisions: processes run with the same page tables in both kernel and user mode, and all physical memory had to be directly addressable by the kernel. Not changing page tables at every transition between kernel and user space is a significant performance win, but it forces the two modes to share the same 4GB address space. The directly-addressable requirement meant that total physical memory could not exceed the amount of virtual memory address space assigned to the kernel. Indeed, not even the full kernel space was available, due to the need to leave some space for I/O memory, vmalloc(), and so on. The normal split is 3GB for user space and 1GB for kernel space; that limited systems to a bit less than 1GB of physical memory.

The way this problem was fixed was to create the concept of "high memory": memory which is not directly addressable by the kernel. Most of the time, the kernel does not need to directly manipulate much of the memory on the system; almost all user-space pages, for example, are usually only accessed in user mode. But, occasionally, the kernel must be able to reach into any page in the system. Zeroing new pages is one example; reading system call arguments from a user-space page is another. Since high-memory pages cannot live permanently in the kernel's virtual address space, the kernel needs a mechanism by which it can temporarily create a kernel-space address for specific high-memory pages.

That mechanism is called kmap(); it takes a pointer to a struct page and returns a kernel-space virtual address for the page. When the kernel is done with the page, it must use kunmap() to unmap the page and make the address available for other mappings. kmap() works, but it can be slow; it requires translation lookaside buffer flushes and, potentially, cross-CPU interrupts for every mapping. Linus recently commented on the costs of high memory:

HIGHMEM accesses really are very slow. You don't see that in user space, but I really have seen 25% performance differences between non-highmem builds and CONFIG_HIGHMEM4G enabled for things that try to put a lot of data in highmem (and the 64G one is even more expensive). And that was just with 2GB of RAM.

All that costly work is done to keep the kernel-space mapping consistent across all processors in the system, even though many of these mappings are used only briefly, and only on a single CPU.

To improve performance, the kernel developers introduced a special version:

    void *kmap_atomic(struct page *page, enum km_type idx);

Atomic kmap slots
`KM_BOUNCE_READ`
`KM_SKB_SUNRPC_DATA`
`KM_SKB_DATA_SOFTIRQ`
`KM_USER0`
`KM_USER1`
`KM_BIO_SRC_IRQ`
`KM_BIO_DST_IRQ`
`KM_PTE0`
`KM_PTE1`
`KM_IRQ0`
`KM_IRQ1`
`KM_SOFTIRQ0`
`KM_SOFTIRQ1`
`KM_SYNC_ICACHE`
`KM_SYNC_DCACHE`
`KM_UML_USERCOPY`
`KM_IRQ_PTE`
`KM_NMI`
`KM_NMI_PTE`

This function differs from kmap() in some important ways. It only creates a mapping on the current CPU, so there is no need to bother other processors with it. It also creates the mapping using one of a very small set of kernel-space addresses. The caller must specify which address to use by way of the idx argument; these addresses are specified by a set of "slot" constants. For example, KM_USER0 and KM_USER1 are set aside for code called directly from user context - system call implementations, generally. KM_PTE0 is used for page table operations, KM_SOFTIRQ0 is used in software interrupt mode, etc. There are about twenty of these slots defined in current kernels; see the list at the right for the 2.6.32 slots.

The use of fixed slots requires that the code using these mappings be atomic - hence the name kmap_atomic(). If code holding an atomic kmap could be preempted, the thread which takes its place could use the same slots, with unfortunate results. The per-CPU nature of atomic mappings means that any cross-CPU migration would be disastrous. It's worth noting that there is no other protection against multiple use of specific slots; if two functions in a given call chain disagree about the use of KM_USER0, bad things are going to happen. In practice, this problem does not seem to actually bite people, though.

This API has seen little change for years, but Peter Zijlstra has recently decided that it could use a face lift. The result is a patch series changing this fundamental interface and fixing the resulting compilation problems in over 200 files. The change is conceptually simple: the slots disappear, and the range of addresses is managed as a stack instead. After all, users of kmap_atomic() don't really care about which address they get; they just want an address that nobody else is using. The new API does force map and unmap operations to nest properly, but the atomic nature of these mappings means that usage generally fits that pattern anyway.

There seems to be little question of this change being merged; Linus welcomed it, saying "I think this is how we should have done it originally." There were some quibbles about the naming in the first version of the patch (kmap_atomic() had become kmap_atomic_push()), but that was easily fixed for the second iteration.

It is also interesting to look at how this patch series was reworked. The first version was a single patch which did all of the changes at once. In response to reviewers, Peter broke the second version down into four steps:

Make sure that all atomic kmaps are created and destroyed in a strictly nested manner. There were a few places in the code where that did not happen; fixing it was usually just a matter of reordering a couple of kunmap_atomic() calls.
Switch to the stack-based mode without changing the kmap_atomic() prototype. So, after this patch, kmap_atomic() simply ignores the idx argument.
The kmap_atomic() prototype loses the idx argument; this is, by far, the largest patch of the series.
Various final details are fixed up.

Doing things this way will make it a lot easier to debug any strange problems which result from the changes. The most significant change in terms of how the kernel works is step 2, so that's the patch which is most likely to create problems. But this organization makes that patch relatively small, so tracking down any residual bugs should be relatively easy. Instead, the really huge patch (part 3) should not really change the binary kernel at all, so the chances of it being problem-free are quite high.

All that remains is getting this change merged. It's too late for 2.6.32, but putting it into linux-next is likely to create large numbers of patch conflicts. That is a common problem with wide-ranging patches like this, though; developers have gotten better over the years at maintaining them against a rapidly-changing kernel

Index entries for this article
Kernel	kmap_atomic()
Kernel	Memory management/Internal API

How expensive is highmem

Posted Oct 18, 2009 19:41 UTC (Sun) by kleptog (subscriber, #1183) [Link] (11 responses)

I work with machines that have 8GB+ of memory but are only 32-bit builds. It seems to me that highmem would indeed cost something, but it's entirely unclear to me how much. How could you measure it?

I've been thinking of just installing a 64-bit kernel but leaving the userspace unchanged. ISTM this should cut out the cost of highmem. But I have no way of measuring any difference and I havn't found any place discussing this combination.

How expensive is highmem

Posted Oct 19, 2009 4:44 UTC (Mon) by dlang (guest, #313) [Link] (10 responses)

a64bit kernel with 32 bit userspace is reasonably common.

I've seen significant (double digit percentage) benifits from going from 32 bit to 64 bit kernels, even without himem involved (probably due to the fact that in 64 bit mode the x86 cpu's have twice as many registers available to use)

when you add the benifits of eliminating himem overhead, you should relly find it a win.

several years ago there were problems with 32 bit userspace and 64 bit kernels, but at this point I have not heard of any interface compatibility problems for a year or two.

How expensive is highmem

Posted Oct 22, 2009 18:28 UTC (Thu) by jengelh (guest, #33263) [Link] (9 responses)

>a64bit kernel with 32 bit userspace is reasonably common.

[citation needed]. It is common on arches where 64-bit is considered expensive - such as sparc64. But on x86, people have obviously made sure 64-bit mode runs at least at the same effective speed because otherwise it would be harder in selling it. Plus, in precompiled distros, the 32-bit objects often do not contain SSE/SSE2 because it is not guaranteed to be available in all cases (and in fact, libvorbis/oggenc will speed up by 17% when adding -msse -mfpmath=sse on ye olde 2003ish AMD Athlon XP); on 64-bit however, SSE is always available, so naturally it comes faster compared to precompiled 32 bit objects. (IOW, Gentoo is exempt because you can add -mfpmath=sse at your leisure.)

How expensive is highmem

Posted Oct 22, 2009 18:49 UTC (Thu) by dlang (guest, #313) [Link] (8 responses)

there is a bunch of software that is not available in 64 bit mode. As a result a lot of people on x86 use a 64 bit kernel so that they can efficiently use all of their ram, but then use 32 bit userspace so that all their apps still work.

How expensive is highmem

Posted Oct 23, 2009 7:36 UTC (Fri) by Cato (guest, #7643) [Link] (7 responses)

Does anyone know if a 32 bit userland with 64 bit kernel is supported in Debian or Ubuntu? Sounds like this would be a better option than a PAE kernel for some uses.

How expensive is highmem

Posted Oct 23, 2009 7:53 UTC (Fri) by dlang (guest, #313) [Link]

how do you define 'supported'?

32 bit userspace with a 64 bit kernel is absolutly supported by the kernel developers

I don't think either debian or ubuntu include the option to do this in their installers, but since the packages are available if you force overriding the checks you can install the 64 bit kernel on an otherwise 32 bit install (and it will even leave your old kernel available to boot from)

will you get some people who question why you are doing this when you ask on mailing lists? yes. will you get people who are doing this on their systems when you ask on mailing lists? yes (not always at the same time)

are you talking about paid support for either of these? if so you would have to ask the support orginization. if they are any good (and are charging you enough to really offer support) I would expect them to do so.

How expensive is highmem

Posted Oct 23, 2009 11:52 UTC (Fri) by mjg59 (subscriber, #23239) [Link] (3 responses)

It's supported in the sense that anything that doesn't work is a bug, but not supported in the sense that it's basically untested. There are certainly missing ioctl translations and suchlike, and random applications may fail as a result.

How expensive is highmem

Posted Oct 23, 2009 22:02 UTC (Fri) by paulj (subscriber, #341) [Link]

Now we just need distros to support 64bit/primarily-32bit kernel/userspace.

(Fedora is tantalisingly close, but yum updates don't quite work right)

How expensive is highmem

Posted Oct 24, 2009 10:54 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

In practice it works well enough for normal userspace apps. It might not
work for things that do Linux-specific stuff like iptables, but if you're
running a 64-bit userspace with some 32-bit apps that you can't get 64-bit
equivalents for (like World of Goo ;) ) then it should just work.

The POSIX subset of what Linux can do absolutely does work in 32-bit
compat mode.

How expensive is highmem

Posted Oct 24, 2009 20:23 UTC (Sat) by dlang (guest, #313) [Link]

actually, no.

one thing I ran into is trying to get the citrix client working (yes, a binary-only app), it needs various other libraries, including X libraries. these are not part of the stuff supported by the 32 bit compatibility libraries.

I don't run many 32 bit apps, but I've run across a half dozen of them that have required that I manually download and install some 32 bit versions of packages on my 64 bit machine before they work.

however, I have been able to get every one of them to work.

How expensive is highmem

Posted Oct 24, 2009 2:23 UTC (Sat) by ccurtis (guest, #49713) [Link] (1 responses)

I find this question so strange ... as in, why is it even a concern?

But supported or not, my Debian servers have no issues running 32-bit userland apps. 'apt-get install ia32-libs' should be all you need.

My desktop is Ubuntu, and it appears that this package is also required for flash (nonfree). Now, as for stability I can't say I have _too_ many problems, but I wouldn't attribute any I do have to 32bit-ness.

Audio on the other hand...

How expensive is highmem

Posted Oct 24, 2009 2:28 UTC (Sat) by dlang (guest, #313) [Link]

installing the ia32-libs package does not take care of everything (unfortunantly)

it makes most things run, but I have run across many things that require additional 32 bit packages be installed to make them work.

Fixing kmap_atomic()

Posted Oct 19, 2009 16:04 UTC (Mon) by pflugstad (subscriber, #224) [Link]

So, let's see if I can explain this in my terms.

When the kernel is running on a machine with <1GB of physical RAM, and CONFIG_HIGMEM (of any variety) is not enabled, then the kernel just maps all of physical ram to it's virtual memory entries. This kernel is effectively limited to 1GB physical RAM (or so). Additionally, for each userspace process, the top 1GB of it's virtual address space is mapped to this same 1GB slice, so that it effectively shares a virtual address space with the kernel.

Then when some userspace process transitions into kernel space, nothing needs to change w.r.t. virtual memory - the kernel can access all of userspace memory directly (the lower 3GB of virtual RAM), since it's already in the same virtual address space, without mucking about with address space mappings which would cause a TLB flush.

But, when you enable CONFIG_HIGHMEM4G, things change a little bit. The kernel still maps the lowest 1GB of physical RAM to it's virtual address space, but if you have, say, 2GB of RAM, then that other 1GB of RAM is not mapped to the kernels virtual address space. This RAM is still accessible directly to the CPU, and user space processes can run from it just fine.

However, when the user space process transitions into the kernel (system call, etc), any pointers the user space process may pass may point to memory that is not currently mapped in the kernels virtual memory setup.

So this is where kmap_atomic comes into play: it grabs some chunk of unused virtual memory in the kernel (I assume some is set aside up front for this?) and sets up a temporary virtual<->physical mapping to the chunk of physical memory that is not currently mapped. So now the kernel can use that virtual memory address to access the chunk of RAM above 1GB that is not permanently mapped. But since you changed the virtual<->physical mapping, now you have to flush your TLB, which is relatively expensive to rebuild, in addition to the overhead in managing these mappings.

Now, prior to this patch, the chunks of unused virtual memory in the kernel were divided into "slots" dedicated to specific uses. The change discussed in this article is to treat those temporary mappings all the same, and just do a stack of available virtual memory - grab the next chunk of available virtual address space and hand it out, then when it's done, that is "popped" off again.

Actually, it seems like you don't even need to treat the available virtual addresses as a stack - just manage it like you do the heap: hand out how ever much is requested, and when it's "free'd" you put it back into the heap? I guess you could get fragmentation that way. Maybe use one of the SLxB allocators on it?

Now, to just carry this a little bit farther - when you have >4GB on a 32-bit machine, this is where PAE comes into play? Is PAE basically just another extension to the above process - only instead of mapping a 32-bit physical address into the kernel, you map a 36-bit physical address, which is a chunk of RAM somewhere above 4GB, into the kernel's 32-bit virtual address space? So again there's extra overhead in changing virtual to physical mappings, so you get a TLB flush and so on...

Thanks!