Fixing kmap_atomic()
The way this problem was fixed was to create the concept of "high memory": memory which is not directly addressable by the kernel. Most of the time, the kernel does not need to directly manipulate much of the memory on the system; almost all user-space pages, for example, are usually only accessed in user mode. But, occasionally, the kernel must be able to reach into any page in the system. Zeroing new pages is one example; reading system call arguments from a user-space page is another. Since high-memory pages cannot live permanently in the kernel's virtual address space, the kernel needs a mechanism by which it can temporarily create a kernel-space address for specific high-memory pages.
That mechanism is called kmap(); it takes a pointer to a struct page and returns a kernel-space virtual address for the page. When the kernel is done with the page, it must use kunmap() to unmap the page and make the address available for other mappings. kmap() works, but it can be slow; it requires translation lookaside buffer flushes and, potentially, cross-CPU interrupts for every mapping. Linus recently commented on the costs of high memory:
All that costly work is done to keep the kernel-space mapping consistent across all processors in the system, even though many of these mappings are used only briefly, and only on a single CPU.
To improve performance, the kernel developers introduced a special version:
void *kmap_atomic(struct page *page, enum km_type idx);
Atomic kmap slots |
---|
KM_BOUNCE_READ |
KM_SKB_SUNRPC_DATA |
KM_SKB_DATA_SOFTIRQ |
KM_USER0 |
KM_USER1 |
KM_BIO_SRC_IRQ |
KM_BIO_DST_IRQ |
KM_PTE0 |
KM_PTE1 |
KM_IRQ0 |
KM_IRQ1 |
KM_SOFTIRQ0 |
KM_SOFTIRQ1 |
KM_SYNC_ICACHE |
KM_SYNC_DCACHE |
KM_UML_USERCOPY |
KM_IRQ_PTE |
KM_NMI |
KM_NMI_PTE |
The use of fixed slots requires that the code using these mappings be atomic - hence the name kmap_atomic(). If code holding an atomic kmap could be preempted, the thread which takes its place could use the same slots, with unfortunate results. The per-CPU nature of atomic mappings means that any cross-CPU migration would be disastrous. It's worth noting that there is no other protection against multiple use of specific slots; if two functions in a given call chain disagree about the use of KM_USER0, bad things are going to happen. In practice, this problem does not seem to actually bite people, though.
This API has seen little change for years, but Peter Zijlstra has recently decided that it could use a face lift. The result is a patch series changing this fundamental interface and fixing the resulting compilation problems in over 200 files. The change is conceptually simple: the slots disappear, and the range of addresses is managed as a stack instead. After all, users of kmap_atomic() don't really care about which address they get; they just want an address that nobody else is using. The new API does force map and unmap operations to nest properly, but the atomic nature of these mappings means that usage generally fits that pattern anyway.
There seems to be little question of this change being merged; Linus welcomed it, saying "I think this is how
we should have done it originally.
" There were some quibbles about
the naming in the first version of the patch (kmap_atomic() had
become kmap_atomic_push()), but that was easily fixed for the
second iteration.
It is also interesting to look at how this patch series was reworked. The first version was a single patch which did all of the changes at once. In response to reviewers, Peter broke the second version down into four steps:
- Make sure that all atomic kmaps are created and destroyed in a
strictly nested manner. There were a few places in the code where
that did not happen; fixing it was usually just a matter of reordering
a couple of kunmap_atomic() calls.
- Switch to the stack-based mode without changing the
kmap_atomic() prototype. So, after this patch,
kmap_atomic() simply ignores the idx argument.
- The kmap_atomic() prototype loses the idx argument;
this is, by far, the largest patch of the series.
- Various final details are fixed up.
Doing things this way will make it a lot easier to debug any strange problems which result from the changes. The most significant change in terms of how the kernel works is step 2, so that's the patch which is most likely to create problems. But this organization makes that patch relatively small, so tracking down any residual bugs should be relatively easy. Instead, the really huge patch (part 3) should not really change the binary kernel at all, so the chances of it being problem-free are quite high.
All that remains is getting this change merged. It's too late for 2.6.32,
but putting it into linux-next is likely to create large numbers of
patch conflicts. That is a common problem with wide-ranging patches like
this, though; developers have gotten better over the years at maintaining
them against a rapidly-changing kernel
Index entries for this article | |
---|---|
Kernel | kmap_atomic() |
Kernel | Memory management/Internal API |
Posted Oct 18, 2009 19:41 UTC (Sun)
by kleptog (subscriber, #1183)
[Link] (11 responses)
I've been thinking of just installing a 64-bit kernel but leaving the userspace unchanged. ISTM this should cut out the cost of highmem. But I have no way of measuring any difference and I havn't found any place discussing this combination.
Posted Oct 19, 2009 4:44 UTC (Mon)
by dlang (guest, #313)
[Link] (10 responses)
I've seen significant (double digit percentage) benifits from going from 32 bit to 64 bit kernels, even without himem involved (probably due to the fact that in 64 bit mode the x86 cpu's have twice as many registers available to use)
when you add the benifits of eliminating himem overhead, you should relly find it a win.
several years ago there were problems with 32 bit userspace and 64 bit kernels, but at this point I have not heard of any interface compatibility problems for a year or two.
Posted Oct 22, 2009 18:28 UTC (Thu)
by jengelh (guest, #33263)
[Link] (9 responses)
[citation needed]. It is common on arches where 64-bit is considered expensive - such as sparc64. But on x86, people have obviously made sure 64-bit mode runs at least at the same effective speed because otherwise it would be harder in selling it. Plus, in precompiled distros, the 32-bit objects often do not contain SSE/SSE2 because it is not guaranteed to be available in all cases (and in fact, libvorbis/oggenc will speed up by 17% when adding -msse -mfpmath=sse on ye olde 2003ish AMD Athlon XP); on 64-bit however, SSE is always available, so naturally it comes faster compared to precompiled 32 bit objects. (IOW, Gentoo is exempt because you can add -mfpmath=sse at your leisure.)
Posted Oct 22, 2009 18:49 UTC (Thu)
by dlang (guest, #313)
[Link] (8 responses)
Posted Oct 23, 2009 7:36 UTC (Fri)
by Cato (guest, #7643)
[Link] (7 responses)
Posted Oct 23, 2009 7:53 UTC (Fri)
by dlang (guest, #313)
[Link]
32 bit userspace with a 64 bit kernel is absolutly supported by the kernel developers
I don't think either debian or ubuntu include the option to do this in their installers, but since the packages are available if you force overriding the checks you can install the 64 bit kernel on an otherwise 32 bit install (and it will even leave your old kernel available to boot from)
will you get some people who question why you are doing this when you ask on mailing lists? yes. will you get people who are doing this on their systems when you ask on mailing lists? yes (not always at the same time)
are you talking about paid support for either of these? if so you would have to ask the support orginization. if they are any good (and are charging you enough to really offer support) I would expect them to do so.
Posted Oct 23, 2009 11:52 UTC (Fri)
by mjg59 (subscriber, #23239)
[Link] (3 responses)
Posted Oct 23, 2009 22:02 UTC (Fri)
by paulj (subscriber, #341)
[Link]
(Fedora is tantalisingly close, but yum updates don't quite work right)
Posted Oct 24, 2009 10:54 UTC (Sat)
by nix (subscriber, #2304)
[Link] (1 responses)
The POSIX subset of what Linux can do absolutely does work in 32-bit
Posted Oct 24, 2009 20:23 UTC (Sat)
by dlang (guest, #313)
[Link]
one thing I ran into is trying to get the citrix client working (yes, a binary-only app), it needs various other libraries, including X libraries. these are not part of the stuff supported by the 32 bit compatibility libraries.
I don't run many 32 bit apps, but I've run across a half dozen of them that have required that I manually download and install some 32 bit versions of packages on my 64 bit machine before they work.
however, I have been able to get every one of them to work.
Posted Oct 24, 2009 2:23 UTC (Sat)
by ccurtis (guest, #49713)
[Link] (1 responses)
But supported or not, my Debian servers have no issues running 32-bit userland apps. 'apt-get install ia32-libs' should be all you need.
My desktop is Ubuntu, and it appears that this package is also required for flash (nonfree). Now, as for stability I can't say I have _too_ many problems, but I wouldn't attribute any I do have to 32bit-ness.
Audio on the other hand...
Posted Oct 24, 2009 2:28 UTC (Sat)
by dlang (guest, #313)
[Link]
it makes most things run, but I have run across many things that require additional 32 bit packages be installed to make them work.
Posted Oct 19, 2009 16:04 UTC (Mon)
by pflugstad (subscriber, #224)
[Link]
When the kernel is running on a machine with <1GB of physical RAM, and CONFIG_HIGMEM (of any variety) is not enabled, then the kernel just maps all of physical ram to it's virtual memory entries. This kernel is effectively limited to 1GB physical RAM (or so). Additionally, for each userspace process, the top 1GB of it's virtual address space is mapped to this same 1GB slice, so that it effectively shares a virtual address space with the kernel.
Then when some userspace process transitions into kernel space, nothing needs to change w.r.t. virtual memory - the kernel can access all of userspace memory directly (the lower 3GB of virtual RAM), since it's already in the same virtual address space, without mucking about with address space mappings which would cause a TLB flush.
But, when you enable CONFIG_HIGHMEM4G, things change a little bit. The kernel still maps the lowest 1GB of physical RAM to it's virtual address space, but if you have, say, 2GB of RAM, then that other 1GB of RAM is not mapped to the kernels virtual address space. This RAM is still accessible directly to the CPU, and user space processes can run from it just fine.
However, when the user space process transitions into the kernel (system call, etc), any pointers the user space process may pass may point to memory that is not currently mapped in the kernels virtual memory setup.
So this is where kmap_atomic comes into play: it grabs some chunk of unused virtual memory in the kernel (I assume some is set aside up front for this?) and sets up a temporary virtual<->physical mapping to the chunk of physical memory that is not currently mapped. So now the kernel can use that virtual memory address to access the chunk of RAM above 1GB that is not permanently mapped. But since you changed the virtual<->physical mapping, now you have to flush your TLB, which is relatively expensive to rebuild, in addition to the overhead in managing these mappings.
Now, prior to this patch, the chunks of unused virtual memory in the kernel were divided into "slots" dedicated to specific uses. The change discussed in this article is to treat those temporary mappings all the same, and just do a stack of available virtual memory - grab the next chunk of available virtual address space and hand it out, then when it's done, that is "popped" off again.
Actually, it seems like you don't even need to treat the available virtual addresses as a stack - just manage it like you do the heap: hand out how ever much is requested, and when it's "free'd" you put it back into the heap? I guess you could get fragmentation that way. Maybe use one of the SLxB allocators on it?
Now, to just carry this a little bit farther - when you have >4GB on a 32-bit machine, this is where PAE comes into play? Is PAE basically just another extension to the above process - only instead of mapping a 32-bit physical address into the kernel, you map a 36-bit physical address, which is a chunk of RAM somewhere above 4GB, into the kernel's 32-bit virtual address space? So again there's extra overhead in changing virtual to physical mappings, so you get a TLB flush and so on...
Thanks!
How expensive is highmem
How expensive is highmem
How expensive is highmem
How expensive is highmem
How expensive is highmem
How expensive is highmem
How expensive is highmem
How expensive is highmem
How expensive is highmem
work for things that do Linux-specific stuff like iptables, but if you're
running a 64-bit userspace with some 32-bit apps that you can't get 64-bit
equivalents for (like World of Goo ;) ) then it should just work.
compat mode.
How expensive is highmem
How expensive is highmem
How expensive is highmem
Fixing kmap_atomic()