Posted Oct 19, 2009 16:04 UTC (Mon) by pflugstad (subscriber, #224)
Parent article: Fixing kmap_atomic()
So, let's see if I can explain this in my terms.
When the kernel is running on a machine with <1GB of physical RAM, and CONFIG_HIGMEM (of any variety) is not enabled, then the kernel just maps all of physical ram to it's virtual memory entries. This kernel is effectively limited to 1GB physical RAM (or so). Additionally, for each userspace process, the top 1GB of it's virtual address space is mapped to this same 1GB slice, so that it effectively shares a virtual address space with the kernel.
Then when some userspace process transitions into kernel space, nothing needs to change w.r.t. virtual memory - the kernel can access all of userspace memory directly (the lower 3GB of virtual RAM), since it's already in the same virtual address space, without mucking about with address space mappings which would cause a TLB flush.
But, when you enable CONFIG_HIGHMEM4G, things change a little bit. The kernel still maps the lowest 1GB of physical RAM to it's virtual address space, but if you have, say, 2GB of RAM, then that other 1GB of RAM is not mapped to the kernels virtual address space. This RAM is still accessible directly to the CPU, and user space processes can run from it just fine.
However, when the user space process transitions into the kernel (system call, etc), any pointers the user space process may pass may point to memory that is not currently mapped in the kernels virtual memory setup.
So this is where kmap_atomic comes into play: it grabs some chunk of unused virtual memory in the kernel (I assume some is set aside up front for this?) and sets up a temporary virtual<->physical mapping to the chunk of physical memory that is not currently mapped. So now the kernel can use that virtual memory address to access the chunk of RAM above 1GB that is not permanently mapped. But since you changed the virtual<->physical mapping, now you have to flush your TLB, which is relatively expensive to rebuild, in addition to the overhead in managing these mappings.
Now, prior to this patch, the chunks of unused virtual memory in the kernel were divided into "slots" dedicated to specific uses. The change discussed in this article is to treat those temporary mappings all the same, and just do a stack of available virtual memory - grab the next chunk of available virtual address space and hand it out, then when it's done, that is "popped" off again.
Actually, it seems like you don't even need to treat the available virtual addresses as a stack - just manage it like you do the heap: hand out how ever much is requested, and when it's "free'd" you put it back into the heap? I guess you could get fragmentation that way. Maybe use one of the SLxB allocators on it?
Now, to just carry this a little bit farther - when you have >4GB on a 32-bit machine, this is where PAE comes into play? Is PAE basically just another extension to the above process - only instead of mapping a 32-bit physical address into the kernel, you map a 36-bit physical address, which is a chunk of RAM somewhere above 4GB, into the kernel's 32-bit virtual address space? So again there's extra overhead in changing virtual to physical mappings, so you get a TLB flush and so on...