64GB on 32-bit systems
- All physical memory was directly reachable via a kernel virtual
address. When the kernel has direct access to all memory,
manipulating that memory is easy. But, to operate in this mode, the
system cannot have more memory than the kernel is able to address.
- The virtual address space was split into two large pieces: the bottommost 3GB for user-space addresses, and the top 1GB for kernel addresses.
The 3/1 split was not imposed by any particular external factor; instead, it was a compromise chosen to balance two limits. The portion of the address space given over to user addresses limits the maximum size of any individual process on the system, while the kernel's portion limits the maximum amount of physical memory which can be supported. Allowing the kernel to address more memory would reduce the maximum size of every process in the system, to the chagrin of Lisp programmers and Mozilla users worldwide. There were, however, patches in circulation to change the address space split for specific needs.
The 2.3 development series added the concept of "high memory," which is not directly addressable by the kernel. High memory complicated kernel programming a bit - kernel code cannot access an arbitrary page in the system without setting up an explicit page-table mapping first. But the payoff that comes with high memory is that much larger amounts of physical memory can now be supported. Multi-gigabyte Linux systems are now common.
High memory has not solved the problem entirely, however. The kernel is still limited to 1GB of directly-addressable low memory. Any kernel data structure which is frequently accessed must live in low memory, or system performance will be hurt. Increasingly, low memory is becoming the new limiting factor on system scalability.
Consider, for example, the system memory map, which consists of a struct page structure for every page of physical memory in the system. The memory map is a fundamental kernel data structure which must be placed in low memory. It takes up 40 bytes for every (4096-byte) page in the system; that overhead may seem small until you consider that, if you want to put 64GB of memory into an x86 box, the memory map will grow to some 640 megabytes. This structure thus takes most of low memory by itself. Low memory must also be used for every other important data structure, free memory, and the kernel code itself. For a 64GB system, 1GB of low memory is insufficient to even allow the system to boot, much less do the sort of serious processing that such machines are bought for.
One approach to solving this problem is page clustering - grouping physical pages into larger virtual pages. Among other things, this technique reduces the size of the memory map. Page clustering was covered here back in February.
Recently, Ingo Molnar posted a patch which takes a very different approach. Rather than try to squeeze more into 1GB of low memory, Ingo's patch makes low memory bigger. This is done by creating separate page tables to be used by user-space and kernel code, eliminating the need to split the virtual address space between the two realms. With this patch, a user-space process has a page table which gives it access to (almost) the full 4GB virtual address space. When the system goes into kernel mode (via a system call or interrupt), it switches over to the kernel page tables. Since none of the kernel page table space must be given to user processes, the kernel, too, can use the full 4GB address space. The maximum amount of addressable low memory thus quadruples.
There are, of course, costs to this approach, or it would have been adopted a long time ago. The biggest problem is that the processor's translation buffer (a hardware cache which stores the results of page table lookups) must be flushed when the page tables are changed. Flushing the TLB hurts because subsequent memory accesses will be slowed by the need to do a full, multi-level page table lookup. And, as it turns out, the TLB flush is, itself, a slow operation on x86 processors. The additional overhead is enough to cause a significant slowdown, especially for certain kinds of loads.
The cost by the separated page tables is more than
most users will want to pay. For those who have applications requiring
large amounts of memory - and who, for whatever reason, cannot just get a
64-bit system - this patch may well be the piece that makes everything
work. Of course, the chances of such a patch getting in to the mainline
kernel before 2.7 are about zero. But it would not be surprising to see it
show up in certain vendors' distributions as an option.
