Once upon a time - not that long ago - the Linux kernel was unable to work
with more than 1GB of physical memory (actually, just a little bit less).
This limit was imposed by a couple of fundamental design decisions in the
- All physical memory was directly reachable via a kernel virtual
address. When the kernel has direct access to all memory,
manipulating that memory is easy. But, to operate in this mode, the
system cannot have more memory than the kernel is able to address.
- The virtual address space was split into two large pieces: the
bottommost 3GB for user-space addresses, and the top 1GB for kernel
The 3/1 split was not imposed by any particular external factor; instead,
it was a compromise chosen to balance two limits. The portion of the address space
given over to user addresses limits the maximum size of any individual
process on the system, while the kernel's portion limits the maximum
amount of physical memory which can be supported. Allowing the kernel to
address more memory would reduce the maximum size of every process in the
system, to the chagrin of Lisp programmers and Mozilla users worldwide.
There were, however, patches in
circulation to change the address space split for specific needs.
The 2.3 development series added the concept of "high memory," which is not
directly addressable by the kernel. High memory complicated kernel
programming a bit - kernel code cannot access an arbitrary page in the
system without setting up an explicit page-table mapping first. But the
payoff that comes with high memory is that much larger amounts of physical
memory can now be supported. Multi-gigabyte Linux systems are now common.
High memory has not solved the problem entirely, however. The kernel is
still limited to 1GB of directly-addressable low memory. Any kernel data
structure which is frequently accessed must live in low memory, or system
performance will be hurt. Increasingly, low memory is becoming the new
limiting factor on system scalability.
Consider, for example, the system memory map, which consists of a
struct page structure for every page of physical memory in the
system. The memory map is a fundamental kernel data structure which must
be placed in low memory. It takes up 40 bytes for every (4096-byte) page
in the system; that overhead may seem small until you consider that, if you
want to put 64GB of memory into an x86 box, the memory map will grow to
some 640 megabytes. This structure thus takes most of low memory by
itself. Low memory must also be used for every other important data
structure, free memory, and the kernel code itself. For a 64GB system, 1GB
of low memory is insufficient to even allow the system to boot, much less
do the sort of serious processing that such machines are bought for.
One approach to solving this problem is page clustering - grouping physical
pages into larger virtual pages. Among other things, this technique
reduces the size of the memory map. Page clustering was covered here back in February.
Recently, Ingo Molnar posted a patch which
takes a very different approach. Rather than try to squeeze more into 1GB
of low memory, Ingo's patch makes low memory bigger. This is done by
creating separate page tables to be used by user-space and kernel code,
eliminating the need to split the virtual address space between the two realms.
With this patch, a user-space process has a page table which gives it
access to (almost) the full 4GB virtual address space. When the system
goes into kernel mode (via a system call or interrupt), it switches over to
the kernel page tables. Since none of the kernel page table space must be
given to user processes, the kernel, too, can use the full 4GB address
space. The maximum amount of addressable low memory thus quadruples.
There are, of course, costs to this approach, or it would have been adopted
a long time ago. The biggest problem is that the processor's translation
buffer (a hardware cache which stores the results of page table lookups)
must be flushed when the page tables are changed. Flushing the TLB hurts
because subsequent memory accesses will be slowed by the need to do a full,
multi-level page table lookup. And, as it turns out, the TLB flush is,
itself, a slow operation on x86 processors. The additional overhead is
enough to cause a significant slowdown, especially for certain kinds of
The cost by the separated page tables is more than
most users will want to pay. For those who have applications requiring
large amounts of memory - and who, for whatever reason, cannot just get a
64-bit system - this patch may well be the piece that makes everything
work. Of course, the chances of such a patch getting in to the mainline
kernel before 2.7 are about zero. But it would not be surprising to see it
show up in certain vendors' distributions as an option.
to post comments)