The 3/1 split was not imposed by any particular external factor; instead, it was a compromise chosen to balance two limits. The portion of the address space given over to user addresses limits the maximum size of any individual process on the system, while the kernel's portion limits the maximum amount of physical memory which can be supported. Allowing the kernel to address more memory would reduce the maximum size of every process in the system, to the chagrin of Lisp programmers and Mozilla users worldwide. There were, however, patches in circulation to change the address space split for specific needs.
The 2.3 development series added the concept of "high memory," which is not directly addressable by the kernel. High memory complicated kernel programming a bit - kernel code cannot access an arbitrary page in the system without setting up an explicit page-table mapping first. But the payoff that comes with high memory is that much larger amounts of physical memory can now be supported. Multi-gigabyte Linux systems are now common.
High memory has not solved the problem entirely, however. The kernel is still limited to 1GB of directly-addressable low memory. Any kernel data structure which is frequently accessed must live in low memory, or system performance will be hurt. Increasingly, low memory is becoming the new limiting factor on system scalability.
Consider, for example, the system memory map, which consists of a struct page structure for every page of physical memory in the system. The memory map is a fundamental kernel data structure which must be placed in low memory. It takes up 40 bytes for every (4096-byte) page in the system; that overhead may seem small until you consider that, if you want to put 64GB of memory into an x86 box, the memory map will grow to some 640 megabytes. This structure thus takes most of low memory by itself. Low memory must also be used for every other important data structure, free memory, and the kernel code itself. For a 64GB system, 1GB of low memory is insufficient to even allow the system to boot, much less do the sort of serious processing that such machines are bought for.
One approach to solving this problem is page clustering - grouping physical pages into larger virtual pages. Among other things, this technique reduces the size of the memory map. Page clustering was covered here back in February.
Recently, Ingo Molnar posted a patch which takes a very different approach. Rather than try to squeeze more into 1GB of low memory, Ingo's patch makes low memory bigger. This is done by creating separate page tables to be used by user-space and kernel code, eliminating the need to split the virtual address space between the two realms. With this patch, a user-space process has a page table which gives it access to (almost) the full 4GB virtual address space. When the system goes into kernel mode (via a system call or interrupt), it switches over to the kernel page tables. Since none of the kernel page table space must be given to user processes, the kernel, too, can use the full 4GB address space. The maximum amount of addressable low memory thus quadruples.
There are, of course, costs to this approach, or it would have been adopted a long time ago. The biggest problem is that the processor's translation buffer (a hardware cache which stores the results of page table lookups) must be flushed when the page tables are changed. Flushing the TLB hurts because subsequent memory accesses will be slowed by the need to do a full, multi-level page table lookup. And, as it turns out, the TLB flush is, itself, a slow operation on x86 processors. The additional overhead is enough to cause a significant slowdown, especially for certain kinds of loads.
The cost by the separated page tables is more than most users will want to pay. For those who have applications requiring large amounts of memory - and who, for whatever reason, cannot just get a 64-bit system - this patch may well be the piece that makes everything work. Of course, the chances of such a patch getting in to the mainline kernel before 2.7 are about zero. But it would not be surprising to see it show up in certain vendors' distributions as an option.
Solution in silicon?
Posted Jul 17, 2003 5:59 UTC (Thu) by proski (subscriber, #104) [Link]
It looks like a deficiency of the x86 architecture (I'm don't know how other architectures do it, I'm just saying it can be done better).While segmentation is separate for different processes, the page mapping is common and cannot be changed without performance impact. If the processor had separate pagetables for the user code and the kernel, the problem would be solved. The problem doesn't seem to be Linux specific.
I wonder if any chip maker would consider implementing separate pagetables in silicon. IMHO it should be easier than implementing 64-bit registers or hyperthreading. More conservative approach is known to work well. I think a "64G ready" Xeon would be at least as popular as Itanium.
Solution in silicon?
Posted Jul 17, 2003 19:44 UTC (Thu) by melauer (guest, #2438) [Link]
>It looks like a deficiency of the x86 architecture (I'm don't know howI'm pretty sure the problem is that 32-bit x86 CPUs can only address 4GB of memory, of which Linux currently allows 1GB to be actual physical memory. The problem isn't x86 per se, but rather the fact that 32-bit x86 CPUs are so powerful and cost-effective nowadays that people want to use them in situations requiring multiple gigabytes of memory and where previously a 64-bit CPU (e.g. SPARC) was used.
The funny thing about this whole situation is that 64-bit x86 CPUs are out now (AMD's Opteron). Even if you _must_ have x86, rather than using a SPARC or the like, there is currently an option for doing so without running into this memory limitation. Granted, the Opteron platform needs some time to mature. For example, I seem to recall that there are few (no?) multi-processor Opteron boards out there right now. But I bet that by the time 2.6 comes out (and certainly by the time 2.8/3.0 comes out) this sort of problem will be best solved by using more appropriate hardware, rather than by using software hacks.
Solution in silicon?
Posted Jul 18, 2003 1:54 UTC (Fri) by tjc (guest, #137) [Link]
I'm pretty sure the problem is that 32-bit x86 CPUs can only address 4GB of memory [snip]IIRC, the P2 (and up) chop 32-bit addressing into 12 bits for a 4K page, 10 bits for the "inner" page table, and 10 bits for the "outer" page table. I don't know the details, though, living further down the food chain as an application programer. :-)
I don't follow the kernel lists at all (unless there's a spectacular flame war happening ;-), but has an inverted page table been discussed? Is this possible on a P[234]? That would take a performance hit right off the top, but at least it solves the huge page table problem.
Solution in silicon?
Posted Jul 19, 2003 6:34 UTC (Sat) by giraffedata (subscriber, #1954) [Link]
There's a more robust solution. No need to make the address translation sensitive to kernel vs user space. Address spaces are just address spaces. The TLB should include an address space ID in its translations. Then you wouldn't purge the TLB just because you switch address spaces. You could switch between user and kernel space and between processes, all with the TLB still warm.
Solution in silicon?
Posted Jul 19, 2003 23:25 UTC (Sat) by tjc (guest, #137) [Link]
That sounds like a good idea. I can't find much info on it though. My OS book has about two paragraphs on address space identifiers, and googling for "ASID" brings up a page for the American Society of Interior Designers. ;-)
Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds