Virtual Memory I: the problem

[Posted March 10, 2004 by corbet]

This article serves mostly as background to help understand why the kernel developers are considering making fundamental virtual memory changes at this point in the development cycle. It can probably be skipped by readers who understand how high and low memory work on 32-bit systems.

A 32-bit processor can address a maximum of 4GB of memory. One could, in theory, extend the instruction set to allow for larger pointers, but, in practice, nobody does that; the effects on performance and compatibility would be too strong. So the limitation remains: no process on a 32-bit system can have an address space larger than 4GB, and the kernel cannot directly address more than 4GB.

In fact, the limitations are more severe than that. Linux kernels split the 4GB address space between user processes and the kernel; under the most common configuration, the first 3GB of the 32-bit range are given over to user space, and the kernel gets the final 1GB starting at 0xc0000000. Sharing the address space gives a number of performance benefits; in particular, the hardware's address translation buffer can be shared between the kernel and user space.

If the kernel wishes to be able to access the system's physical memory directly, however, it must set up page tables which map that memory into the kernel's part of the address space. With the default 3GB/1GB mapping, the amount of physical memory which can be addressed in this way is somewhat less than 1GB - part of the kernel's space must be set aside for the kernel itself, for memory allocated with vmalloc(), and various other purposes. That is why, until a few years ago, Linux could not even fully handle 1GB of memory on 32-bit systems. In fact, back in 1999, Linus decreed that 32-bit Linux would never, ever support more than 2GB of memory. "This is not negotiable."

Linus's views notwithstanding, the rest of the world continued on with the strange notion that 32-bit systems should be able to support massive amounts of memory. The processor vendors added paging modes which could use physical addresses which exceed 32 bits in length, thus ending the 4GB limit for physical memory. The internal addressing limitations in the Linux kernel remained, however. Happily for users of large systems, Linus can acknowledge an error and change his mind; he did eventually allow large memory support into the 2.3 kernel. That support came with its own costs and limitations, however.

On 32-bit systems, memory is now divided into "high" and "low" memory. Low memory continues to be mapped directly into the kernel's address space, and is thus always reachable via a kernel-space pointer. High memory, instead, has no direct kernel mapping. When the kernel needs to work with a page in high memory, it must explicitly set up a special page table to map it into the kernel's address space first. This operation can be expensive, and there are limits on the number of high-memory pages which can be mapped at any particular time.

For the most part, the kernel's own data structures must live in low memory. Memory which is not permanently mapped cannot appear in linked lists (because its virtual address is transient and variable), and the performance costs of mapping and unmapping kernel memory are too high. High memory is useful for process pages and some kernel tasks (I/O buffers, for example), but the core of the kernel stays in low memory.

Some 32-bit processors can now address 64GB of physical memory, but the Linux kernel is still not able to deal effectively with that much; the current limit is around 8GB to 16GB, depending on the load. The problem now is that larger systems simply run out of low memory. As the system gets larger, it requires more kernel data structures to manage, and eventually room for those structures can run out. On a very large system, the system memory map (an array of struct page structures which represents physical memory) alone can occupy half of the available low memory.

There are users out there wanting to scale 32-bit Linux systems up to 32GB or more of main memory, so the enterprise-oriented Linux distributors have been scrambling to make that possible. One approach is the 4G/4G patch written by Ingo Molnar. This patch separates the kernel and user address spaces, allowing user processes to have 4GB of virtual memory while simultaneously expanding the kernel's low memory to 4GB. There is a cost, however: the translation buffer is no longer shared and must be flushed for every transition between kernel and user space. Estimates of the magnitude of the performance hit vary greatly, but numbers as high as 30% have been thrown around. This option makes some systems work, however, so Red Hat ships a 4G/4G kernel with its enterprise offerings.

The 4G/4G patch extends the capabilities of the Linux kernel, but it remains unpopular. It is widely seen as an ugly solution, and nobody likes the performance cost. So there are efforts afoot to extend the scalability of the Linux kernel via other means. Some of these efforts will likely go forward - in 2.6, even - but the kernel developers seem increasingly unwilling to distort the kernel's memory management systems to meet the needs of a small number of users who are trying to stretch 32-bit systems far beyond where they should go. There will come a time where they will all answer as Linus did back in 1999: go get a 64-bit system.

Index entries for this article
Kernel	Memory management/Object-based reverse mapping

Virtual Memory I: the problem

Posted Mar 11, 2004 5:27 UTC (Thu) by oconnorcjo (guest, #2605) [Link]

I agree with Linus on this one. Opterons and AMD64 are out now and Intel will have an x86-64 in the next year or so. If you really need the RAM then one might as well get a 64 bit chip to go with it. Racking up the RAM on a 32 bit system is like stuffing 10 pounds of potatoes in a five pound bag.

Virtual Memory I: the problem

Posted Mar 11, 2004 8:20 UTC (Thu) by dlang (guest, #313) [Link] (2 responses)

I think you have the 3:1 split backwards. as I understand it the kernel gets the 3G portion and userspace gets 1G.

there are patches to allow you to change it to 2G:2G and I believe I've seen either a 3:1 or a 2.5:1.5 (I don't remember which at the moment) but as you cut down the amount of address space available to the kernel other problems become more common.

Virtual Memory I: the problem

Posted Mar 11, 2004 10:34 UTC (Thu) by axboe (subscriber, #904) [Link]

Nah, it is you who gets it backward.

Virtual Memory I: the problem

Posted Mar 11, 2004 11:51 UTC (Thu) by nix (subscriber, #2304) [Link]

If that were true any given process would only be able to address a third of the physical RAM in the system (on a fully-populated non-highmem box).

This is considered silly, since process virtual memory requirements can be far higher than their physical requirements. :)

Virtual Memory I: the problem

Posted Mar 11, 2004 9:06 UTC (Thu) by dale77 (guest, #1490) [Link]

Yep, buy yourself an AMD64.

Perhaps one of these:

http://www.gamepc.com/shop/systemfamily.asp?family=gpdev

Dale

Virtual Memory I: the problem

Posted Mar 11, 2004 17:23 UTC (Thu) by parimi (guest, #5773) [Link] (2 responses)

Jon, Thanks for such an informative article!

I thought SGI's Altix was able to handle a huge amount of memory. Does anyone know if SGI's kernel also uses the 4G/4G patch?

Altix

Posted Mar 11, 2004 17:30 UTC (Thu) by corbet (editor, #1) [Link]

"I thought SGI's Altix was able to handle a huge amount of memory. Does anyone know if SGI's kernel also uses the 4G/4G patch?"

I do believe that Altix systems are Itanium-based, so they don't have to deal with all this obnoxious stuff.

Virtual Memory I: the problem

Posted Mar 20, 2004 1:35 UTC (Sat) by mysticalreaper (guest, #20326) [Link]

As the other reply stated, SGI's altix runs on Intel's Itanium 2 processors, which are 64 bit, and thus, can address a silly amount of memory (2^64 bytes) and so do not suffer these silly problems.

Virtual Memory I: the problem

Posted Mar 11, 2004 18:36 UTC (Thu) by mmarkov (guest, #4978) [Link] (2 responses)

If the kernel wishes to be able to access the system's physical memory directly, however, it must set up page tables which map that memory into the kernel's part of the address space. With the default 3GB/1GB mapping, the amount of physical memory which can be addressed in this way is somewhat less than 1GB - part of the kernel's space must be set aside for the kernel itself, for memory allocated with vmalloc(), and various other purposes.

Honestly, I don't understand here why only 1GB is accessible under these premises.

PS Great article, Jon. In fact, great articles, both part I and part II.

Virtual Memory I: the problem

Posted Mar 11, 2004 22:17 UTC (Thu) by jmshh (guest, #8257) [Link]

The keyword here is "directly", i.e. without any manipulation of page
tables. So all physical RAM has to live inside the 1GB virtual address
space of the kernel, together with some other stuff, like video buffers.

Why only 1 G is directly accessible

Posted Mar 12, 2004 9:51 UTC (Fri) by Duncan (guest, #6647) [Link]

First, keep in mind that we are talking about a less than 4 gig address
space, the physical limit of the "flat" memory model, 32-bits of address,
with each address serving one byte of memory. One can of course play with
the byte-per-address model and make it, say, two bytes or a full 32-bit
4-bytes, but there again, we get into serious compatibility problems with
current software that assumes one-byte handling. The implications of that
would be HUGE, and NOBODY wants to tackle the task of ensuring
4-byte-per-address clean code, since the assumption has been
byte-per-address virtually forever and virtually ALL programs have that
axiom written so deep into their code you might as well start over again
(which is sort of what Intel argued should be the case with Itanic, clean
start approach, anyway, taking the opportunity to move cleanly to 64-bit,
which is why it never really took off, but that's an entirely different
topic). It's simply easier to move to 64-bit address space than to tinker
with the byte-per-address assumption. Thus, 32-bit is limited to 4-gig of
directly addressable memory in any practical case.

Another solution, as generally used back in the 16-bit era, is called
"segmented" memory. The address back then consisted of a 16-bit "near"
address, and a 16-bit "segment" address. The issue, as one would expect,
ammounted to one of performance. It was comparatively fast to access
anything within the same segment, much slower to address anything OUT of
the segment. As it happened, 64k was the segment size, and if you
remember anything from that era, it might be that editors, for instance,
quite commonly had a limit on the size of the editable file of somewhat
less than 64k, so they could access both their own operational memory AND
the datafile being edited, all within the same 64k segment. However, the
benefits of "flat" memory are such that few want to go back to a segmented
memory model, if at all possible to stay away from it. (That said, the
various high memory models do essentially that, but try to manage it at
the system level so at least individual applications don't have to worry
about it, as they did back in the 16-bit era.)

That still doesn't "address" (play on words intentional) the lower 1-gig
kernel-space, 3-gig user-space, "soft" limit. As you mention, yes, in
theory the kernel /can/ address the full 4-gig. The problem, however, is
hinted at elsewhere in the article where it talks about the 4G/4G patch --
use all available direct address space for the kernel, and switching
between usermode and kernelmode becomes even MORE tremendously expensive
than it already is, in performance terms, because if they use the same
address space, the entire 4-gig "picture" has to be flushed (more on that
below), so the new "picture" of the other mode can be substituted without
losing data. As explained in the article each mode then has to manage its
own memory picture, and the performance issues of flushing that picture so
another one can replace it at each context switch are enormous.

As already mentioned in other replies, there are a number of solutions,
each with their own advantages and disadvantages. One is the 2G/2G split,
which BTW is what MSWormOS uses. This symmetric approach allows both the
kernel and userspace to access the same four gig maximum "picture", each
from their own context, but sharing the picture, so the performance issues
in flushing it don't come into play. It does give the kernel more
comfortable room to work in, but at the expense of that extra gig for
userspace. While few applications need more than their two-gig share of
memory to work in, the very types of applications that do, huge database
applications and other such things, happen to be run on the same sorts of
systems that need that extra room for the kernel.. huge enterprise systems
with well over eight gig of physical memory. Thus, the 2G/2G solution is
a niche solution that will fit only a very limited subset of those running
into the problem in the first place. The 4G/4G solution is more practical
-- EXCEPT that it carries those huge performance issues. Well, there's
also the fact that even a 4G/4G solution only doubles the space available
to work with, and thus is only a temporary solution at best, perhaps two
years worth, maybe 3-4 by implimenting other "tricks" with their own
problems, even if the base performance issue didn't apply. That's where
the next article comes in.

The loose end left to deal with is that flushing, mentioned above. I must
admit to not fully understanding this myself, but a very simplistic view
of things would be to imagine a system with 8 gig of physical memory,
dealt with using the previously mentioned "segments", of which there would
be two, one each for userspace and kernel space. A mode switch would then
simply mean changing the segment reference, ensuring all cache memory is
flushed out to the appropriate segment before one does so, of course.

Practice of course doesn't match that concept perfectly very often at all,
however, and even if a system DID happen to have exactly eight gig of
memory, such a simplistic model wouldn't work in real life because of
/another/ caveat.. that being that each application has its own virtual
address space map, and few actually use the entire thing, so one would be
writing to swap (a 100 to 1000 times slower solution than actual memory,
so a generally poor solution if not absolutely necessary) entirely
unnecessarily with only one application being runnable at once.

That of course is where vm=virtual memory comes in. as it allows all the
space unused by one app or the kernel itself to be used by another, with
its own remapping solutions. However, that's the part I don't really
understand, so won't attempt to explain it. Besides, this post is long
enough already. <g> Just understand that flushing is a necessary process
of low enough performance that it should be avoided if possible, and that
the concept is one of clearing the slate so it can be used for the new
memory picture, while retaining the data of the first one so it can be
used again.

Duncan

Virtual Memory I: the problem

Posted Mar 21, 2004 1:02 UTC (Sun) by alpharomeo (guest, #20341) [Link]

Some questions/comments:

1) Why not cause the kernel to manage memory in de fact pages larger than 4K? A larger page is how other OS's manager large memory efficiently, whether we are talking 32 bit or 64 bit addressing. To avoid breakage, why not keep the current 4K page size but have the kernel always allocate/free pages in blocks of, say, 16 pages, or even 256 pages? Then the page table would be vastly smaller.

2) It may be convenient to say "use a 64 bit machine". The fact is that 64 bit addressing is inefficient and overkill for many situations. Experience with several architectures that support simultaneous 32-bit and 64-bit applications has shown that the 64-bit builds run at least 25-30% slower than the corresponding 32-bit builds. Bigger addresses means bigger programs, bigger stacks and heaps, etc.. Some applications may require nearly twice as much memory when run in 64-bit mode. So, why not optimize the kernel to provide better support for 32-bit addressing? In particular, what is so wrong with supporting infinite physical memory but limiting process address space to 4 GB (or 3 GB)?

3) Why is shared memory such a big problem? We have never been able to get Linux to allow shared memory segments larger than slightly less than 1 GB. Is there some trick to it? Linux reports that there is insufficient swap space, but it does not matter how many swap partitions you allocate - you always get the same error.

Thanks!