Brief items
The current 2.6 prepatch is 2.6.4-rc3, which was
announced by Linus on March 9.
Changes this time include more cleanups from Al Viro, an
R128 DRI driver security fix, an ARC4 crypto module, an ACPI update, some
preparatory work for the hotplug CPU patch (but not that patch itself), an
IrDA update, and various other fixes. See
the
long-format changelog for the details.
2.6.4-rc2 was announced on March 3.
It included a number of parallel port
fixes, various architecture updates, the reversion of a patch which had
removed threads from /proc (and broke gdb), an XFS update, a FireWire
update (including one which notes that IEEE1394 support is no longer
experimental), and numerous fixes. See the
long-format changelog for the details.
Linus's BitKeeper tree contains just a handful of fixes as of this writing.
The current prepatch from Andrew Morton is 2.6.4-rc1-mm1, released on March 7.
Recent additions to the -mm tree include DMA for IDE CDROM ripping,
per-page access permissions with remap_file_pages(), more
scheduler tweaks, and various other fixes. The next -mm release is likely
to be most interesting; see the rest of this week's Kernel Page for
details.
The current 2.4 kernel is 2.4.25; Marcelo released 2.4.26-pre2 on March 6.
This prepatch contains an ACPI update, an XFS update, and a number of
networking patches.
Comments (3 posted)
Kernel development news
This article serves mostly as background to help understand why the kernel
developers are considering making fundamental virtual memory changes at
this point in the development cycle. It can probably be skipped by readers
who understand how high and low memory work on 32-bit systems.
A 32-bit processor can address a maximum of 4GB of memory. One could, in
theory, extend the instruction set to allow for larger pointers, but, in
practice, nobody does that; the effects on performance and compatibility
would be too strong. So the limitation remains: no process on a 32-bit
system can have an address space larger than 4GB, and the kernel cannot
directly address more than 4GB.
In fact, the limitations are more severe than that. Linux kernels split
the 4GB address space between user processes and the kernel; under the most common
configuration, the first 3GB of the 32-bit range are given over to user
space, and the kernel gets the final 1GB starting at 0xc0000000.
Sharing the address space gives a number of performance benefits; in
particular, the hardware's address translation buffer can be shared between
the kernel and user space.
If the kernel wishes to be able to access the system's physical memory
directly, however, it must set up page tables which map that memory into
the kernel's part of the address space. With the default 3GB/1GB mapping,
the amount of physical memory which can be addressed in this way is
somewhat less than 1GB - part of the kernel's space must be set aside for
the kernel itself, for memory allocated with vmalloc(), and
various other purposes. That is why, until a few years ago, Linux could
not even fully handle 1GB of memory on 32-bit systems. In fact, back in
1999, Linus decreed
that 32-bit Linux would never, ever support more than 2GB of memory.
"This is not negotiable."
Linus's views notwithstanding, the rest of the world continued on with the
strange notion that 32-bit
systems should be able to support massive amounts of memory. The processor
vendors added paging modes which could use physical addresses which exceed
32 bits in length, thus ending the 4GB limit for physical memory. The
internal addressing limitations in the Linux kernel remained, however.
Happily for users of large systems, Linus can acknowledge an error and
change his mind; he did eventually allow large memory support into the 2.3
kernel. That support came with its own costs and limitations, however.
On 32-bit systems, memory is now divided into "high" and "low" memory. Low
memory continues to be mapped directly into the kernel's address space, and
is thus always reachable via a kernel-space pointer. High memory, instead,
has no direct kernel mapping. When the kernel needs to work with a page in
high memory, it must explicitly set up a special page table to map it into
the kernel's address space first. This operation can be expensive, and
there are limits on the number of high-memory pages which can be mapped at
any particular time.
For the most part, the kernel's own data structures must live in low
memory. Memory which is not permanently mapped cannot appear in linked
lists (because its virtual address is transient and variable), and the
performance costs of mapping and unmapping kernel memory are too high.
High memory is useful for process pages and some kernel tasks (I/O buffers,
for example), but the core of the kernel stays in low memory.
Some 32-bit processors can now address 64GB of physical memory, but the
Linux kernel is still not able to deal effectively with that much; the
current limit is around 8GB to 16GB, depending on the load. The problem
now is that larger systems simply run out of low memory. As the system
gets larger, it requires more kernel data structures to manage, and
eventually room for those structures can run out. On a very large system,
the system memory map (an array of struct page structures which
represents physical memory) alone can occupy half of the available low
memory.
There are users out there wanting to scale 32-bit Linux systems up to 32GB
or more of main memory, so the enterprise-oriented Linux distributors have
been scrambling to make that possible. One approach is the 4G/4G patch written by Ingo
Molnar. This patch separates the kernel and user address spaces, allowing
user processes to have 4GB of virtual memory while simultaneously expanding
the kernel's low memory to 4GB. There is a cost, however: the translation
buffer is no longer shared and must be flushed for every transition between
kernel and user space. Estimates of the magnitude of the performance hit
vary greatly, but numbers as high as 30% have been thrown around. This
option makes some systems work, however, so Red Hat ships a 4G/4G kernel
with its enterprise offerings.
The 4G/4G patch extends the capabilities of the Linux kernel, but it
remains unpopular. It is widely seen as an ugly solution, and nobody likes
the performance cost. So there are efforts afoot to extend the scalability
of the Linux kernel via other means. Some of these efforts will likely go
forward - in 2.6, even - but the kernel developers seem increasingly unwilling to distort
the kernel's memory management systems to meet the needs of a small number
of users who are trying to stretch 32-bit systems far beyond where they
should go. There will come a time where they will all answer as Linus did
back in 1999: go get a 64-bit system.
Comments (12 posted)
Andrea Arcangeli not only wants to make the Linux kernel scale to and
beyond 32GB of memory on 32-bit processors; he seems to be in a real
hurry. There are, it would seem, customers waiting for a 2.6-based
distribution which can run in such environments.
For Andrea, the real culprit in the exhaustion of low memory is clear: it's
the reverse-mapping virtual memory ("rmap") code. The rmap code was first
described on this page in
January, 2002; its purpose is to make it easier for the kernel to free
memory when swapping is required. To that end, rmap maintains, for each
physical page in the system, a chain of reverse pointers; each pointer
indicates a page table which has a reference for that page. By following
the rmap chains, the kernel can quickly find all mappings for a given page,
unmap them, and swap the page out.
The rmap code solved some real performance problems in the kernel's virtual
memory subsystem, but it, too has a cost. Every one of those reverse
mapping entries consumes memory - low memory in particular. Much effort has gone into
reducing the memory cost of the rmap chains, but the simple fact remains:
as the amount of memory (and the number of processes using that memory)
goes up, the rmap chains will consume larger amounts of low memory.
Eliminating the rmap overhead would go a long way toward allowing the
kernel to scale to larger systems. Of course, one wants to eliminate this
overhead while not losing the benefits that rmap brings.
Andrea's approach is to bring back and extend the object-based reverse
mapping patches. The initial object-based patch was created by Dave
McCracken; LWN covered this
patch a year ago. Essentially, this patch eliminates the rmap chains
for memory which maps a file by following pointers "the long way around"
and searching candidate virtual memory areas (VMAs). Andrea has updated this patch and fixed some bugs, but the
core of the patch remains the same; see last year's description for the
details.
Last week, we raised the possibility that
the virtual memory subsystem could see fundamental changes in the course of
the 2.6 "stable" series. This week, Linus confirmed that possibility in response to
Andrea's object-based reverse mapping patch:
I certainly prefer this to the 4:4 horrors. So it sounds worth it
to put it into -mm if everybody else is ok with it.
Assuming this work goes forward, it has the usual implications for the
stable kernel. Even assuming that it stays in the -mm tree for some time,
its inclusion into 2.6 is likely to destabilize things for a few releases
until all of the obscure bugs are shaken out.
Dave McCracken's original patch, in any case, only solves part of the
problem. It gets rid of the rmap chains for file-backed memory, but it
does nothing for anonymous memory (basic process data - stacks, memory
obtained with malloc(), etc.), which has no "object" behind it.
File-backed memory is a large portion of the total, especially on systems
which are running large Oracle servers and use big, shared file mappings.
But anonymous memory is also a large part of the mix; it would be nice to
take care of the rmap overhead for that as well.
To that end, Andrea has posted another patch
(in preliminary form) which provides object-based reverse mapping for
anonymous memory as well. It works, essentially, by replacing the rmap
chain with a pointer to a chain of virtual memory area (VMA) structures.
Anonymous pages are always created in response to a request for memory from
a single process; as a result, they are never shared at creation time.
Given that, there is no need for a new anonymous page to have a chain of
reverse mappings; we know that there can be only a single mapping. Andrea's
patch adds a union to struct page which includes the existing
mapping pointer (for non-anonymous memory) and adds a couple of
new ones. One of those is simply called vma, and it points to the
(single) VMA structure pointing to the page. So if a process has several
non-shared,
anonymous pages in the same virtual memory area, the structure looks
somewhat like
this:
With this structure, the kernel can find the page table which maps a given
page by following the pointers through the VMA structure.
Life gets a bit more complicated when the process forks, however. Once
that happens, there will be multiple page tables pointing to the same anonymous
pages and a single VMA pointer will no longer be adequate. To deal with this
case, Andrea has created a new "anon_vma" structure which
implements a linked list of VMAs. The third member of the new struct
page union is a pointer to this structure which, in turn, points to
all VMAs which might contain the page. The structure now looks like:
If the kernel needs to unmap a page in this scenario, it must follow the
linked list and examine every VMA it finds. Once the page is unmapped from
every page table found, it can be freed.
There are some memory costs to this scheme: the VMA structure requires a
new list_head structure, and the anon_vma structure must
be allocated whenever a chain must be formed. One VMA can refer to
thousands of pages, however, so a per-VMA cost will be far less than the
per-page costs incurred by the existing rmap code.
This approach does incur a greater computational cost. Freeing a page
requires scanning multiple VMAs which may or may not contain references to
the page under consideration. This cost will increase with the number of
processes sharing a memory region. Ingo Molnar, who is fond of O(1)
solutions, is nervous about object-based
schemes for this reason. According to Ingo, losing the possibility of
creating an O(1) page unmapping scheme is a heavy cost to pay for the prize
of making large amounts of memory work on obsolete hardware.
The solution that Ingo would like to see, instead, is to reduce the
per-page memory overhead by reducing the number of pages. The means to
that end is page clustering - grouping
adjacent hardware pages into larger virtual pages. Page clustering would
reduce rmap overhead, and reduce the size of the main kernel memory map as
well. The available page clustering patch is even more intrusive than
object-based reverse mapping, however; it seems seriously unlikely to be
considered for 2.6.
Comments (6 posted)
The block layer supports the notion of "plugging" a request queue for a
block device. A plugged queue passes no requests to the underlying device;
it allows them to accumulate, instead, so that the I/O scheduler has a
chance to reorder them and optimize performance. There comes a time,
however, when the plug has to be pulled and the device restarted. Often,
code within the filesystem or virtual memory layers decides that, for
whatever reason, it's time to get block I/O moving again. In the current
2.6 kernel, there is a function (
blk_run_queues()) which performs
this task.
The problem is that blk_run_queues() has turned out to be a bit of
a performance and scalability problem. It has a single, global lock which
keeps multiple processors from trying to restart the queues at the same
time; this lock has become a bit of a contention point on some systems. A
call to blk_run_queues() also restarts all block devices on the
system, even though there is typically only one queue that truly needs to
be unplugged.
To address these problems, Jens Axboe has posted a patch which does away with
blk_run_queues() altogether. This change is a result of a
fundamental realization: there is always one specific queue which needs to
be kickstarted. So blk_run_queues() has been replaced with
blk_run_queue() (which takes the specific queue to start as a
parameter) and blk_run_address_space() (which takes a pointer to a
address_space structure). With these functions, higher-level code
can fire up the request queue which belongs to a specific device or which
ultimately underlies a particular non-anonymous mapping.
This patch is going straight into the -mm tree; Andrew Morton commented "This is such an improvement over
what we have now it isn't funny." He also noted that "...the next -mm is
starting to look like linux-3.1.0..." The 2.6 kernel looks to be
interesting for a while.
Comments (1 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>