LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.4-rc3, which was announced by Linus on March 9. Changes this time include more cleanups from Al Viro, an R128 DRI driver security fix, an ARC4 crypto module, an ACPI update, some preparatory work for the hotplug CPU patch (but not that patch itself), an IrDA update, and various other fixes. See the long-format changelog for the details.

2.6.4-rc2 was announced on March 3. It included a number of parallel port fixes, various architecture updates, the reversion of a patch which had removed threads from /proc (and broke gdb), an XFS update, a FireWire update (including one which notes that IEEE1394 support is no longer experimental), and numerous fixes. See the long-format changelog for the details.

Linus's BitKeeper tree contains just a handful of fixes as of this writing.

The current prepatch from Andrew Morton is 2.6.4-rc1-mm1, released on March 7. Recent additions to the -mm tree include DMA for IDE CDROM ripping, per-page access permissions with remap_file_pages(), more scheduler tweaks, and various other fixes. The next -mm release is likely to be most interesting; see the rest of this week's Kernel Page for details.

The current 2.4 kernel is 2.4.25; Marcelo released 2.4.26-pre2 on March 6. This prepatch contains an ACPI update, an XFS update, and a number of networking patches.

Comments (3 posted)

Kernel development news

Virtual Memory I: the problem

This article serves mostly as background to help understand why the kernel developers are considering making fundamental virtual memory changes at this point in the development cycle. It can probably be skipped by readers who understand how high and low memory work on 32-bit systems.

A 32-bit processor can address a maximum of 4GB of memory. One could, in theory, extend the instruction set to allow for larger pointers, but, in practice, nobody does that; the effects on performance and compatibility would be too strong. So the limitation remains: no process on a 32-bit system can have an address space larger than 4GB, and the kernel cannot directly address more than 4GB.

In fact, the limitations are more severe than that. Linux kernels split the 4GB address space between user processes and the kernel; under the most common configuration, the first 3GB of the 32-bit range are given over to user space, and the kernel gets the final 1GB starting at 0xc0000000. Sharing the address space gives a number of performance benefits; in particular, the hardware's address translation buffer can be shared between the kernel and user space.

If the kernel wishes to be able to access the system's physical memory directly, however, it must set up page tables which map that memory into the kernel's part of the address space. With the default 3GB/1GB mapping, the amount of physical memory which can be addressed in this way is somewhat less than 1GB - part of the kernel's space must be set aside for the kernel itself, for memory allocated with vmalloc(), and various other purposes. That is why, until a few years ago, Linux could not even fully handle 1GB of memory on 32-bit systems. In fact, back in 1999, Linus decreed that 32-bit Linux would never, ever support more than 2GB of memory. "This is not negotiable."

Linus's views notwithstanding, the rest of the world continued on with the strange notion that 32-bit systems should be able to support massive amounts of memory. The processor vendors added paging modes which could use physical addresses which exceed 32 bits in length, thus ending the 4GB limit for physical memory. The internal addressing limitations in the Linux kernel remained, however. Happily for users of large systems, Linus can acknowledge an error and change his mind; he did eventually allow large memory support into the 2.3 kernel. That support came with its own costs and limitations, however.

On 32-bit systems, memory is now divided into "high" and "low" memory. Low memory continues to be mapped directly into the kernel's address space, and is thus always reachable via a kernel-space pointer. High memory, instead, has no direct kernel mapping. When the kernel needs to work with a page in high memory, it must explicitly set up a special page table to map it into the kernel's address space first. This operation can be expensive, and there are limits on the number of high-memory pages which can be mapped at any particular time.

For the most part, the kernel's own data structures must live in low memory. Memory which is not permanently mapped cannot appear in linked lists (because its virtual address is transient and variable), and the performance costs of mapping and unmapping kernel memory are too high. High memory is useful for process pages and some kernel tasks (I/O buffers, for example), but the core of the kernel stays in low memory.

Some 32-bit processors can now address 64GB of physical memory, but the Linux kernel is still not able to deal effectively with that much; the current limit is around 8GB to 16GB, depending on the load. The problem now is that larger systems simply run out of low memory. As the system gets larger, it requires more kernel data structures to manage, and eventually room for those structures can run out. On a very large system, the system memory map (an array of struct page structures which represents physical memory) alone can occupy half of the available low memory.

There are users out there wanting to scale 32-bit Linux systems up to 32GB or more of main memory, so the enterprise-oriented Linux distributors have been scrambling to make that possible. One approach is the 4G/4G patch written by Ingo Molnar. This patch separates the kernel and user address spaces, allowing user processes to have 4GB of virtual memory while simultaneously expanding the kernel's low memory to 4GB. There is a cost, however: the translation buffer is no longer shared and must be flushed for every transition between kernel and user space. Estimates of the magnitude of the performance hit vary greatly, but numbers as high as 30% have been thrown around. This option makes some systems work, however, so Red Hat ships a 4G/4G kernel with its enterprise offerings.

The 4G/4G patch extends the capabilities of the Linux kernel, but it remains unpopular. It is widely seen as an ugly solution, and nobody likes the performance cost. So there are efforts afoot to extend the scalability of the Linux kernel via other means. Some of these efforts will likely go forward - in 2.6, even - but the kernel developers seem increasingly unwilling to distort the kernel's memory management systems to meet the needs of a small number of users who are trying to stretch 32-bit systems far beyond where they should go. There will come a time where they will all answer as Linus did back in 1999: go get a 64-bit system.

Comments (12 posted)

Virtual Memory II: the return of objrmap

Andrea Arcangeli not only wants to make the Linux kernel scale to and beyond 32GB of memory on 32-bit processors; he seems to be in a real hurry. There are, it would seem, customers waiting for a 2.6-based distribution which can run in such environments.

For Andrea, the real culprit in the exhaustion of low memory is clear: it's the reverse-mapping virtual memory ("rmap") code. The rmap code was first described on this page in January, 2002; its purpose is to make it easier for the kernel to free memory when swapping is required. To that end, rmap maintains, for each physical page in the system, a chain of reverse pointers; each pointer indicates a page table which has a reference for that page. By following the rmap chains, the kernel can quickly find all mappings for a given page, unmap them, and swap the page out.

The rmap code solved some real performance problems in the kernel's virtual memory subsystem, but it, too has a cost. Every one of those reverse mapping entries consumes memory - low memory in particular. Much effort has gone into reducing the memory cost of the rmap chains, but the simple fact remains: as the amount of memory (and the number of processes using that memory) goes up, the rmap chains will consume larger amounts of low memory. Eliminating the rmap overhead would go a long way toward allowing the kernel to scale to larger systems. Of course, one wants to eliminate this overhead while not losing the benefits that rmap brings.

Andrea's approach is to bring back and extend the object-based reverse mapping patches. The initial object-based patch was created by Dave McCracken; LWN covered this patch a year ago. Essentially, this patch eliminates the rmap chains for memory which maps a file by following pointers "the long way around" and searching candidate virtual memory areas (VMAs). Andrea has updated this patch and fixed some bugs, but the core of the patch remains the same; see last year's description for the details.

Last week, we raised the possibility that the virtual memory subsystem could see fundamental changes in the course of the 2.6 "stable" series. This week, Linus confirmed that possibility in response to Andrea's object-based reverse mapping patch:

I certainly prefer this to the 4:4 horrors. So it sounds worth it to put it into -mm if everybody else is ok with it.

Assuming this work goes forward, it has the usual implications for the stable kernel. Even assuming that it stays in the -mm tree for some time, its inclusion into 2.6 is likely to destabilize things for a few releases until all of the obscure bugs are shaken out.

Dave McCracken's original patch, in any case, only solves part of the problem. It gets rid of the rmap chains for file-backed memory, but it does nothing for anonymous memory (basic process data - stacks, memory obtained with malloc(), etc.), which has no "object" behind it. File-backed memory is a large portion of the total, especially on systems which are running large Oracle servers and use big, shared file mappings. But anonymous memory is also a large part of the mix; it would be nice to take care of the rmap overhead for that as well.

To that end, Andrea has posted another patch (in preliminary form) which provides object-based reverse mapping for anonymous memory as well. It works, essentially, by replacing the rmap chain with a pointer to a chain of virtual memory area (VMA) structures.

Anonymous pages are always created in response to a request for memory from a single process; as a result, they are never shared at creation time. Given that, there is no need for a new anonymous page to have a chain of reverse mappings; we know that there can be only a single mapping. Andrea's patch adds a union to struct page which includes the existing mapping pointer (for non-anonymous memory) and adds a couple of new ones. One of those is simply called vma, and it points to the (single) VMA structure pointing to the page. So if a process has several non-shared, anonymous pages in the same virtual memory area, the structure looks somewhat like this:

[Anonymous reverse mapping]

With this structure, the kernel can find the page table which maps a given page by following the pointers through the VMA structure.

Life gets a bit more complicated when the process forks, however. Once that happens, there will be multiple page tables pointing to the same anonymous pages and a single VMA pointer will no longer be adequate. To deal with this case, Andrea has created a new "anon_vma" structure which implements a linked list of VMAs. The third member of the new struct page union is a pointer to this structure which, in turn, points to all VMAs which might contain the page. The structure now looks like:

[anonvma]

If the kernel needs to unmap a page in this scenario, it must follow the linked list and examine every VMA it finds. Once the page is unmapped from every page table found, it can be freed.

There are some memory costs to this scheme: the VMA structure requires a new list_head structure, and the anon_vma structure must be allocated whenever a chain must be formed. One VMA can refer to thousands of pages, however, so a per-VMA cost will be far less than the per-page costs incurred by the existing rmap code.

This approach does incur a greater computational cost. Freeing a page requires scanning multiple VMAs which may or may not contain references to the page under consideration. This cost will increase with the number of processes sharing a memory region. Ingo Molnar, who is fond of O(1) solutions, is nervous about object-based schemes for this reason. According to Ingo, losing the possibility of creating an O(1) page unmapping scheme is a heavy cost to pay for the prize of making large amounts of memory work on obsolete hardware.

The solution that Ingo would like to see, instead, is to reduce the per-page memory overhead by reducing the number of pages. The means to that end is page clustering - grouping adjacent hardware pages into larger virtual pages. Page clustering would reduce rmap overhead, and reduce the size of the main kernel memory map as well. The available page clustering patch is even more intrusive than object-based reverse mapping, however; it seems seriously unlikely to be considered for 2.6.

Comments (6 posted)

No more global unplugging

The block layer supports the notion of "plugging" a request queue for a block device. A plugged queue passes no requests to the underlying device; it allows them to accumulate, instead, so that the I/O scheduler has a chance to reorder them and optimize performance. There comes a time, however, when the plug has to be pulled and the device restarted. Often, code within the filesystem or virtual memory layers decides that, for whatever reason, it's time to get block I/O moving again. In the current 2.6 kernel, there is a function (blk_run_queues()) which performs this task.

The problem is that blk_run_queues() has turned out to be a bit of a performance and scalability problem. It has a single, global lock which keeps multiple processors from trying to restart the queues at the same time; this lock has become a bit of a contention point on some systems. A call to blk_run_queues() also restarts all block devices on the system, even though there is typically only one queue that truly needs to be unplugged.

To address these problems, Jens Axboe has posted a patch which does away with blk_run_queues() altogether. This change is a result of a fundamental realization: there is always one specific queue which needs to be kickstarted. So blk_run_queues() has been replaced with blk_run_queue() (which takes the specific queue to start as a parameter) and blk_run_address_space() (which takes a pointer to a address_space structure). With these functions, higher-level code can fire up the request queue which belongs to a specific device or which ultimately underlies a particular non-anonymous mapping.

This patch is going straight into the -mm tree; Andrew Morton commented "This is such an improvement over what we have now it isn't funny." He also noted that "...the next -mm is starting to look like linux-3.1.0..." The 2.6 kernel looks to be interesting for a while.

Comments (1 posted)

Patches and updates

Kernel trees

Build system

Core kernel code

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds