User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.23-rc7, released by Linus on September 19. It contains a fair number of fixes and the return of the sk98lin driver. This should be the last prepatch before the final 2.6.23 release. See the long-format changelog for the details.

The current -mm release is 2.6.23-rc6-mm1. Recent changes to -mm include a patch to disable the timerfd() system call (to give time to work out what the API should actually be), randomization of the brk() system call on i386 and x86_64 systems, and lots of fixes.

Comments (none posted)

Kernel development news

Quotes of the week

It took me over two solid days to get this lot compiling and booting on a few boxes. This required around ninety fixup patches and patch droppings. There are several bugs in here which I know of (details below) and presumably many more which I don't know of. I have to say that this just isn't working any more.
-- Andrew Morton launches 2.6.23-rc6-mm1

Linux developers firmly guard their independence and don't often follow our advice.
-- Richard Stallman

Yes, I realize that there's a lot of insane people out there. However, we generally don't do kernel design decisions based on them. But we can pat the insane users on the head and say "we won't guarantee it works, but if you eat your prozac, and don't bother us, go do your stupid things".
-- Linus Torvalds

Comments (5 posted)

Large pages, large blocks, and large problems

By Jonathan Corbet
September 19, 2007
Most of the core virtual memory subsystem developers met for a mini-summit just before the 2007 Kernel Summit in Cambridge. They came away feeling that they had resolved a number of VM scalability problems. Subsequent discussions have made it clear that, perhaps, this conclusion was a bit premature. They may well have resolved things, but it is not clear that everybody came to the same resolution.

All of the issues at hand relate to scalability in one way or another. While the virtual memory subsystem has been through a great many changes aimed at making it work well on contemporary systems, one key aspect of how it works has remained essentially unchanged since the beginning: the 4096-byte (on most architectures) page size. Over that time, the amount of memory installed on a typical system has grown by about three orders of magnitude - that's 1000 times more pages that the kernel must manage and 1000 times more page faults which must be handled. Since it does not appear that this trend will stop soon, there is a clear scalability problem which must be managed.

This problem is complicated by the way that Linux tends to fragment its memory. Almost all memory allocations are done in units of a single page, with the result that system RAM tends to get scattered into large numbers of single-page chunks. The kernel's memory allocator tries to keep larger groups of pages together, but there are limits to how successful it can be. The file /proc/buddyinfo can be illustrative here; on a system which has been running and busy for a while, the number of higher-order (larger) pages, as shown in the rightmost columns, will be very small.

The main response to memory fragmentation has been to avoid higher-order allocations at almost any cost. There are very few places in the kernel where allocations of multiple contiguous pages are done. This approach has worked for some time, but avoiding larger allocations does not always make the need for such allocations go away. In fact, there are many things which could benefit from larger contiguous memory areas, including:

  • Applications which use large amounts of memory will be working with large numbers of pages. The translation lookaside buffer (TLB) in the CPU, which speeds virtual address lookups, is generally relatively small, to the point that large applications run up a lot of time-consuming TLB misses. Larger pages require fewer TLB entries, and will thus result in faster execution. The hugetlbfs extension was created for just this purpose, but it is a specialized mechanism used by few applications, and it does not do anything special to make large contiguous memory regions easier for the kernel to find.

  • I/O operations can work better with larger contiguous chunks of data to work with. Users trying to use "jumbo frames" (extra-large packets) on high-performance network adapters have been experiencing problems for a while. Many devices are limited in the number of scatter/gather entries they support for a single operation, so small buffers limit the overall I/O operation size. Disk devices are pushing toward larger sector sizes which would best be supported by larger contiguous buffers within the kernel.

  • Filesystems are feeling pressure to use larger block sizes for a number of performance reasons. This message from David Chinner provides an excellent explanation of why filesystems benefit from larger blocks. But it is hard (on Linux) for a filesystem to work with block sizes larger than the page size; XFS does it, but the resulting code is seen as non-optimal and is not as fast as it could be. Most other filesystems do not even try; as a result, an ext3 filesystem created on a system with 8192-byte pages cannot be mounted on a system with smaller pages.

None of these issues are a surprise; developers have seen them coming for some time. So there are a number of potential solutions waiting on the wings. What is lacking is a consensus on which solution is the best way to go.

One piece of the puzzle may be Mel Gorman's fragmentation avoidance work, which has been discussed here more than once. Mel's patches seek to separate allocations which can be moved in physical memory from those which cannot. When movable allocations are grouped together, the kernel can, when necessary, create higher-order groups of pages by relocating allocations which are in the way. Some of Mel's work is in 2.6.23; more may be merged for 2.6.24. The lumpy reclaim patches, also in 2.6.23, encourage the creation of large blocks by targeting adjacent pages when memory is being reclaimed.

The immediate cause for the current discussion is a new version of Christoph Lameter's large block size patches. Christoph has filled in the largest remaining gap in that patch set by implementing mmap() support. This code enables the page cache to manage chunks of file data larger than a single page which, in turn, addresses many of the I/O and filesystem issues. Christoph has given a long list of reasons why this patch should be merged, but agreement is not universal.

At the top of the list of objections would appear to be the fact that the large block size patches require the availability of higher-order pages to work; there is no fallback if memory becomes sufficiently fragmented that those allocations are not available. So a system which has filesystems using larger block sizes will fall apart in the absence of large, contiguous blocks of memory - and, as we have seen, that is not an uncommon situation on Linux systems. The fragmentation avoidance patches can improve the situation quite a bit, but there is no guarantee that If this patch set is merged, some developers want it to include a loud warning to discourage users from actually expecting it to work. fragmentation will not occur, either as a result of the wrong workload or a deliberate attack. So, if this patch set is merged, some developers want it to include a loud warning to discourage users (and distributors) from actually expecting it to work.

An alternative is Nick Piggin's fsblock work. People like to complain about the buffer head layer in current kernels, but that layer has a purpose: it tracks the mapping between page cache blocks and the associated physical disk sectors. The fsblock patch replaces the buffer head code with a new implementation with the goals of better performance and cleaner abstractions.

One of the things fsblock can do is support large blocks for filesystems. The current patch does not use higher-order allocations to implement this support; instead, large blocks are made virtually contiguous in the vmalloc() space through a call to vmap() - a technique used by XFS now. The advantage of using vmap() is that the filesystem code can see large, contiguous blocks without the need for physical adjacency, so fragmentation is not an issue.

On the other hand, using vmap() is quite slow, the address space available for vmap() on 32-bit systems is small enough to cause problems, and using vmap() does nothing to help at the I/O level. So Nick plans to extend fsblock to implement large blocks with contiguous allocations, but with a fallback to vmap() when large allocations are not available. In theory, this approach should be be best of both worlds, giving the benefits of large blocks without unseemly explosions in the presence of fragmentation. Says Nick:

However fsblock can do everything that higher order pagecache can do in terms of avoiding vmap and giving contiguous memory to block devices by opportunistically allocating higher orders of pages, and falling back to vmap if they cannot be satisfied.

From the conversation, it seems that a number of developers see fsblock as the future. But it is not something for the near future. The patch is big, intrusive, and scary, which will slow its progress (and memory management patches have a tendency to merge at a glacial pace to begin with). It lacks the opportunistic large block feature. Only the Minix filesystem has been updated to use fsblock, and that patch was rather large. Everybody (including Nick) anticipates that more complex filesystems - those with features like journaling - will present surprises and require changes of unknown size. Fsblock is not a near-term solution.

One recently-posted patch from Christoph could help fill in some of the gaps. His "virtual compound page" patch allows kernel code to request a large, contiguous allocation; that request will be satisfied with physically contiguous memory if possible. If that memory is not available, virtually contiguous memory will be returned instead. Beyond providing opportunistic large block allocation for fsblock, this feature could conceivably be used in a number of places where vmalloc() is called now, resulting in better performance when memory is not overly fragmented.

Meanwhile, Andrea Arcangeli has been relatively quiet for some time, but one should not forget that he is the author of much of the VM code in the kernel now. He advocates a different approach entirely:

From my part I am really convinced the only sane way to approach the VM scalability and larger-physically contiguous pages problem is the CONFIG_PAGE_SHIFT patch (aka large PAGE_SIZE from Hugh for 2.4).

The CONFIG_PAGE_SHIFT patch is a rework of an old idea: separate the size of a page as seen by the operating system from the hardware's notion of the page size. Hardware pages can be clustered together to create larger software pages which, in turn, become the basic unit of memory management. If all pages in the system were, say, 64KB in length, a 64KB buffer would be a single-page allocation with no fragmentation issues at all.

If the system is to go to larger pages, creating them in software is about the only option. Most processors support more than one hardware page size, but the smallest of the larger page sizes tend to be too large for general use. For example, i386 processors have no page sizes between 4KB and 2MB. Clustering pages in software enables the use of more reasonable page sizes and creates the flexibility needed to optimize the page size for the expected load on the system. This approach will make large block support easy, and it will help with the I/O performance issues as well. Page clustering is not helpful for TLB pressure problems, but there is little to be done there in any sort of general way.

The biggest problem, perhaps, with page clustering is that it replaces external fragmentation with internal fragmentation. A 64KB page will, when used as the page cache for a 1KB file, waste 63KB of memory. There are provisions in Andrea's patch for splitting large pages to handle this situation; Andrea claims that this splitting will not lead to the same sort of fragmentation seen on current systems, but he has not, yet, convinced the others of this fact.

Conclusions from this discussion are hard to come by; at one point Mel Gorman asked: "Are we going to agree on some sort of plan or are we just going to handwave ourselves to death?" Linus has just called the whole discussion "idiotic". What may happen is that the large block size patches go in - with warnings - as a way of keeping a small subset of users happy and providing more information about the problem space. Memory management hacking requires a certain amount of black-magic handwaving in the best of times; there is no reason to believe that the waving of hands is going to slow down anytime soon this time around.

Comments (34 posted)

A generic tracing interface

By Jonathan Corbet
September 19, 2007
Dynamic kernel tracing remains high on the wishlists presented by many Linux users. While much work has been done to create a powerful tracing capability, very little of that work has found its way into the mainline. The recent posting of one small piece of infrastructure may help to change that situation, though.

The piece in question is the trace layer posted by David Wilder. Its purpose is to make it easy for a tracing application to get things set up in the kernel and allow the user to control the tracing process. To that end, it provides an internal kernel API and a set of control files in the debugfs filesystem.

On the kernel side, a tracing module would set things up with a call to:

    #include <linux/trace.h>

    struct trace_info *trace_setup(const char *root, const char *name,
			           u32 buf_size, u32 buf_nr, u32 flags);

Here, root is the name of the root directory which will appear in debugfs, name is the name of the control directory within root, buf_size and buf_nr describe the size and number of relay buffers to be created, and flags controls various channel options. The TRACE_GLOBAL_CHANNEL flag says that a single set of relay channels (as opposed to per-CPU channels) should be used; TRACE_FLIGHT_CHANNEL turns on the "flight recorder" mode where relay buffer overruns result in the overwriting of old data, and TRACE_DISABLE_STATE disables control of the channel via debugfs.

The return value (if all goes well) will be a pointer to a trace_info structure for the channel. This structure has a number of fields, but the one which will be of most interest outside of the trace code itself will be rchan, which is a pointer to the relay channel associated with this trace point.

When actual tracing is to begin, the kernel module should make a call to:

    int trace_start(struct trace_info *trace);

The return value follows the "zero or a negative error value" convention. Tracing is turned off with:

    int trace_stop(struct trace_info *trace);

When the tracing module is done, it should shut down the trace with:

    void trace_cleanup(struct trace_info *trace);

Note that none of these entry points have anything to do with the placement or activation of trace points or the creation of trace data. All of that must be done separately by the trace module. So a typical module will, after calling trace_start(), set up one or more kprobes or activate a static kernel marker. The probe function attached to the trace points should do something like this:

    if (trace_running(trace)) {
        /* Format trace data and output via relay */

Additionally, if the TRACE_GLOBAL_CHANNEL flag has been set, the probe function should protect access to the relay channel with a spinlock. This protection may also be necessary in situations where an interrupt handler might be traced.

In user space, the trace information will show up under /debug/root/name, where debug is the debugfs mount point, and root and name are the directory names passed to trace_setup(). The file state can be read to get the current tracing state; an application can write start or stop to this file to turn tracing on or off. The file trace0 is the relay channel where tracing data can be read; on SMP systems with per-CPU channels there will be additional files (trace1...) for additional processors. The file dropped can be read to see how many trace records (if any) have been dropped due to buffer-full conditions.

All told, it is not a particularly complicated bit of code. Perhaps the most significant feature of this patch is that it is part of the infrastructure created and used by the SystemTap project. Getting this code into the mainline will make it that much easier for distributors to provide well-supported tracing facilities to their users. And that, in turn, should make users happy and give analysts one less thing to complain about.

Comments (none posted)

A summary of 2.6.23 internal API changes

By Jonathan Corbet
September 17, 2007
The final 2.6.23 kernel release is getting closer. At this point, it would be more than surprising to see any additional API changes find their way into this release, so it should be safe to summarize the changes which have been made.

  • The UIO interface for the creation of user-space drivers has been merged. While UIO is aimed at user space, there is a kernel-space component for driver registration and interrupt handling.

  • unregister_chrdev() now returns void.

  • There is a new notifier chain which can be used (by calling register_pm_notifier()) to obtain notification before and after suspend and hibernate operations.

  • The new "lockstat" infrastructure provides statistics on the amount of time threads spend waiting for and holding locks.

  • The new fault() VMA operation replaces nopage() and populate(). See this article for a description of the current fault() API.

  • The generic netlink API now has the ability to register (and unregister) multicast groups on the fly.

  • The destructor argument has been removed from kmem_cache_create(), as destructors are no longer supported. All in-kernel callers have been updated.

  • There is a new clone() flag - CLONE_NEWUSER - which creates a new user namespace for the process; it is intended for use with container systems.

  • There is a new rtnetlink API for managing software network devices.

  • The networking core can now work with devices which have more than one transmit queue. This is a feature which was needed to properly support some wireless devices.

  • The sysfs core has been significantly rewritten to weaken the connection between sysfs entries and internal kobjects. The new code should make life easier for driver writers who will have fewer object lifecycle issues to worry about.

  • The never-used enable_wake() PCI driver method has been removed.

  • Drivers wanting to get the revision ID from the PCI config space should now just use the value found in the new revision member of the pci_dev structure. All in-tree drivers have been changed to use this new approach.

  • The SCSI layer has picked up a couple of scatter/gather accessor functions - scsi_dma_map() and scsi_dma_unmap() - in preparation for chained scatter/gather lists and bidirectional requests. Most drivers in the kernel have been updated to use these functions.

  • The idr code has a couple of new helper functions: idr_for_each() and idr_remove_all().

  • sys_ioctl() is no longer exported to modules.

  • The page table helper functions ptep_establish(), ptep_test_and_clear_dirty() and ptep_clear_flush_dirty() have been removed - they had no in-kernel users.

  • Kernel threads are non-freezable by default; any kernel thread which should be frozen for a suspend-to-disk operation must now call set_freezable() to arrange for that to happen.

  • The SLUB allocator is now the default.

  • The new function is_owner_or_cap(inode) tests for access permission based on the current fsuid and capabilities; it replaces the open-coded test previously found in several filesystems.

  • There is a new utility function:
        char *kstrndup(const char *s, size_t max, gfp_t gfp);
    This function duplicates a string along the lines of the user-space strndup().

As always, a cumulative record of API changes can be found in the LWN 2.6 API changes page.

Comments (none posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds