User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.25-rc1, released by Linus on February 10. It is a huge patch. Among many other things, 2.6.25 will have realtime group scheduling, preemptible RCU, LatencyTop support, a bunch of ext4 filesystem enhancements, the controller area network protocol, Atheros wireless support, the reworked timerfd() system call, the page map patches, the SMACK security module, the container memory use controller, the ACPI thermal regulation API, and support for the MN10300/AM33 architecture. See the short-form changelog for lots of details, or the long changelog for more detail than anybody can cope with.

As of this writing, a few dozen small fixes have gone into the mainline git repository since the -rc1 release.

The current stable 2.6 kernel is, released on February 10. This update contains a single patch fixing the vmsplice() vulnerability. was released - with a rather longer list of fixes - on February 8.

For older kernels: and both come out on February 10; they, too, contain the vmsplice() fix. was released on February 8 with a few dozen fixes. And, also with quite a few fixes, came out on February 6.

Comments (1 posted)

Kernel development news

Quotes of the week

Remember, we are currently clocking along at the steady rate of:
   4000 lines added every day
   1900 lines removed every day
   1300 lines modified every day
-- Greg Kroah-Hartman

  ???? lines reviewed every day.
-- Al Viro

Comments (none posted)

Before the 2.6.25 merge window closed...

By Jonathan Corbet
February 12, 2008
The 2.6.25 merge window closed on February 10, after the merging of an eye-opening 9450 non-merge changesets. Most of the changes merged for 2.6.25 were covered in the first and second "what got merged" articles. This, the third in the series, covers the final 1900 patches merged before the window closed.

User-visible changes include:

  • There are new drivers for SC2681/SC2691-based serial ports, Dallas DS1511 timekeeping chips, AT91sam9 realtime clock devices, Compaq ASIC3 multi-function chips, Cell Broadband Engine memory controllers, Marvell MV64x60 memory controllers, PA Semi PWRficient NAND flash interfaces, Marvell Orion NAND flash controllers, Freescale eLBC NAND flash controllers, Sharp Zaurus SL-6000x keyboards, Fujitsu Lifebook Application Panel buttons, IPWireless 3G UMTS PCMCIA cards, intelligent storage device enclosures, Winbond W83L786NG and W83L786NR sensor chips, Texas Instruments ADS7828 12-bit 8-channel ADC devices, and Sony MemoryStick cards.

  • Also added are updated video drivers for Radeon R500 chipsets (2D acceleration is now supported) and Intel i915 chipsets (suspend and resume now work properly).

  • Several more obsolete OSS audio drivers have been removed. The old mxser driver has also been removed in favor of mxser_new, now called simply "mxser."

  • File descriptors returned by inotify_init() now support signal-based (using SIGIO) I/O. There is also a new notification event (IN_ATTRIB) sent when the link count of a watched file changes.

  • The mac80211 (formerly Devicescape) wireless subsystem is no longer marked "experimental."

  • The memory use controller for containers has been merged. This controller was described in this LWN article, but the patch has evolved somewhat since then and the details have changed. Some documentation can be found in Documentation/controllers/memory.txt.

  • ACPI thermal regulation support has been added; see Documentation/thermal/sysfs-api.txt for details on how it works. The ACPI code also now supports the Windows Management Instrumentation interface, and uses that support to make recent Acer laptops work.

  • ACPI now provides support for users who want to override their system's Differentiated System Description Table (DSDT).

  • The XFS filesystem now supports the fallocate() system call.

  • ATA-over-Ethernet (AoE) now properly supports devices with multiple network interfaces (and, thus, multiple paths to the host).

  • Support for the MN10300 architecture (little-endian mode only) has been added.

  • Support for a.out binaries has been removed from the ELF loader. Pure a.out systems will still work, though.

  • Disk I/O statistics (as seen in /proc/diskstats and under /sys/block) have been augmented with more information about request merging and I/O wait time.

  • The S390 architecture now implements dynamic page tables - processes will use 2-, 3-, or 4-level page tables depending on the size of their address space.

  • The ext4 "in development" flag has been added; mounting an ext4 filesystem will now require an explicit "I know this might explode" option.

Changes visible to kernel developers include:

  • Many nopage() methods have been replaced by the newer fault() API; the near-term plan is to remove nopage() altogether. See this article for a description of the new way of "page not present" handling.

  • This cycle has also seen a bit of a reinvigoration of the long-stalled project to eliminate the big kernel lock. A number of BKL-removal patches have been merged, with more certainly to come.

  • A generic resource counter mechanism was merged as part of the memory controller patch set; see <linux/res_counter.h> for the details.

  • reserve_bootmem() has a new flags parameter. Most callers will set it to BOOTMEM_DEFAULT; the kdump code, though, uses BOOTMEM_EXCLUSIVE to ensure that it is the only one to touch the memory.

  • Most architectures now have support for cmpxchg64() and cmpxchg_local().

  • There is a new set of string functions:

        extern int strict_strtoul(const char *string, unsigned int base, 
                                  unsigned long *result);
        extern int strict_strtol(const char *string, unsigned int base,
        	       		     long *result);
        extern int strict_strtoull(const char *string, unsigned int base,
                                   unsigned long long *result);
        extern int strict_strtoll(const char *string, unsigned int base,
                                  long long *result);

    These functions convert the given strings to various forms of long values, but they will return an error status if the given string value, as a whole, does not represent a proper integer value. These functions are now used in the parsing of kernel parameters.

At this point, the merging of features is done (though there has been a bit of pushing for one or two things to slip in) and the stabilization period begins. With luck, that process will go a little more quickly than it did with 2.6.24.

Comments (7 posted)

linux-next and patch management process

By Jonathan Corbet
February 13, 2008
The kernel development process operates at a furious pace, merging on the order of 10,000 changesets over the course of a 2-3 month release cycle. There have been many changes over the last few years which have helped to make this level of patch flow possible, and the process has been optimized significantly. An ongoing discussion on the kernel mailing list has made it clear, though, that a truly optimal solution has not yet been found.

It started with the announcement of the linux-next tree. This tree, to be maintained by Stephen Rothwell, is intended to be a gathering point for the patches which are planned to be merged in the next development cycle. So, since we are currently in the 2.6.25 cycle, linux-next will accumulate patches for 2.6.26. The idea is to solve the patch integration issues there and reduce the demands on Andrew Morton's time.

The question which was immediately raised was this: how do we deal with big API changes which require changes in multiple subsystems? These changes are already problematic, often requiring maintainers to rework their trees in the middle of the merge window. Trying to integrate such changes earlier, in a separate tree, could bring a new set of problems. There will be a lot of conflicts between patches done before and after the API change, and somebody is going to have to put the pieces back together again. Andrew does some of that now, but the problem is big enough that not even Andrew can solve it all the time. The bidirectional SCSI patches merged for 2.6.25 were held up as an example; that change required coordinated SCSI and block layer patches, and it never was possible to get the whole thing working in -mm.

Arjan van de Ven asserted that the only way to make large API changes work is to merge them first, at the beginning of the merge window. The merged patch would fix all in-tree users of the changed API, as is the usual rule. Maintainers of all other trees could then merge with the updated mainline, fixing any new code which might be affected by the API change. This is, essentially, the approach which was taken for the big device model changes in 2.6.25; they hit the mainline at the beginning of the merge window, then everybody else got to adapt to the new way of doing things.

Greg Kroah-Hartman worries that this approach is not sufficient, especially when live trees are being merged. If an API change in one tree forces a change to a separate tree, the coordination issues just get hard. Keeping the secondary changes in the primary tree risks conflicts with patches in the proper subsystem tree. Patches which reach across trees are also, increasingly, being discouraged as making life harder for everybody. But the fixup patch will not apply to its nominal subsystem tree as long as the API change itself is not there. In the -mm tree, this sort of problem is glued together by a series of fixup patches maintained by Andrew; Greg says that the linux-next tree would need something similar.

David Miller's suggestion was to resolve this sort of conflict through frequent rebasing of the -next tree. Rebasing is an operation (supported by git and other code management tools) which takes a set of patches against one tree and does what's required to make them apply to a different version of the tree. It can be quite useful for maintaining patches against a moving target - which kernel trees tend to be. David talked about how he rebases his (networking subsystem) trees frequently as a way of eliminating conflicts with the mainline and, in the process, cleaning some cruft out of the development history.

It turns out, though, that this frequent rebasing is not popular with the developers who are downstream of David. Rebasing the tree forces all downstream contributors to do the same thing, and to deal with any merge conflicts that result. It makes it much harder to prepare trees which can be pulled upstream and creates extra work.

This was where Linus jumped into the conversation and expressed his dislike of rebasing. He echoed the complaints from downstream developers that a constantly-rebased tree is hard to prepare patches against. It also confuses the development history, making changes to other developers' patches in silent ways. After somebody's patch set has been rebased, it is no longer the patches that were sent. So, says Linus:

So there's a real reason why we strive to *not* rewrite history. Rewriting history silently turns tested code into totally untested code, with absolutely no indication left to say that it now is untested.

It is about here that Andrew Morton commented that git does not appear to be matching entirely well with the way that kernel developers work. Some of the solution may be found in tools more oriented toward the management of patch queues - such as quilt. There may be a renewed push to get more quilt-like functionality built into git (along the lines of the stacked git project) in the near future.

Linus is also not entirely pleased with how the integration of patches only happens in the mainline:

I'm also a bit unhappy about the fact you think all merging has to go through my tree and has to be visible during the two-week merge period. Quite frankly, I think that you guys could - and should - just try to sort API changes out more actively against each other, and if you can't, then that's a problem too.

His suggestion is that a separate git tree should be created to contain a large API change - and nothing else. Affected subsystem maintainers could then merge that tree and develop against the result. In the end, all of the pieces should merge nicely in the mainline.

This approach raises a number of interesting issues. The API-change tree has to be agreed upon by everybody, and it must be quite stable - lots of changes at that level will create downstream trouble. There must also be a high degree of confidence that this API-change tree will, in fact, get merged into the mainline; should Linus balk, everybody else's trees will no longer be applicable to the mainline. Replacing the current "tree of trees" patch flow with something messier could create a number of coordination issues. And there are fears that a mainline tree built from this process would fail to build in many of its intermediate states, which would make tools like "git bisect" much harder to use. Even so, it could be part of the long-term solution.

Linus also took the opportunity to complain about large-scale API changes in general:

Really. I do agree that we need to fix up bad designs, but I disagree violently with the notion that this should be seen as some ongoing thing. The API churn should absolutely *not* be seen as a constant pain, and if it is (and it clearly is) then I think the people involved should start off not by asking "how can we synchronize", but looking a bit deeper and saying "what are we doing wrong?"

He also stated that the costs of big API changes are high enough that we should, more often, stay with older interfaces, even if they are not as good as they could be. Others disagreed, claiming that Linux must continue to evolve if it is to stay alive and relevant.

The rate of change seems unlikely to fall in the near future. There may be some changes to how big changes are done, though. As suggested by Ted Ts'o, more changes could be done by creating entirely new interfaces rather than breaking old ones. With Ted's scheme, the old interface would be marked "deprecated" at the beginning of the merge window. Developers would then have the entire development cycle to adjust to the change, and the deprecated interface would be removed before the final release.

There is resistance to this approach, based on the observation that getting rid of deprecated interfaces tends to be harder than one would expect. But, still, it is a relatively painless way of making changes. The current transition (in the memory management area) from the nopage() VMA operation to fault() is an example of how it can work. Nick Piggin has been slowly changing in-tree users with the eventual goal of removing nopage() altogether. For now, though, both interfaces coexist in the tree and nothing has been broken.

Like the kernel itself, its development process is undergoing constant change and (hopefully) improvement. As the development community and the rate of change continues to grow, the process will have to adjust accordingly. What changes come out of this discussion remain to be seen. But it's worth noting that Andrew Morton fears that the biggest problem - regressions and bugs - will be relatively unaffected.

Comments (none posted)

vmsplice(): the making of a local root exploit

By Jonathan Corbet
February 12, 2008
As this is being written, distributors are working quickly to ship kernel updates fixing the local root vulnerabilities in the vmsplice() system call. Unlike a number of other recent vulnerabilities which have required special situations (such as the presence of specific hardware) to exploit, these vulnerabilities are trivially exploited and the code to do so is circulating on the net. Your editor found himself wondering how such a wide hole could find its way into the core kernel code, so he set himself the task of figuring out just what was going on - a task which took rather longer than he had expected.

The splice() system call, remember, is a mechanism for creating data flow plumbing within the kernel. It can be used to join two file descriptors; the kernel will then read data from one of those descriptors and write it to the other in the most efficient way possible. So one can write a trivial file copy program which opens the source and destination files, then splices the two together. The vmsplice() variant connects a file descriptor (which must be a pipe) to a region of user memory; it is in this system call that the problems came to be.

The first step in understanding this vulnerability is that, in fact, it is three separate bugs. When the word of this problem first came out, it was thought to only affect 2.6.23 and 2.6.24 kernels. Changes to the vmsplice() code had caused the omission of a couple of important permissions checks. In particular, if the application had requested that vmsplice() move the contents of a pipe into a range of memory, the kernel didn't check whether that application had the right to write to that memory. So the exploit could simply write a code snippet of its choice into a pipe, then ask the kernel to copy it into a piece of kernel memory. Think of it as a quick-and-easy rootkit installation mechanism.

If the application is, instead, splicing a memory range into a pipe, the kernel must, first, read in one or more iovec structures describing that memory range. The 2.6.23 vmsplice() changes omitted a check on whether the purported iovec structures were in readable memory. This looks more like an information disclosure vulnerability than anything else - though, as we will see, it can be hard to tell sometimes.

These two vulnerabilities (CVE-2008-0009 and CVE-2008-0010) were patched in the and kernel updates, released on February 8.

On February 10, Niki Denev pointed out that the kernel appeared to be still vulnerable after the fix. In fact, the vulnerability was the result of a different problem - and it is a much worse one, in that kernels all the way back to 2.6.17 are affected. At this point, a large proportion of running Linux systems are vulnerable. This one has been fixed in the,, and kernels, also released on the 10th. At this point, with luck, all of these bugs have been firmly stomped - though, now, we need to see a lot of distributor updates.

The problem, once again, is in the memory-to-pipe implementation. The function get_iovec_page_array() is charged with finding a set of struct page pointers corresponding to the array of iovec structures passed in by the calling application. Those pointers are stored in this array:

    struct page *pages[PIPE_BUFFERS];

Where PIPE_BUFFERS happens to be 16. In order to avoid overflowing this array, get_iovec_page_array() does the following check:

    npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
    if (npages > PIPE_BUFFERS - buffers)
	npages = PIPE_BUFFERS - buffers;

Here, off is the offset into the first page of the memory to be transferred, len is the length passed in by the application, and buffers is the current index into the pages array.

Now, if we turn our attention to the exploit code for a moment, we see it setting up a number of memory areas with mmap(); some of that setup is not necessary for the exploit to work, as it turns out. At the end, the code does this (edited slightly):

    iov.iov_base = map_addr;
    iov.iov_len  = ULONG_MAX;
    vmsplice(pi[1], &iov, 1, 0);

The map_addr address points to one of the areas created with mmap() which, crucially, is significantly more than PIPE_BUFFERS pages long. And the length is passed through as the largest possible unsigned long value.

Now let's go back to fs/splice.c, where the vmsplice() implementation lives. We note that, prior to the fix, the kernel did not check whether the memory area pointed to by the iovec structure was readable by the calling process. Once again, this looks like an information disclosure vulnerability - the process could cause any bit of kernel memory to be written to the pipe, from which it could be read. But the exploit code is, in fact, passing in a valid pointer - it's just the length which is clearly absurd.

Looking back at the code which calculates npages, we see something interesting:

    npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
    if (npages > PIPE_BUFFERS - buffers)
	npages = PIPE_BUFFERS - buffers;

Since len will be ULONG_MAX when the exploit runs, the addition will cause an integer overflow - with the effect that npages is calculated to be zero. Which, one would think, would cause no pages to be examined at all. Except that there is an unfortunate interaction with another part of the kernel.

Once npages has been calculated, the next line of code looks like this:

    error = get_user_pages(current, current->mm,
		       	   (unsigned long) base, npages, 0, 0,
		       	   &pages[buffers], NULL);

get_user_pages() is the core memory management function used to pin a set of user-space pages into memory and locate their struct page pointers. While the npages variable passed as an argument is an unsigned quantity, the prototype for get_user_pages() declares it as a simple int called len. And, to complete the evil, this function processes pages in a do {} while(); loop which ends thusly:

    } while (len && start < vma->vm_end);

So, if get_user_pages() is passed with a len argument of zero, it will pass through the mapping loop once, decrement len to a negative number, then continue faulting in pages until it hits an address which lacks a valid mapping. At that point it will stop and return. But, by then, it may have stored far more entries into the pages array than the caller had allocated space for.

The practical result in this case is that get_user_pages() faults in (and stores struct page pointers for) the entire region mapped by the exploit code. That region (by design) has more than PIPE_BUFFERS pages - in fact, it has three times that many, so 48 pointers get stored into a 16-pointer array. And this turns the failure to read-verify the source array into a buffer overflow vulnerability within the kernel. Once that is in place, it is a relatively straightforward exercise for any suitably 31337 hacker to cause the kernel to jump into the code of his or her choice. Game over. (Update: as a linux-kernel reader pointed out, the story is a little more complicated still at this point; this is an unusual sort of buffer overflow attack).

The fix which was applied simply checks the address range that the application is trying to splice into the pipe. Since a range of length ULONG_MAX is unlikely to be valid, the vulnerability is closed - as are any potential information disclosure problems.

This vulnerability is a clear example of how a seemingly read-only vulnerability can be escalated into something rather more severe. It also shows what can happen when certain types of sloppiness find their way into the code - if get_user_pages() is asked to get zero pages, that's how many it should do. Your editor is working on a patch to clean that up a bit. Meanwhile, everybody should ensure that they are running current kernels with the vulnerability closed.

Comments (91 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management


Virtualization and containers


Page editor: Jake Edge
Next page: Distributions>>

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds