User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The 2.6.26 kernel is out, released by Linus on July 13. For those just tuning in, some of the bigger changes in 2.6.26 include PAT support in the x86 architecture, read-only bind mounts, the KGDB debugger, a lot of virtualization work, and more. See the KernelNewbies 2.6.26 page for lots of details.

The 2.6.27 merge window has opened, with some 3000 changesets incorporated as of this writing. See the separate article below for a summary of what has been merged (so far) for the next development cycle.

The stable kernel update was released on July 13. It contains a single fix for a locally-exploitable vulnerability (limited to x86_64 systems).

Comments (none posted)

Kernel development news

Quotes of the week

That said, I didn't actually _test_ my patch. That's what users are for!
-- Linus Torvalds

6924d1ab8b7bbe5ab416713f5701b3316b2df85b is a work of art. Is it ascii-art tetris? a magic eye picture? you decide! It looks even more spectacular in gitk.
-- Dave Jones

art of Ingo Molnar]
-- Ingo Molnar

But I could also see the second number as being the "year", and 2008 would get 2.8, and then next year I'd make the first release of 2009 be 2.9.1 (and probably avoid the ".0" just because it again has the connotations of a "big new untested release", which is not true in a date-based numbering scheme). And then 2010 would be 3.0.1 etc..

Anyway, I have to say that I personally don't have any hugely strong opinions on the numbering. I suspect others do, though, and I'm almost certain that this is an absolutely _perfect_ "bikeshed-painting" subject where thousands of people will be very passionate and send me their opinions on why _their_ particular shed color is so much better.

-- Linus Torvalds opens the can of worms

Indeed, I apologise for reviewing the code on a monitor that is wider than yours. If only we could make sure that all Linux developers used smaller monitors then the code quality would surely improve!
-- Herbert Xu

And we should obviously have _a_ version of the firmware available with the kernel when that is possible. But I'd hate for it to be 1:1 with a particular driver version - because at that point it smells of being a single work, and if it is more than mere aggregation it's no longer viable with most of our firmware (I don't think we have source for more than one or two cases).
-- Linus Torvalds

Comments (19 posted)

2.6.27: what's coming (part 1)

By Jonathan Corbet
July 16, 2008
Linus wasted no time after the 2.6.26 release; he opened the 2.6.27 merge window less than 24 hours later. As of this writing, the process has barely begun with a mere 3000 changesets merged. So we do not have a complete picture of what will be in the next kernel release. But we can look at what has been merged so far.

User-visible changes include:

  • New drivers for CompuLab EM-x270 audio devices (as found on the Toshiba e800 PDA), Philips UDA1380 codecs, Wolfson Micro WM8510 and WM8990 codecs, Atmel AT32 audio devices, AK4535 codecs, SGI HAL2 audio devices (as found in Indy and Indigo2 workstations), SGI O2 audio boards, crypto engines found in Intel IXP4xx processors, Freescale Security Engine processors, AMD I/O memory management units, Marvell Loki (88RC8480), Kirkwood (88F6000), and Discovery Duo (MV78xx0) system-on-chip processors, IBM Power Virtual Fibre Channel Adapters, and GEFanuc C2K cPCI single-board computers.

  • The old "ppc" architecture has been removed; all platforms are now supported by the integrated "powerpc" architecture code.

  • The SCSI command filter - which controls which SCSI commands can be sent to a device by which kind of user - is now per-device and can be changed via sysfs.

  • The block subsystem now has support for hardware which can perform data integrity checking; this will allow some kinds of errors to be caught before the associated data is lost forever. See this article for more information on the block-layer integrity feature.

  • The "dummy" Linux security module has been removed; the default module is now the capabilities module.

  • The crypto code has gained support for the RIPEMD-128, RIPEMD-160, RIPEMD-256, and RIPEMD-320 hash algorithms. Asynchronous hashing is now supported and is implemented by the "cryptd" software crypto daemon.

  • Xen now has support for the saving and restoring of virtual machines - possibly migrating them to different hosts in between.

  • The new virtual file /sys/firmware/memmap shows the memory map as it was configured by the system BIOS before the kernel booted.

  • The ftrace lightweight tracing framework has been merged. See Documentation/ftrace.txt for more information on ftrace.

  • The mmiotrace tool has been merged. Mmiotrace will capture and print out memory-mapped I/O accesses, making it a useful tool for the reverse-engineering of binary drivers.

  • The ARM and powerpc architectures now support the latencytop tool.

  • The RDMA code has acquired support for the InfiniBand "base memory management extension" operations. The IP-over-InfiniBand code can now perform large receive offload (LRO).

  • Delayed allocation support has been added to the ext4 filesystem, which is getting quite close to its target feature set.

  • The SATA layer now has enclosure management support; this allows the system to do things like blink an LED to indicate a specific drive in a large enclosure.

  • The SGI IRIX binary compatibility layer has been removed.

Changes visible to kernel developers include:

  • The register_security() function has been removed. Security modules which wish to implement stacking must now do so explicitly.

  • The request_queue_t type is gone at last; block drivers should use struct request_queue instead.

  • Quite a bit of big kernel lock removal work has been merged. For char devices, the open() method from struct file_operations is no longer protected by the BKL. Calls to fasync() have also lost BKL protection.

  • Many drivers have been converted to use the firmware loader, making it possible to strip the firmware from the kernel for those who are inclined to do so. See this article for more information on the firmware work.

  • The API work in the i2c layer continues; there is now an autodetection capability which allows new-style drivers to detect devices on their buses automatically.

  • The SCSI layer has gained new support for "device handlers," which are mostly concerned with multipath management. Some of this code has been moved over from the device mapper.

Come back next week for the next episode in the "what's coming in 2.6.27" series.

Comments (none posted)

Block layer: integrity checking and lots of partitions

By Jonathan Corbet
July 15, 2008
One likes to think of disk drives as being a reliable store of data. As long as nothing goes so wrong as to let the smoke out of the device, blocks written to the disk really should come back with the same bits set in the same places. The reality of the situation is a bit less encouraging, especially when one is dealing with the sort of hardware which is available at the local computer store. Stories of blocks which have been corrupted, or which have been written to a location other than the one which was intended, are common.

For this reason, there is steady interest in filesystems which use checksums on data stored to block devices. Rather than take the device's word that it successfully stored and retrieved a block, the filesystem can compare checksums and be sure. A certain amount of checksumming is also done by paranoid applications in user space. The checksums used by BitKeeper are said to have caught a number of corruption problems; successor tools like git have checksums wired deeply into their data structures. If a disk drive corrupts a git repository, users will know about it sooner rather than later.

Checksums are a useful tool, but they have one minor problem: checksum failures tend to come when they are too late to be useful. By the time a filesystem or application notices that a disk block isn't quite what it once was, the original data may be long-gone and unrecoverable. But disk block corruption often happens in the process of getting the data to the disk; it would sure be nice if the disk itself could use a checksum to ensure that (1) the data got to the disk intact, and (2) the disk itself hasn't mangled it.

To that end, a few standards groups have put together schemes for the incorporation of data integrity checking into the hardware itself. These mechanisms generally take the form of an additional eight-byte checksum attached to each 512-byte block. The host system generates the checksum when it prepares a block for writing to the drive; that checksum will follow the data through the series of host controllers, RAID controllers, network fabrics, etc., with the hardware verifying the checksum along each step of the way. The checksum is stored with the data, and, when the data is read in the future, the checksum travels back with it, once again being verified at each step. The end result should be that data corruption problems are caught immediately, and in a way which identifies which component of the system is at fault.

Needless to say, this integrity mechanism requires operating system support. As of the 2.6.27 kernel, Linux will have such support, at least for SCSI and SATA drives, thanks to Martin Petersen. The well-written documentation file included with the data integrity patches envisions three places where checksum generation and verification can be performed: in the block layer, in the filesystem, and in user space. Truly end-to-end protection seems to need user-space verification, but, for now, the emphasis is on doing this work in the block layer or filesystem - though, as of this writing, no integrity-aware filesystems exist in the mainline repository.

Drivers for block devices which can manage integrity data need to register some information with the block layer. This is done by filling in a blk_integrity structure and passing it to blk_integrity_register(). See the document for the full details; in short, this structure contains two function pointers. generate_fn() generates a checksum for a block of data, and verify_fn() will verify a checksum. There are also functions for attaching a tag to a block - a feature supported by some drives. The data stored in the tag can be used by filesystem-level code to, for example, ensure that the block is really part of the file it is supposed to belong to.

The block layer will, in the absence of an integrity-aware filesystem, prepare and verify checksum data itself. To that end, the bio structure has been extended with a new bi_integrity field, pointing to a bio_vec structure describing the checksum information and some additional housekeeping. Happily, the integrity standards were written to allow the checksum information to be stored separately from the actual data; the alternative would have been to modify the entire Linux memory management system to accommodate that information. The bi_integrity area is where that information goes; scatter/gather DMA operations are used to transfer the checksum and data to and from the drive together.

Integrity-aware filesystems, when they exist, will be able to take over the generation and verification of checksum data from the block layer. A call to bio_integrity_prep() will prepare a given bio structure for integrity verification; it's then up to the filesystem to generate the checksum (for writes) or check it (for reads). There's also a set of functions for managing the tag data; again, see the document for the details.

Extended partitions

One of the more annoying and long-lived annoyances in the Linux block layer has been the limit on the number of partitions which can be created on any one device. IDE devices can handle up to 64 partitions, which is usually enough, but SCSI devices can only manage 16 - including one reserved for the full device. As these devices get larger, and as applications which benefit from filesystem isolation (virtualization, for example) become more popular, this limit only becomes more irksome.

The interesting thing is that the work needed to circumvent this problem was done some years ago when device numbers were extended to 32 bits. Some complicated schemes were proposed back in 2004 as a way of extending the number of partitions while not changing any existing device numbers, but that approach was never adopted. In the mean time, increasing use of tools like udev has pretty much eliminated the need for device number compatibility; on most distributions, there are no persistent device files anymore.

So when Tejun Heo revisited the partition limit problem, he didn't bother with obscure bit-shuffling schemes. Instead, with his patch set, block devices simply move to a new major device number and have all minor numbers dynamically assigned. That means that no block device has a stable (across boots) number; it also means that the minor numbers for partitions on the same device are not necessarily grouped together. But, since nobody really ever sees the device numbers on a contemporary distribution, none of this should matter.

Tejun's patch series is an interesting exercise in slowly evolving an interface toward a final goal, with a number of intermediate states. In the end, the API as seen by block drivers changes very little. There is a new flag (GENHD_FL_EXT_DEVT) which allows the disk to use extended partition numbers; once the number of minor numbers given to alloc_disk() is exhausted, any additional partitions will be numbered in the extended space. The intended use, though, would appear to be to allocate no traditional minor numbers at all - allocating disks with alloc_disk(0) - and creating all partitions in that extended space. Tejun's patch causes both the IDE and sd drivers to allocate gendisk structures in that way, moving all disks on most systems into the (shared) extended number space.

Even though modern distributions are comfortable with dynamic device numbers (and names, for that matter), it seems hard to imagine that a change like this would be entirely free of systems management problems across the full Linux user base. Distributors may still be a little nervous from the grief they took after the shift to the PATA drivers changed drive names on installed systems. So it's not really clear when Tejun's patches might make it into the mainline, or when distributors would make use of that functionality. The pressure for more partitions is unlikely to go away, though, so these patches may find their way in before too long.

Comments (12 posted)

Handling kernel security problems

By Jonathan Corbet
July 16, 2008
Even the most casual observer of the linux-kernel mailing must have noticed that, in the shadow of the firmware flame war, there is also a heated discussion over the management of security issues. There have also been some attempts to turn this local battle into a multi-list, regional conflict. Finding the right way to deal with security problems is difficult for any project, and the kernel is no exception. Whether this discussion will lead to any changes remains to be seen, but it does at least provide a clear view of where the disagreements are.

Things flared up this time in response to the stable kernel update. The announcement stated that "any users of the 2.6.25 kernel series are STRONGLY encouraged to upgrade to this release," but did not say why; none of the patches found in this release were marked as security problems. As it happens, there were security-related fixes in that update; some users are upset that they were not explicitly called out as such. They have reached the point of accusing the kernel developers of hiding security problems.

These problems, it is said, are fixed with relatively benign-sounding commit messages ("x86_64 ptrace: fix sys32_ptrace task_struct leak," for example) and users are not told that a security fix has been made. This, in turn, is thought to put users at risk because (1) they do not know when they need to apply an update, and (2) there is no clear picture of how many security problems are surfacing in the kernel code. So, as "pageexec" (or "PaX Team") put it:

the problem i raised was that there's one declared policy in Documentation/SecurityBugs (full disclosure) yet actual actions are completely different and now Linus even admitted it. the problem arising from such inconsistency is that people relying on the declared disclosure policy will make bad decisions and potentially endanger their users. there're two ways out of this sitution: either follow full disclosure in practice or let the world at large know that you (well, Linus) don't want to. in either case people will adjust their security bug handling processes and everyone will be better off.

There are two aspects to the charge that the kernel is not following a full disclosure policy: commit messages are said to obscure security fixes, and kernel releases do not highlight the fact that security problems have been fixed. There is an aspect of truth to the first charge, in that Linus will freely admit to changing commit logs which discuss security problems too explicitly:

I literally draw the line at anything that is simply greppable for. If it's not a very public security issue already, I don't want a simple "git log + grep" to help find it.

That said, I don't _plan_ messages or obfuscate them, so "overflow" might well be part of the message just because it simply describes the fix. So I'm not claiming that the messages can never help somebody pinpoint interesting commits to look at, I'm just also not at all interested in doing so reliably.

His goal here is clear: make life just a little harder for people who are searching the commit logs for vulnerabilities to exploit. One may argue over whether this policy amounts to hiding security problems, or whether it will be effective in reducing exploits (and plenty of people have shown their willingness to do such arguing), but the fact remains that it is the policy followed by Linus at this time. In his view, the committing of a fix is the disclosure of the problem, and there is no need to be more explicit than that.

That view extends to the whole security update process found in much of the community. He has no respect for embargo policies or delayed disclosure, and he criticizes the "whole security circus" which, in his opinion, emphasizes the wrong thing:

It makes "heroes" out of security people, as if the people who don't just fix normal bugs aren't as important.

In fact, all the boring normal bugs are _way_ more important, just because there's a lot more of them. I don't think some spectacular security hole should be glorified or cared about as being any more "special" than a random spectacular crash due to bad locking.

Beyond that, it is often hard to know which patches are truly security fixes. It has been argued at times that all bugs have security relevance; it's mostly just a matter of figuring out how to exploit them. So explicitly marking security fixes risks taking attention away from all of the other fixes, many of which may also, in fact, fix security issues. Thus, Linus says:

If people think that they are safer for only applying (or upgrading to) certain patches that are marked as being security-specific, they are missing all the ones that weren't marked as such. Making them even _believe_ that the magic security marking is meaningful is simply a lie. It's not going to be.

So why would I add some marking that I most emphatically do not believe in myself, and think is just mostly security theater?

That said, the stable kernel updates go out with patches which are known to be security fixes. Some people clearly believe that being STRONGLY encouraged to update is not sufficient notification of that fact. It does seem that there has been a trend away from explicit recognition of security issues in the stable releases. The inclusion of CVE numbers was once common; in the 2.6.25 series, only,, and had such numbers in the changelogs. It is, indeed, true that a straightforward reading of the stable release changelogs will not tell users whether those releases fix relevant security issues.

There are a number of answers to that complaint too, of course. The real information is in the source code, and that is always public. The fixes in the stable series are unlikely to be all that relevant to most users anyway; they are running distributor kernels which are many months behind even the -stable series and which may (or may not) be affected by a specific problem. In the end, users who are concerned about security issues in their kernels have somebody to turn to: their distributors. Linux distributors follow disclosure rules and tend to do a pretty thorough job of fixing the known security problems and propagating those fixes to users. For users who need a high level of long-term support, there are distributors who are more than willing to provide that kind of service for a fee.

As is often the case, what it really comes down to here is resources. It would be nice if somebody were to follow the patch stream (well over 100 patches/day into the mainline) and identify each one which has security implications. For each patch, this person could then figure out which kernel version was first affected by the vulnerability, obtain a CVE number, and issue a nicely-formatted advisory. But this is a huge job, one which nobody is likely to do in an uncompensated mode for any period of time. So somebody would have to pay for this work. And, to a great extent, that is just what the distributors are doing now - with the nice addition that they backport the fixes into the kernels they support.

It is worth noting that those distributors have not been doing a whole lot of complaining about how security fixes are handled now. Instead, the complaining has come, primarily, from the maintainers of the out-of-tree grsecurity project which, from a suitably cynical point of view, could be seen to benefit from raising the profile of Linux kernel security problems.

But, regardless of the validity of any such charge, there may be some value in what they are asking. It is good to have a clear sense for what the security problems in a piece of code are. If nothing else, it helps the project itself to understand where it stands with regard to security and whether things are getting better or worse. So it would be nice if the kernel developers could be a bit more diligent and organized in how they track security issues, much like the tracking of regressions has improved over the last couple of years. But this kind of improvement will not happen until somebody decides to put the work into it. Actually putting some time into documenting kernel security issues will accomplish far more than complaining on mailing lists.

Comments (44 posted)

Kernel security problems: a response

July 16, 2008

This article was contributed by Greg Kroah-Hartman.

I would like to try to clarify a few points in the article, "Handling kernel security problems" by Jonathan Corbet.

First off, I speak only for myself, not for the other half of the Linux -stable team, Chris Wright, who might totally disagree with me, nor for the other kernel developers who help out with the alias, nor for my current employer Novell. Also note that all of my -stable development is done on my own time, and is not part of my role at my current job.

All of that out of the way, I object to a few things stated in the original article:

It does seem that there has been a trend away from explicit recognition of security issues in the stable releases. The inclusion of CVE numbers was once common; in the 2.6.25 series, only,, and had such numbers in the changelogs. It is, indeed, true that a straightforward reading of the stable release changelogs will not tell users whether those releases fix relevant security issues.

A number of times, when we do -stable releases, there are no CVE numbers issued for the "security" related issues that are fixed in there. This happens when the fix is first made in Linus's tree, and is either forwarded to the alias saying, "we need to get this out now", or just by the fact that it is only later that people realize that a CVE number should be allocated.

And yes, the trend is away from explicit recognition of security issues, exactly following Linus's statement that you quote from.

It comes down to who are the users of the -stable kernel series. I personally see these kernels for two different groups of people:

  • Those who want to follow the latest releases and not rely on a distribution for their kernel versions.

  • For distributions to base releases on, and to pick and choose patches from.

The first group should always update to the latest -stable kernel update as they are relying on the -stable team to always provide them the latest fixes that are known to be needed for them. Simply marking things as "security related" can be misguided as Linus points out. The change log entries should show all users what was fixed, and if they run machine where this code is used, then they should upgrade. It's as simple as that.

In fact, in the release I tried to say exactly that:

It contains one bugfix, any user of the 2.6.25 kernel on x86-64 with untrusted local users is very STRONGLY recommended to upgrade.

How much clearer can I be? Does a user of the -stable tree, who has to be technically competent to be able to do such a thing in the first place, need to know more to decide if they need to upgrade their machines or not? It seems people are upset that I am no longer using the magic words "security fix", and that is true, I am not saying that anymore. As Linus and others have noted, marking some bugs as being "security-related" is not helpful, especially as not everyone can even agree - or sometimes even know at release time - whether a bug has security implications or not.

Also note that this release does not refer to a CVE number. This is because, as of this moment, there still is not a number assigned, despite asking the relevant groups for such an assignment. I never want to hold up a release by waiting for any such number, so I personally will just not use them in the future in -stable releases unless they are already contained in the original changelog entry in Linus's tree.

The second group, the distributions, all seem very happy with how the -stable releases are conducted. They have the capability to pick and choose from the fixes and apply them to their older kernel versions and ship them to their customers as they see fit. The distros all know what things are security related by the fact that they know and understand the code and the threat model as they have developers assigned to handle such security issues, and have done so for years.

In your summary, you state:

It is good to have a clear sense for what the security problems in a piece of code are. If nothing else, it helps the project itself to understand where it stands with regard to security and whether things are getting better or worse. So it would be nice if the kernel developers could be a bit more diligent and organized in how they track security issues, much like the tracking of regressions has improved over the last couple of years.

I think the individual developers of the kernel all know quite well what the security problems for their code are. This is backed up by the fact that these developers are the ones usually making the fix and telling the -stable team that a specific patch is needed to be added.

What you seem to be asking for is a way to somehow classify bugs and fixes in the kernel tree as "security related" or not. And that goes back to Linus's original point. To try to do so marginalizes bugs which are somehow not so designated as not worth fixing. However, if someone wants to do this work for the kernel community, and it proves to be useful over time, I'll be the first in line to say that I was wrong.

Comments (25 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O


Memory management



Virtualization and containers

Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds