User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.2-rc3, released on November 23. "Anyway, whether you will be stuffing yourself with turkey tomorrow or not, there's a new -rc out. I'd love to say that things have been calming down, and that the number of commits just keep shrinking, but I'd be lying. -rc3 is actually bigger than -rc2, mainly due to a network update (none in -rc2) and with Greg doing his normal usb/driver-core/tty/staging thing." Linus appears to be back on the Wednesday release schedule, so expect -rc4 sometime shortly after this page is published.

Stable updates: the, 3.0.11, and 3.1.3 stable kernel updates were released on November 28; they contained a long list of fixes and (for 3.x) one bit of USB driver breakage. The 3.0.12 and 3.1.4 updates came out shortly thereafter with one patch to fix that problem.

Comments (none posted)

Quotes of the week

Modules are evil. They are a security issue, and they encourage a "distro kernel" approach that takes forever to compile. Just say no. Build a lean and mean kernel that actually has what you need, and nothing more. And don't spend stupid time compiling modules you won't need.
-- Linus Torvalds

I listen to songs like "Believe" by Cher sometimes. In my head, it's a song about the lonely life of a NMI watchdog handler (seriously). That is, if an NMI watchdog handler could express the sheer and utter loneliness of its existence.
-- Jon Masters

Comments (37 posted)


DM-Steg is a kernel module that adds steganographic encryption to the device mapper. "Steganographic" means that the encrypted data is hidden to the point that its very existence can be denied. "Steg works with substrates (devices containing ciphertext) to export plaintext- containing block devices, known as aspects, to the user. Without having the key(s), there is no way of determining how many aspects a substrate contains, or if it contains any aspects at all." The initial release of this module has just been announced. "The code has only ever been tested on my PC, but it works very nicely for me and has stopped eating my data, so I figure it's ready for public consumption!" See this document [PDF] for details.

Comments (14 posted)

Kernel development news

Routing Open vSwitch into the mainline

By Jonathan Corbet
November 30, 2011
Visitors to the features page on the Open vSwitch web site may be forgiven if they do not immediately come away with a good understanding of what this package does. The feature list is full of enlightening bullet points like "LACP (IEEE 802.1AX-2008)", "802.1ag link monitoring", and "Multi-table forwarding pipeline with flow-caching engine". Behind the acronyms, Open vSwitch is a virtual switch that has already seen a lot of use in the Xen community and which is applicable to most other virtualization schemes as well. After some years as an out-of-tree project, Open vSwitch has recently made a push for inclusion into the mainline kernel.

Open vSwitch is a network switch; at its lowest level, it is concerned with routing packets between interfaces. It is aimed at virtualization users, so, naturally, it is used in the creation of virtual networks. A switch can be set up with a number of virtual network interfaces, most of which are used by virtual machines to communicate with each other and the wider world. These virtual networks can be connected across hosts and across physical networks. One of the key features of Open vSwitch appears to be the ability to easily migrate virtual machines between physical hosts and have their network configuration (addresses, firewall rules, open connections, etc.) seamlessly follow.

Needless to say, there is no shortage of features beyond making it easier to move guests around. Open vSwitch offers a long list of options for access control, quality-of-service control, network bridging, traffic monitoring, and more. The OpenFlow protocol is supported, allowing the integration of interesting protocols and controllers into the network. Open vSwitch has been shipped as part of a number of products and it shows; it has the look of a polished, finished offering.

Most of Open vSwitch is implemented in user space, but there is one kernel module that makes the whole thing work; that module was submitted for review in mid-November. Open vSwitch tries to make use of existing networking features to the greatest extent possible; the kernel module mostly implements a control interface allowing the user-space code to make routing decisions. Routing packets through user space would slow things down considerably, so the interface is set up to avoid the user-space round trip whenever possible.

When the Open vSwitch module receives a packet on one of its interfaces, it generates a "flow key" describing the packet in general terms. An example key from the submission is:

    in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
    eth_type(0x0800), ipv4(src=, dst=, proto=17, tos=0,
    frag=no), tcp(src=49163, dst=80)

Most of the fields should be fairly self-explanatory; this key describes a packet that arrived on port (interface) 1, aimed at TCP port 80 on host If Open vSwitch does not know how to process the packet, it will pass it to the user-space daemon, along with the generated flow key. The daemon can then decide what should be done; it will also, normally, pass a rule back to the kernel describing how to handle related packets in the future. These rules start with the flow key, which may be generalized somewhat, and include a set of associated actions. Possible actions include:

  • Output the packet to a specific port, forwarding it on its way to its final destination.

  • Send the packet to user space for further consideration. The destination process may or may not be the main Open vSwitch control daemon.

  • Make changes to the packet header on its way through; network address translation could be implemented this way, for example.

  • Add an 802.1Q virtual LAN header in preparation for tunneling the packet to another host; there is also an action for stripping such headers at the receiving end.

  • Record attributes of the packet for statistics generation.

Once a rule for a given type of packet has been installed into the kernel, future packets can be routed quickly without the need for further user-space intervention. If the switch is working properly, most packets should never need to go through the control daemon.

Open vSwitch, by all appearances, is a useful and powerful mechanism; the networking developers seem to agree that it would be a good addition to the kernel. There is, however, some disagreement over the implementation. In particular, the patch adds a new packet classification and control mechanism, but the kernel already has a traffic control system of its own; duplicating that infrastructure is not a popular idea. As Jamal Hadi Salim put it:

You are replicating a lot of code and semantic that exist (not just on classifier actions). You could improve the existing infrastructure instead.

Jamal suggested that Open vSwitch could add a special-purpose classifier for its own needs, but that classifier should fit into the existing traffic control subsystem.

That said, there seems to be some awareness within the networking community that the kernel's traffic controller may not quite be up to the task. Eric Dumazet noted that its scalability is not what it could be and that the code reflects its age; he said: "Maybe its time to redesign a new model, based on modern techniques." Others seemed to agree with this assessment. The traffic controller, it appears, is in need of serious improvements or replacement regardless of what happens with Open vSwitch.

The fact that the traffic controller is not everything Open vSwitch needs will not normally be considered an adequate justification for duplicating its infrastructure, though. The obvious options available to the Open vSwitch developers will be to (1) improve the traffic controller to the point that it does work, or (2) position the Open vSwitch controller as a plausible long-term replacement. Neither task is likely to be easy. The outcome of this discussion may well be that developers who were hoping to merge their existing code will find themselves tasked with a fair amount of infrastructural work.

That can be the point where those developers take option (3): go away and continue to maintain their code out of tree. Requiring extra work from contributors can cause them to simply give up. But if the networking maintainers accept duplicated subsystems, the likely outcome is a lot of wasted work and multiple implementations of the same functionality, none of which is as good as it should be. There are solid reasons behind the maintainers' tendency to push back against that kind of contribution; without that pushback, the long-term maintainability of the kernel will suffer.

How things will be resolved in the case of Open vSwitch remains to be seen; the discussion is ongoing as of this writing. Open vSwitch is a healthy and active project; it may well have the resources and the desire to perform the necessary work to get into the mainline and ease its own long-term maintenance burden. Meanwhile, as was discussed at the 2011 Kernel Summit, code that is being shipped and used has value; sometimes it is best to get it into the mainline and focus on improving it afterward. Some developers (such as Herbert Xu) seem to think that may be the best approach to take in this case. So Open vSwitch may yet find its way into the mainline in the near future with the idea that its internals can be fixed up thereafter.

Comments (1 posted)

Hardware face detection

By Jonathan Corbet
November 29, 2011
Once upon a time, a "system on chip" (SOC) was a package containing a processor and some number of I/O controllers. While SOCs still have all that, manufacturers have been busy adding hardware support for all kinds of interesting functionality. For example, OMAP4 processors have an onboard face detection module that can be used for camera focus control, "face unlock" features, and more. Naturally, there is interest in making use of such features in Linux; a recent driver submission shows that the question of just how to do that has not yet been answered, though.

The OMAP4 face recognition detection driver was submitted by Tom Leiming, but was apparently written by Ming Lei. Upon initialization, the driver allocates a memory area which is made available to an application via mmap(). The application places an image in that area (it seems that a 320x240 grayscale PGM image is the only supported option), then uses a number of ioctl() operations to specify the area of interest and to start and stop the image recognition process. A [face
recognition] read() on the device will, once detection is complete, yield a number of structures describing the locations of the faces in the image as rectangles.

Face detection functionality is clearly welcome, but this particular driver has a lot of problems and will not get into the mainline in anything resembling its current state. The most significant criticism, though, came from Alan Cox, who asked that, rather than being implemented as a standalone device, face detection be integrated into the Video4Linux2 framework.

In truth, V4L2 is probably the right place for this feature. Face detection is generally meant to be used with the camera controller integrated into the same SOC and the face detection hardware may be tightly tied to that controller. The media controller subsystem was designed for just this kind of functionality; it provides a mechanism by which camera data may (or may not) be routed to the face detection module as needed. Integration into V4L2 would bring the face detection module under the same umbrella as the rest of the video processing hardware and export the necessary data routing capabilities to user space.

The design of the user-space interface for this functionality seems likely to pose challenges of its own, though. The OMAP4 hardware is relatively simple in its operation; it appears to even lack the ability to work with multiple image formats, even moderately high-resolution images, or color data. Future hardware will certainly not be so limited. It is also not hard to imagine a shift from detection of any face to recognition of specific faces - or, at least, the generation of metrics to ease the association of faces and the identities of their owners. The hardware could become capable of blink detection, distinguishing real faces from pictures of faces, or determining when a face belongs to a poker player who is bluffing. Designing an API that can handle this kind of functionality is going to be an interesting task.

But it does not stop there. There is a discouragingly large market out there for devices capable of reading automobile license plates, for example. There is money in meeting the needs of the contemporary surveillance state, so manufacturers will certainly compete to provide the needed capabilities. In general, the world is filled with interesting things that are not faces; it is not hard to imagine that people will be able to do useful things with devices that can pick all kinds of high-level objects out of image data.

In general, we may be seeing a shift in what kinds of peripherals are attached to our processors. There will always be plenty of devices that serve essentially (from the CPU's point of view) as channels moving chunks of data in one direction or the other. But there will be more and more devices that offload some type of processing, and that is going to present some interesting ABI challenges. Hardware-based offload engines are nothing new, of course. But, once upon a time, offload devices mostly performed tasks otherwise handled by the operating system kernel. Integrated controllers and network protocol offload functionality are a couple of obvious examples. More recently, though, hardware has provided functionality that needs to be made available to user space. And that changes the game somewhat.

If one looks for examples of this kind of functionality, one almost certainly needs to start at the GPU found in most graphics cards. Creating a workable (and stable) user-space ABI providing access to the GPU has taken many years, and it is not clear that the job is done yet. The media controller ABI controls routing of data among the numerous interesting functional units in contemporary video processors, but writing a hardware-independent application using the media controller is hard. Creating a workable interface for the wide variety of available industrial sensors has also been a multi-year project.

Trying to anticipate where this kind of hardware will go in an attempt to create the perfect ABI from the outset seems like an exercise in futility. Most likely it will have to be done the way we've always done it: come up with something that seems reasonable, learn (the hard way) what it's shortcomings are, then begin the long process of replacing it with something better. It is not an ideal way to create an operating system, but it seems to be better than the alternatives. Figuring out the best way to support face detection will just be another step in this ongoing process.

Comments (16 posted)

Improving ext4: bigalloc, inline data, and metadata checksums

By Jonathan Corbet
November 29, 2011
It may be tempting to see ext4 as last year's filesystem. It is solid and reliable, but it is based on an old design; all the action is to be found in next-generation filesystems like Btrfs. But it may be a while until Btrfs earns the necessary level of confidence in the wider user community; meanwhile, ext4's growing user base has not lost its appetite for improvement. A few recently-posted patch sets show that the addition of new features to ext4 has not stopped, even as that filesystem settles in for a long period of stable deployments.


In the early days of Linux, disk drives were still measured in megabytes and filesystems worked with blocks of 1KB to 4KB in size. As this article is being written, terabyte disk drives are not quite as cheap as they recently were, but the fact remains: disk drives have gotten a lot larger, as have the files stored on them. But the ext4 filesystem still deals in 4KB blocks of data. As a result, there are a lot of blocks to keep track of, the associated allocation bitmaps have grown, and the overhead of managing all those blocks is significant.

Raising the filesystem block size in the kernel is a dauntingly difficult task involving major changes to memory management, the page cache, and more. It is not something anybody expects to see happen anytime soon. But there is nothing preventing filesystem implementations from using larger blocks on disk. As of the 3.2 kernel, ext4 will be capable of doing exactly that. The "bigalloc" patch set adds the concept of "block clusters" to the filesystem; rather than allocate single blocks, a filesystem using clusters will allocate them in larger groups. Mapping between these larger blocks and the 4KB blocks seen by the core kernel is handled entirely within the filesystem.

The cluster size to use is set by the system administrator at filesystem creation time (using a development version of e2fsprogs), but it must be a power of two. A 64KB cluster size may make sense in a lot of situations; for a filesystem that holds only very large files, a 1MB cluster size might be the right choice. Needless to say, selecting a large cluster size for a filesystem dominated by small files may lead to a substantial amount of wasted space.

Clustering reduces the space overhead of the block bitmaps and other management data structures. But, as Ted Ts'o documented back in July, it can also increase performance in situations where large files are in use. Block allocation times drop significantly, but file I/O performance also improves in general as the result of reduced on-disk fragmentation. Expect this feature to attract a lot of interest once the 3.2 kernel (and e2fsprogs 1.42) make their way to users.

Inline data

An inode is a data structure describing a single file within a filesystem. For most filesystems, there are actually two types of inode: the filesystem-independent in-kernel variety (represented by struct inode), and the filesystem-specific on-disk version. As a general rule, the kernel cannot manipulate a file in any way until it has a copy of the inode, so inodes, naturally, are the focal point for a lot of block I/O.

In the ext4 filesystem, the size of on-disk inodes can be set when a filesystem is created. The default size is 256 bytes, but the on-disk structure (struct ext4_inode) only requires about half of that space. The remaining space after the ext4_inode structure is normally used to hold extended attributes. Thus, for example, SELinux labels can be found there. On systems where extended attributes are not heavily used, the space between on-disk inode structures may simply go to waste.

Meanwhile, space for file data is allocated in units of blocks, separately from the inode. If a file is very small (and, even on current systems, there are a lot of small files), much of the block used to hold that file will be wasted. If the filesystem is using clustering, the amount of lost space will grow even further, to the point that users may start to complain.

Tao Ma's ext4 inline data patches may change that situation. The idea is quite simple: very small files can be stored directly in the space between inodes without the need to allocate a separate data block at all. On filesystems with 256-byte on-disk inodes, the entire remaining space will be given over to the storage of small files. If the filesystem is built with larger on-disk inodes, only half of the leftover space will be used in this way, leaving space for late-arriving extended attributes that would otherwise be forced out of the inode.

Tao says that, with this patch set applied, the space required to store a kernel tree drops by about 1%, and /usr gets about 3% smaller. The savings on filesystems where clustering is enabled should be somewhat larger, but those have not yet been quantified. There are a number of details to be worked out yet - including e2fsck support and the potential cost of forcing extended attributes to be stored outside of the inode - so this feature is unlikely to be ready for inclusion before 3.4 at the earliest.

Metadata checksumming

Storage devices are not always as reliable as we would like them to be; stories of data corrupted by the hardware are not uncommon. For this reason, people who care about their data make use of technologies like RAID and/or filesystems like Btrfs which can maintain checksums of data and metadata and ensure that nothing has been mangled by the drive. The ext4 filesystem, though, lacks this capability.

Darrick Wong's checksumming patch set does not address the entire problem. Indeed, it risks reinforcing the old jest that filesystem developers don't really care about the data they store as long as the filesystem metadata is correct. This patch set seeks to achieve that latter goal by attaching checksums to the various data structures found on an ext4 filesystem - superblocks, bitmaps, inodes, directory indexes, extent trees, etc. - and verifying that the checksums match the data read from the filesystem later on. A checksum failure can cause the filesystem to fail to mount or, if it happens on a mounted filesystem, remount it read-only and issue pleas for help to the system log.

Darrick makes no mention of any plans to add checksums for data as well. In a number of ways, that would be a bigger set of changes; checksums are relatively easy to add to existing metadata structures, but an entirely new data structure would have to be added to the filesystem to hold data block checksums. The performance impact of full-data checksumming would also be higher. So, while somebody might attack that problem in the future, it does not appear to be on anybody's list at the moment.

The changes to the filesystem are significant, even for metadata-only checksums, but the bulk of the work actually went into e2fsprogs. In particular, e2fsck gains the ability to check all of those checksums and, in some cases, fix things when the checksum indicates that there is a problem. Checksumming can be enabled with mke2fs and toggled with tune2fs. All told, it is a lot of work, but it should help to improve confidence in the filesystem's structure. According to Darrick, the overhead of the checksum calculation and verification is not measurable in most situations. This feature has not drawn a lot of comments this time around, and may be close to ready for inclusion, but nobody has yet said when that might happen.

Comments (177 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds