Kernel development

Brief items

Kernel release status

The current development kernel remains 2.6.32-rc5; no 2.6.32 prepatches have been released over the last week.

The current stable kernel is 2.6.31.5, released (along with 2.6.27.38) on October 22. The 2.6.27 update is relatively small and focused on SCSI and USB serial devices; the 2.6.31 update, instead, addresses a much wider range of problems.

Comments (none posted)

Quotes of the week

It would be possible for us to rescan the RMRR tables when we take a device out of the si_domain, if we _really_ have to. But I'm going to want a strand of hair from the engineer responsible for that design, for my voodoo doll.

-- David Woodhouse

If a software system is so complex that its quirks and pitfalls cannot easily be located and avoided (witness the ondemand scheduler problem on Pentium IV's message I recently filed) then is it not *effectively* open source. I am qualified to read hardware manuals, I am qualified to rewrite C code (having written code generators for several C compilers) but the LKML is like the windmill and I feel like Don Quixote tilting back and forth in front of it. One could even argue that the lack of an open bug reporting system (and "current state" online reports) effectively makes Linux a non-open-source system. Should not Linux be the one of the first systems to make all knowledge completely available? Or is it doomed to be replaced by systems which might provide such capabilities (Android perhaps???)

-- Robert Bradbury

A real git tree will contain fixes for brown paperbag bugs, it will contain reverts, it will contain the occasional messy changelog. It is also, because it's more real life, far more trustable to pull from. The thing is, nothing improves a workflow more than public embarrassment - but rebasing takes away much of that public embarrassment factor.

-- Ingo Molnar

Comments (3 posted)

A Tokyo moment

The release of Windows 7 happened to coincide with the Japan Linux Symposium in Tokyo. Linus Torvalds was clearly quite impressed - and Chris Schlaeger was there to capture the moment. The original picture is available over here.

See also: Len Brown's photos from the kernel summit and JLS.

Comments (21 posted)

Tracefs

By Jonathan Corbet
October 28, 2009

In-kernel tracing is rapidly becoming a feature that developers and users count on. In current kernels, though, the virtual files used to control tracing and access data are all found in the debugfs filesystem, in the tracing directory. That is not seen as a long-term solution; debugfs is meant for volatile, debugging information, but tracing users want to see a stable ABI in a non-debugging location.

Following up on some conference discussions, Greg Kroah-Hartman decided to regularize the tracing file hierarchy through the creation of a new tracefs virtual filesystem. Tracefs looks a lot like .../debug/tracing in that the files have simply been moved from one location to the other. Tracefs has a simpler internal API, though, since it does not require all of the features supported by debugfs.

The idea of tracefs is universally supported, but this particular patch looks like it will not be going in anytime soon. The concern is that anything moved out of debugfs and into something more stable will instantly become part of the kernel ABI. Much of the current tracing interface has been thrown together to meet immediate needs; the sort of longer-term thinking which is needed to define an interface which can remain stable for years is just beginning to happen.

Ingo Molnar thinks that the virtual files which describe the available events could be exported now, but not much else. That leaves most of the interface in an unstable state, still. So Greg has withdrawn the patch for now; expect it to come back with the tracing developers are more ready to commit to their ABI. At that point, we can expect the debate to begin on the truly important question: /tracing or /sys/kernel/tracing?

Comments (1 posted)

Staging drivers out

By Jonathan Corbet
October 28, 2009

The staging tree was conceived as a way for substandard drivers to get into the kernel tree. Recently, though, there has been talk of using staging to ease drivers out as well. The idea is that apparently unused and unloved drivers would be moved to the staging tree, where they will languish for three development cycles. If nobody has stepped up to maintain those drivers during that time, they will be removed from the tree. This idea was discussed at the 2009 Kernel Summit with no serious dissent.

Since then, John Linville has decided to test the system with a series of ancient wireless drivers. These include the "strip" driver ("STRIP is a radio protocol developed for the MosquitoNet project - to send Internet traffic using Metricom radios."), along with the arlan, netwave, and wavelan drivers. Nobody seems to care about this code, and it is unlikely that any users remain. If that is true, then there should be no down side to removing the code.

That hasn't stopped the complaints, though, mostly from people who believe that staging drivers out of the tree is an abuse of the process which may hurt unsuspecting users. It is true that users may have a hard time noticing this change until the drivers are actually gone - though their distributors may drop them before the mainline does. So the potential for an unpleasant surprise is there; mistaken removals are easily reverted, but that is only partially comforting for a user whose system has just broken.

The problem here is that there is no other way to get old code out of the tree. Once upon a time, API changes would cause unmaintained code to fail to compile; after an extended period of brokenness, a driver could be safely removed. Contemporary mores require developers to fix all in-tree users of an API they change, though, so this particular indicator no longer exists. That means the tree can fill up with code which is unused and which has long since ceased to work, but which still compiles flawlessly. Somehow a way needs to be found to remove that code. The "staging out" process may not be perfect, but nobody has posted a better idea yet.

Comments (16 posted)

/proc and directory permissions

By Jake Edge
October 28, 2009

In a discussion of the O_NODE open flag patch, an interesting, though obscure, security hole came to light. Jamie Lokier noticed the problem, and Pavel Machek eventually posted it to the Bugtraq security mailing list.

Normally, one would expect that a file in a directory with 700 permissions would be inaccessible to all but the owner of the directory (and root, of course). Lokier and Machek showed that there is a way around that restriction by using an entry in an attacking process's fd directory in the /proc filesystem.

If the directory is open to the attacker at some time, while the file is present, the attacker can open the file for reading and hold it open even if the victim changes the directory permissions. Any normal write to the open file descriptor will fail because it was opened read-only, but writing to /proc/$$/fd/N, where N is the open file descriptor number, will succeed based on the permissions of the file. If the file allows the attacking process to write to it, writing to the /proc file will succeed regardless of the permissions of the parent directory. This is rather counter-intuitive, and, even though it is a rather contrived example, seems to constitute a security hole.

The Bugtraq thread got off course quickly, by noting that a similar effect could be achieved creating a hardlink to the file before the directory permissions were changed. While that is true, Machek's example looked for that case by checking the link count on the file after the directory permissions had been changed. The hardlink scenario would be detected at that point.

One can imagine situations where programs do not put the right permissions on the files they use and administrators attempt to work around that problem by restricting access to the parent directory. Using this technique, an attacker could still access those files, in a way that was difficult to detect. As Machek noted, unmounting the /proc filesystem removes the problem, but "I do not think mounting /proc should change access control semantics."

There is currently some discussion of how, and to some extent whether, to address the problem, but a consensus (and patch) has not yet emerged.

Comments (12 posted)

Kernel development news

JLS2009: Generic receive offload

By Jonathan Corbet
October 27, 2009

Your editor still remembers installing his first Ethernet adapter. Through the expenditure of massive engineering resources, DEC was able to squeeze this device onto a linked pair of UNIBUS boards - the better part of a square meter of board space in total - so that a VAX system could be put onto a modern network. Supporting 10Mb/sec was a bit of a challenge in those days. In the intervening years, leading-edge network adaptors have sped up to 10Gb/sec - a full three orders of magnitude. Supporting them is still a challenge, though for different reasons. At the 2009 Japan Linux Symposium, Herbert Xu discussed those challenges and how Linux has evolved to meet them.

Part of the problem is that 10G Ethernet is still Ethernet underneath. There is value in that; it minimizes the changes required in other parts of the system. But it's an old technology which brings some heavy baggage with it, with the heaviest bag of all being the 1500-byte maximum transfer unit (MTU) limit. With packet size capped at 1500 bytes, a 10G network link running at full speed will be transferring over 800,000 packets per second. Again, that's an increase of three orders of magnitude from the 10Mb days, but CPUs have not kept pace. So the amount of CPU time available to process a single Ethernet packet is less than it was in the early days. Needless to say, that is putting some pressure on the networking subsystem; the amount of CPU time required to process each packet must be squeezed wherever possible.

(Some may quibble that, while individual CPU speeds have not kept pace, the number of cores has grown to make up the difference. That is true, but the focus of Herbert's talk was single-CPU performance for a couple of reasons: any performance work must benefit uniprocessor systems, and distributing a single adapter's work across multiple CPUs has its own challenges.)

Given the importance of per-packet overhead, one might well ask whether it makes sense to raise the MTU. That can be done; the "jumbo frames" mechanism can handle packets up to 9KB in size. The problem, according to Herbert, is that "the Internet happened." Most connections of interest go across the Internet, and those are all bound by the lowest MTU in the entire path. Sometimes that MTU is even less than 1500 bytes. Protocol-based mechanisms for finding out what that MTU is exist, but they don't work well on the Internet; in particular, a lot of firewall setups break it. So, while jumbo frames might work well for local networks, the sad fact is that we're stuck with 1500 bytes on the wider Internet.

If we can't use a larger MTU, we can go for the next-best thing: pretend that we're using a larger MTU. For a few years now Linux has supported network adapters which perform "TCP segmentation offload," or TSO. With a TSO-capable adapter, the kernel can prepare much larger packets (64KB, say) for outgoing data; the adapter will then re-segment the data into smaller packets as the data hits the wire. That cuts the kernel's per-packet overhead by a factor of 40. TSO is well supported in Linux; for systems which are engaged mainly in the sending of data, it's sufficient to make 10GB work at full speed.

The kernel actually has a generic segmentation offload mechanism (called GSO) which is not limited to TCP. It turns out that performance improves even if the feature is emulated in the driver. But GSO only works for data transmission, not reception. That limitation is entirely fine for broad classes of users; sites providing content to the net, for example, send far more data than they receive. But other sites have different workloads, and, for them, packet reception overhead is just as important as transmission overhead.

Solutions on the receive side have been a little slower in coming, and not just because the first users were more interested in transmission performance. Optimizing the receive side is harder because packet reception is, in general, harder. When it is transmitting data, the kernel is in complete control and able to throttle sending processes if necessary. But incoming packets are entirely asynchronous events, under somebody else's control, and the kernel just has to cope with what it gets.

Still, a solution has emerged in the form of "large receive offload" (LRO), which takes a very similar approach: incoming packets are merged at reception time so that the operating system sees far fewer of them. This merging can be done either in the driver or in the hardware; even LRO emulation in the driver has performance benefits. LRO is widely supported by 10G drivers under Linux.

But LRO is a bit of a flawed solution, according to Herbert; the real problem is that it "merges everything in sight." This transformation is lossy; if there are important differences between the headers in incoming packets, those differences will be lost. And that breaks things. If a system is serving as a router, it really should not be changing the headers on packets as they pass through. LRO can totally break satellite-based connections, where some very strange header tricks are done by providers to make the whole thing work. And bridging breaks, which is a serious problem: most virtualization setups use a virtual network bridge between the host and its clients. One might simply avoid using LRO in such situations, but these also tend to be the workloads that one really wants to optimize. Virtualized networking, in particular, is already slower; any possible optimization in this area is much needed.

The solution is generic receive offload (GRO). In GRO, the criteria for which packets can be merged is greatly restricted; the MAC headers must be identical and only a few TCP or IP headers can differ. In fact, the set of headers which can differ is severely restricted: checksums are necessarily different, and the IP ID field is allowed to increment. Even the TCP timestamps must be identical, which is less of a restriction than it may seem; the timestamp is a relatively low-resolution field, so it's not uncommon for lots of packets to have the same timestamp. As a result of these restrictions, merged packets can be resegmented losslessly; as an added benefit, the GSO code can be used to perform resegmentation.

One other nice thing about GRO is that, unlike LRO, it is not limited to TCP/IPv4.

The GRO code was merged for 2.6.29, and it is supported by a number of 10G drivers. The conversion of drivers to GRO is quite simple. The biggest problem, perhaps, is with new drivers which are written to use the LRO API instead. To head this off, the LRO API may eventually be removed, once the networking developers are convinced that GRO is fully functional with no remaining performance regressions.

In response to questions, Herbert said that there has not been a lot of effort toward using LRO in 1G drivers. In general, current CPUs can keep up with a 1G data stream without too much trouble. There might be a benefit, though, in embedded systems which typically have slower processors. How does the kernel decide how long to wait for incoming packets before merging them? It turns out that there is no real need for any special waiting code: the NAPI API already has the driver polling for new packets occasionally and processing them in batches. GRO can simply be performed at NAPI poll time.

The next step may be toward "generic flow-based merging"; it may also be possible to start merging unrelated packets headed to the same destination to make larger routing units. UDP merging is on the list of things to do. There may even be a benefit in merging TCP ACK packets. Those packets are small, but there are a lot of them - typically one for every two data packets going the other direction. This technology may go in surprising directions, but one thing is clear: the networking developers are not short of ideas for enabling Linux to keep up with ever-faster hardware.

Comments (24 posted)

JLS2009: A Btrfs update

By Jonathan Corbet
October 27, 2009

Conferences can be a good opportunity to catch up with the state of ongoing projects. Even a detailed reading of the relevant mailing lists will not always shed light on what the developers are planning to do next, but a public presentation can inspire them to set out what they have in mind. Chris Mason's Btrfs talk at the Japan Linux Symposium was a good example of such a talk.

The Btrfs filesystem was merged for the 2.6.29 kernel, mostly as a way to encourage wider testing and development. It is certainly not meant for production use at this time. That said, there are people doing serious work on top of Btrfs; it is getting to where it is stable enough for daring users. Current Btrfs includes an all-caps warning in the Kconfig file stating that the disk format has not yet been stabilized; Chris is planning to remove that warning, perhaps for the 2.6.33 release. Btrfs, in other words, is progressing quickly.

One relatively recent addition is full use of zlib compression. Online resizing and defragmentation are coming along nicely. There has also been some work aimed at making synchronous I/O operations work well.

Defragmentation in Btrfs is easy: any specific file can be defragmented by simply reading it and writing it back. Since Btrfs is a copy-on-write filesystem, this rewrite will create a new copy of the file's data which will be as contiguous as the filesystem is able to make it. This approach can also be used to control the layout of files on the filesystem. As an experiment, Chris took a bunch of boot-tracing data from a Moblin system and analyzed it to figure out which files were accessed, and in which order. He then rewrote the files in question to put them all in the same part of the disk. The result was a halving of the I/O time during boot, resulting in a faster system initialization and smiles all around.

Performance of synchronous operations has been an important issue over the last year. On filesystems like ext3, an fsync() call will flush out a lot of data which is not related to the actual file involved; that adds a significant performance penalty for fsync() use and discourages careful programming. Btrfs has improved the situation by creating an entirely separate Btree on each filesystem which is used for synchronous I/O operations. That tree is managed identically to, but separately from, the regular filesystem tree. When an fsync() call comes along, Btrfs can use this tree to only force out operations for the specific file involved. That gives a major performance win over ext3 and ext4.

A further improvement would be the ability to write a set of files, then flush them all out in a single operation. Btrfs could do that, but there's no way in POSIX to tell the kernel to flush multiple files at once. Fixing that is likely to involve a new system call.

Btrfs provides a number of features which are also available via the device mapper and MD subsystems; some people have wondered if this duplication of features makes sense. But there are some good reasons for it; Chris gave a couple of examples:

Doing snapshots at the device mapper/LVM layer involves making a lot more copies of the relevant data. Chris ran an experiment where he created a 400MB file, created a bunch of snapshots, then overwrote the file. Btrfs is able to just write the new version, while allowing all of the snapshots to share the old copy. LVM, instead, copies the data once for each snapshot. So this test, which ran in less than two seconds on Btrfs, took about ten minutes with LVM.
Anybody who has had to replace a drive in a RAID array knows that the rebuild process can be long and painful. While all of that data is being copied, the array runs slowly and does not provide the usual protections. The advantage of running RAID within Btrfs is that the filesystem knows which blocks contain useful data and which do not. So, while an MD-based RAID array must copy an entire drive's worth of data, Btrfs can get by without copying unused blocks.

So what does the future hold? Chris says that the 2.6.32 kernel will include a version of Btrfs which is stable enough for early adopters to play with. In 2.6.33, with any luck, the filesystem will have RAID4 and RAID5 support. Things will then stabilize further for 2.6.34. Chris was typically cagey when talking about production use, though, pointing out that it always takes a number of years to develop complete confidence in a new filesystem. So, while those of us with curiosity, courage, and good backups could maybe be making regular use of Btrfs within a year, widespread adoption is likely to be rather farther away than that.

Comments (54 posted)

Transparent hugepages

By Jonathan Corbet
October 28, 2009

Most Linux systems divide memory into 4096-byte pages; for the bulk of the memory management code, that is the smallest unit of memory which can be manipulated. 4KB is an increase over what early virtual memory systems used; 512 bytes was once common. But it is still small relative to the both the amount of physical memory available on contemporary systems and the working set size of applications running on those systems. That means that the operating system has more pages to manage than it did some years back.

Most current processors can work with pages larger than 4KB. There are advantages to using larger pages: the size of page tables decreases, as does the number of page faults required to get an application into RAM. There is also a significant performance advantage that derives from the fact that large pages require fewer translation lookaside buffer (TLB) slots. These slots are a highly contended resource on most systems; reducing TLB misses can improve performance considerably for a number of large-memory workloads.

There are also disadvantages to using larger pages. The amount of wasted memory will increase as a result of internal fragmentation; extra data dragged around with sparsely-accessed memory can also be costly. Larger pages take longer to transfer from secondary storage, increasing page fault latency (while decreasing page fault counts). The time required to simply clear very large pages can create significant kernel latencies. For all of these reasons, operating systems have generally stuck to smaller pages. Besides, having a single, small page size simply works and has the benefit of many years of experience.

There are exceptions, though. The mapping of kernel virtual memory is done with huge pages. And, for user space, there is "hugetlbfs," which can be used to create and use large pages for anonymous data. Hugetlbfs was added to satisfy an immediate need felt by large database management systems, which use large memory arrays. It is narrowly aimed at a small number of use cases, and comes with significant limitations: huge pages must be reserved ahead of time, cannot transparently fall back to smaller pages, are locked into memory, and must be set up via a special API. That worked well as long as the only user was a certain proprietary database manager. But there is increasing interest in using large pages elsewhere; virtualization, in particular, seems to be creating a new set of demands for this feature.

A host setting up memory ranges for virtualized guests would like to be able to use large pages for that purpose. But if large pages are not available, the system should simply fall back to using lots of smaller pages. It should be possible to swap large pages when needed. And the virtualized guest should not need to know anything about the use of large pages by the host. In other words, it would be nice if the Linux memory management code handled large pages just like normal pages. But that is not how things happen now; hugetlbfs is, for all practical purposes, a separate, parallel memory management subsystem.

Andrea Arcangeli has posted a transparent hugepage patch which attempts to remedy this situation by removing the disconnect between large pages and the regular Linux virtual memory subsystem. His goals are fairly ambitious: he would like an application to be able to request large pages with a simple madvise() system call. If large pages are available, the system will provide them to the application in response to page faults; if not, smaller pages will be used.

Beyond that, the patch makes large pages swappable. That is not as easy as it sounds; the swap subsystem is not currently able to deal with memory in anything other than PAGE_SIZE units. So swapping out a large page requires splitting it into its component parts first. This feature works, but not everybody agrees that it's worthwhile. Christoph Lameter commented that workloads which are performance-sensitive go out of their way to avoid swapping anyway, but that may become less true on a host filling up with virtualized guests.

A future feature is transparent reassembly of large pages. If such a page has been split (or simply could not be allocated in the first place), the application will have a number of smaller pages scattered in memory. Should a large page become available, it would be nice if the memory management code would notice and migrate those small pages into one large page. This could, potentially, even happen for applications which have never requested large pages at all; the kernel would just provide them by default whenever it seemed to make sense. That would make large pages truly transparent and, perhaps, decrease system memory fragmentation at the same time.

This is an ambitious patch to the core of the Linux kernel, so it is perhaps amusing that the chief complaint seems to be that it does not go far enough. Modern x86 processors can support a number of page sizes, up to a massive 1GB. Andrea's patch is currently aiming for the use of 2MB pages, though - quite a bit smaller. The reasoning is simple: 1GB pages are an unwieldy unit of memory to work with. No Linux system that has been running for any period of time will have that much contiguous memory lying around, and the latency involved with operations like clearing pages would be severe. But Andi Kleen thinks this approach is short-sighted; today's massive chunk of memory is tomorrow's brief email. Andi would rather that the system not be designed around today's limitations; for the moment, no agreement has been reached on that point.

In any case, this patch is an early RFC; it's not headed toward the mainline in the near future. It's clearly something that Linux needs, though; making full use of the processor's capabilities requires treating large pages as first-class memory-management objects. Eventually we should all be using large pages - though we may not know it.

Comments (12 posted)

Patches and updates

Kernel trees

Greg KH Linux 2.6.31.5 ?

Greg KH Linux 2.6.27.38 ?

Architecture-specific

Wu Zhangjin ftrace for MIPS ?

Kevin Hilman OMAP3: PM: base off-mode support ?

Core kernel code

john stultz Add prctl to set sibling thread names ?

dino@in.ibm.com [patch -rt] Sched load balance backport ?

Charles 'Buck' Krasic [RFC] Cooperative Polling - an alternative for real time scheduling and deadlines ?

Stefani Seibold kfifo: new API v0.6 ?

Paul E. McKenney [PATCH tip/core/rcu 0/6] rcu: Tiny RCU, expedited RCU, rcutorture fixes, lockdep simplification ?

Akinobu Mita bitmap: Introduce bitmap_set, bitmap_clear, bitmap_find_next_zero_area ?

Development tools

Greg KH tracefs ?

Frederic Weisbecker hw-breakpoints: Rewrite on top of perf events ?

Device drivers

Peter Ujfalusi MFD/OMAP/ASoC: MFD device for twl4030 codec submodule ?

Ike Panhc [PATCH] ACPI: New driver for Lenovo SL laptops ?

adam radford Updated: Add new driver for LSI 3ware 9750 ?

Vladislav Bolkhovitin [ANNOUNCE]: SCST target driver for QLogic adapters qla_isp 1.0.2 released ?

Laurent Pinchart Media controller: next round of patches ?

Documentation

Randy Dunlap docs: large update to ioctl-number.txt ?

Alex Chiang Documentation: document /sys/devices/system/cpu/ ?

Filesystems and block I/O

Valerie Aurora Writable overlays (union mounts) ?

Jeff Moyer cfq: implement merging and breaking up of cfq_queues ?

Zach Brown loop: issue aio with pages ?

Jan Kara Fix fsync on ext3 and ext4 (v2) ?

Memory management

Andrea Arcangeli RFC: Transparent Hugepage support ?

Networking

Gilad Ben-Yossef Per route TCP options ?

Virtualization and containers

Gregory Haskins irqfd enhancements ?

Miscellaneous

Neil Brown ANNOUNCE: mdadm 3.0.3 - A tool for managing Soft RAID under Linux ?

Neil Brown ANNOUNCE: mdadm 3.1 - A tool for managing Soft RAID under Linux ?

Hans de Goede libv4l release: 0.6.3: time to retire some v4l1 drivers ?

Page editor: Jonathan Corbet
Next page: Distributions>>