User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.30-rc5, released by Linus on May 8. "Driver updates (SCSI being the bulk of it, but there are input layer, networking, DRI and MD changes too). Arch updates (mostly ARM "davinci" support, but some x86 and even alpha). And various random stuff (fairly big cifs update, but some smaller ocfs2 and xfs updates, and a fair amount of small one-liners all over)." See the long-format changelog for all the details.

The current stable 2.6 kernel is, released with a long list of fixes on May 8. was released at the same time; as promised, updates for the 2.6.28 kernel have ended.

Comments (none posted)

Kernel development news

Quotes of the week

If you are getting no feedback just submit it next merge window. Either its offended nobody or they've forgotten to notice - in both cases submitting it will have the desired effect.
-- Alan Cox

I'm thinking of an app which prepares pages full of scurrilous rumour, then waits around looking at its /proc/self/smaps to see if anyone else is writing stories like that!
-- Hugh Dickins ponders security threats

Well I'm sorry I hardcoded a lack of beer into the serial layer to save a microsecond, you'll have to go without.... It works for me so clearly your usage pattern isn't interesting.
-- Alan Cox

The inventor of copy-n-paste has a lot to answer for.
-- Andrew Morton

Comments (1 posted)

In brief

By Jonathan Corbet
May 13, 2009
Editor's note: it's no secret that far more happens on the kernel mailing lists than can ever be reported on this page. As a result, interesting discussions and developments often slip by without a mention here. This article is the beginning of an experimental attempt to improve that situation. The idea is to briefly mention important topics which have not, yet, been developed into a full Kernel Page article. Some items will be followups from previous discussions; others may foreshadow full articles to come.

The "In brief" article will probably not appear every week. But, if it works out, it should become a semi-regular feature filling out LWN's kernel coverage. Comments are welcome.

reflink(): the proposed reflink() system call was covered last week. Since then, there have been some followup postings. reflink() v2, posted on May 7, maintained the reflink-as-snapshot semantics. When asked about that decision, Joel Becker responded "reflink() is a snapshotting call, not a kitchen sink." It seemed like there was to be no comfort for those wanting reflink-as-copy semantics.

reflink() v4, posted on the 11th, changed that tune somewhat. In this version, a process which either (1) owns the target file, or (2) has sufficient capabilities will create a link which copies the original security information - reflink-as-snapshot, essentially. A process lacking ownership and privilege, but having read access to the target file, will get a reflink with "new file" security information - reflink-as-copy. The idea is to do the right thing in all situations, but some developers are now concerned about a system call which has different semantics for processes running as root. This conversation has a while to go yet.

devtmpfs was also covered last week. This patch, too, has been reposted; the resulting conversation, again, looks to go on for a while. The return of devfs was always going to be controversial; the first version, after all, inspired flame wars for years before being merged. The devtmpfs developers feel that they need this feature to provide distributions which boot quickly and reliably in a number of situations; others think that there are better solutions to the problem. There is no consensus on merging this code at this time, but it is worth noting that the discussion has slowly shifted away from general opposition and toward fixing problems with the code.

Wakelocks are back, but now the facility has been rebranded suspend block. The core idea is the same: it allows code in kernel or user space to keep the system from suspending for a brief period of time. The user-space API has changed; there is now a /dev/suspend_blocker device which provides a couple of ioctl() calls. Closing the device releases the block, eliminating a potential problem with the wakelock API where a failed process could leave a block in place indefinitely.

There has been relatively little discussion of the new code; either everybody is happy with it now, or nobody has really noticed the new posting yet.

Doctor, it HZ. Much of the kernel is now tickless and equipped with high-resolution timers. So, says Alok Kataria, there is really no need to run x86 systems with a 1ms clock tick anymore. Running with HZ=1000 measurably slows the execution of a CPU-bound loop. So why not lower it?

There are problems with a lower HZ value, though, many of which have, at their source, the same problem which makes HZ=1000 more expensive: the kernel is still not truly tickless. Yes, the periodic clock interrupt is turned off when the processor is idle. But, when the CPU is busy, the clock ticks away as usual. Making the system fully tickless is a harder job than just making the idle state tickless; among other things, it pretty much requires doing away with the jiffies variable and all that depends on it. But, until that happens, lowering HZ will have costs of its own.

Wu Fengguang has been trying for a while to extend /proc/kpageflags, his patch adds a great deal of information about the usage of memory in the system. One might think that adding more useful information would be uncontroversial, but Ingo Molnar continues to oppose its inclusion. Ingo does not like the interface or the fact that it lives in /proc; his preferred solution looks more like an extension to ftrace. More thought toward the creation of uniform instrumentation interfaces is probably a good idea, but the current /proc/kpageflags interface has proved useful. It's also an established kernel ABI, so it's not going away anytime soon. But whether /proc/kpageflags will be extended further remains to be seen.

Comments (18 posted)

Seccomp and sandboxing

By Jonathan Corbet
May 13, 2009
Back in 2005, Andrea Arcangeli, mostly known for memory management work in those days, wandered into the security field with the "secure computing" (or "seccomp") feature. Seccomp was meant to support a side business of his which would enable owners of Linux systems to rent out their CPUs to people doing serious processing work. Allowing strangers to run arbitrary code is something that people tend to be nervous about; they require some pretty strong assurance that this code will not have general access to their systems.

Seccomp solves this problem by putting a strict sandbox around processes running code from others. A process running in seccomp mode is severely limited in what it can do; there are only four system calls - read(), write(), exit(), and sigreturn() - available to it. Attempts to call any other system call result in immediate termination of the process. The idea is that a control process could obtain the code to be run and load it into memory. After setting up its file descriptors appropriately, this process would call:

    prctl(PR_SET_SECCOMP, 1);

to enable seccomp mode. Once straitjacketed in this way, it would jump into the guest code, knowing that no real harm could be done. The guest code can run in the CPU and communicate over the file descriptors given to it, but it has no other access to the system.

Andrea's CPUShare never quite took off, but seccomp remained in the kernel. Last February, when a security hole was found in the seccomp code, Linus wondered whether it was being used at all. It seems likely that there were, in fact, no users at that time, but there was one significant prospective user: Google.

Google is not looking to use seccomp to create a distributed computing network; one assumes that, by now, they have developed other solutions to that problem. Instead, Google is looking for secure ways to run plugins in its Chrome browser. The Chrome sandbox is described this way:

Sandbox leverages the OS-provided security to allow code execution that cannot make persistent changes to the computer or access information that is confidential. The architecture and exact assurances that the sandbox provides are dependent on the operating system. Currently the only finished implementation is for Windows.

It seems that the Google developers thought that seccomp would make a good platform on which to create a "finished implementation" for Linux. Google developer Markus Gutschke said:

Simplicity is really the beauty of seccomp. It is very easy to verify that it does the right thing from a security point of view, because any attempt to call unsafe system calls results in the kernel terminating the program. This is much preferable over most ptrace solutions which is more difficult to audit for correctness.

The downside is that the sandbox'd code needs to delegate execution of most of its system calls to a monitor process. This is slow and rather awkward. Although due to the magic of clone(), (almost) all system calls can in fact be serialized, sent to the monitor process, have their arguments safely inspected, and then executed on behalf of the sandbox'd process. Details are tedious but we believe they are solvable with current kernel APIs.

There is, however, the little problem that sandboxed code can usefully (and safely) invoke more than the four allowed system calls. That limitation can be worked around ("tedious details"), but performance suffers. What the Chrome developers would like is a more flexible way of specifying which system calls can be run directly by code inside the sandbox.

One suggestion that came out was to add a new "mode" to seccomp. The API was designed with the idea that different applications might have different security requirements; it includes a "mode" value which specifies the restrictions that should be put in place. Only the original mode has ever been implemented, but others can certainly be added. Creating a new mode which allowed the initiating process to specify which system calls would be allowed would make the facility more useful for situations like the Chrome sandbox.

Adam Langley (also of Google) has posted a patch which does just that. The new "mode 2" implementation accepts a bitmask describing which system calls are accessible. If one of those is prctl(), then the sandboxed code can further restrict its own system calls (but it cannot restore access to system calls which have been denied). All told, it looks like a reasonable solution which could make life easier for sandbox developers.

That said, this code may never be merged because the discussion has since moved on to other possibilities. Ingo Molnar, who has been arguing for the use of the ftrace framework in a number of situations, thinks that ftrace is a perfect fit for the Chrome sandbox problem as well. He might be right, but only for a version of ftrace which is not, yet, generally available.

Using ftrace for sandboxing may seem a little strange; a tracing framework is supposed to report on what is happening while perturbing the situation as little as possible. But ftrace has a couple of tools which may be useful in this situation. The system call tracer is there now, making it easy to hook into every system call made by a given process. In addition, the current development tree (perhaps destined for 2.6.31) includes an event filter mechanism which can be used to filter out events based on an arbitrary boolean expression. By using ftrace's event filters, the sandbox could go beyond just restricting system calls; it could also place limits on the arguments to those system calls. An example supplied by Ingo looks like this:

    { "sys_read",		"fd == 0" },
    { "sys_write",		"fd == 1" },
    { "sys_sigreturn",		"1" },
    { "sys_gettimeofday",	"tz == NULL" },

These expressions implement something similar to mode 1 seccomp. But, additionally, read() is limited to the standard input and write() to the standard output. The sandboxed process is also allowed to call gettimeofday(), but it is not given access to the time zone information.

The expressions can be arbitrarily complex. They are also claimed to be very fast; Ingo claims that they are quicker than the evaluation of security module hooks. And, if straight system call filtering is not enough, arbitrary tracepoints can be placed elsewhere. All told, it does seem like a fairly general mechanism for restricting what a given process can do.

The problem cannot really be seen as solved yet, though. The event tracing code is very new and mostly unused so far. It is out of the mainline still, meaning that it could easily be a year or so until it shows up in kernels shipped by distributions. The code allowing this mechanism to be used to control execution is yet to be written. So Chrome will not have a sandbox based on anything other than mode 1 seccomp for some time (though the Chrome developers are also evaluating using SELinux for this purpose).

Beyond that, there are some real doubts about whether system call interception is the right way to sandbox a process. There are well-known difficulties with trying to verify parameters if they are stored in user space; a hostile process can attempt to change them between the execution of security checks and the actual use of the data. There are also interesting interactions between system calls and multiple ways to do a number of things, all of which can lead to a leaky sandbox. All of this has led James Morris to complain:

I'm concerned that we're seeing yet another security scheme being designed on the fly, without a well-formed threat model, and without taking into account lessons learned from the seemingly endless parade of similar, failed schemes.

Ingo is not worried, though; he notes that the ability to place arbitrary tracepoints allows filtering at any spot, not just at system call entry. So the problems associated with system call interception are not necessarily an issue with the ftrace-based scheme. Beyond that, this is a specific sort of security problem:

Your argument really pertains to full-system security solutions - while maximising compatibility and capability and minimizing user inconvenience. _That_ is an extremely hard problem with many pitfalls and snake-oil merchants flooding the roads. But that is not our goal here: the goal is to restrict execution in very brutal but still performant ways.

This has the look of a discussion which will take some time to play out. There is sure to be opposition to turning the event filtering code into another in-kernel security policy language. It may turn out that the simple seccomp extension is more generally palatable. Or something completely different could come along. What is clear is that the sandboxing problem is hard; many smart people have tried to implement it in a number of different ways with varying levels of success. There is no assurance that that the solution will be easier this time around.

Comments (11 posted)

TuxOnIce: in from the cold?

By Jake Edge
May 13, 2009

As flamewars go, the recent linux-kernel thread about TuxOnIce was pretty tame. Likely weary of heated discussions in the past, the participants mostly swore off the flames with a bid to work together on Linux hibernation (i.e. suspend to disk). But, there still seems to be an impediment to that collaboration. The long out-of-tree history for TuxOnIce, combined with lead developer Nigel Cunningham's inability or unwillingness to work with the community means that TuxOnIce could have a bumpy road into the kernel—if it ever gets there at all.

TuxOnIce, formerly known as suspend2 and swsusp2, is a longstanding out-of-tree solution for hibernation. It has an enthusiastic user community along with some features not available in swsusp, which is the current mainline hibernation code. Some of the advantages claimed by TuxOnIce are support for multiple swap devices or regular files as the suspend image destination, better performance via compressed images and other techniques, saving nearly all of the contents of memory including caches, etc. But its vocal users say that the biggest advantage is that TuxOnIce just works for many—some of whom cannot get the current mainline mechanisms to work.

Much of the recent mainline hibernation work, generally done by Rafael Wysocki and Pavel Machek, has focused on uswsusp, which moves the bulk of the suspend work to user space. So, the kernel already contains two mechanisms for doing hibernation, leaving no real chance for a third to be added.

There are clear disagreements about how much and which parts should be in the kernel versus in user space. Machek seems to think that nearly all of the task can be handled in user space, while Cunningham is in favor of the advantages—performance and being able to take advantage of in-kernel interfaces—of an all kernel approach. Wysocki is somewhere in the middle, outlining some of the advantages he sees in the in-kernel solution:

One benefit is that we need not anything in the initrd for hibernation to work. Another one is that we can get superior performance, for obvious reasons (less copying of data, faster I/O). Yet another is simpler configuration and no need to maintain a separate set of user space tools. I probably could find more.

A bigger disconnect, though, is how to proceed. Cunningham would like to see TuxOnIce merged whole as a parallel alternative to swsusp, with an eye to eventually replacing and removing swsusp. Machek and Wysocki are not terribly interested in replacing swsusp, they would rather see incremental improvements—many coming from the TuxOnIce code—proposed and merged. On the one hand, Cunningham has an entire subsystem that he would like to see merged, while the swsusp folks have a subsystem—used by most distributions for hibernation—to maintain.

Cunningham recently posted an RFC for merging TuxOnIce "with a view to seeking to get it merged, perhaps in 2.6.31 or .32 (depending upon what needs work before it can be merged) and the willingness of those who matter". That was met with a somewhat heated reply by Machek. But Wysocki was quick to step in to try to avoid the flames:

Actually, I see advantages of working together versus fighting flame wars. Please stop that, I'm not going to take part in it this time.

After Cunningham agreed, the discussion turned to how to work together, which is where it seems to have hit an impasse. Wysocki and Cunningham, at least, see some clear advantages in the TuxOnIce code, but, contrary to Cunningham's wishes, having it merged wholesale is likely not in the cards. Cunningham describes his plan as follows:

I'd like to see use have all three [swsusp, uswsusp, and TuxOnIce] for one or two releases of vanilla, just to give time to work out any issues that haven't been foreseen. Once we're all that there are confident there are no regressions with TuxOnIce, I'd remove swsusp. That's my ideal plan of attack.

Not surprisingly, Wysocki and Machek see things differently. Machek is not opposed to bringing some of TuxOnIce into the mainline: "If we are talking about improving mainline to allow tuxonice functionality... then yes, that sounds reasonable." Wysocki lays out an alternative plan that is much more in keeping with traditional kernel development strategies:

So this is an idea to replace our current hibernation implementation with TuxOnIce.

Which unfortunately I don't agree with.

I think we can get _one_ implementation out of the three, presumably keeping the user space interface that will keep the current s2disk binaries happy, by merging TuxOnIce code _gradually_. No "all at once" approach, please.

And by "merging" I mean _exactly_ that. Not adding new code and throwing away the old one.

But, as Cunningham continues pushing for help in getting TuxOnIce merged alongside swsusp, Wysocki points out that it requires a great deal of review to get a huge (10,000+ lines of code) set of patches accepted: "That would take lot of work and we'd also have to ask many other busy people to do a lot of work for us". Cunningham seems to be under the misapprehension that kernel hackers will be willing to merge a subsystem that duplicates another without a clear overriding reason. Easing what he sees as a necessary transition from swsusp to TuxOnIce is not likely to be that compelling.

It is clearly frustrating for Cunningham to have a working solution but be unable to get it into the kernel. But it is a direct result of working out of the tree and then trying to present a solution when the kernel has gone in a different direction. It is a common mistake that folks make when dealing with the kernel community. Ray Lee provides a nice answer to Cunningham's frustrations, which points to IBM's device mapper contribution that suffered from a similar reaction. Lee notes that Wysocki has offered extremely valuable assistance:

He's offering to be the social glue between your code and the forms that are accepted and followed here on LKML. Taking things apart from the whole, finding the pieces that are non-controversial or easily argued for, getting them merged upstream with a user, and then moving on to the rest.

This way, the external TuxOnIce patch set shrinks and shrinks, until it's eventually gone, with all functionality merged into the kernel in one form or another.

Is your code better than uswsusp? Almost certainly. This isn't about that. This is about making your code better than what it is today, by going through the existing review-and-merge process.

At one point, Cunningham pointed to the SL*B memory allocators as an example of parallel implementations that are all available in the mainline. Various folks responded that memory allocators are fairly self-contained, unlike TuxOnIce. Furthermore, as Pekka Enberg notes: "Yes, so please don't make the same mistake we did. Once you have multiple implementations in the kernel, it's extremely hard to get rid of them."

There has been a bit of discussion about the technical aspects of the TuxOnIce patch, mostly centering on the way that it frees up memory to allow enough space to create a suspend image, while still adding the contents of that memory to the suspend image. By relying on existing kernel behavior, which is not necessarily guaranteed for the future, TuxOnIce can save nearly all of the memory contents, whereas swsusp dumps caches and the like to create enough memory to build the suspend image. That means that performance after a resume operation may be impacted as those caches are refilled. Overall, though, the main focus of the discussion has been the way forward; so far, there has been little progress on that front.

This is not the first time that TuxOnIce has gotten to this point. In its earlier guise as swsusp2, Cunningham made several attempts to get it into the mainline. In March of 2004, Andrew Morton asked that it be broken down into smaller, more easily digested, chunks. The same thing happened again near the end of 2004 when Cunningham proposed adding swsusp2 in one big code ball. It doesn't end there, either, between then and now the same request has been made; at this point one might guess that Cunningham simply isn't willing to do things that way.

There is a real danger that the TuxOnIce features that its users like could be lost—or remain out-of-tree—if something doesn't give. Either Cunningham has to recognize that the only plausible way to get TuxOnIce into the kernel is via the normal kernel development path, or someone else has to pick it up and start that process themselves. With no one (other than Cunningham) pushing for its inclusion, there simply is no other way for it to get into the mainline.

Comments (7 posted)

Which I/O controller is the fairest of them all?

By Jonathan Corbet
May 12, 2009
An I/O controller is a system component intended to arbitrate access to block storage devices; it should ensure that different groups of processes get specific levels of access according to a policy defined by the system administrator. In other words, it prevents I/O-intensive processes from hogging the disk. This feature can be useful on just about any kind of system which experiences disk contention; it becomes a necessity on systems running a number of virtualized (or containerized) guests. At the moment, Linux lacks an I/O controller in the mainline kernel. There is, however, no shortage of options out there. This article will look at some of the I/O controller projects currently pushing for inclusion into the mainline.

[Block layer structure] For the purposes of this discussion, it may be helpful to refer to your editor's bad artwork, as seen on the right, for a simplistic look at how block I/O happens in a Linux system. At the top, we have several sources of I/O activity. Some requests come from the virtual memory layer, which is cleaning out dirty pages and trying to make room for new allocations. Others come from filesystem code, and others yet will originate directly from user space. It's worth noting that only user-space requests are handled in the context of the originating process; that creates complications that we'll get back to. Regardless of the source, I/O requests eventually find themselves at the block layer, represented by the large blue box in the diagram.

Within the block layer, I/O requests may first be handled by one or more virtual block drivers. These include the device mapper code, the MD RAID layer, etc. Eventually a (perhaps modified) request heads toward a physical device, but first it goes into the I/O scheduler, which tries to optimize I/O activity according to a policy of its own. The I/O scheduler works to minimize seeks on rotating storage, but it may also implement I/O priorities or other policy-related features. When it deems that the time is right, the I/O scheduler passes requests to the physical block driver, which eventually causes them to be executed by the hardware.

All of this is relevant because it is possible to hook an I/O controller into any level of this diagram - and the various controller developers have done exactly that. There are advantages and disadvantages to doing things at each layer, as we will see.


The dm-ioband patch by Ryo Tsuruta (and others) operates at the virtual block driver layer. It implements a new device mapper target (called "ioband") which prioritizes requests passing through. The policy is a simple proportional weighting system; requests are divided up into groups, each of which gets bandwidth according to the weight assigned by the system administrator. Groups can be determined by user ID, group ID, process ID, or process group. Administration is done with the dmsetup tool.

dm-ioband works by assigning a pile of "tokens" to each group. If I/O traffic is low, the controller just stays out of the way. Once traffic gets high enough, though, it will charge each group for every I/O request on its way through. Once a group runs out of tokens, its I/O will be put onto a list where it will languish, unloved, while other groups continue to have their requests serviced. Once all groups which are actively generating I/O have exhausted their tokens, everybody gets a new set and the process starts anew.

The basic dm-ioband code has a couple of interesting limitations. One is that it does not use the control group mechanism, as would normally be expected for a resource controller. It also has a real problem with I/O operations initiated asynchronously by the kernel. In many cases - perhaps the majority of cases - I/O requests are created by kernel subsystems (memory management, for example) which are trying to free up resources and which are not executing in the context of any specific process. These requests do not have a readily-accessible return label saying who they belong to, so dm-ioband does not know how to account for them. So they run under the radar, substantially reducing the value of the whole I/O controller exercise.

The good news is that there's a solution to both problems in the form of the blkio-cgroup patch, also by Ryo. This patch interfaces between dm-ioband and the control group mechanism, allowing bandwidth control to be applied to arbitrary control groups. Unlike some other solutions, dm-ioband still does not use control groups for bandwidth control policy; control groups are really only used to define the groups of processes to operate on.

The other feature added by blkio-cgroup is a mechanism by which the owner of arbitrary I/O requests can be identified. To this end, it adds some fields to the array of page_cgroup structures in the kernel. This array is maintained by the memory usage controller subsystem; one can think of struct page_cgroup as a bunch of extra stuff added into struct page. Unlike the latter, though, struct page_cgroup is normally not used in the kernel's memory management hot paths, and it's generally out of sight, so people tend not to notice when it grows. But, there is one struct page_cgroup for every page of memory in the system, so this is a large array.

This array already has the means to identify the owner for any given page in the system. Or, at least, it will identify an owner; there's no real attempt to track multiple owners of shared pages. The blkio-cgroup patch adds some fields to this array to make it easy to identify which control group is associated with a given page. Given that, and given that block I/O requests include the address of the memory pages involved, it is not too hard to look up a control group to associate with each request. Modules like dm-ioband can then use this information to control the bandwidth used by all requests, not just those initiated directly from user space.

The advantages of dm-ioband include device-mapper integration (for those who use the device mapper), and a relatively small and well-contained code base - at least until blkio-cgroup is added into the mix. On the other hand, one must use the device mapper to use dm-ioband, and the scheduling decisions made there are unlikely to help the lower-level I/O scheduler implement its policy correctly. Finally, dm-ioband does not provide any sort of quality-of-service guarantees; it simply ensures that each group gets something close to a given percentage of the available I/O bandwidth.


The io-throttle patches by Andrea Righi take a different approach. This controller uses the control group mechanism from the outset, so all of the policy parameters are set via the control group virtual filesystem. The main parameter for each control group is the maximum bandwidth that group can consume; thus, io-throttle enforces absolute bandwidth numbers, rather than dividing up the available bandwidth proportionally as is done with dm-ioband. (Incidentally, both controllers can also place limits on the number of I/O operations rather than bandwidth). There is a "watermark" value; it sets a level of utilization below which throttling will not be performed. Each control group has its own watermark, so it is possible to specify that some groups are throttled before others.

Each control group is associated with a specific block device. If the administrator wants to set identical policies for three different devices, three control groups must still be created. But this approach does make it possible to set different policies for different devices.

One of the more interesting design decisions with io-throttle is its placement in the I/O structure: it operates at the top, where I/O requests are initiated. This approach necessitates the placement of calls to cgroup_io_throttle() wherever block I/O requests might be created. So they show up in various parts of the memory management subsystem, in the filesystem readahead and writeback code, in the asynchronous I/O layer, and, of course, in the main block layer I/O submission code. This makes the io-throttle patch a bit more invasive than some others.

There is an advantage to doing throttling at this level, though: it allows io-throttle to slow down I/O by simply causing the submitting process to sleep for a while; this is generally preferable to filling memory with queued BIO structures. Sleeping is not always possible - it's considered poor form in large parts of the virtual memory subsystem, for example - so io-throttle still has to queue I/O requests at times.

The io-throttle code does not provide true quality of service, but it gets a little closer. If the system administrator does not over-subscribe the block device, then each group should be able to get the amount of bandwidth which has been allocated to it. This controller handles the problem of asynchronously-generated I/O requests in the same way dm-ioband does: it uses the blkio-cgroup code.

The advantages of the io-throttle approach include relatively simple code and the ability to throttle I/O by causing processes to sleep. On the down side, operating at the I/O creation level means that hooks must be placed into a number of kernel subsystems - and maintained over time. Throttling I/O at this level may also interfere with I/O priority policies implemented at the I/O scheduler level.


Both dm-ioband and io-throttle suffer from a significant problem: they can defeat the policies (such as I/O priority) being implemented by the I/O scheduler. Given that a bandwidth control module is, for all practical purposes, an I/O scheduler in its own right, one might think that it would make sense to do bandwidth control at the I/O scheduler level. The io-controller patches by Vivek Goyal do just that.

Io-controller provides a conceptually simple, control-group-based mechanism. Each control group is given a weight which determines its access to I/O bandwidth. Control groups are not bound to specific devices in io-controller, so the same weights apply for access to every device in the system. Once a process has been placed within a control group, it will have bandwidth allocated out of that group's weight, with no further intervention needed - at least, for any block device which uses one of the standard I/O schedulers.

The io-controller code has been designed to work with all of the mainline I/O controllers: CFQ, Deadline, Anticipatory, and no-op. Making that work requires significant changes to those schedulers; they all need to have a hierarchical, fair-scheduling mechanism to implement the bandwidth allocation policy. The CFQ scheduler already has a single level of fair scheduling, but the io-controller code needs a second level. Essentially, one level implements the current CFQ fair queuing algorithm - including I/O priorities - while the other applies the group bandwidth limits. What this means is that bandwidth limits can be applied in a way which does not distort the other I/O scheduling decisions made by CFQ. The other I/O schedulers lack multiple queues (even at a single level), so the io-controller patch needs to add them.

Vivek's patch starts by stripping the current multi-queue code out of CFQ, adding multiple levels to it, and making it part of the generic elevator code. That allows all of the I/O schedulers to make use of it with (relatively) little code churn. The CFQ code shrinks considerably, but the other schedulers do not grow much. Vivek, too, solves the asynchronous request problem with the blkio-cgroup code.

This approach has the clear advantage of performing bandwidth throttling in ways consistent with the other policies implemented by the I/O scheduler. It is well contained, in that it does not require the placement of hooks in other parts of the kernel, and it does not require the use of the device mapper. On the other hand, it is by far the largest of the bandwidth controller patches, it cannot implement different policies for different devices, and it doesn't yet work reliably with all I/O schedulers.

Choosing one

The proliferation of bandwidth controllers has been seen as a problem for at least the last year. There is no interest in merging multiple controllers, so, at some point, it will become necessary to pick one of them to put into the mainline. It has been hoped that the various developers involved would get together and settle on one implementation, but that has not yet happened, leading Andrew Morton to proclaim recently:

I'm thinking we need to lock you guys in a room and come back in 15 minutes.

Seriously, how are we to resolve this? We could lock me in a room and come back in 15 days, but there's no reason to believe that I'd emerge with the best answer.

At the Storage and Filesystem Workshop in April, the storage track participants appear to have been leaning heavily toward a solution at the I/O scheduler level - and, thus, io-controller. The cynical among us might be tempted to point out that Vivek was in the room, while the developers of the competing offerings were not. But such people should also ask why an I/O scheduling problem should be solved at any other level.

In any case, the developers of dm-ioband and io-throttle have not stopped their work since this workshop was held, and the wider kernel community has not yet made a decision in this area. So the picture remains only slightly less murky than before. About the only clear area of consensus would appear to be the use of blkio-cgroup for the tracking of asynchronously-generated requests. For the rest, the locked-room solution may yet prove necessary.

Comments (11 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds