Brief items
The current development kernel is 2.6.30-rc5,
released by Linus on May 8.
"
Driver updates (SCSI being the bulk of it, but there are input
layer, networking, DRI and MD changes too). Arch updates (mostly ARM
"davinci" support, but some x86 and even alpha). And various random stuff
(fairly big cifs update, but some smaller ocfs2 and xfs updates, and a fair
amount of small one-liners all over)." See
the
long-format changelog for all the details.
The current stable 2.6 kernel is 2.6.29.3, released with a long list of
fixes on May 8. 2.6.27.23 was released at the
same time; as promised, updates for the 2.6.28 kernel have ended.
Comments (none posted)
Kernel development news
If you are getting no feedback just submit it next merge
window. Either its offended nobody or they've forgotten to notice -
in both cases submitting it will have the desired effect.
--
Alan Cox
I'm thinking of an app which prepares pages full of scurrilous
rumour, then waits around looking at its /proc/self/smaps to see if
anyone else is writing stories like that!
--
Hugh Dickins ponders security threats
Well I'm sorry I hardcoded a lack of beer into the serial layer to
save a microsecond, you'll have to go without.... It works for me
so clearly your usage pattern isn't interesting.
--
Alan Cox
The inventor of copy-n-paste has a lot to answer for.
--
Andrew Morton
Comments (1 posted)
By Jonathan Corbet
May 13, 2009
Editor's note: it's no secret that far more happens on the kernel
mailing lists than can ever be reported on this page. As a result,
interesting discussions and developments often slip by without a mention
here. This article is the beginning of an experimental attempt to improve
that situation. The idea is to briefly mention important topics which have
not, yet, been developed into a full Kernel Page article. Some items will
be followups from previous discussions; others may foreshadow full articles
to come.
The "In brief" article will probably not appear every week. But, if it
works out, it should become a semi-regular feature filling out LWN's kernel
coverage. Comments are welcome.
reflink(): the proposed reflink() system call was covered last week. Since then,
there have been some followup postings. reflink() v2, posted on
May 7, maintained the reflink-as-snapshot semantics. When asked about
that decision, Joel Becker responded
"reflink() is a snapshotting call, not a kitchen sink." It
seemed
like there was to be no comfort for those wanting reflink-as-copy
semantics.
reflink() v4, posted on the 11th, changed
that tune somewhat. In this version, a process which either (1) owns the
target file, or (2) has sufficient capabilities will create a link which
copies the original security information - reflink-as-snapshot,
essentially. A process lacking ownership and privilege, but having read
access to the target file, will get a reflink with "new file" security
information - reflink-as-copy. The idea is to do the right thing in all
situations, but some developers are now concerned about a system call which
has different semantics for processes running as root. This conversation
has a while to go yet.
devtmpfs was also covered
last week. This patch, too, has been reposted; the resulting
conversation, again, looks to go on for a while. The return of devfs was
always going to be controversial; the first version, after all, inspired
flame wars for years before being merged. The devtmpfs developers feel
that they need this feature to provide distributions which boot quickly and
reliably in a number of situations; others think that there are better
solutions to the problem. There is no consensus on merging this code at
this time, but it is worth noting that the discussion has slowly shifted
away from general opposition and toward fixing problems with the code.
Wakelocks are back, but now the facility has been rebranded suspend block. The core idea is
the same: it allows code in kernel or user space to keep the system from
suspending for a brief period of time. The user-space API has changed;
there is now a /dev/suspend_blocker device which provides a couple
of ioctl() calls. Closing the device releases the block,
eliminating a potential problem with the wakelock API where a failed
process could leave a block in place indefinitely.
There has been relatively little discussion of the new code; either
everybody is happy with it now, or nobody has really noticed the new
posting yet.
Doctor, it HZ. Much of the kernel is now tickless and equipped with
high-resolution timers. So, says
Alok Kataria, there is really no need to run x86 systems with a 1ms
clock tick anymore. Running with HZ=1000 measurably slows the execution of
a CPU-bound loop. So why not lower it?
There are problems with a lower HZ value, though, many of which have, at
their source, the same problem which makes HZ=1000 more expensive: the
kernel is still not truly tickless. Yes, the periodic clock interrupt is
turned off when the processor is idle. But, when the CPU is busy, the
clock ticks away as usual. Making the system fully tickless is a harder
job than just making the idle state tickless; among other things, it pretty
much requires doing away with the jiffies variable and all that
depends on it. But, until that happens, lowering HZ will have costs of its
own.
Wu Fengguang has been trying for a while to extend /proc/kpageflags,
his patch adds a great deal
of information about the usage of memory in the system. One might think
that adding more useful information would be uncontroversial, but Ingo
Molnar continues to oppose its inclusion.
Ingo does not like the interface or the fact that it lives in
/proc; his preferred solution looks more like an extension to ftrace. More
thought toward the creation of uniform instrumentation interfaces is
probably a good idea, but the current /proc/kpageflags interface
has proved useful. It's also an established kernel ABI, so it's not going
away anytime soon. But whether /proc/kpageflags will be extended
further remains to be seen.
Comments (18 posted)
By Jonathan Corbet
May 13, 2009
Back in 2005, Andrea Arcangeli, mostly known for memory management work in
those days,
wandered into the
security field with the "secure computing" (or "seccomp") feature.
Seccomp was meant to support
a side business
of his which would enable owners of Linux systems to rent out their
CPUs to people doing serious processing work. Allowing strangers to run
arbitrary code is something that people tend to be nervous about; they
require some pretty strong assurance that this code will not have general
access to their systems.
Seccomp solves this problem by putting a strict sandbox around processes
running code from others. A process running in seccomp
mode is severely limited in what it can do; there are only four system
calls - read(), write(), exit(), and
sigreturn() - available to it. Attempts to call any other system
call result in immediate termination of the process.
The idea is that a control process could obtain the code to be run and load
it into memory. After setting up its file descriptors appropriately, this
process would call:
prctl(PR_SET_SECCOMP, 1);
to enable seccomp mode. Once straitjacketed in this way, it would jump
into the guest code, knowing that no real harm could be done. The guest
code can run in the CPU and communicate over the file descriptors given to
it, but it has no other access to the system.
Andrea's CPUShare never quite took off, but seccomp remained in the
kernel. Last February, when a security hole was found in the seccomp code,
Linus wondered whether it was being used at
all. It seems likely that there were, in fact, no users at that time, but
there was one significant prospective user: Google.
Google is not looking to use seccomp to create a distributed computing
network; one assumes that, by now, they have developed other solutions to
that problem. Instead, Google is looking for secure ways to run plugins in
its Chrome browser. The Chrome
sandbox
is described this way:
Sandbox leverages the OS-provided security to allow code execution
that cannot make persistent changes to the computer or access
information that is confidential. The architecture and exact
assurances that the sandbox provides are dependent on the operating
system. Currently the only finished implementation is for Windows.
It seems that the Google developers thought that seccomp would make a good
platform on which to create a "finished implementation" for Linux. Google
developer Markus Gutschke said:
Simplicity is really the beauty of seccomp. It is very easy to
verify that it does the right thing from a security point of view,
because any attempt to call unsafe system calls results in the
kernel terminating the program. This is much preferable over most
ptrace solutions which is more difficult to audit for correctness.
The downside is that the sandbox'd code needs to delegate execution
of most of its system calls to a monitor process. This is slow and
rather awkward. Although due to the magic of clone(), (almost) all
system calls can in fact be serialized, sent to the monitor
process, have their arguments safely inspected, and then executed
on behalf of the sandbox'd process. Details are tedious but we
believe they are solvable with current kernel APIs.
There is, however, the little problem that sandboxed code
can usefully (and safely) invoke more than the four allowed system calls. That limitation
can be worked around ("tedious details"), but performance suffers. What
the Chrome developers would like is a more flexible way of specifying which
system calls can be run directly by code inside the sandbox.
One suggestion that came out was to add a new "mode" to seccomp. The API
was designed with the idea that different applications might have different
security requirements; it includes a "mode" value which specifies the
restrictions that should be put in place. Only the original mode has ever been
implemented, but others can certainly be added. Creating a new mode which
allowed the initiating process to specify which system calls would be
allowed would make the facility more useful for situations like the Chrome
sandbox.
Adam Langley (also of Google) has posted a patch which does just that.
The new "mode 2" implementation accepts a bitmask describing which
system calls are accessible. If one of those is prctl(), then the
sandboxed code can further restrict its own system calls (but it cannot
restore access to system calls which have been denied). All told, it looks
like a reasonable solution which could make life easier for sandbox
developers.
That said, this code may never be merged because the discussion has since
moved on to other possibilities. Ingo Molnar, who has been arguing for the
use of the ftrace framework in a number of situations, thinks that ftrace is a perfect fit for the
Chrome sandbox problem as well. He might be right, but only for a version
of ftrace which is not, yet, generally available.
Using ftrace for sandboxing may seem a little strange; a tracing framework
is supposed to report on what is happening while perturbing the situation
as little as possible. But ftrace has a couple of tools which may be
useful in this situation. The system call tracer is there now, making it
easy to hook into every system call made by a given process. In addition, the current
development tree (perhaps destined for 2.6.31) includes an event filter
mechanism which can be used to filter out events based on an arbitrary
boolean expression. By using ftrace's event filters, the sandbox could go beyond
just restricting system calls; it could also place limits on the arguments
to those system calls. An example supplied
by Ingo looks like this:
{ "sys_read", "fd == 0" },
{ "sys_write", "fd == 1" },
{ "sys_sigreturn", "1" },
{ "sys_gettimeofday", "tz == NULL" },
These expressions implement something similar to mode 1 seccomp. But,
additionally, read() is limited to the standard input and
write() to the standard output. The sandboxed process is also
allowed to call gettimeofday(), but it is not given access to the
time zone information.
The expressions can be arbitrarily complex. They are also claimed to be
very fast; Ingo claims that they are quicker than the evaluation of
security module hooks. And, if straight system call filtering is not
enough, arbitrary tracepoints can be placed elsewhere. All told, it does
seem like a fairly general mechanism for restricting what a given process
can do.
The problem cannot really be seen as solved yet, though. The event tracing
code is very new and mostly unused so far. It is out of the mainline
still, meaning that it could easily be a year or so until it shows up in
kernels shipped by distributions. The code allowing this mechanism to be
used to control execution is yet to be written. So Chrome will not have a
sandbox based on anything other than mode 1 seccomp for some time
(though the Chrome developers are also evaluating using SELinux for this
purpose).
Beyond that, there are some real doubts about whether system call
interception is the right way to sandbox a process. There are well-known
difficulties with trying to verify parameters if they are stored in user
space; a hostile process can attempt to change them between the execution
of security checks and the actual use of the data. There are also
interesting interactions between system calls and multiple ways to do a
number of things, all of which can lead to a leaky sandbox. All of this
has led James Morris to complain:
I'm concerned that we're seeing yet another security scheme being
designed on the fly, without a well-formed threat model, and
without taking into account lessons learned from the seemingly
endless parade of similar, failed schemes.
Ingo is not worried, though; he notes that the ability to place arbitrary
tracepoints allows filtering at any spot, not just at system call entry.
So the problems associated with system call interception are not
necessarily an issue with the ftrace-based scheme.
Beyond that, this is a specific sort of security problem:
Your argument really pertains to full-system security solutions -
while maximising compatibility and capability and minimizing user
inconvenience. _That_ is an extremely hard problem with many pitfalls
and snake-oil merchants flooding the roads. But that is not our
goal here: the goal is to restrict execution in very brutal but
still performant ways.
This has the look of a discussion which will take some time to play out.
There is sure to be opposition to turning the event filtering code into
another in-kernel security policy language. It may turn out that the
simple seccomp extension is more generally palatable. Or something
completely different could come along. What is clear is that the
sandboxing problem is hard; many smart people have tried to implement it in
a number of different ways with varying levels of success. There is no
assurance that that the solution will be easier this time around.
Comments (11 posted)
By Jake Edge
May 13, 2009
As flamewars go, the recent linux-kernel thread about TuxOnIce was pretty tame. Likely weary of
heated discussions in the past, the participants
mostly swore off the flames with a bid to work together on Linux
hibernation (i.e. suspend to disk). But, there still seems to be an
impediment to that collaboration. The long out-of-tree history for
TuxOnIce, combined with lead developer Nigel Cunningham's inability or
unwillingness to work with the community means that TuxOnIce could have a
bumpy road into the kernel—if it ever gets there at all.
TuxOnIce, formerly known as suspend2 and swsusp2, is a longstanding out-of-tree
solution for hibernation. It has an enthusiastic user community along with
some features not available in swsusp, which is the current mainline
hibernation code. Some of the advantages claimed by TuxOnIce are support
for multiple swap devices or regular files as the suspend image
destination, better performance via compressed images and other techniques,
saving nearly all of the contents of memory including caches, etc. But its
vocal users say that the biggest advantage is that TuxOnIce just works for
many—some of
whom cannot get the current mainline mechanisms to work.
Much of the recent mainline hibernation work, generally done by Rafael
Wysocki and Pavel Machek, has focused on uswsusp, which moves the bulk of
the suspend work to user space. So, the kernel already contains two
mechanisms for doing hibernation, leaving no real chance for a third to be
added.
There are clear disagreements about how much and which parts should be in
the kernel versus in user space. Machek seems to think that nearly all of
the task can be handled in user space, while Cunningham is in favor of the
advantages—performance and being able to take advantage of in-kernel
interfaces—of an all kernel approach. Wysocki is somewhere in the
middle, outlining some of the advantages
he sees in the in-kernel solution:
One benefit is that we need not anything in the initrd for hibernation to work.
Another one is that we can get superior performance, for obvious reasons
(less copying of data, faster I/O). Yet another is simpler configuration and
no need to maintain a separate set of user space tools. I probably could
find more.
A bigger disconnect, though, is how to
proceed. Cunningham would like to see TuxOnIce merged whole as a parallel
alternative to swsusp, with an eye to eventually replacing and removing swsusp.
Machek and Wysocki are not terribly interested in replacing swsusp, they
would rather see incremental improvements—many coming from the
TuxOnIce code—proposed and merged. On the one hand,
Cunningham has an entire subsystem that he would like to see merged, while
the swsusp folks have a subsystem—used by most distributions for
hibernation—to maintain.
Cunningham recently posted an RFC for
merging TuxOnIce "with a view to seeking to get it
merged, perhaps in 2.6.31 or .32 (depending upon what needs work before
it can be merged) and the willingness of those who matter". That
was met with a somewhat heated reply by
Machek. But Wysocki was quick to step in to
try to avoid the flames:
Actually, I see advantages of working together versus fighting flame wars.
Please stop that, I'm not going to take part in it this time.
After Cunningham agreed, the discussion turned to how to work
together,
which is where it seems to have hit an impasse. Wysocki and Cunningham, at
least, see some clear advantages in the TuxOnIce code, but, contrary to
Cunningham's wishes, having it merged wholesale is likely not in the
cards. Cunningham describes his plan as
follows:
I'd like to see use have all three [swsusp, uswsusp, and TuxOnIce] for one
or two releases of vanilla,
just to give time to work out any issues that haven't been foreseen.
Once we're all that there are confident there are no regressions with
TuxOnIce, I'd remove swsusp. That's my ideal plan of attack.
Not surprisingly, Wysocki and Machek see things differently. Machek is not
opposed to bringing some of TuxOnIce into the mainline: "If we are
talking about improving mainline to allow tuxonice
functionality... then yes, that sounds reasonable." Wysocki lays
out an alternative plan that is much more
in keeping with traditional kernel development strategies:
So this is an idea to replace our current hibernation implementation with
TuxOnIce.
Which unfortunately I don't agree with.
I think we can get _one_ implementation out of the three, presumably keeping
the user space interface that will keep the current s2disk binaries happy, by
merging TuxOnIce code _gradually_. No "all at once" approach, please.
And by "merging" I mean _exactly_ that. Not adding new code and throwing
away the old one.
But, as Cunningham continues pushing for help in getting TuxOnIce merged
alongside swsusp, Wysocki points out that
it requires a great deal of
review to get a huge (10,000+ lines of code) set of patches accepted:
"That would take lot of work and we'd also have to ask many other
busy people
to do a lot of work for us". Cunningham seems to be
under the misapprehension that kernel hackers will be willing to merge a
subsystem that duplicates another without a clear overriding reason.
Easing what he sees as a necessary
transition from swsusp to TuxOnIce is not likely to be that compelling.
It is clearly frustrating for Cunningham to have a working solution but be
unable to get it into the kernel. But it is a direct result of working out
of the tree and then trying to present a solution when the kernel has gone
in a different direction. It is a common mistake that folks make when
dealing with the kernel community. Ray Lee provides a nice answer to Cunningham's frustrations, which
points to IBM's device mapper contribution that suffered from a similar
reaction. Lee notes that Wysocki has offered extremely valuable
assistance:
He's offering to be the social glue between your code and the forms
that are accepted and followed here on LKML. Taking things apart from
the whole, finding the pieces that are non-controversial or easily
argued for, getting them merged upstream with a user, and then moving
on to the rest.
This way, the external TuxOnIce patch set shrinks and shrinks, until
it's eventually gone, with all functionality merged into the kernel in
one form or another.
Is your code better than uswsusp? Almost certainly. This isn't about
that. This is about making your code better than what it is today, by
going through the existing review-and-merge process.
At one point, Cunningham pointed to the
SL*B memory allocators as an
example of parallel
implementations that are all available in the mainline. Various folks
responded that memory allocators are fairly self-contained, unlike
TuxOnIce. Furthermore,
as Pekka Enberg notes: "Yes, so
please don't make the same mistake we did. Once you have
multiple implementations in the kernel, it's extremely hard to get rid
of them."
There has been a bit of discussion about the technical aspects of the
TuxOnIce patch, mostly centering on the way that it frees up memory to
allow enough space to create a suspend image, while still adding the
contents of that memory to the suspend image. By relying on existing
kernel behavior, which is
not necessarily guaranteed for the future, TuxOnIce can save nearly all of
the memory contents, whereas swsusp dumps caches and the like to create
enough memory to build the suspend image. That means that performance after a
resume operation may be impacted as those caches are refilled. Overall,
though, the main focus of the discussion has been the way forward; so far,
there has been little progress on that front.
This is not the first time that TuxOnIce has gotten to this point. In its
earlier guise as swsusp2, Cunningham made several attempts to get it into
the mainline. In March of 2004, Andrew Morton asked that it be broken down into smaller, more
easily digested, chunks. The same thing
happened again near the end of 2004 when Cunningham proposed adding swsusp2
in one big code ball. It doesn't end there, either, between then and now
the same request has been made; at this point one might guess that
Cunningham simply isn't willing to do things that way.
There is a real danger that the TuxOnIce features that its users like could
be lost—or
remain out-of-tree—if something doesn't give. Either Cunningham has
to recognize that the only plausible way to get TuxOnIce into the kernel is
via the normal kernel development path, or someone else has to pick it up
and start that process themselves. With no one (other than Cunningham)
pushing for its inclusion, there simply is no other way for it to get into
the mainline.
Comments (7 posted)
By Jonathan Corbet
May 12, 2009
An I/O controller is a system component intended to arbitrate access to
block storage devices; it should ensure that different groups of processes
get specific levels of access according to a policy defined by the system
administrator. In other words, it prevents I/O-intensive processes from
hogging the disk.
This feature can be useful on just about any kind of system
which experiences disk contention; it becomes a necessity on systems
running a number of virtualized (or containerized) guests. At the moment,
Linux lacks an I/O controller in the mainline kernel. There is, however,
no shortage of options out there. This article will look at some of the
I/O controller projects currently pushing for inclusion into the mainline.
For the purposes of this discussion, it may be helpful to refer to your
editor's bad artwork, as seen on the right, for a simplistic look at how
block I/O happens in a Linux system. At the top, we have several sources
of I/O activity. Some requests come from the virtual memory layer, which
is cleaning out dirty pages and trying to make room for new allocations.
Others come from filesystem code, and others yet will originate directly
from user space. It's worth noting that only user-space requests are
handled in the context of the originating process; that creates
complications that we'll get back to. Regardless of the source, I/O
requests eventually find themselves at the block layer, represented by the
large blue box in the diagram.
Within the block layer, I/O requests may first be handled by one or more
virtual block drivers. These include the device mapper code, the MD RAID
layer, etc. Eventually a (perhaps modified) request heads toward a
physical device, but first it goes into the I/O scheduler, which tries to
optimize I/O activity according to a policy of its own. The I/O scheduler
works to minimize seeks on rotating storage, but it may also implement I/O
priorities or other policy-related features. When it deems that
the time is right, the I/O scheduler passes requests to the physical block driver,
which eventually causes them to be executed by the hardware.
All of this is relevant because it is possible to hook an I/O controller
into any level of this diagram - and the various controller developers have
done exactly that. There are advantages and disadvantages to doing things
at each layer, as we will see.
dm-ioband
The dm-ioband
patch by Ryo Tsuruta (and others) operates at the virtual block
driver layer. It implements a new device mapper target (called "ioband")
which prioritizes requests passing through. The policy is a simple proportional
weighting system; requests are divided up into groups, each of which gets
bandwidth according to the weight assigned by the system administrator.
Groups can be determined by user ID, group ID, process ID, or process
group. Administration is done with the dmsetup tool.
dm-ioband works by assigning a pile of "tokens" to each group. If I/O
traffic is low, the controller just stays out of the way. Once traffic
gets high enough, though, it will charge each group for every I/O request
on its way through. Once a group runs out of tokens, its I/O will be put
onto a list where it will languish, unloved, while other groups continue to
have their requests serviced. Once all groups which are actively
generating I/O have exhausted their tokens, everybody gets a new set and
the process starts anew.
The basic dm-ioband code has a couple of interesting limitations. One is
that it does not use the control group mechanism, as would normally be
expected for a resource controller. It also has a real problem with I/O
operations initiated asynchronously by the kernel. In many cases - perhaps
the majority of cases - I/O requests are created by kernel subsystems
(memory management, for example) which are trying to free up resources and
which are not executing in the context of any specific process. These
requests do not have a readily-accessible return label saying who they
belong to, so dm-ioband does not know how to account for them. So they run
under the radar, substantially reducing the value of the whole I/O
controller exercise.
The good news is that there's a solution to both problems in the form of the blkio-cgroup patch, also by
Ryo. This patch interfaces between dm-ioband and the control group
mechanism, allowing bandwidth control to be applied to arbitrary control
groups. Unlike some other solutions, dm-ioband still does not use control
groups for bandwidth control policy; control groups are really only used to
define the groups of processes to operate on.
The other feature added by blkio-cgroup is a mechanism by which the owner
of arbitrary I/O requests can be identified. To this end, it adds some
fields to the array of page_cgroup structures in the
kernel. This array is maintained by the memory usage controller subsystem;
one can think of struct page_cgroup as a bunch of extra stuff
added into struct page. Unlike the latter, though, struct
page_cgroup is normally not used in the kernel's memory management hot
paths, and it's generally out of sight, so people tend not to notice when
it grows. But, there is one struct page_cgroup for every page of
memory in the system, so this is a large array.
This array already has the means to identify the owner for any given page
in the system. Or, at least, it will identify an owner; there's no
real attempt to track multiple owners of shared pages. The blkio-cgroup
patch adds some fields to this array to make it easy to identify which
control group is associated with a given page. Given that, and given that
block I/O requests include the address of the memory pages involved, it is
not too hard to look up a control group to associate with each request. Modules
like dm-ioband can then use this information to control the bandwidth used
by all requests, not just those initiated directly from user space.
The advantages of dm-ioband include device-mapper integration (for those
who use the device mapper), and a relatively small and well-contained code base - at least
until blkio-cgroup is added into the mix. On the other hand, one must use
the device mapper to use dm-ioband, and the scheduling decisions made there
are unlikely to help the lower-level I/O scheduler implement its policy
correctly. Finally, dm-ioband does not provide any sort of
quality-of-service guarantees; it simply ensures that each group gets
something close to a given percentage of the available I/O bandwidth.
io-throttle
The io-throttle patches by
Andrea Righi take a different approach. This controller uses the control
group mechanism from the outset, so all of the policy parameters are set
via the control group virtual filesystem. The main parameter for each
control group is the maximum bandwidth that group can consume; thus,
io-throttle enforces absolute bandwidth numbers, rather than dividing up
the available bandwidth proportionally as is done with dm-ioband.
(Incidentally, both
controllers can also place limits on the number of I/O operations rather
than bandwidth). There is a "watermark" value; it sets a
level of utilization below which throttling will not be performed. Each
control group has its own watermark, so it is possible to specify that some
groups are throttled before others.
Each control group is associated with a specific block device. If the
administrator wants to set identical policies for three different devices,
three control groups must still be created. But this approach does make it
possible to set different policies for different devices.
One of the more interesting design decisions with io-throttle is its
placement in the I/O structure: it operates at the top, where I/O requests
are initiated. This approach necessitates the placement of calls to
cgroup_io_throttle() wherever block I/O requests might be
created. So they show up in various parts of the memory management
subsystem, in the filesystem readahead and writeback code, in the
asynchronous I/O layer, and, of course, in the main block layer I/O
submission code. This makes the io-throttle patch a bit more invasive than
some others.
There is an advantage to doing throttling at this level, though: it allows
io-throttle to slow down I/O by simply causing the submitting process to
sleep for a while; this is generally preferable to filling memory with
queued BIO structures. Sleeping is not always possible - it's considered
poor form in large parts of the virtual memory subsystem, for example - so
io-throttle still has to queue I/O requests at times.
The io-throttle code does not provide true quality of service, but it
gets a little closer. If the system administrator does not over-subscribe
the block device, then each group should be able to get the amount of
bandwidth which has been allocated to it. This controller handles the problem of
asynchronously-generated I/O requests in the same way dm-ioband does: it
uses the blkio-cgroup code.
The advantages of the io-throttle approach include relatively simple code and the
ability to throttle I/O by causing processes to sleep. On the down side,
operating at the I/O creation level means that hooks must be placed into a
number of kernel subsystems - and maintained over time. Throttling I/O at
this level may also interfere with I/O priority policies implemented at the
I/O scheduler level.
io-controller
Both dm-ioband and io-throttle suffer from a significant problem: they can
defeat the policies (such as I/O priority) being implemented by the I/O
scheduler. Given that a bandwidth control module is, for all practical
purposes, an I/O scheduler in its own right, one might think that it would
make sense to do bandwidth control at the I/O scheduler level. The io-controller patches by Vivek
Goyal do just that.
Io-controller provides a conceptually simple, control-group-based mechanism.
Each control group is given a weight which determines its access to I/O
bandwidth. Control groups are not bound to specific devices in
io-controller, so the same weights apply for access to every device in the
system. Once a process has been placed within a control group, it will
have bandwidth allocated out of that group's weight, with no further
intervention needed - at least, for any block device which uses one of the
standard I/O schedulers.
The io-controller code has been designed to work with all of the mainline
I/O controllers: CFQ, Deadline, Anticipatory, and no-op. Making that work
requires significant changes to those schedulers; they all need to have a
hierarchical, fair-scheduling mechanism to implement the bandwidth
allocation policy. The CFQ scheduler already has a single level of fair
scheduling, but the io-controller
code needs a second level. Essentially, one level implements the current
CFQ fair queuing algorithm - including I/O priorities - while the other
applies the group bandwidth limits. What this means is that bandwidth
limits can be applied in a way which does not distort the other I/O
scheduling decisions made by CFQ. The other I/O schedulers lack multiple
queues (even at a single level), so the io-controller patch needs to add them.
Vivek's patch starts by stripping the current multi-queue code out of CFQ,
adding multiple levels to it, and making it part of the generic elevator
code. That allows all of the I/O schedulers to make use of it with
(relatively) little code churn. The CFQ code shrinks considerably, but the
other schedulers do not grow much. Vivek, too, solves the asynchronous
request problem with the blkio-cgroup code.
This approach has the clear advantage of performing bandwidth throttling
in ways consistent with the other policies implemented by the I/O
scheduler. It is well contained, in that it does not require the placement
of hooks in other parts of the kernel, and it does not require the use of
the device mapper. On the other hand, it is by far the largest of the
bandwidth controller patches, it cannot implement different policies for
different devices, and it doesn't yet work reliably with all I/O schedulers.
Choosing one
The proliferation of bandwidth controllers has been seen as a problem for at least the last year. There is no interest in merging multiple controllers,
so, at some point, it will become necessary to pick one of them
to put into the mainline. It
has been hoped that the various developers involved would get together and
settle on one implementation, but that has not yet happened, leading Andrew
Morton to proclaim recently:
I'm thinking we need to lock you guys in a room and come back in 15 minutes.
Seriously, how are we to resolve this? We could lock me in a room
and come back in 15 days, but there's no reason to believe that I'd
emerge with the best answer.
At the Storage and Filesystem Workshop in April, the storage track participants
appear to have been leaning heavily toward a solution at the I/O scheduler
level - and, thus, io-controller. The cynical among us might be tempted to
point out that Vivek was in the room, while the developers of the competing
offerings were not. But such people should also ask why an I/O scheduling
problem should be solved at any other level.
In any case, the developers of dm-ioband and io-throttle have not stopped
their work since this workshop was held, and the wider kernel community has
not yet made a decision in this area. So the picture remains only slightly
less murky than before. About the only clear area of consensus would
appear to be the use of blkio-cgroup for the tracking of
asynchronously-generated requests. For the rest, the locked-room solution
may yet prove necessary.
Comments (11 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>