The current 2.6 development kernel is 2.6.28-rc7
on December 1.
Linus says: "I was gone for a week, and it wasn't quite as quiet as I
was hoping for, but there's a new -rc out there now with the merges of the
" Along with the usual fixes, 2.6.28-rc7 includes a
new set of resource limits
intended to prevent excessive kernel memory
usage via the epoll_wait()
Details, as usual, can be found in the
There have been no stable kernel updates over the last week. The 126.96.36.199 update is in the review process as of
this writing; this 104-patch monster can be expected sometime on or after
Comments (none posted)
Kernel development news
Too many people seem to think that documentation is the "final"
argument. It's not. Not even close. It's a hint and a help, but
it's _secondary_ to code. Anybody who doesn't understand that
should never be allowed to write code (or documentation, for that
-- Linus Torvalds
is an act of insane vandalism, punishable by spending five additional
years coding in fortran.
-- Andrew Morton
Meanwhile, 10 years and counting, the Linux kernel still generates
a stupid write IO for every file read that apps do. Fortunately
hardware designers will get rid of rotating disks faster than we
can fix our glaring process problems in this space - but it's still
a bit sad.
-- Ingo Molnar
Comments (none posted)
Sam Leffler, maintainer of the Atheros hardware abstraction layer (HAL), has released the source
under an ISC license. Previously, this piece of the MadWifi driver for Atheros wireless chipsets was only available in binary form. "In his announcement, Sam states:
Coincident with the release of this code I have concluded my agreement with Atheros whereby I had access to information about their devices. This means that in the future all fixes, updates for new chips, etc. will need to be a community effort. Atheros states the Linux platform will be the reference public code base so folks wanting to add support for other platforms will have to scrape the information from there.
Comments (11 posted)
There is a great deal of activity around Linux filesystems currently. Of
the many ongoing efforts, two receive the most attention: ext4, the
extension of ext3 expected to keep that filesystem design going for a few
more years, and btrfs, which is seen by many as the long-term filesystem of
the future. But there is another project out there which is moving quickly
and is worth a look: Daniel Phillips's Tux3 filesystem.
Daniel is not a newcomer to filesystem development. His Tux2 filesystem was
announced in 2000; it attracted a fair amount of interest until it turned out that
Network Appliance, Inc. held patents on a number of techniques used in
Tux2. There was some talk of filing for defensive patents, and Jeff Merkey
up for long enough to claim to have hired a patent attorney to help
with the situation. What really happened is that Tux2 simply faded from
Tux3 is built on some of the same ideas as Tux2, but many of those ideas
have evolved over the eight intervening years. The new filesystem, one
hopes, has changed enough to avoid the attention of NetApp, which has shown
a willingness to use software patents to defend its filesystem turf.
Like any self-respecting contemporary filesystem, Tux3 is based on
B-trees. The inode table is such a tree; each file stored within is also a
B-tree of blocks. Blocks are mapped using extents, of course - another
obligatory feature for new filesystems. Most of the expected features are
present. In many ways, Tux3 looks like yet another POSIX-style filesystem,
but there are some interesting differences.
Tux3 implements transactions through a forward-logging mechanism. A set of
changes to the filesystem will be batched together into a "phase," which is
then written to the journal. Once the phase is committed to the journal,
the transaction is considered to be safely completed. At some future time,
the filesystem code will "roll up" the journal changes and write them back
to the static version of the filesystem.
The logging implementation is interesting. Tux3 uses a variant of the
copy-on-write mechanism employed by Btrfs; it will not allow any filesystem
block to be overwritten in place. So writing to a block within a file will
cause a new block to be allocated, with the new data written there. That,
in turn, will require that the filesystem data structure which maps
file-logical blocks to physical blocks (the extent) will need to be changed
to reflect the new block location. Tux3
handles this by writing the new blocks directly to their final location,
then putting a "promise"
to update the metadata block into the log. At roll-up time, that promise
will be fulfilled through the allocation of a new block and, if necessary,
the logging of a promise to change the next-higher block in the tree. In
this way, changes to files propagate up through the filesystem one step at
a time, without the need to make a recursive, all-at-once change.
The end result is that the results of a specific change can remain in the
log for some time. In Tux3, the log can be thought of as an integral part
of the filesystem's metadata. This is true to the point that Tux3 doesn't
even bother to roll up the log when the filesystem is unmounted; it just
initializes its state from the log when the next mount happens. Among
other things, Daniel says, this approach ensures that the journal recovery
code will be well-tested and robust - it will be exercised at every
In most filesystems, on-disk inodes are fixed-size objects. In Tux3,
instead, their size will be variable. Inodes are essentially containers
for attributes; in Tux3, normal filesystem data and extended attributes are
treated in almost the same way. So an inode with more attributes will be
larger. Extended attributes are compressed through the use of an "atom
table" which remaps attribute names onto small integers. Filesystems with
extended attributes tend to have large numbers of files using attributes
with a small number of names, so the space savings across an entire
filesystem could be significant.
Also counted among a file's attributes are the blocks where the data is
stored. The Tux3 design envisions a number of different ways in which file
blocks can be tracked. A B-tree of extents is a common solution to this
problem, but its benefits are generally seen with larger files. For
smaller files - still the majority of files on a typical Linux system - data can be
stored either directly in the inode or at the other end of a simple block
pointer. Those representations are more compact for small files, and they
provide quicker data access as well. For the moment, though, only extents
Another interesting - but unimplemented - idea for Tux3 is the concept of
versioned pointers. The
btrfs filesystem implements snapshots by retaining a copy of the entire
filesystem tree; one of these copies exists for every snapshot. The
copy-on-write mechanism in btrfs ensures that those snapshots share data
which has not been changed, so it is not as bad as it sounds. Tux3 plans
to take a different approach to the problem; it will keep a single copy of
the filesystem tree, but keep track of different versions of blocks (or
extents, really) within that tree. So the versioning information is stored
in the leaves of the tree, rather than at the top.
But the versioned extents idea has been deferred for now, in favor of getting
a working filesystem together.
Also removed from the initial feature list is support for subvolumes. This
feature initially seemed like an easy thing to do, but interaction with
fsync() proved hard. So Daniel finally concluded that volume management was best left
to volume managers and dropped the subvolume feature from Tux3.
One feature which has never been on the list is checksumming of data.
Daniel once commented:
Having been checksumming filesystem data during continuous
replication for two years now on multiple machines, and having
caught exactly zero blocks of bad data passed as good in that time,
I consider the spectre of disks passing bad data as good to be
largely vendor FUD. That said, checksumming will likely appear in
the feature list at some point, I just consider it a decoration,
not an essential feature.
Tux3 development is far from the point where the developers can worry about
"decorations"; it remains, at this point, an embryonic project being pushed
by a developer with a bit of a reputation for bright ideas which never
quite reach completion. The code, thus far, has been developed in user
space using FUSE. There is,
however, an in-kernel version
which is now ready for further development. According to Daniel:
The functionality we have today is roughly like a buggy Ext2 with
missing features. While it is very definitely not something you
want to store your files on, this undeniably is Tux3 and
demonstrates a lot of new design elements that I have described in
some detail over the last few months. The variable length inodes,
the attribute packing, the btree design, the compact extent
encoding and deduplication of extended attribute names are all
working out really well.
The potential user community for a stripped-down ext2 with bugs is likely
to be relatively small. But the Tux3 design just might have enough to
offer to make it a contender eventually.
First, though, there are a few little
problems to solve. At the top of the list, arguably, is the complete lack
of locking - locking being the rocks upon which other filesystem projects
have run badly aground. The code needs some cleanups - little problems
like the almost complete lack of comments and the use of macros as formal
function parameters are likely to raise red flags on wider review. Work
on an fsck utility does not appear to have begun. There has been no real
benchmarking work done; it will be interesting to see how Daniel can manage
the "never overwrite a block" policy in a way which does not fragment files
(and thus hurt performance) over time. And so on.
That said, a lot of these problems could end up being resolved rather
quickly. Daniel has put the code out there and appears to have attracted an
energetic (if small) community of contributors. Tux3 represents the core
of a new filesystem with some interesting ideas. Code comments may be
scarce, but Daniel - never known as a tight-lipped developer - has posted a
wealth of information which can be found in the Tux3
mailing list archives. Potential contributors should be aware of Daniel's licensing scheme - GPLv3 with a
reserved unilateral right to relicense the code to anything else - but
developers who are comfortable with that are likely to find an interesting
and fast-moving project to play in.
Comments (53 posted)
Remi Colinet recently proposed
of a new virtual file, /proc/mempool
, which would display the
usage of memory pools within the kernel. Nobody really disagreed with the
idea of making this information available, but there were some grumbles
about putting it into /proc
. Once upon a time, just about
anything could go into that directory, but, in recent years, there has been
a real attempt to confine /proc
to its original intent: providing
information about processes. /proc/mempool
is not about
processes, so it was considered procfile-non-grata. It was suggested that
another home should be found for this file.
Where that other home should be is not obvious, though. Somewhere like
/sys/kernel might seem to make sense, but sysfs has rules of its
own. In particular, the one-value-per-file rule makes it hard to create an
where developers can simply query the state of a kernel subsystem, so sysfs
is not a suitable home for this file either.
The next option is debugfs, which was created in December, 2004.
Debugfs is meant to be an aid for kernel developers; it explicitly
disclaims any rules on the types of files that can be put there. All rules
except for one: debugfs is not a mandatory part of any kernel installation,
and nothing found therein should be considered to be a part of the stable
user-space ABI. It is, instead, a dumping ground where kernel developers
can quickly export information which is useful to them.
Since debugfs is not a part of the user-space ABI, it seems like a poor
place to put things that users might depend on. When this was pointed out,
it became clear that the non-ABI status of debugfs is not as well
established as one might think. Quoting Matt
The problem with debugfs is that it claims to not be an ABI but it
is lying. Distributions ship tools that depend on portions of
debugfs. And they also ship debugfs in their kernel. So it is
effectively the same as /proc, except with the 1.0-era
everything-goes attitude rather than the 2.6-era
Pushing stuff from procfs to debugfs is thus just setting us up for
pain down the road. Don't do it. In five years, we'll discover we
can't turn debugfs off or even clean it up because too much relies
As an example, Matt pointed out the extensively-documented usbmon interface which
provides a great deal of information about what's happening on a USB bus.
If it is not an ABI, he says, nobody should be upset if he submits a patch
which breaks it.
That is a perennial problem with interfaces between the kernel and user
space; changing them causes
pain for users. That is why incompatible changes to user-space interfaces
are almost never allowed;
an important goal for the kernel development process is to avoid breaking
user-space programs. One might think that this problem could be avoided
for a specific interface by explicitly documenting it as an unstable
interface. The files in Documentation/ABI/testing are meant to serve that
role; anything found there should be considered to be unstable. But, as
soon as people start using programs which depend on a specific interface,
it has, for all practical purposes, hardened into part of the kernel ABI.
Linus put it this way:
The fact that something is documented (whether correctly or not)
has absolutely _zero_ impact on anything at all. What makes
something an ABI is that it's useful and available. The only way
something isn't an ABI is by _explicitly_ making sure that it's not
available even by mistake in a stable form for binary use.
Example: kernel internal data structures and function calls. We
make sure that you simply _cannot_ make a binary that works across
kernel versions. That is the only way for an ABI to not form.
So a given kernel interface can be kept away from ABI status if it is so
hard to get to, and so unstable, that nothing ever comes to depend on it.
The kernel module interface certainly fits this bill. Modules must
generally be built for the exact kernel they are intended to work with, and
they must often be built with the same configuration options and the same
compiler. Anybody who has gotten into the dark business of distributing
binary-only modules has learned what a challenge it can be.
Debugfs is different, though. It is enabled in a number of distributor
kernels, even if, perhaps, it is not mounted by default. Once a set of
files gets placed there, their format tends to change rarely. So it is
possible for people to write programs which depend on debugfs files. And
the end result of that is that debugfs files can become part of the stable
kernel ABI. That is generally not a result that was intended by anybody
involved, but it happens anyway. The only way to avoid it would be to
deliberately shake up debugfs every kernel cycle - and few developers have
much desire to do that.
This is a discussion without a whole lot in the way of useful conclusions;
it leaves /proc/mempool without a home. ABI design, it turns out,
is still hard. In the longer term, dealing with an ABI which was never
really designed, but which just sort of settled into being, is even
harder. There does not appear to be any substitute for thinking seriously
about every interface between kernel and user space, even if it's just for
a developer's debugging tool.
Comments (8 posted)
An I/O scheduler is a subsystem of the kernel which schedules I/O
the various storage devices to get the best possible throughput from those
The algorithm is often reminiscent of the algorithm used by elevators when
dealing with requests coming from different floors to go up or down.
This is the reason I/O scheduling algorithms are also called
"elevators." I/O requests are submitted in an order designed to minimize
disk head movement (thus minimizing disk seek times), yet guaranteeing
good I/O rates. The next request chosen will be dependent on the current
disk head position, in order to service the requests quickly, and spend
less time seeking, or moving the disk head. However, algorithms
may also consider other aspects such as fairness or time guarantees.
The Completely Fair Queuing (CFQ) I/O scheduler, is one of the most popular I/O
scheduling algorithms; it is used as the default scheduler in most
distributions. As the name suggests, the CFQ scheduler tries to
maintain fairness in its distribution of bandwidth to processes, and yet does not
compromise much on the throughput. The elevator's fairness is
accomplished by servicing all processes and not penalizing those
which have requests far from the current disk head position.
It grants a time slice to every process;
once the task has consumed its slice, this slice is recomputed and task is
added to the end of the queue.
The I/O priority is used to compute the time slice granted and the offset
in the request queue.
The Budget Fair Queuing scheduler
The time-based allocation of the disk service in CFQ, while having
the desirable effect of implicitly charging each application for
the seek time it incurs, still suffers from fairness problems, especially
towards processes which make the best possible use of the disk bandwidth.
If the same time slice is assigned to two processes,
they may each get different throughput, as a function of the
positions on the disk of their requests. Moreover, due
to its round robin policy, CFQ is characterized by an O(N) worst-case
delay (jitter) in request completion time, where N is the number
of tasks competing for the disk.
The Budget Fair Queuing (BFQ)
scheduler, developed by Fabio Checconi and Paolo Valente,
changes the CFQ round-robin scheduling policy based on time slices into a
fair queuing policy based on sector budgets. Each task is assigned a budget
measured in number of sectors instead of amount of time, and budgets
are scheduled using a slightly modified version of the Worst-case Fair
Weighted Fair Queuing+ (WF2Q+) algorithm (described in this paper
[compressed PS]), which
guarantees a worst case complexity of O(logN) and boils down to O(1)
in most cases. The budget assigned to each task varies over time as a
function of its behavior. However, one can set the maximum value of
the budget that BFQ can assign to any task.
BFQ can provide strong guarantees on bandwidth distribution because the
assigned budgets are measured sectors. There are limits, though: processes
too much time to exhaust their budget are penalized and the scheduler
selects the next process to dispatch I/O. The next budget is
calculated on the feedback provided by the request serviced.
BFQ also introduces I/O scheduling within control groups. Queues are collected
into a tree of groups, and there is a distinct B-WF2Q+ scheduler on each
non-leaf node. Leaf nodes are request queues as in the
non-hierarchical case. BFQ supports I/O priority classes at each hierarchy
level, enforcing a strict priority ordering among classes. This means
that idle queues or groups are served only if there are no best effort
queues or groups in the same control group, and best effort queues and groups are
served only if there are no real-time queues or groups. As compared to
cfq-cgroups (explained later), it lacks per device priorities. The
developers however claim that this feature can be incorporated easily.
Requests coming to an I/O scheduler fall into two categories,
synchronous and asynchronous. Synchronous requests are those for which
the application must wait before continuing to send further
requests - typically read requests. On the other hand, asynchronous
requests - typically writes - do not block the application's progress while
they are executed.
In BFQ, as in CFQ, synchronous requests are collected in per-task queues, while
asynchronous requests are collected in per-device (or, in the case of
hierarchical scheduling, per group) queues.
When the underlying device
driver asks for the next request to serve and there is no queue being
served, BFQ uses B-WF2Q+, a modified version of WF2Q+, to choose a
queue. It then selects the first request from that queue in C-LOOK order
and returns it to the driver. C-LOOK is a disk scheduling algorithm,
where the next request picked is the one with the immediate next highest
disk sector to the current position of the disk head. Once the disk
has serviced the maximum sector number in the request queue, it
positions the head to the sector number of the request having the
lowest sector number.
When a new queue is selected it is assigned a budget, in disk sector
units, decremented each time a request from the same queue is served.
When the device driver asks for new requests and there is a queue
under service, they are chosen from that queue until one of the
following conditions is met: (1) the queue exhausts its budget,
(2) the queue is spending too much time to consume its budget, or
(3) the queue has no more requests to serve
On termination of a request, the scheduler recalculates the
budget allocated to each process depending on the feedback it gets.
For example, for greedy processes which have exhausted their budgets,
the budget is increased, whereas if it has been idle for long, its
budget is decreased. The maximum budget a process can get is a
configurable system parameter (max_budget).
Two other parameters, timeout_sync and timeout_async,
control the timeout time for consuming the budget of the synchronous and
queues respectively. In addition, max_budget_async_rq limits the
maximum number of requests serviced from an asynchronous queue.
If a synchronous queue has no more requests to serve, but it has
some budget left, the scheduler idles (i.e., it tells to the device
driver that it has no requests to serve even if there are other active
queues) for a short period, in anticipation of a new request from the task
owning the queue.
The developers compared six different I/O scheduling algorithms:
SCAN-EDF, CFQ, the Linux anticipatory scheduler, and C-LOOK.
They compared a multitude of test scenarios analogous to
real-life scenarios, including throughput, bandwidth distribution,
latency, and short-term time
guarantees. With respect to bandwidth distribution, BFQ can be
concluded as the best, and a good algorithm for most scenarios.
There were also extensive tests comparing BFQ against CFQ, and the
results are available here.
The throughput of BFQ is more or less the same as CFQ, but it scores well in
distributing I/O bandwidth fairly among the processes, and displays
lower latency with streaming data.
Using sector budgets instead of time as a factor of granting slice
for fair bandwidth distribution is an interesting concept.
The algorithm also employs timeouts to terminate requests of "seeky"
processes taking too much time to consume their budget and penalizes
them. The feedback from current requests help determine future
budgets, making the algorithm self-learning. Such tighter bandwidths
distribution would be a requirement for systems running virtual
machines, or container classes. However, it depends on how BFQ stands
the test of time against the tried-and-tested stable CFQ.
BFQ technical report [PDF] for (much) more information.
Control Groups provide a mechanism for aggregating sets of tasks, and
all their future children, into hierarchical groups. These groups can
be allocated dedicated portions of the available resources, or
resource sharing can be prioritized within these groups. Control
groups are controlled by the cgroups pseudo-filesystem. Once mounted,
the top level directory shows the complete set of existing control
groups. Each directory made
under the root filesystem makes a new group, and resources can be
allocated to the tasks listed in the tasks file in the individual
Control groups can be used to regulate access to CPU time, memory, and
more. There are also several projects working toward the creation of I/O
bandwidth controllers for control groups.
One of those is
the expanded CFQ scheduler
patch for cgroups by Satoshi Uchida.
This patch set introduces a new I/O scheduler called cfq-cgroups,
introduces cgroups for the I/O scheduling subsystem.
This scheduler, as
the name suggests, is based on Completely Fair Queuing I/O scheduler.
It can take advantage of hierarchical scheduling of
processes, with respect to the cgroup they belong to, each cgroup
having its own CFQ scheduler.
I/O devices in a control group can be prioritized. The time slice
given to each hierarchical group per device is a function of the device
priority. This helps shaping of I/O bandwidth per group, per device.
To use, cfq-cgroups, select it as a default scheduler at
boot by passing elevator=cfq-cgroups as a boot parameter.
This can also be dynamically changed for individual devices by writing
cfq-cgroups to /sys/block/<device>/queue/scheduler.
There are two levels of control:
through the cgroups filesystem, for individual groups, and
through sysfs, for individual devices.
Like any other control group, cfq-cgroup is managed through the
To access the cgroups, mount the pseudo cgroups filesystem:
# mount -t cgroup -o cfq cfq /mnt/cgroup
The cgroup directory, by default, will have a file called
cfq.ioprio, which contains the
individual priority on a per-device basis. The time slice received per
device per group is a function of the I/O priority listed in cfq.ioprio.
The tasks file represents the list of tasks in the particular group.
To make more groups, create a directory in the mounted cgroup
# mkdir /mnt/cgroup/group1
The new directories are automatically populated with files,
cfq.ioprio, tasks etc, which are used to control the
resources in this
group. To add tasks in a group, write the process ID of the task to the
#echo <pid> > /mnt/cgroup/group1/tasks
The cfq.ioprio file contains the list of devices and their respective
priorities. Each device in the cgroup has a default I/O priority of 3,
while the valid values are 0 to 7. To change the priority of a device for
the cgroup group1, run:
# echo 2 > /mnt/cgroup/group1/cfq.ioprio
This would change the priority of the entire group. To change the I/O
priority of a specific device:
# echo 2 sda > /mnt/cgroup/group1/cfq.ioprio
To change the default priority while keeping the priority of the
# echo 4 defaults > /mnt/cgroup/group1/cfq.ioprio
The device view shows the list of cgroups and their respective
priorities on a per-group basis. This can be changed by:
# echo 2 group1 > /sys/block/sda/queue/iosched/ioprio
The device view contain other parameters similar to the CFQ scheduler,
such as back_seek_max or back_seek_penalty, which are
specific to the control of the individual device, same as the traditional
The patch introduces a new data structure called cfq_driver_data
control of I/O bandwidth for cgroups. All driver-related data has been
moved from the traditional cfq_data structure to
cfq_driver_data structure. Similarly, cfq_cgroups is a new data
structure to control
the cgroup parameters. The organization of data can be assumed as
a matrix with cfq_cgroups as rows and cfq_driver_data as
shown in the diagram below.
At each intersection, there is a cfqd_data
structure which is responsible for all CFQ related queue handling, so
that each cfq_data corresponds to one cfq_cgroup and
When a new cgroup is created, the cfq_data from
the parent cgroup is copied into the new group. While inserting new nodes
of cfq_data into the
cgroup, the cfq_data structure is initialized with the priority of
This way all data of the parent is inherited by the child cgroup, and
shows up in the respective files per group in the cgroup filesystem.
Scheduling of cfq_data within the CFQ scheduler is similar to that
of the native CFQ scheduler. Each node is assigned a time slice.
This slice is calculated according to the I/O priority of the device, using
the per-device base time slice. The time slice offset forms the key of
the red-black node to be inserted in the service tree. One
cfq_data entry is
picked from the start of the red-black tree and scheduled. Once its
time slice expires it is added to the tree again, after recalculation
of its time slice offset. So, each cfq_data structure acts as a
queue node per
device, and, within each CFQ data structure, requests are queued as with a
regular CFQ queue.
Both BFQ and cfq-cgroups are attempts to bring a higher degree of fairness
to I/O scheduling, with "fairness" being tempered by the desire to add more
administrative control via the control groups mechanism. They both appear
to be useful solutions, but they must contend with the wealth of other I/O
bandwidth control implementations out there. Coming to some sort of
consensus on which approach is the right one could prove to be a rather
longer process than simply implementing these algorithms in the first place.
Comments (6 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jake Edge
Next page: Distributions>>