The current development kernel is 2.6.37-rc2
on November 15. "And it
all looks the way I like to see my -rc2's: nothing really interesting
" It's mostly fixes, but there's also some residual big kernel
lock removal work, the final removal of hard barrier support from the block
layer, and a couple of new LED drivers. See the
for the details.
A significant driver API change was merged after the -rc2 release: the SCSI
midlayer queuecommand() function is now invoked without the host
lock; the function's prototype has changed as well.
Stable updates: there have been no stable updates released since
Comments (none posted)
We *all* want to build infrastructure; when other coders are forced
to use it we rise up the kernel dominance hierarchy. Ook ook!
(Every Unix app has its own config language for the same reason:
the author distills the mental sweat of the users into some kind of
Elixir of Coder Hubris).
Yet abstractions obfuscate: let's resist our primal urges to add
another speed hump on the lengthening road to kernel expertise.
-- Rusty Russell
Finally, the whole "user space is more flexible" is just a lie. It
simply doesn't end up being true. It will be _harder_ to configure
some user-space daemon than it is to just set a flag in /sys or
whatever. The "flexibility" tends to be more a flexibility to get
things wrong than any actual advantage.
-- Linus Torvalds
Our real problem with tracing is lack of relevance, lack of
utility, lack of punch-through analytical power.
-- Ingo Molnar
Comments (6 posted)
Julia Lawall has announced that a Coccinelle workshop will be held in
Copenhagen on January 26, 2011. "I expect that the program will consist
of some presentations about Coccinelle and associated tools, as well as
some time for discussions and practical experiments.
" Anybody who
is interested in attending should drop her a note.
Full Story (comments: none)
A group of Linux tracing developers has announced the creation of a new
top-level command, called simply "trace." "After years of efforts we
have not succeeded in meeting (let alone exceeding) the utility of
decades-old user-space tracing tools such as strace - except for a few new
good tools such as PowerTop and LatencyTop. 'trace' is our shot at
improving the situation: it aims at providing a simple to use and
straightforward tracing tool based on the perf infrastructure and on the
well-known perf profiling workflow.
" Obtaining the tool requires
fetching a git tree for now.
Full Story (comments: 37)
One gets the sense that an extended tracing hacking session has been going
on. Ingo Molnar has posted a simple patch to
support user-space tracing
. It is currently implemented as an
extension to the prctl()
system call which allows an application
to inject tracing data into the kernel, where it will be properly mixed
with kernel events. With some suitable user-space work (making DTrace
tracepoints use this facility, for example), Linux may finally be on a path
toward having proper integrated user- and kernel-space tracing.
Comments (1 posted)
The XFS and OCFS2 filesystems currently have the ability to "punch a hole"
in a file - a portion of the file can be marked as unwanted and the
associated storage released. Josef Bacik, noting that this capability may
be added to other filesystems in the near future, came to the conclusion
that the kernel should offer a standard interface for hole punching. The
result is an extension to the
fallocate() system call
adding that ability.
In particular, this patch adds a new flag (FALLOC_FL_PUNCH_HOLE)
which is recognized by the system call. If the underlying filesystem is
able to perform the operation, the indicated range of data will be removed
from the file; otherwise ENOTSUPP will be returned. The current
implementation will not change the size of the file; if the final blocks of
the file are "punched" out, the file will retain the same length. There
has been some discussion of whether changing the size of the file should be
supported, but the consensus seems to be
that, for now, changing the file size would create more problems than it
Comments (18 posted)
Kernel development news
As long as we have desktop systems, there will almost certainly be concerns
about desktop interactivity. Many complex schemes for improving
interactivity have come and gone over the years; most of them seem to leave
at least a subset of users unsatisfied. Miracle cures are hard to come by,
but it seems that a recent patch has come close, at least for some users.
Interestingly, it is a conceptually simple solution that may not
need to be in the kernel at all.
The core idea behind the completely fair scheduler is its complete
fairness: if there are N processes competing for the CPU, each with equal
priority, than each will get 1/N of the available CPU time. This policy
replaced the rather complicated "interactivity" heuristics found in the
O(1) scheduler; it yields better desktop response in most situations.
There are places where this approach falls down, though. If a user is
running ten instances of the compiler with make -j 10
along with one video playback application, each process will get a "fair" 9% of
the CPU. That 9% may not be enough to provide the video experience that
the user was hoping for. So it is not surprising that many users see
"fairness" differently; wouldn't
be nice if the compilation job as a whole got 50%, while the video
application got the other half?
The kernel has been able to implement that kind of fairness for years
though a feature known as group
scheduling. A set of processes placed within a group will each get a
fair share of the CPU time allocated to the group as a whole, but groups
will, themselves, compete for a fair share of the CPU. So, if the video
player were to be placed in one group and the compilation in another, each
group would get half of the available processor time. The various
processes doing the compilation would then get a fair share of their
they will compete with each other, but not with the video player. This
arrangement will ensure that the video player gets enough CPU time to keep
up with the stream and any interactivity requirements.
Groups are thus a nice feature, but they have not seen heavy use since they
were merged for the 2.6.24 release. The reasons for that are clear: groups
require administrative work and root privileges to set up; most users do
not know how to tweak the knobs and would really rather not learn. What
has been missing all these years is a way to make group scheduling "just
work" for ordinary users. That is the goal of Mike Galbraith's per-TTY task groups patch.
In short, this patch automatically creates a group attached to each TTY
in the system. All processes with a given TTY as their controlling
terminal will be placed in the appropriate group; the group scheduling code
can then share time between groups of processes as determined by their
controlling terminals. A compilation job is typically started by typing
"make" in a terminal emulator window; that job will have a
different controlling TTY than the video player, which may not have a
controlling terminal at all. So the end result is that per-TTY grouping
automatically separates tasks run in terminals from those run via the
This behavior makes Linus happy; Linus,
after all, is just the sort of person who might try to sneak in a quick video
while waiting for a highly-parallel kernel compilation. He said:
So I think this is firmly one of those "real improvement" patches.
Good job. Group scheduling goes from "useful for some specific
server loads" to "that's a killer feature".
Others have also reported significant improvements in desktop response, so
this feature looks like one which has a better-than-average chance of
getting into the mainline in the next merge window. There are, however, a
few voices of dissent, most of whom think that the TTY is the wrong marker
to use when placing processes in group.
Most outspoken - as he often is - is Lennart Poettering, who asserted that "Binding something like
this to TTYs is just backwards"; he would rather see something which
is based on sessions. And, he said, all of this could better be done in
user space. Linus was, to put it politely, unimpressed, but Lennart came back with a few lines of bash scripting
which achieves the same result as Mike's patch - with no kernel patching
required at all.
It turns out that working with control groups is not necessarily that hard.
Linus, however, still likes the kernel
version, mainly because it can be made to "just work" with no user
intervention required at all:
Put another way: if we find a better way to do something, we should
_not_ say "well, if users want it, they can do this <technical
thing here>". If it really is a better way to do something, we
should just do it. Requiring user setup is _not_ a feature.
In other words, an improvement that just comes with a new kernel is
likely to be available to more users than something which requires each
user to make a (one-time) manual change.
Lennart isn't buying it. A real user-space
solution, he says, would not come in the form of a requirement that users
edit their .bashrc files; it, too, would be in a form that "just
works." It should come as little surprise that the form he envisions is
systemd; it seems that future plans involve systemd taking over session
management, at which time per-session group scheduling will be easy to
achieve. He believes that this solution will be more flexible; it will be
able to group processes in ways which make more sense for "normal desktop
users" than TTY-based grouping. It also
will not require a kernel upgrade to take effect.
Another idea which has
been raised is to add a "run in separate group" option to desktop
application launchers, giving users an easy way to control how the
partitioning is done.
Linus seems to be holding his line on the
kernel version of the patch:
Anyway, I find it depressing that now that this is solved, people
come out of the woodwork and say "hey you could do this". Where
were you guys a year ago or more?
Tough. I found out that I can solve it using cgroups, I asked
people to comment and help, and I think the kernel approach is
wonderful and _way_ simpler than the scripts I've seen. Yes, I'm
biased ("kernels are easy - user space maintenance is a big pain").
The next merge window is not due until January, though; that is a fair
amount of time for people to demonstrate other approaches. If a solution
based in user space turns out to be more flexible and effective in the long
run, it may yet prevail. That is especially true because merging Mike's
patch does not in any way inhibit user-space solutions; if a systemd-based
approach shows better results, that may be what the distributors decide to
enable. One way or the other, it seems like better interactive response is
coming in the near future.
Comments (41 posted)
Over the course of the last decade, video acquisition hardware has evolved
from relatively rare, bulky, external devices to being a standard feature
in a large variety of
gadgets. Increasingly, chipsets intended for embedded use have video
support as a standard feature. This support is becoming more
complex; contemporary video devices are not just frame grabbers anymore.
That complexity is revealing limitations in the kernel's device model,
prompting the proposal of a new "media controller" abstraction. This
article will provide an overview and mild critical review of this new
Video acquisition devices have never been entirely simple. Even a minimal
camera device will usually be a composite of at least three distinct
sensor, a DMA bridge to move frames between the sensor and main memory, and
an I2C bus dedicated to controlling the sensor. Most devices coming onto
the market now are more sophisticated than that. For example, the
integrated controller in current VIA chipsets (still a very simple device)
adds a "high-quality video" (HQV) unit which can perform image rotation and
format conversions; that unit can be configured into or out of the
processing pipeline depending
on the application's needs. For a more complex example, consider the OMAP
3430, which is found in N900 phones; it has multiple video inputs, a white
balance processor, a lens shading compensation processor, a resizer, and
Each of these components can be thought of as a separate device which can
be powered up or down independently, and which, in some cases, can be
configured in or out at any given time. The current V4L2 system wasn't
designed with this kind of device structure in mind, and neither was the
current Linux device model. An added problem is that these devices can be
tied with devices managed by other subsystems - audio devices in
particular - making it hard for applications to grasp the whole picture.
The media controller is an attempt to rectify that situation.
The most recent version of the media controller patch was posted by Laurent Pinchart back in September;
if all goes according to plan, it will be merged for 2.6.38. The patch
creates a new media_device type which has the responsibility of
managing the various components which make up a media-related device.
These components are called "entities"; and they can take many forms.
Sensors, DMA engines, video processing units, focus controllers, audio
devices, and more are all considered to be "entities" in this scheme.
Most entities will have at least one "pad," being a logical connection
point where data can
flow into or out of the device. "Data" in this sense can be multimedia
data, but it might also be a control stream. Pads are exclusively input
("sink") or output ("source") ports, and an entity can have an arbitrary
number of each. The final piece is called a "link"; it is a directional
connection from a source pad to a sink. Links are created by the media
device driver, but they can, in some cases, be enabled or disabled from
Using this scheme, the simple VIA device described above could be
represented with three entities and three links:
The "sensor" entity has a single source pad which can be connected, via
links, to the HQV unit or directly to the DMA controller. Only one of
those paths can be active at once. The HQV unit has two pads - one sink,
one source - allowing it to be slotted into the video pipeline if need be.
The DMA controller has a single sink pad.
As an aside: entities also have a "group" number assigned to them; groups
are intended to indicate hardware which is meant to function together. All
of the units described above would probably be placed into the same group
by the driver. If there were a microphone attached to the camera, then the
associated audio entity would also be placed in the same group. This
mechanism is intended to make it easier for applications to associate
related devices with each other.
On the application side, there is a device (probably /dev/media0
or some such) which can be opened to gain access to this device. From
there, the interface looks very much like the rest of V4L2 - lots of
ioctl() calls to discover what is available and configure it.
These calls include:
- MEDIA_IOC_DEVICE_INFO to get overall information about the
device: driver name, device model, etc.
- MEDIA_IOC_ENUM_ENTITIES is used to iterate through all of the
entities contained within the device. Information returned includes
an ID number, a coarse entity type (e.g. V4L or ALSA), a subtype (few
of these are defined in the patch; "sensor" is one of them), the group
ID, the device number, and the numbers of pads and links.
- MEDIA_IOC_ENUM_LINKS iterates through all of the links
attached to source pads on a given entity. Thus, it is only possible
to discover the outbound links from any entity; obtaining the whole
graph requires iterating through all entities.
- MEDIA_IOC_SETUP_LINK changes the properties of a specific
link; in particular, it can enable or disable the link (though
links can be marked "immutable" by the driver). Enabling a link will
have the side
effect of powering up all components reachable via that link, while
disabling the last link to an entity will cause that entity to be
powered down. Thus, changing the status of a link affects both the
data path and the power configuration of a device.
Thus far, there have been no applications posted which actually use this
framework (though a gstreamer source element is in the works). One can
certainly see the utility of being able to discover and
modify the configuration of a complex media device in this manner. But, at
the Linux Plumbers Conference, your editor heard some concerns that the
complexity of this interface could prove daunting to application
developers. An application which is intended to work with a specific
device (the camera application on a mobile handset, say) can be written
with a relatively high level of awareness of that device and make good use
of this interface. Writing an application which can make full use of any
device - without requiring the developer to know about specific hardware -
could be more challenging.
One other concern raised at LPC was that this functionality should really
be exported via sysfs rather than through an ioctl()-based API.
The information contained here would fit well within a sysfs hierarchy,
with links represented by symbolic links in the filesystem. Given that the
configuration interface (in its current form) changes
a single bit at a time, there is no need
for the sort of transactional functionality that can make ioctl()
preferable to sysfs. On the other hand, V4L2 applications are already a
big mass of ioctl() calls; the media controller API will be a
natural fit while rooting through sysfs would be a new experience for V4L2
Something else is worth thinking about here: the problem may be bigger than
just media devices. More complex devices are the norm, and it is becoming
clear that the kernel's hierarchical device model is not up to the task of
representing the structure of our systems. Back in 2009, Rafael Wysocki proposed a mechanism for
representing power-management dependencies with explicit links. The media
controller mechanism looks quite similar; it is even being used for power
management purposes. That suggests that we should be looking for a data
structure which can represent device connections and dependencies across
the kernel, not just in one subsystem. Otherwise we run the risk of
creating duplicated structures and multiple user-space ABIs, all of which
must be supported indefinitely.
The media controller subsystem is aimed at solving a real problem, and it
is certainly a credible solution. It is also a significant new user-space
ABI, one which does not necessarily conform to current ideas of how
interfaces should be done. The work done here may also be applicable well
beyond the V4L2 and ALSA subsystems, but any attempt at a bigger-picture
solution should probably be made before the code is merged and the ABI is
set in stone. All of this suggests that the media controller code could
benefit from review outside of the V4L mailing list, which tends to be
inhabited by relatively focused developers.
(Thanks to Andy Walls, Hans Verkuil, and Laurent Pinchart for their
comments on this article).
Comments (1 posted)
Regardless of whether one believes that the security of the Linux kernel is
as good as it should be, it is hard to disagree with the idea that it could
be made more secure. For some years, it has seemed like much of the
security-related work on the kernel has been directed toward the creation
of new access control mechanisms. But access control is only so helpful if
the kernel itself is vulnerable, allowing any access control system to be
bypassed. Recently we have begun to see more work aimed at making small
improvements to the
security of the kernel itself; this article will survey some of that work.
One key to hardening a system against attackers is to make it harder for
them to obtain information which could be used to compromise the kernel.
So it is not surprising to see an increase in patches which lock down
access to information. It turns out, though, that there is not universal
agreement on the value of restricting any kind of information about the
Marcus Meissner started things off with a
simple patch removing world-read access from /proc/kallsyms.
It is difficult to subvert the kernel without knowledge of how the kernel's
memory is laid out, so, Marcus thought, there is no point in providing that
information to anybody who asks. The problem with this change, as Ingo
Molnar pointed out, is that there are many
sources of that information. For example, the System.map file shipped by
most distributors also has the locations of all symbols built into the
Now, one can certainly read-protect System.map as well, but that may not be
particularly helpful. Most systems out there are running
distributor-supplied kernels, and the packages for those kernels are widely
available. So an attacker does not need to read /proc/kallsyms or
System.map if the target system is running a stock kernel; they need only
dig up a package file containing the needed information. For this reason,
Ingo suggested that a complete solution would require restricting access to
the running kernel version as well. Removing all of the globally-readable
kernel version information from a system would be hard, but, if it could be
done, attackers would no longer have easy access to the locations of
functions and data structures within the kernel.
Suffice to say that this idea was not received with universal acclaim.
Critics claim that there are plenty of ways to determine which kernel
version is running; hiding version information would just make life harder
for legitimate applications (which may need that information to know which
features are available) without appreciably slowing attackers. Ingo talked
some about instrumenting the kernel to detect an attacker's attempts to
determine the running kernel version, thus giving an early alarm, but this
idea did not seem to gain a great deal of traction. So, chances are,
kernel versions will not be hidden in any near-future release (the
/proc/kallsyms patch has been merged for 2.6.37, though).
Dan Rosenberg has a similar concern: when the kernel exposes pointer values
to user space, it gives information to potential attackers. These values
can be found in a number of places, including the system log and numerous
places in /proc. Keeping pointer values out of the system log
seems like a hopeless task, but it is possible to better restrict access to
that log. To that end, Dan has posted a patch adding a new sysctl
knob controlling access to the syslog() system call. Later
versions of the patch include a configuration option for the default
setting of this knob; with that, distributors can make the system log off
limits for unprivileged users starting at boot.
Kernel addresses also show up in other places, though; for example,
/proc/net/tcp contains the address of the sock structure
associated with each open TCP connection. Dan worries about exposing the
address of these structures, especially since many of them contain function
pointers; if an attacker is somehow able to change the contents of kernel
memory, this kind of address might facilitate the task of taking over the
system. To raise the bar a bit, Dan posted a series of patches which
replaces the pointer value with an integer value (often zero) if the
process reading the associated /proc file is not suitably
Unlike the syslog patch, which has made it into the mainline, the
/proc modification ran into some stiff opposition. It was
described as "security theater," and developers worried that it would break
applications which are legitimately using the pointer values. There were
suggestions that, perhaps, pointer values could be hashed, or that a more
general solution could be had by modifying the behavior of "%p" in
format strings. We might see the "%p" patch at some point, but
Dan has given up on the /proc
patches for now, saying "It's clear that there's too much resistance
to this effort for it to ever succeed, so I'm ceasing attempts to get this
patch series through."
Making it difficult to find structures containing function pointers may
make life harder for an attacker, but it still seems better to block the
modification of those structures whenever possible, regardless of who knows
their location. To that end, Kees Cook has announced his intent to try to lock down more
of the kernel:
The proposal is simple: as much of the kernel should be read-only
as possible, most especially function pointers and other execution
control points, which are the easiest target to exploit when an
arbitrary kernel memory write becomes available to an attacker.
Getting various structures marked const is an obvious starting
point; "constification" patches have been produced by many developers over
the years, but many structures still can be modified at run time. Beyond
that, though, Kees would like to have working read-only and no-execute
memory in loadable modules, "set once" pointers for things like the
security module operations vector, and more; many of the changes he would
like to see merged can currently be found in the grsecurity tree. It could
be a long process, but Kees says that it would be a security win for
everybody and that he would appreciate cooperation from subsystem
Not all kernel vulnerabilities are in the core code; many, instead, are
found in loadable modules. An attacker wishing to exploit a vulnerability
in a module must first ensure that the module is loaded. Module loading is a
privileged operation, but there are a number of ways in which an
unprivileged user can cause the kernel to load a module anyway; the kernel
normally goes out of its way to autoload modules on demand so that things
"just work." It seems clear that a kernel which never allows users to
trigger the loading of modules is less likely to be affected by any
vulnerability which is found in a loadable module.
Dan has posted another patch (again, based
on work done in the grsecurity tree) which makes the demand loading of
modules harder. It replaces the existing modules_disable sysctl
knob with a more flexible version; if it is set to one, only root can
trigger the loading of modules. Setting it to two disables module loading
entirely until the next boot. The changing of the existing ABI was not
well received, so a future version of the patch will keep the existing
switch and its semantics. Beyond that, doubts have been expressed
regarding whether administrators will enable this option, since demand
loading is a convenient feature.
Hardening the kernel to make the exploiting of vulnerabilities more
difficult seems like a good thing, but it would also be nice if we could
find those vulnerabilities before anybody even tries to exploit them. One
technique which can help in this regard is "fuzzing," the process of
passing random values into system calls and looking for unexpected
behavior. Some attackers certainly have good fuzzing tools, but the
development community seems to be rather less well equipped. So it is good
to see some recent
work by Dave Jones aimed at the creation of a more intelligent fuzzer.
It turns out that, by making system call parameters a bit less fuzzy, the
tool is more likely to get past the trivial checks and turn up real
problems; the improved fuzzer has already turned up one real bug.
The value of all this work may not be clear to everybody, and it probably
will not all make it into the mainline kernel. But it does seem that we
are seeing the beginning of a more focused effort to improve the security
of the kernel and to make it harder to exploit the inevitable bugs. A more
secure kernel may make it harder to gain true ownership of our gadgets in
the future, but it still is generally a good thing.
Comments (7 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>