Brief items
The current 2.6 development kernel is 2.6.28-rc6,
released by Linus on
November 20, just before he fled town for a scuba diving trip. (At
least one assumes he fled town; it is not the best season for ocean sports
in Portland.) It includes a number of fixes, including one for the
high-profile
vmalloc() regression. The
long-format
changelog has the details.
The current stable 2.6 kernel is 2.6.27.7, also released on November 20.
It includes a fair number of fixes, including one with a CVE number
attached.
Comments (none posted)
Kernel development news
+/*
+ * "Define 'is'", Bill Clinton
+ * "Define 'if'", Steven Rostedt
+ */
+#define if(cond) if (__builtin_constant_p((cond)) ? !!(cond) : \
+ ({ \
+ int ______r; \
+ static struct ftrace_branch_data \
+ __attribute__((__aligned__(4))) \
+ __attribute__((section("_ftrace_branch"))) \
+ ______f = { \
+ .func = __func__, \
+ .file = __FILE__, \
+ .line = __LINE__, \
+ }; \
+ ______r = !!(cond); \
+ if (______r) \
+ ______f.hit++; \
+ else \
+ ______f.miss++; \
+ ______r; \
+ }))
--
Steven Rostedt debuts the new "if"
Working on lkml often sounds like everyone is screaming NO,
channeling nothing but stop energy. Sometimes people are, but more
often what they really mean is you just have to take your time and
do things right. Admittedly it is a lot of iteration, but Linux is
a noble pursuit.
--
Robert Love
But let's look at the problem which we're actually trying to solve.
Developer A wishes to write some kernel monitoring/controlling
code, so he is forced to stick it on his website, keep reminding
people to download updates, act as an independent target of other
people's patches, etc, etc. It's all a pain and horror, so
developer A gives up and implements his userspace code in the
kernel instead. It is, as a result, technically inferior and
English-only, but at least it got there.
--
Andrew Morton
Comments (18 posted)
By Jonathan Corbet
November 24, 2008
Rebooting a system to apply a security update is a pain. In some
situations, it's more than a pain; for various reasons, many systems cannot
be taken down at all without compromising the work they are supposed to be
doing. Back in April, LWN
looked
at Ksplice, a mechanism designed to enable the installation of kernel
updates without the need to reboot the system. Since then, work has
continued on Ksplice,
a new
version has been posted, and the project is starting to push toward
mainline inclusion. So another look is called for.
The core idea behind Ksplice remains the same: when given a source tree and
a patch, it builds the kernel both with and without the patch and looks at
the differences. To that end, the compilation procedure is modified to
put every function and data structure into its own executable section.
That makes life a little harder for the compiler and the linker, but
developers are notably insensitive to the difficulties faced by those
tools. With things split up this way, it is relatively easy to identify a
minimal set of changes in the binary kernel image which result from the
patch. Ksplice can then, with some care, patch the new code into the
running kernel. Once this work is done, the old kernel is running the new
code without ever having been rebooted.
This technique works well for code changes, but different challenges come
with changes to data structures. Back in April, Ksplice could not handle
that kind of change. Even so, the project's developers claimed to be able
to apply the bulk of the kernel's security updates using ksplice. Since
then, though, the developers have applied some energy to this problem.
With the addition of a couple of new techniques - which require extra
effort on the part of the person preparing the patch for Ksplice - it is
now possible to apply 100% of the 65 non-DOS security patches released for
the kernel since 2005.
In some cases, a kernel patch will simply require that a data structure be
initialized differently. The way to handle this change in an update
through Ksplice is to modify the relevant data structures on the fly. To
effect such changes, a patch can be modified to include code like the following:
#include <ksplice-patch.h>
ksplice_apply(void (*func)());
While Ksplice is applying the changes - and while the rest of the system is
still stopped - the given func will be called. It can then go
rooting through the kernel's data structures, changing things as needed.
For example, CVE-2008-0007
came about as a result of a failure by some drivers to set the
VM_DONTEXPAND flag on certain vm_area_struct structures.
Ksplice is able to apply the fix to the drivers without trouble, but that
is not helpful for any incorrectly-initialized VMAs present on the running
system. So the
modifications to the patch add some functions which set
VM_DONTEXPAND on existing VMAs, then use ksplice_apply()
to cause those functions to be executed. The result is a fully-fixed
system.
Changes to data structure definitions are harder. If a structure field is
removed, the Ksplice version of the patch can just leave it in place. But
the addition of a new field requires more complicated measures. Simply
replacing the allocated structures on the fly seems impractical; finding
and fixing all pointers to those structures would be difficult at best. So
something else is needed.
For Ksplice, that something else is a "shadow" mechanism which allocates a
separate structure to hold the new fields. Using shadow structures is a
fair amount of additional work; the original patch must be changed in a
number of places. Code which allocates the affected structure must be
modified to allocate the shadow as well, and code which frees the structure
must be changed in similar ways. Any reference to the new field(s) must,
instead, look up the shadow structure and use that version of the field.
All told, it looks like a tiresome procedure which has a significant chance
of introducing new bugs. There is also the potential for performance
issues caused by the linear linked list search performed to find the shadow
structures. The good news is that it is only rarely necessary to modify a
patch in this way.
The Ksplice developers do not appear to be done yet; from the latest patch
posting:
We're currently working on the problem of making it feasible to
apply the entire stable tree using Ksplice. Although Ksplice's
original evaluation focused on patches for CVEs, we understand the
idea that "security bugs are just 'normal bugs'" (i.e.,
tracking security bugs separately from normal bugs can be difficult
and isn't necessarily advisable). We ultimately want to provide to
long-running machines hot updates for all of the bug fixes that go
into the corresponding stable tree.
This is an ambitious goal; a single stable series can add up to hundreds of
changes, some of which can be reasonably large. It will be interesting to
see how many users are really interested in this particular sort of update;
sites running critical systems tend to have older "enterprise" kernels
which are no longer receiving stable tree updates. But a Ksplice which is
flexible enough to handle that kind of update stream should also be useful
for distributors wanting to provide no-reboot patches to their customers.
Meanwhile, Nikanth Karthikesan has posted a facility called kreplace. On the surface, it
looks similar to Ksplice, but the goal is a little different: its purpose
is to allow a developer to quickly try out a change on a running kernel.
Kreplace works by simply patching out and replacing one or more functions
in the kernel. Kreplace may have its value, but the initial reaction has
not been greatly enthusiastic. Among other things, it has been pointed out that Ksplice also has a facility
to allow for quick experimentation with changes - though it will be quick
only if the developer is already set up to use Ksplice with the running
kernel.
A final concern with either of these solutions is that they are, for all
practical purposes, employing rootkit techniques. A mechanism which can be
used by distributors to patch running systems can also be (mis)used by others.
Vendors of binary-only modules could, for example, use Ksplice or kreplace
to get around GPL-only exports and other inconvenient features of
contemporary kernels. Crackers could also use it, of course, but they
already have their own rootkit tools and gain no real benefit from an
officially-supported runtime patching mechanism. Whether this aspect of
Ksplice is of concern to the development community may be seen in the
coming months as this code gets closer to mainline inclusion.
Comments (4 posted)
By Jake Edge
November 25, 2008
There is a lot of functionality—things like filesystems and device
drivers—that are normally considered to be kernel tasks, but have,
over time, been allowed to move into user space. The UIO user space driver framework
came along in 2.6.23, while filesystems in user space (FUSE) have been
around since 2.6.14. Tejun Heo would like to see this idea broadened even
further with the character
devices in user space (CUSE) patches.
At first blush, the uses for a character device implemented in user space
are not obvious. Looking a bit deeper, though, one finds numerous
programs—both open and closed source—that rely on legacy
character drivers. Those drivers are currently in the kernel, but need not
be if there were a way to implement them in user space. In addition,
older, deprecated interfaces, such as Open Sound System (OSS) can be better
supported without constantly fiddling with the in-kernel emulation.
Providing better OSS support is one of the prime motivators for CUSE as
Heo announced in a linux-kernel posting
introducing the OSS
proxy. The proxy uses CUSE to implement the /dev/dsp,
/dev/adsp, and /dev/mixer devices that programs using OSS
expect. Adrian Bunk didn't necessarily see
this as a good thing:
Sorry for being
destructive, but 6 years after ALSA went into the kernel we are slightly
approaching the point where all applications support ALSA.
The
application you list on your webpage is UML host sound support, and I'm
wondering why you don't fix that instead of working on a better OSS
emulation?
But Heo sees the current state of OSS emulation as a rather complicated
mess that, for better or worse, needs cleaning
up:
We now have in-kernel OSS emulation which can't mux with other streams,
aoss [ALSA OSS emulation] with its own supported and broken list and can
also be routed
through PA [PulseAudio] by configuring ALSA right and then padsp [PA OSS
emulation] with its own
supported and broken list and nothing works good enough. So, if we have
one thing which just works, we can in time put all those to rest.
But there are other uses for CUSE too. Greg Kroah-Hartman notes that legacy
software for talking to Palm Pilots, much of which is binary-only, expects
to talk to a /dev/pilot serial port. The kernel carries around a
driver, but "a libusb userspace program can handle all of the data to
the USB device instead". So CUSE could be used to eventually remove
another crufty driver from the kernel, while still maintaining
compatibility with old user space code.
CUSE is implemented on top of FUSE as there is a fair amount of overlap
between them. Character devices and filesystems implement many of the same
file operations—things like open(), close(),
read(), and write()—which makes them a good match.
Heo has a separate patchset for
FUSE that implements additional operations for filesystems some of
which will be used by CUSE.
The additional FUSE operations include an implementation of
ioctl() that is necessarily rather ugly. Because an
ioctl implementation can access memory in unpredictable
ways—and those data structures can be arbitrarily deep—there
needs to be a mechanism for user-space CUSE devices to read and write that
memory. The CUSE server does not have direct access to the caller's
memory, so a multi-step
ioctl() with retries must be implemented. This particular bit of
ugliness is only allowed for in-kernel use, so that CUSE (or other
things like it) can allow "unrestricted" ioctl() implementations.
All FUSE filesystems are still required to have "restricted"
ioctls where the kernel can determine the direction and amount of
data that is transferred.
poll() support has also been added to FUSE, which, in turn,
requires a separate patch that allows poll() callbacks to sleep
(described in this article).
Once the FUSE changes are in place, the actual implementation of CUSE is
relatively small, weighing in around 1000 lines plus some housekeeping to
rename and export FUSE symbols. At its core, it collects up a FUSE-mounted
filesystem that connects to the user-space implemented device along with
the kernel-exported character device, binding the two together. FUSE
handles the interaction with the user-space code, in the same way that it
does for a filesystem.
CUSE creates a device for commands, /dev/cuse, which is opened by
a program that wants to implement a particular character device. CUSE
queries the opener to determine which device it is implementing and then
creates the device node. For most operations, CUSE just hands off to FUSE,
but for open() it, instead, opens a file from the FUSE mount,
storing the file handle for use by later operations.
In many ways, CUSE is a kind of impedance matching layer that creates
something that acts like a character device, but has no hardware directly
behind it. This allows CUSE to ignore things like hardware interrupts;
those would need to be handled by something else, typically a downstream
driver—the soundcard driver in the OSS proxy case. This is one of
the big differences between UIO and CUSE. UIO is much more like a regular
kernel device driver that requires kernel code to handle interrupts. CUSE
drivers, on the other hand, can be created without ever touching kernel
space.
The only objection so far seems to be Bunk's complaint about supporting
OSS when it has been deprecated for so long. As Heo points out, though,
there are still many applications that only support OSS. In addition, all
of the code that has been submitted is "way smaller than the
in-kernel ALSA OSS emulation which is somewhat painful to use these
days", Heo says. Since there are
other potential users of CUSE, not just the OSS proxy, it would seem that,
absent any major objections, CUSE could make it into 2.6.29.
Comments (5 posted)
By Jonathan Corbet
November 24, 2008
There are currently a number of proposed driver API changes being discussed
on the lists. None of them are major, but they are worth being aware of.
poll()
Most of the functions in the file_operations structure are
concerned with I/O. So it is not surprising that these functions are
allowed to sleep. Except that, as it turns out, one of them -
poll() - cannot. There is nothing inherent in the poll()
or select() system calls which would require the driver
poll() callback to be nonblocking; this requirement is, instead, a
result of the implementation. In essence, the core poll()
implementation looks like this:
for (;;)
set_current_state(TASK_INTERRUPTIBLE)
for each fd to poll
ask driver if I/O can happen
add current process to driver wait queue
if one or more fds are ready
break
schedule_timeout_range(...)
The problem is relatively straightforward: if a specific driver chooses to
sleep in its poll() callback, the current task state will get set
back to TASK_RUNNING and schedule_timeout_range() will return
immediately. So a sleeping driver turns the main loop into a busy-wait.
The solution, as developed by
Tejun Heo, is also straightforward. His patch causes
sys_poll() to define a custom wakeup function which, in turn, sets
a new triggered flag when called. That eliminates the need to put
the process into TASK_INTERRUPTIBLE for the duration of the main
loop; that can be done, instead, right before actually sleeping.
Most driver writers can remain unaware of this change, which looks highly
likely to be merged for 2.6.29. But, for those who need it, there will be
one more degree of flexibility in the implementation of poll()
callbacks.
Exclusive I/O memory
For a while, developers involved in the hunt for the e1000e corruption
bug thought that the X server might be the problem. The real bug
turned out to be elsewhere, but the suspicion cast upon X led to the
development of a new API designed to make it harder for user-space programs
to interfere with the operation of an in-kernel driver.
In particular, it seemed sensible to prevent user space from manipulating
I/O memory which has been allocated by device drivers. This can be
achieved by not allowing an mmap() call on /dev/mem to
map regions already given to drivers. If the STRICT_DEVMEM
configuration option is set, the kernel will protect its own memory from
mapping by user space; protecting I/O memory is really just a matter of
extending that mechanism.
Arjan van de Ven has implemented that feature in his MMIO exclusivity patch. He
chose, however, not to make this protection the default. Instead, drivers
which want exclusive access to an I/O memory region should call one of
these new functions:
int pci_request_region_exclusive(struct pci_dev *pdev, int bar,
const char *res_name);
int pci_request_regions_exclusive(struct pci_dev *pdev,
const char *res_name);
int pci_request_selected_regions_exclusive(struct pci_dev *pdev,
int bars,
const char *res_name);
There is also a new, low-level allocation macro:
request_mem_region_exclusive(start, n, name);
In each case, these functions are equivalent to their non-exclusive
cousins, except for the changed name and the resulting exclusive
allocation.
There may be cases where a developer wants to be able to map a region from
user space on a development system, regardless of what the driver thinks.
For such situations, there is a new iomem=relaxed boot parameter.
When relaxed is selected, exclusive allocations are not enforced.
Clearly this is not an option which one would want to set on a production
system, but it may be useful in development environments.
DMA API debugging
The last topic is not actually an API change, but it's worth a look
anyway. The kernel provides a nice API for setting up DMA operations. In
many cases, the associated functions do little or no work; the system they
are running on does not require any additional effort. The result is that
a lot of "tested" driver code may, in fact, have serious errors in its use
of the DMA API. When those drivers are run on a different system - one
with an I/O memory management unit (IOMMU) in particular - those errors
could lead to no end of unpleasant behavior.
Kernel developers like the idea of finding bugs before they bite users on
remote systems. To help make that happen with the DMA API, Joerg Roedel
has posted a new DMA API
debugging facility. This feature, when built into the kernel, should
make it possible to find a number of previously-hidden bugs in device
drivers. It has, in fact, already turned up a few problems with in-tree
drivers, mostly in the networking subsystem.
Use of this facility simply requires enabling a configuration option; the
API itself does not change. Once it's enabled, this code will check for a
number of problems, including freeing DMA buffers with a different size
than was given at allocation time, freeing buffers which were never
allocated at all, mixing coherent and non-coherent functions on the same
buffer, confusion over I/O directions, and more. Each of these problems
might slip by on a developer's test system, but might create havoc where an
IOMMU is being used. When a problem is found, a warning and stack
traceback are logged.
The response to this API has been positive. The biggest complaint seems to
be about the fact that this API is implemented as an x86-specific feature.
So it will probably have to be made generic before merging - after all,
developers on other platforms are entirely capable of introducing
DMA-related bugs too. Once it goes in, this feature should probably be
enabled on any system used for driver development.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jake Edge
Next page: Distributions>>