Brief items
Many more changes have been merged into Linus's kernel tree since last
week, about 2800 changes to be exact. There has been merges with ALSA,
i2c, hwmon, PCI, USB, XFS, Driver core, Power PC architecture, ARM
architecture, SPARC 64 architecture, m68k architecture, x86-64 architecture,
network drivers, SATA, ACPI, networking core, V4L core and drivers, NFS,
Infiniband, DM, MD, the build system, OCFS2, XFS, CIFS, JFFS2, and just about
every other part of the kernel has been updated. In short, a huge merge of
bugfixes and updates, showing there is nothing slowing down the pace of kernel development.
(Thanks to Greg Kroah-Hartman for this update.)
Comments (none posted)
Kernel development news
No, the inevitable flame war here is the abusive way of telling people
not to extract the kernel sources as root. This argument boils down to
a fundamental disjunct: trust people to handle security of their own box
their own way, with full knowledge of how their tools work, or assume
that they aren't intelligent enough to use tools sanely and securely,
and handicap so they don't have to. The latter, much as it is not seen
this way, is the abusive philosophy. The former trusts the user.
Yes, there's a learning curve. There is always a learning curve. Never
expect there not to be a learning curve.
-- Matthew Frost <artusemrys -at- sbcglobal.net>
Comments (4 posted)
This article is the somewhat delayed followup to
Trees I, which looked at
the radix tree interface. Along with radix trees, the kernel contains an
implementation of a data structure known as a "red-black tree." These
trees (known as "rbtrees" within the kernel) are a form of semi-balanced
binary tree. Each node in the tree contains a value and up to two
children; the node's value will be greater than that of all children in the
![[Red-black tree]](/images/ns/kernel/wikipedia-rbtree.png)
"left" child branch, and less than that of all children in the "right"
branch. Thus, it is possible to serialize a red-black tree by performing a
depth-first, left-to-right traversal.
Every node in a red-black tree is considered to be colored either red or
black, with the root always being black. There is a somewhat complicated
set of rules on how nodes should be colored, and, in particular, how the
colors of the nodes should be used to make decisions on when and how to
rebalance the tree. This article will not go into the details of the
red-black tree mechanism, especially since that mechanism is well described
by the Wikipedia
red-black tree article (which is also the source of the image used
here). Instead, we'll focus on how red-black trees are used in the Linux
kernel.
The complex rules for red-black trees do bring some advantages. Since it
is a binary tree, a red-black tree can perform lookups in logarithmic
time. If the tree is properly maintained, the longest path to a leaf node
in the tree will never be more than twice as long as the shortest path - in
other words, the tree is always in approximate balance. But the property
which is arguably most useful in the kernel context is the fact that
insertions and deletions are (1) fast, and (2) provably bounded
in time. All the work that the kernel developers have put into reducing
latencies would be wasted if a data structure were to simply go off for an
indeterminate period of time rebalancing itself. Users of red-black trees
pay a small lookup cost because the tree is not perfectly balanced, but, in
return, they get fast, bounded insertion and deletion operations. A
red-black tree can, thus, be indicated in situations where nodes come and
go frequently.
There are a number of red-black trees in use in the kernel. The
anticipatory, deadline, and CFQ I/O schedulers all employ rbtrees to track
requests; the packet CD/DVD driver does the same. The high-resolution
timer code uses an rbtree to organize outstanding timer requests. The ext3
filesystem tracks directory entries in a red-black tree. Virtual memory
areas (VMAs) are tracked with red-black trees, as are epoll file
descriptors, cryptographic keys, and network packets in the "hierarchical
token bucket" scheduler.
The process of using a red-black tree starts by including
<linux/rbtree.h>. This is one of the trickier kernel data
structures to use, however. When designing a general data structure for a
language like C, the developer must always decide how to include arbitrary
types within the structure, and how to make comparisons between them. The
person who implemented Linux rbtrees (the copyright in the code is to
Andrea Arcangeli) made these decisions:
- Structures which are to be part of an rbtree must include a struct
rb_node within them; there are no void * pointers
to separate objects. This is a common way of implementing kernel data
structures, and so will not surprise too many people.
- There is no "compare two objects" callback used in the rbtree code.
Instead, users of rbtrees must, for all practical purposes, write the
top-level search and insertion functions
themselves, using lower-level rbtree primitives. As a result, using
an rbtree is a bit more work, and the data structure is rather less
opaque than our computer science teachers would have liked. What is
gained in return, however, is a faster overall implementation without
a bunch of indirect function calls in the hottest part of the tree
traversal loops.
It should also be remembered that an rbtree, like many other kernel data
structures, implements no locking of its own. Any code which uses an
rbtree must implement its own mutual exclusion to keep the tree from being
corrupted. Usually, that locking will fit well with the scheme already
being used by that code anyway, so there is no need for an independent
locking mechanism.
The root of a red-black tree has the type struct rb_root; a tree
can be initialized to the empty state with a line like:
struct rb_root the_root = RB_ROOT;
Assume, for a moment, that we have a red-black tree which is already full
of interesting data. Traversal of that tree (which does not involve
searching) is straightforward:
struct rb_node *rb_first(struct rb_root *tree);
struct rb_node *rb_last(struct rb_root *tree);
struct rb_node *rb_next(struct rb_node *node);
struct rb_node *rb_prev(struct rb_node *node);
A call to rb_first() will return a pointer to the first entry in
the tree, while rb_last() returns the final entry. Moving forward
and backward through the tree is a simple matter of calling
rb_next() and rb_prev(). In all of these cases, a return
value of NULL indicates that the requested node does not exist.
Since the rb_node structures are embedded within some other
structure of interest, finding the rb_node is a simple matter of
using the right structure field. A call to one of the above functions will
return a pointer to the embedded rb_node structure, however, and
not the containing structure which is, normally, what the programmer really
wants. This is the situation that the container_of() macro was
created for, but, in this case, there is no need to use
container_of() directly. Instead, use rb_entry():
rb_entry(pointer, type, member);
Where pointer is a pointer to an rb_node structure,
type is the type of the containing structure, and member
is the name of the rb_node structure within the container.
Searching an existing tree for a value is simply a matter of starting at
the root, then, for each node, comparing the value of that node against the
target and following the left or right branch as necessary. So all rbtree
search code tends to look like the following:
struct my_stuff *my_rb_search(struct rb_root *root, int value)
{
struct rb_node *node = root->rb_node; /* top of the tree */
while (node)
{
struct my_stuff *stuff = rb_entry(node, struct my_stuff, node);
if (stuff->coolness > value)
node = node->rb_left;
else if (stuff->coolness < value)
node = node->rb_right;
else
return stuff; /* Found it */
}
return NULL;
}
Here, we are searching for a struct my_stuff whose
coolness field matches the given value. An integer value
is used for simplicity, but not all uses need be so simple. If the
coolness of the root node is greater than the target value, then
that value must be found in the left branch of the tree (if it is in the
tree at all), so the search follows the rb_left branch and starts
over. A search value greater than the current node's value indicates that
the right branch should be used instead. Eventually this function will
either find an exact match, or hit the bottom of the tree.
The insertion case is a little trickier. The code must traverse the tree
until it finds the leaf node where the insertion should take place. Once
it has found that spot, the new node is inserted as a "red" node, and the
tree is rebalanced if need be. Insertion code tends to have this form:
void my_rb_insert(struct rb_root *root, struct my_stuff *new)
{
struct rb_node **link = &root->rb_node, *parent;
int value = new->coolness;
/* Go to the bottom of the tree */
while (*link)
{
parent = *link;
struct my_stuff *stuff = rb_entry(parent, struct my_stuff, parent);
if (stuff->coolness > value)
link = &(*link)->rb_left;
else
link = &(*link)->rb_right;
}
/* Put the new node there */
rb_link_node(new, parent, link);
rb_insert_color(new, root);
}
In this case, the traversal of the tree looks similar to the search case.
However, the link pointer is doubly indirected; in the end, it
will be used to tell the rbtree code which branch pointer (rb_left
or rb_right) should be set to point to the new entry. The code
follows the tree all the way to the bottom, at which point the
parent pointer identifies the parent of the new node, and
link points to the appropriate field within parent.
Then, a call is made to:
void rb_link_node(struct rb_node *new_node,
struct rb_node *parent,
struct rb_node **link);
This call will link the new node into the tree as a red node. After this
call, however, the tree may no longer meet all the requirements for a
red-black tree, and may thus need to be rebalanced. That work is done by
calling:
void rb_insert_color(struct rb_node *new_node, struct rb_root *tree);
Once that step is complete, the tree will be in consistent form.
There is an important assumption built into the above example: the new
value being inserted into the tree is not already present there. If that
assumption is not warranted, a corrupted tree could result. If the
possibility of a duplicated insertion exists, the code must be careful to
test for an exact match (as is done in the search case) and stop (without
inserting the node) if that match is found.
Removal of a node from a tree is simpler; simply call:
void rb_erase(struct rb_node *victim, struct rb_root *tree);
After the call, victim will no longer be part of tree,
which may have been rebalanced as part of the operation.
If one tree entry is being replaced by another with the same value,
however, there is no need to go through the removal and insertion process.
Instead, use:
void rb_replace_node(struct rb_node *old,
struct rb_node *new,
struct rb_root *tree);
This call will quickly remove old from the tree, substituting
new in its place. If new does not have the same value as
old, however, the tree will be corrupted.
Comments (14 posted)
The past two weeks has seen a
huge long email thread about the future of suspend in Linux. No, not
that other type of suspend, this is
all about what users really want, a working suspend to RAM.
It all started out with a few simple patches from Linus that implemented a
framework for allowing a way to debug problems during suspend, but quickly
spiraled out of control into rants about how badly the kernel handles
suspend issues today:
> I think you are trying to change a model that is not broken...
Bzzt. Thank you for playing.
The fact is, this thing has been broken for years. At some point,
we have to just accept the fact that it's not just "drivers".
There's something else that is broken, and I bet it's the model.
To how wrong everyone has been over the years in how suspend should really
work:
See? WE DO NOT DO THIS. I told people we needed to do this _years_
ago. I tried to push through the two-phase suspend. I tried to
explain why. I clearly failed, because we do _nothing_of_the_sort_
right now.
Instead, the "please suspend" thing to the devices is a
single-phase "put yourself into D3", with no support for a
separate "please save your state" call. Crap.
After arguing this last point over and over for many emails, Linus did
what anyone should do who wants to prove that their point is correct, he
wrote up a working patch that implements his proposed changes.
To fully understand the problem, let us look at the interface that the
kernel provides drivers today to handle suspend. When the kernel wants to
shut devices down (for some kind of suspend action), the whole device tree
is walked, and the suspend callback is called.
For PCI devices, this callback looks like:
int (*suspend) (struct pci_dev *dev, pm_message_t state);
The pointer to the PCI device that is about to be suspended is passed to
the driver, along with the state that the kernel wants to go into. Within
this single function, the driver is responsible for doing all suspend
tasks needed for the device.
The big problem with this is that if a device can not be suspended at that
point in time, it has to go through great lengths to try to let the core
know that it should be called back again (it does this by returning
-EAGAIN to the core and hoping that it will be called back.)
But the big issue is that the driver is responsible for shutting the
device down entirely in this function. This prevents the kernel from
doing things like system snapshots easily, or what to do if the driver
simply does not have enough memory available to it in order to properly
save the device state off in order to suspend.
Also the big issue is that the "class" cores should be handling most of
the suspend process, instead of the individual drivers. For example, the
network core should be shutting down the transmit queues and making stuff
go quiet for the drivers, so that they do not need to individually do this
in each and every driver. This last point is the biggest change in
Linus's model, and (in this author's opinion) the most important issue.
So, Linus changes the suspend process to a series of different steps:
- All devices start out on a list called dpm_active and are, as
indicated, "active" and up and running.
- A new callback is called for every device in the global device tree. This
callback is called suspend_prepare and has the same arguments
that the current suspend callback has for each individual bus
type. In this function, the devices are not allowed to disconnect
themselves from the kernel (like USB devices disconnecting themselves to
shut down), and the drivers for these devices need to do everything
necessary to be ready to suspend the device some time in the future. This
usually entails allocating any needed memory to save the device state, or
other kinds of housekeeping. Anything that might possibly fail should be
done here, and if something bad happens, the error should be reported.
Drivers can call functions that might sleep here, as interrupts are not
disabled.
- The kernel then iterates over all of the dpm_active list and
moves it to the dpm_off list and calls the suspend
callback for the different subsystems (which is new). Followed by the
subsystem suspend, the bus suspend callback is made.
- Interrupts are now disabled in the system.
- Then the kernel iterates over all of the devices on the dpm_off
list and moves them to the dpm_off_irq list, while calling a new
callback called suspend_late().
- After this is complete, the system can be suspended by shutting
down the CPU by putting it into any sleep level that is desired.
To resume the system, the kernel reverses the order of manipulating the
device lists and does the following steps:
- The kernel iterates over the dpm_off_irq list and moves the
devices to the dpm_off list while calling a new callback called
resume_early.
- Interrupts are enabled.
- The kernel iterates over all of the devices on the dpm_off list
and moves them to the dpm_active list, while calling the
resume callback (first the bus specific resume function,
followed by the class specific resume.)
This new scheme allows the kernel to properly handle error conditions if
anything bad happens while the suspend process was happening. For
example, if an error is caused during the suspend_late process,
then only the devices on the dpm_off_irq list will be called with
the resume_early callback in order to resume the system in the
proper procedure and recover from the error properly.
Linus's patch is a small patch, not over 400 lines, and generated some
good feedback with other kernel developers who seem to be coming around to
this new scheme. The patch has not shown up in any public kernel trees
yet, but hopefully soon Linux will be able to handle suspend issues in a
much more robust and correct manner.
Comments (12 posted)
June 26, 2006
This article was contributed by Valerie Henson
[
Editors note: this is the second in the Kernel Hacker's Bookshelf
series by Valerie Henson; if you missed it, the first article is over here.]
Computer programs have bugs. As programmers, we know that this is
inevitable, given the trade-off in time and money against creating a
perfect system. Systems with nearly-zero bug counts exist (e.g., the
Shuttle software, only 17 bugs in 420,000 lines of code over the last
11 releases) but they require vast amounts of work to achieve this
level of correctness, work that is completely unjustifiable for most
programs (such as desktop operating systems). But we're programmers,
it's our job to replace time and money with smart ideas.
What would happen if when a program had a memory error - and it
detected that error, ignored it, and drove happily on, oblivious to
the failure? You would expect that this would result in horrible
errors and obscure crashes. But what if it worked - or even made
things better? For example, failing to check the size of a memory
copy operation can result in a buffer overflow attack. Could we do
something clever that would both paper over the memory error and keep
the application running, more or less on track?
A Solution
Martin Rinard and a few of his colleagues got to wondering about this
question and decided to test it - and found that the answer was yes,
you can automatically handle memory bugs in a better, safer way than
either ignoring the bug or terminating the program. I first heard of
their technique,
Failure-Oblivious
Computing, at their talk at
OSDI 2004. The talk
was quite lively; if there was a "Most Laughs per Minute" award,
Martin Rinard would have won it.
The explanation of how failure-oblivious computing is implemented
might seem utterly crazy, but stick with me. Remember, the amazing
thing about failure-oblivious computing is that when you implement it,
it works! (At least for quite a few useful applications.) The basic
idea is to detect memory errors - out-of-bound reads, out-of-bound
writes - and instead of killing the program, handle otherwise fatal
errors by turning them into relatively benign bugs. Detecting the
memory errors requires a "safe-C compiler" - a C compiler that adds
run-time memory access checks.
Safe-C compilers (and languages that always check memory accesses)
have been around for a long time. When they detect a memory error,
the process gets a segmentation fault, and usually exits shortly
thereafter. In failure-oblivious computing, the application never
even knows the memory error happened. In the case of an out-of-bounds
write, the write is silently thrown away and execution continues.
Handling out-of-bounds reads is slightly harder. In this case, a
made-up value is manufactured and returned.
How do you pick which value to return? Two observations lie behind
the answer. First, 0 and 1 are the most common values in computation.
Second, sometimes the program is looking for a particular value before
returning, such as searching for a particular ASCII character in a
string, or iterating through a loop 100 times. The result is a series
of return values that looks something like this:
0, 1, 2, 0, 1, 3, 0, 1, 4,...
So you throw away invalid writes, and make up stuff to return for
invalid reads. Crazy, right? But crazy like a fox.
Why does it work?
Failure-oblivious computing is targeted at a particular class of
applications, ones with short error-propagation distances - in other
words, applications that have relatively short execution paths which
return without affecting much global state. This includes a rather
useful class of applications, such as web servers, mail servers, and
mail readers. It does not include applications like scientific
modeling software, in which one wrong value can fatally corrupt the
final answer. Software programs which handle incoming requests and
return to a waiting state, or have many independent threads of
execution are good candidates for failure-oblivious computing.
Another reason failure-oblivious computing works is because memory
errors are transformed into input errors. Since the programs have to
deal with invalid or malicious input already, often the result is an
anticipated error, one the program knows how to deal with cleanly.
For example, a buffer overflow attack on Sendmail uses a malformed,
too-long, illegal email address to overwrite some other part of the
program's memory. This technique silently discards the writes that go
beyond the buffer, and Sendmail continues on to check the validity of
the input - whether or not it's a correctly formed email address.
Answer: No, so throw it away and go on to the next request. At this
point, Sendmail is back in known territory and the error has stopped
propagating.
A limitation of this technique is the cost of memory bounds checking.
Applications that need to access memory frequently will probably not be
good candidates for this technique. However, applications that are
limited by I/O time, or only need to complete before the human user
notices a delay, won't be much impacted by the cost. Indeed, humans
can't detect delays below about 100 milliseconds - an eternity in
computational time.
Failure-oblivious computing in practice
Rinard and his co-authors evaluated failure-oblivious computing with
versions of several commonly used open source applications with known
buffer overflow attacks: Sendmail, Pine, Apache, and Midnight
Commander. They ran three versions of each program: an unaltered
version, one using just safe-C compilation, and one transformed into a
failure-oblivious program. In each case, the failure-oblivious
version performed acceptably (sometimes better), did not create any
new bugs, and did not suffer any security breaches.
One example was the Pine mail reader. It had a bug in processing the
"From" field for display in the message index. It needed to add a '\'
character in front of certain characters, but allocated a too-small
buffer to copy it into. Some "From" fields could overflow the buffer
and cause the program to segfault and die. The safe-C version of the
program dies as well, because all it can do is detect the buffer
overflow. The failure-oblivious version threw away the writes beyond
the end of the buffer, and then went on to behave exactly correctly!
The length of the "From" field displayed in the index is shorter than
the length of the buffer, so the fact that it was truncated too early
is unobservable. When the user reads a particular message, a
different code path correctly displays the "From" field. Now an email
message that would cause Pine to die every time it was started could
be correctly displayed and handled.
The performance of failure-oblivious Pine was 1.3 to 8 times slower
times on certain tasks, but the total elapsed time to respond to user
input was still in the low milliseconds range. For interactive use,
the slowdown is acceptable. In the case of the Apache server bug, the
performance of the failure-oblivious server was actually better than
either of the other two versions. The higher performance was due to
the fact that the bug would kill an Apache thread each time it was
encountered, incurring the overhead of creating a replacement thread.
The failure-oblivious version did not have the overhead of constantly
killing and restarting threads and could server requests much faster.
Especially exciting is the use of failure-oblivious computing for
widely used network servers, such as Apache and Sendmail. The paper
has in-depth examinations of how buffer overflow bugs are prevented
and indeed ignored by the failure-oblivious versions of these and
other programs.
What failure-oblivious computing means for Linux
Linux has a huge variety of techniques for improving system security
in the face of bugs. SELinux, various stack protection schemes,
capabilities - all these techniques help cut down but don't eliminate
security problems. Failure-oblivious computing would fill one niche,
and in some cases will be the best solution due to the ability to
continue running after a normally-fatal memory error. Wouldn't it be
nice if, when everyone else is suffering from some brand-new zero-day
attack, your system is not only secure but still up and running?
More importantly, this paper teaches the value of experimentation with
obviously crazy ideas. Even after seeing the talk and reading the
paper and talking to the author, I still find it a little
mind-boggling that failure-oblivious computing works. Even more fun
is understanding why it works - a good reason to read the full paper
yourself. I am certain that computers (and computer science) will
continue to surprise us for many years to come.
[Do you have a favorite textbook or systems paper? Of course you do.
Send your suggestions to:
val dot henson at gmail dot com
Valerie Henson is a Linux kernel
developer working for Intel. Her interests include file systems,
networking, women in computing, and walking up and down large
mountains. She is always looking for good systems programmers, so
send her some email and introduce yourself.]
Comments (20 posted)
Version 1.0.12 rc 1 of the
ALSA
sound driver is out. See the
change log for details.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Page editor: Forrest Cook
Next page: Distributions>>