The current development kernel is 2.6.37-rc3, released on November 21. Linus said:
And I have to say, I'm pretty happy with how quiet things have
been. Of course, maybe people are just lying low, waiting to
ambush me next week with a flood of patches when I'm gone in Japan,
all in order to try to be inconvenient. Because that's the kind of
people kernel developers are.
One notable change is that the attempt to make
/proc/kallsyms unreadable by default has been reverted because it
broke an older distribution (Ubuntu Jaunty).
The short-form changelog
is in the announcement; see the
full changelog for all the details.
Stable updates: the 220.127.116.11,
18.104.22.168 updates were released on
November 22; each contains a long list of important fixes. Note that
22.214.171.124 is the last update for the 2.6.35 series.
Comments (none posted)
Now, I do understand that everybody idolizes us software
people. Yes, we really are better, smarter, and more good-looking
than hardware engineers. Life is not fair, and the adoration of the
masses can be unbearable at times. When I go to the mall, I'm
covered in womens underwear in minutes - it's just embarrassing.
So yes, we're the Tom Jones of the engineering world.
So I can see how architecture designers could get some
complexes. I understand. But even if you're a total failure
in life, and you got your degree in EE rather than CompSci,
stand up for yourself, man!
Repeat after me: "Yes, I too can make a difference! I'm
not just a useless lump of meat! I can design hardware that
is wondrous and that I don't need to be ashamed of! I can
help those beautiful software people run their code better!
My life has meaning!"
Doesn't that feel good? Now, look down at your keyboard,
and look back at me. Look down. Look back. You may never
be as beautiful and smart as a software engineer, but with
Old Spice, you can at least smell like one.
Hardware and software should work together. And that does
not mean that hardware should just lay there like
a dead fish, while software does all the work. It should
be actively participating in the action, getting all
excited about its own body and about its own capabilities.
(thanks to George Spelvin)
Operating systems written by normal people rarely end up with
desirable performance characteristics.
-- Matthew Garrett
Like I know the goal here is to create the perfect kernel for
hardware Linus owns, but I'd like to be able to fix bugs urgently
on hardware users have that aren't so privileged, think of it as
some sort of outreach program.
-- Dave Airlie
Comments (5 posted)
Device drivers - especially those dealing with low-end hardware - sometimes
need to allocate large, physically-contiguous memory buffers. As the
system runs and memory fragments, those allocations are increasingly likely
to fail. That had led to a lot of schemes based around techniques like
setting aside memory at boot time; the contiguous memory allocator (CMA)
patch set covered
here in July is one example. There is an alternative approach out there,
though, in the form of Hiroyuki Kamezawa's big
chunk memory allocator
The big chunk allocator provides a new allocation function for large
struct page *alloc_contig_pages(unsigned long base, unsigned long end,
unsigned long nr_pages, int align_order);
Unlike CMA, the big chunk allocator does not rely on setting aside memory
at boot time. Instead, it will attempt to organize a suitable chunk of
memory at allocation time by moving other pages around. Over time, the memory
compaction and page migration mechanisms in the kernel have gotten better
and memory sizes have grown. So it is more feasible to think that this
kind of large allocation might be more possible than it once was.
There are some advantages to the big chunk approach. Since it does not
require that memory be set aside, there is no impact on the system when
there is no need for large buffers. There is also more runtime flexibility
and no need for the system administrator to figure out how much memory to
reserve at boot time. The down sides will be that memory allocation
becomes more expensive and the chances of failure will be higher.
Which system will work better in practice is entirely unknown; answering
that question will require some significant testing by the people who need
the large allocation capability.
Comments (none posted)
Kernel development news
For a long time, tracing was seen as one of the weaker points of the
Linux system. Things have changed dramatically over the last few years, to
the point that Linux has a number of interesting tracing interfaces. The
job is far from done, though, and there is not always agreement on how this
work should proceed. There have been a number of conversations related to
tracing recently; this article will survey some in an attempt to highlight
where the remaining challenges are.
The tracing ABI
Once upon a time, Linux had no tracing-oriented interfaces at all. Now,
instead, we have two: ftrace and perf events. Some types of information
are only available via the ftrace interface, others are only available from
perf, and some sorts of events can be obtained in either way. From the
discussions that have been happening for some time it's clear that neither
interface satisfies everybody's needs. In addition, there are other
subsystems waiting on the wings - LTTng and
a recently proposed system health
subsystem, for example - which bring requirements of their own. The
last thing that the system needs is an even wider variety of tracing
interfaces; it would be nice, instead, to pull everything together into a
single, unified interface.
Almost everybody involved agrees on that point, but that is about where the
agreement stops. Your editor, unfortunately, missed the tempestuous
session at the Linux Plumbers Conference where a number of tracing
developers came to an agreement of sorts: a new ABI would be developed with
the explicit goal of being a unified tracing and event interface for the
system as a whole. This ABI would be kept out of the mainline until a
number of tools had been written to use it; only when it became clear that
everybody's needs are met would it be merged. Your editor talked to a
number of the people involved in that discussion; all seemed pleased with
Ftrace developer Steven Rostedt interpreted the
discussion as a mandate to develop an entirely new ABI for tracing
I think if we take a step back, we can come up with a new
buffering/ABI system that can satisfy everyone. We will still
support the current method now, but I really don't think it is
designed with everything we had in mind. I do not envision that we
can "evolve" to where we want to be. We may have to bite the
bullet, just like iptables did when they saw the failures of
ipchains, and redesign something new now that we understand what
the requirements are.
LTTng developer Mathieu Desnoyers took things even further, posting a "tracing ABI work plan" for discussion. That
posting was poorly received, being seen as a document better suited to
managerial conference rooms - a perception which was not helped by
Mathieu's subsequent posting of a massive common trace format document which would make
a standards committee proud. Kernel developers, as always, would rather see
code than extensive design documents.
When the code comes, though, it seems that there will be resistance to the
idea of creating an entirely new tracing ABI. Thomas Gleixner has expressed his dislike for the current state of
affairs and attempts to create complex replacements; he is calling for a
gradual move toward a better interface. Ingo Molnar has said similar things:
Fact is that we have an ABI, happy users, happy tools and happy
developers, so going incrementally is important and allows us to
validate and measure every step while still having a full
tool-space in place - and it will help everyone, in addition to the
We'll need to embark on this incremental path instead of a
rewrite-the-world thing. As a maintainer my task is to say 'no' to
rewrite-the-world approaches - and we can and will do better here.
The existing ABI that Ingo likes, of course, is the perf interface. He
would clearly like to see all tracing and event reporting move to the perf
side of the house. The perf ABI, he says, is sufficiently extendable to
accommodate everybody's needs; there does not seem to be a lot of room for
negotiation on this point.
One of the conclusions reached at the 2010 Kernel Summit was that a small
set of system tracepoints would be designated "stable" and moved to a
separate location in the filesystem hierarchy. Tools using these
tracepoints would have a high level of assurance that things would not
change in future kernel releases; meanwhile, kernel developers could feel
free to add and use tracepoints elsewhere without worrying that they could
end up maintaining them forever. It seemed like an outcome that everybody
could live with.
Steven recently posted an implementation of
stable tracepoints to implement that decision. His patch adds another
tricky macro (STABLE_EVENT()) which creates a stable tracepoint;
all such tracepoints are essentially a second, restricted view of an
existing "raw" tracepoint. That allows development-oriented tracepoints to
provide more information than is deemed suitable for a stable interface and
does not require cluttering the code with multiple tracepoint invocations.
There is also a new "eventfs" filesystem to host stable tracepoints which
is expected to be mounted on /sys/kernel/events. A small number
of core tracepoints have been marked as stable - just enough to show how
There were a number of complaints about eventfs, not the least of which
being Greg Kroah-Hartman's gripe that he had already written tracefs for just this purpose. Ingo had a different complaint, though: he is pushing
an effort to distribute tracepoints throughout the sysfs hierarchy. The
current /sys/kernel/debug/tracing/events directory would not go
away (there are tools which depend on it), but future users of, say,
ext4-related tracepoints would be expected to look for them in
/sys/fs/ext4. It is an interesting idea which possibly makes good
sense, but it is somewhat orthogonal to Steven's stable tracepoint posting;
it doesn't address the stable/development distinction at all.
It eventually became clear that Ingo is
opposed to the
concept of marking some tracepoints as stable. He is, instead, taking the
position that anything which is used by tools becomes part of the ABI, and
that an excess of tools using too many tracepoints is a problem we wish we
had. This opposition, needless to say, could make it hard to get the
stable tracepoint concept into the kernel.
Here we see one of the hazards of skipping important developer meetings.
The stable tracepoint discussion was expected to be one of the more
contentious sessions at the kernel summit; in the end, though, everybody
present seemed happy with the conclusion that was reached. But Ingo was
not present. His point of view was not heard there, and the community
believes it has reached consensus on something he apparently disagrees
with. If Ingo succeeds in overriding that consensus, then Steven might not
be the only person to express thoughts
Hmm, seems that every decision that we came to agreement with at
Kernel Summit has been declined in practice. Makes me think that
Kernel Summit is pointless, and was a waste of my time.
That conversation has quieted for now, but it will almost certainly
return. If nothing else, some developers are determined to change tracepoints when the need
arises, so this issue can be expected to come up again at some point.
One possible source of conflict is the recently-announced trace utility which, according to Ingo, has "no conceptual
restrictions" and will use tracepoints without regard for any sort
of "stable" designation.
One useful, but little used tracing-related tool is
trace_printk(). It can be called like printk() (though
without a logging level), but its output does not go to the system log;
instead, everything printed via this path goes into the tracing stream as
seen by ftrace. When tracing is off, trace_printk() calls have no
effect. When tracing is enabled, instead, trace_printk() data can
be made available to a developer with far less overhead than normal
printk() output. That overhead can matter - the slowdown caused
by printk() calls is often enough to change timing-related
behavior, leading to "heisenbugs" which are difficult to track down.
Output from trace_printk() does not look like a normal kernel
event, though, so it is not available to the perf interface. Steven has
posted a patch to rectify that, at the cost
of potentially creating large numbers of new trace events. With this
patch, every trace_printk() call will create a new event under
...events/printk/ based on the file name. So, to use Steven's
trace_printk() on line 2180 in kernel/sched.c would show
up in the events hierarchy as
.../events/printk/kernel/sched.c/2180. Each call could then be
enabled and disabled independently, just like ordinary tracepoints. It's a
convenient and understandable interface, but, if use of
trace_printk() ever takes off, it could lead to the creation of
large numbers of events.
That idea drew a grumble from Peter
Zijlstra, who said that it would be painful to use in perf. One of the
reasons for that has to do with how the perf API works: every event must be
opened separately with a perf_event_open() call and managed as a
separate file descriptor. If the number of events gets large, so does the
number of open files which must be juggled.
A potential solution also came from Peter,
in the form of a new "tracepoint collection" event for perf. This special
event will, when opened, collect no data at all, but it supports an
ioctl() call allowing tracepoints to be added to it. All
tracepoints associated with the collection event will report through the
same file descriptor, allowing tools to deal with multiple tracepoints in a
single stream of data. Peter says that the patch "is lightly tested
and wants some serious testing/review before merging," but we may
see this ABI addition become ready in time for 2.6.38.
Finally: access to tracepoints is currently limited to privileged users.
Tracepoints provide a great deal of information about what is going on
inside the kernel, so allowing anybody to watch them does not seem secure.
There is a desire, though, to make some tracepoints generally available so
that tools like trace can work in a non-privileged mode. Frederic
Weisbecker has posted a patch which makes
Frederic's patch adds an optional TRACE_EVENT_FLAGS() declaration
for tracepoints; currently, the only defined flag is
TRACE_EVENT_FL_CAP_ANY, which grants access to unprivileged
users. This flag has been applied to the system call tracepoints, allowing
anybody to trace system calls - at least, when tracing is focused on a
process they own.
An obvious conclusion from all of the above is that there are still a lot
of problems to be solved in the tracing area. The nature of the task is
shifting, though. We now have significant tracing capabilities in place,
and the developers involved have learned a lot about how the problem should
(and should not) be solved. So we're no longer in the position of
wondering how tracing can be done at all, and there no longer seems to be
any trouble selling the concept of kernel visibility to developers. What
needs to be done now is to develop the existing capability into something
which is truly useful for the development community and beyond; that looks
like a task which will keep developers busy for some time.
Comments (1 posted)
If you have been following Linux kernel development over the past few
months, it has been hard to overlook the massive thread on the Linux
Kernel Mailing List (LKML) resulting from an attempt to merge the Google
Android's suspend blockers framework into the main kernel tree. Arguably,
the presentation of the patches might have been better and the explanation
of the problems they addressed
might have been more straightforward [PDF], but in the end it appears that
merging them wouldn't be the smartest thing from the technical point of
view. Unfortunately, though, it is difficult to explain that without
diving into the technical issues behind the suspend blockers patchset, so I
wrote a paper, Technical Background of the Android Suspend
Blockers Controversy [PDF], discussing them in a detailed way, which is
summarized in this article.
Suspend blockers, or wakelocks in the original Android
terminology, are a part of a specific approach to power management, which
is based on aggressive utilization of full system suspend to save as much
energy as reasonably possible. In this approach the natural state of the
system is a sleep
state [PDF], in which energy is only used for refreshing memory and providing
power to a few devices that can generate wakeup signals. The working
state, in which the CPUs are executing instructions and the system is
generally doing some useful work, is only entered in response to a wakeup
signal from one of the selected devices. The system stays in that state
only as long as necessary to do certain work requested by the user. When
the work has been completed, the system automatically goes back to the
This approach can be referred to as opportunistic suspend to
emphasize the fact that it causes the system to suspend every time there is
an opportunity to do so. To implement it effectively one has to address a
number of issues, including possible race conditions between system suspend
and wakeup events (i.e. events that cause the system to wake up from sleep
states). Namely, one of the first things done during system suspend is to
freeze user space processes (except for the suspend process itself) and
after that's been completed user space cannot react to any events signaled
by the kernel. In consequence, if a wakeup event occurs exactly at the
time the suspend process is started, user space may be frozen before it
will have a chance to consume the event, which will be delivered to it only
after the system is woken up from the sleep state as a result of
another wakeup event. Unfortunately, on a cell phone the
"deferred" wakeup event may be a very important incoming call, so
the above scenario is hardly acceptable for this type of device.
On Android this issue has been addressed with the help of wakelocks.
Essentially, a wakelock is an object that can be in one of two states,
active or inactive, and the system cannot be suspended if at least one
wakelock is active. Thus, if the kernel subsystem handling a wakeup
event activates a wakelock right after the event has been signaled and
deactivates it after the event has been passed to user space, the race
condition described in the previous paragraph can be avoided. Moreover, on
Android, the suspend process is started from kernel space whenever there are
no active wakelocks, which addresses the problem of deciding when to
suspend, and user space is allowed to manipulate wakelocks. Unfortunately, that requires every
user space process doing important work to use wakelocks, which creates
unusual and cumbersome issues for application developers to
Of course, processes using wakelocks can impact the system's battery
life quite significantly, so the ability to use them has to be regarded as
a privilege that should not be given unwittingly to all applications.
Unfortunately, however, there is no general principle the system designer
can rely on to figure out what applications will be important enough to the
system user to allow them to use wakelocks by default. Therefore,
ultimately the decision is left to the user which, naturally, is only going
to really work if the user is qualified enough to make the decision.
Moreover, if the user is expected to make such a decision, they
should be informed exactly of the possible consequences of it.
The user also should be able to disallow chosen applications the use of
wakelocks at any time. On Android, though, at least up to and including
version 2.2, that simply doesn't happen.
Apart from this, some advertised features of applications don't really
work on Android because of its use of opportunistic suspend. Namely, some
applications are supposed to periodically check things on remote Internet
servers. For this purpose they need to run when there's the time to make
their checks, but they obviously aren't running when the system is in a
sleep state. Thus the periodic checks the applications are supposed to make
aren't really made at that time. In fact, they are only made when the
system is in the working state incidentally for another reason, and there
happens to be the time to make them. This most likely is not what the
users of the affected applications would have expected.
There is one more problem with full system suspend that is related to
time measurements, although it is not limited to the opportunistic suspend
initiated from kernel space. Namely, every suspend-resume cycle,
regardless of the way it is initiated, introduces inaccuracies into the
kernel's timekeeping subsystem. Usually, when the system goes into a sleep
state, the hardware that the kernel's timekeeping subsystem relies on is powered
off, so it has to be reinitialized during a subsequent system resume. Then,
among other things, the global kernel variables representing the current
time need to be readjusted to keep track of the time spent in the sleep
state. This involves reading the current time value from a persistent
clock which typically is much less accurate than the clock sources used by
the kernel in the system's working state. So that introduces a random shift
of the kernel's representation of current time, depending on the resolution
of the persistent clock, during every suspend-resume cycle. Moreover,
kernel timers used for scheduling the future execution of work inside of
the kernel also are affected by this issue in a similar way. In
consequence, the timing of some events in a suspending and resuming system
is different from their analogous timing without a suspend-resume
If system suspend is initiated by user space, the kernel may assume that
user space is ready for it and is somehow prepared to cope with the
consequences. For example, it may want to use settimeofday() to
set the kernel's monotonic clock using a time value taken from an NTP
server right after the subsequent system resume. On the other hand, if
system suspend is started by the kernel in an opportunistic fashion, user
space doesn't really have a chance to do anything like that.
For this reason, one may think that it's better not to suspend the
system at all and use the cpuidle framework for the entire system
power management. This approach appears to allow some systems to be put
into a low-power state
resembling a sleep state. However, it may not guarantee that the
system will be put into that state sufficiently often because of
applications using busy loops to excess and kernel timers. PM quality
of service (QoS)
requests [PDF] may also prevent cpuidle from using deep low-power
state of the CPUs. Moreover, while only a few selected devices are enabled
to signal wakeup during system suspend, the runtime power management
routines that may be used by cpuidle for suspending I/O devices
tend to enable all of them to signal wakeup. Thus the system wakes up from
low-power states entered as a result of cpuidle transitions
relatively more often than from "real" sleep states, so its
ability to save energy is limited. This basically means that
cpuidle-based system power management may not be sufficient to
save as much energy as opportunistic suspend on the same system.
The alternative implementation
Even if opportunistic suspend is not going to be used on a
given system, it generally makes sense to suspend the system sometimes, for
example when its user knows in advance that it will not need to be in the
working state in the near future. However, the problem of possible races
between the suspend process and wakeup events, addressed on Android with
the help of the wakelocks framework, affects all forms of system suspend,
not only the opportunistic one. Thus this problem should be addressed in
general and it is not really convenient to simply use the Android's
wakelocks for this purpose, because that would require all of user space to be
modified to use wakelocks. While that may be good for Android,
whose user space already is designed this way at least to some extent, it
wouldn't be very practical for other Linux-based systems, whose user space
is not aware of the wakelocks interface. This observation led to the kernel patch that introduced the
wakeup events framework, which was shipped in the 2.6.36 kernel.
This patch introduced a running counter of signaled wakeup events,
event_count, and a counter of wakeup events whose data is being
processed by the kernel at the moment, events_in_progress. Two
interfaces have been added to allow kernel subsystems to modify these
counters in a consistent way. pm_stay_awake() is meant to keep
the system from suspending, while pm_wakeup_event() ensures that
the system stays awake during the processing of a wakeup event.
In order to do that, pm_stay_awake()
increments events_in_progress and the complementary function
pm_relax() decrements it and increments event_count at
the same time. pm_wakeup_event() increments
events_in_progress and sets up a timer to decrement it and
increment event_count in the future.
The current value of event_count can be read from the new
sysfs file /sys/power/wakeup_count. In turn, writing to
it causes the current value of event_count to be stored in the
auxiliary variable saved_count, so that it can be compared with
event_count in the future. However, the write operation will only
succeed if the written number is already equal to event_count. If
that happens, another auxiliary variable events_check_enabled is
set, which tells the PM core to check whether event_count has
changed or events_in_progress is different from zero while
suspending the system.
This relatively simple mechanism allows the PM core to react to wakeup
events signaled during system suspend if it is asked to do so by user space
and if the kernel subsystems detecting wakeup events use either
pm_stay_awake() or pm_wakeup_event(). Still, its support
for collecting device statistics related to wakeup events is not comparable
to the one provided by the wakelocks framework. Moreover, it assumes that
wakeup events will always be associated with devices, or at least with
entities represented by device objects, which need not be the case in all
situations. The need to address these shortcomings led to a kernel patch introducing wakeup
source objects and adding some flexibility to the existing
Most importantly, the new patch introduces objects of type struct
wakeup_source to represent entities that can generate wakeup events.
Those objects are created automatically for devices enabled to signal
wakeup and are used internally by pm_wakeup_event(),
pm_stay_awake(), and pm_relax(). Although the
highest-level interfaces are still designed to report wakeup events
relative to devices, which is particularly convenient to device drivers and
subsystems that generally deal with device objects, the new framework makes
it possible to use wakeup source objects directly.
"standalone" wakeup source object is created by
wakeup_source_create() and added to the kernel's list of wakeup
sources by wakeup_source_add(). Afterward one can use three new
interfaces, __pm_wakeup_event(), __pm_stay_awake() and
__pm_relax(), to manipulate it and, when it is not necessary any
more, it may be removed from the global list of wakeup sources by calling
wakeup_source_remove(). It can then be deleted with the help of
wakeup_source_destroy(). Thus reported wakeup events need not be
associated with device objects any more. Also, at the kernel level, wakeup
source objects may be used to replace Android's wakelocks on a one-for-one
basis because the above interfaces are completely analogous to the ones
introduced by the wakelocks framework.
The infrastructure described above ought to make it easier to port
device drivers from Android to the mainline kernel. It hasn't been
designed with opportunistic suspend in mind, but in theory it may be used
for implementing a very similar power management technique. Namely, in
principle, all wakelocks in the Android kernel can be replaced with wakeup
source objects. Then, if the /sys/power/wakeup_count interface is
used correctly, the resulting kernel will be able to abort suspend in
progress in reaction to wakeup events in the same circumstances in which
the original Android kernel would do that. Yet, user space cannot access
wakeup source objects, so the part of the wakelocks framework allowing user
space to manipulate them has to be replaced with a different mechanics
implemented entirely in user space, involving a power manager process and a
suitable IPC interface for the processes that would use wakelocks on
The IPC interface in question may be implemented using three components,
a shared memory location containing a counter variable referred to as the
"suspend counter" in what follows, a mutex, and a conditional variable
associated with that mutex. Then, a process wanting to prevent the
system from suspending will acquire the mutex, increment the suspend
counter, and release the mutex. In turn, a process wanting to permit the
system to suspend will acquire the mutex and decrement the suspend counter.
If the suspend counter happens to be equal to zero at that point, the
processes waiting on the conditional variable will be unblocked. The mutex
will be released afterward.
With the above IPC interface in place the power manager process can
perform the following steps in a loop:
- Read from /sys/power/wakeup_count (this will block until the
events_in_progress kernel variable is equal to zero).
- Acquire the mutex.
- Check if the suspend counter is equal to zero. If that's not the case, block
on the conditional variable (that releases the mutex automatically) and go to step 2
- Release the mutex.
- Write the value read from /sys/power/wakeup_count in step 1 back to
this file. If the write fails, go to step 1.
- Start suspend or hibernation and go to step 1 when it returns.
Of course, this design will cause the system to be suspended very
aggressively. Although it is not entirely equivalent to the Android's
opportunistic suspend, it appears to be close enough to yield the same
level of energy savings. However, it also suffers from a number of
problems affecting the Android's approach. Some of them may be addressed
by adding complexity to the power manager and the IPC interface between it
and the processes permitted to block and unblock suspend, but the others
are not really avoidable. Thus it may be better to use system suspend less
aggressively, but in combination with some other techniques described
Overall, while the idea of suspending the system extremely aggressively
may be controversial, it doesn't seem reasonable to entirely dismiss
automatic suspending of it as a valid power management measure. Many
different operating systems do that and they achieve
good battery life [PDF] with the help of it. There don't seem to be
any valid reasons why Linux-based systems shouldn't do that, especially if
they are battery-powered. As far as desktop and similar (e.g. laptop or
netbook) systems are concerned, it makes sense to configure them to suspend
automatically in specific situations so long as system suspend is known to
work reliably on the given configuration of hardware. The new interfaces
and ideas presented above may be used to this end.
Comments (5 posted)
The bible portrays
the road to destruction as wide, while the road to life is narrow and hard
to find. This illustration has many applications in the more temporal
sphere in which we make many of our decisions.
It is often the case that there are many ways to approach
a problem that are unproductive and comparatively few which lead to
success. So it should be no surprise that, as we have been looking for
patterns in the design of Unix and their development in both Unix and
Linux, we find
fewer patterns of success than we
Our final pattern in this series continues the theme of different ways to go wrong,
and turns out to have a lot in common with the previous pattern of trying to
"fix the unfixable". However it has a crucial difference which very
much changes the way the pattern might be recognized and, so, the ways
we must be on the look-out for it. This pattern we will refer to as a
"high maintenance" design. Alternatively: "It seemed like a good idea at
the time, but was it worth the cost?".
While "unfixable" designs were soon discovered to be insufficient and
attempts were made (arguably wrongly) to fix them, "high maintenance"
designs work perfectly well and do exactly what is required.
However they do not fit seamlessly into their surroundings and, while
they may not actually leave disaster in their wake, they do impose a
high cost on other parts of the system as a whole. The effort of fixing things
is expended not on the center-piece of the problem, but on all that
The first of two examples we will use to illuminate this pattern is the
"setuid" and "setgid" permission bits and the related
functionality. In itself, the setuid bit works quite well, allowing
non-privileged users to perform privileged operations in a very
In fact this is such a clever and original idea that the inventor,
Dennis Ritchie, was granted a patent for the invention. This patent
was since placed in the public domain. Though ultimately pointless,
it is amusing to speculate what might have happened had the patent rights
been asserted, leading to that aspect of Unix being invented around.
Could a whole host of setuid vulnerabilities have been avoided?
The problem with this design is that programs which are running setuid
exist in two realms at once and must attempt to be both a privileged
service provider, and a tool available to users - much like the confused
deputy recently pointed out by LWN
reader "cmccabe." This creates a
number of conflicts which requires special handling in various
The most obvious problem comes from the inherited environment. Like
any tool, the programs inherit an environment of name=value
assignments which are often used by library routines to allow fine
control of certain behaviors. This is great for tools but
potentially quite dangerous for privileged service providers as there
is a risk that the environment will change the behavior of the
library and so give away some sort of access that was not intended.
All libraries and all setuid programs need to be particularly
suspicious of anything in the environment, and often need to
explicitly ignore the environment when running setuid.
are a perfect example of the difficulty of guarding against this sort
An example of a more general conflict comes from the combination of
setuid with executable shell scripts. This did not apply at the time
that setuid was first invented, but once Unix gained the
#!/bin/interpreter (or "shebang") method of running scripts
it became possible for scripts to run setuid. This is almost always
insecure, though various different interpreters have made various
attempts to make it secure, such as the "-b" option to
csh and the "taint mode" in perl. Whether they
succeed or not, it is clear that the setuid mechanism has imposed a
real burden on these interpreters.
Permission checking for signal delivery is normally a fairly
matching of the UID of the sending process with the
UID of the receiving process, with special exceptions for
UID==0 (root) as the sender. However, the existence of setuid adds
a further complication. As a setuid program runs just like a regular
tool, it must respond to job-control signals and, in particular, must
stop when the controlling terminal sends it a SIGTSTP. This
requires that the owner of the controlling terminal must be able to
request that the process continues by sending SIGCONT. So
the signal delivery mechanism needs special handling for SIGCONT,
simply because of the existence of setuid.
When writing to a file, Linux (like various flavors of Unix) checks if
the file is setuid and, if so, clears the setuid flag. This is not
absolutely essential for security, but has been found to be a valuable
extra barrier to prevent exploits and is a good example of the wide
ranging intrusion of setuid.
Each of these issues can be addressed and largely have been. However
they are issues that must be fixed not in the setuid mechanism itself,
but in surrounding code. Because of that it is quite possible for new
problems to arise as new code is developed, and only eternal vigilance
can protect us from these new problems. Either that, or removing
setuid functionality and replacing it with something different and
It was recently announced
that Fedora 15 would be released with a substantially reduced set of
setuid programs. Superficially this seems like it might be "removing
setuid functionality" as suggested, but a closer look shows that this
isn't the case. The plan for Fedora is to use filesystem capabilities
instead of full setuid. This isn't really a different mechanism, just
a slightly reworked form of the original. Setuid stores just one
bit per file which (together with the UID) determines the capabilities
that the program will have. In the case of setuid to root, this is
an all or nothing approach. Filesystem capabilities store more bits
per file and allow different capabilities to be individually
selected, so a program that does not need all of the capabilities of
root will not be given them.
This certainly goes some way to increasing security by decreasing the
attack surface. However it doesn't address the main problem that
the setuid programs exist in an uncertain world between being tools
and being service providers. It is unclear if libraries which make
use of environment variables after checking that setuid is not in
force, will also correctly check if capabilities are not in force.
Only a comprehensive audit would be able to tell for sure.
Meanwhile, by placing extra capabilities in the filesystem we impose
extra requirements on filesystem implementations, on copy and backup tools, and on
tools for examining and manipulating filesystems. Thus we achieve an
uncertain increase in security at the price of imposing a further
maintenance burden on surrounding subsystems. It is not clear to this
author that forward progress is being achieved.
Our second example, completing the story of high maintenance designs, is
the idea of "hard links", known simply as links before symbolic links
were invented. In the design of the Unix filesystem, the name of a
file is an entity separate from the file itself. Each name is treated as
a link to the file, and a file can have multiple links, or even none -
though of course when the last link is removed the file will soon be
This separation does have a certain elegance and there are certainly
uses that it can be put to with real value. However the vast majority
of files still only have one link, and there are plenty of cases where
the use of links is a tempting but ultimately sub-optimal option, and
where symbolic links or other mechanisms turn out to be much more
effective. In some ways this is reminiscent of the Unix permission
model where most of the time the subtlety it provides isn't needed,
and much of the rest of the time it isn't sufficient.
Against this uncertain value, we find that:
Archiving programs such as tar need extra complexity to look out
for hard links, and to archive the file the first time it is seen,
but not any subsequent time.
Similar care is needed in du, which calculates disk usage,
and in other programs which walk the filesystem hierarchy.
Anyone who can read a file can create a link to that file which
the owner of the file may not be able to remove. This can lead to users
having charges against their storage quota that they cannot do
Editors need to take special care of linked files. It is generally
safer to create a new file and rename it over the original rather
than to update the file in place. When a file has multiple hard
links it is not possible to do this without breaking that linkage,
which may not always be desired.
The Linux kernel's internals have an awkward distinction between
the "dentry" which refers to the name of a file, and the "inode",
which refers to the file itself. In many
we find that a dentry is needed even when you would think that only
the file is being accessed. This distinction would be irrelevant
if hard links were not possible, and may well relate to the choice
made by the developers of Plan 9 to not support hard links at all.
Hard links would also make it awkward to reason about any
name-based access control approach (as discussed in part 3) as a
given file can have many names and so multiple access permissions.
While hard links are certainly a lesser evil than setuid, and there
is little motivation to rid ourselves of them, they do serve to
illustrate how a seemingly clever and useful design can have a range
of side effects which can weigh heavily against the value that the
design tries to bring.
Avoiding high maintenance designs
The concept described here as "high maintenance" is certainly not
unique to software engineering. It is simply a specific manifestation
of the so-called
law of unintended consequences
which can appear in many disciplines.
As with any consequences, determining the root cause can be a real
challenge, and finding an alternate approach which does not result in
worse consequences is even harder. There are no magical solutions on
offer by which we can avoid high maintenance designs and their
associated unintended consequences. Rather, here are three thoughts
that might go some small way to reining in the worst such designs.
- Studying history is the best way to avoid repeating it, and so
taking a broad and critical look at our past has some hope of
directing is well for the future. It is partly for this reason that
"patterns" were devised, to help encapsulate history.
- Building on known successes is likely to have fewer unintended
consequences than devising new ideas. So following the pattern that
started this series of "full exploitation" is, where possible, most
likely to yield valuable results.
- An effective way to understand the consequences of a design is
to document it thoroughly, particularly explaining how it should be
used to someone with little background knowledge. Often writing
such documentation will highlight irregularities which make it
easier to fix the design than to document all the corner cases of
it. This is certainly the
of Michael Kerrisk who maintains the man pages for Linux, and,
apparently, of our Grumpy Editor who found that fixing the cdev
interface made him less grumpy than trying to document it, unchanged,
When documenting the behavior of the Unix filesystem, it is
desirable to describe it as a hierarchical structure, as that was the overall
intent. However, honesty requires us to call it as directed acyclic
graph (DAG) because that is what the presence of hard links turns it
into. It is possible that having to write DAG instead of
hierarchy several times might have been enough to raise the question of
whether hard links are such a good idea after all.
Harken to the ghosts
In his classic novella "A Christmas Carol", Charles Dickens uses
three "ghosts" to challenge Ebenezer Scrooge about his ideology and
ethics. They reminded him of his past, presented him with a clear
picture of the present, warned him about future consequences, but
ultimately left the decision of how to respond to him.
We, as designers and engineers, can similarly be challenged as we
reflect on these "Ghosts of Unix Past" that we have been exploring.
And again, the response is up to us.
It can be tempting to throw our hands up in disgust and build
something new and better. Unfortunately, mere technical excellence is
no guarantee of success. As Paul McKenney
astutely observed, at the
2010 Kernel Summit,
economic opportunity is at least an equal reason for success, and is
much harder to come by. Plan 9 from Bell Labs attempted to learn from
the mistakes of Unix and build something better; many of the mistakes
explored in this series are addressed quite effectively in Plan 9.
However while Plan 9 is an important research operating system, it
does not come close to the user or developer base that Linux has,
despite all the faults of the latter. So, while starting from scratch can be
tempting, it is rare that it has a long-term successful outcome.
The alternative is to live with our mistakes and attempt to minimize
their ongoing impact, deprecating that which cannot be discarded.
The x86 CPU architecture seems to be a good example of this. Modern
64-bit processors still support the original 8086 16-bit instruction
set and addressing modes. They do this with minimal optimization and
using only a small fraction of the total transistor count. But they
continue to support it as there has been no economic opportunity to
break with the past. Similarly Linux must live with its past mistakes.
Our hope for the future is to avoid making the same sort of mistakes
again, and to create such compelling new designs that the mistakes,
while still being supported, can go largely unnoticed.
It is to this end that it is important to study our past mistakes,
collect them into patterns, and be always alert against the repetition
of these patterns, or at least to learn how best to respond when the
patterns inevitably recur.
So, to conclude, we have a succinct restatement of the patterns
discovered on this journey, certainly not a complete set of patterns
to be alert for, but a useful collection nonetheless.
Firstly there was "Full exploitation": a pattern hinted at in that
early paper on Unix and which continues to provide strength today. It
involves taking one idea and applying it again and again to diverse
aspects of a system to bring unity and cohesiveness. As we saw with
signal handlers, not all designs benefit from full exploitation, but
those that do can bring significant value. It is usually best to try
to further exploit an existing design before creating something new
"Conflated" designs happen when two related but distinct ideas are
combined in a way that they cannot easily be separated. It can often
be appropriate to combine related functionality, whether for
convenience or efficiency, but it is rarely appropriate to tie aspects
of functionality together in such a way that they cannot be separated.
This is an error which can be recognized as the design is being
created, though a bit of perspective often makes it a lot clearer.
"Unfixable" designs are particularly hard to recognize until the
investment of time in them makes replacing them unpalatable. They are
not clearly seen until repeated attempts to fix the original have
resulted in repeated failures to produce something good. Their inertia
can further be exacerbated by a stubbornness to "fix it if it kills
me", or an aversion to replacement because "it is better the devil you
know". It can take substantial maturity to know when it is time to
learn from past mistakes, give up on failure, and build something new
and better. The earlier we can make that determination, the easier it
will be in the long run.
Finally "high maintenance" designs can be the hardest for early
detection as the costs are usually someone else's problem. To
some extent these are the antithesis of "fully exploitable" designs as,
rather than serving as a unifying force to bring multiple aspects of a
system together, they serve as an irritant which keeps other parts
unsettled yet doesn't even produce a pearl. Possibly the best way to
avoid high maintenance designs is to place more emphasis on full
exploitation and to be very wary of including anything new and different.
If identifying, describing, and naming these patterns makes it easier
to detect defective designs early and serves to guide and encourage
effective design then they will certainly have filled their purpose.
Exercises for the interested reader
Identify a design element in the IP protocol suite which could be
described as "high maintenance" or as having "unintended consequences".
Choose a recent extension to Linux and write some comprehensive
documentation, complete with justification and examples. See if that
suggests any possible improvements in the design which would simplify
- Research and enumerate uses of "hard links" which are not
adequately served by using symbolic links instead. Suggest
technologies that might effectively replace these other uses.
- Describe your "favorite" failings in Unix or Linux and describe a
pattern which would help with early detection and correction of
Comments (180 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>