LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.37-rc3, released on November 21. Linus said:

And I have to say, I'm pretty happy with how quiet things have been. Of course, maybe people are just lying low, waiting to ambush me next week with a flood of patches when I'm gone in Japan, all in order to try to be inconvenient. Because that's the kind of people kernel developers are.

One notable change is that the attempt to make /proc/kallsyms unreadable by default has been reverted because it broke an older distribution (Ubuntu Jaunty).

The short-form changelog is in the announcement; see the full changelog for all the details.

Stable updates: the 2.6.27.56, 2.6.32.26, 2.6.35.9, and 2.6.36.1 updates were released on November 22; each contains a long list of important fixes. Note that 2.6.35.9 is the last update for the 2.6.35 series.

Comments (none posted)

Quotes of the week

Now, I do understand that everybody idolizes us software people. Yes, we really are better, smarter, and more good-looking than hardware engineers. Life is not fair, and the adoration of the masses can be unbearable at times. When I go to the mall, I'm covered in womens underwear in minutes - it's just embarrassing.

So yes, we're the Tom Jones of the engineering world.

So I can see how architecture designers could get some complexes. I understand. But even if you're a total failure in life, and you got your degree in EE rather than CompSci, stand up for yourself, man!

Repeat after me: "Yes, I too can make a difference! I'm not just a useless lump of meat! I can design hardware that is wondrous and that I don't need to be ashamed of! I can help those beautiful software people run their code better! My life has meaning!"

Doesn't that feel good? Now, look down at your keyboard, and look back at me. Look down. Look back. You may never be as beautiful and smart as a software engineer, but with Old Spice, you can at least smell like one.

Hardware and software should work together. And that does not mean that hardware should just lay there like a dead fish, while software does all the work. It should be actively participating in the action, getting all excited about its own body and about its own capabilities.

-- Linus Torvalds (thanks to George Spelvin)

Operating systems written by normal people rarely end up with desirable performance characteristics.
-- Matthew Garrett

Like I know the goal here is to create the perfect kernel for hardware Linus owns, but I'd like to be able to fix bugs urgently on hardware users have that aren't so privileged, think of it as some sort of outreach program.
-- Dave Airlie

Comments (5 posted)

The big chunk memory allocator

By Jonathan Corbet
November 24, 2010
Device drivers - especially those dealing with low-end hardware - sometimes need to allocate large, physically-contiguous memory buffers. As the system runs and memory fragments, those allocations are increasingly likely to fail. That had led to a lot of schemes based around techniques like setting aside memory at boot time; the contiguous memory allocator (CMA) patch set covered here in July is one example. There is an alternative approach out there, though, in the form of Hiroyuki Kamezawa's big chunk memory allocator.

The big chunk allocator provides a new allocation function for large contiguous chunks:

    struct page *alloc_contig_pages(unsigned long base, unsigned long end,
				    unsigned long nr_pages, int align_order);

Unlike CMA, the big chunk allocator does not rely on setting aside memory at boot time. Instead, it will attempt to organize a suitable chunk of memory at allocation time by moving other pages around. Over time, the memory compaction and page migration mechanisms in the kernel have gotten better and memory sizes have grown. So it is more feasible to think that this kind of large allocation might be more possible than it once was.

There are some advantages to the big chunk approach. Since it does not require that memory be set aside, there is no impact on the system when there is no need for large buffers. There is also more runtime flexibility and no need for the system administrator to figure out how much memory to reserve at boot time. The down sides will be that memory allocation becomes more expensive and the chances of failure will be higher. Which system will work better in practice is entirely unknown; answering that question will require some significant testing by the people who need the large allocation capability.

Comments (none posted)

Kernel development news

A collection of tracing topics

By Jonathan Corbet
November 23, 2010
For a long time, tracing was seen as one of the weaker points of the Linux system. Things have changed dramatically over the last few years, to the point that Linux has a number of interesting tracing interfaces. The job is far from done, though, and there is not always agreement on how this work should proceed. There have been a number of conversations related to tracing recently; this article will survey some in an attempt to highlight where the remaining challenges are.

The tracing ABI

Once upon a time, Linux had no tracing-oriented interfaces at all. Now, instead, we have two: ftrace and perf events. Some types of information are only available via the ftrace interface, others are only available from perf, and some sorts of events can be obtained in either way. From the discussions that have been happening for some time it's clear that neither interface satisfies everybody's needs. In addition, there are other subsystems waiting on the wings - LTTng and a recently proposed system health subsystem, for example - which bring requirements of their own. The last thing that the system needs is an even wider variety of tracing interfaces; it would be nice, instead, to pull everything together into a single, unified interface.

Almost everybody involved agrees on that point, but that is about where the agreement stops. Your editor, unfortunately, missed the tempestuous session at the Linux Plumbers Conference where a number of tracing developers came to an agreement of sorts: a new ABI would be developed with the explicit goal of being a unified tracing and event interface for the system as a whole. This ABI would be kept out of the mainline until a number of tools had been written to use it; only when it became clear that everybody's needs are met would it be merged. Your editor talked to a number of the people involved in that discussion; all seemed pleased with the outcome.

Ftrace developer Steven Rostedt interpreted the discussion as a mandate to develop an entirely new ABI for tracing purposes:

I think if we take a step back, we can come up with a new buffering/ABI system that can satisfy everyone. We will still support the current method now, but I really don't think it is designed with everything we had in mind. I do not envision that we can "evolve" to where we want to be. We may have to bite the bullet, just like iptables did when they saw the failures of ipchains, and redesign something new now that we understand what the requirements are.

LTTng developer Mathieu Desnoyers took things even further, posting a "tracing ABI work plan" for discussion. That posting was poorly received, being seen as a document better suited to managerial conference rooms - a perception which was not helped by Mathieu's subsequent posting of a massive common trace format document which would make a standards committee proud. Kernel developers, as always, would rather see code than extensive design documents.

When the code comes, though, it seems that there will be resistance to the idea of creating an entirely new tracing ABI. Thomas Gleixner has expressed his dislike for the current state of affairs and attempts to create complex replacements; he is calling for a gradual move toward a better interface. Ingo Molnar has said similar things:

Fact is that we have an ABI, happy users, happy tools and happy developers, so going incrementally is important and allows us to validate and measure every step while still having a full tool-space in place - and it will help everyone, in addition to the ftrace/lttng usecases.

We'll need to embark on this incremental path instead of a rewrite-the-world thing. As a maintainer my task is to say 'no' to rewrite-the-world approaches - and we can and will do better here.

The existing ABI that Ingo likes, of course, is the perf interface. He would clearly like to see all tracing and event reporting move to the perf side of the house. The perf ABI, he says, is sufficiently extendable to accommodate everybody's needs; there does not seem to be a lot of room for negotiation on this point.

Stable tracepoints

One of the conclusions reached at the 2010 Kernel Summit was that a small set of system tracepoints would be designated "stable" and moved to a separate location in the filesystem hierarchy. Tools using these tracepoints would have a high level of assurance that things would not change in future kernel releases; meanwhile, kernel developers could feel free to add and use tracepoints elsewhere without worrying that they could end up maintaining them forever. It seemed like an outcome that everybody could live with.

Steven recently posted an implementation of stable tracepoints to implement that decision. His patch adds another tricky macro (STABLE_EVENT()) which creates a stable tracepoint; all such tracepoints are essentially a second, restricted view of an existing "raw" tracepoint. That allows development-oriented tracepoints to provide more information than is deemed suitable for a stable interface and does not require cluttering the code with multiple tracepoint invocations. There is also a new "eventfs" filesystem to host stable tracepoints which is expected to be mounted on /sys/kernel/events. A small number of core tracepoints have been marked as stable - just enough to show how it's done.

There were a number of complaints about eventfs, not the least of which being Greg Kroah-Hartman's gripe that he had already written tracefs for just this purpose. Ingo had a different complaint, though: he is pushing an effort to distribute tracepoints throughout the sysfs hierarchy. The current /sys/kernel/debug/tracing/events directory would not go away (there are tools which depend on it), but future users of, say, ext4-related tracepoints would be expected to look for them in /sys/fs/ext4. It is an interesting idea which possibly makes good sense, but it is somewhat orthogonal to Steven's stable tracepoint posting; it doesn't address the stable/development distinction at all.

It eventually became clear that Ingo is opposed to the concept of marking some tracepoints as stable. He is, instead, taking the position that anything which is used by tools becomes part of the ABI, and that an excess of tools using too many tracepoints is a problem we wish we had. This opposition, needless to say, could make it hard to get the stable tracepoint concept into the kernel.

Here we see one of the hazards of skipping important developer meetings. The stable tracepoint discussion was expected to be one of the more contentious sessions at the kernel summit; in the end, though, everybody present seemed happy with the conclusion that was reached. But Ingo was not present. His point of view was not heard there, and the community believes it has reached consensus on something he apparently disagrees with. If Ingo succeeds in overriding that consensus, then Steven might not be the only person to express thoughts like:

Hmm, seems that every decision that we came to agreement with at Kernel Summit has been declined in practice. Makes me think that Kernel Summit is pointless, and was a waste of my time.

That conversation has quieted for now, but it will almost certainly return. If nothing else, some developers are determined to change tracepoints when the need arises, so this issue can be expected to come up again at some point. One possible source of conflict is the recently-announced trace utility which, according to Ingo, has "no conceptual restrictions" and will use tracepoints without regard for any sort of "stable" designation.

trace_printk()

One useful, but little used tracing-related tool is trace_printk(). It can be called like printk() (though without a logging level), but its output does not go to the system log; instead, everything printed via this path goes into the tracing stream as seen by ftrace. When tracing is off, trace_printk() calls have no effect. When tracing is enabled, instead, trace_printk() data can be made available to a developer with far less overhead than normal printk() output. That overhead can matter - the slowdown caused by printk() calls is often enough to change timing-related behavior, leading to "heisenbugs" which are difficult to track down.

Output from trace_printk() does not look like a normal kernel event, though, so it is not available to the perf interface. Steven has posted a patch to rectify that, at the cost of potentially creating large numbers of new trace events. With this patch, every trace_printk() call will create a new event under ...events/printk/ based on the file name. So, to use Steven's example, a trace_printk() on line 2180 in kernel/sched.c would show up in the events hierarchy as .../events/printk/kernel/sched.c/2180. Each call could then be enabled and disabled independently, just like ordinary tracepoints. It's a convenient and understandable interface, but, if use of trace_printk() ever takes off, it could lead to the creation of large numbers of events.

That idea drew a grumble from Peter Zijlstra, who said that it would be painful to use in perf. One of the reasons for that has to do with how the perf API works: every event must be opened separately with a perf_event_open() call and managed as a separate file descriptor. If the number of events gets large, so does the number of open files which must be juggled.

A potential solution also came from Peter, in the form of a new "tracepoint collection" event for perf. This special event will, when opened, collect no data at all, but it supports an ioctl() call allowing tracepoints to be added to it. All tracepoints associated with the collection event will report through the same file descriptor, allowing tools to deal with multiple tracepoints in a single stream of data. Peter says that the patch "is lightly tested and wants some serious testing/review before merging," but we may see this ABI addition become ready in time for 2.6.38.

Unprivileged tracepoints

Finally: access to tracepoints is currently limited to privileged users. Tracepoints provide a great deal of information about what is going on inside the kernel, so allowing anybody to watch them does not seem secure. There is a desire, though, to make some tracepoints generally available so that tools like trace can work in a non-privileged mode. Frederic Weisbecker has posted a patch which makes that possible.

Frederic's patch adds an optional TRACE_EVENT_FLAGS() declaration for tracepoints; currently, the only defined flag is TRACE_EVENT_FL_CAP_ANY, which grants access to unprivileged users. This flag has been applied to the system call tracepoints, allowing anybody to trace system calls - at least, when tracing is focused on a process they own.

In conclusion...

An obvious conclusion from all of the above is that there are still a lot of problems to be solved in the tracing area. The nature of the task is shifting, though. We now have significant tracing capabilities in place, and the developers involved have learned a lot about how the problem should (and should not) be solved. So we're no longer in the position of wondering how tracing can be done at all, and there no longer seems to be any trouble selling the concept of kernel visibility to developers. What needs to be done now is to develop the existing capability into something which is truly useful for the development community and beyond; that looks like a task which will keep developers busy for some time.

Comments (1 posted)

An alternative to suspend blockers

November 24, 2010

This article was contributed by Rafael J. Wysocki

If you have been following Linux kernel development over the past few months, it has been hard to overlook the massive thread on the Linux Kernel Mailing List (LKML) resulting from an attempt to merge the Google Android's suspend blockers framework into the main kernel tree. Arguably, the presentation of the patches might have been better and the explanation of the problems they addressed might have been more straightforward [PDF], but in the end it appears that merging them wouldn't be the smartest thing from the technical point of view. Unfortunately, though, it is difficult to explain that without diving into the technical issues behind the suspend blockers patchset, so I wrote a paper, Technical Background of the Android Suspend Blockers Controversy [PDF], discussing them in a detailed way, which is summarized in this article.

Suspend blockers, or wakelocks in the original Android terminology, are a part of a specific approach to power management, which is based on aggressive utilization of full system suspend to save as much energy as reasonably possible. In this approach the natural state of the system is a sleep state [PDF], in which energy is only used for refreshing memory and providing power to a few devices that can generate wakeup signals. The working state, in which the CPUs are executing instructions and the system is generally doing some useful work, is only entered in response to a wakeup signal from one of the selected devices. The system stays in that state only as long as necessary to do certain work requested by the user. When the work has been completed, the system automatically goes back to the sleep state.

This approach can be referred to as opportunistic suspend to emphasize the fact that it causes the system to suspend every time there is an opportunity to do so. To implement it effectively one has to address a number of issues, including possible race conditions between system suspend and wakeup events (i.e. events that cause the system to wake up from sleep states). Namely, one of the first things done during system suspend is to freeze user space processes (except for the suspend process itself) and after that's been completed user space cannot react to any events signaled by the kernel. In consequence, if a wakeup event occurs exactly at the time the suspend process is started, user space may be frozen before it will have a chance to consume the event, which will be delivered to it only after the system is woken up from the sleep state as a result of another wakeup event. Unfortunately, on a cell phone the "deferred" wakeup event may be a very important incoming call, so the above scenario is hardly acceptable for this type of device.

Wakelocks

On Android this issue has been addressed with the help of wakelocks. Essentially, a wakelock is an object that can be in one of two states, active or inactive, and the system cannot be suspended if at least one wakelock is active. Thus, if the kernel subsystem handling a wakeup event activates a wakelock right after the event has been signaled and deactivates it after the event has been passed to user space, the race condition described in the previous paragraph can be avoided. Moreover, on Android, the suspend process is started from kernel space whenever there are no active wakelocks, which addresses the problem of deciding when to suspend, and user space is allowed to manipulate wakelocks. Unfortunately, that requires every user space process doing important work to use wakelocks, which creates unusual and cumbersome issues for application developers to deal with.

Of course, processes using wakelocks can impact the system's battery life quite significantly, so the ability to use them has to be regarded as a privilege that should not be given unwittingly to all applications. Unfortunately, however, there is no general principle the system designer can rely on to figure out what applications will be important enough to the system user to allow them to use wakelocks by default. Therefore, ultimately the decision is left to the user which, naturally, is only going to really work if the user is qualified enough to make the decision. Moreover, if the user is expected to make such a decision, they should be informed exactly of the possible consequences of it. The user also should be able to disallow chosen applications the use of wakelocks at any time. On Android, though, at least up to and including version 2.2, that simply doesn't happen.

Apart from this, some advertised features of applications don't really work on Android because of its use of opportunistic suspend. Namely, some applications are supposed to periodically check things on remote Internet servers. For this purpose they need to run when there's the time to make their checks, but they obviously aren't running when the system is in a sleep state. Thus the periodic checks the applications are supposed to make aren't really made at that time. In fact, they are only made when the system is in the working state incidentally for another reason, and there happens to be the time to make them. This most likely is not what the users of the affected applications would have expected.

Timekeeping issues

There is one more problem with full system suspend that is related to time measurements, although it is not limited to the opportunistic suspend initiated from kernel space. Namely, every suspend-resume cycle, regardless of the way it is initiated, introduces inaccuracies into the kernel's timekeeping subsystem. Usually, when the system goes into a sleep state, the hardware that the kernel's timekeeping subsystem relies on is powered off, so it has to be reinitialized during a subsequent system resume. Then, among other things, the global kernel variables representing the current time need to be readjusted to keep track of the time spent in the sleep state. This involves reading the current time value from a persistent clock which typically is much less accurate than the clock sources used by the kernel in the system's working state. So that introduces a random shift of the kernel's representation of current time, depending on the resolution of the persistent clock, during every suspend-resume cycle. Moreover, kernel timers used for scheduling the future execution of work inside of the kernel also are affected by this issue in a similar way. In consequence, the timing of some events in a suspending and resuming system is different from their analogous timing without a suspend-resume cycle.

If system suspend is initiated by user space, the kernel may assume that user space is ready for it and is somehow prepared to cope with the consequences. For example, it may want to use settimeofday() to set the kernel's monotonic clock using a time value taken from an NTP server right after the subsequent system resume. On the other hand, if system suspend is started by the kernel in an opportunistic fashion, user space doesn't really have a chance to do anything like that.

For this reason, one may think that it's better not to suspend the system at all and use the cpuidle framework for the entire system power management. This approach appears to allow some systems to be put into a low-power state resembling a sleep state. However, it may not guarantee that the system will be put into that state sufficiently often because of applications using busy loops to excess and kernel timers. PM quality of service (QoS) requests [PDF] may also prevent cpuidle from using deep low-power state of the CPUs. Moreover, while only a few selected devices are enabled to signal wakeup during system suspend, the runtime power management routines that may be used by cpuidle for suspending I/O devices tend to enable all of them to signal wakeup. Thus the system wakes up from low-power states entered as a result of cpuidle transitions relatively more often than from "real" sleep states, so its ability to save energy is limited. This basically means that cpuidle-based system power management may not be sufficient to save as much energy as opportunistic suspend on the same system.

The alternative implementation

Even if opportunistic suspend is not going to be used on a given system, it generally makes sense to suspend the system sometimes, for example when its user knows in advance that it will not need to be in the working state in the near future. However, the problem of possible races between the suspend process and wakeup events, addressed on Android with the help of the wakelocks framework, affects all forms of system suspend, not only the opportunistic one. Thus this problem should be addressed in general and it is not really convenient to simply use the Android's wakelocks for this purpose, because that would require all of user space to be modified to use wakelocks. While that may be good for Android, whose user space already is designed this way at least to some extent, it wouldn't be very practical for other Linux-based systems, whose user space is not aware of the wakelocks interface. This observation led to the kernel patch that introduced the wakeup events framework, which was shipped in the 2.6.36 kernel.

This patch introduced a running counter of signaled wakeup events, event_count, and a counter of wakeup events whose data is being processed by the kernel at the moment, events_in_progress. Two interfaces have been added to allow kernel subsystems to modify these counters in a consistent way. pm_stay_awake() is meant to keep the system from suspending, while pm_wakeup_event() ensures that the system stays awake during the processing of a wakeup event.

In order to do that, pm_stay_awake() increments events_in_progress and the complementary function pm_relax() decrements it and increments event_count at the same time. pm_wakeup_event() increments events_in_progress and sets up a timer to decrement it and increment event_count in the future.

The current value of event_count can be read from the new sysfs file /sys/power/wakeup_count. In turn, writing to it causes the current value of event_count to be stored in the auxiliary variable saved_count, so that it can be compared with event_count in the future. However, the write operation will only succeed if the written number is already equal to event_count. If that happens, another auxiliary variable events_check_enabled is set, which tells the PM core to check whether event_count has changed or events_in_progress is different from zero while suspending the system.

This relatively simple mechanism allows the PM core to react to wakeup events signaled during system suspend if it is asked to do so by user space and if the kernel subsystems detecting wakeup events use either pm_stay_awake() or pm_wakeup_event(). Still, its support for collecting device statistics related to wakeup events is not comparable to the one provided by the wakelocks framework. Moreover, it assumes that wakeup events will always be associated with devices, or at least with entities represented by device objects, which need not be the case in all situations. The need to address these shortcomings led to a kernel patch introducing wakeup source objects and adding some flexibility to the existing framework.

Most importantly, the new patch introduces objects of type struct wakeup_source to represent entities that can generate wakeup events. Those objects are created automatically for devices enabled to signal wakeup and are used internally by pm_wakeup_event(), pm_stay_awake(), and pm_relax(). Although the highest-level interfaces are still designed to report wakeup events relative to devices, which is particularly convenient to device drivers and subsystems that generally deal with device objects, the new framework makes it possible to use wakeup source objects directly.

A "standalone" wakeup source object is created by wakeup_source_create() and added to the kernel's list of wakeup sources by wakeup_source_add(). Afterward one can use three new interfaces, __pm_wakeup_event(), __pm_stay_awake() and __pm_relax(), to manipulate it and, when it is not necessary any more, it may be removed from the global list of wakeup sources by calling wakeup_source_remove(). It can then be deleted with the help of wakeup_source_destroy(). Thus reported wakeup events need not be associated with device objects any more. Also, at the kernel level, wakeup source objects may be used to replace Android's wakelocks on a one-for-one basis because the above interfaces are completely analogous to the ones introduced by the wakelocks framework.

The infrastructure described above ought to make it easier to port device drivers from Android to the mainline kernel. It hasn't been designed with opportunistic suspend in mind, but in theory it may be used for implementing a very similar power management technique. Namely, in principle, all wakelocks in the Android kernel can be replaced with wakeup source objects. Then, if the /sys/power/wakeup_count interface is used correctly, the resulting kernel will be able to abort suspend in progress in reaction to wakeup events in the same circumstances in which the original Android kernel would do that. Yet, user space cannot access wakeup source objects, so the part of the wakelocks framework allowing user space to manipulate them has to be replaced with a different mechanics implemented entirely in user space, involving a power manager process and a suitable IPC interface for the processes that would use wakelocks on Android.

The IPC interface in question may be implemented using three components, a shared memory location containing a counter variable referred to as the "suspend counter" in what follows, a mutex, and a conditional variable associated with that mutex. Then, a process wanting to prevent the system from suspending will acquire the mutex, increment the suspend counter, and release the mutex. In turn, a process wanting to permit the system to suspend will acquire the mutex and decrement the suspend counter. If the suspend counter happens to be equal to zero at that point, the processes waiting on the conditional variable will be unblocked. The mutex will be released afterward.

With the above IPC interface in place the power manager process can perform the following steps in a loop:

  1. Read from /sys/power/wakeup_count (this will block until the events_in_progress kernel variable is equal to zero).
  2. Acquire the mutex.
  3. Check if the suspend counter is equal to zero. If that's not the case, block on the conditional variable (that releases the mutex automatically) and go to step 2 when unblocked.
  4. Release the mutex.
  5. Write the value read from /sys/power/wakeup_count in step 1 back to this file. If the write fails, go to step 1.
  6. Start suspend or hibernation and go to step 1 when it returns.
Of course, this design will cause the system to be suspended very aggressively. Although it is not entirely equivalent to the Android's opportunistic suspend, it appears to be close enough to yield the same level of energy savings. However, it also suffers from a number of problems affecting the Android's approach. Some of them may be addressed by adding complexity to the power manager and the IPC interface between it and the processes permitted to block and unblock suspend, but the others are not really avoidable. Thus it may be better to use system suspend less aggressively, but in combination with some other techniques described above.

Overall, while the idea of suspending the system extremely aggressively may be controversial, it doesn't seem reasonable to entirely dismiss automatic suspending of it as a valid power management measure. Many different operating systems do that and they achieve good battery life [PDF] with the help of it. There don't seem to be any valid reasons why Linux-based systems shouldn't do that, especially if they are battery-powered. As far as desktop and similar (e.g. laptop or netbook) systems are concerned, it makes sense to configure them to suspend automatically in specific situations so long as system suspend is known to work reliably on the given configuration of hardware. The new interfaces and ideas presented above may be used to this end.

Comments (5 posted)

Ghosts of Unix past, part 4: High-maintenance designs

November 23, 2010

This article was contributed by Neil Brown

The bible portrays the road to destruction as wide, while the road to life is narrow and hard to find. This illustration has many applications in the more temporal sphere in which we make many of our decisions. It is often the case that there are many ways to approach a problem that are unproductive and comparatively few which lead to success. So it should be no surprise that, as we have been looking for patterns in the design of Unix and their development in both Unix and Linux, we find fewer patterns of success than we do of failure.

Our final pattern in this series continues the theme of different ways to go wrong, and turns out to have a lot in common with the previous pattern of trying to "fix the unfixable". However it has a crucial difference which very much changes the way the pattern might be recognized and, so, the ways we must be on the look-out for it. This pattern we will refer to as a "high maintenance" design. Alternatively: "It seemed like a good idea at the time, but was it worth the cost?".

While "unfixable" designs were soon discovered to be insufficient and attempts were made (arguably wrongly) to fix them, "high maintenance" designs work perfectly well and do exactly what is required. However they do not fit seamlessly into their surroundings and, while they may not actually leave disaster in their wake, they do impose a high cost on other parts of the system as a whole. The effort of fixing things is expended not on the center-piece of the problem, but on all that surrounds it.

Setuid

The first of two examples we will use to illuminate this pattern is the "setuid" and "setgid" permission bits and the related functionality. In itself, the setuid bit works quite well, allowing non-privileged users to perform privileged operations in a very controlled way. In fact this is such a clever and original idea that the inventor, Dennis Ritchie, was granted a patent for the invention. This patent was since placed in the public domain. Though ultimately pointless, it is amusing to speculate what might have happened had the patent rights been asserted, leading to that aspect of Unix being invented around. Could a whole host of setuid vulnerabilities have been avoided?

The problem with this design is that programs which are running setuid exist in two realms at once and must attempt to be both a privileged service provider, and a tool available to users - much like the confused deputy recently pointed out by LWN reader "cmccabe." This creates a number of conflicts which requires special handling in various different places.

The most obvious problem comes from the inherited environment. Like any tool, the programs inherit an environment of name=value assignments which are often used by library routines to allow fine control of certain behaviors. This is great for tools but potentially quite dangerous for privileged service providers as there is a risk that the environment will change the behavior of the library and so give away some sort of access that was not intended. All libraries and all setuid programs need to be particularly suspicious of anything in the environment, and often need to explicitly ignore the environment when running setuid. The recent glibc vulnerabilities are a perfect example of the difficulty of guarding against this sort of problem.

An example of a more general conflict comes from the combination of setuid with executable shell scripts. This did not apply at the time that setuid was first invented, but once Unix gained the #!/bin/interpreter (or "shebang") method of running scripts it became possible for scripts to run setuid. This is almost always insecure, though various different interpreters have made various attempts to make it secure, such as the "-b" option to csh and the "taint mode" in perl. Whether they succeed or not, it is clear that the setuid mechanism has imposed a real burden on these interpreters.

Permission checking for signal delivery is normally a fairly straightforward matching of the UID of the sending process with the UID of the receiving process, with special exceptions for UID==0 (root) as the sender. However, the existence of setuid adds a further complication. As a setuid program runs just like a regular tool, it must respond to job-control signals and, in particular, must stop when the controlling terminal sends it a SIGTSTP. This requires that the owner of the controlling terminal must be able to request that the process continues by sending SIGCONT. So the signal delivery mechanism needs special handling for SIGCONT, simply because of the existence of setuid.

When writing to a file, Linux (like various flavors of Unix) checks if the file is setuid and, if so, clears the setuid flag. This is not absolutely essential for security, but has been found to be a valuable extra barrier to prevent exploits and is a good example of the wide ranging intrusion of setuid.

Each of these issues can be addressed and largely have been. However they are issues that must be fixed not in the setuid mechanism itself, but in surrounding code. Because of that it is quite possible for new problems to arise as new code is developed, and only eternal vigilance can protect us from these new problems. Either that, or removing setuid functionality and replacing it with something different and less intrusive.

It was recently announced that Fedora 15 would be released with a substantially reduced set of setuid programs. Superficially this seems like it might be "removing setuid functionality" as suggested, but a closer look shows that this isn't the case. The plan for Fedora is to use filesystem capabilities instead of full setuid. This isn't really a different mechanism, just a slightly reworked form of the original. Setuid stores just one bit per file which (together with the UID) determines the capabilities that the program will have. In the case of setuid to root, this is an all or nothing approach. Filesystem capabilities store more bits per file and allow different capabilities to be individually selected, so a program that does not need all of the capabilities of root will not be given them.

This certainly goes some way to increasing security by decreasing the attack surface. However it doesn't address the main problem that the setuid programs exist in an uncertain world between being tools and being service providers. It is unclear if libraries which make use of environment variables after checking that setuid is not in force, will also correctly check if capabilities are not in force. Only a comprehensive audit would be able to tell for sure.

Meanwhile, by placing extra capabilities in the filesystem we impose extra requirements on filesystem implementations, on copy and backup tools, and on tools for examining and manipulating filesystems. Thus we achieve an uncertain increase in security at the price of imposing a further maintenance burden on surrounding subsystems. It is not clear to this author that forward progress is being achieved.

Filesystem links

Our second example, completing the story of high maintenance designs, is the idea of "hard links", known simply as links before symbolic links were invented. In the design of the Unix filesystem, the name of a file is an entity separate from the file itself. Each name is treated as a link to the file, and a file can have multiple links, or even none - though of course when the last link is removed the file will soon be deleted.

This separation does have a certain elegance and there are certainly uses that it can be put to with real value. However the vast majority of files still only have one link, and there are plenty of cases where the use of links is a tempting but ultimately sub-optimal option, and where symbolic links or other mechanisms turn out to be much more effective. In some ways this is reminiscent of the Unix permission model where most of the time the subtlety it provides isn't needed, and much of the rest of the time it isn't sufficient.

Against this uncertain value, we find that:

  • Archiving programs such as tar need extra complexity to look out for hard links, and to archive the file the first time it is seen, but not any subsequent time.

  • Similar care is needed in du, which calculates disk usage, and in other programs which walk the filesystem hierarchy.

  • Anyone who can read a file can create a link to that file which the owner of the file may not be able to remove. This can lead to users having charges against their storage quota that they cannot do anything about.

  • Editors need to take special care of linked files. It is generally safer to create a new file and rename it over the original rather than to update the file in place. When a file has multiple hard links it is not possible to do this without breaking that linkage, which may not always be desired.

  • The Linux kernel's internals have an awkward distinction between the "dentry" which refers to the name of a file, and the "inode", which refers to the file itself. In many cases we find that a dentry is needed even when you would think that only the file is being accessed. This distinction would be irrelevant if hard links were not possible, and may well relate to the choice made by the developers of Plan 9 to not support hard links at all.

  • Hard links would also make it awkward to reason about any name-based access control approach (as discussed in part 3) as a given file can have many names and so multiple access permissions.
While hard links are certainly a lesser evil than setuid, and there is little motivation to rid ourselves of them, they do serve to illustrate how a seemingly clever and useful design can have a range of side effects which can weigh heavily against the value that the design tries to bring.

Avoiding high maintenance designs

The concept described here as "high maintenance" is certainly not unique to software engineering. It is simply a specific manifestation of the so-called law of unintended consequences which can appear in many disciplines.

As with any consequences, determining the root cause can be a real challenge, and finding an alternate approach which does not result in worse consequences is even harder. There are no magical solutions on offer by which we can avoid high maintenance designs and their associated unintended consequences. Rather, here are three thoughts that might go some small way to reining in the worst such designs.

  1. Studying history is the best way to avoid repeating it, and so taking a broad and critical look at our past has some hope of directing is well for the future. It is partly for this reason that "patterns" were devised, to help encapsulate history.

  2. Building on known successes is likely to have fewer unintended consequences than devising new ideas. So following the pattern that started this series of "full exploitation" is, where possible, most likely to yield valuable results.

  3. An effective way to understand the consequences of a design is to document it thoroughly, particularly explaining how it should be used to someone with little background knowledge. Often writing such documentation will highlight irregularities which make it easier to fix the design than to document all the corner cases of it. This is certainly the experience of Michael Kerrisk who maintains the man pages for Linux, and, apparently, of our Grumpy Editor who found that fixing the cdev interface made him less grumpy than trying to document it, unchanged, for LDD3.

    When documenting the behavior of the Unix filesystem, it is desirable to describe it as a hierarchical structure, as that was the overall intent. However, honesty requires us to call it as directed acyclic graph (DAG) because that is what the presence of hard links turns it into. It is possible that having to write DAG instead of hierarchy several times might have been enough to raise the question of whether hard links are such a good idea after all.

Harken to the ghosts

In his classic novella "A Christmas Carol", Charles Dickens uses three "ghosts" to challenge Ebenezer Scrooge about his ideology and ethics. They reminded him of his past, presented him with a clear picture of the present, warned him about future consequences, but ultimately left the decision of how to respond to him. We, as designers and engineers, can similarly be challenged as we reflect on these "Ghosts of Unix Past" that we have been exploring. And again, the response is up to us.

It can be tempting to throw our hands up in disgust and build something new and better. Unfortunately, mere technical excellence is no guarantee of success. As Paul McKenney astutely observed, at the 2010 Kernel Summit, economic opportunity is at least an equal reason for success, and is much harder to come by. Plan 9 from Bell Labs attempted to learn from the mistakes of Unix and build something better; many of the mistakes explored in this series are addressed quite effectively in Plan 9. However while Plan 9 is an important research operating system, it does not come close to the user or developer base that Linux has, despite all the faults of the latter. So, while starting from scratch can be tempting, it is rare that it has a long-term successful outcome.

The alternative is to live with our mistakes and attempt to minimize their ongoing impact, deprecating that which cannot be discarded. The x86 CPU architecture seems to be a good example of this. Modern 64-bit processors still support the original 8086 16-bit instruction set and addressing modes. They do this with minimal optimization and using only a small fraction of the total transistor count. But they continue to support it as there has been no economic opportunity to break with the past. Similarly Linux must live with its past mistakes.

Our hope for the future is to avoid making the same sort of mistakes again, and to create such compelling new designs that the mistakes, while still being supported, can go largely unnoticed. It is to this end that it is important to study our past mistakes, collect them into patterns, and be always alert against the repetition of these patterns, or at least to learn how best to respond when the patterns inevitably recur.

So, to conclude, we have a succinct restatement of the patterns discovered on this journey, certainly not a complete set of patterns to be alert for, but a useful collection nonetheless.

Firstly there was "Full exploitation": a pattern hinted at in that early paper on Unix and which continues to provide strength today. It involves taking one idea and applying it again and again to diverse aspects of a system to bring unity and cohesiveness. As we saw with signal handlers, not all designs benefit from full exploitation, but those that do can bring significant value. It is usually best to try to further exploit an existing design before creating something new and untried.

"Conflated" designs happen when two related but distinct ideas are combined in a way that they cannot easily be separated. It can often be appropriate to combine related functionality, whether for convenience or efficiency, but it is rarely appropriate to tie aspects of functionality together in such a way that they cannot be separated. This is an error which can be recognized as the design is being created, though a bit of perspective often makes it a lot clearer.

"Unfixable" designs are particularly hard to recognize until the investment of time in them makes replacing them unpalatable. They are not clearly seen until repeated attempts to fix the original have resulted in repeated failures to produce something good. Their inertia can further be exacerbated by a stubbornness to "fix it if it kills me", or an aversion to replacement because "it is better the devil you know". It can take substantial maturity to know when it is time to learn from past mistakes, give up on failure, and build something new and better. The earlier we can make that determination, the easier it will be in the long run.

Finally "high maintenance" designs can be the hardest for early detection as the costs are usually someone else's problem. To some extent these are the antithesis of "fully exploitable" designs as, rather than serving as a unifying force to bring multiple aspects of a system together, they serve as an irritant which keeps other parts unsettled yet doesn't even produce a pearl. Possibly the best way to avoid high maintenance designs is to place more emphasis on full exploitation and to be very wary of including anything new and different.

If identifying, describing, and naming these patterns makes it easier to detect defective designs early and serves to guide and encourage effective design then they will certainly have filled their purpose.

Exercises for the interested reader

  1. Identify a design element in the IP protocol suite which could be described as "high maintenance" or as having "unintended consequences".

  2. Choose a recent extension to Linux and write some comprehensive documentation, complete with justification and examples. See if that suggests any possible improvements in the design which would simplify the documentation.

  3. Research and enumerate uses of "hard links" which are not adequately served by using symbolic links instead. Suggest technologies that might effectively replace these other uses.

  4. Describe your "favorite" failings in Unix or Linux and describe a pattern which would help with early detection and correction of similar failings.

Comments (180 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds