User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel remains 2.6.35-rc3; Linus is on vacation so the flow of patches into the mainline is stalled for now.

There have been no stable updates since on June 1.

Comments (3 posted)

Quotes of the week

So V4L1 will finally disappear from the kernel in 2.6.37! There should be an LWN article as well so everyone (we hope) will be informed of this in time.
-- Hans Verkuil (does this count?)

If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level.
-- Edward Shishkin

I considered it a while ago when trying to work out removing the BKL from this path. After much head banging and an overwhelming desire to go and get drunk instead I concluded it wasn't possible to tell by analysis.

So I ack this patch - it's the only way to find out.

-- Alan Cox

Comments (1 posted)

A single power preference knob

By Jonathan Corbet
June 23, 2010
Power management under Linux is getting more complex as the kernel's capabilities grow. It is now possible to control power use through scheduling policies, idle state management, device states, and so on. Unfortunately, some power management choices have performance consequences; depending on the use to which the system is being put, those consequences may not be welcome. So there must be a way for system administrators to control how power management decisions are made.

Currently, that control is exercised through a number of individual system parameters. One controls whether the scheduler tries to coalesce processes onto a subset of the system's CPUs in the hope of letting others sleep. Another knob tells the idle governor which sleep states it is able to use. Yet another controls CPU frequency and voltage response. Simply knowing about all of the available parameters is hard; keeping them all tuned properly can be harder yet.

Len Brown has proposed the addition of an overall control parameter for power management, to be found in /sys/power/policy_preference. This knob would have five settings, ranging from "maximum performance at all times" to "save as much power as possible without actually turning the system off." With a control like this, system administrators could control system power policy without having to learn about all of the individual parameters involved; policy choices would also be applied to any new power-management parameters added in the future.

The idea was not universally loved, though. Some commenters asked for more than five settings, but Len argued that anybody needing more complex configurations should just continue to use the individual parameters. Others fear that the single policy might be interpreted differently by different drivers, leading to inconsistent results; they would rather see the continued use of individual parameters which exactly describe the desired behavior. The real discussion, though, cannot happen until some actual code has been posted, if and when that happens.

Comments (12 posted)

Constraining concurrent core dumps

By Jonathan Corbet
June 23, 2010
Systems running PHP are naturally beset with more than the usual number of challenges from the outset. In some cases, though, it can get even worse; consider this story from Edward Allcutt:

For example, a common configuration for PHP web-servers includes apache's prefork MPM, mod_php and a PHP opcode cache utilizing shared memory. In certain failure modes, all requests serviced by PHP result in a segfault. Enabling coredumps might lead to 10-20 coredumps per second, all attempting to write a 150-200MB core file. This leads to the whole system becoming entirely unresponsive for many minutes.

Edward's response to this non-fun situation was a patch limiting the number of core dumps which can be underway simultaneously; any dumps which would exceed the limit would simply be skipped.

It was generally agreed that a better approach would be to limit the I/O bandwidth of offending processes when contention gets too high. But that approach is not entirely straightforward to implement, especially since core dumps are considered to be special and not subject to normal bandwidth control. So what's likely to happen instead is a variant of Edward's patch where processes trying to dump core simply wait if too many others are already doing the same.

Comments (4 posted)

Another wakeup event mechanism

By Jonathan Corbet
June 23, 2010
The suspend blocker discussion has faded away for the time being - a situation which has drawn few complaints. Developers are still thinking about the underlying problems addressed by suspend blockers, though, as can be seen from this patch by Rafael Wysocki. Rafael is trying to solve the problem of "wakeup events" (events requiring action which would wake a suspended device) being lost if they show up while the system is suspending.

In Rafael's approach, there would be a new sysfs attribute called /sys/power/wakeup_count; it would contain the number of wakeup events seen by the system so far. Any process can read this attribute at any time to obtain this count; privileged processes can also write a count back to the file. There is a twist, though: if the count written to the file does not match the count which would be read from it, the write will fail. A write also triggers a mechanism whereby any subsequent wakeup events will cause an attempted suspend operation to abort.

As with some other scenarios which have been posted, Rafael is assuming the existence of a user-space power management daemon which would decide when to suspend the system. This decision would be made when the daemon knows that no important user-space program has work to do. Without extra help, though, there will be a window between the time that the daemon decides to suspend the system and when that suspend actually happens; a wakeup event which arrives within that window could be lost, or at least improperly delayed until after the system is resumed again. But, with the wakeup_count mechanism, the kernel would notice when this kind of race had happened and abort the suspend process, allowing user space to process the new event.

For this mechanism to work, the kernel must be able to count wakeup events; that, in turn, requires sprinkling calls to a new pm_wakeup_event() function into drivers which can generate such events. So internally it doesn't look all that different from suspend blockers. Some of the comments have suggested that the scheme is quite similar to suspend blockers on the user-space side too, though Rafael believes that it avoids the aspects of that API which generated the most criticism. Regardless, reviews were mixed, and Android developer Arve Hjønnevåg thinks that this approach will not meet that project's needs. So this discussion probably has more rounds to go in the future.

Comments (none posted)

Kernel development news

LSM stacking (again)

By Jake Edge
June 23, 2010

Kees Cook is back with another proposal for a kernel change that would, at least in his mind, provide more security, this time by restricting the ptrace() system call. But, like his earlier symbolic link patch, this one is not being particularly well-received on linux-kernel. It has, however, sparked some discussion of a topic that seems to recur with some frequency in that venue: stacking Linux security modules (LSMs).

Cook's patch is fairly straightforward; it creates a sysctl called ptrace_scope that defaults to zero, which chooses the existing behavior. If it is set to one, though, it only allows ptrace() to be called on descendants of the tracing process. The idea is to stop a vulnerability in one program (Pidgin, say) from being used to trace another program (like Firefox or GPG-agent), which would allow extracting credentials or other sensitive information. Like the previous symlink patch, it is based on a patch that has long been in the grsecurity kernel.

As with the previous proposal, Alan Cox was quick to suggest that it be put into an LSM:

So NAK. If you want to use bits of grsecurity then please just write yourselves a grsecurity kernel module that uses the security hooks properly and stop messing up the core code. It's all really quite simple, the [infrastructure] is there, so use it.

But, one problem with that plan is that LSMs do not stack. One can have SELinux, Smack, and TOMOYO enabled in a kernel, but only one—chosen at boot time—can be active. There have been discussions and proposals for LSM stacking (or chaining) along the way, but nothing has ever been merged. So, two "specialized" LSMs cannot do their separate jobs in the kernel and users will have to choose between them.

For "full-featured" solutions, like SELinux, that isn't really a problem, as users can find or create policies to handle their security requirements. In addition, James Morris points out that SELinux has a boolean, allow_ptrace, to do what Cook is trying to do: "You don't need to write any policy, just set it [allow_ptrace] to 0". But, for those that don't want to use SELinux, that's no solution. As Ted Ts'o puts it:

i think we really need to have stacked LSM's, because there is a large set of people who will never use SELinux. Every few years, I take another look at SELinux, my head explodes with the (IMHO unneeded complexity), and I go away again...

Yet I would really like a number of features such as this ptrace scope idea --- which I think is a useful feature, and it may be that stacking is the only way we can resolve this debate. The SELinux people will never believe that their system is too complicated, and I don't like using things that are impossible for me to understand or configure, and that doesn't seem likely to change anytime in the near future.

Others were also favorably disposed toward combining LSMs, though the consensus seems to be for chaining LSMs in the security core rather than stacking, as was done with SELinux and Linux capabilities (i.e. security/commoncap.c). In the stacking model, each LSM is responsible for calling out to any other secondary LSMs for each security operation, whereas chaining is "just a walk over a list of security_operations" calling each LSM's version from the core, as Eric W. Biederman described. But it's not as easy as it might seem at first glance, as Serge E. Hallyn, who proposed a stacking mechanism in 2004, points out:

The general answer tends to be "generic stacking doesn't work, LSMs need to know about each other." But even for that (as evidenced by the selinux+commoncap experience with stacking) is hairy, and more to the point it probably does not scale when we have 5-10 small LSMs. I.e. LSM 1 wants to prevent some action while LSM 2 requires that action to succeed so that it can properly prevent another action. Concrete examples are buried in the stacker discussions on the lsm list from 2004-2005.

It seems that there may be some discussion of LSM stacking/chaining at the Linux security summit, as part of Cook's presentation on "widely used, but out-of-tree" security solutions, but perhaps also in a "beer BOF" that Hallyn is proposing.

The way forward for both of Cook's recent proposals looks to be as an LSM and, to that end, he has posted the Yama LSM, which incorporates the symlink protections and ptrace() limitations that he previously posted. In addition, it adds the ability to restrict hard links such that they cannot be created for files that are either sensitive (e.g. setuid) or those that are not readable and writable by the link creator. Each of these measures can be enabled separately by sysctls in /proc/sys/kernel/yama/.

While "Yama" might make one start looking for completions of an acronym ("Yet Another ..."), it is actually named for a deity: "Yama is roughly the 'lord of death/underworld' in Buddhist and Hindu tradition, kind of over-seeing the rehabilitation of impure entities", Cook said. Given the number of NAKs that his recent patch proposals have received, calling Yama the "NAKed Access Control system", shows a bit of a sense of humor about the situation. DAC, MAC, RBAC, and others would now be joined by NAC if Yama gets merged.

So far, discussion of Yama has been fairly light, and without any major complaints. While some are rather skeptical of the protections that Cook has been proposing, they are much less likely to care if they live in an LSM, rather than "junk randomly spewed [all] over the tree", as Cox put it.

Once these simpler security tasks are encapsulated into an LSM, Morris said, the kernel hackers "can evaluate the possible need for some form of stacking or a security library API" to allow these measures to coexist with SELinux, AppArmor, and others. Given the fairly broad support for the LSM approach, it would seem that Yama, or some descendant, could make it into the mainline. Whether that translates to some kind of general mechanism for combining LSMs in interesting ways remains to be seen—it should be worth watching, stay tuned.

Comments (25 posted)

Btrfs: broken by design?

By Jonathan Corbet
June 22, 2010
The Btrfs filesystem is seen by many as the primary Linux filesystem for the next decade or so. It brings a next-generation design and a wide range of features (snapshots, data checksums, internal RAID, etc.) that users have been waiting for. Despite being merged for 2.6.29, Btrfs remains an experimental development, but some of the more adventurous distributions are beginning to offer Btrfs installation options and Meego has chosen Btrfs as its default filesystem. So when a filesystem developer started calling Btrfs "broken by design," people took notice.

Edward Shishkin, perhaps better known for his efforts to keep reiser4 development alive, first posted some concerns on June 3. It seems he ran a simple test: create a new Btrfs filesystem, then create 2048-byte files until space runs out. Others have talked about suboptimal space efficiency in Btrfs before, but Edward was still surprised that he was only able to use 17% of the nominal space in the filesystem before it was reported as being full. Such poor efficiency was, according to Edward, evidence the Btrfs was "broken by design" and should not be used:

The first obvious point here is that we *can not* put such file system to production. Just because it doesn't provide any guarantees for our users regarding disk space utilization.... As to current Btrfs code: *NOT ACK*!!! I don't think we need such "file systems".

Part of the problem comes down to the use of "inline extents" in Btrfs. The core data structure on a Btrfs filesystem is a B-tree which provides access to all of the objects stored in the filesystem. For larger files, the actual file data is stored in extents, which are pointed to from within the tree. Small extents, though, can be stored in the tree itself, hopefully yielding both better space efficiency and better performance. If these extents are sized inconveniently, though, they can cause a lot of wasted space. There's only room for one 2048-byte inline extent in a B-tree node, leaving 1800 bytes or so of unused space. That is a lot of internal fragmentation - a lot of wasted space.

As noted in Chris Mason's response, there are two approaches which can be taken to mitigate this kind of problem. One is to turn off inline extents altogether; Btrfs has a max_inline= mount option which can be used for just that purpose. The other approach would be to allow inline extents to be split between tree nodes so that the pieces could be sized to fill those nodes exactly. Btrfs cannot do that, and probably will not be able to anytime soon:

I didn't put in the splitting simply because the complexity was high while the benefits were low (in comparison with just turning off the inline extents).

Chris also noted that most of the other variable-size items stored in B-tree nodes - extended attributes, for example - can be split between nodes if need be. So these items should not cause fragmentation problems; it's mainly the inline extents which are at fault there.

But, as Edward pointed out, there's more to the problem than inline extents. In his investigations, he's found numerous places where groups of nearly-empty nodes exist; some were less than 1% utilized. That, in all likelihood, is the real source of the worst space utilization problems. To Edward, this behavior is another sign that the algorithms used in Btrfs are all wrong and in need of a redesign.

Chris sees it a little differently, though:

The current code is clearly choosing not to merge two leaves that it should have merged, which is just a plain old bug.

He has promised to track it down and post a fix. Between the bug fix and turning off inline extents (or, at least, reducing their maximum size), it is hoped that the worst space utilization problems in Btrfs will be no more.

That fix has not been posted as of this writing, so its effectiveness cannot yet be judged. But, chances are, this is not a case of a filesystem needing a fundamental redesign. Instead, all it needs is more extensive testing, some performance tuning, and, inevitably, some bug fixes. The good news is that the process seems to be working as it should be: these problems have been found before any sort of wide-scale deployment of this very new filesystem.

Comments (22 posted)

Concurrency-managed workqueues and thread priorities

By Jonathan Corbet
June 22, 2010
The original workqueue code found its way into the mainline without a great deal of discussion or debate; it was a clear improvement over what came before. Tejun Heo's concurrency-managed workqueues (CMWQ) rework has the potential to be a significant improvement as well, but its path toward merging has not been so smooth. The fifth iteration of the patch set is currently under discussion. While a number of concerns have been addressed, others have come out of the woodwork to replace them.

The CMWQ work is intended to address a number of problems with current kernel workqueues. At the top of the list is the proliferation of kernel threads; current workqueues can, on a large system, run the kernel out of process IDs before user space ever gets a chance to run. Despite all these threads, current workqueues are not particularly good at keeping the system busy; workqueues may contain a backlog of work while the CPU sits idle. Workqueues can also be subject to deadlocks if locking is not handled very carefully. As a result, the kernel has grown a number of workarounds and some competing deferred-work mechanisms.

To resolve these problems, the CMWQ code maintains a set of worker threads on each processor; these threads are shared between workqueues, so the system is not overrun with workqueue-specific threads. The special scheduler class once used by CMWQ is long gone, but the code still has hooks into the scheduler which it can use to track which worker threads are actually executing at any given time. If all workqueue threads on a CPU have blocked waiting on some resource, and if there is queued work to do, the CMWQ code will kick off a new thread to work on it. The CMWQ code can run multiple jobs from the same CPU concurrently - something the current workqueue code will not do. In this way, the CPU is always kept busy as long as there is work to be done.

The first complaint that came back this time was that many developers had long since forgotten what CMWQ was all about, and Tejun had not put that information into the patch series introduction. He made up for that with an overview document explaining the basics of the code. That led quickly to a new complaint: the lack of dedicated worker threads means that it is no longer possible to change the scheduling behavior of specific workqueues.

There were two variants of this complaint. Daniel Walker lamented the loss of the ability to change the priority of workqueue threads from user space. Tejun has firmly denied that this is a useful thing to be able to do, and Daniel has not, yet, shown an example of where it would be desirable. Andrew Morton, instead, worries about being able to change scheduling behavior from within the kernel; that is something that at least one driver does now. He might be willing to let this capability go, but he's not happy about it:

Oh well. Kernel threads should not be running with RT policy anyway. RT is a userspace feature, and whenever a kernel thread uses RT it degrades userspace RT qos. But I expect that using RT in kernel threads is sometimes the best tradeoff, so let's not pretend that we're getting something for nothing here!

Tejun's reply to this concern takes a couple of forms. One is that workqueues are intended for general-purpose asynchronous work, and that is how almost all callers use it. It would be better, he says, to make special mechanisms for situations where they are really needed. To that end, he has posted a simple kthread_worker API which can be used for the creation of special-purpose worker threads. Essentially, one starts by setting up a kthread_worker structure:

    /* ... or ... */
    struct kthread_worker worker;

Then, a kernel thread should be set up using the (existing) kthread_create() or kthread_run() utilities, but passing a pointer to kthread_worker_fn() as the actual function to run:

    struct task_struct thread;

    thread = kthread_run(kthread_worker_fn, &worker, "name" ...);

Thereafter, it's just a matter of filling in kthread_work structures with actual work to be done and queueing them with:

    bool queue_kthread_work(struct kthread_worker *worker,
                            struct kthread_work *work);

So far, there has been no real commentary on this patch.

The other thing which could be done is to associate attributes like priority and CPU affinity with the work to be done instead of with the thread doing the work. That would require expanding the workqueue API to allow this information to be specified; the CMWQ code would then tweak worker threads accordingly when passing jobs to them. At this point, though, it's not clear that there is enough need for this feature to justify the added complexity that it would require.

The CMWQ code certainly adds a bit of complexity already, though it makes up for some of that by replacing the slow work and asynchronous function call mechanisms. Tejun is hoping to drop it into linux-next shortly, and, presumably, to get it merged for 2.6.36. Whether that will happen remains to be seen; core kernel changes can be hard, and this one may not, yet, have cleared its last hurdle.

Comments (none posted)

Patches and updates


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management



Virtualization and containers

Benchmarks and bugs


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds