User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.4-rc4, released on April 21. Linus says: "But none of it really looks all that scary. It looks like the 3.4 release is all on track, but please do holler if you see regressions." There are some reports of ongoing difficulties with the Nouveau graphics driver; hopefully those will be sorted out soon.

Stable updates: the 3.0.29, 3.2.16, and 3.3.3 stable kernel updates were released on April 22. With 3.2.16, Greg Kroah-Hartman will stop producing 3.2 updates, but Ben Hutchings has stated his intent to pick up and continue maintenance for this kernel for some time yet.

The 3.0.30 and 3.3.4 updates are in the review process as of this writing; they can be expected on or after April 26.

Comments (none posted)

Quotes of the week

Pretty much the FIRST thing you add after "hello world" is a device tree parser, because that tells you how much memory you've got and where it is.
-- Rob Landley

So I think the industry's well within range of making single package Linux-capable devices with sufficient RAM, flash, CPU, and basic peripherals like oscillator, USB, and I²C/SPI in a 8mm² package for $5 today. In fact, the engineering effort for an ARM licensee to do that is significantly less than you'd spend trying to cut Linux's memory footprint in half. Ergo, the days of projects like Linux-tiny are behind us.
-- Matt Mackall

Some patches scare me when they successfully boot.
-- Tejun Heo

Comments (none posted)

An Interview With Linus Torvalds (TechCrunch)

TechCrunch has an interview with Linus Torvalds. "Would some kernel people prefer getting tucked in at night instead of being cursed at? I’m sure that would be appreciated. I don’t think I have it in me, though."

Comments (54 posted) mirror at Google

Google has announced that it has put up a (read-only) mirror at " is served out of multiple Google data centers, utilizing facilities in Asia, the United States and Europe to provide speedy access from almost anywhere in the world." (Thanks to several LWN readers).

Comments (20 posted)

Kernel development news


By Jonathan Corbet
April 24, 2012
While storage devices are billed as being "random access" in nature, the truth of the matter is that operations to some parts of the device can be faster than operations to others. Rotating storage has a larger speed differential than flash, while hybrid devices may show a large difference indeed. Given that differences exist, it is natural to want to place more frequently-accessed data on the faster part of the device. But a recent proposal to allow applications to influence this placement has met with mixed reviews; the problem, it seems, is a bit more complicated than it appears.

The idea, as posted by Ted Ts'o, is to create a couple of new flags to be provided by applications at the time a file is created. A file expected to be accessed frequently would be created with O_HOT, while a file that will see traffic only rarely would be marked with O_COLD. It is assumed that the filesystem would, if possible, place O_HOT files in the fastest part of the underlying device.

The implementation requires a change to the create() inode operation; a new parameter is added to allow the VFS layer to pass down the flags passed by the application. That change is the most intrusive part of the patch, requiring tweaks to most filesystems—43 files changed in all. The only filesystem actually implementing these flags at the outset is, naturally, ext4. In that implementation, O_HOT files will be placed in low-numbered blocks, while O_COLD files occupy the high-numbered blocks—but only if the filesystem is stored on a rotating device. Requesting O_HOT placement requires the CAP_RESOURCE privilege or the ability to dip into the reserved block pool.

A lot of people seem to like the core idea, but there were a lot of questions about the specifics. What happens when the storage device is an array of rotating devices? Why assume that a file is all "hot" or all "cold"; some parts of a given file may be rather hotter than others. If an application is using both hot and cold files, will the (long) seeks between them reduce performance overall? What about files whose "hotness" varies over time? Should this concept be tied into the memory management subsystem's notion of hot and cold pages? And what do "hot" and "cold" really mean, anyway?

With regard to the more general question, Ted responded that, while it would be possible to rigorously define the meanings of "hot" and "cold" in this context, it's not what he would prefer to do:

The other approach is to leave things roughly undefined, and accept the fact that applications which use this will probably be specialized applications that are very much aware of what file system they are using, and just need to pass minimal hints to the application in a general way, and that's the approach I went with in this O_HOT/O_COLD proposal.

In other words, this proposal seems well suited to the needs of, say, a large search engine company that is trying to get the most out of its massive array of compute nodes. That is certainly a valid use case, but a focus on that case may make it hard to generalize the feature for wider use.

Generalizing the feature may also not be helped by placing the decision on who can mark files as "hot" at the individual filesystem level. That design could lead to different policies provided by different filesystems; indeed, Ted expects that to happen. Filesystem-level policy will allow for experimentation, but it will push the feature further into an area where it is only useful for specific applications where the developers have full control over the underlying system. One would not expect to see O_HOT showing up in random applications, since developers would have no real way to know what using that flag would do for them. And that, arguably, is just as well; otherwise, it would not be surprising to see the majority of files eventually designated as "hot."

Interestingly, there is an alternative approach which was not discussed here. In 2010, a set of "data temperature" patches was posted to the btrfs list. This code watched accesses to files and determined, on the fly, which blocks were most in demand. The idea was that btrfs could then migrate the "hot" data to the faster parts of the storage device, improving overall performance. That work would appear to have stalled; no new versions of those patches have appeared for some time. But, for the general case, it would stand to reason that actual observations of access patterns would be likely to be more accurate than various developers' ideas of which files might be "hot."

In summary, it seems that, while there is apparent value in the concept of preferential treatment for frequently-accessed data, figuring out how to request and implement that treatment will take some more time. Among other things, any sort of explicit marker (like O_HOT) will quickly become part of the kernel ABI, so it will be difficult to change once people start using it. So it is probably worthwhile to ponder for a while on how this feature can be suitably designed for the long haul, even if some hot files will have to languish in cold storage in the meantime.

Comments (12 posted)

Toward a safer fput()

By Jonathan Corbet
April 24, 2012
Locking and the associated possibility of deadlocks are a challenge for developers working anywhere in the kernel. But that challenge appears to be especially acute in the virtual filesystem (VFS) layer, where the needs of many collaborating subsystems all come together in one place. The difficulties inherent in VFS locking were highlighted recently when the proposed IMA appraisal extension ran into review problems. The proposed fix shows that, while these issues can be worked around, the solution is not necessarily simple.

The management of file structure reference counts is done with calls to fget() and fput(). A file structure, which represents an open file, can depend on a lot of resources: as long as a file is open, the kernel must maintain its underlying storage device, filesystem, network protocol information, security-related information, user-space notification requests, and more. An fget() call will ensure that all of those resources stay around as long as they are needed. A call to fput(), instead, might result in the destruction of any of those resources. For example, closing the last file on an unmounted filesystem will cause that filesystem to truly go away.

What all this means is that a call to fput() can do a lot of work, and that work may require the acquisition of a number of locks. The problem is that fput() can also be called from any number of contexts; there are a few hundred fput() and fput_light() calls in the kernel. Each of those call sites has its own locking environment and, usually, no knowledge of what code in other subsystems may be called from fput(). So the potential for problems like locking-order violations is real.

The IMA developers ran into exactly that sort of problem. The IMA appraisal cleanup code is one of those functions that can be invoked from an arbitrary fput() call. That code, it seems, sometimes needs to acquire the associated inode's i_mutex lock. But the locking rules say that, if both i_mutex and the task's mmap_sem are to be acquired, i_mutex must be taken first. If somebody calls fput() with mmap_sem held—something that happens in current kernels—the ordering rule will be violated, possibly deadlocking the system. A deadlocked system is arguably secure, but IMA users might be forgiven for complaining about this situation anyway.

To get around this problem, IMA tried to check for the possibility of deadlock inside fput(), and, in that case, defer the underlying __fput() call (which does the real work) to a later and safer context. This idea did not impress VFS maintainer Al Viro, who pointed out that there is no way to encode all of the kernel's locking rules into fput(). In such situations, it can be common for core kernel developers to say "NAK" and get back to what they were doing before, but Al continued to ponder the problem, saying:

If it had been IMA alone, I would've cheerfully told them where to stuff the whole thing. Unfortunately, fput() *is* lock-heavy even without them.

After thinking for a bit, he came up with a plan that offered a way out. Like the scheme used by IMA, Al's idea involves turning risky fput() calls into an asynchronous operation running in a separate thread. But there is no knowledge of locking rules added to fput(); instead, the situation is avoided altogether whenever possible, and all remaining calls are done asynchronously.

In particular, Al is looking at all callers of fget() and fput() to see if they can be replaced with fget_light() and fput_light() instead. The "light" versions have a number of additional requirements: they come close to requiring that the calling code run in atomic context while the reference to the file structure is held. For a lot of situations - many system calls, for example - these rules don't get in the way. As the name suggests, the "light" versions are less expensive, so switching to them whenever possible makes sense regardless of any other issues.

Then, fput() in its current form is renamed to fput_nodefer(). A new fput() is added that, when the final reference to a file is released, queues the real work to be done asynchronously later on. The "no defer" version will obviously be faster—the deferral mechanism will have a cost of its own—so its use will be preferred whenever possible. In this case, "whenever possible" means whenever the caller does not hold any locks. That is a constraint that can be independently verified for each call site; the "no defer" name should also hopefully serve as a warning for any developer who might change the locking environment in the future.

With luck, all of the performance-critical calls can be moved to the "no defer" version, minimizing the performance hit that comes from the deferral of the fput() call. So it seems like a workable solution—except for one little problem pointed out by Linus: deferral can change the behavior seen by user space. In particular, the actual work of closing a file may not be complete by the time control returns to user space, causing the process's environment to differ in subtle and timing-dependent ways. Any program that expects that the cleanup work will be fully done by the time a close() call returns might just break.

The "totally asynchronous deferral" literally *breaks*semantics*.

Sure, it won't be noticeable in 99.99% of all cases, and I doubt you can trigger much of a test for it. But it's potential real breakage, and it's going to be hard to ever see. And then when it *does* happen, it's going to be totally impossible to debug.

That does not seem like a good outcome either. The good news is that there is a potential solution out there in the form of Oleg Nesterov's task_work_add() patch set. This patch adds a functionality similar to workqueues, but with a fundamental difference: the work is run in the context of the process that was active at the time the work is added.

In brief, the interface defines work to be done this way:

    #include <linux/task_work.h>

    typedef void (*task_work_func_t)(struct task_work *);

    struct task_work {
	struct hlist_node hlist;
	task_work_func_t func;
	void *data;

The task_work structure can be initialized with:

    void init_task_work(struct task_work *twork, task_work_func_t func, 
		        void *data);

The work can be queued for execution with:

    int task_work_add(struct task_struct *task, struct task_work *twork, 
		      bool notify);

A key aspect of this interface is that it will run any queued work before returning to user space from the kernel. So that work is guaranteed to be done before user space can run again; in the case of a function like close(), that guarantee means that user space will see the same semantics it did before, without subtle timing issues. So, Linus suggested, this API may be just what is needed to make the new fput() scheme work.

There is just one final little problem: about a half-dozen architectures lack the proper infrastructure to support task_work_add() properly. That makes it unsuitable for use in the core VFS layer. Unless, of course, you're Al Viro; in that case it's just a matter of quickly reviewing all the architectures and coming up with a proposed fix—perhaps in assembly language—for each one. Assuming Al's work passes muster with the architecture maintainers, all of this work is likely to be merged for 3.5 - and the IMA appraisal work should be able to go in with it.

Comments (6 posted)

A library for seccomp filters

By Jake Edge
April 25, 2012

Now that it is looking like Linux will be getting an enhanced "secure computing" (seccomp) facility, some are starting to turn toward actually using the new feature in applications. To that end, Paul Moore has introduced libseccomp, which is meant to make it easier for applications to take advantage of the packet-filter-based seccomp mode. That will lead to more secure applications that can permanently reduce their ability to make "unsafe" system calls, which can only be a good thing for Linux application security overall.

Enhanced seccomp has taken a somewhat tortuous path toward the mainline—and it's not done yet. Will Drewry's BPF-based solution (aka seccomp filter or seccomp mode 2) is currently in linux-next, and recent complaints about it have been few and far between, so it would seem likely that it will appear in the 3.5 kernel. It will provide fine-grained control over the system calls that the process (and its children) can make.

What libseccomp does is make it easier for applications to add support for sandboxing themselves by providing a simpler API to use the new seccomp mode. By way of contrast, Kees Cook posted a seccomp filter tutorial that describes how to build an application using the filters directly. In addition, it is also interesting to see that the recent OpenSSH 6.0 release contains support for seccomp filtering using a (pre-libseccomp) patch from Drewry. The patch limits the privilege-separated OpenSSH child process to a handful of legal system calls, while setting up open() to fail with an EACCESS error

As described in the man pages that accompany the libseccomp code, the starting point is to include seccomp.h, then an application must call:

    int seccomp_init(uint32_t def_action);
The def_action parameter governs the default action that is taken when a system call is rejected by the filter. SCMP_ACT_KILL will kill the process, while SCMP_ACT_TRAP will cause a SIGSYS signal to be issued. There are also options to force rejected system calls to return a certain error (SCMP_ACT_ERRNO(errno)), to generate a ptrace() event (SCMP_ACT_TRACE(msg_num)), or to simply allow the system call to proceed (SCMP_ACT_ALLOW).

Next, the application will want to add its filter rules. Those rules can apply to any invocation of a particular system call, or it can restrict calls to only use certain values for the system call arguments. So, a rule could specify that write() can only be used on file descriptor 1, or that open() is forbidden, for example. The interface for adding rules is:

    int seccomp_rule_add(uint32_t action,
                         int syscall, unsigned int arg_cnt, ...);
The action parameter uses the same action macros as are used in seccomp_init(). The syscall argument is the system call number of interest for this rule, which could be specified using __NR_syscall values, but it is recommended that the SCMP_SYS() macro be used to properly handle multiple architectures. The arg_cnt specifies the number of rules that are being passed; those rules then follow.

In the simplest case, where the rule is just allowing a system call for example, there are no argument rules. So, if the default action is to kill the process, adding a rule to allow close() would look like:

    seccomp_rule_add(SCMP_ACT_ALLOW, SCMP_SYS(close), 0);
Doing filtering based on the system call arguments relies on a set of macros that specify the argument of interest by number (SCMP_A0() through SCMP_A5()), and the comparison to be done (SCMP_CMP_EQ, SCMP_CMP_GT, and so on). So, adding a rule that allows writing to stderr would look like:
    seccomp_rule_add(SCMP_ACT_ALLOW, SCMP_SYS(write), 1, 
                     SCMP_A0(SCMP_CMP_EQ, STDERR_FILENO));

Once all the rules have been added, the filter is loaded into the kernel (and activated) with:

    int seccomp_load(void);
The internal library state that was used to build the filter is no longer needed after the call to seccomp_load(), so it can be released with a call to:
    void seccomp_release(void);

There are a handful of other functions that libseccomp provides, including two ways to extract the filter code from the library:

    int seccomp_gen_bpf(int fd);
    int seccomp_gen_pfc(int fd);
Those functions will write the filter code in either kernel-readable BPF or human-readable Pseudo Filter Code (PFC) to fd. One can also set the priority of system calls in the filter. That priority is used as a hint by the filter generation code to put higher priority calls earlier in the filter list to reduce the overhead of checking those calls (at the expense of the others in the rules):
    int seccomp_syscall_priority(int syscall, uint8_t priority);
In addition, there are a few attributes for the seccomp filter that can be set or queried using:
    int seccomp_attr_set(enum scmp_filter_attr attr, uint32_t value);
    int seccomp_attr_get(enum scmp_filter_attr attr, uint32_t *value);
The attributes available are the default action for the filter (SCMP_FLTATR_ACT_DEFAULT, which is read-only), the action taken when the loaded filter does not match the architecture it is running on (SCMP_FLTATR_ACT_BADARCH, which defaults to SCMP_ACT_KILL), or whether PR_SET_NO_NEW_PRIVS is turned on or off before activating the filter (SCMP_FLTATR_CTL_NNP, which defaults to NO_NEW_PRIVS being turned on). The NO_NEW_PRIVS flag is a recent kernel addition that stops a process and its children from ever being able to get new privileges (via setuid() or capabilities for example).

The last attribute came about after some discussion in the announcement thread. The consensus on the list was that it was desirable to set NO_NEW_PRIVS by default, but allow libseccomp users to override that if desired. Other than some kudos from other developers about the project, the only other messages in the thread concerned the GPLv2 license. Moore said that the GPL was really just his default license and, since it made more sense for a library to use the LGPL, he was able to get the other contributors to agree to switch to the LGPLv2.1

While it is by no means a panacea, the seccomp filter will provide a way for applications to make themselves more secure. In particular, programs that handle untrusted user input, like the Chromium browser which was the original impetus for the feature, will be able to limit the ways in which damage can be done through a security hole in their code. One would guess we will see more applications using the feature via libseccomp. Seccomp mode 2 is currently available in Ubuntu kernels, and is slated for inclusion in ChromeOS—with luck we'll see it in the mainline soon too.

Comments (6 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management


Virtualization and containers


  • Lucas De Marchi: kmod 8 . (April 19, 2012)

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds