LWN.net Logo

Kernel development

Brief items

Kernel release status

The current stable 2.6 release is 2.6.16.6 2.6.16.7 2.6.16.8 2.6.16.9, announced on April 19; it contains a fix for an information leak vulnerability on some AMD processors. Of the prior releases, 2.6.16.6 contains a fairly long list of fixes, while 2.6.16.7 and 2.6.16.8 are single-patch security fixes.

The current 2.6 prepatch is 2.6.17-rc2, announced by Linus on April 18. There's a lot of fixes in this release, but it also contains a simplified form of the scheduler starvation avoidance patch, some tweaks to the memory overcommit algorithm, the removal of the obsolete blkmtd and qlogicfc drivers, the removal of the unmaintained Sangoma WAN drivers, the splice() and tee() system calls, and pollable sysfs attributes. See the long-format changelog for the details.

For the record, it is worth noting that the prototypes for the splice() methods in the file_operations structure have changed again. This week's version:

    ssize_t (*splice_write)(struct pipe_inode_info *pipe, struct file *out, 
                            loff_t *offset, size_t len, unsigned int flags);
    ssize_t (*splice_read)(struct file *in, loff_t *offset, 
                           struct pipe_inode_info *pipe, size_t len, 
			   unsigned int flags);

The offset parameter, describing where in the stream I/O should start, is new.

A few dozen patches (all fixes) have been merged into the mainline after the -rc2 release.

The current -mm tree is 2.6.17-rc1-mm3. Recent changes to -mm include an ACPI dock driver, i2c virtual adapter support, a number of memory management tweaks, a trusted platform module (TPM) driver update, and a new version of the zlib library.

Comments (none posted)

Kernel development news

Quotes of the week

I don't think anyone is smart enough to configure Apache with SELinux. I've installed Apache maybe 20 times in my life, which is plenty, and I eventually realized it was SELinux and just turned the damn thing off after an hour of trying to fix it.

-- Dave Aitel

Keep in mind as well that SELinux "complexity" is purely a reflection of complexity in Linux; SELinux just exposes the existing interactions and provides a way to control them. The SELinux mechanism itself is fairly simple.

-- Stephen Smalley

Comments (19 posted)

Virtual time

The developers interested in containers and virtualization have discussed interfaces to virtualize access to a number of system resources. None, however, have talked about virtualizing access to the system time. Until now, that is. With Jeff Dike's time virtualization patches any process tree can have its own idea of what time it is.

Jeff's patch adds a new "time namespace" structure to the task structure. By default, all processes share the normal host system's idea of time. But a new option (CLONE_TIME) to the unshare() system call allows a process to disconnect from the system time. After such a call, that process - and any children it creates - will be able to keep its own time value. Setting a virtualized time value is, unlike changing the normal system time, an unprivileged operation.

Internally, a virtualized time is stored as a simple offset; whenever a process requests the current time, the offset is added to the the current system time and the sum is returned. This approach has the advantages of being simple and fast; a process running with virtualized time also does not give up time adjustments made, for example, by NTP. On the other hand, this implementation does not support the ability to confuse processes by messing deeply with their idea of time - running time at a different rate, for example, or even backward. Chances are that this omission will not upset more than a small percentage of potential users of virtualized time, however.

Jeff's purpose is to speed up the gettimeofday() system call in User-mode Linux instances. If the kernel allows process subtrees to have their own time values, then User-mode Linux can simply use the host's gettimeofday() call, rather than intercepting that call and implementing it itself. Since gettimeofday() is one of the most frequently-used system calls, this optimization can make a significant difference.

One other change is required, however, for User-mode Linux to get the benefit from this change. UML performs much of its process control using ptrace(); in particular, it intercepts and interprets system calls with the PTRACE_SYSCALL operation. What is really needed for a fast gettimeofday() is the ability to not intercept that particular call. So Jeff's patch also extends ptrace() by adding a PTRACE_SYSCALL_MASK operation. This new operation can set a bitmask indicating which system calls should be intercepted, and which should be executed without stopping.

The result, with a suitably patched UML, is a gettimeofday() call which runs at about 99% of the native process speed. That may well be good enough to make this patch a piece of the growing set of interfaces supporting virtualization and containers.

Comments (4 posted)

write(), thread safety, and POSIX

Dan Bonachea recently reported a problem. It seems that he has a program where multiple threads are simultaneously writing to the same file descriptor. Occasionally, some of that output disappears - overwritten by other threads. Random loss of output data is not generally considered to be a desirable sort of behavior, and, says Dan, POSIX requires that write() calls be thread-safe. So he would like to see this behavior fixed.

Andrew Morton quickly pointed out the source of this behavior. Consider how write() is currently implemented:

    asmlinkage ssize_t sys_write(unsigned int fd, const char __user *buf, 
                                 size_t count)
    {
	struct file *file;
	ssize_t ret = -EBADF;
	int fput_needed;

	file = fget_light(fd, &fput_needed);
	if (file) {
	    loff_t pos = file_pos_read(file);
	    ret = vfs_write(file, buf, count, &pos);
	    file_pos_write(file, pos);
	    fput_light(file, fput_needed);
	}

	return ret;
    }

There is no locking around this function, so it is possible for two (or more) threads performing simultaneous writes to obtain the same value for pos. They will each then write their data to the same file position, and the thread which writes last wins.

Putting some sort of lock (using the inode lock, perhaps) around the entire function would solve the problem and make write() calls thread-safe. The cost of this solution would be high, however: an extra layer of locking when almost no application actually needs it. Serializing write() operations in this way would also rule out simultaneous writes to the same file - a capability which can be useful to some applications.

So some developers have questioned whether this behavior should be fixed at all. It is not something which causes problems for over 99.9% of applications, and, for those which need to be able to perform this sort of simultaneous write, there are other options available. These include user-space locking or using the O_APPEND option. So, it is asked, why add unnecessary overhead to the kernel?

Linus responds that it is a "quality of implementation" issue, and that if there is a low-cost way of getting the system to behave the way users would like, it might as well be done. His proposal is to apply a lock to the file position in particular. His patch adds a f_pos_lock mutex to the file structure and uses that lock to serialize uses of and changes to the file position. This change will have the effect of serializing calls to write(), while leaving other forms (asynchronous I/O, pwrite()) unserialized.

The patch has not drawn a lot of comments, and it has not been merged as of this writing. Its ultimate fate will probably depend on whether avoiding races in this obscure case is truly seen to be worth the additional cost imposed on all users.

Comments (none posted)

The future of the Linux Security Module API

Back in 2001, the very first Linux kernel summit included a discussion on security policies. At that meeting, it was decided that there was no interest in patching in the several competing implementations which were available at that time. Instead, developers interested in security were asked to create a generic interface which could be used by any security policy. The result was the Linux Security Modules (LSM) API - a long list of hooks which can be used to intercept almost any operation of interest within the kernel.

Last year, some developers were heard to mumble that perhaps LSM should be removed from the kernel. Since LSM was merged, there has been only one serious security mechanism using it to emerge: SELinux. Since there is only one LSM user, and since SELinux can be thought of as a fairly generic security framework in its own right, it is not clear that there is a need for the LSM interface. The discussion died down last year, however, and there has been little talk of yanking out LSM.

Until now. In response to a current discussion on LSM hooks, James Morris has posted a patch adding LSM to the "feature removal" schedule. The end of LSM is not a distant event either: the proposed date is this coming June - the 2.6.18 kernel, in other words. If this patch goes through, LSM will be gone in the very near future.

The early indications suggested that it could go through: several kernel developers have argued in favor of the removal of LSM, while none asked for it to be retained. The only disagreement - mild - was over the removal date, with some arguing that 2.6.18 is too soon. Those in favor of an early removal, however, claim that last year's discussion should count as the usual one-year warning for this sort of change, and that there is no need to wait any longer.

One might well wonder what the hurry is to remove this API from the kernel. There is, in fact, more than just the "only one user" argument in circulation. James's patch includes this text:

[LSM] also attracts a regular stream of misconceived and broken security module submissions to mainline, such as BSD Security Levels, and developers are seeing LSM as the answer to everything rather than really thinking about what they need and how to architect the code properly and generally.

So LSM becomes a general temptation to solve problems in the wrong way. Beyond the security levels module (which, among other things, is seen as having open vulnerabilities and no maintainer interest), the developers may be thinking of past episodes like the debate over the realtime security module or the Integrity Measurement Architecture, neither of which is best implemented as a security module.

The real issue, however, may be this one:

There is also a growing number of proprietary modules hooking into LSM in unsafe ways, not necessarily even for security purposes. The LSM interface semantics are too weak and such an API does not belong in the mainline kernel.

The 2.6 kernel - intentionally - does not give loadable modules access to the system call table. But the LSM interface is almost as good - it gives a loadable module the opportunity to intercept almost any operation that the kernel may attempt to perform. The LSM hooks are supposed to limit themselves to internal record keeping and returning an allow/deny status to the kernel - but there is no way to enforce that sort of restriction. The GPL-only status of the LSM API does not help much either.

The people involved are wary of publicly pointing fingers at companies suspected of misusing the LSM interface. One example which can be found, however, is the kernel generalized event management module which was posted to the kernel-mentors list last year. When KGEM was loaded, it would shove aside any currently-loaded security policy and install itself in its place. It would then feed security-related events through to a (proprietary) user-space application, which would make decisions aimed at protecting Linux users from the pressing threat of virus attacks. There were a lot of issues over how this module was implemented, but using LSM to override existing security policies and provide hooks for proprietary code was considered especially distasteful.

These reasons and strong developer pressure notwithstanding, it is not clear that LSM will actually go away anytime soon. There is not yet a consensus that SELinux should be seen as the One True Security Policy; many potential users find its complexity hard to deal with and often simply turn it off. The power of SELinux is unquestioned, but its usability is another story.

There are other users of the LSM API out there, they just have not been submitted for inclusion into the mainline. These include:

  • Novell's AppArmor, which is the security policy shipped with current SUSE releases. AppArmor is free software, but has never been submitted for review. The discussion of removing the LSM interface appears to have lit a fire under some rear ends at Novell, and the first AppArmor submission is said to be imminent. (In fact, it was posted just after this article was published).

    Some of the early discussion, however, suggests that AppArmor could have a hard path into the mainline. In particular, its use of file pathnames as the core of its security policy has been strongly questioned. In a system capable of hard and soft links, multiple namespaces, shared subtrees, and more, the meaning of any specific pathname is far from clear. That is why SELinux uses extended attributes to apply labels directly to files, rather than relying on their pathnames.

  • The Linux Intrusion Detection System (LIDS) is an LSM user. The LIDS developers have asked that LSM not be removed, but have not made any statements regarding if and when they might submit their module for merging.

  • The Dazuko module is used by tools like ClamAV. Dazuko seems somewhat like KGEM, in that it exports an interface for user-space programs to make decisions. It is not clear that such an interface can ever make it through the review process.

  • Multiadm is a module which allows privileges to be handed out to non-root users.

Given that security is something other than a completely solved problem, it would be surprising if there were any single approach which was suitable for all users. So something may well emerge and qualify as the second user which keeps the LSM API in place.

Or, at least, which keeps some sort of API in place. If LSM stays around, the kernel developers will probably make changes which make the API harder to abuse. These might include finding ways to restrict what LSM hooks can do and providing compile-time options to wire in a single security policy at kernel build time. So, while there is a reasonable chance that future kernels will include an LSM interface, it might be a rather different interface than the one there today. Any security module developers who want to have a say in how the interface evolves would be well advised to join the discussion soon.

Comments (15 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Memory management

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds