Kernel development [LWN.net]

Kernel release status

The current 2.6 development kernel is 2.6.27-rc4, released on August 20. Along with lots of fixes, it includes support for the multitouch trackpad on new Apple laptops, more reshuffling of architecture-specific include files, a number of XFS improvements, interrupt stacks for the SPARC64 architecture, the removal of the obsolete Auerswald USB sound driver, and new drivers for TI TUSB 6010 USB controllers, Inventra HDRC USB controllers, and National Semiconductor adcxx8sxxx analog-to-digital converters. The short-form changelog is in the announcement, or see the long-form changelog for all the details.

Patches continue to accumulate in the mainline repository, and 2.6.27-rc5 may be released by the time you read this. Along with the fixes, -rc5 will include Japanese translations of more kernel documents and generic interrupt handling for some UIO devices.

Comments (none posted)

Quotes of the week

NOTE: When tempted to use multibyte bitfields on fixed-layout data, you need to look in the mirror, ask yourself "what will they do to me during code review for that?", shudder and decide that some temptations are just not worth the pain.

-- Al Viro

The longer you queue stuff up, the more painful having to change stuff at the beginning of that queue becomes. It can invalidate everything else you worked on.

The only sane way to operate is to post your work early and often, or else you'll live in a world of hurt, and it will be nobody's fault but your own.

-- David Miller

linux-next is not a tree that you can track. It's a tree that you can fetch _once_ and then throw away.

So what you can do is to "fetch" linux-next, and test it. But you MUST NEVER EVER use it for anything else. You can't do development on it, you cannot rebase onto it, you can't do _anything_ with it.

-- Linus Torvalds

We get better regression tests from getting stuff upstream. However upstreaming stuff to Linus is not how to find regressions, it helps but its suboptimal in that he will eventually ignore us if the regression rate gets too high.

-- Dave Airlie

Comments (none posted)

AXFS: a compressed, execute-in-place filesystem

By Jonathan Corbet
August 26, 2008

Filesystems are clearly an area of high development interest at the moment; hardly a week goes by without a new filesystem release for Linux popping up on some list or other. All of this development is motivated by a number of factors, including the increasing size of storage devices and the increasing capability of solid-state storage. Beyond that, though, there is the simple fact that there is no single filesystem which is optimal for all applications. The recently-announced AXFS filesystem is a clear example of what can be done if one targets a specific use case and optimizes for that case only.

At a first impression, AXFS seems like a simple and limited filesystem. It is, for example, read-only; the AXFS developers have made no provision for changing the filesystem after it is created. Some filesystems have a great deal of code dedicated to the creation of the optimal layout of file blocks on disk; AXFS has none of that. Instead, it has a simple format which divides the media into "regions" and, almost certainly, spreads accesses across the device. There is no journaling, no logging, no snapshots, and no multi-device volume management.

What AXFS does provide is compressed storage using zlib. It is, clearly, aimed at embedded systems using flash-based storage. For such devices, a compressed filesystem can be built using the provided tools, then loaded into a minimal amount of flash on each device. It thus joins a number of other compressed filesystems - cramfs and squashfs, for example - provided for this sort of application. One interesting aspect of compressed, flash-oriented filesystems is their apparent ability to stay out of the mainline kernel. By posting AXFS for review on linux-kernel, developer Jared Hulbert may be trying to avoid a similar fate.

The feature which makes AXFS different from squashfs and cramfs is its support for execute-in-place (XIP) files. Some types of flash can be mapped directly into the processor's address space. When running programs stored on that flash, copying pages of executable code from flash into main memory seems like a bit of a waste: since that code is already addressable by the processor, why not run it from the flash? Executing code directly from flash saves RAM; it also makes things faster by eliminating the need to copy those pages into RAM at page fault time. As a result, systems using XIP tend to boot more quickly, a feature which designers (and users) of embedded systems appreciate.

Linux has had an execute in place mechanism for a few years now, but relatively few filesystems make use of it. AXFS has been designed from the beginning to facilitate XIP operation - that's its reason for existence (and the origin of the "X" in its name).

There is an additional twist, though. One would ordinarily consider compressed storage and XIP to be mutually exclusive - there is little value in mapping compressed executable code into a process's address space. To be able to executed in place, a page of code must be stored uncompressed. What makes AXFS unique is its ability to mix compressed and uncompressed pages in the same executable file. So pages which will be frequently accessed can be stored uncompressed and executed in place. Pages with infrequently-needed code or which contain initialized data can be stored compressed to save space and uncompressed at fault time.

This is a slick feature, but it is not of great use if one does not know which pages of an executable file are heavily enough used to justify storing them without compression. Trying to determine this information and manually pick the representation of each page seems like an error-prone exercise - not to mention one which would tend to create high employee turnover. So another method is needed.

To that end, AXFS provides a built-in profiling mechanism. Each AXFS filesystem is represented by a virtual file under /proc/axfs; writing "on" to that file will cause AXFS to make a note of every page fault within the filesystem. Reading that file then yields spreadsheet-like output showing, for each file, how many times each page was faulted into the page cache. Using this data, it is possible to generate an AXFS filesystem image with an optimal number of compressed pages for the target system.

Filesystems normally need a few rounds of review before they can make it into the mainline; some filesystems need rather more than that. AXFS is sufficiently simple, though, that it may find a quicker path into the kernel. So far, the comments have mostly been positive, with the biggest complaint being, perhaps, that its name is too close to that of the existing XFS filesystem. So a 2.6.28 merge for AXFS, while far from guaranteed, would appear to be not entirely out of the question.

Comments (2 posted)

Sysfs and namespaces

By Jonathan Corbet
August 26, 2008

Support for network namespaces - allowing different groups of processes to have a different view of the system's network interfaces, routes, firewall rules, etc. - is nearing completion in recent kernels. A look at net/Kconfig turns up something interesting, though: network namespaces can only be enabled in kernels which do not support sysfs - the two are mutually exclusive. Since most system configurations need sysfs to work properly, this limitation has made it harder than it would otherwise be to use, or even test, the network namespace feature.

The problem is simple: the network subsystem creates a sysfs directory for each network interface on the system. For example, eth0 is represented by /sys/class/net/eth0; therein one can find out just about anything about how eth0 is configured, query its packet statistics, and more. But, when network namespaces are in use, one group of processes may have a different eth0 than another, so they cannot share a globally-accessible sysfs tree. One solution might be to add the network namespace as an explicit level in the sysfs tree, but that would break user-space tools and fails to properly isolate namespaces from each other. The real solution is to build namespace awareness more deeply into sysfs.

Eric Biederman has been working on a set of sysfs namespace patches for the last year or so; those patches now appear to be getting close to ready for inclusion into the mainline. Network namespaces will be the first user of this feature, but it has been written in a way that makes it possible for any system namespace to provide differing views of parts of the sysfs hierarchy.

The core concept is that of a "tagged" directory in sysfs. Any sysfs directory can be associated with (at most) one type of tag, where that type identifies which type of namespace controls how that directory is viewed. Thus, for example, /sys/class/net would have a tag type identifying the network namespace subsystem as the one which is in control there. The /sys/kernel/uids directory, instead, will be managed by the user namespace subsystem. Once a directory is given a tag type, all subdirectories and attribute files inherit the same type.

Namespace code makes use of tagged sysfs directories by adding an entry to enum sysfs_tag_type, defined in <linux/sysfs.h>, to identify its specific tag type. The namespace must also create an operations structure:

    struct sysfs_tag_type_operations {
	const void *(*mount_tag)(void);
    };

The purpose of the mount_tag() method is to return a specific tag (represented by a void * pointer) for the current process. This tag, normally, will just be a pointer to the structure describing the relevant namespace; for example, network namespaces implement this method as follows:

    static const void *net_sysfs_mount_tag(void)
    {
	return current->nsproxy->net_ns;
    }

The tag operations must be registered with sysfs using:

    int sysfs_register_tag_type(enum sysfs_tag_type type, 
                                struct sysfs_tag_type_operations *ops);

Thereafter, there are two ways of associating tags with a sysfs hierarchy. One of those is to make a tagged directory directly with:

    int sysfs_make_tagged_dir(struct kobject *kobj, 
                              enum sysfs_tag_type type);

The directory associated with kobj will have differing contents depending on the value of the tag of the given type. The actual tag associated with the contents of this directory will be determined (at creation time) by calling a new function added to the kobj_type structure:

    const void *(*sysfs_tag)(struct kobject *kobj);

The sysfs_tag() function is usually a short series of container_of() calls which, eventually, locates the appropriate namespace for the given kobj.

An alternative way to attach tags to a directory tree is to associate it directly with the class structure. To that end, struct class has two new members:

    enum sysfs_tag_type tag_type;
    const void *(*sysfs_tag)(struct device *dev);

When the class is instantiated, it will have tags of the given tag_type; the specific tag for a given class will be found by calling the sysfs_tag() function.

Finally, if a specific tag ceases to be valid (because the associated namespace is destroyed, normally), a call should be made to:

    void sysfs_exit_tag(enum sysfs_tag_type type, const void *tag);

This call will cause all sysfs directories with the given tag to become invisible, and to be deleted when it is safe to do so.

Adding tagged directory support requires some significant changes to the sysfs code. But the interface has been designed to make it very easy for other subsystems to make use of tagged directories; it's a simple matter of providing functions to return the specific tag values which should be used. At this point, the biggest challenge might be making sense of sysfs when its contents may be different for each observer. But that is a challenge associated with namespaces in general.

Comments (9 posted)

TALPA strides forward

By Jake Edge
August 27, 2008

When last we left TALPA, it was still floundering around without a solid threat model, but over the last several weeks that part has changed. Eric Paris proposed a fairly straightforward—though still somewhat controversial—model for the threats that TALPA is supposed to handle. With that in place, there is at least a framework for kernel hackers to evaluate different ways to solve the problem, while also keeping in mind other potential uses.

It seems almost tautological, but anti-virus scanning is supposed to, well, scan files. In particular, they scan for known malware and block access to files when they are found to be infected. For better or worse, scanning files is seen as an essential security mechanism by many, so TALPA is trying to provide a means to that end. Paris describes it this way:

This is a file scanner. There may be all sorts of marketing material or general beliefs that they provide security against all sorts of wide and varied threats (and they do), but in all reality the only threats they provide any help against are those that can be found by scanning files. Simple as that. Some may argue this isn't "good" security and I'm not going to make a strong argument to the contrary, I can stand here for days and show cases where this is highly useful but no one can provide a threat model more than to say, "we attempt to locate files which may be harmful somewhere in the digital ecosystem and try to deny access to that data."

All of the various scenarios where active processes can infect files with malware or actively try to avoid scanning can be ignored under this model. While this looks like "security theater" to some, it avoids the endless what-ifs that were bogging down earlier discussions. It may not be a threat model that appeals to many of the kernel hackers, but it is one that they can work with.

To many kernel developers—used to efficiency at nearly any cost—time consuming filesystem scans seem ludicrous, especially since they only "solve" a subset of the malware problem. But the fact remains that Linux users, particularly in "enterprise" environments, believe they need this kind of scanning and are willing to pay for products that provide it. The current methods used by anti-virus vendors to do the scanning are problematic at best, causing users to run kernels tainted with binary modules. With a threat model—however limited—in place, work can proceed to find the right way to add this functionality into the kernel.

Paris is narrowing in on a design that calls out to user space, both synchronously and asynchronously depending on the operation. File access might go something like this:

open() - causes interested user-space programs to be notified asynchronously; anti-virus scanners might kick off a scan if needed
read()/mmap() - causes a synchronous user-space notification, which allows anti-virus scanners to block access until scanning is complete; if malware is found, cause the read/mmap to return an error
write() - whenever the modification time (mtime) of a file is updated, asynchronously notify user space; this would allow anti-virus scanners to re-scan the data as desired
close() - asynchronous user-space notification; another place where anti-virus scanners could re-scan if the file has been dirtied

Where and how to store the current scanning status of a file is still an open question. Various proposals have been discussed, starting with a non-persistent flag in the in-memory inode of a file. While simple, that would cause a lot of unnecessary additional scanning as inodes drop out of the cache. Persistent storage of the scanned status of a file alleviates that problem, but runs into another: how do you handle multiple anti-virus products (or, more generally, scanners of various sorts); whose status gets stored with the inode?

For this reason, user-space scanners will need to keep their own database of information about which inodes have been scanned. For anti-virus scanners, they will also want to keep information about which version of the virus database was used. Depending on the application, that could be stored in extended attributes (xattrs) of the file or in some other application-specific database. In any case, it is not a problem for the kernel, as Ted Ts'o points out:

I'm just arguing that there should be absolutely *no* support in the kernel for solving this particular problem, since the question of whether a file has been scanned with a particular version of the virus DB is purely a userspace problem.

It is important to keep the scanned status out of kernel purview in order to ensure that policy decisions are not handled by the kernel itself. This is in keeping with the longstanding kernel development principle that user space should make all policy decisions. This allows new applications to come along, ones that were perhaps never envisioned when the feature was being designed. For example, Alan Cox describes another reason that the state of the file with respect to scanning should be kept in user space:

This is another application layer matter. At the end of the day why does the kernel care where this data is kept. For all we know someone might want to centralise it or even distribute it between nodes on a clustered file system.

The latest TALPA design includes an in-memory clean/dirty flag that can short circuit the blocking read notification (when clean). That flag gets set to dirty whenever there is an mtime modification. This optimizes the common case of reading a file that hasn't changed. Further optimizations are possible down the line as Paris mentions:

If some general xattr namespace is agreed upon for such a thing someday a patch may be acceptable to clear that namespace on mtime update, but I don't plan to do that at this time since comparing the timestamp in the xattr vs mtime should be good enough.

Various other uses for the kinds of hooks proposed for TALPA have also come up in the discussion. Hierarchical storage management, where data is transparently moved between different kinds of media, might be able to use the blocking read mechanism. File indexing applications and intrusion detection systems could use the mtime change notification as well. This is a perfect example of kernel development in action; after a rough start, the TALPA folks have done a much better job working with the community.

Some might argue that the kernel development process is somehow suboptimal, but it is the only way to get things into Linux. As the earlier adventures of TALPA would indicate, flouting kernel tradition is likely to go nowhere. While it is still a long way from being included—pesky things like working code are still needed—it is clearly on a path to get there some day, in one form or another.

Comments (48 posted)

Linus Torvalds Linux v2.6.27-rc4 ?

Steven Rostedt 2.6.26.3-rt2 ?

Steven Rostedt 2.6.26.3-rt3 ?

Peter Zijlstra [PATCH] sched: properly account IRQ and RT load in SCHED_OTHER load balancing Aug 14

Manfred Spraul state machine based rcu ?

Paul E. McKenney scalable classic RCU implementation ?

Mathieu Desnoyers Writer-biased low-latency rwlock v8 ?

Oren Laadan kernel based checkpoint-restart ?

Daniel Walker mutex: add mutex_lock_timeout() ?

Roland McGrath utrace ?

David Vrabel New subsystems: Ultra-Wideband radio, Wireless USB and WiMedia LLC Protocol ?

Sean MacLennan Pika Warp appliance watchdog timer ?

Ron Mercer New Qlogic 10Gb Ethernet driver for 2.6.28 ?

Guo-Fu Tseng jme: JMicron Gigabit Ethernet Driver ?

jeykholt@cisco.com fnic: initial submission of driver for FCoE HBA ?

Eric Anholt Add GEM to i915 DRM driver ?

Tejun Heo sound: make OSS sound core optional ?

Greg KH USB: add USB test and measurement class driver ?

Michael Kerrisk man-pages-3.08 is released ?

Jared Hulbert AXFS: Advanced XIP filesystem ?

Tejun Heo block: implement extended devt, take #3 ?

Tejun Heo block: unify disk/part handling and improve ext devt, take #2 ?

Kalpak Shah Large EAs in ext4 ?

FUJITA Tomonori convert sg to use the block layer ?

Evgeniy Polyakov Distributed STorage. ?

Andrea Righi cgroup: block device i/o controller (v9) ?

Paul Moore Labeled networking patches for 2.6.28 ?

Andreas Gruenbacher file capabilities: Add no_file_caps switch ?

Serge E. Hallyn selinux: add support for installing a dummy policy (v2) ?

David Howells Introduce credentials ?

Jeremy Fitzhardinge [PATCH 0 of 4] Xen spinlock updates and performance measurements ?

Eric W. Biederman sysfs namespace support ?

sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org : Enable multiple mounts of devpts ?

Serge E. Hallyn User namespaces: introduction ?

Rafael J. Wysocki 2.6.27-rc4-git1: Reported regressions from 2.6.26 ?

Karel Zak util-linux-ng v2.14.1-rc2 ?

Hans de Goede Announcing libv4l 0.4.1 ?

Dave Airlie DRM development process wiki page.. ?

Kernel development

Brief items

Kernel release status

Kernel development news

Quotes of the week

AXFS: a compressed, execute-in-place filesystem

Sysfs and namespaces

TALPA strides forward

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous