Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.31-rc3, released by Linus on July 13. There's lots of fixes, but this prepatch also includes a new driver for GPIO-based matrix keypads, the "osdblk" driver (which presents an object on an object storage device as a Linux block device), support for the loading of kernel module symbols in the performance counter tools, and the removal of the Intel Langwell USB OTG driver (because it depends on infrastructure which is not yet present in the mainline). The short-form changelog is in the announcement; full details can be found in the full changelog.

There have been no stable kernel updates over the last week.

Comments (none posted)

Kernel development news

Quotes of the week

To many kernel developers, the most scary ideas/patches are not the ones which are totally outlandish and crazy (such as making the kernel be able to parse and execute SQL-like statements as arguments to a modified readdir system call, in the grand but quixotic goal of unifying filesystems and databases), but the ones which are just sane enough to be initially appealing, but which ends up being a millstone making future development and backwards compatibility efforts extremely difficult. At least, this is the fear which I suspect drives the often harsh reception some have received (regardless of whether that reception was justified or not).

-- Ted Ts'o

This is why changing core MM is so worrisome - there's so much secret and subtle history to it, and performance dependencies are unobvious and quite indirect and the lag time to discover regressions is long.

-- Andrew Morton

Having this name is very convenient, people review my drivers like crazy.

-- Linus Walleij (Thanks to Alejandro Riveira Fernández)

Comments (none posted)

In brief

Listmanager. It appears that vger.kernel.org, the busy system which handles the bulk of the kernel-oriented mailing lists, is about to get a new mailing list manager. The new system is called "listmanager"; it is meant to be somewhat more efficient that the existing majordomo installation. The interface should stay about the same, though.

An early-stage version of the software has been posted for those who would like to play with it. Matti Aarnio, the author of the code, says: "Somebody will want to know if the sources will be available. Yes, but it is only my 3rd day of hacking at it yet..."

Btrfs and RAID. Speaking of early-stage software, David Woodhouse has posted an initial implementation of RAID5 and RAID6 support for Btrfs. It's not exactly functional yet, but a number of the pieces are in place. Some additional good news is that this work does not involve the addition of another RAID implementation to the kernel; instead, David has moved the MD implementation into common code so that it can be used in both places.

VFAT. Andrew Tridgell continues to work on the VFAT patent workaround patch. He has set up a directory on kernel.org which contains the latest version of the patch; there is also a README file which describes the interoperability problems which have been identified so far. Meanwhile, some developers are pushing for a return to the previous version of the patch, which simply took away the ability to create long file names. That patch removes some useful functionality, but it is also pretty well guaranteed not to cause interoperability problems.

Tridge does not appear to have given up on the long-name-only approach, though. Stay tuned; he may yet come up with a version which interoperates more universally.

Kmemleak. The kmemleak code was merged for 2.6.31; that means that testers are now beginning to post about memory leak reports. At this stage, it seems that kmemleak is still putting out a fair number of false reports - but it is also turning up real leaks. People who are interested in playing with kmemleak should (1) run 2.6.31-rc3 or later, and (2), look at these suggestions for evaluating leak reports posted by kmemleak author Catalin Marinas.

Comments (none posted)

Communicating requirements to kernel developers

By Jonathan Corbet
July 14, 2009

The 2009 kernel summit is planned for October in Tokyo. Over the years, your editor has observed that the discussion on what to discuss at the summit can sometimes be as interesting as the summit itself. Recently, the question of how user-space programmers can communicate requirements to the kernel community was raised. The ensuing discussion was short on definitive answers, but it did begin to clarify a problem in an interesting way.

For the curious, the entire thread can be found in the ksummit-2009-discuss archives. Matthew Garrett started things this way:

I've just run a session at the desktop summit in Gran Canaria on the functionality that userspace developers would like to see from the kernel. Some interesting things came out of it, but one of the major points was that people seemed generally unclear on how they could communicate those requirements to kernel developers. Worth discussing?

Dave Jones's response was instructive:

What exactly is the problem ? They know where linux-kernel@vger.kernel.org is, so why aren't they talking to us there?

To developers who are used to the ways of linux-kernel, and who are well established in that community, a question like this might make sense. If one were to poll developers who do not normally hang out in the kernel community, though, one might get an answer something like this:

The volume on linux-kernel is far too high for ordinary people to cope with.
Even if we could keep up with linux-kernel, the volume is still likely to bury anything we might post there.
Kernel people speak their own language, making it hard to follow discussions, much less participate in them.
If somebody does notice our request, they will probably flame it to a cinder without necessarily taking the time to understand it first.
If they don't flame it, they will probably tell us to send a patch, but we're not kernel developers and thus not in a position to do that.

There can be no doubt that some communications problems can be easily blamed on the requesting side. If a feature request is phrased as a demand, it is unlikely to be received well. Kernel developers are beholden to demands from their employers, but from nobody else; like most other developers, they take a dim view of people who feel entitled to free work just because they want it. A classic example here would be the early Carrier Grade Linux specifications produced by OSDL; they read like a "to do" list handed to the kernel community, even if OSDL eventually claimed that it was not intended that way.

Another problem can be poorly-expressed requirements. Consider the early TALPA proposal, which posted a very clear set of low-level requirements. Unfortunately, they were too low-level, requiring features like the ability to intercept file close operations. Instead, TALPA (now fanotify) needed to express requirements like "we need a clean way to support proprietary malware-scanning software, and this is why." That disconnect set the project back significantly, and could well have killed a project whose developers showed less persistence or less willingness to learn. Clearly expressing requirements at the right level is never an easy task, but it's crucial.

Finally, some ideas just don't make sense in the kernel. Perhaps they cannot be implemented in a way which avoids security problems, does not break other features, and does not create long-term maintenance problems. Or perhaps there are better solutions in user space. A developer who goes to linux-kernel with this kind of request is likely to go away feeling like kernel developers are completely unwilling to listen to reason.

All of the above notwithstanding, there is some recognition that user-space developers have real difficulties in bringing requirements to the kernel community. Kernel developers tend to be busy, focused on their own projects, and not always entirely open to requests from outside the community. There is no mechanism for tracking feature requests, so it is very easy for them to be buried in the flood of email. The tone of discussions can be harsh, even though it truly has improved over the years. And so on.

This is not good. The kernel exists to provide for the needs of user space; if the kernel development community is not hearing what those needs are, it can only fail to satisfy them. So thinking about how to make it easier for user-space developers to communicate their requirements would seem to be worthwhile; chances are that space will be made at the summit for that topic. But there is no need to wait for the summit to start talking about how things could be improved.

Matthew Wilcox suggested the creation of a document on how user-space developers can interact with the kernel community. The idea makes sense (your editor may just try to help there), but this is not a problem which can be solved by documents alone.

James Bottomley described three broad categories of users needing changes to the kernel:

Sophisticated developers who can write their own kernel extensions.
Users who can get a kernel developer interested in their desired additions.
Users who want features that no developers are interested in.

James points out that categories 1 and 2 can be helped with documentation and general outreach. He worries, though, that we have no way to help the third category of users, who are generally left with no way to get the kernel changed to meet their needs.

Ted Ts'o had a different taxonomy which he has put forward as a way to help understand the problem:

Core kernel developers (or those who have access to such people). Core developers have the advantage of a high degree of trust in the community; that allows them to get features into the kernel with a relatively small amount of trouble. They are able to merge code which might well not pass muster if it came from a different source.
Competent, but non-core kernel developers (and, again, people who have access to them). These developers have to work harder to justify their changes, but they are generally in a position to get changes merged as long as the work is good.
Potentially competent developers with "patently bad design taste." Ted suggests that the frank nature of the kernel review process is intended mainly to weed out bad patches from this source.
Users with no access to kernel development expertise, who must thus try to convince somebody else in the community to implement their desired feature for them. Ted divided this category into two subcategories, depending on whether there is an active kernel developer working in the user's area of interest or not.

Ted's thought is that this taxonomy can help users to understand why certain patches and ideas are treated the way they are. It can also be used to help develop ways to reach out to each specific group of users. Certainly the different groups need to hear different messages. One could argue that the existing documentation should be sufficient for people with kernel development skills, but there is relatively little help out there for those who must find a developer to do their work for them.

It is that last group which is most likely to be intimidated by the prospect of walking into linux-kernel and asking for features. The kernel community could really use a person who would take on the task of working with these users, helping them to clarify their requirements, connecting them with the appropriate developers, and tracking requests. The good news is that we do have such a person; the bad news is that it's Andrew Morton, who has one or two other things to do as well. The community would benefit from a person who worked something close to full time on the task of helping users and user-space developers get what they need from the kernel. That sort of position tends to be very hard to fund, though; as a result, it tends to stay vacant.

As was noted at the outset, this conversation did not produce much in the way of concrete conclusions. It is far from complete, though. If not before, it will be resumed in October at the full summit. Needless to say, your editor plans to be there; stay tuned.

Comments (15 posted)

Rootless X

By Jonathan Corbet
July 15, 2009

Complexity has long been known to be the enemy of security. A code base which is small and straightforward can be verified, with a reasonable degree of trust, to do what it is supposed to do and only that. As the code becomes more complex and harder to understand, that sort of verification becomes increasingly difficult. For this reason, developers try to separate code requiring privilege from that which doesn't. Any code which can be run in a nonprivileged mode is code which is relatively unlikely to create security problems, and the code which is left can, hopefully, be reviewed to a level sufficient to give the required degree of confidence in its security.

Now consider the X window system. It is a massive body of code with a relatively small development community. Some of the code is truly ancient and hasn't seen much real developer attention in years. As with most projects, some of the code is of higher quality than the rest. The X server performs a complex task - based on a complicated protocol - for almost every Linux user. Sometimes it is even exposed to the full Internet. And the X server runs as root, with full privileges. The actual number of security problems found in X over the years has been relatively small, but it is not hard to believe that it is more a matter of luck and a lack of attackers than the inherent quality of the X code base.

Worries about the X server are not new; that is why there has been discussion of running it in a nonprivileged mode for several years. Much of the work which has been done on display graphics has had this goal in mind. All that notwithstanding, Linux distributions still install X as a setuid program. It just has not been possible to enable a nonprivileged X server to get its job done without opening up the system as a whole.

Those days are just about done. The Moblin project is now claiming that it will be the first distribution to ship with a nonprivileged X server. Where Moblin goes, others will certainly follow.

Some details of how this work was done were posted to the xorg-devel list by Jesse Barnes at the beginning of July. According to Jesse, finishing out this multi-year job was "pretty easy." It seems that the pieces are in place now - at least for some graphics hardware - to the point that a few hours of work got the job done. It seems like an almost anti-climactic end to such a long-term challenge. But, of course, the work which has made this result possible has been ongoing for a long time.

The biggest piece is the kernel mode setting (KMS) code which was merged for 2.6.29. Prior to KMS, the X.org server was charged with finding the hardware and driving it directly from user space. Needless to say, this sort of access requires root privilege, since it can easily be used to compromise the system. KMS (and the associated graphical memory management code) turns graphical hardware into something closer to a normal device with a normal kernel driver - albeit a rather complex and specialized sort of "normal" device. The hardware manipulations requiring privilege have been isolated into a relatively small piece of kernel code; they are now separated from the rather larger body of code implementing the X protocol.

That means that X code accessing the hardware can now be run unprivileged as long as it has the ability to open the appropriate device file. The server must also have the ability to open related device files: input devices, the virtual console, and so on. But that is a problem that has been solved for years; the login process can easily change the ownership of those files so that an unprivileged server can access them but the world as a whole cannot.

What's left is some detail work. Some of the ioctl() calls for direct rendering are currently root-only; Jesse thinks that they can be made generally available in a safe way. There may be some small additions to the driver ABI to allow the final root-only operations to be pushed onto the kernel side. But that's about it.

Of course, user-mode X will currently only work with Intel chipsets, since those are the only ones with full KMS support at this time. Radeon drivers are acquiring that support quickly, though, and may be able to support no-root operation in the relatively near future. That leaves NVIDIA as the usual odd chipset out of the big three; the current Nouveau feature matrix suggests that it will be some time yet before the requisite features are available there.

It may also be a little while until we see no-root support in more general-purpose distributions. Moblin has a relatively narrow focus on Intel hardware, by virtue of the fact that it's still mostly Intel people who are doing the work. Distributors who need to make things work on whatever hardware happens to be present may approach a change of this magnitude with a bit more caution. Still, X without root is clearly in the future, and the near future at that.

Comments (16 posted)

A new way to truncate() files

July 15, 2009

This article was contributed by Goldwyn Rodrigues

Changes are happening the way the virtual filesystem and virtual memory subsystems interact. One of the goals of this work is to close a race existing in page_mkwrite() (called when a previously read-only page table entry (PTE) is about to become writable), namely to make sure that file blocks are properly zero-filled if a truncate operation increases the file size. As a part of improving page_mkwrite(), Jan Kara posted a set of patches. However, these patches introduced a new lock to resolve the problem. Nick Piggin thinks this can be done by using the page lock instead of a new lock. As a first step toward a resolution of the problem, he posted a set of patches to improve the truncate sequence. The new truncate sequence is simpler to understand, flexible in usage, and most important of all, handles errors gracefully.

The truncate() and ftruncate() system calls are used to set a file to the specified size. If the file is larger than the argument passed, the file size is truncated and the part of the file greater than the passed file size is lost. If the current file size is smaller than the passed file size argument, the file size is increased and the area greater than the previous file size is filled with zeros. The file size argument passed cannot be greater than the maximum possible file size on the filesystem.

A user space truncate() call is handled, inside the kernel, by do_sys_truncate(), which is responsible for weeding out all error cases (such as "the inode is a directory" or permission errors). It breaks leases of files locked with flock() and calls do_truncate(). do_truncate(), in turn, creates a new attribute structure with the new length of the file and calls notify_change() with the dentry and the new attributes under inode->i_mutex. notify_change() calls the generic inode_setattr(), either explicitly, or through the filesystem implementation of setattr(). Then, inode_setattr() calls vmtruncate() to set the inode size and unmap the pages mapped beyond the new file size. After unmapping the pages, the associated filesystem's truncate() operation is called to free the disk blocks associated with the file.

According to Nick, this approach has problems:

Big problem with the previous calling sequence: the filesystem is not called until i_size has already changed. This means it is not allowed to fail the call, and also it does not know what the previous i_size was. Also, generic code calling vmtruncate to truncate allocated blocks in case of error had no good way to return a meaningful error (or, for example, atomically handle block deallocation).

Nick's new truncate sequence introduces a way to better communicate error conditions and consolidates the checks which most filesystems currently perform individually. The original intention was to add a new truncate() operation in struct inode_operations which would be called directly for a truncate operation in inode_setattr(). Christoph Hellwig disagreed with the call sequence, stating that the new truncate function should be called from notify_sequence, and not from inode_setattr which is the default implementation for inode_operations.setattr. Nick felt that clearing ATTR_SIZE before calling generic setattr is not unusual (discussed later), so he decided to introduce his changes with a flag called new_truncate in struct inode_operations, and not using a new truncate function altogether. The new_truncate flag indicates that the truncate() function in the inode operations handles the new format. Nick admits that this is a nasty hack when he introduces the variable in inode_operations. However, it will be required until all filesystems transition to the new truncate sequence. Filesystem code which does not implement the new convention will automatically initialize new_truncate to zero, indicating that it has not transitioned yet.

The first patch in the patch series introduces new functions to facilitate the change. inode_newsize_ok() performs simple checks to check if the intended new file size is within limits defined by the filesystem or is not a swap file:

    int inode_newsize_ok(struct inode *inode, loff_t offset)

These checks are currently done by individual filesystems. Using this function results in cleanups in individual filesystem code.

The truncate_pagecache() function truncates the inode pages and unmaps the pages in the range beyond the new filesystem size:

    void truncate_pagecache(struct inode *inode, loff_t old, loff_t new);

truncate_pagecache() should ideally be called before the filesystem releases the data blocks associated with the inode. This way the page cache will always be in sync with the on-disk format and the filesystem will not have to deal with situations such as writepage() being called for a page that has had its underlying blocks deallocated.

The vmtruncate() function is consolidated for NUMA and non-NUMA architectures in mm/truncate.c. However, vmtruncate() is deprecated. Instead, truncate_pagecache() and inode_newsize_ok() introduced in the first patch should be used.

The third patch is the main patch of the series which uses the new truncate operation. It introduces simple_setsize(), which performs equivalent of vmtruncate(). simple_setsize() is called by inode_setattr() when ATTR_SIZE is passed. So filesystems implementing their own truncate code in setattr must clear ATTR_SIZE before calling the generic inode_setattr().

To follow the new standards of the truncate operation, individual filesystems must implement their own setsize() function, which performs the file size validation checks, truncates the page cache, and truncates the data blocks associated with the inode. Filesystems must not trim off blocks past i_size using vmtruncate(). Instead, they must handle the truncate in the filesystem code using truncate_pages(). This creates a better opportunity to catch errors. The inode_operations.new_truncate and inode_operations.truncate fields will go away once all filesystems are converted.

To demonstrate the change, the final patch in the series modifies the ext2 filesystem to use the new truncate interface. The patch introduces ext_setsize() to set the inode size of the file, truncate the pagecache, and, finally, trim the data blocks on the filesystem. If ATTR_SIZE is set, ext2_setattr() calls ext2_setsize() to perform the truncate and the ATTR_SIZE is unset so that inode_setattr() does not perform the operations again.

The new truncate patchset has gone through a fair share of review and is pretty likely to get merged. However, it would require the "nasty hack" until all filesystems have transitioned to the new way of truncating files, after which the hack will be removed. The patches are part of the improvements Nick wants to see in the VM layer. Based on the new truncate patches, Nick posted an RFC on how he would close a race condition in page_mkwrite when a file is truncated beyond the current file size. Closing races is a good thing, so expect this work to proceed apace.

Comments (1 posted)