LWN.net Logo

Kernel development

Release status

Kernel release status

The current stable 2.6 release remains 2.6.11.7; it dates back to April 7. A set of patches has been proposed for the .8 release, but there is some debate over a couple of them.

The current 2.6 prepatch remains 2.6.12-rc3; Linus has released no prepatches over the last week. About 100 patches have found their way into his git repository, however; they include a tg3 driver update, a "simple action" capability for the packet scheduler, and various fixes.

There has not been a -mm release since 2.6.12-rc2-mm3 on April 11. Andrew is still getting caught up from his travels and the SCM changes.

The current 2.4 prepatch is 2.4.31-pre1, released by Marcelo on April 25. It consists of a very small set of patches, most of which are x86-64 fixes.

Comments (none posted)

Kernel development news

Andrew Morton at linux.conf.au

[LCA] The Friday morning linux.conf.au keynote was delivered by Australian expatriate Andrew Morton; his wide-ranging talk touched on many aspects of the kernel development process.

Andrew has brought a different approach to kernel development, and it showed early in the talk. He noted that Linus has often characterized his job as being rejecting patches, rather than accepting them. Andrew disagrees with that approach. If somebody has gone to the trouble to put together a patch, even a really poorly-done one, there was probably some sort of underlying need which motivated that work. A patch identifies a problem, at least for some users; you can't just reject it or the kernel as a whole will lose out. So Andrew sees his role as helping to get patches into the kernel, rather than taking pride in rejecting them. According to Andrew, anybody who goes to the trouble of submitting a patch deserves a response. If the patch is not merged, the developer is entitled to an explanation of why.

He does not want to have to understand all of those patches himself, however. It's up to the subsystem maintainers to evaluate patches and, eventually, merge them. Andrew's job is to get the maintainers to get [Andrew Morton] involved. Techniques he can employ include the "troll merge," simply adding the patch to -mm to force the maintainer to react. Asking "dumb questions" on the mailing lists can also help. One way or another, Andrew works to get a response from the relevant maintainers.

Andrew's goal is to bring more professionalism to the kernel development process. He believes that is happening; among other things, he notes, patch traffic now slows down significantly on weekends - that was not always the case. He'd like to settle down the process, and, eventually, hand off pieces of it to others. One such piece, most likely, would be bug tracking. He cautioned, however, that these kernel maintenance tasks are not part-time jobs.

The new development model was revisited; much of what was said will be familiar to LWN readers. He noted that the older process failed one of the kernel's most important customers: the distributors. By getting features merged, tested, and ready for deployment quickly, the new process serves the distributors better. There has, perhaps, been some cost to another set of customers: those who run the mainline kernel on their systems. Andrew will be working hard to increase the stability of the mainline releases to make life easier for that group of users.

Meanwhile, he notes, the developers are shoveling about 10MB of patches into the kernel every month.

The stable 2.6 series (currently at 2.6.11.7) is, according to Andrew, not sure to succeed. He believes that it does not get enough developer attention, and that the bar for patches has been set too high. And it does not address the real problem: that mainline releases have regressions that cause breakage for some users. Really fixing the problems, he says, requires getting the developers to be more careful and more focus on fixing known bugs. He says the process might yet move to an even/odd release scheme, where even-numbered releases (2.6.14, say) would be limited to bug fixes.

On testing: Andrew notes that, while the development process is highly dependent on a large community of testers, it has no real way of rewarding them for their work. He will look into acknowledging testers in the kernel changelogs; if you helped to find a bug, your name can appear alongside that of the developer who fixed it.

On the BitKeeper front, Andrew stated that he was never entirely happy with the decision to use that tool. It imposed an opportunity cost: had the kernel hackers gone off three years ago to build the source code management system they really needed, they would have something quite nice by now. He noted that version control appears to be one of those problems which drives developers crazy, and that's a problem. If you depend on a tool with insane developers, things will "end in tears." Now he's keeping his head down and waiting to see how the whole thing settles out.

Finally, he noted that many developers who think they need a source code management system really don't. If your real purpose is to keep a set of patches in sync with an evolving mainline kernel - which is the case for many developers - then a tool like quilt makes more sense.

Comments (10 posted)

Supporting RDMA on Linux

RDMA (remote direct memory access) is an attempt to extend the DMA mechanism to a networked environment. Using RDMA, an application can quickly transfer the contents of a memory buffer to a buffer on a remote system. On high-speed, local-area networks, RDMA transfers are intended to be significantly faster than transfers done with the regular socket interface. Not everybody likes the RDMA way of doing things, but it exists regardless, and some users expect to see it supported by Linux. Implementations exist for InfiniBand and a number of high-speed Ethernet adaptors.

Since the goals of RDMA include speed and low CPU overhead, implementations attempt to bypass as much kernel processing as possible. Typically, they simply pass the address of a user-space buffer directly to the hardware, and expect that hardware to do the rest. Drivers which need to make user-space memory available to their hardware will call get_user_pages(), which achieves two useful things: it pins the pages into physical memory, and generates an array of physical addresses for the driver to use. The current RDMA implementations use this approach, but they have run into a problem: get_user_pages() was never designed for the usage patterns seen with RDMA.

The typical driver which calls get_user_pages() keeps the pages pinned for a very short period of time. Often, the pages will be released before the driver returns to user space. Sometimes, usually when asynchronous I/O is used, the release of the pages will be delayed for a short period, but only as long as it takes the I/O operation to complete. The problem is that RDMA operations do not "complete" in this manner. An RDMA user can reasonably set up a buffer, pass a descriptor to a remote system, and expect data to show up in the buffer sometime next week. The whole idea is to do the relatively expensive buffer setup once, then be able to transfer the (changing) contents of that buffer an arbitrary number of times. So pages pinned by the driver can remain pinned for a very long time.

Several problems come up in this scenario. get_user_pages() does not do any sort of privilege checking or resource accounting for the pages it pins; it's supposed to be a short-term operation. So a hostile application could use an RDMA interface to lock down large amounts of memory indefinitely, effectively shutting down the system. There is no mechanism for notifying the driver if the process owning the pages exits, so cleanup can be a problem. There are also interactions with the virtual memory system to worry about: if the process forks (causing its data pages to be marked copy-on-write) and writes to a pinned page, it will get a new copy of that page and will become disconnected from its pinned buffer.

Various approaches to solving these problems have been discussed. The resource accounting issues can be partially solved by requiring the process to lock the pages itself (using mlock()) before setting them up for RDMA; that will bring the normal kernel resource limits into play. There are still potential problems if the process is allowed to unlock the pages while the RDMA buffer still exists, however, so some changes would have to be made to prevent that case. Current implementations have dealt with the process exit issue by setting up a char device as the control interface for the RDMA buffer; when the device is closed, all RDMA structures are torn down. The copy-on-write problem can be addressed by forcing RDMA buffers to be in their own virtual memory area (VMA) and setting the VM_DONTCOPY flag on that VMA, preventing the pages from being made available to any child processes. This approach would require that RDMA buffers occupy whole pages by themselves. Then there are little issues like what happens when the process creates overlapping RDMA buffers. The whole thing gets a little complicated.

All of this can clearly be patched together, but it is inelegant at best, and is clearly getting complicated. So an entirely different approach has been proposed by David Addison. This technique does away with the need to pin RDMA buffers entirely, but would, instead, require network drivers to become rather more aware of how the virtual memory subsystem works.

David's patch assumes that the network interface device contains a simple memory management unit of its own, and can deal with its own paging details. This assumption turns out to be true for a number of contemporary high-speed cards. These cards can translate addresses and properly ask for help if they need to access a page which is not currently resident in memory. Thus, when using this sort of card, RDMA buffers can be set up without the need to pin them in memory; the hardware will cause them to be faulted in when the time comes.

Needless to say, the hardware will need a considerable amount of help in this process; it cannot be expected to work with the host system's page tables, cause page faults to happen on its own, etc. So the card's MMU must be loaded with a minimal set of page mappings which describe the RDMA buffer(s), and those mappings must be kept in sync as things change on the system. With that in place, the card can perform DMA to resident pages, and ask the driver for help with the rest.

The device driver can load the initial page tables, but it will need help from the kernel to know when the host system's page tables change. To that end, David's patch defines a structure with a new set of hooks into the virtual memory subsystem:

typedef struct ioproc_ops {
    struct ioproc_ops *next;
    void *arg;

    void (*release)(void *arg, struct mm_struct *mm);
    void (*sync_range)(void *arg, struct vm_area_struct *vma, 
                       unsigned long start, unsigned long end);
    void (*invalidate_range)(void *arg, struct vm_area_struct *vma, 
                             unsigned long start, unsigned long end);
    void (*update_range)(void *arg, struct vm_area_struct *vma, 
                         unsigned long start, unsigned long end);
    void (*change_protection)(void *arg, struct vm_area_struct *vma, 
                              unsigned long start, unsigned long end, 
                              pgprot_t newprot);
    void (*sync_page)(void *arg, struct vm_area_struct *vma, 
                      unsigned long address);
    void (*invalidate_page)(void *arg, struct vm_area_struct *vma, 
                            unsigned long address);
    void (*update_page)(void *arg, struct vm_area_struct *vma, 
                        unsigned long address);
} ioproc_ops_t;

An interested driver can fill in one of these structures with its methods, then attach it to a given process's mm_struct structure with a call to ioproc_register_ops(). Thereafter, calls to those functions will be made whenever things change.

The release() method will be called when the process exits; it allows the driver to perform a full cleanup. The sync_range() and sync_page() methods indicate that the given page(s) have been flushed to disk; this tells the driver that, should the interface modify those pages, they must be marked dirty again. invalidate_range() and invalidate_page() inform the driver that the given page(s) are not longer valid - they have been swapped out or unmapped. Calls to update_range() and update_page() happen when a valid page table entry is written; when a page is brought in, mapped, etc. The change_protection() function is called when page protections are changed.

The patch has already, apparently, been looked over by Andrew Morton and Andrea Arcangeli, so one might assume that there would not be a great many show stoppers there. The comments posted so far have had to do mostly with coding style, though one poster noted that it might make more sense to attach the hooks to the VMA structure, rather than the top-level memory management structure. Unfortunately, the patch does not include any code which actually uses the proposed hooks, making it harder to see how a driver might employ them. Meanwhile, conversations continue on how an interface using page pinning could be made to work. A real solution may be some time yet in coming.

Comments (2 posted)

FUSE and private namespaces

Two weeks ago, we looked at the opposition to FUSE, or, more specifically, to the strange filesystem semantics it implements. FUSE overrides the VFS permission checking code to establish its own set of rules; the intent is to keep users (even root) from accessing each other's private filesystems. Few people dispute the goal, but the approach that was used failed to please.

FUSE hacker Miklos Szeredi has tried to address the concerns with a new patch implementing "private mounts." The patch creates a new mount flag (MNT_PRIVATE); if that flag is set, then only processes belonging to the owner of the mount can see the mounted filesystem at all. To all other processes on the system, these private mounts would be entirely invisible. With this change in place, the permission checking change is no longer needed.

Unfortunately, nobody likes this idea either. This patch creates a different set of filesystem semantics; in this case, setuid programs run by a user who has private mounts will see a different filesystem than any other process. The filesystem hackers do not wish to see namespaces which change in surprising ways.

So what is the solution here? Linux does allow for different processes to have different views of the filesystem ("namespaces"). The namespace mechanism could be brought into play to hide FUSE mounts. The problem is that namespaces were never really meant to be shared across the system. A namespace is a process attribute, like the controlling terminal; it is inherited by child processes, but there is no mechanism for passing a namespace to a process which has not inherited it. Users would like to mount their private filesystems and have them available to all of their processes on the system, so having those filesystems in a namespace which is only available to one process tree does not solve the problem.

As it turns out, there is one way to access namespaces outside of the creating process tree. Jamie Lokier noticed that each process's root directory is accessible via /proc/pid/root. A new process can be put into another process's namespace simply by setting its root with chroot(). If all works as it seems it should, a user-space solution can be envisioned: write a privileged daemon process which can create namespaces and, using file descriptor passing, hand them to interested processes. Those processes can then chroot() into that namespace. chroot() is a privileged operation, but the code to handle the user side of this operation could be hidden within a PAM module and made completely invisible.

All that's left is for somebody to actually code this solution. At that point, a glitch or two could come up, but they should be easily fixed with small patches. So there might just be an answer to the FUSE problem after all.

Comments (1 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds