User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.11-rc1, announced by Linus on January 11. This massive patch set includes a new CPU time abstraction, AMD dual-core support, a memory technology device/JFFS update, an ALSA update, some CPU scheduler tweaks, a number of latency-reduction patches, a buddy allocator rework (removal of the bitmap to make life easier for hotplug memory implementations), the unified spinlock initialization patch, SMP support for the ARM architecture, debugfs (which, it seems, is meant to be mounted on /sys/kernel/debug), a big USB update, an ATA-over-Ethernet driver, mmap() support for binary sysfs attributes, some power management work, the big kernel semaphore patch, the four-level page table patch, a VIA PadLock crypto engine driver, a new SKB allocation function, ACPI hotplug support, the full InfiniBand patch set (covered here last November), a big direct rendering manager (DRM) rework, a new and simplified file readahead mechanism, a set of user-mode Linux patches, a big set of input patches, a new set of "sparse" annotations, an NFS update, an iptables update, support for the Fujitsu FR-V architecture, in-inode extended attribute support for ext3, some SELinux scalability improvements, and lots of fixes. See the long-format changelog for the details.

Note that 2.6.11-rc1 breaks on x86-64 NUMA systems.

Linus's BitKeeper repository contains, as of this writing, a fix for the page fault handler security hole, a fix for the x86-64 NUMA problem, and a few other small patches.

The current prepatch from Andrew Morton is 2.6.10-mm2. Recent changes to -mm include multiple AGP support and a number of fixes.

The current 2.4 prepatch is 2.4.29-rc2, released by Marcelo on January 12. The -rc releases include a number of new security fixes and some driver updates.

For 2.2 users, Marc-Christian Petersen has released 2.2.27-rc1 with the latest security fixes.

Comments (none posted)

Kernel development news

Quote of the week

Unfortunately, the stabilization you're talking about was essentially too late; distros had long-since wildly diverged, they had frozen on older releases, and the damage to Linux' reputation was already done. I'm also unaware of major commercial distros (e.g. Red Hat, SuSE) using 2.4.x more recent than 2.4.21 as a baseline, and it's also notable that one of the largest segments of the commercial userbase I see is using a distro kernel based on 2.4.9.

-- William Lee Irwin III

Comments (8 posted)

Circular pipes

One of the many changes slipped quietly into BitKeeper over the last week was this patch from Linus changing how pipes are implemented internally. For a long time, pipes have used a single page to buffer data between the reader and the writer. If a process writes more than one page, it will block until the reader has consumed enough data to allow the rest to fit within the buffer. The 2.6.11 pipe implementation will be rather different.

Pipes now use a circular buffer, as inexpertly shown in the diagram below:

[Circular pipe diagram]

The curbuf pointer (it's an integer index, actually) indicates the first buffer in the array which contains data; nrbufs tells how many buffers contain data. The page structures are allocated when needed, and do not hang around when not in use. Since both readers and writers can manipulate nrbufs, some sort of locking (the pipe semaphore, in this case) is needed to serialize access. The pipe_buffer structure includes length and offset fields, so each entry in the circular buffer can contain less than a full page of data.

Linus says that the new implementation gives a "30-90%" improvement in pipe bandwidth, with only a small cost in latency (since pages must be allocated when data passes through the pipe). The performance improvements are entirely attributable to the larger amount of buffering; readers and writers will block less often when passing data through the pipe. It is a way of speeding things up by throwing memory at the problem.

Better pipe performance was not Linus's main purpose in making this change, however; he has a longer-term plan in mind. The mechanism used to implement circular pipes will evolve into a general mechanism for passing data streams through the kernel. Quite a few changes will be required to get there, and there seems to be no hurry, but there is clearly a long-term goal in mind.

Among other things, the buffers within the circular structure will gain a reference count, allowing there to be multiple readers or writers. The idea here is to implement a sort of in-kernel tee operation which would let data streams be split without additional copying. The example given by Linus is some sort of video capture device which would feed its data into one of these buffers. A process could obtain data from the buffer and display it in an on-screen window; meanwhile, another process would be capturing the stream and writing it to a file somewhere - perhaps with little or no user-space intervention.

The circular buffers will also gain the usual structure full of method pointers which would allow specific users to change how the basic operations are performed. Once that is in place, two new system calls would be added:

splice(int infd, int outfd);
This call would use a circular buffer to transfer data from infd to outfd, possibly in a zero-copy manner.

tee(int infd, int out1, int out2);
Arranges for data from infd to go to both out1 and out2

Longtime followers of Linux kernel discussions will notice a strong similarity between all of the above and Larry McVoy's splice proposal. Linus's implementation works at a lower level, however, and avoids many of the problems he saw with Larry's approach. Those who are curious about where all this is going may want to look at this explanation from Linus, where he goes into detail and concludes:

I'm clearly enamoured with this concept. I think it's one of those few "RightThing(tm)" that doesn't come along all that often. I don't know of anybody else doing this, and I think it's both useful and clever. If you now prove me wrong, I'll hate you forever ;)

There is a remaining practical issue with the current implementation. No coalescing of data written into a circular buffer is performed. Linus did things that way because he wants to make life easy for high-bandwidth, zero-copy streams using these buffers. To that end, nothing touches a page once it has added to a buffer. The problem is that, in the worst case, a process writing a single byte at a time to a pipe can consume 16 pages of memory (with the default configuration) to hold 16 bytes worth of data. Linus initially noted that nobody doing single-byte I/O should expect good performance, and suggested that people not do that. It turns out, however, that this behavior breaks a crucial application - highly parallel kernel compiles. So coalescing of writes is likely to be added in the near future.

Comments (4 posted)

Merging the realtime security module

The Linux audio development community has a longstanding problem: many audio applications require very short latencies to avoid losing data, but the Linux kernel makes it hard to get the sort of response times needed. Over time, the audio hackers have developed a solution which works reasonably well for them, and which they would like to see merged into the mainline kernel. There has been strong opposition, however, leaving the audio community feeling, once again, that its needs are being passed over by the kernel developers.

The code in question is the realtime security module, which was covered briefly here last September. This module, when loaded, makes a simple change to the Linux protection mechanism: any process running with a designated group ID is given the CAP_SYS_NICE, CAP_IPC_LOCK, and CAP_SYS_RESOURCE capabilities. Thus, any user who has membership in the special group can raise priorities, lock pages into physical memory, and exceed resource limits. With these capabilities, a suitably aware audio application can ensure that it will be able to respond to events within the required time.

A couple of objections have been raised to the inclusion of the realtime module. One is that it is a specialized hack for a specific set of users which has no place in a general-purpose kernel. The GID-based mechanism is seen as being ugly and hard to administer in the long term. A few kernel hackers have been quite vocal in their opinion that, until these issues have been addressed, this module should not be merged. They have been less vocal, however, on just how audio users should satisfy their needs without offending the sensibilities of the kernel community.

Nonetheless, some progress has been made. The memory locking issue has been solved via the new resource limits which were added in 2.6.9. By setting the limits appropriately, a system administrator can allow otherwise unprivileged users to lock a bounded number of pages into physical memory. A bit of PAM configuration work should suffice to deal with that part of the problem.

The other issue, however, is response time from the CPU scheduler. Ingo Molnar has noted that the kernel's handling of regular "nice" levels is much improved in 2.6.10. Audio hacker Jack O'Quin checked it out and found that things had improved, though the maximum response time was still far worse than can be had by running in the SCHED_FIFO class. The reasons for this behavior are still being investigated; interference from high-priority kernel threads may be part of the problem. Even if the response were adequate, however, raising priorities is still a privileged operation.

That issue could, perhaps, be addressed via yet another resource limit which would allow individual users to raise their priorities within an administrator-set of bounds. If the remaining response time issues could be addressed, this new limit could be part of the overall solution, though it would take some time for updated utilities to get into the hands of the users who need them.

Another approach which has been mentioned would be to generalize the realtime module to address a wider range of needs. If it could be set up to hand out any set of capabilities to given users or groups, it would at least be useful to more people. It could, for example, replace the current group-based hack which gives access to the "hugetlb" mechanism. It would still be setting policies in the kernel by way of user and group IDs, which is not a popular idea, but it would not be quite the niche tool that it is now. A first pass at such a module has been posted by Olaf Dietsche; it takes an interesting approach by having much of the relevant information stored in the form of group ownership on sysfs attributes.

A more comprehensive solution would be to make capabilities work properly. After all, that is what capabilities are supposed to be for: to allow precisely-defined bits of privilege to be granted in the situations where they are needed. The problems there are that Linux capabilities are currently broken, fixing them is a tricky job that nobody seems to want to take on at the moment, and, in any case, administering a truly capability-based system is an exercise in complexity. Capabilities seem unlikely to be part of the solution anytime soon.

One interesting aspect of the discussion is what has not been mentioned. SELinux should be able to solve this problem; it exists to provide ultimate control over what every user and program can do. Nobody, however, has wanted to see what happens when musicians attempt to administer SELinux, it would seem. The realtime preemption work has also been strangely absent from the discussion - and from the kernel mailing lists in general.

As of this writing, no real solution seems to have been found. There are enough kernel hackers sympathetic to the needs of audio hackers, however, that some sort of resolution should be possible. Linux should be the ultimate playground for audio developers; it would be a shame if the kernel continued to get in their way. (For more background, see this history of the realtime LSM by Jack O'Quin).

Comments (2 posted)

The abrupt un-exporting of symbols

This seems like a conversation we have seen before: Paul McKenney is asking to have an exported symbol restored for use by an proprietary IBM module. This time around, Paul has submitted a patch requesting that two symbols (files_lock and set_fs_root()) be exported to all modules. It is proving to be a hard sell.

files_lock is a spinlock used within the VFS layer; set_fs_root() is used to change the root directory for (one process's view of) a filesystem. They were used by IBM's MVFS to a novel end: MVFS implements a revision control system internally, and allows each process to see a different revision of the file tree. By using these symbols, MVFS was able to make the filesystem behave differently for each process. With 2.6.9, that worked great, but those symbols are no longer exported in 2.6.10. Paul has asked that they be restored so that the MVFS module can work again.

The export was removed because the kernel developers feel that no code outside of the VFS layer should be making changes in the filesystem namespace. The tricks that MVFS is performing with set_fs_root() would be better done with bind mounts - in user space. It is also felt that any code using set_fs_root() or files_lock can only be a fundamental part of the kernel, and thus a derived product; there is no legal way, according to the relevant kernel developers, that a proprietary module can legally use them. For these reasons, the exports were removed, and there is strong resistance to restoring them.

Nobody disagrees with the reasoning behind the change. Not everybody thinks that it was appropriate to remove the symbols with no notice, however. In particular, Linus thinks there was no reason to break things so abruptly:

I'm known for happily breaking binary modules, but I think we should do it only if we have a reason _other_ than "let's break a module".

Andrew Morton also thinks the exports should be restored for a period of time - a position which gained him an accusation of supporting IBM's position as a payback for IBM's funding of OSDL. Despite Linus's and Andrew's position, as of this writing, the exports of those symbols have not been restored.

This whole episode restarted the discussion of what the proper way is to remove deprecated features when there is no unstable kernel series in sight. Andrew proposed the creation of a file (feature-removal-schedule.txt) in the Documentation directory which would list things slated for removal, and the relevant dates. That file has been created; as of this writing it lists devfs and some CPU frequency files in /proc. This file will be helpful for some users, but it probably will not make life easier for people maintaining out-of-tree code; Christoph Hellwig and others have made it clear that they will continue to remove "unneeded" exports without notice as they are identified. Life will continue to be difficult, it seems, for code maintained outside of the mainline tree.

Comments (1 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers


Filesystems and block I/O

  • Robert Love: inotify.. (January 7, 2005)


Memory management




Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2005, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds