Brief items
The current 2.6 prepatch is 2.6.11-rc1,
announced by Linus on
January 11. This massive patch set includes a new CPU time
abstraction, AMD dual-core support, a memory technology device/JFFS update,
an ALSA update, some CPU scheduler tweaks, a number of latency-reduction
patches, a buddy allocator rework (removal of the bitmap to make life
easier for hotplug memory implementations), the
unified spinlock initialization patch, SMP
support for the ARM architecture,
debugfs
(which, it seems, is meant to be mounted on
/sys/kernel/debug), a
big USB update, an ATA-over-Ethernet driver,
mmap() support for
binary sysfs attributes, some power management work, the
big kernel semaphore patch, the
four-level page table patch, a VIA PadLock
crypto engine driver, a new SKB allocation function, ACPI hotplug support,
the full InfiniBand patch set (covered here
last November), a big direct rendering manager
(DRM) rework, a new and simplified file readahead mechanism, a set of
user-mode Linux patches, a big set of input patches, a new set of "sparse"
annotations, an NFS update, an iptables update, support for the Fujitsu
FR-V architecture, in-inode extended attribute support for ext3, some
SELinux scalability improvements, and lots of fixes. See
the
long-format changelog for the details.
Note that 2.6.11-rc1 breaks on x86-64 NUMA systems.
Linus's BitKeeper repository contains, as of this writing, a fix for the page fault handler security hole, a fix for
the x86-64 NUMA problem, and a few other small patches.
The current prepatch from Andrew Morton is 2.6.10-mm2. Recent changes to -mm include
multiple AGP support and a number of fixes.
The current 2.4 prepatch is 2.4.29-rc2, released by Marcelo on January 12. The
-rc releases include a number of new security fixes and some driver
updates.
For 2.2 users, Marc-Christian Petersen has released 2.2.27-rc1 with the latest security fixes.
Comments (none posted)
Kernel development news
Unfortunately, the stabilization you're talking about was
essentially too late; distros had long-since wildly diverged, they
had frozen on older releases, and the damage to Linux' reputation
was already done. I'm also unaware of major commercial distros
(e.g. Red Hat, SuSE) using 2.4.x more recent than 2.4.21 as a
baseline, and it's also notable that one of the largest segments of
the commercial userbase I see is using a distro kernel based on
2.4.9.
-- William Lee Irwin III
Comments (8 posted)
One of the many changes slipped quietly into BitKeeper over the last week
was
this patch from Linus changing how
pipes are implemented internally. For a long time, pipes have used a
single page to buffer data between the reader and the writer. If a process
writes more than one page, it will block until the reader has consumed
enough data to allow the rest to fit within the buffer. The 2.6.11 pipe
implementation will be rather different.
Pipes now use a circular buffer, as inexpertly shown in the diagram below:
The curbuf pointer (it's an integer index, actually) indicates
the first buffer in the array which contains data; nrbufs tells
how many buffers contain data. The page structures are allocated
when needed, and do not hang around when not in use. Since both readers
and writers can manipulate nrbufs, some sort of locking (the pipe
semaphore, in this case) is needed to serialize access. The
pipe_buffer structure includes length and offset fields, so each
entry in the circular buffer can contain less than a full page of data.
Linus says that the new implementation
gives a "30-90%" improvement in pipe bandwidth, with only a small cost in
latency (since pages must be allocated when data passes through the pipe).
The performance improvements are entirely attributable to the larger amount
of buffering; readers and writers will block less often when passing data
through the pipe. It is a way of speeding things up by throwing memory at
the problem.
Better pipe performance was not Linus's main purpose in making this change,
however; he has a longer-term plan in mind. The mechanism used to
implement circular pipes will evolve into a general mechanism for passing
data streams through the kernel. Quite a few changes will be required to
get there, and there seems to be no hurry, but there is clearly a long-term
goal in mind.
Among other things, the buffers within the circular structure will gain a
reference count, allowing there to be multiple readers or writers. The
idea here is to implement a sort of in-kernel tee operation which
would let data streams be split without additional copying. The example
given by Linus is some sort of video capture device which would feed its
data into one of these buffers. A process could obtain data from the
buffer and display it in an on-screen window; meanwhile, another process
would be capturing the stream and writing it to a file somewhere - perhaps
with little or no user-space intervention.
The circular buffers will also gain the usual structure full of method
pointers which would allow specific users to change how the basic
operations are performed. Once that is in place, two new system calls
would be added:
- splice(int infd, int outfd);
- This call would use a circular buffer to transfer data from
infd to outfd, possibly in a zero-copy manner.
- tee(int infd, int out1, int out2);
- Arranges for data from infd to go to both out1 and
out2
Longtime followers of Linux kernel discussions will notice a strong
similarity between all of the above and Larry McVoy's splice proposal. Linus's
implementation works at a lower level,
however, and avoids many of the problems he saw with Larry's approach.
Those who are curious about where all this is going may want to look at this explanation from Linus, where he goes
into detail and concludes:
I'm clearly enamoured with this concept. I think it's one of those
few "RightThing(tm)" that doesn't come along all that often. I
don't know of anybody else doing this, and I think it's both useful
and clever. If you now prove me wrong, I'll hate you forever ;)
There is a remaining practical issue with the current implementation. No
coalescing of data written into a circular buffer is performed. Linus did
things that way because he wants to make life easy for high-bandwidth,
zero-copy streams using these buffers. To that end, nothing touches a page
once it has added to a buffer. The problem is that, in the worst case, a
process writing a single byte at a time to a pipe can consume 16 pages of
memory (with the default configuration) to hold 16 bytes worth of data.
Linus initially noted that nobody doing single-byte I/O should expect good
performance, and suggested that people not do that. It turns out, however,
that this behavior breaks a crucial
application - highly parallel kernel compiles. So coalescing of writes
is likely to be added in the near future.
Comments (4 posted)
The Linux audio development community has a longstanding problem: many
audio applications require very short latencies to avoid losing data, but
the Linux kernel makes it hard to get the sort of response times needed.
Over time, the audio hackers have developed a solution which works
reasonably well for them, and which they would like to see merged into the
mainline kernel. There has been strong opposition, however, leaving the
audio community feeling, once again, that its needs are being passed over by
the kernel developers.
The code in question is the realtime security module, which was covered briefly here last September. This
module, when loaded, makes a simple change to the Linux protection
mechanism: any process running with a designated group ID is given the
CAP_SYS_NICE, CAP_IPC_LOCK, and CAP_SYS_RESOURCE
capabilities. Thus, any user who has membership in the special group can
raise priorities, lock pages into physical memory, and exceed resource
limits. With these capabilities, a suitably aware audio application can
ensure that it will be able to respond to events within the required time.
A couple of objections have been raised to the inclusion of the realtime
module. One is that it is a specialized hack for a specific set of users
which has no place in a general-purpose kernel. The GID-based mechanism is
seen as being ugly and hard to administer in the long term. A few kernel
hackers have been quite vocal in their opinion that, until these issues
have been addressed, this module should not be merged. They have been less
vocal, however, on just how audio users should satisfy their needs without
offending the sensibilities of the kernel community.
Nonetheless, some progress has been made. The memory locking issue has
been solved via the new resource limits which were added in 2.6.9. By
setting the limits appropriately, a system administrator can allow
otherwise unprivileged users to lock a bounded number of pages into
physical memory. A bit of PAM configuration work should suffice to deal
with that part of the problem.
The other issue, however, is response time from the CPU scheduler. Ingo
Molnar has noted that the kernel's handling
of regular "nice" levels is much improved in 2.6.10. Audio hacker Jack
O'Quin checked it out and found that things
had improved, though the maximum response time was still far worse than can
be had by running in the SCHED_FIFO class. The reasons for this
behavior are still being investigated; interference from high-priority
kernel threads may be part of the problem. Even if the response
were adequate, however, raising priorities is still a privileged operation.
That issue could, perhaps, be addressed via yet another resource limit
which would allow individual users to raise their priorities within an
administrator-set of bounds. If the remaining response time issues could be
addressed, this new limit could be part of the overall solution, though it
would take some time for updated utilities to get into the hands of the
users who need them.
Another approach which has been mentioned would be to generalize the
realtime module to address a wider range of needs. If it could be set up
to hand out any set of capabilities to given users or groups, it would at
least be useful to more people. It could, for example, replace the current group-based hack which gives access
to the "hugetlb" mechanism. It would still be setting policies in the
kernel by way of user and group IDs, which is not a popular idea, but it
would not be quite the niche tool that it is now. A first pass at such a
module has been posted by Olaf Dietsche; it
takes an interesting approach by having much of the relevant information
stored in the form of group ownership on sysfs attributes.
A more comprehensive solution would be to make capabilities work properly.
After all, that is what capabilities are supposed to be for: to allow
precisely-defined bits of privilege to be granted in the situations where
they are needed. The problems there are that Linux capabilities are currently
broken, fixing them is a tricky job that nobody seems to want to take
on at the moment, and, in any case, administering a truly capability-based
system is an exercise in complexity. Capabilities seem unlikely to be part
of the solution anytime soon.
One interesting aspect of the discussion is what has not been
mentioned. SELinux should be able to solve this problem; it exists to
provide ultimate control over what every user and program can do. Nobody,
however, has wanted to see what happens when musicians attempt to
administer SELinux, it would seem. The realtime preemption work has also
been strangely absent from the discussion - and from the kernel mailing
lists in general.
As of this writing, no real solution seems to have been found. There are
enough kernel hackers sympathetic to the needs of audio hackers, however,
that some sort of resolution should be possible. Linux should be the
ultimate playground for audio developers; it would be a shame if the kernel
continued to get in their way. (For more background, see this history of the realtime LSM by Jack
O'Quin).
Comments (2 posted)
This seems like a conversation we have
seen
before: Paul McKenney is asking to have an exported symbol restored for
use by an proprietary IBM module. This time around, Paul has submitted
a patch requesting that two symbols
(
files_lock and
set_fs_root()) be exported to all
modules. It is proving to be a hard sell.
files_lock is a spinlock used within the VFS layer;
set_fs_root() is used to change the root directory for (one
process's view of) a filesystem. They were used by IBM's MVFS to a novel
end: MVFS implements a revision control system internally, and allows each
process to see a different revision of the file tree. By using these
symbols, MVFS was able to make the filesystem behave differently for each
process. With 2.6.9, that worked great, but those symbols are no longer
exported in 2.6.10. Paul has asked that they be restored so that the MVFS
module can work again.
The export was removed because the kernel developers feel that no code
outside of the VFS layer should be making changes in the filesystem
namespace. The tricks that MVFS is performing with set_fs_root()
would be better done with bind mounts - in user space. It is also felt
that any code using set_fs_root() or files_lock can only
be a fundamental part of the kernel, and thus a derived product; there is
no legal way, according to the relevant kernel developers, that a
proprietary module can legally use them. For these reasons, the exports
were removed, and there is strong resistance to restoring them.
Nobody disagrees with the reasoning behind the change. Not everybody
thinks that it was appropriate to remove the symbols with no notice,
however. In particular, Linus thinks there was
no reason to break things so abruptly:
I'm known for happily breaking binary modules, but I think we
should do it only if we have a reason _other_ than "let's break a
module".
Andrew Morton also thinks the exports should be
restored for a period of time - a position which gained him an accusation of supporting IBM's position as a
payback for IBM's funding of OSDL. Despite Linus's and Andrew's position,
as of this writing, the exports of those symbols have not been restored.
This whole episode restarted the discussion of what the proper way is to
remove deprecated features when there is no unstable kernel series in
sight. Andrew proposed the creation of a
file (feature-removal-schedule.txt) in the Documentation
directory which would list things slated for removal, and the relevant
dates. That file has been created; as of
this writing it lists devfs and some CPU frequency files in
/proc. This file will be helpful for some users, but it probably
will not make life easier for people maintaining out-of-tree code;
Christoph Hellwig and others have made it clear that they will continue to
remove "unneeded" exports without notice as they are identified. Life will
continue to be difficult, it seems, for code maintained outside of the
mainline tree.
Comments (1 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
- Robert Love: inotify..
(January 7, 2005)
Janitorial
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>