Release status
Kernel release status
The current 2.6 prepatch is 2.6.11-rc2,
released by Linus on January 21.
Changes this time around include some networking updates (including a "fix"
for a
NETIF_F_LLTX race condition which was subsequently
withdrawn), an ALSA update (to version 1.0.8), some enhancements to the
"circular pipe buffers" code introduced in -rc1, the
ioctl()
method rework, in-inode extended attributes for ext3, some additions to the
completion API, some spinlock changes, and fixes for the latest round of
security problems. The
long-format
changelog has the details.
The flow of patches into Linus's BitKeeper repository has slowed as things
begin to stabilize for the 2.6.11 release. Changes merged since -rc2
include some architecture updates, the removal of bcopy(), a fix
for writable module parameters in sysfs (it never actually worked before),
and various fixes.
The current -mm tree is 2.6.11-rc2-mm1.
Recent changes to -mm include some random driver reworking, the POSIX
high-resolution timers patch set, ACL support for the NFS client, the
isochronous CPU scheduler (see below), and some crypto API work.
The current 2.4 kernel is 2.4.29; no 2.4.30 prepatches have been
released.
Comments (none posted)
Kernel development news
Low latency for audio applications
Two weeks ago, this page
looked at the realtime security module, an addition requested by Linux
users who need to be able to ensure that certain applications are able to
respond quickly to external events. Musicians working with Linux, in
particular, want a system which can keep up with audio streams - a task
which requires sub-millisecond response in many cases. Unpatched Linux
kernels have generally not been able to provide latencies that low in any
sort of reliable way.
The idea of merging the realtime module appears to have been dropped for
now; the opposition was too strong. There are a couple of other approaches
which are being worked on, however, to meet the audio developers' needs.
In particular, Con Kolivas and Ingo Molnar have been creating patches, and
audio hacker Jack O'Quin has been tirelessly testing them out. Two
approaches which look like they could solve the problem have emerged from
this work.
The approach taken by Con Kolivas is the isochronous scheduler patch. This patch, in
its current form, creates two new scheduling classes: SCHED_ISO_RR
and SCHED_ISO_FIFO. These classes function much like the realtime
scheduling classes in that they provide a higher scheduling priority than
any SCHED_NORMAL process enjoys. They differ from the true
realtime classes, however, in a couple of ways. No privilege is required
to enter one of the isochronous classes, so audio applications need not run
as root. The scheduler will also automatically select an isochronous class
if an unprivileged application attempts to enter a true realtime class,
with the result that many audio applications can use the new classes
without modification.
The isochronous classes give high-priority access to the CPU, but only to a
point. If isochronous processes use more than an administrator-defined
percentage of the processor (70% by default), they get dropped back to the
SCHED_NORMAL class for a while. This feature prevents
high-priority, unprivileged tasks from taking over the system entirely.
This is an important feature - the lack of any such protection was the
reason for many of the objections to the realtime security module.
Ingo Molnar's approach, instead, is the
creation of a new resource limit (initially called
RLIMIT_RT_CPU_RATIO, later changed to RLIMIT_RT_CPU).
This limit controls what percentage of the processor's time may be taken by
all unprivileged realtime processes. If the limit is in effect, the patch
also allows any process to enter the realtime scheduling classes. So the
end result is similar to that obtained with Con's patch: unprivileged tasks
can get realtime access to the processor, but they are prevented from
taking over entirely. The difference is that Ingo's patch is somewhat
smaller and simpler, and does not require the introduction of new
scheduling classes.
The rlimit-based patch is also interesting in that it allows each process
to have a different maximum CPU utilization limit. Imagine a system
running a set of audio applications where some have their limit set at 60%,
and others at 80%. If 70% of the available processor time is actually
being used by realtime tasks, processes with the 60% limit will lose their
realtime access, but the 80% processes will not. This scheme, thus, allows
a smart supervisor (such as the jack server) to arrange for a
(relatively) graceful degradation as contention for the CPU increases.
Jack O'Quin's benchmarking suggests that either patch, in their most recent
forms, has the potential to solve the problem (though the realtime
preemption work may also be required for a complete solution). He appears
to favor Ingo's version, however, and its relative simplicity could well
argue for taking that path. It does not seem that any decisions have been
made, however; it may be that nothing is merged until the 2.6.12 process
starts. It does appear, however, that life is about to get a little easier
for Linux audio users, which is a good thing. It can be worthwhile to be
noisy about your needs, especially if you are willing to put time into
helping in the development of the solution.
Comments (7 posted)
A new core time subsystem
Keeping track of the current time is one of the kernel's many jobs. In the
Linux kernel, this task is handled in a very architecture-dependent way.
Each architecture has its own sources of high-resolution time, and each
performs its own calculations. This system works, but it results in quite
a bit of code being duplicated across architectures, and it can be
brittle. Patches which change time-related code often do not manage to
correctly update all architectures.
John Stultz has been working for some months on a cleaner alternative. The
result is a new time subsystem which, he
hopes, will improve the situation.
Much of the patch can be seen as a refactoring of the time code. Common
calculations are now performed in the timeofday core, rather than in
architecture-specific code. The code for implementing the network time
protocol (NTP), an interesting exercise in complexity itself, has been
separated from the rest of the time code and hidden in its own file. Most
of the core time code has been reworked to deal with time in nanoseconds, a
format which gives adequate time resolution but which, in a 64-bit
variable, is still good for centuries. The timeofday code no longer
depends on the jiffies variable, meaning that it can work
independently of the timer interrupt, which may be disabled in some
situations. The overall result is kernel timing code which is much easier
to read and understand.
In the end, however, the timing code must go to the hardware to actually
get high-resolution time values. John made a couple of observations here.
One is that, while time sources are architecture-dependent, many
architectures share the same types of timing hardware. The other was that
the code which deals with a time source is really just another device
driver. So he isolated the time source information into its own structure:
struct timesource_t {
char* name;
int priority;
enum {
TIMESOURCE_FUNCTION,
TIMESOURCE_CYCLES,
TIMESOURCE_MMIO_32,
TIMESOURCE_MMIO_64
} type;
cycle_t (*read_fnct)(void);
void __iomem* mmio_ptr;
cycle_t mask;
u32 mult;
u32 shift;
void (*update_callback)(void);
};
Here, name is just a name for the source, and priority is
used to choose between multiple available sources. The type field
tells how this source can be read. If type is
TIMESOURCE_FUNCTION, the read_fnct() will be called to
read the source. The two _MMIO_ variants are for hardware which
can be read directly from I/O memory; in that case, the time code can just
obtain a value from the location indicated by mmio_ptr with no
need to call any outside functions. TIMESOURCE_CYCLES indicates
that the processor's time stamp counter (TSC) is being used, so
get_cycles() is called to get the actual value.
In any of the above cases, the value returned by the time source is assumed
to be some sort of counter. The mask, mult, and
shift values are applied to turn a delta between two such values
into a number of nanoseconds for the rest of the timekeeping code.
With this structure in place, architecture-specific code need only fill in
a timesource_t structure (possibly implementing a read function in
the process) and pass it to register_timesource(). All the rest
is then handled in the common code. John has provided a set of time source drivers for a few
architectures which demonstrates how they can be written.
The discussion of the patches suggests that, while developers like the
general intent, there are some remaining concerns - especially among the
architecture maintainers. In some architectures, the
gettimeofday() system call can be handled entirely in user space,
but the current patches do not yet support that. The current NTP
implementation is also seen as being too expensive. Finding a way to cut
the cost of NTP while maintaining accuracy could be a bit of a challenge,
but John is working at it. Expect to see some more iterations on this
one.
Comments (none posted)
Some 2.6.11 API changes
A few small internal API changes have been merged for 2.6.11. For the
record, here's what they are.
The completion mechanism allows a thread in
the kernel to block until a specific event happens. Three new functions,
some of which appear to be aiming for the "longest name in the kernel"
prize, have been added:
int wait_for_completion_interruptible(struct completion *c);
unsigned long wait_for_completion_timeout(struct completion *c,
unsigned long timeout);
unsigned long wait_for_completion_interruptible_timeout(struct completion *c,
unsigned long timeout);
Each of these functions should be relatively straightforward to understand:
they add interruptible and timeout variants to the basic
wait_for_completion() function. They were added it make it easier
to convert more semaphore users over to the completion API, which is more
appropriate for cases where a one-shot operation is being waited for. This
change is another small bit of fallout from the realtime preemption work.
The kernel has long had an implementation of bcopy():
void bcopy(const char *src, char *dest, int size);
Arjan van de Ven and Adrian Bunk recently noticed a couple of things:
(1) nothing in the kernel was actually using bcopy(), and
(2) the implementation was broken. bcopy() is supposed to be
able to handle overlapping source and destination areas, but, for a number
of architectures, the kernel implementation would not do the right thing
with such areas. So a patch was merged
which removes bcopy(). No other in-kernel changes were needed,
but out-of-tree modules which use bcopy() will need to be
changed.
Chip Salzenberg (and others) noticed that a couple of networking functions
- skb_copy_datagram() and sock_alloc_send_pskb() - are
no longer exported to modules in the 2.6.11 prepatches. This change breaks
the out-of-tree VMWare modules. Fixes for VMWare have already been merged.
On the PCI front, a patch from Pavel Machek
which changes the prototype of the suspend() method in
struct pci_driver was merged. The new prototype is:
int (*suspend)(struct pci_dev *dev, pm_message_t state);
By changing the type of the state parameter, the patch allows the
removal of some translation code and lets PCI drivers know what is really
going on at the higher power management levels. Pavel is looking for help in fixing PCI drivers to use
the new interface.
A few spinlock primitives have seen changes. For starters, the macro
rwlock_is_locked() has been removed. It was never clear whether
the macro referred to read or write locking, so Linus dealt with the
confusion by just taking it out altogether. Then a new set of primitives
was added:
int read_can_lock(rwlock_t *rw);
int write_can_lock(rwlock_t *rw);
These test whether an attempt to obtain a read or write lock at that time
would have succeeded. In addition, there is a version for regular
spinlocks:
int spin_can_lock(spinlock_t *lock);
This function returns a nonzero value if an attempt to obtain lock
would have succeeded, but does not actually modify the lock.
Finally, the name of the internal lock field in the spinlock
structure was renamed to slock. This change was made to force
the compiler to complain when rwlock primitives are used on a regular
spinlock (and vice versa). This sort of type safety could also have been
achieved by using inline functions, rather than macros, but some
performance problems with gcc prevented that approach from being
used.
Comments (1 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Architecture-specific
Security-related
- Fruhwirth Clemens: Add LRW.
(January 24, 2005)
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>