There is no mainline 2.6 prepatch
as of this writing; the 2.6.24
merge window remains open. Patches are flowing into the mainline
repository at a high rate; see the article below for the highlights.
The current -mm tree is 2.6.23-mm1. Recent changes to
-mm include a new power management "quality of service" infrastructure, a
number of ext4 updates, and the kernel markers patch.
The current stable 2.6 kernel is 188.8.131.52, released on October 12.
This update contains a single fix for a corruption problem in the
For older kernels: 184.108.40.206 was released on
October 14 with several security fixes.
Comments (none posted)
Kernel development news
Not everybody is a "doer". It's important to get input from people
who are just plain users, or hope to be.
-- Linus Torvalds
-The limit on the length of lines is 80 columns and this is a hard limit.
+The limit on the length of lines is 80 columns and this is a strongly
-- Alan Cox
relaxes the rules in 2.6.24
Comments (1 posted)
The 2.6.24 merge window is open and, as of this writing, some 5,600 patches
have found their way into the mainline. As usual, the list of changes is
extensive. Some of the highlights among user-visible changes are:
- New drivers have been added for
Toshiba TCM825x cameras (as found in the Nokia N800),
Conexant cx23415 MPEG encoders (in framebuffer mode),
DiBcom DiB0070 tuners,
Microtune MT2131 tuners,
Samsung S5H1409 demodulators,
Conexant CX23885/CX23887 PCIe bridge devices,
Samsung LTV350QV LCD panel backlights,
Kingsun KS-959 and "Dazzle" IrDA USB dongles,
ADMtek ADM8211-based wireless network adapters,
Marvell Libertas 8385 CF wireless adapters,
Intel 82598-based 10GbE network cards,
Intel PRO/Wireless 3945ABG/BG and Link AGN adapters (finally),
IP1000 Gigabit Ethernet cards,
ICH9 on-board Ethernet adapters,
Tehuti Networks 10G Ethernet adapters,
Sonics Silicon Backplane busses,
several Ralink wireless adapters,
"EMAC" built-in PowerPC ethernet controllers,
Sun Neptune Ethernet adapters,
Winchiphead CH341 USB-RS232 converters,
Atmel AT32AP7000 USB device controllers,
Blackfin bf548 ATAPI controllers,
Atmel AVR32 parallel ATA controllers,
National Semiconductor NS87415 parallel ATA controllers,
Olympus MAUSB-10 and Fujifilm DPC-R1 flash card readers (which have
the nice feature of allowing direct flash access without an
intervening translation layer),
TI DaVinci I2C controllers,
Analog Devices ADT7470 temperature monitoring chips,
IBM PowerExecutive power/temperature sensors,
TI AR7 CPMAC Ethernet controllers,
Siemens SX1 phones,
Dallas Semiconductor DS1374 realtime clock chips,
Atmel AT73C213 external sound devices,
Cirrus Logic CS4270 codec devices,
Gallant SC-6000 Audio Excel DSPs, and
Atmel AT32AP and AT91 on-chip synchronous serial controllers.
- The Broadcom BCM43xx driver has been replaced by a new version which
uses the mac80211 layer. Actually, there's two drivers: "b43" for
newer adapters, and "b43legacy" for older 802.11b and 802.11g
- The "dgrs" Digi RightSwitch driver has been removed from the kernel.
This product, evidently, was never actually sold, so there should not
be a whole lot of users inconvenienced by this change.
- The kernel now has basic support for SDIO peripherals. There is also
now driver support for MMC/SD cards accessed via SPI controllers.
- The CFS group scheduling
code has been merged. As of this writing, though, the feature cannot
actually be turned on because the control groups code has not yet been
merged. There is also a per-UID fair scheduling option which does work now.
- There is now support for RPC-based RDMA and the ability to mount NFS
filesystems using RDMA.
- The traffic shaper, which limits bandwidth usage on network links, has
been marked obsolete and scheduled for removal in 2.6.25. The much
more flexible qdisc subsystem should be used instead.
- Allocation of UDP port numbers is now randomized.
- The netconsole code can now support multiple logging targets.
- Support for network namespaces has been added, enabling the
virtualization of network-related resources in containers. Also
merged is a virtual Ethernet driver which can be used to create
network tunnels into (and out of) containers.
- The Authenticated chunks
protocol for the stream control transmission protocol (SCTP) is supported.
- A new "stateless NAT" implementation performs IPv4 network address
translation in a much more resource-efficient manner.
- Support for the SEED
encryption algorithm has been added.
- Dynamic tick and clockevent support for the x86_64 architecture (now
the 64-bit version of the new x86 architecture, see below) has been merged.
- Support for serial ATA port multipliers has been added.
- LZO compression is now supported in the JFFS2 filesystem.
- The USB device
authorization code - a prerequisite to wireless USB support - has
- A new hidraw device provides access to a stream of
unprocessed input device events for applications which have special
needs in this area.
- The per-device write
throttling patches have been merged; these patches should help the
system keep heavy traffic on one block device from starving other
devices. The floating
proportions patch, needed to support per-device throttling, has
also gone in.
- There is a new sysctl flag for the out-of-memory killer
(oom_kill_allocating_task). If this flag is set, the OOM
killer will simply kill the process whose allocation brings about the
out-of-memory situation instead of scanning through the system looking
for better targets.
- Disk quota messages can now be delivered via a netlink socket. This
should make it easier for graphical environments to inform the user
when disk quota problems are encountered.
- The new F_DUPFD_CLOEXEC command causes fcntl() to
duplicate a file descriptor and set the close-on-exec flag from the
- Block reservations have been added to the ext2 filesystem.
- The Linux security module interface is now a non-module interface: the
ability to load security modules on the fly has been removed.
- File-based capability
masks are now supported.
Important changes visible to kernel developers include:
- As expected, the i386/x86_64
architecture merger has happened. The result is a single
architecture, called "x86," which can be built for 32-bit and 64-bit
- The Video4Linux layer has some new internal support for composite
devices involving more than one driver (many V4L2 devices involve, at
a minimum, separate drivers for the controller and the sensor).
- Also in Video4Linux: the video-buf layer has been replaced with a more
generic implementation which works with a wider range of devices
(including USB devices and those which do not support scatter/gather
- The large receive offload
(LRO) support layer has been merged into the networking
- The NAPI interface used in network drivers has been reworked to better
support devices with multiple transmit queues.
- The networking layer has a new function for printing MAC addresses:
char *print_mac(char *buf, const u8 *addr);
The buf buffer should be declared with
DECLARE_MAC_BUF(); the output is suitable for formatting in
printk() with "%s".
- The NETIF_F_LLTX (lockless transmit) flag for network devices
has been deprecated and should not be used in new code.
- The functions ktime_sub_us() and ktime_sub_ns() have
been added; they subtract the given number of microseconds or
nanoseconds from a ktime_t value.
- The hard_header() method has been removed from struct
net_device; it has been replaced by a per-protocol
header_ops structure pointer.
- The debugfs filesystem has some new functions
debugfs_create_x32()) which make it easy to export files
containing hexadecimal numbers.
- Various small sysfs-related API changes have been made. The
name field has been removed from the kobject
structure. The prototypes of the user-event callbacks have been
changed. Many of the subsystem-related calls have been removed -
subsystems never really did much of anything anyway;
get_bus() and put_bus() are also gone.
- A new value DMA_MASK_NONE can be stored in the
device structure dma_mask field to indicate that the
device is incapable of performing DMA.
- The VFS has a couple of new address space operations
(write_begin() and write_end()) aimed at fixing some
deadlock scenarios; see this article
for more information.
- The scatterlist chaining
patches have been merged and many parts of the kernel have been
updated to use this feature.
- The CFLAGS= and CPPFLAGS= options now work with the
kernel build system in the expected way: they add flags to be passed
to the C compiler and preprocessor, respectively.
- The prototype for slab constructor callbacks has changed to:
void (*ctor)(struct kmem_cache *cache, void *object);
The unused flags argument has been removed and the order of
the other two arguments has been reversed to match other slab
- The DECLARE_MUTEX_LOCKED() macro has been removed.
- The long-deprecated SA_* interrupt flags have been removed in
favor of the IRQF_* equivalents.
- A number of block layer utilities have seen prototype changes. The
most evident change, perhaps, is bio_endio() and the
associated bio_end_io_t callback:
void bio_endio(struct bio *bio, int error);
typedef void (bio_end_io_t) (struct bio *, int);
These functions now always completes the entire BIO, so the size
argument has been removed.
As of this writing, the 2.6.24 merge window can be expected to remain open
for up to another week. So expect more changes to go into the mainline
before this development cycle goes into the stabilization phase.
Comments (8 posted)
The Completely Fair Scheduler (CFS) was merged for the 2.6.23 kernel. One
CFS feature which did not get in, though, was the group scheduling
Group scheduling makes the CFS fairness algorithm operate in a hierarchical
fashion: processes are divided into groups, and, within each group,
processes are scheduled fairly against one another. At the higher level,
each group as a whole is given a fair share of the processor. The grouping
of processes is done in user space in a highly flexible manner; the control
groups (formerly "process containers") mechanism allows a management daemon
to classify processes according to almost any policy.
One of the reasons why group scheduling did not get into 2.6.23 is that the
control groups patch was not ready for merging. Your editor had expected
control groups to go in for 2.6.24, but, as of this writing, it is looking
like that patch might still be under too much active development to get
into the mainline. The group scheduling feature is not waiting, though; it
has been merged for the 2.6.24 release. In the absence of control groups,
the general group scheduling mechanism will not be available. Over the
last few months, though, the group scheduler has evolved a new feature which will
allow it to be used without control groups, and which implements what is
likely to be the most common use case.
That feature is per-user scheduling: creating a separate group for each
user running on the system and using those groups to give each user a fair share of the
processor. Since the groups are created implicitly by the scheduler, there
is no separate need for the control groups interface. Instead, if the
"fair user" configuration option is selected, the per-user group scheduling
will go into effect with no further intervention by the administrator
Of course, once the system provides fair per-user scheduling,
administrators will immediately want to make it unfair by arranging for
some users to get more CPU time than others. The age-old technique of
raising the priority of that crucial administrative wesnoth process still
works, but it is a crude and transparent tool. It would be much nicer to
be able to tweak the scheduler so that certain users get a higher share of
the CPU for the running of their crucial
To achieve such ends with the 2.6.24 scheduler, it will only be necessary
to go to the new sysfs directory /sys/kernel/uids. There will be
a subdirectory there for every active user ID on the system, and each
subdirectory will contain a file called cpu_share. The integer
value found in that file defaults to 1024. For the purposes of adjusting
scheduling, all that really matters with the cpu_share value is
its ratio between two users. If one user's cpu_share is set to
2048, that user will get twice as much CPU time as any one user whose value
remains at the default 1024. The end result is that adjusting the
scheduling of the CPU between users is quite easy for the administrator to
A rather large number of other patches was also merged for 2.6.24. Most of
those are cleanups and small improvements. Some of the math within the
scheduler has been made less intensive, and fairness has been improved in a
number of ways. There is also a new facility for performing guest CPU
accounting for virtualized systems running under KVM. It's a lot of
patches, but the rate of change in the core CPU scheduler should be
beginning to slow down again.
There are some other scheduler-related patches in the works, though. A
couple of them address the problem of getting realtime tasks into a CPU
promptly. Normally, the CPU scheduler will make a significant effort to
avoid moving processes between CPUs because the cost of that migration
(resulting from lost memory cache contents) is high. If a realtime process
wants to run, though, the system is obligated to give it a processor even
if there is a price to be paid in terms of overall throughput. The current
CPU scheduler, however, will cause a realtime process to languish if a
higher-priority process is running on the same CPU, even if other
processors are available in the system.
Fixing this problem involves a couple of different patches. This one from Steven Rostedt
addresses the situation where the scheduling of one realtime task causes a
lower-priority (but still realtime) task to be pushed out of the CPU.
Rather than leave that luckless task in the run queue, Steven's patch
searches through the other processors on the system to find the one running
the lowest-priority process. If a processor running a sufficiently
low-priority process is found, the displaced realtime process is moved over
to that processor.
Gregory Haskins has posted a
similar patch which addresses a slightly different situation: a
realtime task has just been awakened, but the CPU it is on is already
running a higher-priority process. Once again, a search of the system to
find the lowest-priority CPU is performed, with the realtime process being
moved if a suitable home is found. In either case, the moved process will
suffer a small performance hit as it finds a completely cold cache waiting
for it. But it will still be able to respond much more quickly to the real
world than it would if it were sitting on a run queue somewhere; that, of
course, is what realtime scheduling is all about.
Another issue which has come up in some situations is that the accuracy of
fair scheduling decisions is constrained by the scheduler tick frequency.
In the absence of external events (such as I/O completions), one process
can only preempt another when the periodic timer tick comes in. As a
result, processes might run longer than their time slices would otherwise
allow. The scheduler will compensate for the extra time used by that process
by causing it to wait longer than it otherwise would for its next time
slice. The result is fair scheduling, but higher latencies than one might
Peter Zijlstra has posted a
solution to this problem: a patch which uses the high-resolution timer
mechanism to preempt processes exactly at the end of their time slices.
When the scheduler notes that a time slice will run out between timer
ticks, it arranges for a special one-time timer interrupt at the time slice
expiration time. When that interrupt arrives, the running process can be
turfed out right on schedule. As a result, the process will not overrun
its time slice and will not have to face a longer-than-usual wait before it
is able to run again.
Mike Galbraith has reported that this patch
results in reduced context switching on his system, and higher throughput
as well. So it looks like the right solution to the problem, at least in
the absence of a true dynamic tick mechanism. The current dynamic tick
code turns off the periodic clock interrupt when the processor is idle, but
that interrupt continues to run when the processor is busy. In a fully
dynamic environment, periodic ticks would never be used and special
interrupts at the end of time slices would be the normal way of doing
business. Implementing full dynamic tick is a big job, though; in the
meantime the addition of an occasional extra tick can help the scheduler to
do a quick and accurate job.
Comments (5 posted)
Deeply buried in the 2.6.24 patch stream is a set of significant changes to
the VFS layer internal API. The core motivation behind this work is to
prevent some deadlock problem which, with the old API, could not be avoided
without taking a significant performance hit. Anybody maintaining an
out-of-tree filesystem will want to have a look and be prepared to start
fixing up their code.
In the older VFS API, two address space operations are provided by
filesystems to support writes to files:
int (*prepare_write)(struct file *file, struct page *page,
unsigned begin, unsigned end);
int (*commit_write)(struct file *file, struct page *page,
unsigned begin, unsigned end);
A call to prepare_write() notifies the filesystem that the VFS
intends to write bytes begin..end of file into the given
page. It is then the filesystem's responsibility to make sure
that the write will work (allocating blocks if need be) and, if a partial
block is to be written, the filesystem should populate page with
the full block's data. Later on, the call to commit_write() tells
the filesystem that the data has been copied into page and can be
committed to disk.
The problem with this API is that the VFS is expected to pass a locked page
into prepare_write(). There are a number of scenarios which can
lead to attempts to lock that page twice, bringing the system to a halt.
To avoid this problem, Nick Piggin has created replacements for
prepare_write() and commit_write():
int (*write_begin)(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
int (*write_end)(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);
There are a number of changes, but the key is the fact that a page is no
longer passed into write_begin(). Instead, that function should
allocate the page itself and return it (locked) to the VFS. The call to
write_end() indicates that the write is complete; it should unlock
the page and update the inode's i_size field.
The new copied parameter is also important: it is the number of
bytes which were actually copied into the page, which might be smaller than
Some of the possible deadlock scenarios involve the handling
of page faults while the destination page is locked; a trivial example is
when the data being written to the page is also being read from that page.
With the new API, a page fault terminates the copying of the data, allowing
the page to be unlocked. The fault can be handled while the destination
page is unlocked, avoiding the deadlock problems.
The possibility of short writes does impose an extra cost on filesystems:
any data which may be overwritten must be read in regardless, just in case
operation ends prematurely. There are times, however, when the VFS knows
that writes will go the full length; in particular, writes from buffers
which are in kernel space must succeed. When such a write is executed, the
VFS will pass the AOP_FLAG_UNINTERRUPTIBLE flag to
write_begin() to let the filesystem know that short writes are not
For now, the prepare_write() and commit_write() VFS
methods are still supported in the kernel. If a filesystem does not
provide the newer functions, the older ones will be used. The long-term
plan almost certainly involves the removal of those methods, though; they
cannot be supported in a way which is simultaneously safe and fast.
Comments (none posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>