LWN Weekly Edition Front pageSecurity Kernel development Distributions Development Linux in the news Announcements ->One big page
This page Previous weekFollowing week |
Kernel developmentRelease status Kernel release status There is no mainline 2.6 prepatch as of this writing; the 2.6.24 merge window remains open. Patches are flowing into the mainline repository at a high rate; see the article below for the highlights.The current -mm tree is 2.6.23-mm1. Recent changes to -mm include a new power management "quality of service" infrastructure, a number of ext4 updates, and the kernel markers patch. The current stable 2.6 kernel is 2.6.23.1, released on October 12. This update contains a single fix for a corruption problem in the sata_mv driver. For older kernels: 2.6.16.55 was released on October 14 with several security fixes.
Kernel development news Quotes of the week
Not everybody is a "doer". It's important to get input from people
who are just plain users, or hope to be.
-- Linus Torvalds
-The limit on the length of lines is 80 columns and this is a hard limit. +The limit on the length of lines is 80 columns and this is a strongly +preferred limit.
Merged for 2.6.24 The 2.6.24 merge window is open and, as of this writing, some 5,600 patches have found their way into the mainline. As usual, the list of changes is extensive. Some of the highlights among user-visible changes are:
Important changes visible to kernel developers include:
As of this writing, the 2.6.24 merge window can be expected to remain open for up to another week. So expect more changes to go into the mainline before this development cycle goes into the stabilization phase.
Fair user scheduling and other scheduler patches The Completely Fair Scheduler (CFS) was merged for the 2.6.23 kernel. One CFS feature which did not get in, though, was the group scheduling facility. Group scheduling makes the CFS fairness algorithm operate in a hierarchical fashion: processes are divided into groups, and, within each group, processes are scheduled fairly against one another. At the higher level, each group as a whole is given a fair share of the processor. The grouping of processes is done in user space in a highly flexible manner; the control groups (formerly "process containers") mechanism allows a management daemon to classify processes according to almost any policy.One of the reasons why group scheduling did not get into 2.6.23 is that the control groups patch was not ready for merging. Your editor had expected control groups to go in for 2.6.24, but, as of this writing, it is looking like that patch might still be under too much active development to get into the mainline. The group scheduling feature is not waiting, though; it has been merged for the 2.6.24 release. In the absence of control groups, the general group scheduling mechanism will not be available. Over the last few months, though, the group scheduler has evolved a new feature which will allow it to be used without control groups, and which implements what is likely to be the most common use case. That feature is per-user scheduling: creating a separate group for each user running on the system and using those groups to give each user a fair share of the processor. Since the groups are created implicitly by the scheduler, there is no separate need for the control groups interface. Instead, if the "fair user" configuration option is selected, the per-user group scheduling will go into effect with no further intervention by the administrator required.
Of course, once the system provides fair per-user scheduling,
administrators will immediately want to make it unfair by arranging for
some users to get more CPU time than others. The age-old technique of
raising the priority of that crucial administrative wesnoth process still
works, but it is a crude and transparent tool. It would be much nicer to
be able to tweak the scheduler so that certain users get a higher share of
the CPU for the running of their crucial To achieve such ends with the 2.6.24 scheduler, it will only be necessary to go to the new sysfs directory /sys/kernel/uids. There will be a subdirectory there for every active user ID on the system, and each subdirectory will contain a file called cpu_share. The integer value found in that file defaults to 1024. For the purposes of adjusting scheduling, all that really matters with the cpu_share value is its ratio between two users. If one user's cpu_share is set to 2048, that user will get twice as much CPU time as any one user whose value remains at the default 1024. The end result is that adjusting the scheduling of the CPU between users is quite easy for the administrator to do. A rather large number of other patches was also merged for 2.6.24. Most of those are cleanups and small improvements. Some of the math within the scheduler has been made less intensive, and fairness has been improved in a number of ways. There is also a new facility for performing guest CPU accounting for virtualized systems running under KVM. It's a lot of patches, but the rate of change in the core CPU scheduler should be beginning to slow down again. There are some other scheduler-related patches in the works, though. A couple of them address the problem of getting realtime tasks into a CPU promptly. Normally, the CPU scheduler will make a significant effort to avoid moving processes between CPUs because the cost of that migration (resulting from lost memory cache contents) is high. If a realtime process wants to run, though, the system is obligated to give it a processor even if there is a price to be paid in terms of overall throughput. The current CPU scheduler, however, will cause a realtime process to languish if a higher-priority process is running on the same CPU, even if other processors are available in the system. Fixing this problem involves a couple of different patches. This one from Steven Rostedt addresses the situation where the scheduling of one realtime task causes a lower-priority (but still realtime) task to be pushed out of the CPU. Rather than leave that luckless task in the run queue, Steven's patch searches through the other processors on the system to find the one running the lowest-priority process. If a processor running a sufficiently low-priority process is found, the displaced realtime process is moved over to that processor. Gregory Haskins has posted a similar patch which addresses a slightly different situation: a realtime task has just been awakened, but the CPU it is on is already running a higher-priority process. Once again, a search of the system to find the lowest-priority CPU is performed, with the realtime process being moved if a suitable home is found. In either case, the moved process will suffer a small performance hit as it finds a completely cold cache waiting for it. But it will still be able to respond much more quickly to the real world than it would if it were sitting on a run queue somewhere; that, of course, is what realtime scheduling is all about. Another issue which has come up in some situations is that the accuracy of fair scheduling decisions is constrained by the scheduler tick frequency. In the absence of external events (such as I/O completions), one process can only preempt another when the periodic timer tick comes in. As a result, processes might run longer than their time slices would otherwise allow. The scheduler will compensate for the extra time used by that process by causing it to wait longer than it otherwise would for its next time slice. The result is fair scheduling, but higher latencies than one might like. Peter Zijlstra has posted a solution to this problem: a patch which uses the high-resolution timer mechanism to preempt processes exactly at the end of their time slices. When the scheduler notes that a time slice will run out between timer ticks, it arranges for a special one-time timer interrupt at the time slice expiration time. When that interrupt arrives, the running process can be turfed out right on schedule. As a result, the process will not overrun its time slice and will not have to face a longer-than-usual wait before it is able to run again. Mike Galbraith has reported that this patch results in reduced context switching on his system, and higher throughput as well. So it looks like the right solution to the problem, at least in the absence of a true dynamic tick mechanism. The current dynamic tick code turns off the periodic clock interrupt when the processor is idle, but that interrupt continues to run when the processor is busy. In a fully dynamic environment, periodic ticks would never be used and special interrupts at the end of time slices would be the normal way of doing business. Implementing full dynamic tick is a big job, though; in the meantime the addition of an occasional extra tick can help the scheduler to do a quick and accurate job.
Some VFS address space operations changes Deeply buried in the 2.6.24 patch stream is a set of significant changes to the VFS layer internal API. The core motivation behind this work is to prevent some deadlock problem which, with the old API, could not be avoided without taking a significant performance hit. Anybody maintaining an out-of-tree filesystem will want to have a look and be prepared to start fixing up their code.In the older VFS API, two address space operations are provided by filesystems to support writes to files:
int (*prepare_write)(struct file *file, struct page *page,
unsigned begin, unsigned end);
int (*commit_write)(struct file *file, struct page *page,
unsigned begin, unsigned end);
A call to prepare_write() notifies the filesystem that the VFS intends to write bytes begin..end of file into the given page. It is then the filesystem's responsibility to make sure that the write will work (allocating blocks if need be) and, if a partial block is to be written, the filesystem should populate page with the full block's data. Later on, the call to commit_write() tells the filesystem that the data has been copied into page and can be committed to disk. The problem with this API is that the VFS is expected to pass a locked page into prepare_write(). There are a number of scenarios which can lead to attempts to lock that page twice, bringing the system to a halt. To avoid this problem, Nick Piggin has created replacements for prepare_write() and commit_write():
int (*write_begin)(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
int (*write_end)(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);
There are a number of changes, but the key is the fact that a page is no longer passed into write_begin(). Instead, that function should allocate the page itself and return it (locked) to the VFS. The call to write_end() indicates that the write is complete; it should unlock the page and update the inode's i_size field. The new copied parameter is also important: it is the number of bytes which were actually copied into the page, which might be smaller than len predicted. Some of the possible deadlock scenarios involve the handling of page faults while the destination page is locked; a trivial example is when the data being written to the page is also being read from that page. With the new API, a page fault terminates the copying of the data, allowing the page to be unlocked. The fault can be handled while the destination page is unlocked, avoiding the deadlock problems. The possibility of short writes does impose an extra cost on filesystems: any data which may be overwritten must be read in regardless, just in case the write operation ends prematurely. There are times, however, when the VFS knows that writes will go the full length; in particular, writes from buffers which are in kernel space must succeed. When such a write is executed, the VFS will pass the AOP_FLAG_UNINTERRUPTIBLE flag to write_begin() to let the filesystem know that short writes are not a possibility. For now, the prepare_write() and commit_write() VFS methods are still supported in the kernel. If a filesystem does not provide the newer functions, the older ones will be used. The long-term plan almost certainly involves the removal of those methods, though; they cannot be supported in a way which is simultaneously safe and fast.
Patches and updates Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet |
Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.