Kernel development
Brief items
Kernel release status
The current 2.6 development kernel remains 2.6.26-rc8; no 2.6 prepatches have been released over the last week.The current stable 2.6 kernel remains 2.6.25.9. The 2.6.25.10 update, with about a dozen fixes, is currently in the review process; it will probably be released on July 3.
Kernel development news
Quotes of the week
Ext4 hacker Ted Ts'o converts his laptop
A big step in the development of a new filesystem is when the developers feel confident enough to start trusting their data to it. For ext4, it appears we have reached that point as Ted Ts'o has switched his laptop to use it. "So far Ive found one bug as a result of my using ext4 in production (if delayed allocation is enabled, i_blocks doesnt get updated until the block allocation takes place, so files can appear to have 0k blocksize right after they are created, which is confusing/unfortunate), but nothing super serious yet. I will be doing backups a bit more frequently until Im absolutely sure things are rock solid, though!"
Making power policy just work
The sched_mc_power_savings parameter (cleverly hidden under /sys/devices/system/cpu) was introduced in the 2.6.18 kernel. If this parameter is set to one (the default is zero), it changes the scheduler load balancing code in an interesting way: it makes an ongoing effort to gather together processes on the smallest number of CPUs. If the system is not heavily loaded, this policy will result in some processors being entirely idle; those processors can then be put into a deep sleep and left there for some time. And that, of course, results in lower power consumption, which is a good thing.Vaidyanathan Srinivasan recently noted that, while this policy works well in a number of situations, there are others where things could be better. The sched_mc_power_savings policy is relatively conservative in how it loads processes onto CPUs, taking care to not overload those CPUs and create excessive latency for applications. As a result, the workload on a large system can still end up spread out more widely than might be optimal, especially if the workload is bursty. In response, Vaidyanathan suggests making the power savings policy more flexible, with the system administrator being able to select a combination of power savings and latency which works well for the workload. On systems where power savings matters a lot, a more aggressive mode (which would pack processes more tightly into CPUs) could be chosen.
This suggestion was controversial. Nobody disputes the idea that smarter power savings policy would be a good idea. But there is resistance to the idea of creating more tuning knobs to control this policy; instead, it is felt, the kernel should work out the optimal policy on its own. As Andi Kleen puts it:
There are a couple of answers to that objection. One is that the system cannot know, on its own, what priorities the users and/or administrators have. Those priorities could even change over time, with performance being emphasized during peak times and low power usage otherwise. Additionally, not all users see "performance" the same way; some want responsiveness and low latency, while others place a higher priority on throughput. If the system cannot simultaneously optimize all of those parameters, it will need guidance from somewhere to choose the best policy.
And that's where the other answer comes in: that guidance could come from user space. Special-purpose software running on large installations can monitor the performance of important applications and adjust resources (and policies) to get the desired results. Or, in a somewhat different vision, individual applications could register their performance needs and expected behavior. In this case, the kernel is charged with somehow mediating between applications with different expectations and coming up with a reasonable set of policies.
In the middle of all this, it was pointed out that a mechanism by which expectations can be communicated to the kernel already exists: the nice level (priority) associated with each process. In a simple view of the world, a process's nice level would tell the kernel how to manage it with regard to power savings; on a system with a number of niced processes, those processes would be gathered onto a subset of processors during period of relatively low activity. In essence, this policy says that it is not worthwhile to power up more processors just to give better throughput to low-priority processes.
It does not take long, though, to come up with situations where the use of nice levels leads to the wrong sort of results. Peter Zijlstra observed that he has niced processes (created with distcc) which should have access to all of the CPU power available, but which should not contend with interactive processes on the same system. In such cases, those processes should have a high nice value with regard to CPU usage, but that should not interfere with their ability to move onto idle CPUs, if any exist. So the answer may take the form of a separate "powernice" command which would regulate a process's priority when it comes to causing the system to draw more power.
Nice levels may (or may not) prove to be sufficient information to let the system choose an optimal power policy. But it will be some time before anybody really knows that; work on optimizing power usage - especially on server systems - is not in an advanced state. So pressure to add tuning knobs for power policies may continue, for one simple reason: people want ways of experimenting with different policies and seeing what the results are. Until we really know what the effects of different policies are - on both power usage and system performance - it will be hard to build a system which can choose an optimal policy on its own.
TASK_KILLABLE
Like most versions of Unix, Linux has two fundamental ways in which a process can be put to sleep. A process which is placed in the TASK_INTERRUPTIBLE state will sleep until either (1) something explicitly wakes it up, or (2) a non-masked signal is received. The TASK_UNINTERRUPTIBLE state, instead, ignores signals; processes in that state will require an explicit wakeup before they can run again.There are advantages and disadvantages to each type of sleep. Interruptible sleeps enable faster response to signals, but they make the programming harder. Kernel code which uses interruptible sleeps must always check to see whether it woke up as a result of a signal, and, if so, clean up whatever it was doing and return -EINTR back to user space. The user-space side, too, must realize that a system call was interrupted and respond accordingly; not all user-space programmers are known for their diligence in this regard. Making a sleep uninterruptible eliminates these problems, but at the cost of being, well, uninterruptible. If the expected wakeup event does not materialize, the process will wait forever and there is usually nothing that anybody can do about it short of rebooting the system. This is the source of the dreaded, unkillable process which is shown to be in the "D" state by ps.
Given the highly obnoxious nature of unkillable processes, one would think that interruptible sleeps should be used whenever possible. The problem with that idea is that, in many cases, the introduction of interruptible sleeps is likely to lead to application bugs. As recently noted by Alan Cox:
So it would seem that we are stuck with the occasional blocked-and-immortal process forever.
Or maybe not. A while back, Matthew Wilcox realized that many of these concerns about application bugs do not really apply if the application is about to be killed anyway. It does not matter if the developer thought about the possibility of an interrupted system call if said system call is doomed to never return to user space. So Matthew created a new sleeping state, called TASK_KILLABLE; it behaves like TASK_UNINTERRUPTIBLE with the exception that fatal signals will interrupt the sleep.
With TASK_KILLABLE comes a new set of primitives for waiting for events and acquiring locks:
int wait_event_killable(wait_queue_t queue, condition); long schedule_timeout_killable(signed long timeout); int mutex_lock_killable(struct mutex *lock); int wait_for_completion_killable(struct completion *comp); int down_killable(struct semaphore *sem);
For each of these functions, the return value will be zero for a normal, successful return, or a negative error code in case of a fatal signal. In the latter case, kernel code should clean up and return, enabling the process to be killed.
The TASK_KILLABLE patch was merged for the 2.6.25 kernel, but that does not mean that the unkillable process problem has gone away. The number of places in the kernel (as of 2.6.26-rc8) which are actually using this new state is quite small - as in, one need not worry about running out of fingers while counting them. The NFS client code has been converted, which can only be a welcome development. But there are very few other uses of TASK_KILLABLE, and none at all in device drivers, which is often where processes get wedged.
It can take time for a new API to enter widespread use in the kernel, especially when it supplements an existing functionality which works well enough most of the time. Additionally, the benefits of a mass conversion of existing code to killable sleeps are not entirely clear. But there are almost certainly places in the kernel which could be improved by this change, if users and developers could identify the spots where processes get hung. It also makes sense to use killable sleeps in new code unless there is some pressing reason to disallow interruptions altogether.
Some development statistics for 2.6.26 - and beyond
When 2.6.26-rc1 was released, your editor noted that, at a mere 7500 commits, it looked like 2.6.26 would be a smaller than usual development cycle. Interestingly, though, 2.6.26 has caught up. As of this writing (waiting for 2.6.26-rc9), this development cycle has incorporated 10,102 changesets for a net addition of 169,439 lines of code to the kernel. That makes it still significantly smaller than 2.6.25, but it is, by no means small. The developer base remains as broad as ever: 1065 developers (representing some 150 companies) have contributed to 2.6.26; just over 1/3 of those developers contributed one single changeset.The 2.6 development model says that the bulk of the changes should be merged during the merge window (before the -rc1 release), with only fixes coming thereafter. Here's how things break down for recent releases:
Release Changesets merged For -rc1 after -rc1 2.6.23 4505 2570 2.6.24 7132 3221 2.6.25 9629 3078 2.6.26 7555 2577
So, while the bulk of the big patches enter the kernel during the merge window, at least 25% of the total - and often more - come thereafter. That's a lot of fixes.
So who were the most active developers this time around? Here's the top 20:
Most active 2.6.26 developers
By changesets Harvey Harrison 218 2.2% Bartlomiej Zolnierkiewicz 197 1.9% Glauber Costa 195 1.9% Adrian Bunk 180 1.8% Joe Perches 160 1.6% Pavel Emelyanov 148 1.5% Ingo Molnar 144 1.4% Denis V. Lunev 140 1.4% Michael Krufky 130 1.3% Mauro Carvalho Chehab 116 1.1% Al Viro 114 1.1% David S. Miller 103 1.0% Tejun Heo 96 0.9% Johannes Berg 96 0.9% Alan Cox 91 0.9% Takashi Iwai 88 0.9% YOSHIFUJI Hideaki 85 0.8% Alexey Starikovskiy 84 0.8% Ivo van Doorn 80 0.8% Bjorn Helgaas 77 0.8%
By changed lines Stephen Hemminger 41762 5.9% Adrian Bunk 28523 4.0% David S. Miller 19178 2.7% Steven Toth 18681 2.6% Ben Hutchings 15535 2.2% Frank Blaschka 14527 2.0% Xiantao Zhang 12935 1.8% Hans Verkuil 12393 1.7% Tejun Heo 10462 1.5% Sebastian Siewior 9519 1.3% Harvey Harrison 9161 1.3% Peter Tiedemann 8483 1.2% Matthew Wilcox 8059 1.1% Paul Walmsley 7635 1.1% Kumar Gala 7152 1.0% Andrew Victor 7062 1.0% Johannes Berg 6544 0.9% Glauber Costa 6260 0.9% Mike Frysinger 6177 0.9% Joe Perches 5773 0.8%
In terms of the number of changesets merged, Harvey Harrison got to the top of the list with a wide variety of of janitorial fixes. Bartlomiej Zolnierkiewicz continues to put significant effort into cleaning up the IDE subsystem, even though most distributors have moved away from that code and are using the newer PATA layer instead. Glauber Costa has been tirelessly working in the x86 architecture code; in particular, he continues to work toward the goal of unifying the 32-bit and 64-bit code to the greatest extent possible. Adrian Bunk has made a career of cleaning up the code base and eliminating unneeded code. And Joe Perches dedicated much time to eliminating warnings from the checkpatch.pl script.
There have been complaints from the developers that the volume of "cleanup" patches is reaching a point that it is drowning out the rest and interfering with "real work." We're seeing some of that volume here, with three of the top five changeset contributors doing cleanup work - some of which is seen to be more valuable than the rest.
On the lines changed side, we see a mostly different set of developers. In this case, the top slots were earned by deleting code. Stephen Hemminger finally succeeded in getting rid of the old sk98lin driver. Adrian Bunk tore out the bcm43xx driver, the ieee80311 software MAC layer, the xircom_tulip_cb driver, and various other bits and pieces. David Miller removed a bunch of old SPARC code, but replaced it with various other facilities; he also took the PowerPC low-level memory manager and made it generic. Steven Toth works in the Video4Linux layer; he added some new drivers and a bunch of cleanups. Ben Hutchings added the Solarstorm SFC4000 driver.
When one thinks about 2.6.26 features, the things that come to mind include KGDB, almost-ready network namespaces, almost-ready mesh networking support, a working (shall we say "almost ready"?) realtime group scheduler, read-only bind mounts, page attribute table support, the object debugging infrastructure, and, of course, the vast pile of new drivers. One has to look hard to find the developers behind that work in the lists above (some of them are certainly there). Which just reinforces an important point: there is interest and information in counting changesets and lines changed, but the correlation between those numbers and serious accomplishments in kernel programming is weak at best. Unfortunately, "real work" is awfully hard to measure in any sort of automated way.
So what the heck; we'll go back to the numbers we can measure. Here's the most active companies for 2.6.26:
Most active 2.6.26 employers
By changesets (None) 2085 20.6% Red Hat 1130 11.2% (Unknown) 906 8.9% IBM 609 6.0% Novell 597 5.9% Intel 469 4.6% Parallels 312 3.1% SGI 211 2.1% Movial 180 1.8% Oracle 142 1.4% Analog Devices 134 1.3% HP 124 1.2% MontaVista 122 1.2% (Consultant) 116 1.1% Freescale 109 1.1% QLogic 97 1.0% Fujitsu 95 0.9% 94 0.9% (Academia) 89 0.9% Marvell 88 0.9%
By lines changed (None) 111703 15.7% IBM 73601 10.3% Red Hat 56331 7.9% Intel 50297 7.1% (Unknown) 44699 6.3% Vyatta 41835 5.9% Novell 33745 4.7% Movial 28632 4.0% Hauppauge 20234 2.8% Analog Devices 18363 2.6% (Consultant) 16397 2.3% Solarflare 15585 2.2% Freescale 15090 2.1% MontaVista 14013 2.0% QLogic 13327 1.9% SGI 10351 1.5% Marvell 7881 1.1% Wind River 7770 1.1% Oracle 7680 1.1% Pengutronix 7334 1.0%
This list tends not to change too much from one release to the next; in particular, the top companies are always the same.
If we look at who is attaching Signed-off-by tags to code they didn't write, we get a sense for who the gatekeepers to the kernel are. These are the developers and companies who are herding code into the mainline:
Sign-offs in the 2.6.26 kernel
By developer Andrew Morton 1377 14.1% Ingo Molnar 961 9.8% David S. Miller 667 6.8% John W. Linville 551 5.6% Mauro Carvalho Chehab 543 5.6% Jeff Garzik 471 4.8% Thomas Gleixner 279 2.9% Greg KH 267 2.7% Linus Torvalds 256 2.6% Paul Mackerras 220 2.2% Takashi Iwai 208 2.1% James Bottomley 203 2.1% Len Brown 200 2.0% Russell King 167 1.7% Avi Kivity 160 1.6% Bryan Wu 140 1.4% Roland Dreier 130 1.3% Lachlan McIlroy 108 1.1% Bartlomiej Zolnierkiewicz 94 1.0% Ralf Baechle 93 1.0%
By employer Red Hat 3010 30.8% 1378 14.1% (None) 1000 10.2% Novell 731 7.5% IBM 577 5.9% Intel 497 5.1% linutronix 283 2.9% Linux Foundation 256 2.6% (Unknown) 206 2.1% (Consultant) 206 2.1% Hansen Partnership 203 2.1% SGI 166 1.7% Qumranet 160 1.6% Analog Devices 149 1.5% Cisco 130 1.3% MIPS Technologies 93 1.0% Oracle 57 0.6% Freescale 55 0.6% Renesas Technology 54 0.6% Univ. of Michigan CITI 47 0.5%
Once again, these numbers tend not to change that much from one development cycle to the next. Subsystem maintainers do not change often.
What's next?
This is the first full development cycle where the linux-next tree was in operation. At this stage in the cycle, linux-next should look very much like 2.6.27 - or, at least, 2.6.27-rc1. Your editor pulled the July 2 linux-next tree and ran some statistics; this tree contains 6527 changesets from 619 developers. Just over 400,000 lines of code are touched, with a net addition of 38,000 lines.
If linux-next is to be believed, the most active 2.6.27 developers will be:
Most active pre-2.6.27 developers
By changesets Avi Kivity 499 7.6% Artem Bityutskiy 292 4.5% Bartlomiej Zolnierkiewicz 150 2.3% Ingo Molnar 142 2.2% Yinghai Lu 139 2.1% Adrian Hunter 121 1.9% Alan Cox 101 1.5% Xiantao Zhang 100 1.5% Tomas Winkler 91 1.4% Rusty Russell 89 1.4% David Woodhouse 86 1.3% Adrian Bunk 84 1.3% Steven Rostedt 83 1.3% Jonathan Corbet 74 1.1% Arnd Bergmann 73 1.1% Jean Delvare 67 1.0% Harvey Harrison 64 1.0% David Chinner 63 1.0% Lennert Buytenhek 61 0.9% Thomas Gleixner 61 0.9%
By changed lines David Woodhouse 44833 6.7% Artem Bityutskiy 41891 6.3% Eilon Greenstein 18614 2.8% Xiantao Zhang 17223 2.6% Alan Cox 14850 2.2% Jaswinder Singh 10805 1.6% David Brownell 9618 1.4% Stephen Rothwell 9043 1.4% Lennert Buytenhek 9029 1.3% Avi Kivity 8593 1.3% Steven Rostedt 7923 1.2% Adrian Bunk 7424 1.1% Laurent Pinchart 7200 1.1% Yinghai Lu 6850 1.0% Yaniv Rosner 6512 1.0% Carsten Otte 6442 1.0% Tomas Winkler 6250 0.9% Josh Boyer 5292 0.8% Adrian Hunter 5155 0.8% Michael Chan 5133 0.8%
These numbers reflect a number of the larger developments which can be expected for 2.6.27: incredible amounts of KVM work, the merging of the UBIFS filesystem, the ftrace tracing framework, a lot of reworking of the TTY layer, a lot of firmware thrashing, and ongoing big kernel lock removal work.
It will be most interesting to see how these numbers compare with what actually shows up in 2.6.27-rc1. Recent numbers suggest that quite a few patches will hit the mainline without having been in the linux-next tree - either that, or 2.6.27 will be a relatively small release. If nothing else, we will see which developers do not yet get their work into linux-next for integration testing ahead of the merge window.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
