Release status
Kernel release status
The current 2.6 development kernel remains 2.6.26-rc8; no 2.6
prepatches have been released over the last week.
The current stable 2.6 kernel remains 2.6.25.9. The 2.6.25.10
update, with about a dozen fixes, is currently in the review process; it
will probably be released on July 3.
Comments (none posted)
Kernel development news
Quotes of the week
Open source is rapid at progressing towards common goals ... it's
when the goals aren't common that progress gets bogged down.
--
James Bottomley
If we put stuff in sysfs then people WILL use it and we WILL need
to support it for ever. Pointing at some document and saying "call
my lawyer" just won't cut it. sysfs is part of the kernel ABI. We
should design our interfaces there as carefully as we design any
others.
--
Andrew Morton
I hope that nothing I ever say holds back our developers or
community from doing what is right. I did not realize that the GNU
and Linux kernel hackers were such dutiful slaves.
--
Theo de Raadt
Comments (22 posted)
Ext4 hacker Ted Ts'o converts his laptop
A big step in the development of a new filesystem is when the developers feel confident enough to start trusting their data to it. For ext4, it appears we have reached that point as Ted Ts'o has
switched his laptop to use it. "
So far I’ve found one bug as a result of my using ext4 in production (if delayed allocation is enabled, i_blocks doesn’t get updated until the block allocation takes place, so files can appear to have 0k blocksize right after they are created, which is confusing/unfortunate), but nothing super serious yet. I will be doing backups a bit more frequently until I’m absolutely sure things are rock solid, though!"
Comments (35 posted)
Making power policy just work
By Jonathan Corbet
June 30, 2008
The
sched_mc_power_savings parameter (cleverly hidden under
/sys/devices/system/cpu) was introduced in the 2.6.18 kernel. If
this parameter is set to one (the default is zero), it changes the scheduler
load balancing code in an interesting way: it makes an ongoing effort to
gather together processes on the smallest number of CPUs. If the system is
not heavily loaded, this policy will result in some processors being
entirely idle; those processors can then be put into a deep sleep and left
there for some time. And that, of course, results in lower power
consumption, which is a good thing.
Vaidyanathan Srinivasan recently noted that, while this policy
works well in a number of situations, there are others where things could
be better. The sched_mc_power_savings policy is relatively conservative in
how it loads processes onto CPUs, taking care to not overload those CPUs
and create excessive latency for applications. As a result, the workload
on a large system can still end up spread out more widely than might be
optimal, especially if the workload is bursty. In response, Vaidyanathan
suggests making the power savings policy more flexible, with the system
administrator being able to select a combination of power savings and
latency which works well for the workload. On systems where power savings
matters a lot, a more aggressive mode (which would pack processes more
tightly into CPUs) could be chosen.
This suggestion was controversial. Nobody disputes the idea that
smarter power savings policy would be a good idea. But there is resistance
to the idea of creating more tuning knobs to control this policy; instead,
it is felt, the kernel should work out the optimal policy on its own. As
Andi Kleen puts it:
Tunables are basically "we give up, let's push the problem to the
user" which is not nice. I suspect a lot of users won't even know
if their workloads are bursty or not. Or they might have workloads
which are both bursty and not bursty.
There are a couple of answers to that objection. One is that the system
cannot know, on its own, what priorities the users and/or administrators
have. Those priorities could even change over time, with performance being
emphasized during peak times and low power usage otherwise. Additionally,
not all users see "performance" the same way; some want responsiveness and
low latency, while others place a higher priority on throughput. If the
system cannot simultaneously optimize all of those parameters, it will need
guidance from somewhere to choose the best policy.
And that's where the other answer comes in: that guidance could come from
user space. Special-purpose software running on large installations can
monitor the performance of important applications and adjust resources (and
policies) to get the desired results. Or, in a somewhat different vision,
individual applications could register their performance needs and expected
behavior. In this case, the kernel is charged with somehow mediating
between applications with different expectations and coming up with a
reasonable set of policies.
In the middle of all this, it was pointed out that a mechanism by which
expectations can be communicated to the kernel already exists: the nice
level (priority) associated with each process. In a simple view of the
world, a process's nice level would tell the kernel how to manage it with
regard to power savings; on a system with a number of niced processes,
those processes would be gathered onto a subset of processors during period
of relatively low activity. In essence, this policy says that it is not
worthwhile to power up more processors just to give better throughput to
low-priority processes.
It does not take long, though, to come up with situations where the use of
nice levels leads to the wrong sort of results. Peter Zijlstra observed that he has niced processes (created
with distcc) which should have access to all of the CPU power available,
but which should not contend with interactive processes on the same
system. In such cases, those processes should have a high nice value with
regard to CPU usage, but that should not interfere with their ability to
move onto idle CPUs, if any exist. So the answer may take the form of a
separate "powernice" command which would regulate a process's priority when
it comes to causing the system to draw more power.
Nice levels may (or may not) prove to be sufficient information to let the
system choose an optimal power policy. But it will be some time before
anybody really knows that; work on optimizing power usage - especially on
server systems - is not in an advanced state. So pressure to add tuning
knobs for power policies may continue, for one simple reason: people want
ways of experimenting with different policies and seeing what the results
are. Until we really know what the effects of different policies are - on
both power usage and system performance - it will be hard to build a system
which can choose an optimal policy on its own.
Comments (9 posted)
TASK_KILLABLE
By Jonathan Corbet
July 1, 2008
Like most versions of Unix, Linux has two fundamental ways in which a
process can be put to sleep. A process which is placed in the
TASK_INTERRUPTIBLE state will sleep until either
(1) something explicitly wakes it up, or (2) a non-masked signal
is received. The
TASK_UNINTERRUPTIBLE state, instead, ignores
signals; processes in that state will require an explicit wakeup before
they can run again.
There are advantages and disadvantages to each type of sleep.
Interruptible sleeps enable faster response to signals, but they make the
programming harder. Kernel code which uses interruptible sleeps must
always check to see whether it woke up as a result of a signal, and, if so,
clean up whatever it was doing and return -EINTR back to user
space. The user-space side, too, must realize that a system call was
interrupted and respond accordingly; not all user-space programmers are
known for their diligence in this regard. Making a sleep uninterruptible
eliminates these problems, but at the cost of being, well,
uninterruptible. If the expected wakeup event does not materialize, the
process will wait forever and there is usually nothing that anybody can do
about it short of rebooting the system. This is the source of the dreaded,
unkillable process which is shown to be in the "D" state by ps.
Given the highly obnoxious nature of unkillable processes, one would think
that interruptible sleeps should be used whenever possible. The problem
with that idea is that, in many cases, the introduction of interruptible
sleeps is likely to lead to application bugs. As recently noted by Alan Cox:
Unix tradition (and thus almost all applications) believe file
store writes to be non signal interruptible. It would not be safe
or practical to change that guarantee.
So it would seem that we are stuck with the occasional blocked-and-immortal
process forever.
Or maybe not. A while back, Matthew Wilcox realized that many of these
concerns about application bugs do not really apply if the application is
about to be killed anyway. It does not matter if the developer thought
about the possibility of an interrupted system call if said system call is
doomed to never return to user space. So Matthew created a new sleeping
state, called TASK_KILLABLE; it behaves like
TASK_UNINTERRUPTIBLE with the exception that fatal signals will
interrupt the sleep.
With TASK_KILLABLE comes a new set of primitives for waiting for
events and acquiring locks:
int wait_event_killable(wait_queue_t queue, condition);
long schedule_timeout_killable(signed long timeout);
int mutex_lock_killable(struct mutex *lock);
int wait_for_completion_killable(struct completion *comp);
int down_killable(struct semaphore *sem);
For each of these functions, the return value will be zero for a normal,
successful return, or a negative error code in case of a fatal signal. In
the latter case, kernel code should clean up and return, enabling the
process to be killed.
The TASK_KILLABLE patch was merged for the 2.6.25 kernel, but that
does not mean that the unkillable process problem has gone away. The
number of places in the kernel (as of 2.6.26-rc8) which are actually using
this new state is quite small - as in, one need not worry about running out
of fingers while counting them. The NFS client code has been converted,
which can only be a welcome development. But there are very few other
uses of TASK_KILLABLE, and none at all in device drivers, which is
often where processes get wedged.
It can take time for a new API to enter widespread use in the kernel,
especially when it supplements an existing functionality which works well
enough most of the time. Additionally, the benefits of a mass conversion
of existing code to killable sleeps are not entirely clear. But there are
almost certainly places in the kernel which could be improved by this
change, if users and developers could identify the spots where processes
get hung. It also makes sense to use killable sleeps in new code unless
there is some pressing reason to disallow interruptions altogether.
Comments (13 posted)
Some development statistics for 2.6.26 - and beyond
By Jonathan Corbet
July 2, 2008
When 2.6.26-rc1 was released, your editor noted that, at a mere 7500
commits, it looked like 2.6.26 would be a smaller than usual development
cycle. Interestingly, though, 2.6.26 has caught up. As of this writing
(waiting for 2.6.26-rc9), this development cycle has incorporated 10,102
changesets for a net addition of 169,439 lines of code to the kernel. That
makes it still significantly smaller than 2.6.25, but it is, by no means
small. The developer base remains as broad as ever: 1065 developers
(representing some 150 companies) have contributed to 2.6.26; just over 1/3
of those developers contributed one single changeset.
The 2.6 development model says that the bulk of the changes should be
merged during the merge window (before the -rc1 release), with only fixes
coming thereafter. Here's how things break down for recent releases:
| Release | Changesets merged |
| For -rc1 | after -rc1 |
| 2.6.23 | 4505 | 2570 |
| 2.6.24 | 7132 | 3221 |
| 2.6.25 | 9629 | 3078 |
| 2.6.26 | 7555 | 2577 |
So, while the bulk of the big patches enter the kernel during the merge
window, at least 25% of the total - and often more - come thereafter.
That's a lot of fixes.
So who were the most active developers this time around? Here's the top
20:
| Most active 2.6.26 developers |
| By changesets |
| Harvey Harrison | 218 | 2.2% |
| Bartlomiej Zolnierkiewicz | 197 | 1.9% |
| Glauber Costa | 195 | 1.9% |
| Adrian Bunk | 180 | 1.8% |
| Joe Perches | 160 | 1.6% |
| Pavel Emelyanov | 148 | 1.5% |
| Ingo Molnar | 144 | 1.4% |
| Denis V. Lunev | 140 | 1.4% |
| Michael Krufky | 130 | 1.3% |
| Mauro Carvalho Chehab | 116 | 1.1% |
| Al Viro | 114 | 1.1% |
| David S. Miller | 103 | 1.0% |
| Tejun Heo | 96 | 0.9% |
| Johannes Berg | 96 | 0.9% |
| Alan Cox | 91 | 0.9% |
| Takashi Iwai | 88 | 0.9% |
| YOSHIFUJI Hideaki | 85 | 0.8% |
| Alexey Starikovskiy | 84 | 0.8% |
| Ivo van Doorn | 80 | 0.8% |
| Bjorn Helgaas | 77 | 0.8% |
|
| By changed lines |
| Stephen Hemminger | 41762 | 5.9% |
| Adrian Bunk | 28523 | 4.0% |
| David S. Miller | 19178 | 2.7% |
| Steven Toth | 18681 | 2.6% |
| Ben Hutchings | 15535 | 2.2% |
| Frank Blaschka | 14527 | 2.0% |
| Xiantao Zhang | 12935 | 1.8% |
| Hans Verkuil | 12393 | 1.7% |
| Tejun Heo | 10462 | 1.5% |
| Sebastian Siewior | 9519 | 1.3% |
| Harvey Harrison | 9161 | 1.3% |
| Peter Tiedemann | 8483 | 1.2% |
| Matthew Wilcox | 8059 | 1.1% |
| Paul Walmsley | 7635 | 1.1% |
| Kumar Gala | 7152 | 1.0% |
| Andrew Victor | 7062 | 1.0% |
| Johannes Berg | 6544 | 0.9% |
| Glauber Costa | 6260 | 0.9% |
| Mike Frysinger | 6177 | 0.9% |
| Joe Perches | 5773 | 0.8% |
|
In terms of the number of changesets merged, Harvey Harrison got to the
top of the list with a wide variety of of janitorial fixes. Bartlomiej
Zolnierkiewicz continues to put significant effort into cleaning up the IDE
subsystem, even though most distributors have moved away from that code and
are using the newer PATA layer instead. Glauber Costa has been tirelessly
working in the x86 architecture code; in particular, he continues to work
toward the goal of unifying the 32-bit and 64-bit code to the greatest
extent possible. Adrian Bunk has made a career of cleaning up the code
base and eliminating unneeded code. And Joe Perches dedicated much time to
eliminating warnings from the checkpatch.pl script.
There have been complaints from the developers that the volume of "cleanup"
patches is reaching a point that it is drowning out the rest and
interfering with "real work." We're seeing some of that volume here, with
three of the top five changeset contributors doing cleanup work - some of
which is seen to be more valuable than the rest.
On the lines changed side, we see a mostly different set of developers. In
this case, the top slots were earned by deleting code. Stephen Hemminger
finally succeeded in getting rid of the old sk98lin driver. Adrian Bunk
tore out the bcm43xx driver, the ieee80311 software MAC layer, the
xircom_tulip_cb driver, and various other bits and pieces. David Miller
removed a bunch of old SPARC code, but replaced it with various other
facilities; he also took the PowerPC low-level memory manager and made it
generic. Steven Toth works in the Video4Linux layer; he added some new
drivers and a bunch of cleanups. Ben Hutchings added the Solarstorm
SFC4000 driver.
When one thinks about 2.6.26 features, the things that come to mind include
KGDB, almost-ready network namespaces, almost-ready mesh networking
support, a working (shall we say "almost ready"?) realtime group scheduler,
read-only bind mounts, page
attribute table support, the object debugging infrastructure, and, of
course, the vast pile of new drivers. One has to look hard to find the
developers behind that work in the lists above (some of them are certainly
there). Which just reinforces an important point: there is interest and
information in counting changesets and lines changed, but the correlation
between those numbers and serious accomplishments in kernel programming is
weak at best. Unfortunately, "real work" is awfully hard to measure in any
sort of automated way.
So what the heck; we'll go back to the numbers we can measure. Here's the
most active companies for 2.6.26:
| Most active 2.6.26 employers |
| By changesets |
| (None) | 2085 | 20.6% |
| Red Hat | 1130 | 11.2% |
| (Unknown) | 906 | 8.9% |
| IBM | 609 | 6.0% |
| Novell | 597 | 5.9% |
| Intel | 469 | 4.6% |
| Parallels | 312 | 3.1% |
| SGI | 211 | 2.1% |
| Movial | 180 | 1.8% |
| Oracle | 142 | 1.4% |
| Analog Devices | 134 | 1.3% |
| HP | 124 | 1.2% |
| MontaVista | 122 | 1.2% |
| (Consultant) | 116 | 1.1% |
| Freescale | 109 | 1.1% |
| QLogic | 97 | 1.0% |
| Fujitsu | 95 | 0.9% |
| Google | 94 | 0.9% |
| (Academia) | 89 | 0.9% |
| Marvell | 88 | 0.9% |
|
| By lines changed |
| (None) | 111703 | 15.7% |
| IBM | 73601 | 10.3% |
| Red Hat | 56331 | 7.9% |
| Intel | 50297 | 7.1% |
| (Unknown) | 44699 | 6.3% |
| Vyatta | 41835 | 5.9% |
| Novell | 33745 | 4.7% |
| Movial | 28632 | 4.0% |
| Hauppauge | 20234 | 2.8% |
| Analog Devices | 18363 | 2.6% |
| (Consultant) | 16397 | 2.3% |
| Solarflare | 15585 | 2.2% |
| Freescale | 15090 | 2.1% |
| MontaVista | 14013 | 2.0% |
| QLogic | 13327 | 1.9% |
| SGI | 10351 | 1.5% |
| Marvell | 7881 | 1.1% |
| Wind River | 7770 | 1.1% |
| Oracle | 7680 | 1.1% |
| Pengutronix | 7334 | 1.0% |
|
This list tends not to change too much from one release to the next; in
particular, the top companies are always the same.
If we look at who is attaching Signed-off-by tags to code they didn't
write, we get a sense for who the gatekeepers to the kernel are. These are
the developers and companies who are herding code into the mainline:
| Sign-offs in the 2.6.26 kernel |
| By developer |
| Andrew Morton | 1377 | 14.1% |
| Ingo Molnar | 961 | 9.8% |
| David S. Miller | 667 | 6.8% |
| John W. Linville | 551 | 5.6% |
| Mauro Carvalho Chehab | 543 | 5.6% |
| Jeff Garzik | 471 | 4.8% |
| Thomas Gleixner | 279 | 2.9% |
| Greg KH | 267 | 2.7% |
| Linus Torvalds | 256 | 2.6% |
| Paul Mackerras | 220 | 2.2% |
| Takashi Iwai | 208 | 2.1% |
| James Bottomley | 203 | 2.1% |
| Len Brown | 200 | 2.0% |
| Russell King | 167 | 1.7% |
| Avi Kivity | 160 | 1.6% |
| Bryan Wu | 140 | 1.4% |
| Roland Dreier | 130 | 1.3% |
| Lachlan McIlroy | 108 | 1.1% |
| Bartlomiej Zolnierkiewicz | 94 | 1.0% |
| Ralf Baechle | 93 | 1.0% |
|
| By employer |
| Red Hat | 3010 | 30.8% |
| Google | 1378 | 14.1% |
| (None) | 1000 | 10.2% |
| Novell | 731 | 7.5% |
| IBM | 577 | 5.9% |
| Intel | 497 | 5.1% |
| linutronix | 283 | 2.9% |
| Linux Foundation | 256 | 2.6% |
| (Unknown) | 206 | 2.1% |
| (Consultant) | 206 | 2.1% |
| Hansen Partnership | 203 | 2.1% |
| SGI | 166 | 1.7% |
| Qumranet | 160 | 1.6% |
| Analog Devices | 149 | 1.5% |
| Cisco | 130 | 1.3% |
| MIPS Technologies | 93 | 1.0% |
| Oracle | 57 | 0.6% |
| Freescale | 55 | 0.6% |
| Renesas Technology | 54 | 0.6% |
| Univ. of Michigan CITI | 47 | 0.5% |
|
Once again, these numbers tend not to change that much from one development
cycle to the next. Subsystem maintainers do not change often.
What's next?
This is the first full development cycle where the linux-next tree was in
operation. At this stage in the cycle, linux-next should look very much
like 2.6.27 - or, at least, 2.6.27-rc1. Your editor pulled the July 2
linux-next tree and ran some statistics; this tree contains 6527 changesets
from 619 developers. Just over 400,000 lines of code are touched, with a
net addition of 38,000 lines.
If linux-next is to be believed, the most active 2.6.27 developers will be:
| Most active pre-2.6.27 developers |
| By changesets |
| Avi Kivity | 499 | 7.6% |
| Artem Bityutskiy | 292 | 4.5% |
| Bartlomiej Zolnierkiewicz | 150 | 2.3% |
| Ingo Molnar | 142 | 2.2% |
| Yinghai Lu | 139 | 2.1% |
| Adrian Hunter | 121 | 1.9% |
| Alan Cox | 101 | 1.5% |
| Xiantao Zhang | 100 | 1.5% |
| Tomas Winkler | 91 | 1.4% |
| Rusty Russell | 89 | 1.4% |
| David Woodhouse | 86 | 1.3% |
| Adrian Bunk | 84 | 1.3% |
| Steven Rostedt | 83 | 1.3% |
| Jonathan Corbet | 74 | 1.1% |
| Arnd Bergmann | 73 | 1.1% |
| Jean Delvare | 67 | 1.0% |
| Harvey Harrison | 64 | 1.0% |
| David Chinner | 63 | 1.0% |
| Lennert Buytenhek | 61 | 0.9% |
| Thomas Gleixner | 61 | 0.9% |
|
| By changed lines |
| David Woodhouse | 44833 | 6.7% |
| Artem Bityutskiy | 41891 | 6.3% |
| Eilon Greenstein | 18614 | 2.8% |
| Xiantao Zhang | 17223 | 2.6% |
| Alan Cox | 14850 | 2.2% |
| Jaswinder Singh | 10805 | 1.6% |
| David Brownell | 9618 | 1.4% |
| Stephen Rothwell | 9043 | 1.4% |
| Lennert Buytenhek | 9029 | 1.3% |
| Avi Kivity | 8593 | 1.3% |
| Steven Rostedt | 7923 | 1.2% |
| Adrian Bunk | 7424 | 1.1% |
| Laurent Pinchart | 7200 | 1.1% |
| Yinghai Lu | 6850 | 1.0% |
| Yaniv Rosner | 6512 | 1.0% |
| Carsten Otte | 6442 | 1.0% |
| Tomas Winkler | 6250 | 0.9% |
| Josh Boyer | 5292 | 0.8% |
| Adrian Hunter | 5155 | 0.8% |
| Michael Chan | 5133 | 0.8% |
|
These numbers reflect a number of the larger developments which can be
expected for 2.6.27: incredible amounts of KVM work, the merging of the
UBIFS filesystem, the ftrace tracing framework, a lot of reworking of the
TTY layer, a lot of firmware thrashing, and ongoing big kernel lock removal
work.
It will be most interesting to see how these numbers compare with what
actually shows up in 2.6.27-rc1. Recent numbers suggest that quite a few
patches will hit the mainline without having been in the linux-next tree -
either that, or 2.6.27 will be a relatively small release. If nothing
else, we will see which developers do not yet get their work into
linux-next for integration testing ahead of the merge window.
Comments (11 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Kernel building
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>