Kernel development
Brief items
Kernel release status
The current development kernel is 2.6.39-rc7, released on May 9. Says Linus: "So things have been pretty quiet, and unless something major comes up I believe that this will be the last -rc." Full details can be found in the long-form changelog.
Stable updates: the 2.6.32.40, 2.6.33.13, and 2.6.38.6 stable updates were released on May 9. Each contains a long list of important fixes.
Quotes of the week
AMD and Coreboot
Coreboot (formerly LinuxBIOS) is a free BIOS implementation; it offers escape from a long list of woes stemming from poorly-written BIOS's, but it has always suffered from limited hardware support. AMD has now announced support for Coreboot on a new set of processors, and more going forward: "Finally, AMD is now committed to support coreboot for all future products on the roadmap starting next with support for the upcoming 'Llano' APU. AMD has come to realize that coreboot is useful in a myriad of applications and markets, even beyond what was originally considered. Consequently, AMD plans to continue building its support of coreboot in both features and roadmap for the foreseeable future."
Kernel development news
Ftrace, perf, and the tracing ABI
Arjan van de Ven recently reported that a 2.6.39 change in how tracepoint data is reported by the kernel broke powertop; he requested that the change be partially reverted. The resulting discussion covered the familiar problem of how tracepoints mix with the kernel ABI. But it also revealed some serious disagreements on how tracing data should be provided by the kernel and, perhaps, the direction that this interface will take in the future.Each tracepoint defined in the kernel includes a number of fields containing values relevant to the specific event being documented. For example, the sched_switch tracepoint, which fires when the scheduler is switching between processes, includes the IDs of both processes, their priorities, and so on. Every tracepoint also has a few "common" fields, including the process ID, its flags, and the value of the preempt_count variable; if trace data is read in binary form, those values will appear at the beginning of the structure read from the kernel.
Prior to the 2.6.32 development cycle, those common fields also included the thread group ID; that value was removed in September, 2009. A look at the powertop source shows that the program expects that field to still be there (though it does not use it); its internally-defined structure for trace data includes a tgid field. So this change should have broken powertop, and it would have except for one other change: on the very same day, Steve Rostedt added the lock_depth common field to report whether the current process held the big kernel lock (BKL). The addition of this field was never meant to be permanent: its whole purpose, after all, was to help with the removal of the BKL from the kernel entirely.
For 2.6.39, the lock_depth common field was removed, and powertop broke. Arjan subsequently complained; he also supplied a patch which put a zero-filled padding field where lock_depth used to be. Steve opposed the patch, on the grounds that, had powertop used the tracing ABI properly, it would not have broken. The kernel exports information about each tracepoint; for the above-mentioned sched_switch, that information can be examined from the command line:
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/format name: sched_switch ID: 51 format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:char prev_comm[16]; offset:8; size:16; signed:1; field:pid_t prev_pid; offset:24; size:4; signed:1; field:int prev_prio; offset:28; size:4; signed:1; field:long prev_state; offset:32; size:8; signed:1; field:char next_comm[16]; offset:40; size:16; signed:1; field:pid_t next_pid; offset:56; size:4; signed:1; field:int next_prio; offset:60; size:4; signed:1;
A properly-written program, Steve says, should read this file and use the offset values found there to obtain the data it is interested in. Linus seemed to agree that it would have been nice if things worked out that way, but that's not what happened. Instead, at least one program became dependent on the binary format of the trace data exported from the kernel. That is enough to make that format part of the kernel ABI; breaking that program counts as a regression. So Arjan's patch was merged.
Steve did not like this outcome; it went against all the effort which had gone into creating a means by which tracepoints could change without breaking applications. The alternative, he said, was to bury the kernel in compatibility cruft:
What happens if someone designs a tool that analyzes the XFS filesystem's 200+ tracepoints? Will all those tracepoints now become ABI?
The notion that XFS tracepoints could become part of the ABI was dismissed as "crazy talk
" by Dave
Chinner, but there is nothing inherently different about those
tracepoints. They could, indeed, end up as part of the kernel ABI.
Steve was also concerned about the size of events; removal of lock_depth, beyond eliminating a (now) meaningless bit of data, also served to make each event four bytes smaller. There is always pressure to reduce the overhead of tracing, and reducing the bandwidth of the data copied to user space is part of that; adding the pad field goes against that goal. David Sharp (of Google) chimed in to note that data size matters a lot to them:
Steve had hoped to remove some of the other common fields as well (a change that Google has already made internally); that idea has gone by the wayside for now. Tracepoints, it seems, are ABI, even when the information they report no longer makes sense in the kernel.
The remainder of this discussion became a sort of bunfight between Steve and Ingo Molnar as they sought to place the blame for this problem and to determine how things will go in the future. Ingo attacked Steve for resisting the idea of unchanging tracepoints, accused him of maintaining ftrace as a fork of perf in the kernel (despite the fact that ftrace was there first), and said that perf needed to take over:
He also threatened to stop pulling tracing changes from Steve.
Steve, in return, blamed perf for bolting itself onto the ftrace infrastructure, then exporting ftrace's binary structures directly to user space. He blamed Ingo for blocking changes intended to improve the situation (for example, the creation of a separate directory for stable tracepoints agreed to at the 2010 Kernel Summit) and complained that Ingo was ignoring his attempts to create tracing infrastructure which works for everybody. He also worried, again, that set-in-stone tracepoint formats would impede progress in the kernel.
Despite all of this, Steve is willing to work toward the unification of ftrace and perf, as long as it doesn't mean leaving ftrace behind:
So it seems that, while there are clearly disagreements and tension between the developers in this area, there should also be room for a solution that works for everybody. Development emphasis will clearly continue to move toward perf, but, despite Ingo's desire to the contrary, ftrace will likely continue to be improved. We may see efforts to push applications toward libraries that can shield them from tracepoint changes, but, for now, every tracepoint added to the kernel will probably have to be considered to be part of its ABI; given that, developers should probably be reviewing new tracepoints more closely than they have been. And, with luck, instrumentation in Linux - which has improved considerably in the last few years - will continue to get better.
2.6.39 development statistics
As of this writing, the 2.6.39-rc7 prepatch has just been released and Linus has announced that it may be the last one before the final release. Being a traditional sort of operation, LWN.net would not let that release go by without looking at the statistics for this development cycle. It has been a busy cycle, but with some interesting changes.There have been just over 10,000 non-merge changesets merged for 2.6.39; with the sole exception of 2.6.37 (11,446 changesets), that's the highest since 2.6.33. Those changes came from 1,236 developers; only 2.6.37 (with 1,276 developers) has ever exceeded that number. Those developers added 670,000 lines of code while deleting 346,000 lines, for a net growth of 324,000 lines. The most active contributors this time around were:
Most active 2.6.39 developers
By changesets Thomas Gleixner 442 4.4% David S. Miller 201 2.0% Mike McCormack 138 1.4% Mark Brown 127 1.3% Tejun Heo 119 1.2% Russell King 89 0.9% Arnaldo Carvalho de Melo 86 0.9% Arend van Spriel 77 0.8% Al Viro 73 0.7% Aaro Koskinen 72 0.7% Tomas Winkler 70 0.7% Greg Kroah-Hartman 69 0.7% Chris Wilson 65 0.6% Joe Perches 60 0.6% Mauro Carvalho Chehab 60 0.6% Borislav Petkov 60 0.6% Eric Dumazet 59 0.6% Uwe Kleine-König 59 0.6% Dan Carpenter 59 0.6% Artem Bityutskiy 58 0.6%
By changed lines Wey-Yi Guy 45680 5.6% Wei Wang 25224 3.1% Alan Cox 20880 2.6% Laurent Pinchart 20459 2.5% Guan Xuetao 20167 2.5% Larry Finger 14763 1.8% Tomas Winkler 14095 1.7% Arnd Bergmann 13748 1.7% Igor M. Liplianin 13491 1.7% Aaro Koskinen 13274 1.6% Russell King 12862 1.6% Mike McCormack 11582 1.4% Jozsef Kadlecsik 10374 1.3% George 10353 1.3% Bhanu Gollapudi 9925 1.2% Thomas Gleixner 8869 1.1% Olivier Grenie 8167 1.0% Greg Ungerer 8105 1.0% Sakari Ailus 7513 0.9% Joe Perches 7048 0.9%
Thomas Gleixner got to the top of the per-changesets list with a massive reworking of how interrupts are managed in the kernel - a job which required significant changes in almost every architecture. David Miller did a great deal of work cleaning up, reworking, and optimizing the networking stack. Mike McCormack did a lot of cleanup work on the rtl8192e driver in the staging tree, Mark Brown contributed the usual large pile of changes concentrated in the sound driver subsystem, and Tejun Heo improved things all over the tree, primarily in the x86 architecture code.
On the lines-changed side, Wey-Yi Guy reworked some Intel network drivers, Wei Wang worked on the Realtek card reader driver in the staging tree, Alan Cox added the GMA500 driver to staging, Laurent Pinchart did a bunch of Video4Linux work including the addition of the media controller subsystem, and Guan Xuetao added the unicore32 architecture.
There were just over 200 known employers supporting work on the 2.6.39, the most active of which were:
Most active 2.6.39 employers
By changesets (None) 1374 13.7% Red Hat 1260 12.6% (Unknown) 690 6.9% Intel 571 5.7% Novell 376 3.7% Texas Instruments 372 3.7% IBM 305 3.0% Nokia 297 3.0% linutronix 276 2.8% (Consultant) 203 2.0% 180 1.8% Broadcom 180 1.8% Atheros 151 1.5% Samsung 150 1.5% Wolfson Micro 146 1.5% AMD 133 1.3% Pengutronix 123 1.2% ST Ericsson 116 1.2% LINBIT 111 1.1% Oracle 99 1.0%
By lines changed Intel 117903 14.6% (None) 94093 11.6% Red Hat 52140 6.4% Nokia 46063 5.7% Texas Instruments 39536 4.9% (Unknown) 37755 4.7% Realsil Micro 25370 3.1% IBM 24121 3.0% (Consultant) 23999 3.0% Broadcom 23330 2.9% Peking University 20487 2.5% Novell 19024 2.3% Samsung 17275 2.1% NetUP 13683 1.7% 11201 1.4% Realtek 10457 1.3% KFKI Research Inst 10430 1.3% Ericsson 9199 1.1% ST Ericsson 8611 1.1% Freescale 8457 1.0%
The percentage of changes coming from developers known to be working on their own time is at the lowest level seen since we started generating these statistics. Whether that means that volunteers are slowly losing interest in working with the kernel or that everybody who can do kernel work has been hired is hard to say.
Red Hat, as always, generates large numbers of patches; Texas Instruments continues the steady increase we have seen over the last few years, while Oracle continues to decline. New entries this time around include Realsil (the Realtek card reader work), the Peking University Microprocessor R&D Laboratory (the unicore32 architecture), NetUP (various drivers), and the KFKI Research Institute (ipset).
Occasionally it is interesting to look at the list of non-author signoffs - Signed-off-by tags added by developers who are not the authors of the patches involved. For 2.6.39, that list looks like this:
Developers with the most signoffs (total 8766) Greg Kroah-Hartman 1162 13.3% David S. Miller 546 6.2% John W. Linville 437 5.0% Mauro Carvalho Chehab 434 5.0% Andrew Morton 317 3.6% James Bottomley 220 2.5% Ingo Molnar 186 2.1% Mark Brown 158 1.8% Sascha Hauer 135 1.5% Tony Lindgren 129 1.5% Takashi Iwai 124 1.4% Samuel Ortiz 106 1.2% Paul Mundt 100 1.1% Matthew Garrett 99 1.1% Russell King 98 1.1% Jeff Kirsher 97 1.1% Jiri Kosina 95 1.1% Linus Torvalds 94 1.1% Patrick McHardy 90 1.0% Konrad Rzeszutek Wilk 89 1.0%
Greg Kroah-Hartman contributed "only" 69 patches to 2.6.39, but another 1,162 - over 13% of the total - passed through his hands on their way into the kernel. The bulk of those changes applied to the staging tree, but they were certainly not limited to staging. Linus Torvalds directly merged only 94 changes from others; everything else came in by way of a subsystem maintainer's tree.
Despite being one of the more active development cycles in recent years, 2.6.39 has also been one of the smoothest. The number of difficult regressions has been small, and, if Linus's current plan holds, the cycle could complete in just over 60 days, which would make it the shortest development cycle since the beginning of the git era. Kernel development is not without its glitches, but the process would appear to be working quite smoothly.
(As always, thanks are due to Greg Kroah-Hartman for his help in the creation of these statistics.)
Stable pages
When a process writes to a file-backed page in memory (through either a memory mapping or with the write() system call), that page is marked dirty and must eventually be written to its backing store. The writeback code, when it gets around to that page, will mark the page read-only, set the "under writeback" page flag, and queue the I/O operation. The write-protection of the page is not there to prevent changes to the page; its purpose is to detect further writes which would require that another writeback be done. Current kernels will, in most situations, allow a process to modify a page while the writeback operation is in progress.Most of the time, that works just fine. In the worst case, the second write to the page will happen before the first writeback I/O operation begins; in that case, the more recently written data will also be written to disk in the first I/O operation and a second, redundant disk write will be queued later. Either way, the data gets to its backing store, which is the real intent.
There are cases where modifying a page that is under writeback is a bad idea, though. Some devices can perform integrity checking, meaning that the data written to disk is checksummed by the hardware and compared against a pre-write checksum provided by the kernel. If the data changes after the kernel calculates its checksum, that check will fail, causing a spurious write error. Software RAID implementations can be tripped up by changing data as well. As a result of problems like this, developers working in the filesystem area have been convinced for a while that the kernel needs to support "stable pages" which are guaranteed not to change while they are under writeback.
When LWN looked at stable pages in February, Darrick Wong had just posted a patch aimed at solving this problem. In situations where integrity checking was in use, the kernel would make a copy of each page before beginning a writeback operation. Since nobody in user space knew about the copy, it was guaranteed to remain unmolested for the duration of the write operation. This patch solved the problem for the integrity checking case, but all of those copy operations were expensive. Given that providing stable pages in all situations was seen as desirable, that cost was considered to be too high.
So Darrick has come back with a new patch set which takes a different - and simpler - approach. In short, with this patch, any attempt to write to a page which is under writeback will simply wait until the writeback completes. There is no need to copy pages or engage in other tricks, but there may be a cost to this approach as well.
As noted above, a page will be marked read-only when it is written back; there is also a page flag which indicates that writeback is in progress. So all of the pieces are there to trap writes to pages under writeback. To make it even easier, the VFS layer already has a callback (page_mkwrite()) to notify filesystems that a read-only page is being made writable; all Darrick really needed to do was to change how those page_mkwrite() callbacks operate in presence of writeback.
Some filesystems do not provide page_mkwrite() at all; for those, Darrick created a generic empty_page_mkwrite() function which locks the page, waits for any writeback to complete, then returns the locked page. More complicated filesystems do have page_mkwrite() handlers, though, so Darrick had to add similar functionality for ext2, ext4, and FAT. Btrfs has implemented stable pages internally for some time, so no changes were required there. Ext3 turns out to have some complicated interactions with the journal layer which make a stable page implementation hard; since invasive changes to ext3 are not welcomed at this point, that filesystem may never get stable page support.
There have been concerns expressed that this approach could slow down applications which repeatedly write to the same part of a file. Before this change, writeback would not slow down subsequent writes; afterward, those writes will wait for writeback to complete. Darrick ran some benchmarks to test this case and found a performance degradation of up to 12%. This slowdown is unwelcome, but there also seems to be a consensus that there are very few applications which would actually run into this problem. Repetitively rewriting data is a relatively rare pattern; indeed, the developers involved are saying that they don't even know of a real-world case they can test.
Lack of awareness of applications which would be adversely affected by this change does not mean that they don't exist, of course. This is the kind of change which can create real problems a few years down the line when the code is finally shipped by distributors and deployed by users; by then, it's far too late to go back. If there are applications which would react poorly to this change, it would be good to get the word out now. Otherwise the benefits of stable pages are likely to cause them to be adopted in most settings.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>