Kernel development [LWN.net]

Kernel release status

The current development kernel is 2.6.39-rc7, released on May 9. Says Linus: "So things have been pretty quiet, and unless something major comes up I believe that this will be the last -rc." Full details can be found in the long-form changelog.

Stable updates: the 2.6.32.40, 2.6.33.13, and 2.6.38.6 stable updates were released on May 9. Each contains a long list of important fixes.

Comments (none posted)

Quotes of the week

Rebuilding the kernel enables end users to make modifications to their devices that are normally not intended by the device manufacturer, such as theming the device by changing system icons and removing/modifying system components. Please note that Sony Ericsson is not recommending this.

-- But they do tell you how

I can easily handle such people, being a bit bigger than that, and lots of experience being a bouncer at a punk-rock bar for a number of years.

-- The source of Greg Kroah-Hartman's kernel skills

For the life of me I can't understand why you distro guys need to keep patching the kernel when you could just add a line to your initscripts. I'm suspecting that lameness is involved.

-- Andrew Morton

Comments (2 posted)

AMD and Coreboot

Coreboot (formerly LinuxBIOS) is a free BIOS implementation; it offers escape from a long list of woes stemming from poorly-written BIOS's, but it has always suffered from limited hardware support. AMD has now announced support for Coreboot on a new set of processors, and more going forward: "Finally, AMD is now committed to support coreboot for all future products on the roadmap starting next with support for the upcoming 'Llano' APU. AMD has come to realize that coreboot is useful in a myriad of applications and markets, even beyond what was originally considered. Consequently, AMD plans to continue building its support of coreboot in both features and roadmap for the foreseeable future."

Comments (29 posted)

Ftrace, perf, and the tracing ABI

By Jonathan Corbet
May 11, 2011

Arjan van de Ven recently reported that a 2.6.39 change in how tracepoint data is reported by the kernel broke powertop; he requested that the change be partially reverted. The resulting discussion covered the familiar problem of how tracepoints mix with the kernel ABI. But it also revealed some serious disagreements on how tracing data should be provided by the kernel and, perhaps, the direction that this interface will take in the future.

Each tracepoint defined in the kernel includes a number of fields containing values relevant to the specific event being documented. For example, the sched_switch tracepoint, which fires when the scheduler is switching between processes, includes the IDs of both processes, their priorities, and so on. Every tracepoint also has a few "common" fields, including the process ID, its flags, and the value of the preempt_count variable; if trace data is read in binary form, those values will appear at the beginning of the structure read from the kernel.

Prior to the 2.6.32 development cycle, those common fields also included the thread group ID; that value was removed in September, 2009. A look at the powertop source shows that the program expects that field to still be there (though it does not use it); its internally-defined structure for trace data includes a tgid field. So this change should have broken powertop, and it would have except for one other change: on the very same day, Steve Rostedt added the lock_depth common field to report whether the current process held the big kernel lock (BKL). The addition of this field was never meant to be permanent: its whole purpose, after all, was to help with the removal of the BKL from the kernel entirely.

For 2.6.39, the lock_depth common field was removed, and powertop broke. Arjan subsequently complained; he also supplied a patch which put a zero-filled padding field where lock_depth used to be. Steve opposed the patch, on the grounds that, had powertop used the tracing ABI properly, it would not have broken. The kernel exports information about each tracepoint; for the above-mentioned sched_switch, that information can be examined from the command line:

    # cat /sys/kernel/debug/tracing/events/sched/sched_switch/format
    name: sched_switch
    ID: 51
    format:
	field:unsigned short common_type; offset:0; size:2;	signed:0;
	field:unsigned char common_flags; offset:2; size:1; signed:0;
	field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
	field:int common_pid; offset:4; size:4; signed:1;

	field:char prev_comm[16]; offset:8; size:16; signed:1;
	field:pid_t prev_pid; offset:24; size:4; signed:1;
	field:int prev_prio; offset:28; size:4; signed:1;
	field:long prev_state; offset:32; size:8; signed:1;
	field:char next_comm[16]; offset:40; size:16; signed:1;
	field:pid_t next_pid; offset:56; size:4; signed:1;
	field:int next_prio; offset:60; size:4; signed:1;

A properly-written program, Steve says, should read this file and use the offset values found there to obtain the data it is interested in. Linus seemed to agree that it would have been nice if things worked out that way, but that's not what happened. Instead, at least one program became dependent on the binary format of the trace data exported from the kernel. That is enough to make that format part of the kernel ABI; breaking that program counts as a regression. So Arjan's patch was merged.

Steve did not like this outcome; it went against all the effort which had gone into creating a means by which tracepoints could change without breaking applications. The alternative, he said, was to bury the kernel in compatibility cruft:

The reason tracepoints have currently been stable is that kernel design changes do not happen often. But they do happen, and I foresee that in the future, the kernel will have a large number of "legacy tracepoints", and we will be stuck maintaining them forever.

What happens if someone designs a tool that analyzes the XFS filesystem's 200+ tracepoints? Will all those tracepoints now become ABI?

The notion that XFS tracepoints could become part of the ABI was dismissed as "crazy talk" by Dave Chinner, but there is nothing inherently different about those tracepoints. They could, indeed, end up as part of the kernel ABI.

Steve was also concerned about the size of events; removal of lock_depth, beyond eliminating a (now) meaningless bit of data, also served to make each event four bytes smaller. There is always pressure to reduce the overhead of tracing, and reducing the bandwidth of the data copied to user space is part of that; adding the pad field goes against that goal. David Sharp (of Google) chimed in to note that data size matters a lot to them:

The size of events is a *huge* issue for us. Please look at the patches we have been sending out for tracing: A lot of them are about reducing the size of events. Most of the patches we carry internally are about reducing the size of events. Memory is the most scarce resource on our systems, so we *cannot* afford to use large trace buffers.

Steve had hoped to remove some of the other common fields as well (a change that Google has already made internally); that idea has gone by the wayside for now. Tracepoints, it seems, are ABI, even when the information they report no longer makes sense in the kernel.

The remainder of this discussion became a sort of bunfight between Steve and Ingo Molnar as they sought to place the blame for this problem and to determine how things will go in the future. Ingo attacked Steve for resisting the idea of unchanging tracepoints, accused him of maintaining ftrace as a fork of perf in the kernel (despite the fact that ftrace was there first), and said that perf needed to take over:

perf is basically the ftrace UI and APIs done better, cleaner and more robustly. Look at all the tooling that sprang up around that ABI, almost overnight. ftrace evolved through many iterations in the past and perf was simply the next logical step.

He also threatened to stop pulling tracing changes from Steve.

Steve, in return, blamed perf for bolting itself onto the ftrace infrastructure, then exporting ftrace's binary structures directly to user space. He blamed Ingo for blocking changes intended to improve the situation (for example, the creation of a separate directory for stable tracepoints agreed to at the 2010 Kernel Summit) and complained that Ingo was ignoring his attempts to create tracing infrastructure which works for everybody. He also worried, again, that set-in-stone tracepoint formats would impede progress in the kernel.

Despite all of this, Steve is willing to work toward the unification of ftrace and perf, as long as it doesn't mean leaving ftrace behind:

Now that perf has entered the tracing field, I would be happy to bring the two together. But we disagree on how to do that. I will not drop ftrace totally just to work on perf. There's too many users of ftrace that want enhancements, and I will still support that. The reason being is that I honestly do not believe that perf can do what these users want anytime in the near future (if at all). I will not abandon a successful project just because you feel that it is a fork.

So it seems that, while there are clearly disagreements and tension between the developers in this area, there should also be room for a solution that works for everybody. Development emphasis will clearly continue to move toward perf, but, despite Ingo's desire to the contrary, ftrace will likely continue to be improved. We may see efforts to push applications toward libraries that can shield them from tracepoint changes, but, for now, every tracepoint added to the kernel will probably have to be considered to be part of its ABI; given that, developers should probably be reviewing new tracepoints more closely than they have been. And, with luck, instrumentation in Linux - which has improved considerably in the last few years - will continue to get better.

Comments (9 posted)

2.6.39 development statistics

By Jonathan Corbet
May 10, 2011

As of this writing, the 2.6.39-rc7 prepatch has just been released and Linus has announced that it may be the last one before the final release. Being a traditional sort of operation, LWN.net would not let that release go by without looking at the statistics for this development cycle. It has been a busy cycle, but with some interesting changes.

There have been just over 10,000 non-merge changesets merged for 2.6.39; with the sole exception of 2.6.37 (11,446 changesets), that's the highest since 2.6.33. Those changes came from 1,236 developers; only 2.6.37 (with 1,276 developers) has ever exceeded that number. Those developers added 670,000 lines of code while deleting 346,000 lines, for a net growth of 324,000 lines. The most active contributors this time around were:

Most active 2.6.39 developers

By changesets

Thomas Gleixner 442 4.4%

David S. Miller 201 2.0%

Mike McCormack 138 1.4%

Mark Brown 127 1.3%

Tejun Heo 119 1.2%

Russell King 89 0.9%

Arnaldo Carvalho de Melo 86 0.9%

Arend van Spriel 77 0.8%

Al Viro 73 0.7%

Aaro Koskinen 72 0.7%

Tomas Winkler 70 0.7%

Greg Kroah-Hartman 69 0.7%

Chris Wilson 65 0.6%

Joe Perches 60 0.6%

Mauro Carvalho Chehab 60 0.6%

Borislav Petkov 60 0.6%

Eric Dumazet 59 0.6%

Uwe Kleine-König 59 0.6%

Dan Carpenter 59 0.6%

Artem Bityutskiy 58 0.6%

By changed lines

Wey-Yi Guy 45680 5.6%

Wei Wang 25224 3.1%

Alan Cox 20880 2.6%

Laurent Pinchart 20459 2.5%

Guan Xuetao 20167 2.5%

Larry Finger 14763 1.8%

Tomas Winkler 14095 1.7%

Arnd Bergmann 13748 1.7%

Igor M. Liplianin 13491 1.7%

Aaro Koskinen 13274 1.6%

Russell King 12862 1.6%

Mike McCormack 11582 1.4%

Jozsef Kadlecsik 10374 1.3%

George 10353 1.3%

Bhanu Gollapudi 9925 1.2%

Thomas Gleixner 8869 1.1%

Olivier Grenie 8167 1.0%

Greg Ungerer 8105 1.0%

Sakari Ailus 7513 0.9%

Joe Perches 7048 0.9%

Thomas Gleixner got to the top of the per-changesets list with a massive reworking of how interrupts are managed in the kernel - a job which required significant changes in almost every architecture. David Miller did a great deal of work cleaning up, reworking, and optimizing the networking stack. Mike McCormack did a lot of cleanup work on the rtl8192e driver in the staging tree, Mark Brown contributed the usual large pile of changes concentrated in the sound driver subsystem, and Tejun Heo improved things all over the tree, primarily in the x86 architecture code.

On the lines-changed side, Wey-Yi Guy reworked some Intel network drivers, Wei Wang worked on the Realtek card reader driver in the staging tree, Alan Cox added the GMA500 driver to staging, Laurent Pinchart did a bunch of Video4Linux work including the addition of the media controller subsystem, and Guan Xuetao added the unicore32 architecture.

There were just over 200 known employers supporting work on the 2.6.39, the most active of which were:

Most active 2.6.39 employers

By changesets

(None) 1374 13.7%

Red Hat 1260 12.6%

(Unknown) 690 6.9%

Intel 571 5.7%

Novell 376 3.7%

Texas Instruments 372 3.7%

IBM 305 3.0%

Nokia 297 3.0%

linutronix 276 2.8%

(Consultant) 203 2.0%

Google 180 1.8%

Broadcom 180 1.8%

Atheros 151 1.5%

Samsung 150 1.5%

Wolfson Micro 146 1.5%

AMD 133 1.3%

Pengutronix 123 1.2%

ST Ericsson 116 1.2%

LINBIT 111 1.1%

Oracle 99 1.0%

By lines changed

Intel 117903 14.6%

(None) 94093 11.6%

Red Hat 52140 6.4%

Nokia 46063 5.7%

Texas Instruments 39536 4.9%

(Unknown) 37755 4.7%

Realsil Micro 25370 3.1%

IBM 24121 3.0%

(Consultant) 23999 3.0%

Broadcom 23330 2.9%

Peking University 20487 2.5%

Novell 19024 2.3%

Samsung 17275 2.1%

NetUP 13683 1.7%

Google 11201 1.4%

Realtek 10457 1.3%

KFKI Research Inst 10430 1.3%

Ericsson 9199 1.1%

ST Ericsson 8611 1.1%

Freescale 8457 1.0%

The percentage of changes coming from developers known to be working on their own time is at the lowest level seen since we started generating these statistics. Whether that means that volunteers are slowly losing interest in working with the kernel or that everybody who can do kernel work has been hired is hard to say.

Red Hat, as always, generates large numbers of patches; Texas Instruments continues the steady increase we have seen over the last few years, while Oracle continues to decline. New entries this time around include Realsil (the Realtek card reader work), the Peking University Microprocessor R&D Laboratory (the unicore32 architecture), NetUP (various drivers), and the KFKI Research Institute (ipset).

Occasionally it is interesting to look at the list of non-author signoffs - Signed-off-by tags added by developers who are not the authors of the patches involved. For 2.6.39, that list looks like this:

Developers with the most signoffs (total 8766)

Greg Kroah-Hartman 1162 13.3%

David S. Miller 546 6.2%

John W. Linville 437 5.0%

Mauro Carvalho Chehab 434 5.0%

Andrew Morton 317 3.6%

James Bottomley 220 2.5%

Ingo Molnar 186 2.1%

Mark Brown 158 1.8%

Sascha Hauer 135 1.5%

Tony Lindgren 129 1.5%

Takashi Iwai 124 1.4%

Samuel Ortiz 106 1.2%

Paul Mundt 100 1.1%

Matthew Garrett 99 1.1%

Russell King 98 1.1%

Jeff Kirsher 97 1.1%

Jiri Kosina 95 1.1%

Linus Torvalds 94 1.1%

Patrick McHardy 90 1.0%

Konrad Rzeszutek Wilk 89 1.0%

Developers with the most signoffs (total 8766)
Greg Kroah-Hartman	1162	13.3%
David S. Miller	546	6.2%
John W. Linville	437	5.0%
Mauro Carvalho Chehab	434	5.0%
Andrew Morton	317	3.6%
James Bottomley	220	2.5%
Ingo Molnar	186	2.1%
Mark Brown	158	1.8%
Sascha Hauer	135	1.5%
Tony Lindgren	129	1.5%
Takashi Iwai	124	1.4%
Samuel Ortiz	106	1.2%
Paul Mundt	100	1.1%
Matthew Garrett	99	1.1%
Russell King	98	1.1%
Jeff Kirsher	97	1.1%
Jiri Kosina	95	1.1%
Linus Torvalds	94	1.1%
Patrick McHardy	90	1.0%
Konrad Rzeszutek Wilk	89	1.0%

Greg Kroah-Hartman contributed "only" 69 patches to 2.6.39, but another 1,162 - over 13% of the total - passed through his hands on their way into the kernel. The bulk of those changes applied to the staging tree, but they were certainly not limited to staging. Linus Torvalds directly merged only 94 changes from others; everything else came in by way of a subsystem maintainer's tree.

Despite being one of the more active development cycles in recent years, 2.6.39 has also been one of the smoothest. The number of difficult regressions has been small, and, if Linus's current plan holds, the cycle could complete in just over 60 days, which would make it the shortest development cycle since the beginning of the git era. Kernel development is not without its glitches, but the process would appear to be working quite smoothly.

(As always, thanks are due to Greg Kroah-Hartman for his help in the creation of these statistics.)

Comments (13 posted)

Stable pages

By Jonathan Corbet
May 11, 2011

When a process writes to a file-backed page in memory (through either a memory mapping or with the write() system call), that page is marked dirty and must eventually be written to its backing store. The writeback code, when it gets around to that page, will mark the page read-only, set the "under writeback" page flag, and queue the I/O operation. The write-protection of the page is not there to prevent changes to the page; its purpose is to detect further writes which would require that another writeback be done. Current kernels will, in most situations, allow a process to modify a page while the writeback operation is in progress.

Most of the time, that works just fine. In the worst case, the second write to the page will happen before the first writeback I/O operation begins; in that case, the more recently written data will also be written to disk in the first I/O operation and a second, redundant disk write will be queued later. Either way, the data gets to its backing store, which is the real intent.

There are cases where modifying a page that is under writeback is a bad idea, though. Some devices can perform integrity checking, meaning that the data written to disk is checksummed by the hardware and compared against a pre-write checksum provided by the kernel. If the data changes after the kernel calculates its checksum, that check will fail, causing a spurious write error. Software RAID implementations can be tripped up by changing data as well. As a result of problems like this, developers working in the filesystem area have been convinced for a while that the kernel needs to support "stable pages" which are guaranteed not to change while they are under writeback.

When LWN looked at stable pages in February, Darrick Wong had just posted a patch aimed at solving this problem. In situations where integrity checking was in use, the kernel would make a copy of each page before beginning a writeback operation. Since nobody in user space knew about the copy, it was guaranteed to remain unmolested for the duration of the write operation. This patch solved the problem for the integrity checking case, but all of those copy operations were expensive. Given that providing stable pages in all situations was seen as desirable, that cost was considered to be too high.

So Darrick has come back with a new patch set which takes a different - and simpler - approach. In short, with this patch, any attempt to write to a page which is under writeback will simply wait until the writeback completes. There is no need to copy pages or engage in other tricks, but there may be a cost to this approach as well.

As noted above, a page will be marked read-only when it is written back; there is also a page flag which indicates that writeback is in progress. So all of the pieces are there to trap writes to pages under writeback. To make it even easier, the VFS layer already has a callback (page_mkwrite()) to notify filesystems that a read-only page is being made writable; all Darrick really needed to do was to change how those page_mkwrite() callbacks operate in presence of writeback.

Some filesystems do not provide page_mkwrite() at all; for those, Darrick created a generic empty_page_mkwrite() function which locks the page, waits for any writeback to complete, then returns the locked page. More complicated filesystems do have page_mkwrite() handlers, though, so Darrick had to add similar functionality for ext2, ext4, and FAT. Btrfs has implemented stable pages internally for some time, so no changes were required there. Ext3 turns out to have some complicated interactions with the journal layer which make a stable page implementation hard; since invasive changes to ext3 are not welcomed at this point, that filesystem may never get stable page support.

There have been concerns expressed that this approach could slow down applications which repeatedly write to the same part of a file. Before this change, writeback would not slow down subsequent writes; afterward, those writes will wait for writeback to complete. Darrick ran some benchmarks to test this case and found a performance degradation of up to 12%. This slowdown is unwelcome, but there also seems to be a consensus that there are very few applications which would actually run into this problem. Repetitively rewriting data is a relatively rare pattern; indeed, the developers involved are saying that they don't even know of a real-world case they can test.

Lack of awareness of applications which would be adversely affected by this change does not mean that they don't exist, of course. This is the kind of change which can create real problems a few years down the line when the code is finally shipped by distributors and deployed by users; by then, it's far too late to go back. If there are applications which would react poorly to this change, it would be good to get the word out now. Otherwise the benefits of stable pages are likely to cause them to be adopted in most settings.

Comments (22 posted)

Linus Torvalds Linux 2.6.39-rc7 ?

Greg KH Linux 2.6.38.6 ?

Greg KH Linux 2.6.33.13 ?

Greg KH Linux 2.6.32.40 ?

David Daney MIPS: Octeon: Use Device Tree. ?

Catalin Marinas ARM: Add support for the Large Physical Address Extensions ?

Chris Metcalf tile: add an RTC driver for the Tilera hypervisor ?

Chris Metcalf arch/tile: add arch/tile/drivers/ directory with SROM driver ?

Mark Salter arch/c6x: new architecture port for linux ?

Paul Turner CFS Bandwidth Control V6 ?

Andi Kleen RFC: Lower broadcast timer idle path lock contention ?

Tejun Heo ptrace: implement PTRACE_SEIZE/INTERRUPT and group stop notification ?

Lucian Adrian Grijincu faster tree-based sysctl implementation ?

Steven Rostedt ftrace: Allow multiple users to pick and choose functions to trace ?

Hui Zhu KGTP (Linux Kernel debugger and tracer) 201100507 release ?

Avi Kivity KVM in-guest performance monitoring ?

Anirudh Ghayal drivers: rtc: Add support for Qualcomm PMIC8xxx RTC ?

Luis R. Rodriguez x86: Add support for Atheros AR1520 GPS devices ?

Rhyland Klein add support for rfkill gpio devices ?

Rafael J. Wysocki PM: Support for generic I/O power domains (v2) ?

Rafael J. Wysocki PM / Hibernate: Add sysfs knob to control size of memory for drivers ?

Jianyun Li Add Marvell UMI driver ?

Tomasz Stanislawski V4L: Extended crop/compose API ?

Jamie Iles Support for MMIO based Denali NAND controller ?

Linus Walleij drivers: create a pinmux subsystem v2 ?

shaohua.li@intel.com block: optimize flush for non-queueable flush drive ?

Josef Bacik fs: add SEEK_HOLE and SEEK_DATA flags ?

Changli Gao fs: add FD_CLOFORK and O_CLOFORK ?

amir73il@users.sourceforge.net Ext4 snapshots - core patches ?

Christoph Lameter SLUB: Lockless freelists for objects V4 ?

Wu Fengguang writeback fixes and cleanups for 2.6.40 ?

Darrick J. Wong [PATCHSET v3.1 0/7] data integrity: Stabilize pages during writeback for various fses ?

KAMEZAWA Hiroyuki memcg async reclaim ?

Mel Gorman Reduce impact to overall system of SLUB using high-order allocations ?

Minchan Kim Prevent LRU churning ?

y@vger.kernel.org snet: Security for NETwork syscalls ?

Konrad Rzeszutek Wilk xen block backend. ?

Eric W. Biederman Network namespace manipulation with file descriptors ?

Michael S. Tsirkin virtio-net: 64 bit features, event index ?

Douglas Gilbert lsscsi-0.25 released ?

Jesper Dangaard Brouer IPTV-Analyzer project released v0.9.0 ?

Kernel development

Brief items

Kernel release status

Quotes of the week

AMD and Coreboot

Kernel development news

Ftrace, perf, and the tracing ABI

2.6.39 development statistics

Stable pages

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Security-related

Virtualization and containers

Miscellaneous