|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.39-rc7, released on May 9. Says Linus: "So things have been pretty quiet, and unless something major comes up I believe that this will be the last -rc." Full details can be found in the long-form changelog.

Stable updates: the 2.6.32.40, 2.6.33.13, and 2.6.38.6 stable updates were released on May 9. Each contains a long list of important fixes.

Comments (none posted)

Quotes of the week

Rebuilding the kernel enables end users to make modifications to their devices that are normally not intended by the device manufacturer, such as theming the device by changing system icons and removing/modifying system components. Please note that Sony Ericsson is not recommending this.
-- But they do tell you how

I can easily handle such people, being a bit bigger than that, and lots of experience being a bouncer at a punk-rock bar for a number of years.
-- The source of Greg Kroah-Hartman's kernel skills

For the life of me I can't understand why you distro guys need to keep patching the kernel when you could just add a line to your initscripts. I'm suspecting that lameness is involved.
-- Andrew Morton

Comments (2 posted)

AMD and Coreboot

Coreboot (formerly LinuxBIOS) is a free BIOS implementation; it offers escape from a long list of woes stemming from poorly-written BIOS's, but it has always suffered from limited hardware support. AMD has now announced support for Coreboot on a new set of processors, and more going forward: "Finally, AMD is now committed to support coreboot for all future products on the roadmap starting next with support for the upcoming 'Llano' APU. AMD has come to realize that coreboot is useful in a myriad of applications and markets, even beyond what was originally considered. Consequently, AMD plans to continue building its support of coreboot in both features and roadmap for the foreseeable future."

Comments (29 posted)

Kernel development news

Ftrace, perf, and the tracing ABI

By Jonathan Corbet
May 11, 2011
Arjan van de Ven recently reported that a 2.6.39 change in how tracepoint data is reported by the kernel broke powertop; he requested that the change be partially reverted. The resulting discussion covered the familiar problem of how tracepoints mix with the kernel ABI. But it also revealed some serious disagreements on how tracing data should be provided by the kernel and, perhaps, the direction that this interface will take in the future.

Each tracepoint defined in the kernel includes a number of fields containing values relevant to the specific event being documented. For example, the sched_switch tracepoint, which fires when the scheduler is switching between processes, includes the IDs of both processes, their priorities, and so on. Every tracepoint also has a few "common" fields, including the process ID, its flags, and the value of the preempt_count variable; if trace data is read in binary form, those values will appear at the beginning of the structure read from the kernel.

Prior to the 2.6.32 development cycle, those common fields also included the thread group ID; that value was removed in September, 2009. A look at the powertop source shows that the program expects that field to still be there (though it does not use it); its internally-defined structure for trace data includes a tgid field. So this change should have broken powertop, and it would have except for one other change: on the very same day, Steve Rostedt added the lock_depth common field to report whether the current process held the big kernel lock (BKL). The addition of this field was never meant to be permanent: its whole purpose, after all, was to help with the removal of the BKL from the kernel entirely.

For 2.6.39, the lock_depth common field was removed, and powertop broke. Arjan subsequently complained; he also supplied a patch which put a zero-filled padding field where lock_depth used to be. Steve opposed the patch, on the grounds that, had powertop used the tracing ABI properly, it would not have broken. The kernel exports information about each tracepoint; for the above-mentioned sched_switch, that information can be examined from the command line:

    # cat /sys/kernel/debug/tracing/events/sched/sched_switch/format
    name: sched_switch
    ID: 51
    format:
	field:unsigned short common_type; offset:0; size:2;	signed:0;
	field:unsigned char common_flags; offset:2; size:1; signed:0;
	field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
	field:int common_pid; offset:4; size:4; signed:1;

	field:char prev_comm[16]; offset:8; size:16; signed:1;
	field:pid_t prev_pid; offset:24; size:4; signed:1;
	field:int prev_prio; offset:28; size:4; signed:1;
	field:long prev_state; offset:32; size:8; signed:1;
	field:char next_comm[16]; offset:40; size:16; signed:1;
	field:pid_t next_pid; offset:56; size:4; signed:1;
	field:int next_prio; offset:60; size:4; signed:1;

A properly-written program, Steve says, should read this file and use the offset values found there to obtain the data it is interested in. Linus seemed to agree that it would have been nice if things worked out that way, but that's not what happened. Instead, at least one program became dependent on the binary format of the trace data exported from the kernel. That is enough to make that format part of the kernel ABI; breaking that program counts as a regression. So Arjan's patch was merged.

Steve did not like this outcome; it went against all the effort which had gone into creating a means by which tracepoints could change without breaking applications. The alternative, he said, was to bury the kernel in compatibility cruft:

The reason tracepoints have currently been stable is that kernel design changes do not happen often. But they do happen, and I foresee that in the future, the kernel will have a large number of "legacy tracepoints", and we will be stuck maintaining them forever.

What happens if someone designs a tool that analyzes the XFS filesystem's 200+ tracepoints? Will all those tracepoints now become ABI?

The notion that XFS tracepoints could become part of the ABI was dismissed as "crazy talk" by Dave Chinner, but there is nothing inherently different about those tracepoints. They could, indeed, end up as part of the kernel ABI.

Steve was also concerned about the size of events; removal of lock_depth, beyond eliminating a (now) meaningless bit of data, also served to make each event four bytes smaller. There is always pressure to reduce the overhead of tracing, and reducing the bandwidth of the data copied to user space is part of that; adding the pad field goes against that goal. David Sharp (of Google) chimed in to note that data size matters a lot to them:

The size of events is a *huge* issue for us. Please look at the patches we have been sending out for tracing: A lot of them are about reducing the size of events. Most of the patches we carry internally are about reducing the size of events. Memory is the most scarce resource on our systems, so we *cannot* afford to use large trace buffers.

Steve had hoped to remove some of the other common fields as well (a change that Google has already made internally); that idea has gone by the wayside for now. Tracepoints, it seems, are ABI, even when the information they report no longer makes sense in the kernel.

The remainder of this discussion became a sort of bunfight between Steve and Ingo Molnar as they sought to place the blame for this problem and to determine how things will go in the future. Ingo attacked Steve for resisting the idea of unchanging tracepoints, accused him of maintaining ftrace as a fork of perf in the kernel (despite the fact that ftrace was there first), and said that perf needed to take over:

perf is basically the ftrace UI and APIs done better, cleaner and more robustly. Look at all the tooling that sprang up around that ABI, almost overnight. ftrace evolved through many iterations in the past and perf was simply the next logical step.

He also threatened to stop pulling tracing changes from Steve.

Steve, in return, blamed perf for bolting itself onto the ftrace infrastructure, then exporting ftrace's binary structures directly to user space. He blamed Ingo for blocking changes intended to improve the situation (for example, the creation of a separate directory for stable tracepoints agreed to at the 2010 Kernel Summit) and complained that Ingo was ignoring his attempts to create tracing infrastructure which works for everybody. He also worried, again, that set-in-stone tracepoint formats would impede progress in the kernel.

Despite all of this, Steve is willing to work toward the unification of ftrace and perf, as long as it doesn't mean leaving ftrace behind:

Now that perf has entered the tracing field, I would be happy to bring the two together. But we disagree on how to do that. I will not drop ftrace totally just to work on perf. There's too many users of ftrace that want enhancements, and I will still support that. The reason being is that I honestly do not believe that perf can do what these users want anytime in the near future (if at all). I will not abandon a successful project just because you feel that it is a fork.

So it seems that, while there are clearly disagreements and tension between the developers in this area, there should also be room for a solution that works for everybody. Development emphasis will clearly continue to move toward perf, but, despite Ingo's desire to the contrary, ftrace will likely continue to be improved. We may see efforts to push applications toward libraries that can shield them from tracepoint changes, but, for now, every tracepoint added to the kernel will probably have to be considered to be part of its ABI; given that, developers should probably be reviewing new tracepoints more closely than they have been. And, with luck, instrumentation in Linux - which has improved considerably in the last few years - will continue to get better.

Comments (9 posted)

2.6.39 development statistics

By Jonathan Corbet
May 10, 2011
As of this writing, the 2.6.39-rc7 prepatch has just been released and Linus has announced that it may be the last one before the final release. Being a traditional sort of operation, LWN.net would not let that release go by without looking at the statistics for this development cycle. It has been a busy cycle, but with some interesting changes.

There have been just over 10,000 non-merge changesets merged for 2.6.39; with the sole exception of 2.6.37 (11,446 changesets), that's the highest since 2.6.33. Those changes came from 1,236 developers; only 2.6.37 (with 1,276 developers) has ever exceeded that number. Those developers added 670,000 lines of code while deleting 346,000 lines, for a net growth of 324,000 lines. The most active contributors this time around were:

Most active 2.6.39 developers
By changesets
Thomas Gleixner4424.4%
David S. Miller2012.0%
Mike McCormack1381.4%
Mark Brown1271.3%
Tejun Heo1191.2%
Russell King890.9%
Arnaldo Carvalho de Melo860.9%
Arend van Spriel770.8%
Al Viro730.7%
Aaro Koskinen720.7%
Tomas Winkler700.7%
Greg Kroah-Hartman690.7%
Chris Wilson650.6%
Joe Perches600.6%
Mauro Carvalho Chehab600.6%
Borislav Petkov600.6%
Eric Dumazet590.6%
Uwe Kleine-König590.6%
Dan Carpenter590.6%
Artem Bityutskiy580.6%
By changed lines
Wey-Yi Guy456805.6%
Wei Wang252243.1%
Alan Cox208802.6%
Laurent Pinchart204592.5%
Guan Xuetao201672.5%
Larry Finger147631.8%
Tomas Winkler140951.7%
Arnd Bergmann137481.7%
Igor M. Liplianin134911.7%
Aaro Koskinen132741.6%
Russell King128621.6%
Mike McCormack115821.4%
Jozsef Kadlecsik103741.3%
George103531.3%
Bhanu Gollapudi99251.2%
Thomas Gleixner88691.1%
Olivier Grenie81671.0%
Greg Ungerer81051.0%
Sakari Ailus75130.9%
Joe Perches70480.9%

Thomas Gleixner got to the top of the per-changesets list with a massive reworking of how interrupts are managed in the kernel - a job which required significant changes in almost every architecture. David Miller did a great deal of work cleaning up, reworking, and optimizing the networking stack. Mike McCormack did a lot of cleanup work on the rtl8192e driver in the staging tree, Mark Brown contributed the usual large pile of changes concentrated in the sound driver subsystem, and Tejun Heo improved things all over the tree, primarily in the x86 architecture code.

On the lines-changed side, Wey-Yi Guy reworked some Intel network drivers, Wei Wang worked on the Realtek card reader driver in the staging tree, Alan Cox added the GMA500 driver to staging, Laurent Pinchart did a bunch of Video4Linux work including the addition of the media controller subsystem, and Guan Xuetao added the unicore32 architecture.

There were just over 200 known employers supporting work on the 2.6.39, the most active of which were:

Most active 2.6.39 employers
By changesets
(None)137413.7%
Red Hat126012.6%
(Unknown)6906.9%
Intel5715.7%
Novell3763.7%
Texas Instruments3723.7%
IBM3053.0%
Nokia2973.0%
linutronix2762.8%
(Consultant)2032.0%
Google1801.8%
Broadcom1801.8%
Atheros1511.5%
Samsung1501.5%
Wolfson Micro1461.5%
AMD1331.3%
Pengutronix1231.2%
ST Ericsson1161.2%
LINBIT1111.1%
Oracle991.0%
By lines changed
Intel11790314.6%
(None)9409311.6%
Red Hat521406.4%
Nokia460635.7%
Texas Instruments395364.9%
(Unknown)377554.7%
Realsil Micro253703.1%
IBM241213.0%
(Consultant)239993.0%
Broadcom233302.9%
Peking University204872.5%
Novell190242.3%
Samsung172752.1%
NetUP136831.7%
Google112011.4%
Realtek104571.3%
KFKI Research Inst104301.3%
Ericsson91991.1%
ST Ericsson86111.1%
Freescale84571.0%

The percentage of changes coming from developers known to be working on their own time is at the lowest level seen since we started generating these statistics. Whether that means that volunteers are slowly losing interest in working with the kernel or that everybody who can do kernel work has been hired is hard to say.

Red Hat, as always, generates large numbers of patches; Texas Instruments continues the steady increase we have seen over the last few years, while Oracle continues to decline. New entries this time around include Realsil (the Realtek card reader work), the Peking University Microprocessor R&D Laboratory (the unicore32 architecture), NetUP (various drivers), and the KFKI Research Institute (ipset).

Occasionally it is interesting to look at the list of non-author signoffs - Signed-off-by tags added by developers who are not the authors of the patches involved. For 2.6.39, that list looks like this:

Developers with the most signoffs (total 8766)
Greg Kroah-Hartman116213.3%
David S. Miller5466.2%
John W. Linville4375.0%
Mauro Carvalho Chehab4345.0%
Andrew Morton3173.6%
James Bottomley2202.5%
Ingo Molnar1862.1%
Mark Brown1581.8%
Sascha Hauer1351.5%
Tony Lindgren1291.5%
Takashi Iwai1241.4%
Samuel Ortiz1061.2%
Paul Mundt1001.1%
Matthew Garrett991.1%
Russell King981.1%
Jeff Kirsher971.1%
Jiri Kosina951.1%
Linus Torvalds941.1%
Patrick McHardy901.0%
Konrad Rzeszutek Wilk891.0%

Greg Kroah-Hartman contributed "only" 69 patches to 2.6.39, but another 1,162 - over 13% of the total - passed through his hands on their way into the kernel. The bulk of those changes applied to the staging tree, but they were certainly not limited to staging. Linus Torvalds directly merged only 94 changes from others; everything else came in by way of a subsystem maintainer's tree.

Despite being one of the more active development cycles in recent years, 2.6.39 has also been one of the smoothest. The number of difficult regressions has been small, and, if Linus's current plan holds, the cycle could complete in just over 60 days, which would make it the shortest development cycle since the beginning of the git era. Kernel development is not without its glitches, but the process would appear to be working quite smoothly.

(As always, thanks are due to Greg Kroah-Hartman for his help in the creation of these statistics.)

Comments (13 posted)

Stable pages

By Jonathan Corbet
May 11, 2011
When a process writes to a file-backed page in memory (through either a memory mapping or with the write() system call), that page is marked dirty and must eventually be written to its backing store. The writeback code, when it gets around to that page, will mark the page read-only, set the "under writeback" page flag, and queue the I/O operation. The write-protection of the page is not there to prevent changes to the page; its purpose is to detect further writes which would require that another writeback be done. Current kernels will, in most situations, allow a process to modify a page while the writeback operation is in progress.

Most of the time, that works just fine. In the worst case, the second write to the page will happen before the first writeback I/O operation begins; in that case, the more recently written data will also be written to disk in the first I/O operation and a second, redundant disk write will be queued later. Either way, the data gets to its backing store, which is the real intent.

There are cases where modifying a page that is under writeback is a bad idea, though. Some devices can perform integrity checking, meaning that the data written to disk is checksummed by the hardware and compared against a pre-write checksum provided by the kernel. If the data changes after the kernel calculates its checksum, that check will fail, causing a spurious write error. Software RAID implementations can be tripped up by changing data as well. As a result of problems like this, developers working in the filesystem area have been convinced for a while that the kernel needs to support "stable pages" which are guaranteed not to change while they are under writeback.

When LWN looked at stable pages in February, Darrick Wong had just posted a patch aimed at solving this problem. In situations where integrity checking was in use, the kernel would make a copy of each page before beginning a writeback operation. Since nobody in user space knew about the copy, it was guaranteed to remain unmolested for the duration of the write operation. This patch solved the problem for the integrity checking case, but all of those copy operations were expensive. Given that providing stable pages in all situations was seen as desirable, that cost was considered to be too high.

So Darrick has come back with a new patch set which takes a different - and simpler - approach. In short, with this patch, any attempt to write to a page which is under writeback will simply wait until the writeback completes. There is no need to copy pages or engage in other tricks, but there may be a cost to this approach as well.

As noted above, a page will be marked read-only when it is written back; there is also a page flag which indicates that writeback is in progress. So all of the pieces are there to trap writes to pages under writeback. To make it even easier, the VFS layer already has a callback (page_mkwrite()) to notify filesystems that a read-only page is being made writable; all Darrick really needed to do was to change how those page_mkwrite() callbacks operate in presence of writeback.

Some filesystems do not provide page_mkwrite() at all; for those, Darrick created a generic empty_page_mkwrite() function which locks the page, waits for any writeback to complete, then returns the locked page. More complicated filesystems do have page_mkwrite() handlers, though, so Darrick had to add similar functionality for ext2, ext4, and FAT. Btrfs has implemented stable pages internally for some time, so no changes were required there. Ext3 turns out to have some complicated interactions with the journal layer which make a stable page implementation hard; since invasive changes to ext3 are not welcomed at this point, that filesystem may never get stable page support.

There have been concerns expressed that this approach could slow down applications which repeatedly write to the same part of a file. Before this change, writeback would not slow down subsequent writes; afterward, those writes will wait for writeback to complete. Darrick ran some benchmarks to test this case and found a performance degradation of up to 12%. This slowdown is unwelcome, but there also seems to be a consensus that there are very few applications which would actually run into this problem. Repetitively rewriting data is a relatively rare pattern; indeed, the developers involved are saying that they don't even know of a real-world case they can test.

Lack of awareness of applications which would be adversely affected by this change does not mean that they don't exist, of course. This is the kind of change which can create real problems a few years down the line when the code is finally shipped by distributors and deployed by users; by then, it's far too late to go back. If there are applications which would react poorly to this change, it would be good to get the word out now. Otherwise the benefits of stable pages are likely to cause them to be adopted in most settings.

Comments (22 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.39-rc7 ?
Greg KH Linux 2.6.38.6 ?
Greg KH Linux 2.6.33.13 ?
Greg KH Linux 2.6.32.40 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Security-related

Virtualization and containers

Miscellaneous

Douglas Gilbert lsscsi-0.25 released ?
Jesper Dangaard Brouer IPTV-Analyzer project released v0.9.0 ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds