Brief items
The current development kernel is 2.6.39-rc7,
released on May 9. Says Linus: "
So
things have been pretty quiet, and unless something major comes up I
believe that this will be the last -rc." Full details can be found
in
the
long-form changelog.
Stable updates: the 2.6.32.40, 2.6.33.13, and 2.6.38.6 stable updates were released on
May 9. Each contains a long list of important fixes.
Comments (none posted)
Rebuilding the kernel enables end users to make modifications to
their devices that are normally not intended by the device
manufacturer, such as theming the device by changing system icons
and removing/modifying system components. Please note that Sony
Ericsson is not recommending this.
-- But they do
tell
you how
I can easily handle such people, being a bit bigger than that, and
lots of experience being a bouncer at a punk-rock bar for a number
of years.
-- The source of
Greg Kroah-Hartman's
kernel skills
For the life of me I can't understand why you distro guys need to
keep patching the kernel when you could just add a line to your
initscripts. I'm suspecting that lameness is involved.
--
Andrew Morton
Comments (2 posted)
Coreboot (formerly LinuxBIOS) is a
free BIOS implementation; it offers escape from a long list of woes
stemming from poorly-written BIOS's, but it has always suffered from
limited hardware support. AMD has now
announced
support for Coreboot on a new set of processors, and more going forward:
"
Finally, AMD is now committed to support coreboot for all future
products on the roadmap starting next with support for the upcoming 'Llano'
APU. AMD has come to realize that coreboot is useful in a myriad of
applications and markets, even beyond what was originally considered.
Consequently, AMD plans to continue building its support of coreboot in
both features and roadmap for the foreseeable future."
Comments (29 posted)
Kernel development news
By Jonathan Corbet
May 11, 2011
Arjan van de Ven recently
reported that a
2.6.39 change in how tracepoint data is reported by the kernel broke
powertop; he requested that the change be partially reverted. The
resulting discussion covered the familiar problem of how tracepoints mix
with the kernel ABI. But it also revealed some serious disagreements on
how tracing data should be provided by the kernel and, perhaps, the
direction that this interface will take in the future.
Each tracepoint defined in the kernel includes a number of fields
containing values relevant to the specific event being documented. For
example, the sched_switch tracepoint, which fires when the
scheduler is switching between processes, includes the IDs of both
processes, their priorities, and so on. Every tracepoint also has a few
"common" fields, including the process ID, its flags, and the value of the
preempt_count variable; if trace data is read in binary form,
those values will appear at the beginning of the structure read from the
kernel.
Prior to the 2.6.32 development cycle, those common fields also included
the thread group ID; that value was removed in September, 2009. A look at
the powertop
source shows that the program expects that field to still be there
(though it does not use it); its
internally-defined structure for trace data includes a tgid
field. So this change should have broken powertop, and it would have
except for one other change: on the very same day, Steve Rostedt added the
lock_depth common field to report whether the current process held
the big kernel lock (BKL). The addition of this field was never meant to
be permanent: its whole purpose, after all, was to help with the removal of
the BKL from the kernel entirely.
For 2.6.39, the lock_depth common field was removed, and powertop
broke. Arjan subsequently complained; he also supplied a patch which put a
zero-filled padding field where lock_depth used to be. Steve opposed the patch, on the grounds that, had
powertop used the tracing ABI properly, it would not have broken. The
kernel exports information about each tracepoint; for the above-mentioned
sched_switch, that information can be examined from the command
line:
# cat /sys/kernel/debug/tracing/events/sched/sched_switch/format
name: sched_switch
ID: 51
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char prev_comm[16]; offset:8; size:16; signed:1;
field:pid_t prev_pid; offset:24; size:4; signed:1;
field:int prev_prio; offset:28; size:4; signed:1;
field:long prev_state; offset:32; size:8; signed:1;
field:char next_comm[16]; offset:40; size:16; signed:1;
field:pid_t next_pid; offset:56; size:4; signed:1;
field:int next_prio; offset:60; size:4; signed:1;
A properly-written program, Steve says, should read this file and use the
offset values found there to obtain the data it is interested in. Linus seemed to agree that it would have been nice
if things worked out that way, but that's not what happened. Instead, at
least one program became dependent on the binary format of the trace data
exported from the kernel. That is enough to make that format part of the
kernel ABI; breaking that program counts as a regression. So Arjan's patch
was merged.
Steve did not like this outcome; it went against all the effort which had
gone into creating a means by which tracepoints could change without
breaking applications. The alternative, he said, was to bury the kernel in compatibility
cruft:
The reason tracepoints have currently been stable is that kernel
design changes do not happen often. But they do happen, and I
foresee that in the future, the kernel will have a large number of
"legacy tracepoints", and we will be stuck maintaining them
forever.
What happens if someone designs a tool that analyzes the XFS
filesystem's 200+ tracepoints? Will all those tracepoints now
become ABI?
The notion that XFS tracepoints could become part of the ABI was dismissed as "crazy talk" by Dave
Chinner, but there is nothing inherently different about those
tracepoints. They could, indeed, end up as part of the kernel ABI.
Steve was also concerned about the size of events; removal of
lock_depth, beyond eliminating a (now) meaningless bit of data,
also served to make each event four bytes smaller.
There is always pressure to reduce the overhead of
tracing, and reducing the bandwidth of the data copied to user space is
part of that; adding the pad field goes against that goal. David Sharp (of
Google) chimed in to note that data size
matters a lot to them:
The size of events is a *huge* issue for us. Please look at the
patches we have been sending out for tracing: A lot of them are
about reducing the size of events. Most of the patches we carry
internally are about reducing the size of events. Memory is the
most scarce resource on our systems, so we *cannot* afford to use
large trace buffers.
Steve had hoped to remove some of the other common fields as well (a change
that Google has already made internally); that idea has gone by the wayside
for now. Tracepoints, it seems, are ABI, even when the information they
report no longer makes sense in the kernel.
The remainder of this discussion became a sort of bunfight between Steve
and Ingo Molnar as they sought to place the blame for this problem and to
determine how things will go in the future. Ingo attacked Steve for resisting the idea of
unchanging tracepoints, accused him of
maintaining ftrace as a fork of perf in the kernel (despite the fact that
ftrace was there first), and said that perf
needed to take over:
perf is basically the ftrace UI and APIs done better, cleaner and
more robustly. Look at all the tooling that sprang up around that
ABI, almost overnight. ftrace evolved through many iterations in
the past and perf was simply the next logical step.
He also threatened to stop pulling tracing changes from Steve.
Steve, in
return, blamed perf for bolting itself onto the ftrace infrastructure, then
exporting ftrace's binary structures directly to user space. He blamed Ingo for
blocking changes intended to improve the situation (for example, the
creation of a separate directory for stable tracepoints agreed to at the 2010 Kernel Summit) and complained
that Ingo was ignoring his attempts to create tracing infrastructure which
works for everybody. He also worried, again, that set-in-stone tracepoint
formats would impede progress in the kernel.
Despite all of this, Steve is willing to
work toward the unification of ftrace and perf, as long as it doesn't mean
leaving ftrace behind:
Now that perf has entered the tracing field, I would be happy to
bring the two together. But we disagree on how to do that. I will
not drop ftrace totally just to work on perf. There's too many
users of ftrace that want enhancements, and I will still support
that. The reason being is that I honestly do not believe that perf
can do what these users want anytime in the near future (if at
all). I will not abandon a successful project just because you feel
that it is a fork.
So it seems that, while there are clearly disagreements and tension between
the developers in this area, there should also be room for a solution that
works for everybody. Development emphasis will clearly continue to move
toward perf, but, despite Ingo's desire to the contrary, ftrace will likely
continue to be improved. We may see efforts to push applications toward
libraries that can shield them from tracepoint changes, but, for now,
every tracepoint added to the kernel will probably have to
be considered to be part of its ABI; given that, developers should probably
be reviewing new tracepoints more closely than they have been. And, with
luck, instrumentation in Linux - which has improved considerably in the
last few years - will continue to get better.
Comments (9 posted)
By Jonathan Corbet
May 10, 2011
As of this writing, the 2.6.39-rc7 prepatch has just been released and
Linus has announced that it may be the last one before the final
release. Being a traditional sort of operation, LWN.net would not let that
release go by without looking at the statistics for this development
cycle. It has been a busy cycle, but with some interesting changes.
There have been just over 10,000 non-merge changesets merged for 2.6.39;
with the sole exception of 2.6.37 (11,446 changesets), that's the highest
since 2.6.33. Those changes came from 1,236 developers; only 2.6.37 (with
1,276 developers) has ever exceeded that number. Those developers added
670,000 lines of code while deleting 346,000 lines, for a net growth of
324,000 lines. The most active contributors this time around were:
| Most active 2.6.39 developers |
| By changesets |
| Thomas Gleixner | 442 | 4.4% |
| David S. Miller | 201 | 2.0% |
| Mike McCormack | 138 | 1.4% |
| Mark Brown | 127 | 1.3% |
| Tejun Heo | 119 | 1.2% |
| Russell King | 89 | 0.9% |
| Arnaldo Carvalho de Melo | 86 | 0.9% |
| Arend van Spriel | 77 | 0.8% |
| Al Viro | 73 | 0.7% |
| Aaro Koskinen | 72 | 0.7% |
| Tomas Winkler | 70 | 0.7% |
| Greg Kroah-Hartman | 69 | 0.7% |
| Chris Wilson | 65 | 0.6% |
| Joe Perches | 60 | 0.6% |
| Mauro Carvalho Chehab | 60 | 0.6% |
| Borislav Petkov | 60 | 0.6% |
| Eric Dumazet | 59 | 0.6% |
| Uwe Kleine-König | 59 | 0.6% |
| Dan Carpenter | 59 | 0.6% |
| Artem Bityutskiy | 58 | 0.6% |
|
| By changed lines |
| Wey-Yi Guy | 45680 | 5.6% |
| Wei Wang | 25224 | 3.1% |
| Alan Cox | 20880 | 2.6% |
| Laurent Pinchart | 20459 | 2.5% |
| Guan Xuetao | 20167 | 2.5% |
| Larry Finger | 14763 | 1.8% |
| Tomas Winkler | 14095 | 1.7% |
| Arnd Bergmann | 13748 | 1.7% |
| Igor M. Liplianin | 13491 | 1.7% |
| Aaro Koskinen | 13274 | 1.6% |
| Russell King | 12862 | 1.6% |
| Mike McCormack | 11582 | 1.4% |
| Jozsef Kadlecsik | 10374 | 1.3% |
| George | 10353 | 1.3% |
| Bhanu Gollapudi | 9925 | 1.2% |
| Thomas Gleixner | 8869 | 1.1% |
| Olivier Grenie | 8167 | 1.0% |
| Greg Ungerer | 8105 | 1.0% |
| Sakari Ailus | 7513 | 0.9% |
| Joe Perches | 7048 | 0.9% |
|
Thomas Gleixner got to the top of the per-changesets list with a massive
reworking of how interrupts are managed in the kernel - a job which
required significant changes in almost every architecture. David Miller
did a great deal of work cleaning up, reworking, and optimizing the
networking stack. Mike McCormack did a lot of cleanup work on the rtl8192e
driver in the staging tree, Mark Brown contributed the usual large pile of
changes concentrated in the sound driver subsystem, and Tejun Heo improved
things all over the tree, primarily in the x86 architecture code.
On the lines-changed side, Wey-Yi Guy reworked some Intel network drivers,
Wei Wang worked on the Realtek card reader driver in the staging tree, Alan
Cox added the GMA500 driver to staging, Laurent Pinchart did a bunch of
Video4Linux work including the addition of the media controller subsystem, and Guan Xuetao
added the unicore32 architecture.
There were just over 200 known employers supporting work on the 2.6.39, the
most active of which were:
| Most active 2.6.39 employers |
| By changesets |
| (None) | 1374 | 13.7% |
| Red Hat | 1260 | 12.6% |
| (Unknown) | 690 | 6.9% |
| Intel | 571 | 5.7% |
| Novell | 376 | 3.7% |
| Texas Instruments | 372 | 3.7% |
| IBM | 305 | 3.0% |
| Nokia | 297 | 3.0% |
| linutronix | 276 | 2.8% |
| (Consultant) | 203 | 2.0% |
| Google | 180 | 1.8% |
| Broadcom | 180 | 1.8% |
| Atheros | 151 | 1.5% |
| Samsung | 150 | 1.5% |
| Wolfson Micro | 146 | 1.5% |
| AMD | 133 | 1.3% |
| Pengutronix | 123 | 1.2% |
| ST Ericsson | 116 | 1.2% |
| LINBIT | 111 | 1.1% |
| Oracle | 99 | 1.0% |
|
| By lines changed |
| Intel | 117903 | 14.6% |
| (None) | 94093 | 11.6% |
| Red Hat | 52140 | 6.4% |
| Nokia | 46063 | 5.7% |
| Texas Instruments | 39536 | 4.9% |
| (Unknown) | 37755 | 4.7% |
| Realsil Micro | 25370 | 3.1% |
| IBM | 24121 | 3.0% |
| (Consultant) | 23999 | 3.0% |
| Broadcom | 23330 | 2.9% |
| Peking University | 20487 | 2.5% |
| Novell | 19024 | 2.3% |
| Samsung | 17275 | 2.1% |
| NetUP | 13683 | 1.7% |
| Google | 11201 | 1.4% |
| Realtek | 10457 | 1.3% |
| KFKI Research Inst | 10430 | 1.3% |
| Ericsson | 9199 | 1.1% |
| ST Ericsson | 8611 | 1.1% |
| Freescale | 8457 | 1.0% |
|
The percentage of changes coming from developers known to be working on
their own time is at the lowest level seen since we started generating
these statistics. Whether that means that volunteers are slowly losing
interest in
working with the kernel or that everybody who can do kernel work has been
hired is hard to say.
Red Hat, as always, generates large numbers of
patches; Texas Instruments continues the steady increase we have seen over
the last few years, while Oracle continues to decline. New entries this
time around include Realsil (the Realtek card reader work), the Peking
University Microprocessor R&D Laboratory (the unicore32 architecture),
NetUP (various drivers), and the KFKI Research Institute (ipset).
Occasionally it is interesting to look at the list of non-author signoffs -
Signed-off-by tags added by developers who are not the authors of the
patches involved. For 2.6.39, that list looks like this:
| Developers with the most signoffs (total 8766) |
| Greg Kroah-Hartman | 1162 | 13.3% |
| David S. Miller | 546 | 6.2% |
| John W. Linville | 437 | 5.0% |
| Mauro Carvalho Chehab | 434 | 5.0% |
| Andrew Morton | 317 | 3.6% |
| James Bottomley | 220 | 2.5% |
| Ingo Molnar | 186 | 2.1% |
| Mark Brown | 158 | 1.8% |
| Sascha Hauer | 135 | 1.5% |
| Tony Lindgren | 129 | 1.5% |
| Takashi Iwai | 124 | 1.4% |
| Samuel Ortiz | 106 | 1.2% |
| Paul Mundt | 100 | 1.1% |
| Matthew Garrett | 99 | 1.1% |
| Russell King | 98 | 1.1% |
| Jeff Kirsher | 97 | 1.1% |
| Jiri Kosina | 95 | 1.1% |
| Linus Torvalds | 94 | 1.1% |
| Patrick McHardy | 90 | 1.0% |
| Konrad Rzeszutek Wilk | 89 | 1.0% |
Greg Kroah-Hartman contributed "only" 69 patches to 2.6.39, but another
1,162 - over 13% of the total - passed through his hands on their way into
the kernel. The bulk of those changes applied to the staging tree, but
they were certainly not limited to staging.
Linus Torvalds directly merged only 94 changes from others;
everything else came in by way of a subsystem maintainer's tree.
Despite being one of the more active development cycles in recent years,
2.6.39 has also been one of the smoothest. The number of difficult
regressions has been small, and, if Linus's current plan holds, the cycle
could complete in just over 60 days, which would make it the shortest
development cycle since the beginning of the git era. Kernel development
is not without its glitches, but the process would appear to be working
quite smoothly.
(As always, thanks are due to Greg Kroah-Hartman for his help in the
creation of these statistics.)
Comments (13 posted)
By Jonathan Corbet
May 11, 2011
When a process writes to a file-backed page in memory (through either a
memory mapping or with the
write() system call), that page is
marked dirty and must eventually be written to its backing store. The
writeback code, when it gets around to that page, will mark the page
read-only,
set the "under writeback" page flag, and queue the I/O operation. The
write-protection of the page is not there to prevent changes to the page;
its purpose is to detect further writes which would require that another
writeback be done. Current kernels will, in most situations, allow a
process to modify a page while the writeback operation is in progress.
Most of the time, that works just fine. In the worst case, the second
write to the page will happen before the first writeback I/O operation begins;
in that case, the more recently written data will also be written to disk
in the first I/O operation and a
second, redundant disk write will be queued later. Either way, the data
gets to its backing store, which is the real intent.
There are cases where modifying a page that is under writeback is a bad
idea, though. Some devices can perform integrity checking, meaning that the data
written to disk is checksummed by the hardware and compared against a
pre-write checksum
provided by the kernel. If the data changes after the kernel calculates
its checksum, that check will fail, causing a spurious write error.
Software RAID implementations can be tripped up by changing data as well.
As a result of problems like this, developers working in the filesystem
area have been convinced for a while that the kernel needs to support
"stable pages" which are guaranteed not to change while they are under
writeback.
When LWN looked at stable pages in
February, Darrick Wong had just posted a patch aimed at solving this
problem. In situations where integrity checking was in use, the kernel
would make a copy of each page before beginning a writeback operation.
Since nobody in user space knew about the copy, it was guaranteed to remain
unmolested for the duration of the write operation. This patch solved the
problem for the integrity checking case, but all of those copy operations
were expensive. Given that providing stable pages in all situations was
seen as desirable, that cost was considered to be too high.
So Darrick has come back with a new patch
set which takes a different - and simpler - approach. In short, with
this patch, any attempt to write to a page which is under writeback will
simply wait until the writeback completes. There is no need to copy pages
or engage in other tricks, but there may be a cost to this approach as
well.
As noted above, a page will be marked read-only when it is written back;
there is also a page flag which indicates that writeback is in progress.
So all of the pieces are there to trap writes to pages under writeback. To
make it even easier, the VFS layer already has a callback
(page_mkwrite()) to notify filesystems that a read-only page is
being made writable; all Darrick really needed to do was to change how
those page_mkwrite() callbacks operate in presence of writeback.
Some filesystems do not provide page_mkwrite() at all; for those,
Darrick created a generic empty_page_mkwrite() function which
locks the page, waits for any writeback to complete, then returns the
locked page. More complicated filesystems do have page_mkwrite()
handlers, though, so Darrick had to add similar functionality for ext2,
ext4, and FAT. Btrfs has implemented stable pages internally for some
time, so no changes were required there. Ext3 turns out to have some
complicated interactions with the journal layer which make a stable page
implementation hard; since invasive changes to ext3 are not welcomed at
this point, that filesystem may never get stable page support.
There have been concerns expressed that this approach could slow down
applications which repeatedly write to the same part of a file. Before
this change, writeback would not slow down subsequent writes; afterward,
those writes will wait for writeback to complete. Darrick ran some
benchmarks to test this case and found a performance degradation of up to
12%. This slowdown is unwelcome, but there also seems to be a consensus
that there are very few applications which would actually run into this
problem. Repetitively rewriting data is a relatively rare pattern; indeed,
the developers involved are saying that
they don't even know of a real-world case they can test.
Lack of awareness of applications which would be adversely affected by this
change does not mean that they don't exist, of course. This is the kind of
change which can create real problems a few years down the line when the
code is finally shipped by distributors and deployed by users; by then,
it's far too late to go back. If there are applications which would react
poorly to this change, it would be good to get the word out now. Otherwise
the benefits of stable pages are likely to cause them to be adopted in most
settings.
Comments (22 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>