Brief items
3.10-rc7 is the current development kernel. It was released on June 22 and is likely the last -rc
for 3.10. "rc7 contains a fairly mixed collection of fixes all over, with (as
usual) drivers and some arch updates being most noticeable. This time
we had media updates in particular.
But there's also core fixes for things like fallout from the common
cpu-idle routines and some timer accounting for the new full-NOHZ
modes etc. So we have stuff all over, most of it happily fairly small."
Stable updates: The 3.9.7,
3.4.50, and
3.0.83 stable kernels were released on June
20.
The 3.9.8, 3.4.51, and 3.0.84 stable kernels are under review and
should be expected on June 27.
Comments (none posted)
All your cgroups are belong to us!
—
Lennart
Poettering
So I settled in and read the whole spec. It was fun reading (side note, it seems that BIOS engineers think Windows kernel developers are lower on the evolutionary scale than they are, and for all I know, they might be right...), and I'll summarize the whole super-secret, NDA-restricted specification, when it comes to how an operating system is supposed to deal with Thunderbolt, shhh, don't tell anyone that I'm doing this:
Thunderbolt is PCI Express hotplug, the BIOS handles all the hard work.
—
Greg Kroah-Hartman
Tracing should be treated as a drug. Just like not allowing smoking
in public places, only punish the users, not those that want to
breathe fresh air.
—
Steven Rostedt
Comments (4 posted)
Here's
an
interesting post from Arjan van de Ven on how power management works in
contemporary Intel processors. In short, it's complicated. "
The key
thing here is that Core A gets a very variable behavior, independent of
what it asked for, due to what Core B is doing. Or in other words, the
forward predictive value of a P state selection on a logical CPU is rather
limited."
Comments (3 posted)
Kernel development news
By Jonathan Corbet
June 26, 2013
The 3.10 kernel development cycle is nearing its completion; as of this
writing, the
3.10-rc7 prepatch is out and
the kernel appears to be
stabilizing as expected. As predicted, 3.10 has turned out to be the
busiest development cycle ever, with almost 13,500 non-merge changesets
pulled into the mainline repository (so far). What follows is LWN's
traditional look at where those changes came from.
3.9 set a record of its own, with 1,388 developers contributing changes.
So far, with a mere 1,374 contributors, 3.10 falls short of that record,
but that situation clearly could change before the final release is
made. The size of our development community, it seems, continues to
increase.
The most active 3.10 developers were:
| Most active 3.10 developers |
| By changesets |
| H Hartley Sweeten | 392 | 2.9% |
| Jingoo Han | 299 | 2.2% |
| Hans Verkuil | 293 | 2.2% |
| Alex Elder | 268 | 2.0% |
| Al Viro | 205 | 1.5% |
| Felipe Balbi | 202 | 1.5% |
| Sachin Kamat | 192 | 1.4% |
| Laurent Pinchart | 174 | 1.3% |
| Johan Hovold | 159 | 1.2% |
| Mauro Carvalho Chehab | 158 | 1.2% |
| Wei Yongjun | 139 | 1.0% |
| Arnd Bergmann | 138 | 1.0% |
| Eduardo Valentin | 138 | 1.0% |
| Axel Lin | 112 | 0.8% |
| Lee Jones | 111 | 0.8% |
| Lars-Peter Clausen | 99 | 0.7% |
| Kuninori Morimoto | 98 | 0.7% |
| Tejun Heo | 97 | 0.7% |
| Mark Brown | 97 | 0.7% |
| Johannes Berg | 96 | 0.7% |
|
| By changed lines |
| Joe Perches | 34561 | 4.5% |
| Hans Verkuil | 18739 | 2.4% |
| Kent Overstreet | 18690 | 2.4% |
| Larry Finger | 17222 | 2.2% |
| Greg Kroah-Hartman | 16610 | 2.2% |
| Shawn Guo | 12879 | 1.7% |
| Dave Chinner | 12838 | 1.7% |
| Paul Zimmerman | 12637 | 1.6% |
| H Hartley Sweeten | 12518 | 1.6% |
| Al Viro | 11116 | 1.4% |
| Andrey Smirnov | 11107 | 1.4% |
| Mauro Carvalho Chehab | 9726 | 1.3% |
| Laurent Pinchart | 9258 | 1.2% |
| Jussi Kivilinna | 8960 | 1.2% |
| Lee Jones | 8598 | 1.1% |
| Sylwester Nawrocki | 8305 | 1.1% |
| Artem Bityutskiy | 8094 | 1.0% |
| Dave Airlie | 7546 | 1.0% |
| Guenter Roeck | 7510 | 1.0% |
| Sanjay Lal | 7428 | 1.0% |
|
H. Hartley Sweeten's position at the top of the list seems like a permanent
aspect of these reports as he continues his work on the endless task of
cleaning up the Comedi drivers in the staging tree. Jingoo Han contributed
a long list of driver cleanup patches, moving the code toward the use of
standard helper functions and the "managed" resource allocation API. Hans
Verkuil improved a number of video acquisition drivers as part of his
new(ish) role as the maintainer of the Video4Linux subsystem. Alex
Elder's work is focused on the Ceph filesystem and associated "RADOS" block
device, and Al Viro implemented a large number of core kernel improvements
and API changes. Together, these five developers accounted for nearly 11%
of all the changes going into the kernel.
In the "lines changed" column, Joe Perches topped the list with a set of
patches effecting whitespace cleanups, printk() format changes,
checkpatch.pl tweaks, and more. Kent Overstreet added the bcache block caching subsystem and a number of
asynchronous I/O improvements. Larry Finger's 17 patches added new
features and device support to the rtlwifi driver, and Greg Kroah-Hartman
removed the Android "CCG" USB gadget driver from the staging tree.
Just over 200 employers are known to have supported work on the 3.10
kernel. The most active of these were:
| Most active 3.10 employers |
| By changesets |
| (None) | 1495 | 11.1% |
| Red Hat | 1269 | 9.4% |
| Intel | 912 | 6.8% |
| Linaro | 877 | 6.5% |
| Texas Instruments | 765 | 5.7% |
| (Unknown) | 746 | 5.5% |
| Samsung | 615 | 4.6% |
| IBM | 402 | 3.0% |
| Vision Engraving Systems | 392 | 2.9% |
| Google | 350 | 2.6% |
| SUSE | 332 | 2.5% |
| Renesas Electronics | 331 | 2.5% |
| Cisco | 300 | 2.2% |
| Inktank Storage | 277 | 2.1% |
| Broadcom | 182 | 1.3% |
| NVidia | 180 | 1.3% |
| Freescale | 175 | 1.3% |
| Oracle | 175 | 1.3% |
| Trend Micro | 139 | 1.0% |
| Fujitsu | 138 | 1.0% |
|
| By lines changed |
| (None) | 118326 | 15.3% |
| Red Hat | 88080 | 11.4% |
| Linaro | 64697 | 8.4% |
| Intel | 50641 | 6.6% |
| Google | 33342 | 4.3% |
| Cisco | 24109 | 3.1% |
| (Unknown) | 24033 | 3.1% |
| Samsung | 20893 | 2.7% |
| Texas Instruments | 20289 | 2.6% |
| NVidia | 18470 | 2.4% |
| Linux Foundation | 16759 | 2.2% |
| Renesas Electronics | 15777 | 2.0% |
| IBM | 14385 | 1.9% |
| QLogic | 14165 | 1.8% |
| Synopsys | 13698 | 1.8% |
| Vision Engraving Systems | 13111 | 1.7% |
| Broadcom | 12770 | 1.7% |
| Synapse Product Development | 11107 | 1.4% |
| OpenSource AB | 9584 | 1.2% |
| SUSE | 9479 | 1.2% |
|
With 3.10, Red Hat regained its usual place as the company with the most
contributions, though even Red Hat, once again, falls short of the
contributions from volunteers. The increase in contributions from the
mobile and embedded community continues its impressive growth; Linaro, in
particular, continues to grow, with 42 developers contributing code under
its name to 3.10.
In summary, the kernel's busiest development cycle ever shows the
continuation of a number of patterns that have been observed for a while:
increasing participation from the mobile and embedded worlds, more
developers, and more companies. There was a slight uptick in volunteer
contributors this time around, but it is not at all clear that the
long-term decline in that area has been interrupted. As a whole,
the kernel development machine continues to operate in its familiar,
predictable, and productive manner.
Comments (1 posted)
By Jake Edge
June 26, 2013
The Ftrace kernel tracing facility has long had the ability to turn on or off tracing based on hitting
a certain function (the trigger) during the trace. For 3.10, several new
trigger commands that allow even more control over the tracing process were
added, but can only be triggered by function entry.
A recent patch set from Tom Zanussi would
extend
the Ftrace trigger feature to apply more widely so that the triggers could be
specified for any tracepoint. Triggers allow the user to reduce the amount
of tracing information gathered, so that narrowing in on the behavior of
interest is easier.
In 3.10, patches by
Steven Rostedt add several trigger actions to Ftrace
that will allow triggers to create a
snapshot of the tracing buffers, add a stack trace to the buffers, or
to disable and enable other trace events when the trigger is hit. Any of
those (and the long-available traceon and traceoff trigger
actions) can be associated with a particular function call in the kernel,
but not with an arbitrary trace event. The latter is what Zanussi's
patches would allow.
Tracepoints can be placed by kernel developers in locations of interest
for debugging or other kinds of monitoring; when an active tracepoint is
hit while
tracing is active, it will generate a trace event that gets stored in the
tracing buffers.
Some tracepoints have been
added permanently to the kernel, but others can be placed ad hoc to
provide extra information, particularly when tracking down a bug.
In contrast, the function
entry events are placed automatically at build time when Ftrace is enabled
in the
kernel. In both cases, the tracepoints are disabled until a user enables them
through the debugfs interface.
Disabling or enabling a tracepoint is a fairly intrusive process that involves
modifying the code at the tracepoint site. That, in turn, requires the use
of stop_machine(), which will stop execution on all other CPUs in
the system so that the modified code properly propagates to the entire
system. But a trigger can fire on any function call (or, with Zanussi's
patches, any tracepoint), so there needs to be a way to enable and disable
tracepoints that doesn't
require taking any locks or sleeping. That constraint led Rostedt to add a
way to "soft
disable" trace events.
Tracepoints are simply marked as disabled using a flag,
rather than modifying the code itself, which means that the function handling the trace event
will be called but will
immediately return. Any tracepoints that might be
enabled by a trigger are initialized at the time the trigger is set, but
marked as soft disabled until the triggering event is hit.
All of the trigger capabilities are aimed at narrowing in on a particular
behavior (typically a bug) of interest. The pre-3.10 triggers only allowed
using the "big hammer" of disabling and enabling tracing itself, while
these new trigger actions provide more fine-grained control. For example,
using the "snapshot" trigger action will create a snapshot of the existing
tracing buffer in a separate file while continuing the trace. Rostedt
shows an example in his update
to Documentation/trace/ftrace.txt:
# in the tracing debugfs directory, typically /sys/kernel/debug/tracing
echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
That command will trigger a snapshot, once (that's what the "
:1"
does), when the
native_flush_tlb_others() function is called.
Or to enable a specific event:
echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > set_ftrace_filter
That will enable the
sched_switch event in the
sched
subsystem when the
try_to_wake_up() function is
called. The event will be enabled the first two times that
try_to_wake_up() is called, presumably some other trigger is
disabling the event in
between.
Zanussi's patches (which come with good documentation and examples, both in
the introductory patch and an update to
Documentation/trace/events.txt) apply those same triggers to
the events interface. Thus the kmem:kmalloc event can be enabled
when the read() system call is entered as follows:
# again from /sys/kernel/debug/tracing
echo 'enable_event:kmem:kmalloc:1' > events/syscalls/sys_enter_read/trigger
That will also cause the event to be soft disabled until the trigger is
hit, which can be verified by:
cat events/kmem/kmalloc/enable
That will show a "0*" as output, which means that the event is soft disabled.
But there is more that can be done using triggers. The trace events filter
interface allows tests to
be performed to filter the tracing output,
and Zanussi's patches extend that to the trigger feature. So adding a
stack backtrace to the tracing buffer for the first five
kmem:kmalloc events where the bytes requested value is 512 or more
could
be done this way:
echo 'stacktrace:5 if bytes_req >= 512' > events/kmem/kmalloc/trigger
One can check on the status of the trigger using:
cat events/kmem/kmalloc/trigger
which would show:
stacktrace:count=2 if bytes_req >= 512
if three of the five stack backtraces had been emitted into the tracing
buffer (
count=2).
The triggers and tests can be mixed and matched in various ways, but
there are some restrictions. There can only be one
stacktrace or snapshot trigger per event and each event can only have one
traceon and traceoff trigger. A triggering event can have multiple
enable_event and
disable_event triggers, but each must refer to a different event to be enabled
or disabled. So sys_enter_read could enable two different events (with two different commands echoed into its trigger
file) but the events so enabled must be distinct.
The Ftrace triggers will be released soon with 3.10. Zanussi's patches are
still in the review stage. Given that the 3.11 merge window will likely
open soon, triggers for all events will probably have to wait for 3.12.
Comments (1 posted)
By Jonathan Corbet
June 26, 2013
The number of latency-sensitive applications running on Linux seems to be
increasing, with the result that more latency-related changes are finding
their way into the kernel. Recently LWN looked at the
Ethernet device polling patch set, which
implements
polling to provide minimal latency to critical networking tasks. But what
happens if you want the lowest possible latency for block I/O requests
instead? Matthew Wilcox's
block driver
polling patch is an attempt to answer that question.
As Matthew put it, there are users who are
willing to go to great lengths to lower the latencies they experience with
block I/O requests:
The problem is that some of the people who are looking at those
technologies are crazy. They want to "bypass the kernel" and "do
user space I/O" because "the kernel is too slow". This patch is
part of an effort to show them how crazy they are.
The patch works by adding a new driver callback to struct
backing_dev_info:
int (*io_poll)(struct backing_dev_info *bdi);
This function, if present, should poll the given device for completed I/O
operations. If any are found, they should be signaled to the block layer;
the return value is the number of operations found (or a negative error
code).
Within the block layer, the io_poll() function will be called
whenever a process is about to sleep waiting for an outstanding operation.
By placing the poll calls there, Matthew hopes to avoid going into polling when there
is other work to be done; it allows, for example, the submission of
multiple operations without invoking the poll loop. But, once a process
actually needs the result of a submitted operation, it begins polling rather
than sleep.
Polling continues until one of a number of conditions comes about. One of
those, of course, is that an operation that the current process is waiting
for completes. In the absence of a completed operation, the process will
continue polling until it receives a signal or the scheduler
indicates that it would like to switch to a different process. So, in
other words, polling will stop if a higher-priority process becomes
runnable or if the current process exhausts its time slice. Thus, while
the polling happens in the kernel, it is limited by the relevant process's
available CPU time.
Linus didn't like this approach, saying
that the polling still wastes CPU time even if there is no higher-priority
process currently contending for the CPU. That said, he's not necessarily
opposed to polling; he just does not want it to happen if there might be other
runnable processes. So, he suggested, the polling should be moved to the
idle thread. Then polling would only happen when the CPU was about to go
completely idle, guaranteeing that it would not get in the way of any other
process that had work to do.
But Linus might actually lose in this case. Block maintainer Jens Axboe responded that an idle-thread solution would
not work. "If you need to take the context
switch, then you've negated pretty much all of the gain of the polled
approach." Also he noted that the
current patch does the polling in (almost) the right place, just where the
necessary information is available. So Jens appears to be disposed toward
merging something that looks like the current patch; at that point, Linus
will likely accept it.
But Jens did ask for a bit more smarts when it comes to deciding when the
polling should be done; in the current patch, it happens unconditionally
for any device that provides an io_poll() function. A better
approach, he said, would be to provide a way for specific processes to opt
in to the polling, since, even on latency-sensitive systems, polling will
not be needed by all processes. Those processes that do not need extremely
low latency should not have to give up some of their allotted CPU time for
I/O polling.
So the patch will certainly see some work before it is ready for merging.
But the benefits are real: in a test run by Matthew on an NVMe device, I/O
latencies dropped from about 8.0µs to about 5.5µs — a significant
reduction. The benefit will only become more pronounced as the speed of
solid-state storage devices increases; as the time required for an I/O
operation approaches 1µs, an extra 2.5µs of overhead will come to dominate
the picture. Latency-sensitive users will seek to eliminate that overhead
somehow; addressing it in the kernel is a good way to ensure that all users
are able to take advantage of this work.
Comments (2 posted)
Patches and updates
Kernel trees
- Sebastian Andrzej Siewior: 3.8.13-rt12 .
(June 21, 2013)
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Page editor: Jake Edge
Next page: Distributions>>