The current development kernel is 2.6.38-rc3
on February 1. "Nothing
hugely special in here, and I'm happy to say that most of the pull requests
have been nice clear bug-fixes and fixing regressions. Thanks to most of
you for that.
" As always, see the
for all the details.
Stable updates: no stable updates have been released in the last
week. The 184.108.40.206 update is in the review
process as of this writing; it could be released as early as
Comments (none posted)
If you ever have a function with the string "check" in its name,
it's a good sign that you did something wrong.
-- Andrew Morton
In GNOME 3.0, we're defaulting to suspending the computer when the
user shuts the lid, and not providing any preferences combobox to
change this. This is what the UI designers for GNOME 3.0 want, and
is probably a step in the right direction. We really can't keep
working around bugs in the kernel with extra UI controls.
I don't normally do acked-by's. I think it's my way of avoiding
getting blamed when it all blows up.
-- Andrew Morton
Comments (71 posted)
As anybody who has ever sat through an "allyesconfig" build - or a
build using a distributor configuration - understands, there is a
of code in the kernel. Most of the time, creating the perfect
kernel is a matter of excluding code until the size becomes reasonable. So
it's a rare kernel build that actually compiles a majority of the code
found in the tree.
The kernel configuration mechanism makes it possible to perform this
selection. Part of this mechanism is wired into the build system; it
allows source files to be passed over entirely if they contain nothing of
interest. The other half, though, is implemented with preprocessor symbols
and conditional compilation. Kernel developers may be discouraged from
using #ifdef, but there are still a lot of conditional blocks in
Sometimes, the logic which leads to the inclusion or exclusion of a
specific block is complex and not at all clear. There are many
configuration options in the kernel, and they can depend on each other in
complicated ways. As a result, dead code - code which will not be compiled
regardless of the selected configuration - may escape notice for years.
Dead code adds noise to the
source tree and, since nobody ever runs it, it is more than likely to
contain bugs. If that code is re-enabled or copied, those bugs could
spread through the tree in surprising ways.
So it would be good to be able to identify dead code and get it out of the
tree. The newly-released undertaker tool
was designed to do a number of types of static analysis, including dead
code identification. Developers can run it on their own to find dead
blocks in specific files; there is also a
web interface which allows anybody to browse through the tree and find
the dead sections. That should lead to patches hauling away the bodies and
cleaning up the tree, which is a good thing.
Comments (none posted)
The kernel configuration system is a complex bit of code in its own right;
many people who have no trouble hacking on kernel code find reasons to
avoid going into the configuration subsystem. There is value in being able
to work with the complicated data structure that is a kernel configuration,
though. Ulf Magnusson has recently posted a library, Kconfiglib
, which, he hopes, will make that
Kconfiglib is a Python library which is able to load, analyze, and output
kernel configurations; care has been taken to ensure that any configuration
it creates is identical to what comes out of the existing kernel
configuration system. With Kconfiglib, it becomes straightforward to write
simple tools like "allnoconfig"; it also is possible to ask questions about
a given configuration. One possible tool, for example, would answer the
"why can't I select CONFIG_FOO" question - a useful feature
There are currently no Python dependencies in the kernel build system;
trying to add one could well run into opposition. But Kconfiglib could
find a role in the creation of ancillary tools which are not required to
configure and build a kernel as it's always been done. For the curious,
there's a set of
Comments (7 posted)
Kernel development news
Monitoring a system for "reliability, availability, and serviceability"
(RAS) is an important part of keeping that system, or a cluster of such
computers, up and running. There is a
wide variety of things that could be monitored for RAS
purposes—memory errors, CPU
temperature, RAID and filesystem health, and so on—but Borislav
Petkov's RAS daemon is just targeted at
gathering information on any machine check exceptions (MCEs) that occur.
The daemon uses trace events and the perf infrastructure, which
requires a fair amount of restructuring of that code to make it available
not only to the RAS daemon, but also for other kinds of tools.
The first step is to create persistent perf events, which are events that
are always enabled, and will have event buffers allocated, even if there is
no process currently looking at the data. That allows the MCE trace event to be
enabled at boot time,
before there is any task monitoring the perf buffer. Once the boot has
completed, the RAS daemon (or some other program) can mmap() the
event buffers and start
monitoring the event. This will allow the RAS daemon to pick up any MCE
events that happened during the boot process.
To do that, the struct perf_event_attr gets a new
persistent bitfield that is used to determine whether or not to
destroy the event buffers when they are unmapped. In addition, persistent
events can be shared by multiple monitoring programs because they can be
mapped as shared and read-only. Once the persistent events are added, the
next patch then changes the MCE event
to become a
With that stage set, Petkov then starts to rearrange the perf code so that the
RAS daemon and other tools can access some of the code that is currently
buried in the tools/perf directory. That includes things like the
trace event utilities, which move from tools/perf/util to
tools/lib/trace and some helper functions for debugfs that move to
tools/lib/lk. These were obviously things that were needed when
creating the RAS daemon, but not easily accessible.
A similar patch moves the mmap() helper functions from the
tools/perf directory to another new library:
tools/lib/perf. These functions handle things like reading the
head of the event buffer queue, writing at the tail of the queue, and
reading and summing all of the per-cpu event counters for a given event.
In response to the patch moving the
mmap() helpers, Arnaldo Carvalho de Melo pointed out that he had
already done some work to rework that code, and that it would reduce the
size of Petkov's patch set once it gets merged into the -tip tree. He also
noted that he had created a set of Python bindings and a simple
twatch daemon using those bindings. While Petkov had some reasons for writing the RAS daemon in C
rather than Python, mostly so that it would work on systems without Python
or with outdated versions, he did seem impressed: "twatch looks
almost like a child's play and even my grandma can profile her system now
But the Python bindings aren't necessarily meant for production code, as
Carvalho de Melo describes. Because the
Python bindings are quite similar to their C counterparts, they can be used
to ensure that the kernel interfaces are right:
I.e. one can go on introducing the kernel interfaces and testing them
using python, where you can, for instance, from the python interpreter
command line, create counters, read its values, i.e. test the kernel
stuff quickly and easily.
Moving to a C version then becomes easy after the testing phase is over
and the kernel bits are set in stone.
There are some additional patches that move things around within the
tools tree before the final patch actually adds the RAS daemon.
The daemon is fairly straightforward, with the bulk of it being
boilerplate daemon-izing code. The rest parses the MCE event format (from
mce/mce_record/format file in debugfs), then opens
and maps the debugfs mce/mce_recordN files (where N is the CPU
number). The main program sits in a loop checking for MCE events every
30 seconds, printing the CPU, MCE status, and address for any events that
have occurred to a log file. Petkov mentions decoding of the MCE status as
something that he is currently working on.
Obviously, the RAS daemon itself is not the end result Petkov is aiming
for. Rather, it is just a proof-of-concept for persistent events and
demonstrates one way to rearrange the perf code so that other tools can use
it. There may be disagreements about the way the libraries were arranged,
or the specific location of various helpers, but the overall goal seems
like a good one. Whether tools like ras actually end up in the
kernel tree is, perhaps, questionable—the kernel hackers may not want to
maintain a bunch of tools of this kind—but by making the utility code
more accessible, it will make it much easier for others build these tools
on their own.
Comments (none posted)
Wacom tablets are often the tool of choice for those who need accurate and
flexible input devices; they seem to be especially favored by artists.
Like a mouse, these tablets can report position and movement, but they can
also present multiple input devices to the system (one for each of several
different types of pens, for
example) and report variables like pen angle, pressure, and more. Support
in Linux for these
devices has not been as good as one might like, but, as Peter Hutterer
described in his talk at the linux.conf.au Libre Graphics Day miniconf, it
is getting better quickly. How that came to be is a classic example of how
to (or how not
to) manage kernel driver development.
Peter is the maintainer for the bulk of the graphical input drivers. He
has, he says, rewritten most of that subsystem, so he is to blame for the
bugs which can be found there. Most input devices are easily handled
through the evdev abstraction, but the Wacom driver is an exception. The
things which are unique to these tablets (multiple input "devices," one
associated with each pen, the pressure, tilt, and rotation axes, and the
relatively high resolution) require a separate driver for their support.
Thus, Wacom users must have the linuxwacom driver in their systems.
There is some confusion about the linuxwacom driver, because there are
multiple versions of it, all of which can be found on SourceForge. One version
(0.8.8) is created by Wacom itself; it is a classic vendor driver, Peter
said, with everything that usually implies about the development process
(code dumps) and the quality of the code itself. This driver ships as a
tarball containing a wild set of permutations of kernel and X.org versions;
it's a mess. But it's Wacom's mess, and the company has been resistant to
efforts to clean it up.
Peter got fed up with this situation in 2009 and forked the driver. His
version is now the default driver in a number of distributions, and is the
only one which supports newer versions of the X server. Looking at the
repositories, Peter found 78 commits total before the fork, all from
Wacom. After the fork, there are 788 commits, 65% from Red Hat, and 12%
from Wacom. Extracting the driver from its vendor-dominated situation has
definitely helped to increase its rate of development.
Surprisingly, the original vendor driver is still under development by
Wacom, despite the fact that it does not support current X servers and is
not shipped by any distributors. The original mailing list is still in
business, but, Peter warned, one should not ask questions about the new
driver there. Kernel development, he said, should be done on the
linux-kernel mailing list. There is also little point in talking to him
about problems with the older driver; Wacom insists on keeping control over
Update: Peter tells us that there are three mailing lists (linuxwacom-announce, linuxwacom-discuss and linuxwacom-devel) which are still the place to go for general questions, including hardware-specific questions. X driver development for the forked driver happens exclusively on linuxwacom-devel and all patches are sent there.
So the mailing lists are definitely the place to ask questions, at least in
regards to the X driver. The kernel driver is the exception here. Kernel driver development should happen on LKML, not on linuxwacom lists.
Much of the work Peter has done so far has been toward the goal of cleaning
up the driver. That has involved throwing out a number of features. Some
of those needed to go - the original driver tries to track the resolution
of the screen, for example, which it has no business knowing. Support for
the "twinview" approach to dual monitors has also been taken out. In some
cases, the removed features are things that people want; support should
eventually be restored once it can be done in the right way. Sometimes,
Peter said, things have to get worse before they can get better.
Also gone is the wacomcpl
configuration tool. It is, Peter said, some of the worst code that he has
Peter did this talk to update the graphics community on the state of
support for this driver, but he was also looking for input. His attitude
toward development was described as "if it doesn't crash the server,
it works." In other words, he is not a graphic artist, so he has no
deep understanding of how this hardware is used. To get that
understanding, he needs input from
the user community regarding development priorities and what does not work
as well as it should.
So artists making use of Wacom tablets should make sure that their needs
are known; the developer in charge of the driver is ready to listen.
Meanwhile, bringing a more open development process to the driver has
increased the pace of development and is improving the quality of the
code. If the usual pattern holds, before long Linux should have support
for these tablets which is second to none.
Comments (6 posted)
The last LWN article on
Ftrace described trace-cmd, which is a front end tool to
interface with Ftrace. trace-cmd is all command line based, and
works well for embedded devices or tracing on a remote system. But reading
the output in a text format may be a bit overwhelming, and make it hard to
see the bigger picture. To be able to understand how
processes interact, a GUI can help humans see what is happening at a global
scale. KernelShark has been written to fulfill this requirement. It is a
GUI front end to trace-cmd.
KernelShark is distributed in the same repository that trace-cmd is:
To build it, you just need to type make gui
, as just typing
will only build trace-cmd
. These two tools have been kept separate
since a lot
of embedded devices do not have the libraries needed to build KernelShark.
A full HTML help
is included in the repository and is installed with make install_doc
. After installing
the documentation, you can access the help directly from KernelShark from the
This article is not a tutorial on using KernelShark, as everything you need to
know about the tool is kept up-to-date in the KernelShark
repository. Instead, this
article will describe a use case that KernelShark was instrumental in helping
Analyzing the real-time scheduler
Some time ago, when the push/pull algorithm of the real-time scheduler
in Linux was being developed, a decision had to be made about what to do
when a high priority process wakes up on a CPU running a lower real-time
priority process, where both processes have multiple CPU affinity, and both
can be migrated to a CPU running a non-real-time task. One would think that
the proper thing to do would be to simply wake up the high priority process
on that CPU which would cause the lower priority process to be pushed off
the running CPU. But a theory was that by doing so, we move a cache hot
real-time process onto a cache cold CPU and possibly replace it with a
cache cold process.
After some debate, the decision was made to migrate the
high priority process to the CPU running the lowest priority task (or no
task at all) and wake it there. Some time later, after the code was
incorporated into mainline, I started to question this decision even though
I was the one that fought for it. With the introduction of Ftrace, we now
have a utility to truly examine the impact that decision has made.
The decision to move the higher priority task was based on an assumption
that if the task was waking up, that it is more likely to be cache cold
than a task that is already running. Thinking more about this case, one
must think about what would cause a high priority task to wake up in the first place.
If it is woken up periodically to do some work, then it can very well be
the case that it will be cache cold. Any task that was scheduled
in between can easily push out the cache of this high priority task.
But what if the high priority task was blocked on a mutex? If the
task was blocked on a mutex and another RT task was scheduled in its place
then when the high priority task wakes up again, there is a good chance that
the task will be cache hot.
A mutex in most real-time programs will usually be held for a short
period of time. The PREEMPT_RT patch, which this code was developed from,
converts spinlocks into mutexes, and those mutexes are held for very small
code segments, as all spinlocks should be. Migrating a task simply because
it blocked on a mutex increases the impact these locks have on the
throughput. Why punish the high priority task even more because it blocked
and had to wait for another task to run?
Before making any decision to change the code, I needed to have a test
case that can show that the moving of a high priority task instead of preempting
the lower priority task will cause the high priority task to ping pong around
the CPUs when there is lock contention. A high priority task should not be
punished (migrated) if it simply encounters lock contention with lower priority real-time
tasks. It would also be helpful to know how changing this decision affects the total
number of migrations for all the tasks under lock contention.
Having a 4 processor box to play with,
I started writing a test case that would possibly cause this scenario, and
use Ftrace to analyze the result. The first test case to try was to
create five threads (one more than CPUs) and four pthread mutex locks.
Have all threads wake up from a barrier wait and then loop 50 times
grabbing each lock in sequence and do a small busy loop. The name of this
test is called migrate.c.
The test application uses trace_marker as explained in
previous articles to write what is happening
inside the application to synchronize with kernel tracing.
Running the following with trace-cmd:
# trace-cmd record -e 'sched_wakeup*' -e sched_switch -e 'sched_migrate*' migrate
Like trace-cmd report
, KernelShark will, by default, read
the file trace.dat
. You can specify another file by naming it as
the first argument to KernelShark. While the KernelShark display images may be
difficult to read fully in the article, clicking any of them will bring
up a full-resolution version.
Since all tasks have been recorded, even trace-cmd itself, we want to
filter out any tasks that we do not care about.
Selecting Filter->Tasks from the KernelShark menu, and then
choosing only the migrate threads will remove the extraneous
Note that events that involve two tasks, like sched_switch
or sched_wakeup, will not be filtered out if one of the tasks
should be displayed.
In the default graph view, each on-line CPU is represented by a plot line.
Each task is represented by a different color. The color is determined by running the
process ID through a hash function and then parsing that number into a
- The purple ( ) colored bar represents
thread 4, the highest priority task.
- The orange(ish) ( ) colored bar represents thread 3.
- The turquoise ( ) colored bar represents thread 2.
- The brown ( ) colored bar represents thread 1.
- The light blue ( ) colored bar represents thread 0,
the lowest priority task.
The lines sticking out of the top of the bars represent
events that appear in the list below the graph.
By examining the graph we can see that the test case was quite naive.
The lowest priority task, thread 0, never got to run until the other
four tasks were finished. This makes sense as the machine only had four
CPUs and there were four higher priority tasks running. The four running
tasks were running in lock step, taking the locks in sequence. From this
view it looks like the tasks went out of sequence, but if we zoom in to
where the migrations happened, we see something different.
To zoom into the graph, press and hold the left mouse button. A line will appear, then
drag the mouse to the right. As the mouse moves off the line, another line will
appear that follows the mouse. When you let go of the mouse button, the view will
zoom in making the locations of the two lines the width of the new window.
Repeating the above procedure, we can get down to the details of the
migration of thread 3.
Double clicking on the graph brings the list view to the event that was
clicked on. A green line appears at the location of that was clicked.
On CPU 0, thread 3 was preempted by the watchdog/0 kernel thread.
Because we filtered out all threads but the migration test tasks, we see a
small blank on the CPU 0 line. This would have been filled in with a colored
bar representing the watchdog/0 thread if the filters were not enabled.
The watchdog/0 thread runs at priority 99, which we can see
from the sched_switch event as the priority of the tasks is between
the two colons. The priority shown is represented by the kernel's view
of priority, which is inverse to what user-space uses (user space priority 99 is
kernel priority zero).
When the watchdog/0 thread preempted thread 3, the push/pull algorithm
of the scheduler pushed it off to CPU 3, which had the lowest priority running
task. Zooming into the other migrations that happened on the other CPUs, show
that the watchdog kernel thread was responsible for them as well. If it wasn't for the
watchdog kernel threads, this test would not have had any migrations.
Test two, second failure
The first test took the naive approach of just setting up four locks
and having the tasks grab them in order. But this just kept the tasks in
sync. The next approach is to try to
mix things up a little more. The concern about the real-time scheduler is
how it affects the highest priority task. The next test creates the four
locks again (as there are four CPUs) and five tasks each of increasing
priority. This time, only the highest priority task grabs all the locks in
sequence. The other four tasks will grab a single lock. Each lock will
have a single task and the highest priority task grabbing that lock. To try
to force contention, pthread barriers are used. For those unfamiliar with
pthread barriers, they are synchronization methods to serialize threads.
A barrier is first initialized with a number and all threads that hit the
barrier will block until that number of threads have hit the barrier,
then all the threads are released.
This test case creates two barriers for each lock (lock_wait and
lock_go) each initialized with the number 2, for the two tasks (the unique low
priority task and the high priority task) that will take the lock.
The low priority task will take the lock and wait on a barrier (lock_wait). The
high priority task will hit that barrier before it takes the corresponding lock.
Because the low priority task is already waiting on the barrier, the high
priority task will trigger the barrier to release both tasks because the barrier
has a task limit of two. The high priority task will most likely try to take
the mutex while the low priority task aleady has it. The low priority task will
release the mutex and then wait on the other barrier (lock_go) letting the
high priority task take the mutex.
Running this test under trace-cmd yields the following from KernelShark
after filtering out all but the migrate test tasks.
As the colors of the tasks are determined by their process IDs, this run
has the following:
- The initial green ( ) bar is the main thread that is
setting up all the locks and barriers.
- The light purple ( ) bar is the lowest priority thread 0.
- The red ( ) bar is the next-higher priority thread 1.
- The yellow(ish) ( ) bar is the next-higher priority thread 2.
- The blue ( ) bar is the next-higher priority thread 3.
- The light turquoise ( ) bar is the highest priority
Looking at the graph it seems that the highest priority thread stayed on the
same CPU, and was not affected by the contention. Considering that the scheduler
is set to migrate a waking real-time task if it is woken on a CPU that is
another real-time task, regardless of the priorities, one would think the
high priority task would have migrated a bit more. Zooming in on the graph
brings to light a bit more details to what is occurring.
What we can see from the graph, and from the list, is that the high priority
thread did have contention on the lock. But because all threads are waiting for
the high priority process to come around to its lock, the other threads are sleeping
when the high priority process wakes up. The high priority process is only contending
with a single thread at a time. Threads 0 and 2 share CPU 2 without issue, while
threads 1 and 3 each still have a CPU for themselves.
The test to force migration
The second test was on the right track. It was able to produce a contention
but failed to have the CPUs busy enough to cause the highest priority task
to wake up on a CPU running another real-time task. What is needed is to have
more tasks. The final test adds twice as many running threads as there are
This test goes back to all tasks grabbing
all locks in sequence. To prevent the synchronization that has happened
before, each thread will hold a lock a different amount of time.
The higher the priority of a thread, the shorter time it will hold the lock.
Not only that, but the threads will now sleep after they release a lock.
The higher the priority of a task, the longer it will sleep:
lock_held = 1 ms * ((nr_threads - thread_id) + 1)
sleep_time = 1 ms * thread_id
The lowest priority thread will never sleep and it will hold the lock for
the longest time. To make things even more interesting, the mutexes have
been given the PTHREAD_PRIO_INHERIT
attribute. When a higher priority
thread blocks on a mutex held by a lower priority thread, the lower priority
thread will inherit the priority of the thread it blocks.
The test records the number of times each task
voluntarily schedules, the number of times it is preempted, the number
of times it migrates, and the number of times it successfully acquired
all locks. When the test finishes, it gives an output of these for
each thread. The higher the task number the higher the priority of
the thread it represents.
|Task ||vol ||nonvol ||migrated ||iterations
|0 ||43 ||3007 ||1571 ||108
|1 ||621 ||1334 ||1247 ||108
|2 ||777 ||769 ||1072 ||108
|3 ||775 ||17 ||701 ||108
|4 ||783 ||50 ||699 ||108
|5 ||788 ||2 ||610 ||109
|6 ||801 ||89 ||680 ||109
|7 ||813 ||0 ||693 ||110
|Total ||5401 ||5268 ||7273 ||868
Running this test under trace-cmd and viewing it with KernelShark
yields a graph with lots of pretty colors, which means we likely succeeded
in our goal.
To prove that the highest priority thread did indeed migrate, we can plot
the thread itself.
Using the "Plots" menu and choosing "Tasks" brings up the same type of dialog as the task filter that was described earlier.
I selected the highest priority thread (migrate-2158), and zoomed
in to get a better view. The colors on a task plot are determined by the CPU number it
was running on. When a task migrates, the colors of the plot changes.
This test now demonstrates how a high priority task can migrate substantially
when other RT tasks are running on the system. Changes to the real-time
scheduler can now be tested.
changes the decision on which thread
migrates when an real-time task wakes up on a CPU running another
real-time task. The original way was to always move the task that is
waking up if there is a CPU available that is running a task that is
lower in priority than both tasks. Instead, the commit changes this to
just wake up the real-time task on its CPU if it is higher priority
than the real-time task that is currently running.
The migrate test now shows:
|Task ||vol ||nonvol ||migrated ||iterations
|0: ||52 ||2923 ||2268 ||108
|1: ||569 ||1529 ||1457 ||109
|2: ||801 ||1961 ||2194 ||109
|3: ||808 ||789 ||1274 ||109
|4: ||810 ||61 ||155 ||109
|5: ||813 ||10 ||57 ||109
|6: ||827 ||35 ||81 ||110
|7: ||824 ||0 ||4 ||110
|total: ||5504 ||7308 ||7490 ||873
The total number of migrations has stayed around the same (several runs
will yield a fluctuation of a few hundred), but the number of migrations for the
highest priority task has dropped substantially, as it will not migrate
simply because it woke up on a CPU running another real-time task. Note,
that the highest priority task migrated at all was because it woke up
on a CPU that was running the task that owned the mutex that it was blocked on.
As these are priority inheritance mutexes, the owner would have the same
as the highest priority process that it is blocking. The wake up will not
a real-time task of equal priority. Perhaps that can be the
next change to the real-time scheduler: have the wake up be aware of priority
The highest priority thread (migrate-21412) was woken on
CPU 3, which was running thread 1 (migrate-21406) which is the task
that thread 7 originally blocked on. CPU 2 happened
to be running thread 0 (migrate-21405) which was the lowest priority
thread running at the time.
Note that the empty green box that is at the start of the task plot
time between when a task was woken and the time it actually was scheduled in.
Using KernelShark allowed me to analyze each of my tests to see if
they were doing what I expected them to do. The final test was able to force
a common scenario where a high priority process is woken on a CPU running
another real-time task, and cause the decision to be made, whether to migrate
the waking task or not. This test allowed me to see how the changes to that
decision affected the results.
This article demonstrates a simple use case for KernelShark, but there are
a lot more features that aren't explained here. To find out more, download
KernelShark and try it out. It is still in beta and is constantly being worked on.
Soon there will be plugins that will allow it to read other file formats and
even change the way it displays the graph. All the code is available and
under the GPL, so you can add your own features as well (hint hint).
Comments (6 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>