Kernel development
Brief items
Kernel release status
The current development kernel is 2.6.38-rc3, released on February 1. "Nothing hugely special in here, and I'm happy to say that most of the pull requests have been nice clear bug-fixes and fixing regressions. Thanks to most of you for that." As always, see the full changelog for all the details.
Stable updates: no stable updates have been released in the last week. The 2.6.35.11 update is in the review process as of this writing; it could be released as early as February 3.
Quotes of the week
Undertaker 1.0
As anybody who has ever sat through an "allyesconfig" build - or a build using a distributor configuration - understands, there is a lot of code in the kernel. Most of the time, creating the perfect kernel is a matter of excluding code until the size becomes reasonable. So it's a rare kernel build that actually compiles a majority of the code found in the tree.The kernel configuration mechanism makes it possible to perform this selection. Part of this mechanism is wired into the build system; it allows source files to be passed over entirely if they contain nothing of interest. The other half, though, is implemented with preprocessor symbols and conditional compilation. Kernel developers may be discouraged from using #ifdef, but there are still a lot of conditional blocks in the code.
Sometimes, the logic which leads to the inclusion or exclusion of a specific block is complex and not at all clear. There are many configuration options in the kernel, and they can depend on each other in complicated ways. As a result, dead code - code which will not be compiled regardless of the selected configuration - may escape notice for years. Dead code adds noise to the source tree and, since nobody ever runs it, it is more than likely to contain bugs. If that code is re-enabled or copied, those bugs could spread through the tree in surprising ways.
So it would be good to be able to identify dead code and get it out of the tree. The newly-released undertaker tool was designed to do a number of types of static analysis, including dead code identification. Developers can run it on their own to find dead blocks in specific files; there is also a web interface which allows anybody to browse through the tree and find the dead sections. That should lead to patches hauling away the bodies and cleaning up the tree, which is a good thing.
Kconfiglib
The kernel configuration system is a complex bit of code in its own right; many people who have no trouble hacking on kernel code find reasons to avoid going into the configuration subsystem. There is value in being able to work with the complicated data structure that is a kernel configuration, though. Ulf Magnusson has recently posted a library, Kconfiglib, which, he hopes, will make that easier.Kconfiglib is a Python library which is able to load, analyze, and output kernel configurations; care has been taken to ensure that any configuration it creates is identical to what comes out of the existing kernel configuration system. With Kconfiglib, it becomes straightforward to write simple tools like "allnoconfig"; it also is possible to ask questions about a given configuration. One possible tool, for example, would answer the "why can't I select CONFIG_FOO" question - a useful feature indeed.
There are currently no Python dependencies in the kernel build system; trying to add one could well run into opposition. But Kconfiglib could find a role in the creation of ancillary tools which are not required to configure and build a kernel as it's always been done. For the curious, there's a set of examples available.
Kernel development news
Using the perf code to create a RAS daemon
Monitoring a system for "reliability, availability, and serviceability" (RAS) is an important part of keeping that system, or a cluster of such computers, up and running. There is a wide variety of things that could be monitored for RAS purposes—memory errors, CPU temperature, RAID and filesystem health, and so on—but Borislav Petkov's RAS daemon is just targeted at gathering information on any machine check exceptions (MCEs) that occur. The daemon uses trace events and the perf infrastructure, which requires a fair amount of restructuring of that code to make it available not only to the RAS daemon, but also for other kinds of tools.
The first step is to create persistent perf events, which are events that are always enabled, and will have event buffers allocated, even if there is no process currently looking at the data. That allows the MCE trace event to be enabled at boot time, before there is any task monitoring the perf buffer. Once the boot has completed, the RAS daemon (or some other program) can mmap() the event buffers and start monitoring the event. This will allow the RAS daemon to pick up any MCE events that happened during the boot process.
To do that, the struct perf_event_attr gets a new persistent bitfield that is used to determine whether or not to destroy the event buffers when they are unmapped. In addition, persistent events can be shared by multiple monitoring programs because they can be mapped as shared and read-only. Once the persistent events are added, the next patch then changes the MCE event to become a persistent event.
With that stage set, Petkov then starts to rearrange the perf code so that the RAS daemon and other tools can access some of the code that is currently buried in the tools/perf directory. That includes things like the trace event utilities, which move from tools/perf/util to tools/lib/trace and some helper functions for debugfs that move to tools/lib/lk. These were obviously things that were needed when creating the RAS daemon, but not easily accessible.
A similar patch moves the mmap() helper functions from the tools/perf directory to another new library: tools/lib/perf. These functions handle things like reading the head of the event buffer queue, writing at the tail of the queue, and reading and summing all of the per-cpu event counters for a given event.
In response to the patch moving the
mmap() helpers, Arnaldo Carvalho de Melo pointed out that he had
already done some work to rework that code, and that it would reduce the
size of Petkov's patch set once it gets merged into the -tip tree. He also
noted that he had created a set of Python bindings and a simple
perf-event-consuming
twatch daemon using those bindings. While Petkov had some reasons for writing the RAS daemon in C
rather than Python, mostly so that it would work on systems without Python
or with outdated versions, he did seem impressed: "twatch looks
almost like a child's play and even my grandma can profile her system now
:).
"
But the Python bindings aren't necessarily meant for production code, as Carvalho de Melo describes. Because the Python bindings are quite similar to their C counterparts, they can be used to ensure that the kernel interfaces are right:
Moving to a C version then becomes easy after the testing phase is over and the kernel bits are set in stone.
There are some additional patches that move things around within the tools tree before the final patch actually adds the RAS daemon. The daemon is fairly straightforward, with the bulk of it being boilerplate daemon-izing code. The rest parses the MCE event format (from mce/mce_record/format file in debugfs), then opens and maps the debugfs mce/mce_recordN files (where N is the CPU number). The main program sits in a loop checking for MCE events every 30 seconds, printing the CPU, MCE status, and address for any events that have occurred to a log file. Petkov mentions decoding of the MCE status as something that he is currently working on.
Obviously, the RAS daemon itself is not the end result Petkov is aiming for. Rather, it is just a proof-of-concept for persistent events and demonstrates one way to rearrange the perf code so that other tools can use it. There may be disagreements about the way the libraries were arranged, or the specific location of various helpers, but the overall goal seems like a good one. Whether tools like ras actually end up in the kernel tree is, perhaps, questionable—the kernel hackers may not want to maintain a bunch of tools of this kind—but by making the utility code more accessible, it will make it much easier for others build these tools on their own.
LCA: Rationalizing the wacom driver
Wacom tablets are often the tool of choice for those who need accurate and flexible input devices; they seem to be especially favored by artists. Like a mouse, these tablets can report position and movement, but they can also present multiple input devices to the system (one for each of several different types of pens, for example) and report variables like pen angle, pressure, and more. Support in Linux for these devices has not been as good as one might like, but, as Peter Hutterer described in his talk at the linux.conf.au Libre Graphics Day miniconf, it is getting better quickly. How that came to be is a classic example of how to (or how not to) manage kernel driver development.Peter is the maintainer for the bulk of the graphical input drivers. He has, he says, rewritten most of that subsystem, so he is to blame for the bugs which can be found there. Most input devices are easily handled through the evdev abstraction, but the Wacom driver is an exception. The things which are unique to these tablets (multiple input "devices," one associated with each pen, the pressure, tilt, and rotation axes, and the relatively high resolution) require a separate driver for their support. Thus, Wacom users must have the linuxwacom driver in their systems.
There is some confusion about the linuxwacom driver, because there are
multiple versions of it, all of which can be found on SourceForge. One version
(0.8.8) is created by Wacom itself; it is a classic vendor driver, Peter
said, with everything that usually implies about the development process
(code dumps) and the quality of the code itself. This driver ships as a
tarball containing a wild set of permutations of kernel and X.org versions;
it's a mess. But it's Wacom's mess, and the company has been resistant to
efforts to clean it up.
Peter got fed up with this situation in 2009 and forked the driver. His version is now the default driver in a number of distributions, and is the only one which supports newer versions of the X server. Looking at the repositories, Peter found 78 commits total before the fork, all from Wacom. After the fork, there are 788 commits, 65% from Red Hat, and 12% from Wacom. Extracting the driver from its vendor-dominated situation has definitely helped to increase its rate of development.
Surprisingly, the original vendor driver is still under development by Wacom, despite the fact that it does not support current X servers and is not shipped by any distributors. The original mailing list is still in business, but, Peter warned, one should not ask questions about the new driver there. Kernel development, he said, should be done on the linux-kernel mailing list. There is also little point in talking to him about problems with the older driver; Wacom insists on keeping control over that code.
Update: Peter tells us that there are three mailing lists (linuxwacom-announce, linuxwacom-discuss and linuxwacom-devel) which are still the place to go for general questions, including hardware-specific questions. X driver development for the forked driver happens exclusively on linuxwacom-devel and all patches are sent there. So the mailing lists are definitely the place to ask questions, at least in regards to the X driver. The kernel driver is the exception here. Kernel driver development should happen on LKML, not on linuxwacom lists.
Much of the work Peter has done so far has been toward the goal of cleaning up the driver. That has involved throwing out a number of features. Some of those needed to go - the original driver tries to track the resolution of the screen, for example, which it has no business knowing. Support for the "twinview" approach to dual monitors has also been taken out. In some cases, the removed features are things that people want; support should eventually be restored once it can be done in the right way. Sometimes, Peter said, things have to get worse before they can get better.
Also gone is the wacomcpl configuration tool. It is, Peter said, some of the worst code that he has ever seen.
Peter did this talk to update the graphics community on the state of
support for this driver, but he was also looking for input. His attitude
toward development was described as "if it doesn't crash the server,
it works
". In other words, he is not a graphic artist, so he has no
deep understanding of how this hardware is used. To get that
understanding, he needs input from
the user community regarding development priorities and what does not work
as well as it should.
So artists making use of Wacom tablets should make sure that their needs are known; the developer in charge of the driver is ready to listen. Meanwhile, bringing a more open development process to the driver has increased the pace of development and is improving the quality of the code. If the usual pattern holds, before long Linux should have support for these tablets which is second to none.
Using KernelShark to analyze the real-time scheduler
The last LWN article on Ftrace described trace-cmd, which is a front end tool to interface with Ftrace. trace-cmd is all command line based, and works well for embedded devices or tracing on a remote system. But reading the output in a text format may be a bit overwhelming, and make it hard to see the bigger picture. To be able to understand how processes interact, a GUI can help humans see what is happening at a global scale. KernelShark has been written to fulfill this requirement. It is a GUI front end to trace-cmd.
KernelShark is distributed in the same repository that trace-cmd is:
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.gitTo build it, you just need to type make gui, as just typing make will only build trace-cmd. These two tools have been kept separate since a lot of embedded devices do not have the libraries needed to build KernelShark. A full HTML help is included in the repository and is installed with make install_doc. After installing the documentation, you can access the help directly from KernelShark from the "Help" menu.
This article is not a tutorial on using KernelShark, as everything you need to know about the tool is kept up-to-date in the KernelShark repository. Instead, this article will describe a use case that KernelShark was instrumental in helping to solve.
Analyzing the real-time scheduler
Some time ago, when the push/pull algorithm of the real-time scheduler in Linux was being developed, a decision had to be made about what to do when a high priority process wakes up on a CPU running a lower real-time priority process, where both processes have multiple CPU affinity, and both can be migrated to a CPU running a non-real-time task. One would think that the proper thing to do would be to simply wake up the high priority process on that CPU which would cause the lower priority process to be pushed off the running CPU. But a theory was that by doing so, we move a cache hot real-time process onto a cache cold CPU and possibly replace it with a cache cold process.
After some debate, the decision was made to migrate the high priority process to the CPU running the lowest priority task (or no task at all) and wake it there. Some time later, after the code was incorporated into mainline, I started to question this decision even though I was the one that fought for it. With the introduction of Ftrace, we now have a utility to truly examine the impact that decision has made.
The decision to move the higher priority task was based on an assumption that if the task was waking up, that it is more likely to be cache cold than a task that is already running. Thinking more about this case, one must think about what would cause a high priority task to wake up in the first place. If it is woken up periodically to do some work, then it can very well be the case that it will be cache cold. Any task that was scheduled in between can easily push out the cache of this high priority task. But what if the high priority task was blocked on a mutex? If the task was blocked on a mutex and another RT task was scheduled in its place then when the high priority task wakes up again, there is a good chance that the task will be cache hot.
A mutex in most real-time programs will usually be held for a short period of time. The PREEMPT_RT patch, which this code was developed from, converts spinlocks into mutexes, and those mutexes are held for very small code segments, as all spinlocks should be. Migrating a task simply because it blocked on a mutex increases the impact these locks have on the throughput. Why punish the high priority task even more because it blocked and had to wait for another task to run?
Before making any decision to change the code, I needed to have a test case that can show that the moving of a high priority task instead of preempting the lower priority task will cause the high priority task to ping pong around the CPUs when there is lock contention. A high priority task should not be punished (migrated) if it simply encounters lock contention with lower priority real-time tasks. It would also be helpful to know how changing this decision affects the total number of migrations for all the tasks under lock contention.
First try
Having a 4 processor box to play with, I started writing a test case that would possibly cause this scenario, and use Ftrace to analyze the result. The first test case to try was to create five threads (one more than CPUs) and four pthread mutex locks. Have all threads wake up from a barrier wait and then loop 50 times grabbing each lock in sequence and do a small busy loop. The name of this test is called migrate.c.
The test application uses trace_marker as explained in previous articles to write what is happening inside the application to synchronize with kernel tracing.
Running the following with trace-cmd:
# trace-cmd record -e 'sched_wakeup*' -e sched_switch -e 'sched_migrate*' migrate # kernelshark
![[KernelShark]](https://static.lwn.net/images/2011/ks-fail1-open-sm.png)
Since all tasks have been recorded, even trace-cmd itself, we want to filter out any tasks that we do not care about. Selecting Filter->Tasks from the KernelShark menu, and then choosing only the migrate threads will remove the extraneous tasks. Note that events that involve two tasks, like sched_switch or sched_wakeup, will not be filtered out if one of the tasks should be displayed.
![[KernelShark post-filtering]](https://static.lwn.net/images/2011/ks-fail1-filtered-sm.png)
In the default graph view, each on-line CPU is represented by a plot line. Each task is represented by a different color. The color is determined by running the process ID through a hash function and then parsing that number into a RGB format.
- The purple ( ) colored bar represents thread 4, the highest priority task.
- The orange(ish) ( ) colored bar represents thread 3.
- The turquoise ( ) colored bar represents thread 2.
- The brown ( ) colored bar represents thread 1.
- The light blue ( ) colored bar represents thread 0, the lowest priority task.
The lines sticking out of the top of the bars represent events that appear in the list below the graph.
By examining the graph we can see that the test case was quite naive. The lowest priority task, thread 0, never got to run until the other four tasks were finished. This makes sense as the machine only had four CPUs and there were four higher priority tasks running. The four running tasks were running in lock step, taking the locks in sequence. From this view it looks like the tasks went out of sequence, but if we zoom in to where the migrations happened, we see something different.
![[KernelShark zoom]](https://static.lwn.net/images/2011/ks-fail1-zoom-in-sm.png)
To zoom into the graph, press and hold the left mouse button. A line will appear, then drag the mouse to the right. As the mouse moves off the line, another line will appear that follows the mouse. When you let go of the mouse button, the view will zoom in making the locations of the two lines the width of the new window.
Repeating the above procedure, we can get down to the details of the migration of thread 3. Double clicking on the graph brings the list view to the event that was clicked on. A green line appears at the location of that was clicked.
![[KernelShark event selection]](https://static.lwn.net/images/2011/ks-fail1-select-event-sm.png)
On CPU 0, thread 3 was preempted by the watchdog/0 kernel thread. Because we filtered out all threads but the migration test tasks, we see a small blank on the CPU 0 line. This would have been filled in with a colored bar representing the watchdog/0 thread if the filters were not enabled. The watchdog/0 thread runs at priority 99, which we can see from the sched_switch event as the priority of the tasks is between the two colons. The priority shown is represented by the kernel's view of priority, which is inverse to what user-space uses (user space priority 99 is kernel priority zero).
When the watchdog/0 thread preempted thread 3, the push/pull algorithm of the scheduler pushed it off to CPU 3, which had the lowest priority running task. Zooming into the other migrations that happened on the other CPUs, show that the watchdog kernel thread was responsible for them as well. If it wasn't for the watchdog kernel threads, this test would not have had any migrations.
Test two, second failure
The first test took the naive approach of just setting up four locks and having the tasks grab them in order. But this just kept the tasks in sync. The next approach is to try to mix things up a little more. The concern about the real-time scheduler is how it affects the highest priority task. The next test creates the four locks again (as there are four CPUs) and five tasks each of increasing priority. This time, only the highest priority task grabs all the locks in sequence. The other four tasks will grab a single lock. Each lock will have a single task and the highest priority task grabbing that lock. To try to force contention, pthread barriers are used. For those unfamiliar with pthread barriers, they are synchronization methods to serialize threads. A barrier is first initialized with a number and all threads that hit the barrier will block until that number of threads have hit the barrier, then all the threads are released.
This test case creates two barriers for each lock (lock_wait and lock_go) each initialized with the number 2, for the two tasks (the unique low priority task and the high priority task) that will take the lock. The low priority task will take the lock and wait on a barrier (lock_wait). The high priority task will hit that barrier before it takes the corresponding lock. Because the low priority task is already waiting on the barrier, the high priority task will trigger the barrier to release both tasks because the barrier has a task limit of two. The high priority task will most likely try to take the mutex while the low priority task aleady has it. The low priority task will release the mutex and then wait on the other barrier (lock_go) letting the high priority task take the mutex.
Running this test under trace-cmd yields the following from KernelShark after filtering out all but the migrate test tasks.
As the colors of the tasks are determined by their process IDs, this run
has the following:
- The initial green ( ) bar is the main thread that is setting up all the locks and barriers.
- The light purple ( ) bar is the lowest priority thread 0.
- The red ( ) bar is the next-higher priority thread 1.
- The yellow(ish) ( ) bar is the next-higher priority thread 2.
- The blue ( ) bar is the next-higher priority thread 3.
- The light turquoise ( ) bar is the highest priority thread 4.
Looking at the graph it seems that the highest priority thread stayed on the
same CPU, and was not affected by the contention. Considering that the scheduler
is set to migrate a waking real-time task if it is woken on a CPU that is
running
another real-time task, regardless of the priorities, one would think the
high priority task would have migrated a bit more. Zooming in on the graph
brings to light a bit more details to what is occurring.
What we can see from the graph, and from the list, is that the high priority thread did have contention on the lock. But because all threads are waiting for the high priority process to come around to its lock, the other threads are sleeping when the high priority process wakes up. The high priority process is only contending with a single thread at a time. Threads 0 and 2 share CPU 2 without issue, while threads 1 and 3 each still have a CPU for themselves.
The test to force migration
The second test was on the right track. It was able to produce a contention but failed to have the CPUs busy enough to cause the highest priority task to wake up on a CPU running another real-time task. What is needed is to have more tasks. The final test adds twice as many running threads as there are CPUs.
This test goes back to all tasks grabbing all locks in sequence. To prevent the synchronization that has happened before, each thread will hold a lock a different amount of time. The higher the priority of a thread, the shorter time it will hold the lock. Not only that, but the threads will now sleep after they release a lock. The higher the priority of a task, the longer it will sleep:
lock_held = 1 ms * ((nr_threads - thread_id) + 1) sleep_time = 1 ms * thread_idThe lowest priority thread will never sleep and it will hold the lock for the longest time. To make things even more interesting, the mutexes have been given the PTHREAD_PRIO_INHERIT attribute. When a higher priority thread blocks on a mutex held by a lower priority thread, the lower priority thread will inherit the priority of the thread it blocks.
The test records the number of times each task voluntarily schedules, the number of times it is preempted, the number of times it migrates, and the number of times it successfully acquired all locks. When the test finishes, it gives an output of these for each thread. The higher the task number the higher the priority of the thread it represents.
Task vol nonvol migrated iterations 0 43 3007 1571 108 1 621 1334 1247 108 2 777 769 1072 108 3 775 17 701 108 4 783 50 699 108 5 788 2 610 109 6 801 89 680 109 7 813 0 693 110 Total 5401 5268 7273 868
Running this test under trace-cmd and viewing it with KernelShark
yields a graph with lots of pretty colors, which means we likely succeeded
in our goal.
To prove that the highest priority thread did indeed migrate, we can plot
the thread itself.
Using the "Plots" menu and choosing "Tasks" brings up the same type of dialog as the task filter that was described earlier. I selected the highest priority thread (migrate-2158), and zoomed in to get a better view. The colors on a task plot are determined by the CPU number it was running on. When a task migrates, the colors of the plot changes.
![[KernelShark task plot]](https://static.lwn.net/images/2011/ks-success-plot-task-sm.png)
This test now demonstrates how a high priority task can migrate substantially when other RT tasks are running on the system. Changes to the real-time scheduler can now be tested. The commit changes the decision on which thread migrates when an real-time task wakes up on a CPU running another real-time task. The original way was to always move the task that is waking up if there is a CPU available that is running a task that is lower in priority than both tasks. Instead, the commit changes this to just wake up the real-time task on its CPU if it is higher priority than the real-time task that is currently running.
The migrate test now shows:
Task vol nonvol migrated iterations 0: 52 2923 2268 108 1: 569 1529 1457 109 2: 801 1961 2194 109 3: 808 789 1274 109 4: 810 61 155 109 5: 813 10 57 109 6: 827 35 81 110 7: 824 0 4 110 total: 5504 7308 7490 873
The total number of migrations has stayed around the same (several runs will yield a fluctuation of a few hundred), but the number of migrations for the highest priority task has dropped substantially, as it will not migrate simply because it woke up on a CPU running another real-time task. Note, the reason that the highest priority task migrated at all was because it woke up on a CPU that was running the task that owned the mutex that it was blocked on. As these are priority inheritance mutexes, the owner would have the same priority as the highest priority process that it is blocking. The wake up will not preempt a real-time task of equal priority. Perhaps that can be the next change to the real-time scheduler: have the wake up be aware of priority mutexes.
![[KernelShark after change]](https://static.lwn.net/images/2011/ks-success-zoom-task-sm.png)
The highest priority thread (migrate-21412) was woken on CPU 3, which was running thread 1 (migrate-21406) which is the task that thread 7 originally blocked on. CPU 2 happened to be running thread 0 (migrate-21405) which was the lowest priority thread running at the time. Note that the empty green box that is at the start of the task plot represents the time between when a task was woken and the time it actually was scheduled in.
Using KernelShark allowed me to analyze each of my tests to see if they were doing what I expected them to do. The final test was able to force a common scenario where a high priority process is woken on a CPU running another real-time task, and cause the decision to be made, whether to migrate the waking task or not. This test allowed me to see how the changes to that decision affected the results.
This article demonstrates a simple use case for KernelShark, but there are a lot more features that aren't explained here. To find out more, download KernelShark and try it out. It is still in beta and is constantly being worked on. Soon there will be plugins that will allow it to read other file formats and even change the way it displays the graph. All the code is available and under the GPL, so you can add your own features as well (hint hint).
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Virtualization and containers
Page editor: Jonathan Corbet
Next page:
Distributions>>