Kernel development

Brief items

Kernel release status

The current development kernel is 3.5-rc5, released on June 30. Linus says: "So nothing really worrisome in here. Despite the networking merge (which tends to be fairly big), -rc5 is a smaller patch than -rc4 was, even if there are a couple more commits in there. So things seem to be going in the right direction."

Stable updates: no stable updates have been released in the last week. The 3.2.22 update is in the review process as of this writing; it can be expected at any time.

Comments (none posted)

Quotes of the week

Please find large crayon and write on forehead "when fixing a bug, be sure to describe the end-user impact of that bug".

— Andrew Morton

Fundamentally, 8k stacks on x86-64 are too small for our increasingly complex storage layers and the 100+ function deep call chains that occur.

— Dave Chinner

Comments (4 posted)

Kernel patchwork returns

Kernel.org administrator John Hawley has announced that the kernel patchwork system is finally back on the air. "All the old user account still exist, though it is *HIGHLY* recommended that once you log in you change your password."

Full Story (comments: none)

A UEFI secure boot and TianoCore info page

James Bottomley has distilled his hard-earned knowledge of how to set up UEFI secure boot with QEMU and the TianoCore system and placed it into a web page. It has a lot of information for anybody needing to work in this area. "Intel has produced a project called TianoCore as an open firmware reference implementation of UEFI. One of the sub projects within TianoCore is OVMF which stands for Open Virtual Machine Firmware. It is OVMF that we are using to produce the virtual machine image for qemu that will run the UEFI secure boot environment. TianoCore secure boot is only really working as of version r13466 of the svn repository. This version has not yet been released as a downloadable zip file."

Full Story (comments: 2)

Kernel development news

Missing the AF_BUS

By Jonathan Corbet
July 3, 2012

The D-Bus interprocess communication mechanism has, over the years, become a standard component of the Linux desktop. For almost as long, developers have been trying to find ways to make D-Bus faster. The latest attempt comes in the form of a kernel patch set adding a new socket address family (called AF_BUS) to the networking layer. Significant performance improvements are claimed, but, like previous attempts, this one may have a hard time getting into the mainline kernel.

D-Bus implements a mechanism by which processes can send messages to each other. Multicast functionality is inherently a part of the protocol; one message can be sent to multiple recipients. D-Bus promises reliable delivery, where "reliable" means that messages arrive in the order in which they were sent and multicast messages will either be delivered to all recipients or, if that is not possible, to none. There is a security model built into the protocol whereby messages can be limited to specific recipients. All of these features are used by contemporary systems, which expect the system to be robust, secure, and with as little latency and overhead as possible.

The current D-Bus implementation uses Unix-domain sockets and a central routing daemon. It works, but the routing daemon adds context switches, overhead, and latency to each message it handles. The kernel is unable to help get high-priority messages delivered first, so all messages cause wakeups that slow down the processing of the most important ones; see this message for a description of how these problems can affect a running system. It has been evident for some time to the developers involved that a better solution must be found.

There have been a number of attempts in that direction. The previous time this topic came up, it was around a set of patches adding multicast capabilities to Unix-domain sockets. This idea was rejected with the claim that the Unix-domain socket code is already too complicated and there was not enough justification to make things worse by adding multicast capabilities. The D-Bus developers were told to simply use IPv4 sockets, which already have multicast support, instead.

What those developers actually did was to implement AF_BUS, a new address family designed to meet the needs of D-Bus. It provides the reliable delivery that D-Bus requires; it also has the ability to pass file descriptors and credentials from one process to another. The security mechanism is built in, with the netfilter code (augmented with a new D-Bus message parser) used to control which messages can actually be delivered to any specific process. The end result, it is claimed, is a significant reduction in D-Bus overhead due to reduced system calls; submitter Vincent Sanders claims "a doubling in throughput and better than halving of latency." See the associated documentation for details on how this address family works.

A factor-of-two improvement in a component that is widely used in Linux systems would certainly be welcome. The patch set, however, was not; networking maintainer David Miller immediately stated his intention to simply ignore the patch set entirely. His objections seem to be that IPv4 sockets are sufficient for the task and that reliable delivery of multicast messages cannot be done, even in the limited manner needed by D-Bus. He expressed doubts that the IPv4 approach had even been tried, and decreed: "We are not creating a full address family in the kernel which exists for one, and only one, specific and difficult user."

Vincent responded that a number of approaches have been tried and found wanting. IPv4 sockets cannot provide the needed delivery guarantees and do not allow for the passing of file descriptors and credentials. It is also important, he said, for D-Bus to be up and running before the networking subsystem has been configured; setting up IP interfaces on a contemporary system often requires communication over D-Bus. There really is no better solution, he said.

He found support from a few other developers, including Alan Cox, who pointed out that there is no shortage of interprocess communication systems out there with requirements similar to D-Bus:

In fact if you look up the stack you'll find a large number of multicast messaging systems which do reliable transport built on top of IP. In fact Red Hat provides a high level messaging cluster service that does exactly this. (as well as dbus which does it on the desktop level) plus a ton of stuff on top of that (JGroups etc)

Everybody at the application level has been using these 'receiver reliable' multicast services for years (Websphere MQ, TIBCO, RTPGM, OpenPGM, MS-PGM, you name it). There are even accelerators for PGM based protocols in things like Cisco routers and Solarflare can do much of it on the card for 10Gbit.

He added that latency concerns are paramount on contemporary systems and that one of the best ways of reducing latency is to cut back on context switches and middleman processes. Chris Friesen added that his company uses "an out-of-tree datagram multicast messaging protocol family based on AF_UNIX" that could almost certainly be replaced by something like AF_BUS, were AF_BUS to be added to the mainline kernel.

There have been various other local messaging patch sets posted over the years. So it seems clear that there is a significant level of interest in having this sort of capability built into the Linux kernel. But interest alone is not sufficient justification for the merging of a large patch set; there must also be agreement from the developers who are charged with ensuring that Linux has a top-quality networking stack in the long term. That agreement is not yet there, so there may be a significant amount of multicast interpersonal messaging required before we have multicast interprocess messaging in the kernel.

Comments (46 posted)

Better documentation: the window of naive interest

July 3, 2012

This article was contributed by Neil Brown

Sometimes a casual comment can capture your imagination and not let go until you do something with it. So it was for me with a comment made by Heikki Orsila on some observations that Greg Kroah-Hartman made about documentation in the Linux kernel tree:

Greg, the documentation is very bad.

The specific documentation that he or Greg were thinking of may well be very different from the specific documentation that my thoughts turned to, but as both a producer and a consumer of some parts of linux/Documentation I can at least agree that some of it isn't very good.

Heikki continued: "Linux is badly documented, but so what?" and often that would have been the end of it. But we are an open development community where putting up with mediocrity is neither necessary nor encouraged. If things are broken then there is always the possibility of fixing them if only we know how. How, then, can we fix the documentation?

"Documentation" is a broad category and I would like to start by narrowing our focus a little and excluding reference documentation from consideration. By this I mean documents used by a person knowledgeable in the subject who needs to clarify some detail such as the arguments to some function or the required ordering between two locks. For these details the source code is by far the best resource - as it cannot get out-of-date - and, when the code itself is not sufficient, placing the documentation in the source code will provide the greatest likelihood of it being found, read, and kept up-to-date. It doesn't really belong in a separate Documentation directory.

The class of documentation that is of interest is documentation for the new developer, not necessarily new to development but new to a particular project or subsystem. Such a developer combines a lack of knowledge with a genuine interest and this is a combination that is not stable: if one component does not disappear soon, the other is likely to. The task of good documentation is to ensure the lack of knowledge disappears before the interest.

I was exposed to this instability when trying to understand some details of power management in Linux. The documentation simply didn't help and I had to look elsewhere. However when I went back to assess the documentation while preparing for this article I discovered that it wasn't as bad as I remembered. I now had enough experience that it all made sense. The paucity of the documentation was now only in my memory and I couldn't be sure I had given that documentation a fair trial. The temptation to just move on might have won had Heikki Orslia's observation not encouraged me on.

To understand what makes good documentation we need to mine the experiences from that short window of naive interest to find out what works and what doesn't. A question that seems most suited for digging is "What were you looking for that you didn't find". I'm sure my kind reader will have their own answers to offer, but here are three that I have found on my travels.

Wire-frame outlines

When I first went to the Linux power management documentation I was after a "big picture" understanding. I wanted more detail than "this code manages power" but not quite "these are the entry points that a driver must provide". I wanted to know what the important parts were and, significantly, how they connected together and impacted each other. I picture this as a collection of key concepts together with the linkage between them. These are nodes and edges in a graph, entities and relationships, or for the more spatially oriented, vertices and edges of a wire-frame polyhedron. This gives the shape of the project without getting bogged down in details.

For me it is vital to have this framework first as I can only take in and retain new details if I have something to attach them to and a place to attach them. Without it I'll either attach new ideas to the wrong place, or forget them completely - which is probably the safer of the two.

The image of a wire-frame is a little misleading as it presents all vertices as of equal value and this is rarely the case. Some concepts are bigger and should be named and described first. Others can come later. So maybe a ball-and-stick model might be a better picture, with big and small balls, joined by thin and fat sticks.

In the case of Linux power management, one key concept that gives shape to the whole is the number of multiplicities: there are multiple sequencing states when moving away from or towards full functionality, multiple power saving approaches such as runtime, suspend, and hibernate, and many multiple different sorts of devices that need to fit into the frameworks. Another concept, already hinted at but often recurring, is that there is generally one "fully functional" state but several "low power" states, where moving between two low power states involves returning to fully-functional and then reducing power a different way.

Why, not what.

"Swap over NFS" is a set of functionality that some people find valuable, but is not at all straightforward to implement. There is a need to avoid deadlocks in memory management, and to do so without slowing down either the networking code or the memory allocation code, both of which are quite performance sensitive. There is a set of patches which provides this functionality but getting it ready for mainline inclusion has been a slow process.

Andrew Morton was recently good enough to provide some review of these patches and, while reading the commit-log entries and code comments is a little different from seeking out more coherent documentation, it does provide a good window into the thoughts of someone who, while generally knowledgeable, is both new to the project specifics and still interested. It can thus answer the question "what were you looking for that you didn't find?".

One observation that he made repeatedly is most clearly embodied in

The comment should explain "why", not "what". Particularly when the "what" was bleedin obvious ;)

or more humorously in: "s/"what"/"why"/ !".

Documenting what a function does is very important in closed-source projects, but less so in open source where the code can be directly read. Of course if the code is long and complex it might be easier to read some documentation, however the effort of writing the documentation might be better spent in breaking up the code and making it more readable.

Documenting why is much more valuable, whether it is "why do it this way" or "why even do this at all". The "why" of a project is rarely explicit in just one place of the code. Rather it permeates throughout and can touch various fragments in different ways. Sometimes the "why" is not technical at all but is historical, cultural, or simply subjective. In these cases it really cannot be extracted by reading the code and must be documented, or lost.

Were I to properly document the Linux "md" driver, for which I get occasional requests, I would need to explain its relationship with "dm" - for it isn't only internal edges of our wire-frame that are interesting, but also external edges. The "why"s here are mostly historical accident, though there would be value in observing that "md" focuses on reliability through redundancy, while "dm" focuses more on flexibility by hiding all the other restrictions imposed by storage hardware. This, I think, gives the "why" for continuing to have two separate frameworks, even if it isn't a strong technical justification.

To continue with the analogy of the wire-frame model, if the concepts and relationships provide the shape of the model, then the "Why"s provide the fabric that they give shape to. They are the substance that gives purpose and the force that gives direction. They may not always be visible, especially once we put some skin on our model, but understanding them is key to understanding the whole.

Examples, examples, examples.

One of the documents that I maintain is the set of manual pages for mdadm. I recall some years ago being challenged that there weren't enough examples in that documentation. At the time I didn't really know what to do with the challenge as, after all, there was an "Examples" section at the end of the man page and there was plenty of explanatory material from which you can deduce your own examples. Though I didn't give it much attention then, this challenge clearly stuck in my mind even to today and on reflection I now think quite differently to how I thought then. Examples matter.

For those of us who enjoy binary taxonomies, there are two sorts of reasoning processes: deductive and inductive. These are described in various ways in the literature. One that is particularly succinct and helpful is from Naked Science which describes the distinction as:

Deductive reasoning arrives at a specific conclusion based on generalizations. Inductive reasoning takes events and makes generalizations.

In the context of documentation, reasoning is the process of turning the words in the document into a model in your mind. Different people appear to vary in which style of reasoning they are most comfortable with, so good documentation must attempt to play to both styles.

Documentation that plays to deductive reasoning will be filled with generalizations. This doesn't mean that it avoid details (as generalities would) but that it attempts to describe exactly - in complete generality - what each interface does, or how each concept applies, or what role each interaction plays. Such documentation can be very useful, but is can also lead to a feeling that you are drowning in detail. It can be a challenge to extract meaning and importance from such details. A lot of technical documentation seems to tend to this extreme.

Documentation that plays to inductive reasoning will be full of examples of specific cases. It may explain each case very well but the coverage of the cases can never be perfect and it will inevitably leave out some information, typically the particular information that the reader is looking for. "How-tos" are a good example of this sort of document with maybe the extreme case being recipe books for cooking - they are full of sample recipes with very little space dedicated to explaining what makes a good recipe. These are very good if they chose just the right example, fairly good and quite accessible if they have chosen a good variety of examples, but usually lacking when you want to get down to the nitty-gritty.

Documentation that plays to both types of reasoning will mix examples in with the generalizations, using them to embellish and explain those generalizations and as an excuse to make diversions into tangentially related topics. Examples are particularly good at highlighting contrasts which are themselves an important part of describing key concepts and clarifying why choices are made. The various multiplicities noted for Linux power management can doubtlessly provide lots of contrasts such as that between a "UART" serial driver that must be ready to receive full-rate data whenever it is not off, as opposed to a "USB" serial driver which only needs to be able to respond to a wake-up signal and has plenty of time to prepare itself for full data-rate messages. These would necessarily make different decisions about allowable power states.

Returning to our wire-frame model which gives shape to some substance, it hopefully is not too much of a stretch to see examples as the skin on the model. These are the bits we can directly see, they reveal the texture or taste of the whole, and only hint at the bigger picture behind them. But they are an important part in closing the gaps that are left out of the big-picture descriptions.

A worked example?

Having all these goals for introductory documentation may be nice, but are they actually useful? Can they lead to truly "good" documentation? Clearly they are not enough by themselves, but when combined with enough knowledge and experience, with some story-telling ability and an occasional touch of humor I believe that they can. To put this to the test, I've used them as a guide to producing some introductory documentation on Linux power management. The results will be presented next week when you, dear reader, can be the judge of whether the resulting documentation is actually "good".

Comments (10 posted)

Leaping seconds and looping servers

By Jonathan Corbet
July 2, 2012

As most of the net is likely to have heard by now, Linux servers displayed a notable tendency to misbehave during the leap second event at the end of the day on June 30. The problem often presented itself as abrupt and sustained load spikes on the affected machines. The bug that caused this behavior has been tracked down (thanks to a determined effort by John Stultz); a look at what happened shines an interesting light on the trickiness of dealing with time in software systems.

The earth's rotation is slowing over time; contrary to some public claims, this slowing is not caused by Republican administrations, government spending, or proprietary software. In an attempt to keep the official Coordinated Universal Time (UTC) in sync with the earth's behavior, the powers that be occasionally insert an additional second (a "leap second") into a day; 25 such seconds have been inserted since the practice began in 1972. This habit is not without its detractors, and there are constant calls for its abolition, but, for now, leap seconds are a reality that the world (and the kernel) must deal with. For the curious, the Wikipedia leap second page has more detail than almost anybody could want.

The kernel's core time is kept in a timespec structure:

    struct timespec {
	__kernel_time_t	tv_sec;			/* seconds */
	long		tv_nsec;		/* nanoseconds */
    };

It is, in essence, a count of seconds since the beginning of the epoch. Unfortunately, that count is defined to not include leap seconds. So when a leap second happens, the system time must be explicitly corrected; that is done by setting the system clock back one second at the end of that leap second. The code that handles this change is quite old and works pretty much as advertised. It is the source of this message that most Linux systems should have (in some form) in their logs:

    Jun 30 19:59:59 dt kernel: Clock: inserting leap second 23:59:60 UTC

The kernel's high-resolution timer (hrtimer) code does not use this version of the system time, though — at least, not directly. Instead, hrtimers have a couple of internal time bases that are offset from the system time. These time bases allow the implementation of different clocks; the "realtime" clock should adjust with the time, while the "monotonic" clock must always move forward, for example. Importantly, these timer bases are CPU-specific, since realtime clocks can differ between one CPU and the next in the same system. The hrtimer offsets allow the timer subsystem to quickly turn a system time into a time value appropriate for a specific processor's realtime clock.

If the system time changes, those offsets must be adjusted accordingly. There is a function called clock_was_set() that handles this task. As long as any system time change is followed by a call to clock_was_set(), all will be well. The problem, naturally, is that the kernel failed to call clock_was_set() after the leap second adjustment, which certainly qualifies as a system time change. So the hrtimer subsystem's idea of the current time moved forward while the system time was held back for a second; hrtimers were thereafter operating one second in the future. The result of that offset is that timers started expiring one second sooner than they should have; that is not quite what the timer developers had in mind when they used the term "high resolution."

For many applications, having a timer go off one second early is not a big problem. But there are plenty of situations where timers are set for less than one second in the future; all such timers will naturally expire immediately if the timer subsystem is operating one second ahead of the system time. Many of these timers are also recurring timers; they will be re-set immediately after expiration, at which point they will immediately expire again — and so on. The resulting loop is the source of the load spikes reported by victims of this bug across the net.

The fix is to call clock_was_set() in the leap second code—a call that had been removed in 2007. But it's not quite that simple. The work done by clock_was_set() must happen on every CPU, since each CPU has its own set of timer bases. That's not something that can be done in atomic context. So John's patch detects a call in atomic context and defers the work to a workqueue in that case. With this patch in place, the kernel's leap second handling should work again.

How could such a bug come about? Time-related code is notoriously tricky in general; bugs are common. But the situation is far worse when the code in question is almost never executed. Prior to June 30, 2012, the last leap second was at the end of 2008. That is 3½ years in which the leap second code could have been broken without anybody noticing. If the kernel had a regularly-run regression test that verified the correct functioning of hrtimers in the presence of leap second adjustments, this problem might just have been caught before it affected production systems, but nobody has made a habit of running such tests thus far.

Perhaps that will change in the future; if nothing else, distributors with support obligations are likely to run some tests ahead of the next scheduled leap second adjustment. Hopefully, that will catch any problems in this particular little piece of code, should they happen to slip in again. Beyond that, one can always hope for an end to leap seconds. The kernel could also contemplate a switch to international atomic time (TAI), which does not have leap seconds, for its internal representation. Using TAI internally has its own challenges, though, including a need to avoid changing the time representation as seen by user space—meaning that the kernel would still have to track leap seconds internally. So it seems likely that, one way or another, leap seconds are likely to continue to be a source of irritation and bugs in the future.

Comments (90 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.5-rc5 ?

Architecture-specific

Alex Shi X86 TLB flush optimization ?

Tomoki Sekiyama [RFC PATCH 00/18] KVM: x86: CPU isolation and direct interrupts handling by guests ?

Thomas Petazzoni [PATCH v6] arm: Add basic support for new Marvell Armada 370 and Armada XP SoC ?

Andi Kleen Updated and combined Sandy Bridge/Ivy Bridge perf patchkits ?

Core kernel code

Paul Turner CFS per-entity load tracking ?

Aristeu Rozanski cgroup: add xattr support ?

John Stultz Potential fix for leapsecond caused futex issue (v2) ?

Development tools

Jiri Olsa perf tool: Add new event group management ?

Jiri Olsa perf, tool: Allow to use hw events in PMU syntax ?

Akinobu Mita notifier error injection ?

Steven Rostedt [RFC v3] ftrace/kprobes: Setting up ftrace for kprobes ?

Device drivers

Roland Stigge MTD: LPC32xx MLC NAND driver ?

Andrew Boie bcb: Android bootloader control block driver ?

Qiao Zhou add 88pm80x mfd driver ?

Rafael J. Wysocki PM / Domains: Allow drivers to attach PM domain callbacks to devices at any time ?

Marek Vasut IIO: Add basic MXS LRADC driver ?

Documentation

alexdeucher@gmail.com Start documenting the radeon drm better ?

James Morris Document no_new_privs ?

Filesystems and block I/O

Jeff Layton vfs: add the ability to retry on ESTALE to several syscalls ?

Tao Ma ext4: Add inline data support. ?

Alexander Block Experimental btrfs send/receive (kernel side) ?

Memory management

Sha Zhengju Per-cgroup page stat accounting ?

Andrea Arcangeli AutoNUMA19 ?

Rafael Aquini make balloon pages movable by compaction ?

Mel Gorman Swap-over-NBD without deadlocking V14 ?

Mel Gorman Swap-over-NFS without deadlocking V8 ?

Yasuaki Ishimatsu memory-hotplug : hot-remove physical memory ?

Networking

Vincent Sanders AF_BUS socket address family ?

John Eaglesham bonding support for IPv6 transmit hashing ?

Security-related

Kees Cook fs: add link restrictions ?

Virtualization and containers

Takuya Yoshikawa KVM: Optimize MMU notifier's THP page invalidation -v4 ?

Page editor: Jonathan Corbet
Next page: Distributions>>