|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel remains 3.6-rc1; Linus, who has been on vacation, has not made a new 3.6-rc release since August 2. There has been a slow flow of fixes into the mainline repository, though; a new release should happen in the near future.

Stable updates: 3.0.40, 3.4.8, and 3.5.1 were released on August 9; 3.2.27 came out on August 11; and 3.0.41, 3.4.9 and 3.5.2 were released on August 15. That final set included, along with the usual fixes, the recent random number generator improvements.

Comments (none posted)

Quotes of the week

I'm trying to appear to be an incompetent maintainer so that someone will offer to take over. It isn't working yet.
Neil Brown

Anyway. This was my attempt to spend a few days doing something more relaxing than secure boot, and all I ended up with was eczema and liver pain. Lesson learned, hardware vendors hate you even more than firmware vendors do.
Matthew Garrett (thanks to Cesar Eduardo Barros)

Comments (2 posted)

Making workqueues non-reentrant

By Jonathan Corbet
August 15, 2012
Workqueues are the primary mechanism for deferring work within the Linux kernel. The maintainer of this facility is Tejun Heo, who recently posted a patch series changing an aspect of workqueue behavior that, perhaps, few kernel developers know about. Most workqueues are reentrant, meaning that the same work item can be running on multiple CPUs at the same time. That can result in concurrency that developers are not expecting; it also significantly complicates the various "flush" operations in surprising ways. In summary, Tejun says:

In the end, workqueue is a pain in the ass to get completely correct and the breakages are very subtle. Depending on queueing pattern, assuming non-reentrancy might work fine most of the time. The effect of using flush_work() where flush_work_sync() should be used could be a lot more subtle. flush_work() becoming noop would happen extremely rarely for most users but it definitely is there.

A tool which is used as widely as workqueue shouldn't be this difficult and error-prone. This is just silly.

Tejun's patch set is intended to make the workqueue interface work more like the timer interface, which is rather more careful about allowing concurrent operations on the same timer. All workqueues become non-reentrant, and aspects of the API related to reentrant behavior have been simplified. There is, evidently, a slight performance cost to the change, but Tejun says it is small, and the cost of flush operations should go down. It seems like a worthwhile trade-off, overall, but anybody who maintains code that depends on concurrent work item execution will want to be aware of the change.

Comments (none posted)

Kernel development news

Firmware loading and suspend/resume

By Jonathan Corbet
August 15, 2012
Many devices are unable to function until the host system has loaded them with their operating firmware. Runtime-loadable firmware has some real advantages: the hardware can be a little cheaper to make, and the firmware is easily upgraded after the hardware has been sold. But it also poses some problems, especially when combined with other features. Properly handling firmware loading over suspend/resume cycles has been a challenge for the kernel for some time, but a new set of patches may be poised to make things work better with little or no need for changes to drivers.

The obvious issue with suspend/resume is that any given device may lose its firmware while the system is suspended. The whole point of suspending the system is to reduce its power consumption to a minimum, so that operation may well power down peripheral devices entirely. Loss of firmware during suspend doesn't seem like it should be a big problem; the driver can just load the firmware again at resume time. But firmware tends to live on disk, and the actual firmware loading operation involves the running of a helper process in user space. Neither the disk nor user space are guaranteed to be available at the point in the resume process when a given device wants its firmware back; drivers that attempt to obtain firmware at such times may fail badly. The result is resume failures; they may be of the intermittent, developer-never-sees-it variety that can be so frustrating to track down. So the search has been on for a more robust solution for some time.

In July, Ming Lei tried to address this problem with a patch integrating firmware loading with the deferred driver probing mechanism. In short, if a firmware load fails, the whole driver initialization process would be put on the deferred queue to be retried later on. So, a driver that is unable to load its firmware at resume time will be put on hold and retried at a later point when, hopefully, the resources required to complete the firmware load will be available. That, Ming hoped, would resolve a lot of resume-time failures without requiring changes to lots of drivers.

Linus, however, disagreed:

Sure, for a lot of devices it's fine to load the firmware later. But some devices may be part of the resume sequence in very critical ways, and deferring the firmware loading will just mean that the resume will fail.

Deferring firmware loading in this manner, he thought, would just serve to hide problems from developers but leave them to burn users later on. It is much better, he thought, to force driver writers to deal with the problem explicitly.

The classic way for a driver writer to handle this problem is to just keep the firmware around after it is loaded at system boot time. Permanently cached firmware will always be available when it is needed, so firmware loading at resume time should be robust. The problem with that approach is that the firmware blobs loaded into some devices can be quite large; keeping them around forever can waste a fair amount of kernel-space memory. To make things worse, these blobs are loaded into vmalloc() memory (so that they appear to be contiguous in memory); that memory can be in short supply on 32-bit systems. Permanently caching the firmware is, thus, not an ideal solution, but that is what a number of drivers do now.

After the discussion with Linus, Ming thought for a while and came back with a new proposal: cache firmware blobs, but only during the actual suspend/resume cycle. Drivers can, of course, do that now; they can request a copy of the firmware while suspending their devices, and release that copy once it's no longer needed at resume time. But that is a chunk of boilerplate code that would need to be added to each driver. Ming's patch, instead, makes this process automatic and transparent.

In particular, request_firmware() is changed to make a note of the name of every firmware blob it is asked to load. This information is reference-counted and tied to the devices that needed the firmware; it can thus be discarded if all such devices disappear. The result is a simple data structure tracking all of the firmware blobs that may be needed by the hardware currently present in the system.

At system suspend time, the code simply goes and loads every piece of firmware that it thinks may be needed. That data then sits in memory while the system is suspended. At resume time, those cached blobs are available to any driver, with no need for filesystem access or user-space involvement, via the usual request_firmware() interface. Once the resume process is complete, the firmware loader will, after a small delay, release all of those cached firmware images, freeing the associated memory and address space for other uses.

The patch seems close to an ideal solution. Firmware loading at resume time becomes more robust, there is no need for drivers to be concerned with how it works, and wasted memory is minimized. Even Linus said "Nothing in this patchset made me go 'Eww'", which, from him, can be seen as reasonably high praise. It doesn't solve every problem; there are, for example, some strange devices that retain firmware over a reboot but not over suspend, so the system may not know that a specific firmware image is needed until resume time, when it's too late. But such hardware is probably best handled as a special case. For the rest, we may be close to a solution that simply works—and that brings an end to the recurring "firmware at resume time" discussions on the mailing lists.

Comments (7 posted)

TCP friends

By Jonathan Corbet
August 15, 2012
One of the many advantages of the TCP network protocol is that the process at one end of a connection need not have any idea of where the other side is. A process could be talking with a peer on the other side of the world, in the same town, or, indeed, on the same machine. That last case may be irrelevant to the processes involved, but it can be important for performance-sensitive users. A new patch from Google seems likely to speed that case up in the near future.

A buffer full of data sent on the network does not travel alone. Instead, the TCP layer must split that buffer into reasonably-sized packets, prepend a set of TCP headers to it, and, possibly, calculate a checksum. The packets are then passed to the IP layer, which throws its own headers onto the beginning of the buffer, finds a suitable network interface, and hands the result off to that interface for transmission. At the receiving end the process is reversed: the IP and TCP headers are stripped, checksums are compared, and the data is merged back into a seamless stream for the receiving process.

It is all a fair amount of work, but it allows the two processes to communicate without having to worry about all that happens in between. But, if the two processes are on the same physical machine, much of that work is not really necessary. The bulk of the overhead in the network stack is there to ensure that packets do not get lost on their way to the destination, that the data does not get corrupted in transit, and that nothing gets forgotten or reordered. Most of these perils do not threaten data that never leaves the originating system, so much of the work done by the networking stack is entirely wasted in this case.

That much has been understood by developers for many years, of course. That is why many programs have been written specifically to use Unix-domain sockets when communicating with local peers. Unix-domain sockets ("pipes") provide the same sort of stream abstraction, but, since they do not communicate between systems, they avoid all of the overhead added by a full network stack. So faster communications between local processes is possible now, but it must be coded explicitly in any program that wishes to use it.

What if local TCP communications could be accelerated to the point that they are competitive with Unix-domain sockets? That is the objective of this patch from Bruce Curtis. The idea is simple enough to explain: when both endpoints of a TCP connection are on the same machine, the two sockets are marked as being "friends" in the kernel. Data written to such a socket will be immediately queued for reading on the friend socket, bypassing the network stack entirely. The TCP, IP, and loopback device layers are simply shorted out. The actual patch, naturally enough, is rather more complicated than this simple description would suggest; friend sockets must still behave like TCP sockets to the point that applications cannot tell the difference, so friend-handling tweaks must be applied to many places in the TCP stack.

One would hope that this approach would yield local networking speeds that are at least close to competitive with those achieved using Unix-domain sockets. Interestingly, Bruce's patch not only achieves that, but it actually does better than Unix-domain sockets in almost every benchmark he ran. "Better" means both higher data transmission rates and lower latencies on round-trip tests. Bruce does not go into why that is; perhaps the amount of attention that has gone into scalability in the networking stack pays off in his 16-core testing environment.

There is one important test for which Bruce posted no results: does the TCP friends patch make things any slower for non-local connections where the stack bypass cannot be used? Some of the network stack hot paths can be sensitive to even small changes, so one can imagine that the networking developers will want some assurance that the non-bypass case will not be penalized if this patch goes in. There are various other little issues that need to be dealt with, but this patch looks like it is on track for merging in the relatively near future.

If it is merged, the result should be faster local communications between processes without the need for special-case code using Unix-domain sockets. It could also be most useful on systems hosting containerized guests where cross-container communications are needed; one suspects that Google's use case looks somewhat like that. In the end, it is hard to argue against a patch that can speed local communications by as much as a factor of five, so chances are this change will go into the mainline before too long.

Comments (22 posted)

Signed overflow optimization hazards in the kernel

August 15, 2012

This article was contributed by Paul McKenney

A recent LWN article described a couple of the hazards that compiler optimizations can pose for multithreaded code. This article takes a different approach, looking at a compiler-optimization hazard that can also strike sequential code. This hazard stems from an annoying aspect of the C11 standard, namely that signed-integer overflow is undefined (Section 3.4.3).

Overflow is a consequence of the fact that a computer's native arithmetic capability is quite limited. For example, if a C program running on a typical Linux system tries adding one int variable with value 2,147,483,647 to another int with value 1, the result will be -2,147,483,648—which might surprise people who naively expect the mathematically correct value of +2,147,483,648. This deviation from mathematical correctness occurs because the machine cannot represent the correct value of 2,147,483,648 in a 32-bit twos-complement integer. Therefore, any attempt to compute this number without help from software will result in overflow.

Quick Quiz 1: Yecch!!! Why can't CPU designers come up with something better?
Answer

The number -2,147,483,648 is “unusual” in that adding it to itself (again, using twos-complement 32-bit integers) results in zero. Furthermore, it is its own negative: negating -2,147,483,648 results in -2,147,483,648. Therefore, this weird number is (1) equal to half of zero and (2) both positive and negative but (3) not equal to zero. This weirdness earned this number a special mention in the C standard, which says that its handling is implementation-defined (Section 6.2.6.2, Paragraph 2).

Unfortunately, relying on signed integer overflow for both normal and weird values is extremely convenient when working with free-running counters. For example, suppose that our program is dealing with a succession of work items, each designated by an integer. Suppose further that this code might be called upon to report on the past and future 25 work items. This situation will likely require a fast way to distinguish between past, current, and future work. Signed twos-complement integers make this easy. If current is the integer corresponding to the current work item,

    (current - other > 0)

will evaluate to true if the other work item is from the past. This works, even if the counter hits the maximum value and wraps around, due to the circular nature of twos-complement integers: adding one to the largest positive number that can be represented results in the smallest negative number. As a result, there is no need to write special-case code to handle counter overflows.

The simplicity of this approach has caused coders to willfully ignore the undefined nature of C signed-integer overflow since well before the dawn of Linux. For example, I was happily relying on twos-complement semantics from C signed-integer overflow in the early 1980s, and the only reason I wasn't doing so earlier was that I wasn't using C any earlier. Nor am I alone. Here are a couple of representative code fragments from version 3.5 of the Linux kernel:

    if (auth_vnode->acl_order - acl_order > 0) {

    return (int)(tcmp - __raw_readl(timer_base + MX1_2_TCN)) < 0 ?  -ETIME : 0;

The first is from afs_cache_permit() in the AFS filesystem, which is using this pattern to sort out the order of events in a distributed filesystem. The second example is from mx1_2_set_next_event() in the ARM architecture, which is using a variation on this theme to determine whether the requested event time really is in the future. Here the actual subtraction is unsigned, but the result is cast to a signed integer. Because unsigned longs are always positive, the only way that the result can be negative (when interpreted as a signed value) is overflow, which the compiler is permitted to assume never happens. The compiler is therefore within its rights to unconditionally evaluate the test as false and return zero, which might fatally disappoint the caller.

In addition, there used to be several instances of this pattern in the Linux kernel's RCU implementation, where it was used to figure out whether a given request had been implicitly satisfied due to the efforts undertaken to fulfill concurrent requests of the same type. These have since been converted to use unsigned arithmetic using the technique described below.

One might well ask: is there really a problem here? All systems running Linux are twos complement, so we really should not worry about clauses in the C standard designed to handle the wider variety of arithmetic that was available a few decades ago, right?

Unfortunately, wrong. The C compiler can and does make use of undefined behavior when optimizing. To see this, consider the following code:

     1 long long_cmp_opt(const int a, const int b)
     2 {
     3   if (a > 0) {
     4     do_something();
     5     if (b < 0) {
     6       do_something_else();
     7       if ((a - b) > 0)
     8         do_another_thing();
     9     }
    10   }
    11 }

At line 7 the compiler knows that the variable a is positive and the variable b is negative. Therefore, ignoring the possibility of integer overflow, the compiler knows that this “if” condition will always evaluate to true, meaning that the compiler is within its rights to invoke do_another_thing() unconditionally, without actually doing the subtraction and comparison. In contrast, if a is (say) 2,147,483,647 and b is -2,147,483,648, the unoptimized code would avoid invoking do_another_thing(). Therefore, this optimization has significantly changed the program's behavior.

Quick Quiz 2: But just how often is the compiler going to know the sign of both the values???
Answer

Of course, in real life, overflow really can occur. But because the C standard says that signed overflow is undefined, the compiler is permitted to do whatever it wishes in the overflow case. And GCC 4.6.1 really does omit the subtraction and comparison when compiling this example for x86 at optimization levels of -O2 or higher.

Fortunately for the Linux kernel, GCC will generate the subtraction and comparison for -O1 or less. But optimizations can migrate to lower optimization levels over time, and there may come a time when either performance or energy-efficiency considerations motivate the Linux kernel to move to higher optimization levels. If that happens, what can be done?

Quick Quiz 3: First you were talking about overflowing, now about wrapping. Consistent terminology, please?
Answer

One approach is to move to unsigned integers for free-running counters. The C standard defines unsigned integers to use modular arithmetic, so that wrapping the counter is fully defined (Section 6.2.5 Paragraph 9).

Of course, checking for counter wrap must be done differently. For purposes of comparison, here is the (undefined) signed version:

    if ((a - b) < 0)

And here is the corresponding version for unsigned long types:

    if (ULONG_MAX / 2 < a - b)

This version relies on the fact that, bit for bit, twos-complement addition and subtraction are identical to their unsigned counterparts. Now, the bitwise representation of the constant ULONG_MAX / 2 is a zero bit followed by all one-bits, which is the largest value that does not have the most-significant bit set. Therefore, if the result of computing a - b is greater than this constant, we know that this result has its uppermost bit set. Because the uppermost bit is the sign bit for twos-complement numbers, we are guaranteed that the signed and unsigned versions compute identical results.

Of course, the unsigned version is more characters to type, but that is what inline functions and C-preprocessor macros are for. But what about the code that the compiler generates? After all, the Linux kernel absolutely does not need extra instructions loading large constants for each comparison!

The good news is that GCC actually generates exactly the same code for both of the above versions when compiled with -O1 on both x86 and PowerPC:

   /* x86 code. */
      e:	8b 44 24 04          	mov    0x4(%esp),%eax
     12:	2b 44 24 08          	sub    0x8(%esp),%eax
     16:	c1 e8 1f             	shr    $0x1f,%eax
   /* PowerPC code. */
     1c:   7c 64 18 50     subf    r3,r4,r3
     20:   78 63 0f e0     rldicl  r3,r3,1,63

Of course, there will be times when the Linux kernel absolutely must rely on undefined behavior. However, this is not one of those times: As shown above, there are straightforward ways to avoid relying on signed integer overflow. Removing the kernel's reliance on signed integer overflow could avoid our getting burned by increasingly aggressive optimization, and might further allow use of higher optimization levels to improve performance and battery lifetime. So it is not too early to start future-proofing the Linux kernel by removing its reliance on signed integer overflow!

Answers to Quick Quizzes

Quick Quiz 1: Yecch!!! Why can't CPU designers come up with something better?

Answer: CPU designers have come up with a variety of schemes over the decades. However, in my experience, each scheme has its own peculiarities. I have used ones complement and twos complement, and dealing with the peculiarities of twos complement proved easier for me than those of ones complement.

That said, I suspect that the dominance of twos complement was not due to ease of use, but rather due to the fact that it allows a single hardware adder to perform both signed and unsigned computations.

Back to Quick Quiz 1.

Quick Quiz 2: But just how often is the compiler going to know the sign of both the values???

Answer: The more inline functions we add, the higher the probability that the compiler will be able to infer all sorts of things about the values in question, including their sign. And it only takes one unwanted optimization for the Linux kernel to fail.

Back to Quick Quiz 2.

Quick Quiz 3: First you were talking about overflowing, now about wrapping. Consistent terminology, please?

Answer: Interestingly enough, the C standard does not define overflow for unsigned integers. Instead, it defines the unsigned integral types to use modular arithmetic so as to eliminate the possibility of overflow. Aside from things like division by zero, that is. The term “wrap” works regardless.

Back to Quick Quiz 3.

Comments (55 posted)

Patches and updates

Kernel trees

Greg KH Linux 3.5.2 ?
Greg KH Linux 3.5.1 ?
Greg KH Linux 3.4.9 ?
Greg KH Linux 3.4.8 ?
Steven Rostedt 3.4.8-rt16 ?
Ben Hutchings Linux 3.2.27 ?
Steven Rostedt 3.2.27-rt40 ?
Greg KH Linux 3.0.41 ?
Greg KH Linux 3.0.40 ?
Steven Rostedt 3.0.40-rt60 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous

Ben Hutchings ethtool 3.5 released ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds