Kernel development
Brief items
Kernel release status
The current development kernel remains 3.6-rc1; Linus, who has been on vacation, has not made a new 3.6-rc release since August 2. There has been a slow flow of fixes into the mainline repository, though; a new release should happen in the near future.Stable updates: 3.0.40, 3.4.8, and 3.5.1 were released on August 9; 3.2.27 came out on August 11; and 3.0.41, 3.4.9 and 3.5.2 were released on August 15. That final set included, along with the usual fixes, the recent random number generator improvements.
Quotes of the week
Making workqueues non-reentrant
Workqueues are the primary mechanism for deferring work within the Linux kernel. The maintainer of this facility is Tejun Heo, who recently posted a patch series changing an aspect of workqueue behavior that, perhaps, few kernel developers know about. Most workqueues are reentrant, meaning that the same work item can be running on multiple CPUs at the same time. That can result in concurrency that developers are not expecting; it also significantly complicates the various "flush" operations in surprising ways. In summary, Tejun says:
A tool which is used as widely as workqueue shouldn't be this difficult and error-prone. This is just silly.
Tejun's patch set is intended to make the workqueue interface work more like the timer interface, which is rather more careful about allowing concurrent operations on the same timer. All workqueues become non-reentrant, and aspects of the API related to reentrant behavior have been simplified. There is, evidently, a slight performance cost to the change, but Tejun says it is small, and the cost of flush operations should go down. It seems like a worthwhile trade-off, overall, but anybody who maintains code that depends on concurrent work item execution will want to be aware of the change.
Kernel development news
Firmware loading and suspend/resume
Many devices are unable to function until the host system has loaded them with their operating firmware. Runtime-loadable firmware has some real advantages: the hardware can be a little cheaper to make, and the firmware is easily upgraded after the hardware has been sold. But it also poses some problems, especially when combined with other features. Properly handling firmware loading over suspend/resume cycles has been a challenge for the kernel for some time, but a new set of patches may be poised to make things work better with little or no need for changes to drivers.The obvious issue with suspend/resume is that any given device may lose its firmware while the system is suspended. The whole point of suspending the system is to reduce its power consumption to a minimum, so that operation may well power down peripheral devices entirely. Loss of firmware during suspend doesn't seem like it should be a big problem; the driver can just load the firmware again at resume time. But firmware tends to live on disk, and the actual firmware loading operation involves the running of a helper process in user space. Neither the disk nor user space are guaranteed to be available at the point in the resume process when a given device wants its firmware back; drivers that attempt to obtain firmware at such times may fail badly. The result is resume failures; they may be of the intermittent, developer-never-sees-it variety that can be so frustrating to track down. So the search has been on for a more robust solution for some time.
In July, Ming Lei tried to address this problem with a patch integrating firmware loading with the deferred driver probing mechanism. In short, if a firmware load fails, the whole driver initialization process would be put on the deferred queue to be retried later on. So, a driver that is unable to load its firmware at resume time will be put on hold and retried at a later point when, hopefully, the resources required to complete the firmware load will be available. That, Ming hoped, would resolve a lot of resume-time failures without requiring changes to lots of drivers.
Linus, however, disagreed:
Deferring firmware loading in this manner, he thought, would just serve to hide problems from developers but leave them to burn users later on. It is much better, he thought, to force driver writers to deal with the problem explicitly.
The classic way for a driver writer to handle this problem is to just keep the firmware around after it is loaded at system boot time. Permanently cached firmware will always be available when it is needed, so firmware loading at resume time should be robust. The problem with that approach is that the firmware blobs loaded into some devices can be quite large; keeping them around forever can waste a fair amount of kernel-space memory. To make things worse, these blobs are loaded into vmalloc() memory (so that they appear to be contiguous in memory); that memory can be in short supply on 32-bit systems. Permanently caching the firmware is, thus, not an ideal solution, but that is what a number of drivers do now.
After the discussion with Linus, Ming thought for a while and came back with a new proposal: cache firmware blobs, but only during the actual suspend/resume cycle. Drivers can, of course, do that now; they can request a copy of the firmware while suspending their devices, and release that copy once it's no longer needed at resume time. But that is a chunk of boilerplate code that would need to be added to each driver. Ming's patch, instead, makes this process automatic and transparent.
In particular, request_firmware() is changed to make a note of the name of every firmware blob it is asked to load. This information is reference-counted and tied to the devices that needed the firmware; it can thus be discarded if all such devices disappear. The result is a simple data structure tracking all of the firmware blobs that may be needed by the hardware currently present in the system.
At system suspend time, the code simply goes and loads every piece of firmware that it thinks may be needed. That data then sits in memory while the system is suspended. At resume time, those cached blobs are available to any driver, with no need for filesystem access or user-space involvement, via the usual request_firmware() interface. Once the resume process is complete, the firmware loader will, after a small delay, release all of those cached firmware images, freeing the associated memory and address space for other uses.
The patch seems close to an ideal solution. Firmware loading at resume
time becomes more robust, there is no need for drivers to be concerned with
how it works, and wasted memory is minimized. Even Linus said
"
A buffer full of data sent on the network does not travel alone. Instead,
the TCP layer must split that buffer into reasonably-sized packets, prepend
a set of TCP headers to it, and, possibly, calculate a checksum. The
packets are then passed to the IP layer, which throws its own headers onto
the beginning of the buffer, finds a suitable network interface, and hands
the result off to that interface for transmission. At the receiving end
the process is reversed: the IP and TCP headers are stripped, checksums
are compared, and the data is merged back into a seamless stream for the
receiving process.
It is all a fair amount of work, but it allows the two processes to
communicate without having to worry about all that happens in between.
But, if the two processes are on the same physical machine, much of that
work is not really necessary. The bulk of the overhead in the network
stack is there to ensure that packets do not get lost on their way to the
destination, that the data does not get corrupted in transit, and that
nothing gets forgotten or reordered. Most of these perils do not threaten
data that never leaves the originating system, so much of the work done by
the networking stack is entirely wasted in this case.
That much has been understood by developers for many years, of course.
That is why many programs have been written specifically to use Unix-domain
sockets when communicating with local peers. Unix-domain sockets ("pipes")
provide the same sort of stream abstraction, but, since they do not
communicate between systems, they avoid all of the overhead added by a full
network stack. So faster communications between local processes is
possible now, but it must be coded explicitly in any program that wishes to
use it.
What if local TCP communications could be accelerated to the point that
they are competitive with Unix-domain sockets? That is the objective of this patch from Bruce Curtis. The idea is
simple enough to explain: when both endpoints of a TCP connection are on
the same machine, the two sockets are marked as being "friends" in the
kernel. Data written to such a socket will be immediately queued for
reading on the friend socket, bypassing the network stack entirely. The
TCP, IP, and loopback device layers are simply shorted out. The actual
patch, naturally enough, is rather more complicated than this simple
description would suggest; friend sockets must still behave like TCP
sockets to the point that applications cannot tell the difference, so
friend-handling tweaks must be applied to many places in the TCP stack.
One would hope that this approach would yield local networking speeds that
are at least close to competitive with those achieved using Unix-domain
sockets. Interestingly, Bruce's patch not only achieves that, but it
actually does better than Unix-domain sockets in almost every benchmark he
ran. "Better" means both higher data transmission rates and lower
latencies on round-trip tests. Bruce does not go into why that is; perhaps
the amount of attention that has gone into scalability in the networking
stack pays off in his 16-core testing environment.
There is one important test for which Bruce posted no results: does the TCP
friends patch make things any slower for non-local connections where the
stack bypass cannot be used? Some of the network stack hot paths can be
sensitive to even small changes, so one can imagine that the networking
developers will want some assurance that the non-bypass case will not be
penalized if this patch goes in. There are various other little issues that need to be dealt with, but
this patch looks like it is on track for merging in the relatively near
future.
If it is merged, the result should be faster local communications between
processes without the need for special-case code using Unix-domain
sockets. It could also be most useful on systems hosting containerized
guests where cross-container communications are needed; one suspects that
Google's use case looks somewhat like that. In the end, it is hard to
argue against a patch that can speed local communications by as much as a
factor of five, so chances are this change will go into the mainline before
too long.
A recent LWN article
described a couple of the hazards that compiler optimizations can pose
for multithreaded code.
This article takes a different approach, looking at a compiler-optimization
hazard that can also strike sequential code.
This hazard stems from an annoying aspect of the C11 standard,
namely that signed-integer overflow is undefined (Section 3.4.3).
Overflow is a consequence of the fact that a computer's
native arithmetic capability is quite limited.
For example, if a C program running on a typical Linux system
tries adding one Quick Quiz 1:
Yecch!!!
Why can't CPU designers come up with something better?
The number -2,147,483,648 is “unusual” in that adding it to
itself (again, using twos-complement 32-bit integers) results in zero.
Furthermore, it is its own negative: negating -2,147,483,648 results in
-2,147,483,648.
Therefore, this
weird number
is (1) equal to half of zero and (2) both positive and negative
but (3) not equal to zero.
This weirdness earned this number a special mention in the C standard,
which says that its handling is implementation-defined
(Section 6.2.6.2, Paragraph 2).
Unfortunately, relying on signed integer overflow for both
normal and weird values is extremely
convenient when working with free-running counters.
For example, suppose that our program is dealing with a succession of
work items, each designated by an integer.
Suppose further that this code might be called upon to report on
the past and future 25 work items.
This situation will likely require a fast way to distinguish between
past, current, and future work.
Signed twos-complement integers make this easy.
If will evaluate to The simplicity of this approach has caused coders to willfully ignore
the undefined nature
of C signed-integer overflow since well before the dawn of Linux.
For example, I was happily relying on twos-complement semantics
from C signed-integer overflow in the early 1980s, and the only reason
I wasn't doing so earlier was that I wasn't using C any earlier.
Nor am I alone.
Here are a couple of representative code fragments from version 3.5
of the Linux kernel:
The first is from In addition, there used to be several instances of this pattern
in the Linux kernel's RCU implementation, where it was used to figure
out whether a given request had been implicitly satisfied due to the
efforts undertaken to fulfill concurrent requests of the same type.
These have since been converted to use unsigned arithmetic using the
technique described below.
One might well ask: is there really a problem here?
All systems running Linux are twos complement, so we really should not
worry about clauses in the C standard designed to handle the wider variety
of arithmetic that was available a few decades ago, right?
Unfortunately, wrong.
The C compiler can and does make use of undefined behavior when
optimizing.
To see this, consider the following code:
At line 7 the compiler knows that the variable Of course, in real life, overflow really can occur.
But because the C standard says that signed overflow is undefined,
the compiler is permitted to do whatever it wishes in the overflow case.
And GCC 4.6.1 really does omit the subtraction and comparison
when compiling this example for x86 at optimization levels of
Fortunately for the Linux kernel, GCC will generate the
subtraction and comparison for One approach is to move to unsigned integers for free-running counters.
The C standard defines unsigned integers to use modular arithmetic,
so that wrapping the counter is fully defined (Section 6.2.5 Paragraph 9).
Of course, checking for counter wrap must be done differently.
For purposes of comparison, here is the (undefined) signed version:
And here is the corresponding version for unsigned long types:
This version relies on the fact that, bit for bit, twos-complement
addition and subtraction are identical to their unsigned counterparts.
Now, the bitwise representation of the constant Of course, the unsigned version is more characters to type, but that
is what inline functions and C-preprocessor macros are for.
But what about the code that the compiler generates?
After all, the Linux kernel absolutely does not need extra instructions
loading large constants for each comparison!
The good news is that GCC actually generates exactly the same
code for both of the above versions when compiled with Of course, there will be times
when the Linux kernel absolutely must rely on undefined behavior.
However, this is not one of those times: As shown above, there are
straightforward ways to avoid relying on signed integer overflow.
Removing the kernel's reliance on signed integer overflow could
avoid our getting burned by increasingly aggressive optimization, and
might further allow use of higher optimization levels to improve
performance and battery lifetime.
So it is not too early to start future-proofing the Linux kernel
by removing its reliance on signed integer overflow!
Quick Quiz 1:
Yecch!!!
Why can't CPU designers come up with something better?
Answer:
CPU designers have come up with a
variety of schemes
over the decades.
However, in my experience, each scheme has its own peculiarities.
I have used ones complement and twos complement,
and dealing with the peculiarities of twos complement proved easier for me
than those of ones complement.
That said, I suspect that the dominance of twos complement was not due
to ease of use, but rather due to
the fact that it allows a single hardware adder to perform both
signed and unsigned computations.
Answer:
The more inline functions we add, the higher the probability
that the compiler will be able to infer all sorts of things about the
values in question, including their sign.
And it only takes one unwanted optimization for the Linux kernel to fail.
Answer:
Interestingly enough, the C standard does not define overflow for
unsigned integers.
Instead, it defines the unsigned integral types to use modular
arithmetic so as to eliminate the possibility of overflow.
Aside from things like division by zero, that is.
The term “wrap” works regardless.
Nothing in this patchset made me go 'Eww'
", which, from him,
can be seen
as reasonably high praise. It doesn't solve every problem; there are, for
example, some
strange devices that retain firmware over a reboot but not over
suspend, so the system may not know that a specific firmware image is
needed until resume time, when it's too late. But such hardware is
probably best handled as a special case. For the rest, we may be close to
a solution that simply works—and that brings an end to the recurring
"firmware at resume time" discussions on the mailing lists.
TCP friends
One of the many advantages of the TCP network protocol is that the process at
one end of a connection need not have any idea of where the other side
is. A process could be talking with a peer on the other side of the world,
in the same town, or, indeed, on the same machine. That last case may be
irrelevant to the processes involved, but it can be important for
performance-sensitive users. A new patch from Google seems likely to speed
that case up in the near future.
Signed overflow optimization hazards in the kernel
int
variable with value 2,147,483,647 to another
int
with value 1, the result will be
-2,147,483,648—which might surprise people who naively
expect the mathematically correct value of +2,147,483,648.
This deviation from mathematical correctness occurs
because the machine cannot represent the correct value of
2,147,483,648 in a 32-bit
twos-complement
integer.
Therefore, any attempt to compute this number without
help
from software will result in overflow.
Answer
current
is the integer corresponding to the current
work item,
(current - other > 0)
true
if the other
work item
is from the past.
This works, even if the counter hits the maximum value and wraps around,
due to the circular nature of twos-complement integers:
adding one to the largest positive number that can be represented
results in the smallest negative number. As a result, there is no need to
write special-case code to handle counter overflows.
if (auth_vnode->acl_order - acl_order > 0) {
return (int)(tcmp - __raw_readl(timer_base + MX1_2_TCN)) < 0 ? -ETIME : 0;
afs_cache_permit()
in
the AFS filesystem, which is using this pattern
to sort out the order of events in a distributed filesystem.
The second example is from mx1_2_set_next_event()
in the ARM
architecture, which is using a variation on this theme to determine
whether the requested event time really is in the future.
Here the actual subtraction is unsigned, but the result is cast
to a signed integer.
Because unsigned longs are always positive,
the only way that the result can be negative (when interpreted as a signed
value) is overflow, which the compiler
is permitted to assume never happens.
The compiler is therefore within its rights to unconditionally evaluate the
test as false and return
zero, which might fatally disappoint the caller.
1 long long_cmp_opt(const int a, const int b)
2 {
3 if (a > 0) {
4 do_something();
5 if (b < 0) {
6 do_something_else();
7 if ((a - b) > 0)
8 do_another_thing();
9 }
10 }
11 }
a
is positive and the variable b
is negative.
Therefore, ignoring the possibility of integer overflow, the compiler
knows that this “if” condition will always evaluate to
true
, meaning that the compiler is within its rights to
invoke do_another_thing()
unconditionally, without actually
doing the subtraction and comparison.
In contrast, if a
is (say) 2,147,483,647 and b
is
-2,147,483,648, the unoptimized code would avoid invoking
do_another_thing()
.
Therefore, this optimization has significantly changed the program's behavior.
Answer
-O2
or higher.
-O1
or less.
But optimizations can migrate to lower optimization
levels over time, and there may come a time when either performance
or energy-efficiency considerations motivate the
Linux kernel to move to higher optimization levels.
If that happens, what can be done?
Answer
if ((a - b) < 0)
if (ULONG_MAX / 2 < a - b)
ULONG_MAX / 2
is
a zero bit followed by all one-bits, which is the largest value that
does not have the most-significant bit set.
Therefore, if the result of computing a - b
is greater
than this constant, we know that this result has its uppermost bit set.
Because the uppermost bit is the sign bit for twos-complement numbers,
we are guaranteed that the signed and unsigned versions compute identical
results.
-O1
on both x86 and PowerPC:
/* x86 code. */
e: 8b 44 24 04 mov 0x4(%esp),%eax
12: 2b 44 24 08 sub 0x8(%esp),%eax
16: c1 e8 1f shr $0x1f,%eax
/* PowerPC code. */
1c: 7c 64 18 50 subf r3,r4,r3
20: 78 63 0f e0 rldicl r3,r3,1,63
Answers to Quick Quizzes
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>