LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.11-rc5, released on August 11. Linus said: "Sadly, the numerology doesn't quite work out, and while releasing the final 3.11 today would be a lovely coincidence (Windows 3.11 was released twenty years ago today), it is not to be. Instead, we have 3.11-rc5." Along with the usual fixes, this prepatch contains the linkat() permissions change discussed in the August 8 Kernel Page.

Stable updates: 3.10.6, 3.4.57, and 3.0.90 were released on August 11.

The 3.10.7, 3.4.58, and 3.0.91 updates are in the review process as of this writing; they can be expected sometime on or after August 15.

Comments (1 posted)

Quotes of the week

All companies end up in the Open Source Internet Beam Of Hate at some point or another, not always for good reason. I've felt that heat myself a few times in the last few years, I know all too well what it's like to be hated by the people you're trying to help.
Jean-Baptiste Quéru

One of the properties that π is conjectured to have is that it is normal, which is to say that its digits are all distributed evenly, with the implication that it is a disjunctive sequence, meaning that all possible finite sequences of digits will be present somewhere in it. If we consider π in base 16 (hexadecimal) , it is trivial to see that if this conjecture is true, then all possible finite files must exist within π. The first record of this observation dates back to 2001.

From here, it is a small leap to see that if π contains all possible files, why are we wasting exabytes of space storing those files, when we could just look them up in π!

The π filesystem

We seem to have reached the point in kernel development where "security" is the magic word to escape from any kind of due process (it is, in fact, starting to be used in much the same way the phrase "war on terror" is used to abrogate due process usually required by the US constitution).
James Bottomley

It's disturbing to me that there are almost as many addresses from people like Lockheed Martin, Raytheon Missile, various govt agencies from various countries with access to the coverity db as there are people who actually have contributed something to the kernel in the past. (The mix is even more skewed when you factor in other non-contrib companies like anti-virus vendors).

There's a whole industry of buying/selling vulnerabilities, and our response is basically "oh well, we'll figure it out when an exploit goes public".

Dave Jones

Comments (6 posted)

Siemon: Queueing in the Linux Network Stack

Dan Siemon has posted a detailed overview of how the Linux network stack queues packets. "As of Linux 3.6.0 (2012-09-30), the Linux kernel has a new feature called TCP Small Queues which aims to solve this problem for TCP. TCP Small Queues adds a per TCP flow limit on the number of bytes which can be queued in the QDisc and driver queue at any one time. This has the interesting side effect of causing the kernel to push back on the application earlier which allows the application to more effectively prioritize writes to the socket."

Comments (1 posted)

Kernel development news

Pondering 2038

By Jonathan Corbet
August 14, 2013
Many LWN readers have been in the field long enough to remember the year-2000 problem, caused by widespread use of two decimal digits to store the year. Said problem was certainly overhyped, but the frantic effort to fix it was also not entirely wasted; plenty of systems would, indeed, have misbehaved had all those COBOL programmers not come out of retirement to fix things up. Part of the problem was that the owners of the affected systems waited until almost too late to address the issue, despite the fact that it was highly predictable and had been well understood decades ahead of time. One would hope that, in the free software world, we would not repeat this history with another, equally predictable problem.

We'll have the opportunity to find out, since one such problem lurks over the horizon. The classic Unix representation for time is a signed 32-bit integer containing the number of seconds since January 1, 1970. This value will overflow on January 19, 2038, less than 25 years from now. One might think that the time remaining is enough to approach a fix in a relaxed manner, and one would be right. But, given the longevity of many installed systems, including hard-to-update embedded systems, there may be less time for a truly relaxed fix than one might think.

It is thus interesting to note that, on August 12, OpenBSD developer Philip Guenther checked in a patch to the OpenBSD system changing the types of most time values to 64-bit quantities. With 64 bits, there is more than enough room to store time values far past the foreseeable future, even if high-resolution (nanosecond-based) time values are used. Once the issues are shaken out, OpenBSD will likely have left the year-2038 problem behind; one could thus argue that they are well ahead of Linux on this score. And perhaps that is true, but there are some good reasons for Linux to proceed relatively slowly with regard to this problem.

The OpenBSD patch changes types like time_t and clock_t to 64-bit quantities. Such changes ripple outward quickly; for example, standard types like struct timeval and struct timespec contain time_t fields, so those structures change as well. The struct stat passed to the stat() system call also contains a set of time_t values. In other words, the changes made by OpenBSD add up to one huge, incompatible ABI change. As a result, OpenBSD kernels with this change will generally not run binaries that predate the change; anybody updating to the new code is advised to do so with a great deal of care.

OpenBSD can do this because it is a self-contained system, with the kernel and user space built together out of a single repository. There is little concern for users with outside binaries; one is expected to update the system as a whole and rebuild programs from source if need be. As a result, OpenBSD developers are much less reluctant to break the kernel ABI than Linux developers are. Indeed, Philip went ahead and expanded ino_t (used to represent inode numbers) as well while he was at it, even though that type is not affected by this problem. As long as users testing this code follow the recommendations and start fresh with a full snapshot, everything will still work. Users attempting to update an installed system will need to be a bit more careful.

In the Linux world, we are unable to simply drag all of user space forward with the kernel, so we cannot make incompatible ABI changes in this way. That is going to complicate the year-2038 transition considerably — all the more reason why it needs to be thought out ahead of time. That said, not all systems are at risk. As a general rule, users of 64-bit systems will not have problems in 2038, since 64-bit values are already the norm on such machines. The 32-bit x32 ABI was also designed with 64-bit time values. So many Linux users are already well taken care of.

But users of the pure 32-bit ABI will run into trouble. Of course, there is a possibility that there will be no 32-bit systems in the wild 25 years from now, but history argues otherwise. Even with its memory addressing limitations (a 32-bit processor with the physical address extension feature will struggle to work with 16GB of memory which, one assumes, will barely be enough to hold a "hello world" program in 2038), a 32-bit system can perform a lot of useful tasks. There may well be large numbers of embedded 32-bit systems running in 2038 that were deployed many years prior. There will almost certainly be 32-bit systems running in 2038 that will need to be made to work properly.

During a brief discussion on the topic last June, Thomas Gleixner described a possible approach to the problem:

If we really want to survive 2038, then we need to get rid of the timespec based representation of time in the kernel altogether and switch all related code over to a scalar nsec 64bit storage. [...]

Though even if we fix that we still need to twist our brains around the timespec/timeval based user space interfaces. That's going to be the way more interesting challenge.

In other words, if a new ABI needs to be created anyway, it would make sense to get rid of structures like timespec (which split times into two fields, representing seconds and nanoseconds) and use a simple nanosecond count. Software could then migrate over to the new system calls at leisure. Thomas suggested keeping the older system call infrastructure in place for five years, meaning that operations using the older time formats would continue to be directly implemented by the kernel; that would prevent unconverted code from suffering performance regressions. After that period passed, the compatibility code would be replaced by wrappers around the new system calls, possibly slowing the emulated calls down and providing an incentive for developers to update their code. Then, after about ten years, the old system calls could be deprecated.

Removal of those system calls could be an interesting challenge, though; even Thomas suggested keeping them for 100 years to avoid making Linus grumpy. If the system calls are to be kept up to (and past) 2038, some way will need to be found to make them work in some fashion. John Stultz had an interesting suggestion toward that end: turn time_t into an unsigned value, sacrificing the ability to represent dates before 1970 to gain some breathing room in the future. There are some interesting challenges to deal with, and some software would surely break, but, without a change, all software using 32-bit time_t values will break in 2038. So this change may well be worth considering.

Even without legacy applications to worry about, making 32-bit Linux year-2038 safe would be a significant challenge. The ABI constraints make the job harder yet. Given that some parts of any migration simply cannot be rushed, and given that some deployed systems run for many years, it would make sense to be thinking about a solution to this problem now. Then, perhaps, we'll all be able to enjoy our retirement without having to respond to a long-predicted time_t mess.

Comments (61 posted)

KPortReserve and the multi-LSM problem

By Jake Edge
August 14, 2013

Network port numbers are a finite resource, and each port number can only be used by one application at a time. Ensuring that the "right" application gets a particular port number is important because that number is required by remote programs trying to connect to the program. Various methods exist to reserve specific ports, but there are still ways for an application to lose "its" port. Enter KPortReserve, a Linux Security Module (LSM) that allows an administrator to ensure that a program gets its reservation.

One could argue that KPortReserve does not really make sense as an LSM—in fact, Tetsuo Handa asked just that question in his RFC post proposing it. So far, no one has argued that way, and Casey Schaufler took the opposite view, but the RFC has only been posted to the LSM and kernel hardening mailing lists. The level of opposition might rise if and when the patch set heads toward the mainline.

But KPortReserve does solve a real problem. Administrators can ensure that automatic port assignments (i.e. those chosen when the bind() port number is zero) adhere to specific ranges by setting a range or ranges of ports in the /proc/sys/net/ipv4/ip_local_reserved_ports file. But that solution only works for applications that do not choose a specific port number. Programs that do choose a particular port will be allowed to grab it—possibly at the expense of the administrator's choice. Furthermore, if the port number is not in the privileged range (<= 1024), even unprivileged programs can allocate it.

There is at least one existing user-space solution using portreserve, but it still suffers from race conditions. Systemd has a race-free way to reserve ports, but it requires changes to programs that will listen on those ports and is not available everywhere, which is why Handa turned to a kernel-based solution.

The solution itself is fairly straightforward. It provides a socket_bind() method in its struct security_operations to intercept bind() calls, which checks the reserved list. An administrator can write some values to a control file (where, exactly, that control file would live and the syntax it would use were being discussed in the thread) to determine which ports are reserved and what program should be allowed to allocate them. For example:

    echo '10000 /path/to/server' >/path/to/control/file
That would restrict port 10,000 to only being used by the server program indicated by the path. A special "<kernel>" string could be used to specify that the port is reserved for kernel threads.

Vasily Kulikov objected to specifying that certain programs could bind the port, rather a user ID or some LSM security context, but Schaufler disagreed, calling it "very 21st century thinking". His argument is that using unrelated attributes to govern port reservation could interfere with the normal uses of those attributes:

[...] Android used (co-opted, hijacked) the UID to accomplish this. Some (but not all) aspects of SELinux policy in Fedora identify the program and its standing within the system. Both of these systems abuse security attributes that are not intended to identify programs to do just that. This limits the legitimate use of those attributes for their original purpose.

What Tetsuo is proposing is using the information he really cares about (the program) rather than an attribute (UID, SELinux context, Smack label) that can be associated with the program. Further, he is using it in a way that does not interfere with the intended use of UIDs, labels or any other existing security attribute.

Beyond that, Handa noted that all of the programs he is interested in for this feature are likely running as root. While it would seem that root-controlled processes could be coordinated so that they didn't step on each other's ports, there are, evidently, situations where that is not so easy to arrange.

In his initial RFC, Handa wondered if the KPortReserve functionality should simply be added to the Yama LSM. At the 2011 Linux Security Summit, Yama was targeted as an LSM to hold discretionary access control (DAC) enhancements, which port reservations might be shoehorned into—maybe. But, then and since, there has been a concern that Yama not become a "dumping ground" for unrelated security patches. Thus, Schaufler argued, Yama is not the right place for KPortReserve.

However, there is the well-known problem for smaller, targeted LSMs: there is currently no way to have more than one LSM active on any given boot of the system. Handa's interest in Yama may partly be because it has, over time, changed from a "normal" LSM to one that can be unconditionally stacked, which means that it will be called regardless of which LSM is currently active. Obviously, if KPortReserve were added to Yama, it would likewise evade the single-LSM restriction.

But, of course, Schaufler has been working on another way around that restriction for some time now. There have been attempts to stack (or chain or compose) LSMs for nearly as long as they have existed, but none has ever reached the mainline. The latest entrant, Schaufler's "multiple concurrent LSMs" patch set, is now up to version 14. Unlike some earlier versions, any of the existing LSMs (SELinux, AppArmor, TOMOYO, or Smack) can now be arbitrarily combined using the technique. One would guess it wouldn't be difficult to incorporate a single-hook LSM like KPortReserve into the mix.

While there was some discussion of Schaufler's patches when they were posted at the end of July—and no objections to the idea—it still is unclear when (or if) we will see this capability in a mainline kernel. One senses that we are getting closer to that point, and new single-purpose LSM ideas crop up fairly regularly, but we aren't there yet. Schaufler will be presenting his ideas at the Linux Security Summit in September. Perhaps the discussion there will help clarify the future of this feature.

Comments (4 posted)

Optimizing preemption

By Jonathan Corbet
August 14, 2013
The kernel's lowest-level primitives can be called thousands of times (or more) every second, so, as one might expect, they have been ruthlessly optimized over the years. To do otherwise would be to sacrifice some of the system's performance needlessly. But, as it happens, hard-won performance can slip away over the years as the code is changed and gains new features. Often, such performance loss goes unnoticed until a developer decides to take a closer look at a specific kernel subsystem. That would appear to have just happened with regard to how the kernel handles preemption.

User-space access and voluntary preemption

In this case, things got started when Andi Kleen decided to make the user-space data access routines — copy_from_user() and friends — go a little faster. As he explained in the resulting patch set, those functions were once precisely tuned for performance on x86 systems. But then they were augmented with calls to functions like might_sleep() and might_fault(). These functions initially served in a debugging role; they scream loudly if they are called in a situation where sleeping or page faults are not welcome. Since these checks are for debugging, they can be turned off in a production kernel, so the addition of these calls should not affect performance in situations where performance really matters.

But, then, in 2004, core kernel developers started to take latency issues a bit more seriously, and that led to an interest in preempting execution of kernel code if a higher-priority process needed the CPU. The problem was that, at that time, it was not exactly clear when it would be safe to preempt a thread in kernel space. But, as Ingo Molnar and Arjan van de Ven noticed, calls to might_sleep() were, by definition, placed in locations where the code was prepared to sleep. So a might_sleep() call had to be a safe place to preempt a thread running in kernel mode. The result was the voluntary preemption patch set, adding a limited preemption mode that is still in use today.

The problem, as Andi saw it, is that this change turned might_sleep() and might_fault() into a part of the scheduler; it is no longer compiled out of a kernel if voluntary preemption is enabled. That, in turn, has slowed down user-space access functions by (on his system) about 2.5µs for each call. His patch set does a few things to try to make the situation better. Some functions (should_resched(), which is called from might_sleep(), for example) are marked __always_inline to remove the function calling overhead. A new might_fault_debug_only() function goes back to the original intent of might_fault(); it disappears entirely when it is not needed. And so on.

Linus had no real objection to these patches, but they clearly raised a couple of questions in his mind. One of his first comments was a suggestion that, rather than optimizing the might_fault() call in functions like copy_from_user(), it would be better to omit the check altogether. Voluntary preemption points are normally used to switch between kernel threads when an expensive operation is being performed. If a user-space access succeeds without faulting, it is not expensive at all; it is really just another memory fetch. If, instead, it causes a page fault, there will already be opportunities for preemption. So, Linus reasoned, there is little point in slowing down user-space accesses with additional preemption checks.

The problem with full preemption

To this point, the discussion was mostly concerned about voluntary preemption, where a thread running in the kernel can lose access to the processor, but only at specific spots. But the kernel also supports "full preemption," which allows preemption almost anywhere that preemption has not been explicitly disabled. In the early days of kernel preemption, many users shied away from the full preemption option, fearing subtle bugs. They may have been right at the time, but, in the intervening years, the fully preemptible kernel has become much more solid. Years of experience, helped by tools like the locking validator, can work wonders that way. So there is little reason to be afraid to enable full preemption at this point.

With that history presumably in mind, H. Peter Anvin entered the conversation with a question: should voluntary preemption be phased out entirely in favor of full kernel preemption? It turns out that there is still one reason to avoid turning on full preemption: as Mike Galbraith put it, "PREEMPT munches throughput." Complaints about the cost of full preemption have been scarce over the years, but, evidently, it does hurt in some cases. As long as there is a performance penalty to the use of full preemption, it is going to be hard to convince throughput-oriented users to switch to it.

There would not seem to be any fundamental reason why full preemption should adversely affect throughput. If the rate of preemption were high, there could be some associated cache effects, but preemption should be a relatively rare event in a throughput-sensitive system. That suggests that something else is going on. A clue about that "something else" can be found in Linus's observation that the testing of the preemption count — which happens far more often in a fully preemptible kernel — is causing the compiler to generate slower code.

The thing is, even if that is almost never taken, just the fact that there is a conditional function call very often makes code generation *much* worse. A function that is a leaf function with no stack frame with no preemption often turns into a non-leaf function with stack frames when you enable preemption, just because it had a RCU read region which disabled preemption.

So configuring full preemption into the kernel can make performance-sensitive code slower. Users who are concerned about latency may well be willing to make that tradeoff, but those who want throughput will not be so agreeable. The good news is that it might be possible to do something about this problem and keep both camps happy.

Optimizing full preemption

The root of the problem is accesses to the variable known as the "preemption count," which can be found in the thread_info structure, which, in turn lives at the bottom of the kernel stack. It is not just a counter, though; instead it is a 32-bit quantity that has been divided up into several subfields:

  • The actual preemption count, indicating how many times kernel code has disabled preemption. This counter allows calls like preempt_disable() to be nested and still do the right thing (eight bits).

  • The software interrupt count, indicating how many nested software interrupts are being handled at the moment (eight bits).

  • The hardware interrupt count (ten bits on most architectures).

  • The PREEMPT_ACTIVE bit indicating that the current thread is being (or just has been) preempted.

This may seem like a complicated combination of fields, but it has one useful feature: the preemptability of the currently-running thread can be tested by comparing the entire preemption count against zero. If any of the counters has been incremented (or the PREEMPT_ACTIVE bit set), preemption will be disabled.

It seems that the cost of testing this count might be reduced significantly with some tricky assembly language work; that is being hashed out as of this writing. But there's another aspect of the preemption count that turns out to be costly: its placement in the thread_info structure. The location of that structure must be derived from the kernel stack pointer, making the whole test significantly more expensive.

The important realization here is that there is (almost) nothing about the preemption count that is specific to any given thread. It will be zero for every non-executing thread; and no executing thread will be preempted if the count is nonzero. It is, in truth, more of an attribute of the CPU than of the running process. And that suggests that it would be naturally stored as a per-CPU variable. Peter Zijlstra has posted a patch that changes things in just that way. The patch turned out to be relatively straightforward; the only twist is that the PREEMPT_ACTIVE flag, being a true per-thread attribute, must be saved in the thread_info structure when preemption occurs.

Peter's first patch didn't quite solve the entire problem, though: there is still the matter of the TIF_NEED_RESCHED flag that is set in the thread_info structure when kernel code (possibly running in an interrupt handler or on another CPU) determines that the currently-running task should be preempted. That flag must be tested whenever the preemption count returns to zero, and in a number of other situations as well; as long as that test must be done, there will still be a cost to enabling full preemption.

Naturally enough, Linus has a solution to this problem in mind as well. The "need rescheduling" flag would move to the per-CPU preemption count as well, probably in the uppermost bit. That raises an interesting problem, though. The preemption count, as a per-CPU variable, can be manipulated without locks or the use of expensive atomic operations. This new flag, though, could well be set by another CPU entirely; putting it into the preemption count would thus wreck that count's per-CPU nature. But Linus has a scheme for dancing around this problem. The "need rescheduling" flag would only be changed using atomic operations, but the remainder of the preemption count would be updated locklessly as before.

Mixing atomic and non-atomic operations is normally a way to generate headaches for everybody involved. In this case, though, things might just work out. The use of atomic operations for the "need rescheduling" bit means that any CPU can set that bit without corrupting the counters. On the other hand, when a CPU changes its preemption count, there is a small chance that it will race with another CPU that is trying to set the "need rescheduling" flag, causing that flag to be lost. That, in turn, means that the currently executing thread will not be preempted when it should be. That result is unfortunate, in that it will increase latency for the higher-priority task that is trying to run, but it will not generate incorrect results. It is a minor bit of sloppiness that the kernel can get away with if the performance benefits are large enough.

In this case, though, there appears to be a better solution to the problem. Peter came back with an alternative approach that keeps the TIF_NEED_RESCHED flag in the thread_info structure, but also adds a copy of that flag in the preemption count. In current kernels, when the kernel sets TIF_NEED_RESCHED, it also signals an inter-processor interrupt (IPI) to inform the relevant CPU that preemption is required. Peter's patch makes the IPI handler copy the flag from the thread_info structure to the per-CPU preemption count; since that copy is done by the processor that owns the count variable, the per-CPU nature of that count is preserved and the race conditions go away. As of this writing, that approach seems like the best of all worlds — fast testing of the "need rescheduling" flag without race conditions.

Needless to say, this kind of low-level tweaking needs to be done carefully and well benchmarked. It could be that, once all the details are taken care of, the performance gained does not justify the trickiness and complexity of the changes. So this work is almost certainly not 3.12 material. But, if it works out, it may be that much of the throughput cost associated with enabling full preemption will go away, with the eventual result that the voluntary preemption mode could be phased out.

Comments (10 posted)

Patches and updates

Kernel trees

  • Sebastian Andrzej Siewior: 3.10.6-rt3 . (August 12, 2013)

Build system

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds