Brief items
The current development kernel is 3.11-rc5,
released on August 11. Linus said:
"
Sadly, the numerology doesn't quite work out, and while releasing
the final 3.11 today would be a lovely coincidence (Windows 3.11 was
released twenty years ago today), it is not to be. Instead, we have
3.11-rc5." Along with the usual fixes, this prepatch contains the
linkat() permissions change discussed
in the August 8 Kernel Page.
Stable updates: 3.10.6,
3.4.57, and 3.0.90 were released on August 11.
The 3.10.7,
3.4.58, and
3.0.91 updates are in the review process as
of this writing; they can be expected sometime on or after August 15.
Comments (1 posted)
All companies end up in the Open Source Internet Beam Of Hate at
some point or another, not always for good reason. I've felt that
heat myself a few times in the last few years, I know all too well
what it's like to be hated by the people you're trying to help.
—
Jean-Baptiste
Quéru
One of the properties that π is conjectured to have is that it is
normal, which is to say that its digits are all distributed evenly,
with the implication that it is a disjunctive sequence, meaning
that all possible finite sequences of digits will be present
somewhere in it. If we consider π in base 16 (hexadecimal) , it is
trivial to see that if this conjecture is true, then all possible
finite files must exist within π. The first record of this
observation dates back to 2001.
From here, it is a small leap to see that if π contains all
possible files, why are we wasting exabytes of space storing those
files, when we could just look them up in π!
—
The π filesystem
We seem to have reached the point in kernel development where
"security" is the magic word to escape from any kind of due process
(it is, in fact, starting to be used in much the same way the
phrase "war on terror" is used to abrogate due process usually
required by the US constitution).
—
James
Bottomley
It's disturbing to me that there are almost as many addresses from
people like Lockheed Martin, Raytheon Missile, various govt
agencies from various countries with access to the coverity db as
there are people who actually have contributed something to the
kernel in the past. (The mix is even more skewed when you factor in
other non-contrib companies like anti-virus vendors).
There's a whole industry of buying/selling vulnerabilities, and our
response is basically "oh well, we'll figure it out when an exploit
goes public".
—
Dave
Jones
Comments (6 posted)
Dan Siemon has posted
a
detailed overview of how the Linux network stack queues packets.
"
As of Linux 3.6.0 (2012-09-30), the Linux kernel has a new feature
called TCP Small Queues which aims to solve this problem for TCP. TCP Small
Queues adds a per TCP flow limit on the number of bytes which can be queued
in the QDisc and driver queue at any one time. This has the interesting
side effect of causing the kernel to push back on the application earlier
which allows the application to more effectively prioritize writes to the
socket."
Comments (1 posted)
Kernel development news
By Jonathan Corbet
August 14, 2013
Many LWN readers have been in the field long enough to remember the
year-2000 problem, caused by widespread use of two decimal digits to store
the year. Said problem was certainly overhyped, but the frantic effort to
fix it was also not entirely wasted; plenty of systems would, indeed, have
misbehaved had all those COBOL programmers not come out of retirement to
fix things up. Part of the problem was that the owners of the affected systems
waited until almost too late to address the issue, despite the fact that it
was highly predictable and had been well understood decades ahead of time.
One would hope that, in the free software world, we would not repeat this
history with another, equally predictable problem.
We'll have the opportunity to find out, since one such problem lurks over
the horizon. The classic Unix representation for time is a signed 32-bit
integer containing the number of seconds since January 1, 1970. This
value will overflow on January 19, 2038, less than 25 years from now.
One might think that the time remaining is enough to approach a fix in a
relaxed manner, and one would be right. But, given the longevity of many
installed systems, including hard-to-update embedded systems, there may be
less time for a truly relaxed fix than one might think.
It is thus interesting to note that, on August 12, OpenBSD developer Philip
Guenther checked in a patch to the OpenBSD
system changing the types of most time values to 64-bit quantities. With
64 bits, there is more than enough room to store time values far past
the foreseeable future, even if high-resolution (nanosecond-based) time
values are used. Once the issues are shaken out, OpenBSD will likely have
left the year-2038 problem behind; one could thus argue that they are well
ahead of Linux on this score. And perhaps that is true, but there are some
good reasons for Linux to proceed relatively slowly with regard to this
problem.
The OpenBSD patch changes types like time_t and clock_t
to 64-bit quantities. Such changes ripple outward quickly; for example,
standard types like struct timeval and
struct timespec contain time_t fields, so those
structures change as well. The struct stat passed to the
stat() system call also contains a set of time_t values.
In other words, the changes made by OpenBSD add up to one huge,
incompatible ABI change. As a
result, OpenBSD kernels with this change will generally not run binaries
that predate the change; anybody updating to the new code is advised to do
so with a great deal of care.
OpenBSD can do this because it is a self-contained system, with the kernel
and user space built together out of a single repository. There is little
concern for users with outside binaries; one is expected to update the
system as a whole and rebuild programs from source if need be. As a
result, OpenBSD developers are much less reluctant to break the kernel ABI
than Linux developers are. Indeed, Philip went ahead and expanded
ino_t (used to represent inode numbers) as well while he was at
it, even though that type is not affected by this problem. As long as
users testing this code follow the recommendations and start fresh with a full
snapshot, everything will still work. Users attempting to update an
installed system will need
to be a bit more careful.
In the Linux world, we are unable to simply drag all of user space forward
with the kernel, so we cannot make incompatible ABI changes in this way.
That is going to complicate the year-2038 transition considerably — all the
more reason why it needs to be thought out ahead of time. That said, not
all systems are at risk. As a general
rule, users of 64-bit systems will not have problems in 2038, since 64-bit
values are already the norm on such machines. The 32-bit x32 ABI was also designed with 64-bit time
values. So many Linux users are already well taken care of.
But users of the pure 32-bit ABI will run into trouble. Of course, there
is a possibility that there
will be no 32-bit systems in the wild 25 years from now, but history argues
otherwise. Even with its memory addressing limitations (a 32-bit processor
with the physical address extension feature will struggle to work with 16GB of
memory which, one assumes, will barely be enough to hold a "hello world"
program in 2038), a 32-bit system can perform a lot of useful tasks. There
may well be large numbers of embedded 32-bit systems running in 2038 that
were deployed many years prior. There will almost certainly be 32-bit
systems running in 2038 that will need to be made to work properly.
During a brief discussion on the topic last June, Thomas Gleixner described a possible approach to the problem:
If we really want to survive 2038, then we need to get rid of the
timespec based representation of time in the kernel altogether and
switch all related code over to a scalar nsec 64bit storage. [...]
Though even if we fix that we still need to twist our brains around
the timespec/timeval based user space interfaces. That's going to
be the way more interesting challenge.
In other words, if a new ABI needs to be created anyway, it would make
sense to get rid of structures like timespec (which split times
into two fields, representing seconds and nanoseconds) and use a simple
nanosecond count. Software could then migrate over to the new system calls
at leisure. Thomas suggested keeping the
older system call infrastructure in place for five years, meaning that
operations using the older time formats would continue to be directly
implemented by the kernel; that would prevent unconverted code from
suffering performance regressions. After that period passed, the
compatibility code would be replaced by wrappers around the new system
calls, possibly slowing the emulated calls down and providing an incentive for
developers to update their code. Then, after about ten years, the old
system calls could be deprecated.
Removal of those system calls could be an interesting challenge, though;
even Thomas suggested keeping them for 100 years to avoid making Linus
grumpy. If the system calls are to be kept up to (and past) 2038, some way
will need to be found to make them work in some fashion. John Stultz had
an interesting suggestion toward that end:
turn time_t into an unsigned value, sacrificing the ability to
represent dates before 1970 to gain some breathing room in the future.
There are some interesting challenges to deal with, and some software would
surely break, but, without a change, all software using 32-bit
time_t values will break in 2038. So this change may well be
worth considering.
Even without legacy applications to worry about, making 32-bit Linux
year-2038 safe would be a significant challenge. The ABI constraints make
the job harder yet. Given that some parts of any migration simply cannot
be rushed, and given that some deployed systems run for many years, it
would make sense to be thinking about a solution to this problem now.
Then, perhaps, we'll all be able to enjoy our retirement without having to
respond to a long-predicted time_t mess.
Comments (61 posted)
By Jake Edge
August 14, 2013
Network port numbers are a finite resource, and each port number can only
be used by one application at a time. Ensuring that the "right"
application gets a particular port number is important because that
number is required by remote programs trying to connect to the program.
Various methods exist to reserve specific ports, but there are still ways
for an application to lose "its" port. Enter KPortReserve, a Linux Security Module (LSM)
that allows an administrator to ensure that a program gets its reservation.
One could argue that KPortReserve does not really make sense as an LSM—in
fact, Tetsuo Handa asked just that question in his RFC post proposing it.
So far, no one has argued that way, and Casey Schaufler took the opposite view, but the RFC has only
been posted to the LSM and kernel hardening mailing lists. The level of
opposition might rise if and when the patch set heads toward the mainline.
But KPortReserve does solve a real problem. Administrators can ensure that
automatic port assignments (i.e. those chosen when the bind() port
number is zero) adhere to specific ranges by setting a range or ranges of
ports in the /proc/sys/net/ipv4/ip_local_reserved_ports file. But
that solution only works for
applications that do not choose a specific port number. Programs that do
choose a particular port will be allowed to grab it—possibly at the expense
of the
administrator's choice. Furthermore, if the port number is not in the
privileged range (<= 1024), even unprivileged programs can allocate it.
There is at least one existing
user-space solution using portreserve, but it still suffers from race
conditions. Systemd has a race-free way to reserve ports, but it requires
changes to programs that will listen on those ports and is not
available everywhere, which is why Handa turned to a kernel-based solution.
The solution itself is fairly straightforward. It provides a
socket_bind() method in its struct security_operations to
intercept bind() calls, which checks the reserved list. An
administrator can write
some values to a control file (where, exactly, that control file
would live and the syntax it would use were being discussed in the thread) to
determine which ports are reserved and what program should be allowed to
allocate them. For example:
echo '10000 /path/to/server' >/path/to/control/file
That would restrict port 10,000 to only being used by the server program
indicated by the path. A
special "
<kernel>" string could be used to specify that
the port is reserved for kernel threads.
Vasily Kulikov
objected to
specifying that certain programs could bind the port, rather a user ID
or some LSM security context, but Schaufler disagreed, calling it "very 21st century
thinking". His argument is that using unrelated attributes to
govern port reservation could interfere with the normal uses of those
attributes:
[...] Android used (co-opted, hijacked) the
UID to accomplish this. Some (but not all) aspects of SELinux policy
in Fedora identify the program and its standing within the system.
Both of these systems abuse security attributes that are not intended
to identify programs to do just that. This limits the legitimate use
of those attributes for their original purpose.
What Tetsuo is proposing is using the information he really cares
about (the program) rather than an attribute (UID, SELinux context,
Smack label) that can be associated with the program. Further, he
is using it in a way that does not interfere with the intended use
of UIDs, labels or any other existing security attribute.
Beyond that, Handa noted that all of the
programs he is interested in for this feature are likely running as root.
While it would seem that root-controlled processes could be coordinated so
that they didn't step on each other's ports, there are, evidently,
situations where that is not so easy to arrange.
In his initial RFC, Handa wondered if the KPortReserve functionality should
simply be added to the Yama LSM. At the 2011 Linux Security Summit, Yama
was targeted as an LSM to hold
discretionary access control (DAC) enhancements, which port reservations
might be shoehorned into—maybe. But, then and since, there has been a
concern that Yama not become a "dumping ground" for unrelated
security patches. Thus, Schaufler argued, Yama is not the right place for
KPortReserve.
However, there is the well-known problem
for smaller, targeted LSMs: there is
currently no way to have more than one LSM active on any given boot of
the system. Handa's interest in Yama may partly be because it has, over
time, changed from a "normal" LSM to one that can be unconditionally
stacked, which means that it will be called regardless of which LSM is
currently active. Obviously, if KPortReserve were added to Yama, it would
likewise evade the single-LSM restriction.
But, of course, Schaufler has been working on another way around that
restriction for some time now. There have been attempts to stack (or chain
or compose) LSMs for nearly as long as they have existed, but none has ever
reached the mainline. The latest entrant, Schaufler's "multiple
concurrent LSMs" patch set, is now up to version 14. Unlike some
earlier versions, any of the existing LSMs (SELinux, AppArmor, TOMOYO, or
Smack) can now be arbitrarily combined using the technique. One would
guess it wouldn't be difficult to incorporate a single-hook LSM like
KPortReserve into the mix.
While there was some discussion of Schaufler's patches when they were
posted at the end of July—and no objections to the idea—it still is unclear
when (or if) we will see this capability in a mainline kernel. One senses
that we are getting closer to that point, and new single-purpose LSM ideas
crop up fairly regularly, but we aren't there yet. Schaufler will be
presenting his ideas at the Linux
Security Summit in September. Perhaps the discussion there will help
clarify the future of this feature.
Comments (4 posted)
By Jonathan Corbet
August 14, 2013
The kernel's lowest-level primitives can be called thousands of times (or
more) every second, so, as one might expect, they have been ruthlessly
optimized over the years. To do otherwise would be to sacrifice some of
the system's performance needlessly. But, as it happens, hard-won
performance can slip away over the years as the code is changed and gains
new features. Often, such performance loss goes unnoticed until a
developer decides to take a closer look at a specific kernel subsystem.
That would appear to have just happened with regard to how the kernel
handles preemption.
User-space access and voluntary preemption
In this case, things got started when Andi Kleen decided to make the
user-space data access routines — copy_from_user() and friends —
go a little faster. As he explained in the
resulting patch set, those functions were once precisely tuned
for performance on x86 systems. But then they were augmented with calls to
functions like might_sleep() and might_fault(). These
functions initially served in a debugging role; they scream loudly if they are
called in a situation where sleeping or page faults are not welcome. Since
these checks are for debugging, they can be turned off in a production kernel,
so the addition of these calls should not affect performance in situations
where performance really matters.
But, then, in 2004, core kernel developers started to take latency issues a
bit more seriously, and that led to an interest in preempting execution of
kernel code if a higher-priority process needed the CPU. The problem was
that, at that time, it was not exactly clear when it would be safe to preempt
a thread in kernel space. But, as Ingo Molnar and Arjan van de Ven
noticed, calls to might_sleep() were, by definition,
placed in locations where the code was prepared to sleep. So a
might_sleep() call had to be a safe place to preempt a thread
running in kernel mode. The result was the voluntary preemption patch set, adding a
limited preemption mode that is still in use today.
The problem, as Andi saw it, is that this change turned
might_sleep() and might_fault() into a part of the
scheduler; it is no longer
compiled out of a kernel if voluntary preemption is enabled. That, in
turn, has slowed down user-space access functions by (on his system) about
2.5µs for each call. His patch set does a few things to try to make the
situation better. Some functions (should_resched(), which is
called from might_sleep(), for example) are
marked __always_inline to remove the function calling overhead.
A new might_fault_debug_only() function goes back to the original
intent of might_fault(); it disappears entirely when it is not
needed. And so on.
Linus had no real objection to these patches, but they clearly raised a
couple of questions in his mind. One of his first comments was a suggestion that, rather than optimizing the
might_fault() call in functions like copy_from_user(), it
would be better to omit the check altogether. Voluntary preemption points are
normally used to switch between kernel threads when an expensive operation
is being performed. If a user-space access succeeds without faulting, it
is not expensive at all; it is really just another memory fetch. If,
instead, it causes a page fault, there will already be opportunities for
preemption. So, Linus reasoned, there is little point in slowing down
user-space accesses with additional preemption checks.
The problem with full preemption
To this point, the discussion was mostly concerned about voluntary
preemption, where a thread running in the kernel can lose access to the
processor, but only at specific spots. But the kernel also supports "full
preemption," which allows preemption almost anywhere that preemption has
not been explicitly disabled.
In the early days of kernel preemption, many users shied away
from the full preemption option, fearing subtle bugs. They may have been
right at the time, but, in the intervening years, the fully preemptible
kernel has become much more solid. Years of experience, helped by tools
like the locking validator, can work wonders that way. So there is little
reason to be afraid to enable full preemption at this point.
With that history presumably in mind, H. Peter Anvin entered the
conversation with a question: should
voluntary preemption be phased out entirely in favor of full kernel
preemption?
It turns out that there is still one reason to avoid turning on full
preemption: as Mike Galbraith put it, "PREEMPT munches
throughput." Complaints about the cost of full preemption have been
scarce over the years, but, evidently, it does hurt in some cases. As long
as there is a performance penalty to the use of full preemption, it is
going to be hard to convince throughput-oriented users to switch to it.
There would not seem to be any fundamental reason why full preemption
should adversely affect throughput. If the rate of preemption were high, there
could be some associated cache effects, but preemption should be a
relatively rare event in
a throughput-sensitive system. That suggests that something else is going
on. A clue about that "something else" can be found in Linus's observation that the testing of
the preemption count — which happens far more often in a fully preemptible
kernel — is causing the compiler to generate slower code.
The thing is, even if that is almost never taken, just the fact
that there is a conditional function call very often makes code
generation *much* worse. A function that is a leaf function with no
stack frame with no preemption often turns into a non-leaf function
with stack frames when you enable preemption, just because it had a
RCU read region which disabled preemption.
So configuring full preemption into the kernel can make
performance-sensitive code slower. Users who are concerned about latency may
well be willing to make that tradeoff, but those who want throughput will
not be so agreeable. The
good news is that it might be possible to do something about this problem
and keep both camps happy.
Optimizing full preemption
The root of the problem is accesses to the variable known as the
"preemption count," which can be found in the
thread_info structure, which, in turn
lives at the bottom of the kernel stack. It is not just a counter, though;
instead it is a 32-bit quantity that has been divided up into several
subfields:
- The actual preemption count, indicating how many times kernel code has
disabled preemption. This counter allows calls like
preempt_disable() to be nested and still do the right thing
(eight bits).
- The software interrupt count, indicating how many nested software
interrupts are being handled at the moment (eight bits).
- The hardware interrupt count (ten bits on most architectures).
- The PREEMPT_ACTIVE bit indicating that the current thread
is being (or just has been) preempted.
This may seem like a complicated combination of fields, but it has one
useful feature: the preemptability of the currently-running thread can be
tested by comparing the entire preemption count against zero. If any of
the counters has been incremented (or the PREEMPT_ACTIVE bit set),
preemption will be disabled.
It seems that the cost of testing this count might be reduced significantly
with some tricky assembly language work; that is being hashed out as of
this writing. But there's another aspect of the preemption count that
turns out to be costly: its placement in the thread_info
structure. The location of that structure must be derived from the kernel
stack pointer, making the whole test significantly more expensive.
The important realization here is that there is (almost) nothing about the
preemption count that is specific to any given thread. It will be zero for
every non-executing thread; and no executing thread will be preempted if
the count is nonzero. It is, in truth, more of an attribute of the CPU
than of the running process. And that suggests that it would be naturally
stored as a per-CPU variable. Peter Zijlstra has posted a patch that changes things in just that way.
The patch turned out to be relatively straightforward; the only twist is
that the PREEMPT_ACTIVE flag, being a true per-thread attribute,
must be saved in the thread_info structure when preemption occurs.
Peter's first patch didn't quite solve the entire problem, though: there is
still the matter of the TIF_NEED_RESCHED flag that is set in the
thread_info structure when kernel code (possibly running in an
interrupt handler or on another CPU) determines that the currently-running
task should be preempted. That flag must be tested whenever the preemption
count returns to zero, and in a number of other situations as well; as long
as that test must be done, there will still be a cost to enabling full
preemption.
Naturally enough, Linus has a solution to this
problem in mind as well. The "need rescheduling" flag would move to
the per-CPU preemption count as well, probably in the uppermost bit. That
raises an interesting problem, though. The preemption count, as a per-CPU
variable, can be manipulated without locks or the use of expensive atomic
operations. This new flag, though, could well be set by another CPU
entirely; putting it into the preemption count would thus wreck that
count's per-CPU
nature. But Linus has a scheme for dancing around this problem. The "need
rescheduling" flag would only be changed using atomic operations,
but the remainder of the preemption count
would be updated locklessly as before.
Mixing atomic and non-atomic operations is normally a way to generate
headaches for everybody involved. In this case, though, things might just
work out. The use of atomic operations for the "need rescheduling" bit
means that any CPU can set that bit without corrupting the counters. On
the other hand, when a CPU changes its preemption count, there is a small
chance that it will race with another CPU that is trying to set the "need
rescheduling" flag, causing that flag to be lost.
That, in turn, means that the currently executing thread will
not be preempted when it should be. That result is unfortunate, in that it
will increase latency for the higher-priority task that is trying to run,
but it will not generate incorrect results. It is a minor bit of
sloppiness that the kernel can get away with if the performance benefits
are large enough.
In this case, though, there appears to be a better solution to the problem.
Peter came back with an alternative
approach that keeps the TIF_NEED_RESCHED flag in the
thread_info structure, but also adds a copy of that flag in the
preemption count. In current kernels, when the kernel sets
TIF_NEED_RESCHED, it also
signals an inter-processor interrupt (IPI) to inform the relevant CPU that
preemption is required. Peter's patch makes the IPI handler copy the flag
from the thread_info structure to the per-CPU
preemption count; since that copy is done by the processor that owns
the count variable, the per-CPU nature of that count is preserved and the
race conditions go away. As of this writing, that approach seems like the
best of all worlds — fast testing of the "need rescheduling" flag without
race conditions.
Needless to say, this kind of low-level tweaking needs to be done carefully
and well benchmarked. It could be that, once all the details are taken
care of, the performance gained does not justify the trickiness and
complexity of the changes. So this work is almost certainly not 3.12
material. But, if it works out, it may be that much of the throughput cost
associated with enabling full preemption will go away, with the eventual
result that the voluntary preemption mode could be phased out.
Comments (10 posted)
Patches and updates
Kernel trees
- Sebastian Andrzej Siewior: 3.10.6-rt3 .
(August 12, 2013)
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>