The current development kernel is 3.1-rc9
on October 4. The kernel
repository is back in its old location on kernel.org; Linus also has
started using a new signing key with this release. "On the kernel
front, not a huge amount of changes. That said, by now, there had better
not be - and I definitely wouldn't have minded having even fewer
changes. But the fixes that are here are generally pretty small, and the
diffstat really doesn't look all that scary - there really aren't *big*
Stable updates: the 3.0.5 update was
released on October 3 with well over 200 important fixes; it was
immediately followed by 3.0.6 to fix a
build problem in the Radeon driver.
Comments (2 posted)
Some day someone will write something that supercedes the Linux
kernel. Probably before that someone should either fix or replace C
so that it has an inbuilt ability to describe locking and
statically enforce some of the locking rules at compile time. I'm
sure my generation of programmers will despise the resulting
language but in time we will all thank whoever does it.
Inside of Oracle, we've decided to make btrfs the default
filesystem for Oracle Linux. This is going into beta now and we'll
increase our usage of btrfs in production over the next four to six
months... What this means is that absolutely cannot move forward
without btrfsck. RH, Fujitsu, SUSE and others have spent a huge
amount of time on the filesystem and it is clearly time to start
putting it into customer hands.
-- Chris Mason
Note that if your laptop allows incoming ssh connections, and you
logged into master.kernel.org with ssh forwarding enabled, your
laptop may not be safe. So be very, very careful before you assume
that your laptop is safe. At least one kernel developer, after he
got past the belief, "surely I could have never had my machine be
compromised", looked carefully and found rootkits on his machines.
-- Ted Ts'o
The more important point is that as far as the linux-kernel
community is concerned, the guy we've all seen show up at
conferences and present stuff all these times *is* Andrew Morton,
even if his real name is George Q. Smith and he's been on the run
for the last 27 years for an embarrassing incident involving an
ostrich, the mayor's daughter, and 17 gallons of mineral oil in the
atrium of the museum.
-- Valdis Kletnieks
Comments (6 posted)
Last week's article on GCC plugins
drew a comment
from "PaXTeam" regarding the PaX
project's use of plugins. It seems they have developed a set of plugins to
help with the task of creating a more secure kernel. Said plugins can be
hard to find; your editor located them in the grsecurity "test" patch
. There appear to be four of them, each of which tweaks the
compilation process in a different way:
- Structures containing only function pointers are made const,
regardless of whether they are declared that way. Of course, it turns
out that this is the wrong thing to do in a number of cases, so the
developers had to create a no_const attribute and use it some
180 places in their patch.
- A histogram of the distribution of sizes passed to kalloc()
is generated; it's not clear (to your editor) what use is made of that
- Some fairly sophisticated tweaks to the generated assembly are made for
AMD processors to improve the prevention of the execution of kernel
- Instrumentation is inserted to track kernel stack usage.
Use of plugins in this way allows significant changes to be made to the
kernel without actually having to change the code:
To give you some numbers, an allmodconfig 2.6.39 i386 kernel loses
over 7000 static (i.e., not runtime allocated) writable function
pointers (a reduction of about 16%). creating an equivalent source
patch would be thousands of lines of code and have virtually no
chance to be accepted in any reasonable amount of time.
On the other hand, plugins of this type can increase the distance between
the code one sees and what is actually run in the kernel; it is easy to
imagine that leading to some real developer confusion at some point.
Still, says PaXTeam, "the cost/benefit ratio of the plugin approach
is excellent and there's a lot more in the pipeline." It is not too
hard to imagine other uses that are not necessarily tied to security.
(Amusingly, the plugins are licensed under GPLv2, meaning that they do not
qualify for the GCC runtime library exemption. The kernel does not need
that library, though, so all is well.)
Comments (26 posted)
Kernel development news
On August 31, the world was informed
kernel.org, the primary repository for kernel code in various stages of
development, had been compromised - though developers with access to the
site had been informed a few days prior. The site was shut down for
"maintenance" when that notice went out, leaving the community without an
important hosting and distribution point. Kernel development has slowed as
a result; the 3.1 kernel, which would have been expected by now, remains
unreleased. Kernel.org is on its way back, but it will almost certainly
never be quite the same.
On October 3, a basic kernel.org returned to the net. Git hosting is back,
but only for a very small number of trees: mainline, stable, and
linux-next. The return of the other trees is waiting for the relevant
developers to reestablish their access to the site - a process that
involves developers verifying the integrity
of their own systems, then generating a new
PGP/GPG key, integrating it into the web of trust, and forwarding the
public key to the kernel.org maintainers. This procedure could take a
while; it is not clear how many developers will be able to regain their
access to kernel.org before the 3.2 merge window opens.
The front-page web interface is back though, as of this writing, it is
not being updated to reflect the state of the git trees. Most other
kernel.org services remain down; some could stay that way for some time.
It is worth remembering that kernel.org only has one full-time system
administrator, a position that has been funded by the Linux Foundation
since 2008. That administrator, along with a number of volunteers, is
likely to be quite busy; some of the less-important services may not return
A full understanding of what happened is also likely to take some time.
Even in the absence of a report on this intrusion, though, there are some
conclusions that can be made. The first is obvious: the threat is real.
There are attackers out there with time, resources, motivation, and
skills. Given the potential value of either putting a back door into the
kernel or adding a trojan that would run on developers' machines, we have
to assume that there will be more attacks in the future. If the restored
kernel.org is not run in a more secure manner, it will be compromised again
in short order.
The site's administrators have already announced that shell accounts will
not be returning to the systems where git trees are hosted. Prior to the
breakin, there were on the order of 450 of those accounts; that is a lot of
keys to the front door to have handed out. No matter how careful all those
developers may be - and some are more careful than others - the chances of
one of them having a compromised machine approach 100%. Keeping all those
shell accounts off the system is clearly an important step toward a higher
level of security.
Kernel.org has its roots in the community and was run the way kernel
developers often run their machines. So, for example, kernel.org tended to
run mainline -rc kernels - a good exercise in dogfooding, perhaps, but it
also exposed the system to bleeding-edge bugs, and, perhaps more
importantly, obscured the real cause of kernel
panics experienced last August, delaying the realization that the
system had been compromised. The kernel currently running on the new systems
has not been announced; one assumes it is something a little better tested,
better supported, and stable. (No criticism is intended by pointing this
out, incidentally. Kernel.org has been run very well for a long time; the
point here is that the environment has changed, so practices need to change
At this point it seems clear that a single administrator for such a
high-profile site is not an adequate level of staffing. Given the
resources available in our community, it seems like it should be possible
to increase the amount of support available to kernel.org. There are
rumors that this is being worked on, but nothing has been announced.
Developers are going to have to learn to pay more attention to the
security of their systems. There are scattered reports of kernel
developers turning up
compromised systems; in some cases, they may have been infected as the
result of excessive trust in kernel.org. Certain practices will have to
change; for that reason, the Fedora project's announcement of a zero-tolerance policy toward
private keys on Fedora systems is welcome. Developers are on the front
line here: everybody is depending on them to keep their code - and the
infrastructure that distributes that code - secure.
There is an interesting question related to that: will kernel developers
move back to kernel.org? These developers have had to find new homes for
their git repositories during the outage; some of them are likely to decide
that leaving those repositories in their new location is easier than
establishing identities in the web of trust and getting back into
kernel.org. Linus has said in the past
that he sees the presence of a kernel.org-hosted tree in a pull request as
a sign that the request is more likely to be genuine. Requiring
that repositories be hosted at kernel.org seems like an unlikely step for
this community, though. It is not entirely clear whether trees distributed
around the net increase the security risk to the kernel, or whether putting
all the eggs into the kernel.org basket would be worse.
One other conclusion would seem to jump out at this point: kernel.org got hit
this time, but there are a lot of other important projects and hosting
sites out there. Any of those projects is just as likely to be a target as
the kernel. If we are not to have a long series of embarrassing compromises,
some with seriously unfortunate consequences, we're going to have to take
security more seriously everywhere. Doing so without ruining our
community's openness is going to be a challenge, to say the least, but it
is one we need to take on. Security is a pain, but being broken into and
used to attack your users and developers is even more so.
Comments (121 posted)
High-resolution timers (hrtimers) can be used to invoke kernel code after a
precisely-specified time interval; unlike regular kernel timers, hrtimers
can be reliably used with periods of microseconds or
nanoseconds. Even hrtimer users can usually accept a wakeup within a
specific range of times, though. To take advantage of that fact, the
kernel offers "range hrtimers" with both soft (earliest) and hard (latest)
deadlines. With range hrtimers, the kernel can coalesce wakeup events,
minimizing the number of interrupts and reducing power usage. These are
good things, so it is not surprising that the use of range timers has
increased since they were introduced.
One would think that, once the hrtimer code starts running in response to a
timer interrupt, it would make sense to run every timer event whose soft
expiration time has passed. But that is not what current kernels do. It
is an interesting exercise to look at why that is, and how a recent patch from Venkatesh Pallipadi
changes that behavior.
For the sake of simplicity, let us imagine a set of timers that we'll call
"A" through "G", each expiring 10µs after its predecessor. The
hard expiration times are regular, but the timers have wildly differing
soft expiration times; plotted on a timeline, the example timers look like
As can be seen here, timer "A" has a hard expiration 10µs in the
future, but it could expire any time
after 5µs. Timer "B" can be expired anytime from 7.5µs to
20µs in the future; the kernel can thus expire them both at 10µs and
eliminate the need to schedule a timer interrupt at 20µs. Further in
the future, timer "D" has a hard expiration 40µs
ahead, but it is quite flexible and could, like timer "B", legitimately be
expired 7.5µs from now.
If the kernel is interrupted by a hardware timer in 10µs, it might be
expected to call the expiration function for timers "A", "B", and "D". In
reality, though, the expiration function for "D" will not be called at that
time. To understand why, consider that hrtimers, within the kernel, are
stored in a red-black tree with the hard
expiration time as the key. The resulting tree will look something like
When the timer interrupt happens, the timer code performs a depth-first
traversal of this tree for as long as it finds timers whose soft expiration
time has passed. In this case, it will encounter "A" and "B" but, once it
hits "C", the soft expiration time is in the future and the traversal
stops. The organization of the data structure is such that the code cannot
find the other events whose soft expiration time has passed without
searching the whole tree.
When the hrtimer code was extended to support range timers, searching for
all soft-expired timers looked like it would require the addition of a
second tree over the existing tree. That was deemed to be too expensive,
especially since it may not actually save any wakeups. With the current
code, "D" will be expired after 30µs, when "C" hits its hard
expiration. Expiring "D" sooner will not eliminate the need for a wakeup
at 30µs, so it didn't seem worth the effort to expire "D" sooner.
Venkatesh thought this through and decided that he could come up with a
couple of exceptions to that reasoning. It may well be that, at
10µs, the system will be less heavily loaded than at 30µs; in
that case, it makes sense to get more work done sooner. Running the
timer sooner also could save a wakeup if "C" is deleted prior to
expiration. So he wrote up a patch to implement a "greedy hrtimer walk"
that would run all soft-expired hrtimers on a timer interrupt.
He was helped by the addition of augmented
red-black trees (also done by Venkatesh) in 2010. Essentially, an
augmented tree allows the
addition of a bit of extra metadata to each node; when a change is made to
the tree, that extra information can be percolated upward. The greedy
hrtimer walk patch turns the hrtimer tree into an augmented red-black tree;
each node then stores the earliest soft expiration time to be found at that
level of the tree or below. With the timer example given above, the new
tree would look like this:
The new numbers in red tell the tree-traversal logic what the soonest
soft-expiration time is in each subtree. Using those numbers, a search of
the tree 10µs in the future could prune the search at "F", since all
soft expiration times will be known to be at least 25µs further in
the future at that time. That takes away much of the cost of searching the
tree for soft-expired timers that are not on the left side.
One might still wonder if that extra work is worthwhile on the off-chance
that running timer events sooner will be advantageous. After all, in the
absence of specific knowledge or a crystal ball, it is just as likely that
the system will be less loaded at the later expiration time; in that
case, expiring the timer sooner would make things worse. Venkatesh's patch
avoids that issue by only performing the greedy hrtimer walk if the CPU is
idle when the timer interrupt happens. If work is being done, soft-expired
timers that are not immediately accessible are left in the tree, but, if
the CPU has nothing better to do, it performs the full search.
Venkatesh benchmarked this work by looking at the number of times the
scheduler migrated tasks between CPUs on a given workload. Migrations are
a sign of contention for the processor; they can also be expensive since
processes can leave their memory cache behind when they move. Given the
right workload (80% busy with a number of threads), the number of
migrations was cut to just over half its previous value; other workloads
gave less impressive results, but the patch never seemed to hurt. Given
that, the comments on the patch were mostly focused on the details - like
whether the greedy behavior should be controlled by a sysctl knob or not.
Chances are this feature will show up in the 3.2 kernel.
Comments (2 posted)
Google's requirements for systems running in its cluster have been
discussed in public a number of times at this point; the recent Linux Plumbers Conference session on control
groups is an
example. The company does everything it can to pack as much work onto each
system as possible to ensure that its hardware is fully utilized. One
aspect of this packing is the need to make the best use possible of system
memory. Michel Lespinasse's recently posted idle page tracking patch set is one piece of
Google's solution to this problem.
The "fake NUMA" mechanism is currently used to control memory use within a
single system, but Google is trying to move to the control-group memory
controller instead. The memory controller can put limits on how much
memory each group of processes can use, but it is unable to automatically
limits in response to the actual need shown by those groups. So some
control groups may have a lot of idle memory sitting around while others
are starved. Google would like to get a better handle on how much memory
each group actually needs so that the limits can be adjusted on the fly -
responding to changes in load - and more jobs can be crammed onto each box.
Determining a process's true memory needs can be hard, but one fairly clear
clue is the existence of pages in the process's working set that have not
been touched in some time. If there are a lot of idle pages around, it is
to say that the process is not starved for memory; this idea is based, of
course, on the notion that the kernel's page replacement algorithm is
working reasonably well. It follows that, if you would like to know how
memory usage limits can be tweaked to optimize the use of memory, it makes
sense to track the number of idle pages in each control group. The kernel
does not currently provide that information - a gap that Michel's patch set
tries to fill.
The memory management code has a function (page_referenced() and a
number of variants) that can be used to determine whether a given page has
been referenced since the last time it was checked. It is used in a number
of memory management decisions, such as the quick aging of pagecache pages
that are only referenced once. Michel's patch makes use of this mechanism
to find idle pages, but this use has some slightly different needs: Michel
needs to know more about the pages in question, and he needs to not
interfere with other users of page_referenced(). To meet these
needs, Michel has to make some changes to the core memory management code.
For the first
problem, the page_referenced() interface is changed to take a new
page_referenced_info) where the additional information can be
Avoiding interference with existing users of page_referenced(),
instead, requires adding a couple of new page flags. Since page flags are
in short supply on 32-bit architectures,
using more of them is strongly discouraged. This patch set gets around
that problem by disabling the feature altogether on 32-bit machines;
anybody wanting idle page tracking will need to run in 64-bit mode.
Systems where idle page tracking is in use will have a new kernel thread
running under the name kstaled. Its job is to scan through all of
memory (once every two minutes by default) and count the number of pages
that have not been referenced since the previous scan. Such pages are
deemed to be idle; each one is traced back to its owning control group and
that group's statistics are adjusted. The patch adds a new "page age"
data structure - an array containing one byte for every page in the system
- to track how long each page has been idle, up to 255 scan cycles. The
results are boiled down to counters showing how many pages have been idle
for 1, 2, 5, 15, 30, 60, 120, and 240 cycles. Idle pages are further
broken down into a few categories: clean, dirty and file-backed, and dirty
anonymous pages. These counters, which are
only updated at the end of each scan, can be found in the memory
controller's control directory for each group.
Since the statistics are only updated at the end of each scan, and since
the scans are two minutes apart, the resulting numbers are likely to lag
reality by some time. Imagine that a given page is scanned toward the
beginning of a cycle and seen to be in use; clearly it will not be counted
as idle. If it is referenced one last time just after the scan, it will
still appear to be in use at the next scan, nearly two minutes later, when
the "referenced" bit will be reset. It is only after another two minutes
that kstaled will decide that the page is unused - nearly four minutes
after its last reference. That is not necessarily a problem; a decision to
shrink a group of processes because they are not using all of their memory
probably should not be made in haste.
There are times when more current information is useful, though. In
particular, Google's management code would like to know when a group of
processes suddenly start making heavier use of their memory so that their
limits can be expanded before they begin to thrash. To handle this case,
the patch introduces the notion of "stale" pages: a page is stale if it is
clean and if it has been idle for more than a given (administrator-defined)
number of scan cycles. The presence of stale pages indicates that a
control group is not under serious memory pressure. If that control
group's memory needs suddenly increase, though, the kernel will start
reclaiming those stale pages. So a sudden drop in the number of stale
pages is a good indication that something has changed.
When kstaled determines that a given page is stale, one of the new
page flags (PG_stale) will be used to mark it. Tests have been
sprinkled throughout the memory management code to notice when a stale page
is dirtied, referenced, locked, or reclaimed; when that happens, the owning
control group's count of stale pages will be decremented on the spot.
Stale pages are not detected any more quickly than idle
pages, but a reduction in the number of stale pages can be noticed
immediately. That provides an early-warning system that can flag control
groups whose memory use is on the increase.
The patch has been through a couple of iterations; there have been comments
pointing out things to fix but no fundamental opposition to the idea. That
said, memory management patches are not known for their speed getting into
the mainline; if and when we'll see this feature in mainline kernels
remains to be seen.
Comments (3 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>