3.2 is the most recent kernel, released on January 4. The 3.3 merge window
is still open, see the article below for what's been merged in the last
week. 3.3-rc1 can probably be expected sometime soon.
Stable releases: The 3.1.10 stable kernel was released on
January 18. This is the last stable kernel in the 3.1.x series, so users
should upgrade to the 3.2 series.
In addition, the
3.2.1 stable kernels were released on
Comments (none posted)
Don't think for a minute that something won't get done just
because its obviously inappropriate.
-- Casey Schaufler
So I think that line should go away entirely. It doesn't have any
meaning.... I realize that I wrote it, and that it as such must be
bug-free, but I suspect that removing that line is even *more*
-- Linus Torvalds
Tracepoints are being added like the US deficit. We need to set
some rules somewhere. Either by making a library that can handle
small changes (like the one we are discussing, even though a memcpy
should cope), or we need to put a kabosh to adding new tracepoints
like they are the new fad app. Perhaps we should put the same
requirements on new tracepoints as we do with new syscalls.
-- Steven Rostedt
Work on dma-buf was originally started with the goal of unifying
several competing "memory management" systems developed with
different ARM SoCs in mind. It would be unfortunate if restricting
its use to only GPL-licensed modules caused dma-buf adoption to be
-- NVIDIA's Robert Morell
Comments (2 posted)
Kernel development news
As of this writing, almost 8,800 non-merge changesets have been pulled into
the mainline kernel for the 3.3 development cycle - 2,900 since last week's summary
. The pace of the merge
window clearly slowed in its second week, but there were still a number of
interesting changes merged.
User-visible changes merged since last week include:
- The kernel has gained the ability to verify RSA digital signatures.
The extended verification module (EVM) makes use of this capability.
- The slab allocator supports a new slab_max_order= boot
parameter controlling the maximum size of a slab. Setting it to a
larger number may increase memory efficiency at the cost of increasing
the probability of allocation failures.
- The ALSA core has gained support for compressed audio on devices that
are able to handle it.
- There have been some significant changes made to the memory compaction
code to avoid the lengthy stalls experienced by some users when
writing data to slow devices (USB keys, for example). This problem
was described in this article, but the
solution has evolved considerably. By making a number of changes to
how compaction works, the memory management hackers (and Mel Gorman in
particular) were able to avoid disabling synchronous compaction, which
had the unfortunate effect of reducing huge page usage. See this
commit for a lot of information on how this problem was addressed.
- There is a new "charger manager" subsystem intended for use with
batteries that must be monitored occasionally, even when the system is
suspended. The charger manager can partially resume the system as
needed to poll the battery, then immediately re-suspend afterward.
for more information.
- The Btrfs balancing/restriping code has been reworked to allow a lot
more flexibility in how a volume is rearranged. Restriping operations
can now be paused, canceled, or resumed after a crash.
- The audit subsystem is now supported on the ARM architecture.
- New device drivers include:
- Systems and processors:
Renesas R8A7740 CPUs,
R-Car H1 (R8A77790) processors,
NetLogic DB1300 boards,
Ubiquiti Networks XM (rev 1.0) boards,
Atheros AP121 reference boards, and
Netlogic XLP SoC and systems.
- Audio: Realtek ALC5632 codecs and
Cirrus Logic CS42L73 codecs.
- Block: Micron PCIe SSD cards and solid-state drives
supporting the NVM Express standard.
TI TWL4030 battery chargers,
Dialog DA9052 battery chargers,
Maxim MAX8997 MUIC devices,
Samsung Electronics S5M multifunction devices, and
CSR SiRFprimaII DMA engines.
- Video4Linux: Samsung S5P and EXYNOS4 G2D 2d graphics
remote controls using the Sanyo protocol,
Austria Microsystems AS3645A and LM3555 flash controllers,
Microtune MT2063 silicon IF tuners,
Jellin JL2005B, JL2005C, or JL2005D-based cameras,
HDIC HD29L2 demodulators, and
Samsung S5P/Exynos4 JPEG codecs.
Changes visible to kernel developers include:
- The memory control group naturalization
patches have been merged. These patches eliminate the
double-tracking of memory and, thus, substantially reduce the overhead
associated with the memory controller.
- The framebuffer device subsystem has a new FOURCC-based configuration
API; see Documentation/fb/api.txt for
- The Btrfs filesystem has gained an integrity checking tool that
monitors traffic to the storage device and looks for operations that
could leave the filesystem corrupted if the system fails at the wrong
time. See the comments at the top of fs/btrfs/check-integrity.c for
The 3.3-rc1 release can be expected at almost any point; after that, the
stabilization process begins for the 3.3 development
cycle. If the usual timing holds (and it almost always does anymore), the
final 3.3 kernel release can be expected in the second half of March.
Comments (none posted)
We briefly covered a proposal for
restricting system calls using the kernel packet filtering mechanism on the
January 12 Kernel page, but, at that time,
there hadn't been any comments on the proposal. Since then there have been
several rounds of comments and revisions of the patch set, along with a
revival of an older idea to let a process limit itself and its children
to its current privilege level. So far, both sets of patches have received
generally positive feedback, to the point where it seems like
general-purpose system call filtering just might make it into the mainline
in the not-too-distant future.
For some time now, Will Drewry has been trying to find an acceptable way to
seccomp ("secure computing") facility in the kernel so that more flexible
system call filtering can be done. His target for the feature is the
web browser in order to sandbox untrusted code, but
other projects (including QEMU, openssh, vsftpd, and others) have expressed
interest in the feature as well. He (and others) have tried various
approaches over the last few years without finding one that passed muster.
His latest attempt, which uses the BPF (Berkeley
Packet Filter) engine to filter the system calls, seems like it avoids
many of the problems that were noted in the earlier attempts.
The basic idea is that instead of examining packet contents, the filters
will examine system calls and any arguments passed
in registers (that means that it won't follow
pointers to avoid
time-of-check-to-time-of-use races). The code will only allow those calls
that pass the
filter tests to
be executed. The filtering fails "closed" so that any calls not listed in
the filter, or whose arguments don't
correspond to the filter rules, will return an EACCES
error. The syntax for creating a filter, as described in the documentation file, is fairly painful, but
Eric Paris has already started on a translator to turn a more readable form into
the BPF rules needed.
In order to avoid a longstanding problem
with the interactions between
binaries that can change their privileges (e.g. setuid or file-based
capabilities) and mechanisms to reduce privileges for a process, Drewry's
initial patch would restrict the ability of a process to make an
execve() call once a filter had been installed. The problem
is that privilege-changing binaries can get confused when faced with an
environment with fewer privileges than are expected. That confusion can
lead to privilege
escalation or other security holes. This is why things like
chroot(), bind mounts, and, eventually, user namespaces are
restricted to root-privileged processes.
If a filtered process can't successfully call execve(), though,
all of the concerns about confusing those binaries is gone. It does make using
the system call filtering a little clunky, however. One would expect that
a parent could set up filters and then spawn a child that would be bound by
those filters, but, without a way to exec, that won't work. That can be
worked around for most existing programs with some
LD_PRELOAD trickery, but in the discussion another potential
solution was proposed.
Andrew Lutomirski pointed to his execve_nosecurity proposal as a possible
solution. That would allow processes to set a flag so that they (and their
children) would be unable to call execve() and would add a new
variant (called, somewhat confusingly, execve_nosecurity()) that
could be used instead but would not allow any security transitions for the
executed program. That
means that setuid, LSM context changes, changing capabilities, and so on
be allowed. Linus Torvalds agreed that
adding a way to restrict privilege changes would be useful:
We could easily introduce a per-process flag that just says "cannot
escalate privileges". Which basically just disables execve() of
suid/sgid programs (and possibly other things too), and locks the
process to the current privileges. And then make the rule be that *if*
that flag is set, you can then filter across an execve, or chroot as a
normal user, or whatever.
That led Lutomirski to propose a flag in
struct task_struct called no_new_privs that would be set
via the PR_SET_NO_NEW_PRIVS flag to prctl(). It would be
a one-way gate as there would be no way to unset the flag. If set, the flag
would restrict executing binaries in much the same way that the
nosuid mount flag works. In addition, it would disallow processes
capabilities on exec or SELinux security context transitions.
But, Lutomirski's patch does not implement a sandbox, as it can
still be subverted via ptrace() as Alan Cox points out. Cox was also concerned that
preventing SELinux, AppArmor, or other LSMs from changing privileges could
lead to other problems because those transitions may actually be changing
the context to a less privileged state. Simply keeping the previous
context, as Lutomirski's patch does, could lead to executing programs in a
more-privileged context. But Eric Paris clarifies that SELinux, at least, will still
make the same policy decision even without the transition (as it does for
nosuid mounts), so that the execution will still fail if the
process has the wrong context.
Lutomirski also notes that a sandbox will
be much less useful if execve() has to fail when there is any kind
of security transition, as Cox suggested. The presence of a policy on a
particular binary would make that binary unusable from within a sandbox, no
matter what the policy is. A better solution, Lutomirski said, is to set the
no_new_privs bit, then set up a sandbox (using Drewry's seccomp
system call filtering for example), then execute the binary, which will
succeed or fail based on the actual mandatory access control (MAC) policy.
That solves the problem of ptrace() and other circumvention
methods as well
because a sandbox requires both the no_new_privs patch and some
other mechanism to filter system calls:
no_new_privs is not intended to be a sandbox at all -- it's a way to
make it safe for a task to manipulate itself in a way that would allow
it to subvert its own children (or itself after execve). So ptrace
isn't a problem at all -- PR_SET_NO_NEW_PRIVS + chroot + ptrace is
exactly as unsafe as ptrace without PR_SET_NO_NEW_PRIVS. Neither one
allows privilege escalation beyond what you started with.
If you want a sandbox, call PR_SET_NO_NEW_PRIVS, then enable seccomp
(or whatever) to disable ptrace, evil file access, connections on unix
sockets that authenticate via uid, etc.
Meanwhile, Drewry has been revising his patches to take advantage of
no_new_privs. One of those revisions brought about some other
concerns regarding whether dropping privileges should be allowed
after the bit is set. Torvalds is worried
that allowing privilege dropping will
somehow lead to confusing other programs:
"We've had security bugs that were *due* to dropped capabilities -
people dropped one capability but not another, and fooled code into
doing things they weren't expecting it to do." Lutomirski's patches
do not restrict things like calls to setuid() because they are not
meant to implement a sandbox—that's what the existing seccomp, or an
enhanced version from Drewry's patches (aka seccomp mode 2) will do. As Lutomirski explains:
Another way of saying this is: no_new_privs is not a sandbox. It's
just a way to make it safe for sandboxes and other such weird things
processes can do to themselves safe across execve. If you want a
sandbox, use seccomp mode 2, which will require you to set
It's clear that Lutomirski, at least, thinks the no_new_privs
changes cannot lead to the problems that Torvalds and others (notably Smack
developer Casey Schaufler) are concerned
about. But, any program that uses no_new_privs needs to be aware
of what it does (and doesn't) do. Coupling it with a system call filtering
mechanism seems like it could only increase the security of the system.
But, interactions between security mechanisms often have unforeseen
effects, typically resulting in security holes, so it makes sense to be
So far, these changes are still being discussed, and no subsystem
maintainer has volunteered to take them, but the two proposals seem to have
support that other similar ideas have lacked. Whether Lutomirski can
convince the other kernel hackers that no_new_privs can't lead to
other problems, or whether he needs to figure out how to stop the dropping
of privileges is unclear. But it does seem like there may now be a path
for an enhanced seccomp to reach the mainline.
Comments (none posted)
Over the last fifty years, thousands of very bright system
software engineers, including many of you reading this today,
have invested large parts of their careers trying to solve
a single problem: How to divide up a fixed amount of physical
RAM to maximize a machine's performance across a wide variety
of workloads. We can call this "the MM problem."
Because RAM has become incredibly large and inexpensive, because
the ratios and topologies of CPU speeds, disk access times,
and memory controllers have grown ever more complex, and
because the workloads have changed dramatically and have
become ever more diverse, this single MM problem has continued
to offer fresh challenges and excite the imagination of kernel
MM developers. But at the same time, the measure
of success in solving the problem has become increasingly
difficult to define.
So, although this problem has never been considered "solved",
it is about to become much more complex, because those same
industry changes have also brought new business computing models.
Gone are the days when optimizing a single machine and a single
workload was a winning outcome. Instead, dozens, hundreds,
thousands, perhaps millions of machines run an even larger
number of workloads. The "winners" in the future industry
are those that figure out how to get the most work done at
the lowest cost in this ever-growing environment. And that
means resource optimization. No matter how inexpensive a
resource is, a million times that small expense is a large
expense. Anything that can be done to reduce that large
expense, without a corresponding reduction in throughput,
results in greater profit for the winners.
Some call this (disdainfully or otherwise) "cloud computing",
but no matter what you call it, the trend is impossible to
ignore. Assuming it is both possible and prudent to consolidate
workloads, it is increasingly possible to execute those workloads
more cost effectively in certain data center environments
where the time-varying demands of the work can be statistically
load-balanced to reduce the maximum number of resources required.
A decade ago, studies showed that, on average, only 10% of the
CPU in an average pizza box server was being utilized... wouldn't
it be nice, they said, if we could consolidate and buy 10x fewer
servers? This would not only save money on servers, but
would save a lot on power, cooling, and space too. While many
organizations had some success in consolidating some workloads
"manually", many other workloads broke or became organizationally
unmanageable when they were combined onto the same system and/or OS.
As a result, scale-out has continued and different virtualization
and partitioning technologies have rapidly grown in popularity
to optimize CPU resources.
But let's get back to "MM", memory management. The management
of RAM has not changed much to track this trend toward optimizing
resources. Since "RAM is cheap", the common response to performance
problems is "buy more RAM". Sadly, in this evolving world where
workloads may run on different machines at different times, this
classic response results in harried IT organizations all
buying more RAM on most or all of the machines in a data center.
A further result is that the ratio of total RAM in a data center
vs. the sum of the "working set" of the workloads, is often at
least 2x and sometimes as much as 10x. This means that somewhere
between half and 90% of the RAM in an average data center is
wasted, which is decidedly not cheap. So the question is begged:
Is it possible to apply similar resource optimization techniques
A thought experiment
Bear with me and open your imagination for the following
Let's assume that the next generation processors have two new
instructions: PageOffline and PageOnline. When PageOffline is
executed (with a physical address as a parameter), that (4K)
page of memory is marked by the hardware as inaccessible and
any attempts to load/store from/to that location result in an
exception until a corresponding PageOnline is executed.
And through some performance registers, it is possible to
measure which pages are in the offline state and which are not.
Let's further assume that John and Joe are kernel MM developers
and their employer "GreenCloud" is "green" and
enlightened. The employer offers the following bargain to John
and Joe and the thousands of other software engineers working at
"RAM is cheap but not free. We'd like to encourage you
to use only the RAM necessary to do your job. So, for every page,
on "average" over the course of the year, that you have offline,
we will add one-hundredth of one cent to your end-of-year bonus.
Of course, if you turn off too much RAM, you will be less efficient
at getting your job done, which will reflect negatively on your
year-end bonus. So it is up to you to find the right balance."
John and Joe quickly do some arithmetic: so, since my
machine has 8GB RAM, if I on average keep 4GB offline, I will
be $100 richer. They quickly start scheming ideas on how
to dynamically measure their working set and optimize page
But the employer goes on: "And for any individual page that you
have offline for the entire year, we will double that to two-hundredths
of a cent. But once you've chosen the "permanent offline"
option on a page, you are stuck with that decision until the
next calendar year."
John, anticipating the extra $200, decides immediately to try to
shut off 4GB for the whole year. Sure, there will be some workload
peaks where his machine will get into a swapstorm and he won't get
any work done at all, but that will happen rarely and he can
pretend he is on a coffee break when it happens. Maybe the
boss won't notice.
Joe starts crafting a grander vision; he realizes that, if he
can come up with a way to efficiently allow others' machines that
are short on RAM capacity to utilize the RAM capacity on his machine,
then the "spread" between temporary offlining and permanent offlining
could create a nice RAM market that he could exploit. He could
ensure that he always has enough RAM to get his job done, but
dynamically "sell" excess RAM capacity to those, like John, who
have underestimated their RAM needs ... at say fifteen thousandths
of a cent per page-year. If he can implement this "RAM capacity
sharing capability" into the kernel MM subsystem, he may be
able to turn his machine into a "RAM server" and make a tidy
profit. If he can do this ...
In the GreenCloud story, we have: (1) a mechanism for offlining
and onlining RAM one page at a time; (2) an incentive for using
less RAM than is physically available; and, (3) a market for load-
balancing RAM capacity dynamically. If Joe successfully figures
out a way to make his excess RAM capacity available to others
and get it back when he needs it for his own work, we may have
solved (at least in theory) the resource optimization problem for
RAM for the cloud.
While the specifics of the GreenCloud story may not be realistic
or accurate, there do exist some of the same factors in the real
world. In a virtual environment, "ballooning" allows individual
pages to be onlined/offlined in one VM and made available to other
VMs; in a bare-metal environment, the RAMster project provides a similar capability. So, though primitive and
not available in all environments, we do have a mechanism. By
substantially reducing the total amount
of RAM across a huge number of machines in a data center, both
capital outlay and power/cooling would be reduced, improving resource
efficiency and thus potential profit. So we have an incentive
and the foundation for a market.
Interestingly, the missing piece, and where this article started, is
that most OS MM developers are laser-focused on the existing problem
from the classic single machine world which is, you recall: how to
divide up a fixed amount of physical RAM to maximize a single
machine's performance across a wide variety of workloads.
The future version of this problem is this: how to vary the amount
of physical RAM provided by the kernel and divide it up to maximize
the performance of a workload. In the past, this was irrelevant:
you own the RAM, you paid for it, it's always on, so just use it.
But in this different and future world with virtualization, containers,
and/or RAMster, it's an intriguing problem. It will ultimately allow
us to optimize the utilization of RAM, as a resource, across a data
It's also a hard problem, for three reasons: The first, is that we can't
predict, but can only estimate, the future RAM demands of any workload.
But this is true today, the only difference is whether the result
is "buy more RAM" or not. The second, is that we need to understand
the instantaneous benefit (performance) of each additional page of
RAM (cost); my math is very rusty, but this reminds me of differential
calculus, where "dy" is performance and "dx" is RAM size. At every
point in time, increasing dx past a certain size will have no
corresponding increase in dy. Perhaps this suggests control theory
more than calculus but the needed result is a true dynamic
representation of "working set" size. Third, there is some cost for
moving capacity efficiently; this cost (and impact on performance) must
be somehow measured and taken into account as well.
But, in my opinion, this "calculus" is the future of memory management.
I have no answers and only a few ideas, but there's a lot of bright
people who know memory management a lot better than I. My hope is to
stimulate discussion about this very-possible future and how the kernel
MM subsystem should deal with it.
Comments (35 posted)
Patches and updates
- Con Kolivas: 3.2-ck1 .
(January 16, 2012)
Core kernel code
Filesystems and block I/O
Virtualization and containers
- Lucas De Marchi: kmod 4 .
(January 18, 2012)
Page editor: Jonathan Corbet
Next page: Distributions>>