Kernel development

Brief items

Kernel release status

3.2 is the most recent kernel, released on January 4. The 3.3 merge window is still open, see the article below for what's been merged in the last week. 3.3-rc1 can probably be expected sometime soon.

Stable releases: The 3.1.10 stable kernel was released on January 18. This is the last stable kernel in the 3.1.x series, so users should upgrade to the 3.2 series.

In addition, the 2.6.32.54, 3.0.17, 3.1.9, and 3.2.1 stable kernels were released on January 12.

Comments (none posted)

Quotes of the week

Don't think for a minute that something won't get done just because its obviously inappropriate.

-- Casey Schaufler

So I think that line should go away entirely. It doesn't have any meaning.... I realize that I wrote it, and that it as such must be bug-free, but I suspect that removing that line is even *more* bug-free.

-- Linus Torvalds

Tracepoints are being added like the US deficit. We need to set some rules somewhere. Either by making a library that can handle small changes (like the one we are discussing, even though a memcpy should cope), or we need to put a kabosh to adding new tracepoints like they are the new fad app. Perhaps we should put the same requirements on new tracepoints as we do with new syscalls.

-- Steven Rostedt

Work on dma-buf was originally started with the goal of unifying several competing "memory management" systems developed with different ARM SoCs in mind. It would be unfortunate if restricting its use to only GPL-licensed modules caused dma-buf adoption to be limited.

-- NVIDIA's Robert Morell

Comments (2 posted)

Kernel development news

3.3 merge window part 2

By Jonathan Corbet
January 18, 2012

As of this writing, almost 8,800 non-merge changesets have been pulled into the mainline kernel for the 3.3 development cycle - 2,900 since last week's summary. The pace of the merge window clearly slowed in its second week, but there were still a number of interesting changes merged.

User-visible changes merged since last week include:

The kernel has gained the ability to verify RSA digital signatures. The extended verification module (EVM) makes use of this capability.
The slab allocator supports a new slab_max_order= boot parameter controlling the maximum size of a slab. Setting it to a larger number may increase memory efficiency at the cost of increasing the probability of allocation failures.
The ALSA core has gained support for compressed audio on devices that are able to handle it.
There have been some significant changes made to the memory compaction code to avoid the lengthy stalls experienced by some users when writing data to slow devices (USB keys, for example). This problem was described in this article, but the solution has evolved considerably. By making a number of changes to how compaction works, the memory management hackers (and Mel Gorman in particular) were able to avoid disabling synchronous compaction, which had the unfortunate effect of reducing huge page usage. See this commit for a lot of information on how this problem was addressed.
There is a new "charger manager" subsystem intended for use with batteries that must be monitored occasionally, even when the system is suspended. The charger manager can partially resume the system as needed to poll the battery, then immediately re-suspend afterward. See Documentation/power/charger-manager.txt for more information.
The Btrfs balancing/restriping code has been reworked to allow a lot more flexibility in how a volume is rearranged. Restriping operations can now be paused, canceled, or resumed after a crash.
The audit subsystem is now supported on the ARM architecture.
New device drivers include:
- Systems and processors: Renesas R8A7740 CPUs, R-Car H1 (R8A77790) processors, NetLogic DB1300 boards, Ubiquiti Networks XM (rev 1.0) boards, Atheros AP121 reference boards, and Netlogic XLP SoC and systems.
- Audio: Realtek ALC5632 codecs and Cirrus Logic CS42L73 codecs.
- Block: Micron PCIe SSD cards and solid-state drives supporting the NVM Express standard.
- Miscellaneous: TI TWL4030 battery chargers, Dialog DA9052 battery chargers, Maxim MAX8997 MUIC devices, Samsung Electronics S5M multifunction devices, and CSR SiRFprimaII DMA engines.
- Video4Linux: Samsung S5P and EXYNOS4 G2D 2d graphics accelerators, remote controls using the Sanyo protocol, Austria Microsystems AS3645A and LM3555 flash controllers, Microtune MT2063 silicon IF tuners, Jellin JL2005B, JL2005C, or JL2005D-based cameras, HDIC HD29L2 demodulators, and Samsung S5P/Exynos4 JPEG codecs.

Changes visible to kernel developers include:

The memory control group naturalization patches have been merged. These patches eliminate the double-tracking of memory and, thus, substantially reduce the overhead associated with the memory controller.
The framebuffer device subsystem has a new FOURCC-based configuration API; see Documentation/fb/api.txt for details.
The Btrfs filesystem has gained an integrity checking tool that monitors traffic to the storage device and looks for operations that could leave the filesystem corrupted if the system fails at the wrong time. See the comments at the top of fs/btrfs/check-integrity.c for more information.

The 3.3-rc1 release can be expected at almost any point; after that, the stabilization process begins for the 3.3 development cycle. If the usual timing holds (and it almost always does anymore), the final 3.3 kernel release can be expected in the second half of March.

Comments (none posted)

System call filtering and no_new_privs

By Jake Edge
January 18, 2012

We briefly covered a proposal for restricting system calls using the kernel packet filtering mechanism on the January 12 Kernel page, but, at that time, there hadn't been any comments on the proposal. Since then there have been several rounds of comments and revisions of the patch set, along with a revival of an older idea to let a process limit itself and its children to its current privilege level. So far, both sets of patches have received generally positive feedback, to the point where it seems like general-purpose system call filtering just might make it into the mainline sometime in the not-too-distant future.

For some time now, Will Drewry has been trying to find an acceptable way to enhance the seccomp ("secure computing") facility in the kernel so that more flexible system call filtering can be done. His target for the feature is the Chrome/Chromium web browser in order to sandbox untrusted code, but other projects (including QEMU, openssh, vsftpd, and others) have expressed interest in the feature as well. He (and others) have tried various approaches over the last few years without finding one that passed muster. His latest attempt, which uses the BPF (Berkeley Packet Filter) engine to filter the system calls, seems like it avoids many of the problems that were noted in the earlier attempts.

The basic idea is that instead of examining packet contents, the filters will examine system calls and any arguments passed in registers (that means that it won't follow pointers to avoid time-of-check-to-time-of-use races). The code will only allow those calls that pass the filter tests to be executed. The filtering fails "closed" so that any calls not listed in the filter, or whose arguments don't correspond to the filter rules, will return an EACCES error. The syntax for creating a filter, as described in the documentation file, is fairly painful, but Eric Paris has already started on a translator to turn a more readable form into the BPF rules needed.

In order to avoid a longstanding problem with the interactions between binaries that can change their privileges (e.g. setuid or file-based capabilities) and mechanisms to reduce privileges for a process, Drewry's initial patch would restrict the ability of a process to make an execve() call once a filter had been installed. The problem is that privilege-changing binaries can get confused when faced with an environment with fewer privileges than are expected. That confusion can lead to privilege escalation or other security holes. This is why things like chroot(), bind mounts, and, eventually, user namespaces are restricted to root-privileged processes.

If a filtered process can't successfully call execve(), though, all of the concerns about confusing those binaries is gone. It does make using the system call filtering a little clunky, however. One would expect that a parent could set up filters and then spawn a child that would be bound by those filters, but, without a way to exec, that won't work. That can be worked around for most existing programs with some LD_PRELOAD trickery, but in the discussion another potential solution was proposed.

Andrew Lutomirski pointed to his execve_nosecurity proposal as a possible solution. That would allow processes to set a flag so that they (and their children) would be unable to call execve() and would add a new variant (called, somewhat confusingly, execve_nosecurity()) that could be used instead but would not allow any security transitions for the executed program. That means that setuid, LSM context changes, changing capabilities, and so on would not be allowed. Linus Torvalds agreed that adding a way to restrict privilege changes would be useful:

We could easily introduce a per-process flag that just says "cannot escalate privileges". Which basically just disables execve() of suid/sgid programs (and possibly other things too), and locks the process to the current privileges. And then make the rule be that *if* that flag is set, you can then filter across an execve, or chroot as a normal user, or whatever.

That led Lutomirski to propose a flag in struct task_struct called no_new_privs that would be set via the PR_SET_NO_NEW_PRIVS flag to prctl(). It would be a one-way gate as there would be no way to unset the flag. If set, the flag would restrict executing binaries in much the same way that the nosuid mount flag works. In addition, it would disallow processes changing capabilities on exec or SELinux security context transitions.

But, Lutomirski's patch does not implement a sandbox, as it can still be subverted via ptrace() as Alan Cox points out. Cox was also concerned that preventing SELinux, AppArmor, or other LSMs from changing privileges could lead to other problems because those transitions may actually be changing the context to a less privileged state. Simply keeping the previous context, as Lutomirski's patch does, could lead to executing programs in a more-privileged context. But Eric Paris clarifies that SELinux, at least, will still make the same policy decision even without the transition (as it does for nosuid mounts), so that the execution will still fail if the process has the wrong context.

Lutomirski also notes that a sandbox will be much less useful if execve() has to fail when there is any kind of security transition, as Cox suggested. The presence of a policy on a particular binary would make that binary unusable from within a sandbox, no matter what the policy is. A better solution, Lutomirski said, is to set the no_new_privs bit, then set up a sandbox (using Drewry's seccomp system call filtering for example), then execute the binary, which will succeed or fail based on the actual mandatory access control (MAC) policy. That solves the problem of ptrace() and other circumvention methods as well because a sandbox requires both the no_new_privs patch and some other mechanism to filter system calls:

no_new_privs is not intended to be a sandbox at all -- it's a way to make it safe for a task to manipulate itself in a way that would allow it to subvert its own children (or itself after execve). So ptrace isn't a problem at all -- PR_SET_NO_NEW_PRIVS + chroot + ptrace is exactly as unsafe as ptrace without PR_SET_NO_NEW_PRIVS. Neither one allows privilege escalation beyond what you started with.

If you want a sandbox, call PR_SET_NO_NEW_PRIVS, then enable seccomp (or whatever) to disable ptrace, evil file access, connections on unix sockets that authenticate via uid, etc.

Meanwhile, Drewry has been revising his patches to take advantage of no_new_privs. One of those revisions brought about some other concerns regarding whether dropping privileges should be allowed after the bit is set. Torvalds is worried that allowing privilege dropping will somehow lead to confusing other programs: "We've had security bugs that were *due* to dropped capabilities - people dropped one capability but not another, and fooled code into doing things they weren't expecting it to do." Lutomirski's patches do not restrict things like calls to setuid() because they are not meant to implement a sandbox—that's what the existing seccomp, or an enhanced version from Drewry's patches (aka seccomp mode 2) will do. As Lutomirski explains:

Another way of saying this is: no_new_privs is not a sandbox. It's just a way to make it safe for sandboxes and other such weird things processes can do to themselves safe across execve. If you want a sandbox, use seccomp mode 2, which will require you to set no_new_privs.

It's clear that Lutomirski, at least, thinks the no_new_privs changes cannot lead to the problems that Torvalds and others (notably Smack developer Casey Schaufler) are concerned about. But, any program that uses no_new_privs needs to be aware of what it does (and doesn't) do. Coupling it with a system call filtering mechanism seems like it could only increase the security of the system. But, interactions between security mechanisms often have unforeseen effects, typically resulting in security holes, so it makes sense to be cautious.

So far, these changes are still being discussed, and no subsystem maintainer has volunteered to take them, but the two proposals seem to have support that other similar ideas have lacked. Whether Lutomirski can convince the other kernel hackers that no_new_privs can't lead to other problems, or whether he needs to figure out how to stop the dropping of privileges is unclear. But it does seem like there may now be a path for an enhanced seccomp to reach the mainline.

Comments (none posted)

The future calculus of memory management

January 18, 2012

This article was contributed by Dan Magenheimer

Over the last fifty years, thousands of very bright system software engineers, including many of you reading this today, have invested large parts of their careers trying to solve a single problem: How to divide up a fixed amount of physical RAM to maximize a machine's performance across a wide variety of workloads. We can call this "the MM problem."

Because RAM has become incredibly large and inexpensive, because the ratios and topologies of CPU speeds, disk access times, and memory controllers have grown ever more complex, and because the workloads have changed dramatically and have become ever more diverse, this single MM problem has continued to offer fresh challenges and excite the imagination of kernel MM developers. But at the same time, the measure of success in solving the problem has become increasingly difficult to define.

So, although this problem has never been considered "solved", it is about to become much more complex, because those same industry changes have also brought new business computing models. Gone are the days when optimizing a single machine and a single workload was a winning outcome. Instead, dozens, hundreds, thousands, perhaps millions of machines run an even larger number of workloads. The "winners" in the future industry are those that figure out how to get the most work done at the lowest cost in this ever-growing environment. And that means resource optimization. No matter how inexpensive a resource is, a million times that small expense is a large expense. Anything that can be done to reduce that large expense, without a corresponding reduction in throughput, results in greater profit for the winners.

Some call this (disdainfully or otherwise) "cloud computing", but no matter what you call it, the trend is impossible to ignore. Assuming it is both possible and prudent to consolidate workloads, it is increasingly possible to execute those workloads more cost effectively in certain data center environments where the time-varying demands of the work can be statistically load-balanced to reduce the maximum number of resources required. A decade ago, studies showed that, on average, only 10% of the CPU in an average pizza box server was being utilized... wouldn't it be nice, they said, if we could consolidate and buy 10x fewer servers? This would not only save money on servers, but would save a lot on power, cooling, and space too. While many organizations had some success in consolidating some workloads "manually", many other workloads broke or became organizationally unmanageable when they were combined onto the same system and/or OS. As a result, scale-out has continued and different virtualization and partitioning technologies have rapidly grown in popularity to optimize CPU resources.

But let's get back to "MM", memory management. The management of RAM has not changed much to track this trend toward optimizing resources. Since "RAM is cheap", the common response to performance problems is "buy more RAM". Sadly, in this evolving world where workloads may run on different machines at different times, this classic response results in harried IT organizations all buying more RAM on most or all of the machines in a data center. A further result is that the ratio of total RAM in a data center vs. the sum of the "working set" of the workloads, is often at least 2x and sometimes as much as 10x. This means that somewhere between half and 90% of the RAM in an average data center is wasted, which is decidedly not cheap. So the question is begged: Is it possible to apply similar resource optimization techniques to RAM?

A thought experiment

Bear with me and open your imagination for the following thought experiment:

Let's assume that the next generation processors have two new instructions: PageOffline and PageOnline. When PageOffline is executed (with a physical address as a parameter), that (4K) page of memory is marked by the hardware as inaccessible and any attempts to load/store from/to that location result in an exception until a corresponding PageOnline is executed. And through some performance registers, it is possible to measure which pages are in the offline state and which are not.

Let's further assume that John and Joe are kernel MM developers and their employer "GreenCloud" is "green" and enlightened. The employer offers the following bargain to John and Joe and the thousands of other software engineers working at GreenCloud: "RAM is cheap but not free. We'd like to encourage you to use only the RAM necessary to do your job. So, for every page, on "average" over the course of the year, that you have offline, we will add one-hundredth of one cent to your end-of-year bonus. Of course, if you turn off too much RAM, you will be less efficient at getting your job done, which will reflect negatively on your year-end bonus. So it is up to you to find the right balance."

John and Joe quickly do some arithmetic: so, since my machine has 8GB RAM, if I on average keep 4GB offline, I will be $100 richer. They quickly start scheming ideas on how to dynamically measure their working set and optimize page offlining.

But the employer goes on: "And for any individual page that you have offline for the entire year, we will double that to two-hundredths of a cent. But once you've chosen the "permanent offline" option on a page, you are stuck with that decision until the next calendar year."

John, anticipating the extra $200, decides immediately to try to shut off 4GB for the whole year. Sure, there will be some workload peaks where his machine will get into a swapstorm and he won't get any work done at all, but that will happen rarely and he can pretend he is on a coffee break when it happens. Maybe the boss won't notice.

Joe starts crafting a grander vision; he realizes that, if he can come up with a way to efficiently allow others' machines that are short on RAM capacity to utilize the RAM capacity on his machine, then the "spread" between temporary offlining and permanent offlining could create a nice RAM market that he could exploit. He could ensure that he always has enough RAM to get his job done, but dynamically "sell" excess RAM capacity to those, like John, who have underestimated their RAM needs ... at say fifteen thousandths of a cent per page-year. If he can implement this "RAM capacity sharing capability" into the kernel MM subsystem, he may be able to turn his machine into a "RAM server" and make a tidy profit. If he can do this ...

Analysis

In the GreenCloud story, we have: (1) a mechanism for offlining and onlining RAM one page at a time; (2) an incentive for using less RAM than is physically available; and, (3) a market for load- balancing RAM capacity dynamically. If Joe successfully figures out a way to make his excess RAM capacity available to others and get it back when he needs it for his own work, we may have solved (at least in theory) the resource optimization problem for RAM for the cloud.

While the specifics of the GreenCloud story may not be realistic or accurate, there do exist some of the same factors in the real world. In a virtual environment, "ballooning" allows individual pages to be onlined/offlined in one VM and made available to other VMs; in a bare-metal environment, the RAMster project provides a similar capability. So, though primitive and not available in all environments, we do have a mechanism. By substantially reducing the total amount of RAM across a huge number of machines in a data center, both capital outlay and power/cooling would be reduced, improving resource efficiency and thus potential profit. So we have an incentive and the foundation for a market.

Interestingly, the missing piece, and where this article started, is that most OS MM developers are laser-focused on the existing problem from the classic single machine world which is, you recall: how to divide up a fixed amount of physical RAM to maximize a single machine's performance across a wide variety of workloads.

The future version of this problem is this: how to vary the amount of physical RAM provided by the kernel and divide it up to maximize the performance of a workload. In the past, this was irrelevant: you own the RAM, you paid for it, it's always on, so just use it. But in this different and future world with virtualization, containers, and/or RAMster, it's an intriguing problem. It will ultimately allow us to optimize the utilization of RAM, as a resource, across a data center.

It's also a hard problem, for three reasons: The first, is that we can't predict, but can only estimate, the future RAM demands of any workload. But this is true today, the only difference is whether the result is "buy more RAM" or not. The second, is that we need to understand the instantaneous benefit (performance) of each additional page of RAM (cost); my math is very rusty, but this reminds me of differential calculus, where "dy" is performance and "dx" is RAM size. At every point in time, increasing dx past a certain size will have no corresponding increase in dy. Perhaps this suggests control theory more than calculus but the needed result is a true dynamic representation of "working set" size. Third, there is some cost for moving capacity efficiently; this cost (and impact on performance) must be somehow measured and taken into account as well.

But, in my opinion, this "calculus" is the future of memory management. I have no answers and only a few ideas, but there's a lot of bright people who know memory management a lot better than I. My hope is to stimulate discussion about this very-possible future and how the kernel MM subsystem should deal with it.

Comments (35 posted)

Patches and updates

Kernel trees

Greg KH Linux 3.2.1 ?

Con Kolivas 3.2-ck1 ?

Greg KH Linux 3.1.9 ?

Greg KH Linux 3.1.10 ?

Greg KH Linux 3.0.17 ?

Greg KH Linux 2.6.32.54 ?

Architecture-specific

Kelvin Cheung MIPS: Add support for Loongson1B ?

Uri Yosef ARM: OMAP4: Add support for Variscite OMAP4460 System-On-Module ?

Core kernel code

Frederic Weisbecker cgroups: Task counter subsystem v7 ?

Vaidyanathan Srinivasan sched: unified sched_powersavings tunables ?

Cyrill Gorcunov fs, proc: Introduce /proc/<pid>/task/<tid>/children entry v6 ?

Development tools

Xiao Guangrong KVM: perf: a smart tool to analyse kvm events ?

Jiri Olsa perf tool: parser generator for events parsing ?

Kees Cook ramoops: use pstore interface ?

Device drivers

Miklos Szeredi cuse: implement mmap/munmap ?

Bjørn Mork Adding new drivers for Qualcomm MSM based LTE modems ?

Adam Jackson drm/vgem: virtual GEM provider ?

Sakari Ailus [PATCH 0/23] V4L2 subdev and sensor control changes, SMIA++ driver and N9 camera board code ?

Sascha Hauer Congatec CGEB base, i2c and watchdog driver support ?

Rafael J. Wysocki PM / Sleep: Introduce new phases of device suspend/resume ?

Documentation

Christoph Hellwig XFS status update for December 2011 ?

Junio C Hamano Using signed tag in pull requests ?

Filesystems and block I/O

Andi Kleen Updated btrfs/crypto snappy interface ready for merging ?

Andrew Ayer POSIX-compliant version of fchmodat with flag argument ?

Memory management

KAMEZAWA Hiroyuki [PATCH 0/7 v2] memcg: page_cgroup diet ?

Minchan Kim low memory notify ?

Leonid Moiseichuk Memory notification pseudo-device module ?

Networking

Hans Schillstrom NETFILTER new target module, HMARK ?

Štefan Gula [patch v1, kernel version 3.2.1] net/ipv4/ip_gre: Ethernet multipoint GRE over IP ?

Security-related

Will Drewry seccomp_filters: system call filtering using BPF ?

Andy Lutomirski Add PR_{GET,SET}_NO_NEW_PRIVS to prevent execve from granting privs ?

Virtualization and containers

Raghavendra K T kvm : Paravirt-spinlock support for KVM guests ?

Wei Liu New Xen netback implementation ?

Miscellaneous

Lucas De Marchi kmod 4 ?

Page editor: Jonathan Corbet
Next page: Distributions>>