LWN.net Logo

Kernel development

Brief items

Kernel release status

The current development kernel is 3.9-rc1, released on March 3. Linus said: "I don't know if it's just me, but this merge window had more 'Uhhuh' moments than I'm used to. I stopped merging a couple of times, because we had bugs that looked really scary, but thankfully each time people were on them like paparazzi on Justin Bieber." See the article below for a summary of the final changes merged during the 3.9 merge window.

Stable updates: 3.8.1, 3.4.34, and 3.0.67 were released on February 28; 3.8.2, 3.4.35, and 3.0.68 followed on March 4. The 3.2.40 update was released on March 6. All of them contain the usual mix of important fixes. Also released on March 4 was 3.5.7.7.

Comments (none posted)

Kernel development news

The conclusion of the 3.9 merge window

By Jonathan Corbet
March 5, 2013
By the time that Linus released the 3.9-rc1 kernel prepatch and closed the merge window for this cycle, he had pulled a total of 10,265 non-merge changesets into the mainline repository. That is just over 2,000 changes since last week's summary. The most significant user-visible changes merged at the end of the merge window include:

  • The block I/O controller now has full hierarchical control group support.

  • The NFS code has gained network namespace support, allowing the operation of per-container NFS servers.

  • The Intel PowerClamp driver has been merged; PowerClamp allows the regulation of a CPU's power consumption through the injection of forced idle states.

  • The device mapper has gained support for a new "dm-cache" target that is able to use a fast drive (like a solid-state device) as a cache in front of slower storage devices. See Documentation/device-mapper/cache.txt for details.

  • RAID 5 and 6 support for the Btrfs filesystem has been merged at last.

  • Btrfs defragmentation code has gained snapshot awareness, meaning that sharing of data between snapshots will no longer be lost when defragmentation runs.

  • Architecture support for the Synopsys ARC and ImgTec Meta architectures has been added.

  • New hardware support includes:

    • Systems and processors: Marvell Armada XP development boards, Ralink MIPS-based system-on-chip processors, Atheros AP136 reference boards, and Google Pixel laptops.

    • Block: IBM RamSam PCIe Flash SSD devices and Broadcom BCM2835 SD/MMC controllers.

    • Display: TI LP8788 backlight controllers.

    • Miscellaneous: Kirkwood 88F6282 and 88F6283 thermal sensors, Marvell Dove thermal sensors, and Nokia "Retu" watchdog devices.

Changes visible to kernel developers include:

  • The menuconfig configuration tool now has proper "save" and "load" buttons.

  • The rework of the IDR API has been merged, simplifying code that uses IDR to generate unique integer identifiers. Users throughout the kernel tree have been updated to the new API.

  • The hlist_for_each_entry() iterator has lost the unused "pos" parameter.

At this point, the stabilization period for the 3.9 kernel has begun. If the usual pattern holds, the final 3.9 release can be expected sometime around the beginning of May.

Comments (42 posted)

LC-Asia: A big LITTLE MP update

By Jonathan Corbet
March 6, 2013
The ARM "big.LITTLE" architecture pairs two types of CPU — fast, power-hungry processors and slow, efficient processors — into a single package. The result is a system that can efficiently run a wide variety of workloads, but there is one little problem: the Linux kernel currently lacks a scheduler that is able to properly spread a workload across multiple types of processors. Two approaches to a solution to that problem are being pursued; a session at the 2013 Linaro Connect Asia event reviewed the current status of the more ambitious of the two.

LWN recently looked at the big.LITTLE switcher, which pairs fast and slow processors and uses the CPU frequency subsystem to switch between them. The switcher approach has the advantage of being relatively straightforward to get working, but it also has a disadvantage: only half of the CPUs in the system can be doing useful work at any given time. It also is not yet posted for review or merging into the mainline, though this posting is said to be planned for the near future, after products using this code begin to ship.

The alternative approach has gone by the name "big LITTLE MP". Rather than play CPU frequency governor games, big LITTLE MP aims to solve the problem directly by teaching the scheduler about the differences between processor types and how to distribute tasks between them. The big.LITTLE switcher patch touches almost no files outside of the ARM architecture subtree; the big LITTLE MP patch set, instead, is focused almost entirely on the core scheduler code. At Linaro Connect Asia, developers Vincent Guittot and Morten Rasmussen described the current state of the patch set and the plans for getting it merged in the (hopefully) not-too-distant future.

The big LITTLE MP patch set has recently seen a major refactoring effort. The first version was strongly focused on the heterogeneous multiprocessing (HMP) problem but, among other things, it is hard to get developers for the rest of the kernel interested in HMP. So the new patch set aims to improve [Morten and
Vincent] scheduling results on all systems, even traditional SMP systems where all CPUs are the same. There is a patch set that is in internal review and available on the Linaro git server. Some parts have been publicly posted recently; soon the rest should be more widely circulated as well.

The new patches are working well; for almost all workloads, their performance is similar to that achieved with the old patch set. The patches were developed with a view toward simplicity: they affect a critical kernel path, so they must be both simple and fast. Some of the patches, fixes for the existing scheduler, have already been posted to the mailing lists. The rest try to augment the kernel's scheduler with three simple rules:

  • Small tasks (those that only use small amounts of CPU time for brief periods) are not worth the trouble to schedule in any sophisticated way. Instead, they should just be packed onto a single, slow core whenever they wake up, and kept there if at all possible.

  • Load balancing should be concerned with the disposition of long-running tasks only; it should simply pass over the small tasks.

  • Long-running tasks are best placed on the faster cores.

Implementing these policies requires a set of a half-dozen patches. One of them is the "small-task packing" patch that was covered here in October, 2012. Another works to expand the use of per-entity load tracking (which is currently only enabled when control groups and the CPU controller are being used) so that the per-task load values are always available to the scheduler. A further patch ensures that the "LB_MIN" scheduler feature is turned on; LB_MIN (which defaults to "off" in mainline kernels) causes the load balancer to pass over small tasks when working to redistribute the computing load on the system, essentially implementing the second policy objective above.

After that, the patch set augments the scheduler with the concept of the "capacity" of each CPU; the unloaded capacity is essentially the clock speed of the processor. The load balancer is tweaked to migrate processes to the CPU with the largest available capacity. This task is complicated by the fact that a CPU's capacity may not be a constant value; realtime scheduling, in particular, can "steal" capacity away from a CPU to give to realtime-priority tasks. Scheduler domains also need to be tuned for the big.LITTLE environment with an eye toward reducing the periodic load balancing work that needs to be done.

The final piece is not yet complete; it is called "scheduling invariance." Currently, the "load" put on the system by a process is a function of the amount of time that process spends running on the CPU. But if some CPUs are faster than others, the same process could end up with radically different load values depending on which CPU it is actually running on. That is suboptimal; the actual amount of work the process needs to do is the same in either case, and varying load values can cause the scheduler to make poor decisions. For now, the problem is likely to be solved by scaling the scheduler's load calculations by a constant value associated with each processor. Processes running on a CPU that is ten times faster than another will accumulate load ten times more quickly.

Even then, the load calculations are not perfect for the HMP scheduling problem because they are scaled by the process's priority. A high-priority task that runs briefly can look like it is generating as much load as a low-priority task that runs for long periods, but the scheduler may want to place those processes in different ways. The best solution to this problem is not yet clear.

A question from the audience had to do with testing: how were the developers testing their scheduling decisions? In particular, was the Linsched testing framework being used? The answer is that no, Linsched is not being used. It has not seen much development work since it was posted for the 3.3 kernel, so it does not work with current kernels. Perhaps more importantly, its task representation is relatively simple; it is hard to present it with something resembling a real-world Android workload. It is easier, in the end, to simply monitor a real kernel with an actual Android workload and see how well it performs.

The plan seems to be to post a new set of big LITTLE MP patches in the near future with an eye toward getting them upstream. The developers are a little concerned about that; getting reviewer attention for these patches has proved to be difficult thus far. Perhaps persistence and a more general focus will help them to get over that obstruction, clearing the way for proper scheduling on heterogeneous multiprocessor systems in the not-too-distant future.

[Your editor would like to thank Linaro for travel assistance to attend this event.]

Comments (11 posted)

Simplifying RCU

March 6, 2013

This article was contributed by Paul McKenney

Read-copy update (RCU) is a synchronization mechanism in the Linux kernel that allows extremely efficient and scalable handling of read-mostly data. Although RCU is quite effective where it applies, there have been some concerns about its complexity. One way to simplify something is to eliminate part of it, which is what is being proposed for RCU.

One source of RCU's complexity is that the kernel contains no fewer than four RCU implementations, not counting the three other special-purpose RCU flavors (sleepable RCU (SRCU), RCU-bh, and RCU-sched, which are covered here). The four vanilla implementations are selected by the SMP and PREEMPT kernel configuration parameters:

  1. !SMP && !PREEMPT: TINY_RCU, which is used for embedded systems with tiny memories (tens of megabytes).
  2. !SMP && PREEMPT: TINY_PREEMPT_RCU, for deep sub-millisecond realtime response on small-memory systems.
  3. SMP && !PREEMPT: TREE_RCU, which is used for high performance and scalability on server-class systems where scheduling latencies in milliseconds are acceptable.
  4. SMP && PREEMPT: TREE_PREEMPT_RCU, which is used for systems requiring high performance, scalability, and deep sub-millisecond response.
Quick Quiz 1: Since when is ten megabytes of memory small???
Answer

The purpose of these four implementations is to cover Linux's wide range of hardware configurations and workloads. However, although TINY_RCU, TREE_RCU, and TREE_PREEMPT_RCU are heavily used for their respective use cases, TINY_PREEMPT_RCU's memory footprint is not all that much smaller than that of TREE_PREEMPT_RCU, especially when you consider that PREEMPT itself expands the kernel's memory footprint. All of those preempt_disable() and preempt_enable() invocations now generate real code.

The size for TREE_PREEMPT_RCU compiled for x86_64 is as follows:

   text    data     bss     dec     hex filename
   1541     385       0    1926     786 /tmp/b/kernel/rcupdate.o
  18060    2787      24   20871    5187 /tmp/b/kernel/rcutree.o

That for TINY_PREEMPT_RCU is as follows:

   text    data     bss     dec     hex filename
   1205     337       0    1542     606 /tmp/b/kernel/rcupdate.o
   3499     212       8    3719     e87 /tmp/b/kernel/rcutiny.o

If you really have limited memory, you will instead want TINY_RCU:

   text    data     bss     dec     hex filename
    963     337       0    1300     514 /tmp/b/kernel/rcupdate.o
   1869      90       0    1959     7a7 /tmp/b/kernel/rcutiny.o

This points to the possibility of dispensing with TINY_PREEMPT_RCU because the difference in size is not enough to justify its existence.

Quick Quiz 2: Hey!!! I use TINY_PREEMPT_RCU! What about me???
Answer

Of course, this needs to be done in a safe and sane way. Until someone comes up with that, I am taking the following approach:

  1. Poll LKML for objections (done: the smallest TINY_PREEMPT_RCU system had 128 megabytes of memory, which is enough that the difference between TREE_PREEMPT_RCU and TINY_PREEMPT_RCU is 0.01% of memory, namely, down in the noise).
  2. Update RCU's Kconfig to once again allow TREE_PREEMPT_RCU to be built on !SMP systems (available in 3.9-rc1 or by applying this patch for older versions).
  3. Alert LWN's readers to this change (you are reading it!).
  4. Allow time for testing and for addressing any issues that might be uncovered.
  5. If no critical problems are uncovered, remove TINY_PREEMPT_RCU, which is currently planned for 3.11.

Note that the current state of Linus's tree once again allows a choice of RCU implementation in the !SMP && PREEMPT case: either TINY_PREEMPT_RCU or TREE_PREEMPT_RCU. This is a transitional state whose purpose is to allow an easy workaround should there be a bug in TREE_PREEMPT_RCU on uniprocessor systems. From 3.11 forward, the choice of RCU implementation will be forced by the values selected for SMP and PREEMPT, once again adhering to the dictum of No Unnecessary Knobs.

If all goes well, this change will remove about 1,000 lines of code from the Linux kernel, which is a worthwhile reduction in complexity. So, if you currently use TINY_PREEMPT_RCU, please go forth and test TREE_PREEMPT_RCU on your hardware and workloads.

Acknowledgments

I owe thanks to Josh Triplett for suggesting this approach, and to Jon Corbet and Linus Torvalds for further motivating it. I am grateful to Jim Wasko for his support of this effort.

Answers to Quick Quizzes

Quick Quiz 1: Since when is ten megabytes of memory small???

Answer: As near as I can remember, Rip, since some time in the early 1990s.

Back to Quick Quiz 1.

Quick Quiz 2: Hey!!! I use TINY_PREEMPT_RCU! What about me???

Answer: Please download Linus's current git tree (or 3.9-rc1 or later) and test TREE_PREEMPT_RCU, reporting any problems you encounter. Alternatively, try disabling PREEMPT, thus switching to TINY_RCU for an even smaller memory footprint, relying on improvements in the non-realtime kernel's latencies. Either way, silence will be interpreted as assent!

Back to Quick Quiz 2.

Comments (none posted)

Namespaces in operation, part 6: more on user namespaces

By Michael Kerrisk
March 6, 2013

In this article, we continue last week's discussion of user namespaces. In particular, we look in more detail at the interaction of user namespaces and capabilities as well as the combination of user namespaces with other types of namespaces. For the moment at least, this article will conclude our series on namespaces.

User namespaces and capabilities

Each process is associated with a particular user namespace. A process created by a call to fork() or a call to clone() without the CLONE_NEWUSER flag is placed in the same user namespace as its parent process. A process can change its user-namespace membership using setns(), if it has the CAP_SYS_ADMIN capability in the target namespace; in that case, it obtains a full set of capabilities upon entering the target namespace.

On the other hand, a clone(CLONE_NEWUSER) call creates a new user namespace and places the new child process in that namespace. This call also establishes a parental relationship between the two namespaces: each user namespace (other than the initial namespace) has a parent—the user namespace of the process that created it using clone(CLONE_NEWUSER). A parental relationship between user namespaces is also established when a process calls unshare(CLONE_NEWUSER). The difference is that unshare() places the caller in the new user namespace, and the parent of that namespace is the caller's previous user namespace. As we'll see in a moment, the parental relationship between user namespaces is important because it defines the capabilities that a process may have in a child namespace.

Each process also has three associated sets of capabilities: permitted, effective, and inheritable. The capabilities(7) manual page describes these three sets in some detail. In this article, it is mainly the effective capability set that is of interest to us. This set determines a process's ability to perform privileged operations.

User namespaces change the way in which (effective) capabilities are interpreted. First, having a capability inside a particular user namespace allows a process to perform operations only on resources governed by that namespace; we say more on this point below, when we talk about the interaction of user namespaces with other types of namespaces. In addition, whether or not a process has capabilities in a particular user namespace depends on its namespace membership and the parental relationship between user namespaces. The rules are as follows:

  1. A process has a capability inside a user namespace if it is a member of the namespace and that capability is present in its effective capability set. A process may obtain capabilities in its effective set in a number of ways. The most common reasons are that it executed a program that conferred capabilities (a set-user-ID program or a program that has associated file capabilities) or it is the child of a call to clone(CLONE_NEWUSER), which automatically obtains a full set of capabilities.
  2. If a process has a capability in a user namespace, then it has that capability in all child (and further removed descendant) namespaces as well. Put another way: creating a new user namespace does not isolate the members of that namespace from the effects of privileged processes in a parent namespace.
  3. When a user namespace is created, the kernel records the effective user ID of the creating process as being the "owner" of the namespace. A process whose effective user ID matches that of the owner of a user namespace and which is a member of the parent namespace has all capabilities in the namespace. By virtue of the previous rule, those capabilities propagate down into all descendant namespaces as well. This means that after creation of a new user namespace, other processes owned by the same user in the parent namespace have all capabilities in the new namespace.

We can demonstrate the third rule with the help of a small program, userns_setns_test.c. This program takes one command-line argument: the pathname of a /proc/PID/ns/user file that identifies a user namespace. It creates a child in a new user namespace and then both the parent (which remains in the same user namespace as the shell that was used to invoke the program) and the child attempt to join the namespace specified on the command line using setns(); as noted above, setns() requires that the caller have the CAP_SYS_ADMIN capability in the target namespace.

For our demonstration, we use this program in conjunction with the userns_child_exec.c program developed in the previous article in this series. First, we use that program to start a shell (we use ksh, simply to create a distinctively named process) running in a new user namespace:

    $ id -u
    1000
    $ readlink /proc/$$/ns/user       # Obtain ID for initial namespace
    user:[4026531837]
    $ ./userns_child_exec -U -M '0 1000 1' -G '0 1000 1' ksh
    ksh$ echo $$                      # Obtain PID of shell
    528
    ksh$ readlink /proc/$$/ns/user    # This shell is in a new namespace
    user:[4026532318]

Now, we switch to a separate terminal window, to a shell running in the initial namespace, and run our test program:

    $ readlink /proc/$$/ns/user       # Verify that we are in parent namespace
    user:[4026531837]
    $ ./userns_setns_test /proc/528/ns/user
    parent: readlink("/proc/self/ns/user") ==> user:[4026531837]
    parent: setns() succeeded

    child:  readlink("/proc/self/ns/user") ==> user:[4026532319]
    child:  setns() failed: Operation not permitted

The following program shows the parental relationships between the various processes (black arrows) and namespaces (blue arrows) that have been created:

[A user namespace hierarchy]

Looking at the output of the readlink commands at the start of each shell session, we can see that the parent process created when the userns_setns_test program was run is in the initial user namespace (4026531837). (As noted in an earlier article in this series, these numbers are i-node numbers for symbolic links in the /proc/PID/ns directory.) As such, by rule three above, since the parent process had the same effective user ID (1000) as the process that created the new user namespace (4026532318), it had all capabilities in that namespace, including CAP_SYS_ADMIN; thus the setns() call in the parent succeeds.

On the other hand, the child process created by userns_setns_test is in a different namespace (4026532319)—in effect, a sibling namespace of the namespace where the ksh process is running. As such, the second of the rules described above does not apply, because that namespace is not an ancestor of namespace 4026532318. Thus, the child process does not have the CAP_SYS_ADMIN capability in that namespace and the setns() call fails.

Combining user namespaces with other types of namespaces

Creating namespaces other than user namespaces requires the CAP_SYS_ADMIN capability. On the other hand, creating a user namespace requires (since Linux 3.8) no capabilities, and the first process in the namespace gains a full set of capabilities (in the new user namespace). This means that that process can now create any other type of namespace using a second call to clone().

However, this two-step process is not necessary. It is also possible to include additional CLONE_NEW* flags in the same clone() (or unshare()) call that employs CLONE_NEWUSER to create the new user namespace. In this case, the kernel guarantees that the CLONE_NEWUSER flag is acted upon first, creating a new user namespace in which the to-be-created child has all capabilities. The kernel then acts on all of the remaining CLONE_NEW* flags, creating corresponding new namespaces and making the child a member of all of those namespaces.

Thus, for example, an unprivileged process can make a call of the following form to create a child process that is a member of both a new user namespace and a new UTS namespace:

    clone(child_func, stackp, CLONE_NEWUSER | CLONE_NEWUTS, arg);

We can use our userns_child_exec program to perform a clone() call equivalent to the above and execute a shell in the child process. The following command specifies the creation of a new UTS namespace (-u), and a new user namespace (-U) in which both user and group ID 1000 are mapped to 0:

    $ uname -n           # Display hostname for later reference
    antero
    $ ./userns_child_exec -u -U -M '0 1000 1' -G '0 1000 1' bash

As expected, the shell process has a full set of permitted and effective capabilities:

    $ id -u              # Show effective user and group ID of shell
    0
    $ id -g
    0
    $ cat /proc/$$/status | egrep 'Cap(Inh|Prm|Eff)'
    CapInh: 0000000000000000
    CapPrm: 0000001fffffffff
    CapEff: 0000001fffffffff

In the above output, the hexadecimal value 1fffffffff represents a capability set in which all 37 of the currently available Linux capabilities are enabled.

We can now go on to modify the hostname—one of the global resources isolated by UTS namespaces—using the standard hostname command; that operation requires the CAP_SYS_ADMIN capability. First, we set the hostname to a new value, and then we review that value with the uname command:

    $ hostname bizarro     # Update hostname in this UTS namespace
    $ uname -n             # Verify the change
    bizarro

Switching to another terminal window—one that is running in the initial UTS namespace—we then check the hostname in that UTS namespace:

    $ uname -n             # Hostname in original UTS namespace is unchanged
    antero

From the above output, we can see that the change of hostname in the child UTS namespace is not visible in the parent UTS namespace.

Capabilities revisited

Although the kernel grants all capabilities to the initial process in a user namespace, this does not mean that process then has superuser privileges within the wider system. (It may, however, mean that unprivileged users now have access to exploits in kernel code that was formerly accessible only to root, as this mail on a vulnerability in tmpfs mounts notes.) When a new IPC, mount, network, PID, or UTS namespace is created via clone() or unshare(), the kernel records the user namespace of the creating process against the new namespace. Whenever a process operates on global resources governed by a namespace, permission checks are performed according to the process's capabilities in the user namespace that the kernel associated with the that namespace.

For example, suppose that we create a new user namespace using clone(CLONE_NEWUSER). The resulting child process will have a full set of capabilities in the new user namespace, which means that it will, for example, be able to create other types of namespaces and be able to change its user and group IDs to other IDs that are mapped in the namespace. (In the previous article in this series, we saw that only a privileged process in the parent user namespace can create mappings to IDs other than the effective user and group ID of the process that created the namespace, so there is no security loophole here.)

On the other hand, the child process would not be able to mount a filesystem. The child process is still in the initial mount namespace, and in order to mount a filesystem in that namespace, it would need to have capabilities in the user namespace associated with that mount namespace (i.e., it would need capabilities in the initial user namespace), which it does not have. Analogous statements apply for the global resources isolated by IPC, network, PID, and UTS namespaces.

Furthermore, the child process would not be able to perform privileged operations that require capabilities that are not (currently) governed by namespaces. Thus, for example, the child could not do things such as raising its hard resource limits, setting the system time, setting process priorities, or loading kernel modules, or rebooting the system. All of those operations require capabilities that sit outside the user namespace hierarchy; in effect, those operations require that the caller have capabilities in the initial user namespace.

By isolating the effect of capabilities to namespaces, user namespaces thus deliver on the promise of safely allowing unprivileged users access to functionality that was formerly limited to the root user. This in turn creates interesting possibilities for new kinds of user-space applications. For example, it now becomes possible for unprivileged users to run Linux containers without root privileges, to construct Chrome-style sandboxes without the use of set-user-ID-root helpers, to implement fakeroot-type applications without employing dynamic-linking tricks, and to implement chroot()-based applications for process isolation. Barring kernel bugs, applications that employ user namespaces to access privileged kernel functionality are more secure than traditional applications based on set-user-ID-root: with a user-namespace-based approach, even if an applications is compromised, it does not have any privileges that can be used to do damage in the wider system.

The author would like to thank Eric Biederman for answering many questions that came up as he experimented with namespaces during the course of writing this article series.

Comments (23 posted)

Patches and updates

Kernel trees

Build system

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Architecture-specific

Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds