Kernel development

Brief items

Kernel release status

The current development kernel is 3.16-rc6, released on July 20. Linus is starting to think that things are still too active:

Anyway, rc6 still isn't all *that* big, so I'm not exactly worried, but I am getting to the point where I'm going to start calling people names and shouting at you if you send me stuff that isn't appropriate for the late rc releases. Which is not to say that people did: while rc6 is bigger than I wished for, I don't think there's too much obviously frivolous in there. But I'll be keeping an eye out, and I'll be starting to get grumpy (or grumpiER) if I notice that people aren't being serious about trying to calm things down.

Stable updates: 3.15.6, 3.14.13, 3.10.49, and 3.4.99 were released on July 17.

Comments (none posted)

Quotes of the week

If protected thingy cannot possibly be diddled concurrently, you'll know what to do, if it can be, you'll know what to do. If you can't figure it out, move on to the next windmill until the last one on the planet has been tilted or bounced off of.

— Mike Galbraith

The Maintainer of 'All Evil' has an interesting ring to it.

— Konrad Rzeszutek Wilk

Comments (none posted)

Faults in Linux 2.6

Six researchers (including Julia Lawall of the Coccinelle project) have just released a paper [PDF] (abstract) that looks at the faults in the 2.6 kernel. "In August 2011, Linux entered its third decade. Ten years before, Chou et al. published a study of faults found by applying a static analyzer to Linux versions 1.0 through 2.4.1. A major result of their work was that the drivers directory contained up to 7 times more of certain kinds of faults than other directories. This result inspired numerous efforts on improving the reliability of driver code. Today, Linux is used in a wider range of environments, provides a wider range of services, and has adopted a new development and release model. What has been the impact of these changes on code quality? To answer this question, we have transported Chou et al.'s experiments to all versions of Linux 2.6; released between 2003 and 2011. We find that Linux has more than doubled in size during this period, but the number of faults per line of code has been decreasing. Moreover, the fault rate of drivers is now below that of other directories, such as arch. These results can guide further development and research efforts for the decade to come. To allow updating these results as Linux evolves, we define our experimental protocol and make our checkers available." (Thanks to Asger Alstrup Palm.)

Comments (30 posted)

Kernel development news

A reworked BPF API

By Jonathan Corbet
July 23, 2014

Regular LWN readers will be, by now, well aware of the fact that the kernel's Berkeley Packet Filter (BPF) virtual machine is in the middle of a rapid development phase, moving beyond packet filtering into a number of other roles. "Extending extended BPF," published at the beginning of July, covered many of the changes that are in the works for an upcoming kernel release. The patch set covered there has evolved considerably since the article was written; the basic functionality is the same, but the API is not. So another look seems warranted.

The version 2 patch set posted by Alexei Starovoitov retains many of the features of the first version. It still adds a single bpf() system call providing a number of new functions. Among those are the ability to load BPF programs, of course, though there is still no way to directly run these programs from user space. In the old patch set, though, BPF programs were represented by numeric IDs in a global namespace. That feature is now gone. Instead, the new interface to load a program looks like this:

    int bpf(BPF_PROG_LOAD, enum bpf_prog_type prog_type, struct nlattr *attr,
            int attr_len);

As before, there is only one "program type" defined: BPF_PROG_TYPE_UNSPEC, and the actual program is to be found in the attr array. That array, as before, must also contain an attribute describing the license that applies to the loaded program. Unlike the previous version, version 2 does not prohibit the loading of non-GPL-compatible programs. It does, however, allow functions "exported" to BPF programs from the kernel to be marked GPL-only; non-GPL-compatible programs that attempt to call such a function will fail to load.

The attr array can also contain a special "fixup" section; this feature will be discussed momentarily.

What's missing from the above call, relative to the first version, is the prog_id parameter specifying the global ID number to use. There is no longer any need for an application to specify such an ID; instead, the kernel tracks programs internally and, whenever a program is loaded, an associated file descriptor is allocated and returned to user space. The application can then use that descriptor to refer to the loaded BPF program, which will remain in the kernel for as long as the file descriptor is held open. There is, thus, no longer any need for an explicit "unload program" operation; instead, the application need only close the file descriptor.

The "maps" feature from version 1 has also been retained, but, again, the global ID numbers are gone. When a map is created (using the BPF_MAP_CREATE command to the bpf() system call), a file descriptor is once again returned to the calling process. That descriptor can be used to store values in the map, query values, iterate through the map, and so on. Once again, the map will continue to exist for as long as the file descriptor remains open.

The removal of the global program and map ID namespaces eliminates a whole set of potential problems, including excessive resource usage if programs leave loaded BPF resources lying around after they exit and possible ID number conflicts. In the end, global IDs are reminiscent of the System V IPC API, and that is not something that everybody wants to be reminded of. But it does raise an interesting question: how do loaded BPF programs refer to maps?

In the previous version of the patch set, using a map to communicate between a BPF program and user space was straightforward; the two sides just had to agree on the proper ID number(s). In the absence of a global ID, an application can refer to a BPF map using the file descriptor passed back from the kernel. But file descriptors only have a meaning in the context of a specific process, and BPF programs do not run in any sort of process context. So the file descriptor cannot be used on the BPF side.

Alexei's solution is to add a "fixup array" to the process of loading a BPF program. This array contains one or more instances of this structure:

    struct bpf_map_fixup {
	int insn_idx;
	int fd;
    };

The array is passed to the BPF_PROG_LOAD operation in the attr argument. As part of the loading process, the kernel will iterate through the array. For each entry, insn_idx is expected to be the offset within the program of a function call instruction that makes use of a BPF map; the actual map to be passed to that function is represented by fd. The program loader will convert fd into an internal representation that is available to BPF programs, then modify the indicated instruction accordingly. In this way, the process-specific file descriptor numbers are removed from the program, replaced by internal IDs.

This solution may strike some readers as being a bit inelegant. For the most part, the BPF virtual machine knows nothing about maps; they are implemented using the "external function" mechanism. Indeed, for the map functionality to be available in any specific context (when running socket filter programs, for example), the kernel code setting up that context must include a fair amount of boilerplate code exporting the map functions to BPF programs. This design allows maps to be implemented with no direct support from the virtual machine; there are no map-specific BPF instructions, for example.

The addition of the fixup array wrecks that separation, pushing an awareness of maps (and how they are represented) into the core of the BPF program loader. This solution works, but one can't help but wonder if it might not be better just to implement map operations directly as BPF instructions. Then the program loader could recognize those instructions and replace the file-descriptor numbers automatically; user-space programs would not have to track the index of every operation that uses maps and set up a proper fixup array operation.

As of this writing, though, nobody else has raised such an objection; commentary on this version of the patch set has been quiet in general. That silence suggests that, as a whole, reviewers are relatively happy with what is there. Unless something changes in the near future, it seems likely that a version of this patch set will be put forward for the 3.17 merge window.

Comments (4 posted)

A system call for random numbers: getrandom()

By Jake Edge
July 23, 2014

The Linux kernel already provides several ways to get random numbers, each with its own set of constraints. But those constraints may make it impossible for a process to get random data when it needs it. The LibreSSL project has recently been making some noise about the lack of a "safe" way to get random data under Linux. That has led Ted Ts'o to propose a new getrandom() system call that would provide LibreSSL with what it needs, while also solving other kernel random number problems along the way.

The kernel maintains random number "pools" that get fed data that comes from sampling unpredictable events (e.g. inter-key timing from the keyboard). The amount of entropy contributed by each of these events is estimated and tracked. A cryptographically secure pseudo-random number generator (PRNG) is used on the data in the pools, which then feed two separate devices: /dev/urandom and /dev/random.

The standard way to get random numbers from the kernel is by reading from the /dev/urandom device. But there is also the /dev/random device that will block until enough entropy has been collected to satisfy the read() request. /dev/urandom should be used for essentially all random numbers required, but /dev/random is sometimes used for things like extremely sensitive, long-lived keys (e.g. GPG) or one-time pads. In order to use either one, though, an application has to be able to open() a file, which requires that there be file descriptors available. It also means that the application has to be able to see the device files, which may not be the case in some containers or chroot() environments.

LibreSSL has been written to use /dev/urandom, but also to have a fallback if there is an exhaustion of file descriptors (which an attacker might try to arrange) or there is some other reason that the library can't open the file. The fallback is to use the deprecated sysctl() system call to retrieve the /proc/sys/kernel/random/uuid value, but without actually having to open that file (since LibreSSL already knows that /dev/urandom could not be opened). But sysctl() may disappear someday—some distribution kernels have already removed it—and, sometimes, using it puts a warning into the kernel log. If the sysctl() fails, LibreSSL falls further back to a scary-looking function that tries to generate its own random numbers from various (hopefully) unpredictable values available to user space (e.g. timestamps, PID numbers, etc.).

All of that can be seen in a well-commented chunk of code in LibreSSL's getentropy_linux.c file. The final comment in that section makes a request:

* We hope this demonstrates that Linux should either retain their
* sysctl ABI, or consider providing a new failsafe API which
* works in a chroot or when file descriptors are exhausted.
*/

That new API is precisely what Ts'o has proposed. The getrandom() system call is well-described in his patch (now up to version 4). It is declared as follows:

    #include <linux/random.h>

    int getrandom(void *buf, size_t buflen, unsigned int flags);

A call will fill buf with up to buflen bytes of random data that can be used for cryptographic purposes, returning the number of bytes stored. As might be guessed, the flags parameter will alter the behavior of the call. In the case where flags == 0, getrandom() will block until the /dev/urandom pool has been initialized. If flags is set to GRND_NONBLOCK, then getrandom() will return -1 with an error number of EAGAIN if the pool is not initialized.

The GRND_RANDOM flag bit can be used to switch to the /dev/random pool, subject to the entropy requirements of that pool. That means the call will block until the pool has the required entropy, unless the GRND_NONBLOCK bit is also present in flags, in which case it will return as many bytes as it can; it will return -1 for an error with errno set to EAGAIN if there is no entropy available at all. The call returns the number of bytes it placed into buf (or -1 for an error). Short reads can occur due to a lack of entropy for the /dev/random pool or because the call was interrupted by a signal, but reads of 256 bytes or less from /dev/urandom are guaranteed to return the full request once that device has been initialized.

In the proposed man page that accompanies the patch, Ts'o shows sample code that could be used to emulate the OpenBSD getentropy() system call using getrandom(). One complaint about the patch came from Christoph Hellwig, who was concerned that Ts'o was not just implementing "exactly the same system call" as OpenBSD. He continued: "Having slightly different names and semantics for the same functionality is highly annoying." But Ts'o is trying to solve more than just the LibreSSL problem, he said. getrandom() is meant to be a superset of OpenBSD's getentropy()—glibc can easily create a compatible getentropy(), as he showed in the patch.

The requirement that /dev/urandom be initialized before getrandom() will return any data from that pool is one of the new features that the proposed system call delivers. Currently, there is no way for an application to know that at least 128 bits of entropy have been gathered since the system was booted (which is the requirement to properly initialize /dev/urandom). Now, an application can either block to wait for that to occur, or test for the condition using GRND_NONBLOCK and looking for EAGAIN. Since the behavior of /dev/urandom is part of the kernel ABI, it could not change, but adding this blocking to the new system call is perfectly reasonable.

The system call also provides a way to do a non-blocking read of /dev/random to get a partial buffer in the event of a lack of entropy. It is a bit hard to see any real application for that—if you don't need a full buffer of high-estimated-entropy random numbers, why ask for one? In fact, the new call provides a number of ways to abuse the kernel's random number facility (requesting INT_MAX bytes, for example), but that isn't really any different than the existing interfaces.

There have been lots of comments of various sorts on Ts'o's patches, but few complaints. The overall idea seems to make sense to those participating in the thread, anyway. Some changes have been made based on the comments, most notably switching to blocking by default. But the latest revision generated only comments about typos. Unless that changes, it would seem that we could see getrandom() in the kernel rather soon, perhaps as early as 3.17.

Comments (44 posted)

Control groups, part 4: On accounting

July 23, 2014

This article was contributed by Neil Brown

Control groups

In our ongoing quest to understand Linux control groups, at least enough to enjoy the debates that inevitably arise when they are mentioned, it is time to explore the role that accounting plays and to consider how the needs of the accountant are met by the design and structure of cgroups.

Linux and Unix are not strangers to the idea of the accounting of resource usage. Even in V6 Unix, the CPU time for each process was accounted and the running total could be accessed with the times() system call. To a limited extent, this extended to groups of processes too. One of the process groupings we found when we first looked at V6 Unix was the group of all descendants of a given process. When all processes in that group have exited, the total CPU time used by them (or at least those that haven't escaped) was available from times() as well. Before a process has exited and been waited for, its CPU time was known only to itself.

In 2.10BSD, the set of resources that were accounted for grew to include memory usage, page faults, disk I/O, and other statistics. Like CPU time, these were gathered into the parent when a child is waited for. They can be accessed by the getrusage() system call that is still available in modern Linux, largely unchanged.

With getrusage() came setrlimit(), which could impose limits on the use of some of these resources, such as CPU time and memory usage. These limits were only imposed on individual processes, never on a group: a group's statistics are only accumulated as processes exit, and that is rather too late to be imposing limits.

Last time, we looked at various cgroups subsystems that did not need to keep any accounting across the group to provide their services, though perf_event was in a bit of a grey area. This week, we look at the remaining five subsystems, each of which involve whole-group accounting in some way, and so support limits of a sort that setrlimit() could never support.

`cpuacct` — accounting for the sake of accounting

cpuacct is the simplest of the accounting subsystems, in part because all it does is accounting; it doesn't exert control at all.

cpuacct appears to have originally been written largely as a demonstration of the capabilities of cgroups, with no expectation of mainline inclusion. It slipped into mainline with other cgroups code in 2.6.24-rc1, was removed with an explanation of this original intent shortly afterward, and then re-added well before 2.6.24-final because it was actually seen as quite useful. Given this history, we probably shouldn't expect cpuacct to fit neatly into any overall design.

Two separate sorts of accounting information are kept by cpuacct. First, there is the total CPU time used by all processes in the group, which is measured by the scheduler as precisely as it can and is recorded in nanoseconds. This information is gathered on a per-CPU basis and can be reported per-CPU or as a sum across all CPUs.

Second, there is (since 2.6.30) a breakdown into "user" and "system" time of the total CPU time used by all processes in the group. These times are accounted in the same way as the times returned by the times() system call and are recorded in clock ticks or "jiffies". Thus, they may not be as precise as the CPU times accounted in nanoseconds.

Since 2.6.29, these statistics are collected hierarchically. Whenever some usage is added to one group, it is also added to all of the ancestors of that group. So, the usage accounted in each group is the sum of the usage of all processes in that extended group including processes that are members of sub-groups.

This is the key characteristic of all the subsystems that we will look at this time: they account hierarchically. While perf_event does do some accounting, the accounting for each process stays in the cgroup that the process is an immediate member of and does not propagate up into ancestor processes.

For these two subsystems (cpuacct and perf_event), it is not clear that hierarchical accounting is really necessary. The totals collected are never used within the kernel, but are only made available to user-space applications, which are unlikely to read the data at a particularly high rate. This implies it would be quite effective for an application that needs whole-group accounting information to walk the tree of descendants from a given group and add up the totals in each sub-group. When a cgroup is removed it could certainly be useful to accumulate its usage into its parent, much like process times are accumulated on exit. Earlier accumulation brings no obvious benefit.

Even if it were important to present the totals directly in the cgroups filesystem, it would be quite practical for the kernel to add the numbers up when required, rather than on every change. This is how the sum across all CPUs is managed, but it is not done for the sum across sub-groups.

There is an obvious tradeoff between the complexity for an application to walk the tree on those occasions when data is needed and the cost for the kernel in walking up the tree to add a new charge to every ancestor of the target process on every single accounting event. A proper cost/benefit analysis of this tradeoff would need to take into account the depth of the tree and the frequency of updates. For cpuacct, updates only happen on a scheduler event or timer tick, which would normally be every millisecond or more on a busy machine. While this may seem a fairly high frequency, there are other events that can happen much more often, as we shall see.

Whether the accounting approach in cpuacct and perf_event involve sensible or questionable choices in not really important for understanding cgroups — what is worth noting is the fact that there are choices to be made and tradeoffs to be considered. These subsystems have freedom to choose because the accounting data is not used within the kernel. The remaining subsystems all keep account of something in order to exert control and, thus, need fully correct accounting details.

Sharing out the memory

Two cgroup subsystems are involved in tracking and restricting memory usage: memory and hugetlb. These two use a common data structure and support library for tracking usage and imposing limits: the "resource counter" or res_counter.

A res_counter, which is declared in include/linux/res_counter.h and implemented in kernel/res_counter.c, stores a usage of some arbitrary resource together with a limit and a "soft limit". It also includes a high-water mark that records the highest level the usage has ever reached, and a failure count that tracks how many times a request for extra resources was denied.

Finally, a res_counter contains a spinlock to manage concurrent access and a parent pointer. This pointer demonstrates a recurring theme among the accounting cgroup subsystems in that they often create a parallel tree-like data structure that exactly copies the tree data structure provided directly by cgroups.

The memory cgroup subsystem allocates three res_counters, one for user-process memory usage, one for the sum of memory and swap usage, and one for usage of memory by the kernel on behalf of the process. Together with the one res_counter allocated by hugetlb (which accounts the memory allocated in huge pages), this means there are four extra parent pointers when the memory and hugetlb subsystems are both enabled. This seems to suggest that the implementation of the hierarchy structure provided by cgroups doesn't really meet the needs of its users, though why that might be is not immediately obvious.

When one of the various memory resources is requested by a process, the res_counter code will walk up the parent pointers, checking if limits are reached and updating the usage at each ancestor. This requires taking a spinlock at each level, so it is not a cheap operation, particularly if the hierarchy is at all deep. Memory allocation is generally a highly optimized operation in Linux, with per-CPU free lists along with batched allocation and freeing to try to minimize the per-allocation cost. Allocating memory isn't always a high-frequency operation, but sometimes it is; those times should still perform well if possible. So, taking a series of spinlocks for multiple nested cgroups to update accounting on every single memory allocation doesn't sound like a good idea. Fortunately, this is not something that the memory subsystem does.

When authorizing a memory allocation request of less than 32 pages (most requests are for one page), the memory controller will request that a full 32 be approved by the res_counter. If that request is granted, the excess above what was actually required is recorded in a per-CPU "stock" that notes which cgroup last made an allocation on each CPU and how much excess has been approved. If the request is not granted, it requests the actual number of pages allocated.

Subsequent allocations by the same process on the same CPU will use what remains in the stock until that drops to zero, at which point another 32-page authorization will be requested. If a scheduling change causes another process from a different cgroup to allocate memory on that CPU, then the old stock will be returned and a new stock for the new cgroup will be requested.

Deallocation is also batched, though with a quite different mechanism, presumably because deallocations often happen in much larger batches and because deallocations can never fail. The batching for deallocation uses a per-process (rather than per-CPU) counter that must be explicitly enabled by the code that is freeing memory. So a sequence of calls:

    mem_cgroup_uncharge_start()
    repeat mem_cgroup_uncharge_page()
    mem_cgroup_uncharge_end()

will use the uncharge batching, while a lone mem_cgroup_uncharge_page() will not.

The key observation here is that, while a naive accounting of resource usage can be quite expensive, there are various ways to minimize the cost and different approaches will be more or less suitable in different circumstances. So it seems proper for cgroups to take a neutral stance on the issue and allow each subsystem to solve the problem in the way best suited to its needs.

Another cgroup subsystem for the CPU

As the CPU is so central to a computer it is not surprising that there are several cgroup subsystems that relate to it. Last time, we met the cpuset subsystem that limits which CPUs a process can run on, and the cpuacct subsystem, above, which accounts for how much time is spent on the CPU by processes in a cgroup. The third and last CPU-related subsystem is simply called cpu. It is used to control how the scheduler shares CPU time among different processes and different cgroups.

The Linux scheduler has a surprisingly simple design. It is modeled on a hypothetical, ideal multi-tasking CPU that can run an arbitrary number of threads simultaneously, though at a proportionally reduced speed. Using this model, the scheduler can calculate how much effective CPU time each thread "should" have ideally received. The scheduler simply selects the thread for which the actual CPU time used is furthest behind the ideal and lets it run for a while to catch up.

If all processes were equal, then the proportionality would mean that if there are N runnable processes, each should get one-Nth of real time. Of course, processes often aren't all equal, as scheduling priority or nice values can assign different weights to each process so they get different proportions of real time. The sum of these proportions must, of course, add up to one (or to the number of active CPUs).

When the cpu cgroup subsystem is used to request group scheduling, these proportions must be calculated based on the group hierarchy, so some proportion will be given to a top-level group, and that is then shared among the processes and sub-groups in that group.

Further, the "virtual runtime", which tracks the discrepancy between ideal and actual CPU time, needs to be accounted for each group as well as for each process. One might expect the virtual runtime of a group to be exactly the sum of the times from the processes. However, when a process exits, any excess or deficit it has is lost. To prevent this loss from leading to unfairness between groups, the scheduler keeps account of the time used by each group as well as by each process.

To manage these different values across the hierarchy, the cpu subsystem creates a parallel hierarchy of struct sched_entity structures, which is what the scheduler uses to store proportional weights and virtual runtime. There are actually a multitude of these hierarchies, one for each CPU. This means that runtime values can be propagated up the tree without locking, so it is much more efficient than the res_counter used by the memory controller.

In accord with the recurring theme that one subsystem will often have two or more distinct (though related) functions, the cpu subsystem allows a maximum CPU bandwidth to be imposed on each group. This is quite separate from the scheduling priority discussed so far.

The bandwidth is measured in CPU time per real time. Both the CPU time limit, referred to as the quota, and the real time period during which that quota can be used must be specified. When setting the quota and period for each group, the subsystem checks that the limit imposed on any parent is always enough to allow all children to make full use of their quota. If that is not met, then the change will be rejected.

The actual implementation of the bandwidth limits is done largely in the context of the sched_entity. As the scheduler updates how much virtual time each sched_entity has used, it also updates the bandwidth usage and checks if throttling is appropriate.

To some extent, this case study simply reinforces some ideas we have already seen, that restrictions are often pushed down the hierarchy while accounting data is often propagated up a parallel hierarchy. We do see here one convincing reason why a parallel hierarchy might be needed. In this case, the parallel hierarchies are per-CPU so they can be updated without taking any locks.

`blkio` - a final pair

As we have repeatedly observed, some cgroup subsystems manage multiple, though related, aspects of the contained processes. With blkio, this idea becomes more formalized. blkio allows for various "policies" to be registered that act much like cgroup subsystems in that they are advised of changes to the cgroup hierarchy and they can add files to the cgroup virtual filesystem. They are not able to disallow the movement of processes or get told about fork() or exit().

There are two blkio policies in Linux 3.15: "throttle" and "cfq-iosched". These have a remarkable similarity to the two functions of the cpu subsystem (bandwidth and scheduling priority), though the details are quite different. Many of the ideas we find in these two have already been seen in other subsystems, so there is little point going over similar details in a different guise. But there are two ideas worth mentioning.

One is that the blkio subsystem adds a new ID number to each cgroup. We saw last time that the cgroup core provides an ID number for each group and this is used by net_prio to identify groups. The new id added by blkio fills a similar role but with one important difference. blkio ID numbers use 64 bits and are never reused, while cgroup-core ID numbers are just an int (typically 32 bits) and are reused. A unique ID number seems like it could be a generally useful feature that the cgroups core could provide. A little over a year after the blkio ID was added, a remarkably similar serial_nr was indeed added to the cgroup core, though blkio hasn't been revised to use it. Note when reading the code: blkio is known internally as blkcg.

The other new idea we find in blkio, and in the cfq-iosched policy in particular, is possibly more interesting. Each cgroup can be assigned a different weight, similar to the weights calculated by the CPU scheduler, to balance scheduling of requests from this group against requests from sibling groups. Uniquely to blkio, each cgroup can also have a leaf_weight that is used to balance requests from processes in this group against requests from processes in child groups.

The net effect of this is that when non-leaf cgroups contain processes, the cfq-iosched policy pretends that those processes are really in a virtual child group and uses leaf_weight to assign a weight to that virtual child. If we consider this against the different aspects of hierarchy that we explored earlier, it seems to be a clear statement from cfq-iosched that "organization" hierarchies are not something that it wants to deal with and that everything should be, or will be treated as, "classification" hierarchies.

The CPU scheduler doesn't appear to be concerned about this issue. The processes in an internal group are scheduled against each other and against any child group as a whole. It isn't really clear which approach to scheduling is right, but it would be nice if they were consistent. One way to achieve consistency would be to forbid non-leaf cgroups from containing processes. There is work underway to exactly this end, as we will see later in this series.

What can we learn?

If we combine all that we learned in this analysis with what we found with the first set of subsystems last time, it is easy to get lost in the details. Some differences may be deep conceptual differences, others may be important, but shallow, implementation differences, while still others could be pointless differences due to historical accident.

In the first class, I see a strong distinction between control elements that share out resources (CPU or block I/O bandwidth), those that impose limits on resources (CPU, block I/O, and memory), and the rest that are largely involved with identifying processes. The first set need a view of the whole hierarchy, as each branch competes with all others. The second set need to see only the local branch, as limits in one branch can not affect usage in another. The third set don't really need the hierarchy at all — it might be useful, but its presence isn't intrinsic to the functionality.

The fact that several controllers create parallel hierarchies seems to be largely an implementation detail, though, as we will see next time, there may be a deeper conceptual issue underlying that.

The almost chaotic relationship between functionality and subsystems is most likely pointless historical accident. There is no clear policy statement concerning what justifies a new subsystem, so sometimes new functionality is added to an existing subsystem and sometimes it is provided as a brand new subsystem. The key issues that could inform such a policy statement are things we can be looking out for as our journey continues.

In the next installment, we will step back from all this fine detail and try to assemble a high level view of cgroups and their relationships. We will look at the hierarchical structures provided by cgroups and how those interact with the three core needs: sharing resources, limiting resource usage, and identifying processes.

Comments (none posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.16-rc6 ?

Greg KH Linux 3.15.6 ?

Greg KH Linux 3.14.13 ?

Kamal Mostafa Linux 3.13.11.5 ?

Steven Rostedt 3.12.24-rt38 ?

Greg KH Linux 3.10.49 ?

Steven Rostedt 3.10.47-rt50 ?

Greg KH Linux 3.4.99 ?

Steven Rostedt 3.4.97-rt121 ?

Steven Rostedt 3.2.60-rt89 ?

Architecture-specific

Brian Norris ARM: brcmstb: Add Broadcom STB SoC support ?

Lorenzo Pieralisi ARM generic idle states ?

Zi Shen Lim arm64: eBPF JIT compiler ?

Catalin Marinas arm64: Support 4 levels of translation tables ?

AKASHI Takahiro arm64: Add seccomp support ?

Laura Abbott Atomic pool for arm64 ?

Andy Lutomirski random,x86,kvm: Rework arch RNG seeds and get some from kvm ?

Qiaowei Ren Intel MPX support ?

Core kernel code

Thomas Gleixner timekeeping: 2038, optimizations, NMI safe accessors ?

Aditya Kali RFC: CGroup Namespaces ?

Yuyang Du sched: Rewrite per entity runnable load average tracking ?

Alexei Starovoitov BPF syscall, maps, verifier, samples ?

Guenter Roeck kernel: Add support for restart handler call chain ?

Steven Rostedt ftrace/tracing: Fixes and final removal of ftrace_start/stop() ?

Waiman Long futex: introduce an optimistic spinning futex ?

Device drivers

Sebastian Andrzej Siewior 8250-core based serial driver for OMAP ?

caesar This series adds support for Rockchip SoCs integrated PWM. ?

Andreas Werner Introduce MEN 14F021P BMC driver series ?

Ivan T. Ivanov New Qualcomm PMIC pin controller drivers ?

Feng Kan mailbox: Add APM X-Gene platform mailbox driver ?

Maxime Ripard Add support for the Allwinner A31 DMA Controller ?

Kieran Kunhya block: Add support for Sony SxS cards ?

Krzysztof Kozlowski charger/mfd: max14577: Add support for MAX77836 ?

Hannes Reinecke Initial SMR drive support ?

Murali Karicheri Add Keystone PCIe controller driver ?

Boris BREZILLON drm: add support for Atmel HLCDC Display Controller ?

Daniel Jeong backlight: add new tps611xx backlight driver ?

Sean Cross Add ES8328 audio codec ?

Matthias Schwarzott add si2165 demod driver ?

Device driver infrastructure

Christoph Hellwig scsi-mq V4 ?

Jassi Brar Common Mailbox Framework ?

Filesystems and block I/O

Matthew Wilcox Support ext4 on NV-DIMMs ?

Miklos Szeredi first iteration of rename2 support ?

Janitorial

Richard Weinberger Global signal cleanup, take 4 ?

Memory management

Nishanth Aravamudan Memoryless nodes and kworker ?

Minchan Kim MADV_FREE support ?

David Herrmann File Sealing & memfd_create() ?

Wang Nan memory-hotplug: suitable memory should go to ZONE_MOVABLE ?

Jerome Marchand mm, shmem: Enhance per-process accounting of shared memory ?

Networking

Andy Zhou Add Geneve ?

Security-related

Kees Cook seccomp: add thread sync ability ?

Andy Lutomirski Two-phase seccomp and x86 tracing changes ?

Theodore Ts'o random: introduce getrandom(2) system call ?

Mimi Zohar ima: add support for measuring and appraising firmware ?

Miscellaneous

Karel Zak util-linux v2.25 ?

Page editor: Jonathan Corbet
Next page: Distributions>>

Kernel development

Brief items

Kernel release status

Quotes of the week

Faults in Linux 2.6

Kernel development news

A reworked BPF API

A system call for random numbers: getrandom()

Control groups, part 4: On accounting

cpuacct — accounting for the sake of accounting

Sharing out the memory

Another cgroup subsystem for the CPU

blkio - a final pair

What can we learn?

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Device drivers

Device driver infrastructure

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Miscellaneous

`cpuacct` — accounting for the sake of accounting

`blkio` - a final pair