Kernel development
Brief items
Kernel release status
The current development kernel remains 3.8-rc2; no 3.8 prepatches have been released in the last week.Stable updates: 3.2.36 was released on January 4.
As of this writing, the 2.6.34.14, 3.0.58, 3.4.25, and 3.7.2 updates are in the review process; they can be expected on or after January 11.
Quotes of the week
Kernel development news
Per-entity load tracking
The Linux kernel's CPU scheduler has a challenging task: it must allocate access to the system's processors in a way that is fair and responsive while maximizing system throughput and minimizing power consumption. Users expect these results regardless of the characteristics of their own workloads, and regardless of the fact that those objectives are often in conflict with each other. So it is not surprising that the kernel has been through a few CPU schedulers over the years. That said, things have seemed relatively stable in recent times; the current "completely fair scheduler" (CFS) was merged for 2.6.23 in 2007. But, behind that apparent stability, a lot has changed, and one of the most significant changes in some time was merged for the 3.8 release.Perfect scheduling requires a crystal ball; when the kernel knows exactly what demands every process will make on the system and when, it can schedule those processes optimally. Unfortunately, hardware manufacturers continue to push affordable prediction-offload devices back in their roadmaps, so the scheduler has to be able to muddle through in their absence. Said muddling tends to be based on the information that is actually available, with each process's past performance being at the top of the list. But, interestingly, while the kernel closely tracks how much time each process actually spends running, it does not have a clear idea of how much each process is contributing to the load on the system.
One might well ask whether there is a difference between "CPU time consumed" and "load." The answer, at least as embodied in Paul Turner's per-entity load tracking patch set, which was merged for 3.8, would appear to be "yes." A process can contribute to load even if it is not actually running at the moment; a process waiting for its turn in the CPU is an example. "Load" is also meant to be an instantaneous quantity — how much is a process loading the system right now? — as opposed to a cumulative property like CPU usage. A long-running process that consumed vast amounts of processor time last week may have very modest needs at the moment; such a process is contributing very little to load now, despite its rather more demanding behavior in the past.
The CFS scheduler (in 3.7 and prior kernels) tracks load on a per-run-queue basis. It's worth noting that "the" run queue in CFS is actually a long list of queues; at a minimum, there is one for each CPU. When group scheduling is in use, though, each control group has its own per-CPU run queue array. Occasionally the scheduler will account for how much each run queue is contributing to the load on the system as a whole. Accounting at that level is sufficient to help the group scheduler allocate CPU time between control groups, but it leaves the system as a whole unaware of exactly where the current load is coming from. Tracking load at the run queue level also tends to yield widely varying estimates even when the workload is relatively stable.
Toward better load tracking
Per-entity load tracking addresses these problems by pushing this tracking down to the level of individual "scheduling entities" — a process or a control group full of processes. To that end, (wall clock) time is viewed as a sequence of 1ms (actually, 1024µs) periods. An entity's contribution to the system load in a period pi is just the portion of that period that the entity was runnable — either actually running, or waiting for an available CPU. The trick, though, is to get an idea of contributed load that covers more than 1ms of real time; this is managed by adding in a decayed version of the entity's previous contribution to system load. If we let Li designate the entity's load contribution in period pi, then an entity's total contribution can be expressed as:
L = L0 + L1*y + L2*y2 + L3*y3 + ...
Where y is the decay factor chosen. This formula gives the most weight to the most recent load, but allows past load to influence the calculation in a decreasing manner. The nice thing about this series is that it is not actually necessary to keep an array of past load contributions; simply multiplying the previous period's total load contribution by y and adding the new L0 is sufficient.
In the current code, y has been chosen so that y32 is equal to 0.5 (though, of course, the calculation is done with integer arithmetic in the kernel). Thus, an entity's load contribution 32ms in the past is weighted half as strongly as its current contribution.
Once we have an idea of the load contributed by runnable processes, that load can be propagated upward to any containing control groups with a simple sum. But, naturally, there are some complications. Calculating the load contribution of runnable entities is easy, since the scheduler has to deal with those entities on a regular basis anyway. But non-runnable entities can also contribute to load; the fact a password cracker is currently waiting on a page fault does not change the fact that it may be loading the system heavily. So there needs to be a way of tracking the load contribution of processes that, by virtue of being blocked, are not currently the scheduler's concern.
One could, of course, just iterate through all those processes, decay their load contribution as usual, and add it to the total. But that would be a prohibitively expensive thing to do. So, instead, the 3.8 scheduler will simply maintain a separate sum of the "blocked load" contained in each cfs_rq (control-group run queue) structure. When a process blocks, its load is subtracted from the total runnable load value and added to the blocked load instead. That load can be decayed in the same manner (by multiplying it by y each period). When a blocked process becomes runnable again, its (suitably decayed) load is transferred back to the runnable load. Thus, with a bit of accounting during process state transitions, the scheduler can track load without having to worry about walking through a long list of blocked processes.
Another complication is throttled processes — those that are running under the CFS bandwidth controller and have used all of the CPU time available to them in the current period. Even if those processes wish to run, and even if the CPU is otherwise idle, the scheduler will pass them over. Throttled processes thus cannot contribute to load, so removing their contribution while they languish makes sense. But allowing their load contribution to decay while they are waiting to be allowed to run again would tend to skew their contribution downward. So, in the throttled case, time simply stops for the affected processes until they emerge from the throttled state.
What it is good for
The end result of all this work is that the scheduler now has a much clearer idea of how much each process and scheduler control group is contributing to the load on the system — and it has all been achieved without an increase in scheduler overhead. Better statistics are usually good, but one might wonder whether this information is truly useful for the scheduler.
It does seem that some useful things can be done with a better idea of an entity's load contribution. The most obvious target is likely to be load balancing: distributing the processes on the system so that each CPU is carrying roughly the same load. If the kernel knows how much each process is contributing to system load, it can easily calculate the effect of migrating that process to another CPU. The result should be more accurate, less error-prone load balancing. There are some patches in circulation that make use of load tracking to improve the scheduler's load balancer; something will almost certainly make its way toward the mainline in the near future.
Another feature needing per-entity load tracking is the small-task packing patch. The goal here is to gather "small" processes onto a small number of CPUs, allowing other processors in the system to be powered down. Clearly, this kind of gathering requires a reliable indicator of which processes are "small"; otherwise, the system is likely to end up in a highly unbalanced state.
Other subsystems may also be able to use this information; CPU frequency and power governors should be able to make better guesses about how much computing power will be needed in the near future, for example. Now that the infrastructure is in place, we are likely to see a number of developers using per-entity load information to optimize the behavior of the system. It is still not a crystal ball with a view into the future, but, at least, we now have a better understanding of the present.
Checkpoint/restore and signals
Checkpoint/restore is a mechanism that permits taking a snapshot of the state of an application (which may consist of multiple processes) and then later restoring the application to a running state. One use of checkpoint/restore is for live migration, which allows a running application to be moved between host systems without loss of service. Another use is incremental snapshotting, whereby periodic snapshots are made of a long-running application so that it can be restarted from a recent snapshot in the event of a system outage, thus avoiding the loss of days of calculation. There are also many other uses for the feature.
Checkpoint/restore has a long history, which we covered in November. The initial approach, starting in 2005, was to provide a kernel-space implementation. However, the patches implementing this approach were ultimately rejected as being too complex, invasive, and difficult to maintain. This led to an alternate approach: checkpoint/restore in user space (CRIU), an implementation that performs most of the work in user space, with some support from the kernel. The benefit of the CRIU approach is that, by comparison with a kernel-space implementation, it requires fewer and less invasive changes in the kernel code.
To correctly handle the widest possible range of applications, CRIU needs to be able to checkpoint and restore as much of a process's state as possible. This is a large task, since there are very many pieces of process state that need to be handled, including process ID, parent process ID, credentials, current working directory, resource limits, timers, open file descriptors, and so on. Furthermore, some resources may be shared across multiple processes (for example, multiple processes may hold open file descriptors referring to the same open file), so that successfully restoring application state also requires reproducing shared aspects of process state.
For each piece of process state, CRIU requires two pieces of support from the kernel: a mechanism for retrieving the state (used during checkpoint) and a mechanism to set the state (used during restore). In some cases, the kernel provides most or all of the necessary support. In other cases, however, the kernel does not provide a mechanism to retrieve the (complete) value of the state during a checkpoint or does not provide a mechanism to set the state during restore. Thus, one of the ongoing pieces of work for the implementation of CRIU is to add support to the kernel for these missing pieces.
Andrey Vagin's recent patches to the signalfd() system call are an example of this ongoing work and illustrate the complexity of the task of saving and restoring process state. Before looking at these patches closely, we need to consider the general problem that CRIU is trying to solve with respect to signals, and consider some of the details that make the solution complicated.
The problem and its complexities
The overall problem that the CRIU developers want to solve is checkpointing and restoring a process's set of pending signals—the set of signals that have been queued for delivery to the process but not yet delivered. The idea is that when a process is checkpointed, all of the process's pending signals should be fetched and saved, and when the process is restored, all of the signals should be requeued to the process. As things stand, the kernel does not quite provide sufficient support for CRIU to perform either of these tasks.
At first glance, it might seem that the task is as simple as fetching the list of pending signal numbers during a checkpoint and then requeueing those signals during the restore. However, there's rather more to the story than that. First, each signal has an associated siginfo structure that provides additional information about the signal. That information is available when a process receives a signal. If a signal handler is installed using sigaction() with the SA_SIGINFO flag, then the additional information is available as the second argument of the signal handler, which is prototyped as:
void handler(int sig, siginfo_t *siginfo, void *ucontext);
The siginfo structure contains a number of fields. One of these, si_code, provides further information about the origin of the signal. A positive number in this field indicates that the signal was generated by the kernel; a negative number indicates that the signal was generated by user space (typically by a library function such as sigqueue()). For example, if the signal was generated because of the expiration of a POSIX timer, then si_code will be set to the value SI_TIMER. On the other hand, if a SIGCHLD signal was delivered because a child process changed state, then si_code is set to one of a range of values indicating that the process terminated, was killed by a signal, was stopped, and so on.
Other siginfo fields provide further information about the signal. For example, if the signal was sent using the kill() system call, then the si_pid field contains the PID and the si_uid field contains the real user ID of the sending process. Various other fields in the siginfo structure provide information about specific signals.
There are other factors that make checkpoint/restore of signals complicated. One of these is that multiple instances of the so-called real-time signals can be queued. This means that the CRIU mechanism must ensure that all of the queued signals are gathered up during a checkpoint.
One final detail about signals must also be handled by CRIU. Signals can be queued either to a specific thread or to a process as a whole (meaning that the signal can be delivered to any of the threads in the process). CRIU needs a mechanism to distinguish these two queues during a checkpoint operation, so that it can later restore them.
Limitations of the existing system call API
At first glance it might seem that the signalfd() system call could solve the problem of gathering all pending signals during a CRIU checkpoint:
int signalfd(int fd, const sigset_t *mask, int flags);
This system call creates a file descriptor from which signals can be "read." Reads from the file descriptor return signalfd_siginfo structures containing much of the same information that is passed in the siginfo argument of a signal handler.
However, it turns out that using signalfd() to read all pending signals in preparation for a checkpoint has a couple of limitations. The first of these is that signalfd() is unaware of the distinction between thread-specific and process-wide signals: it simply returns all pending signals, intermingling those that are process-wide with those that are directed to the calling thread. Thus, signalfd() loses information that is required for a CRIU restore operation.
A second limitation is less obvious but just as important. As we noted above, the siginfo structure contains many fields. However, only some of those fields are filled in for each signal. (Similar statements hold true of the signalfd_siginfo structure used by signalfd().) To simplify the task of deciding which fields need to be copied to user space when a kernel-generated signal is delivered (or read via a signalfd() file descriptor), the kernel encodes a value in the two most significant bytes of the si_code field. The kernel then elsewhere uses a switch statement based on this value to select the code that copies values from appropriate fields in the kernel-internal siginfo structure to the user-space siginfo structure. For example, for signals generated by POSIX timers, the kernel encodes the value __SI_TIMER in the high bytes of si_code, which indicates that various timer-related fields must be copied to the user-space siginfo structure.
Encoding a value in the high bytes of the kernel-internal siginfo.si_code field serves the kernel's requirements when it comes to implementing signal handlers and signalfd(). However, one piece of information is not copied to user space. For kernel-generated signals (i.e., those signals with a positive si_code value), the value encoded in the high bytes of the si_code field is discarded before that field is copied to user space, and it is not possible for CRIU to unambiguously reconstruct the discarded value based only on the signal number and the remaining bits that are passed in the si_code field. This means that CRIU can't determine which other fields in the siginfo structure are valid; in other words, information that is essential to perform a restore of pending signals has been lost.
A related limitation in the system call API affects CRIU restore. The obvious candidates for restoring pending signals are two low-level system calls, rt_sigqueueinfo() and rt_tgsigqueueinfo(), which queue signals for a process and a thread, respectively. These system calls are normally rarely used outside of the C library (where, for example, they are used to implement the sigqueue() and pthread_sigqueue() library functions). Aside from the thread-versus-process difference, these two system calls are quite similar. For example, rt_sigqueueinfo() has the following prototype:
int rt_sigqueueinfo(pid_t tgid, int sig, siginfo_t *siginfo);
The system call sends the signal sig, whose attributes are provided in siginfo, to the process with the ID tgid. This seems perfect, except that the kernel imposes one limitation: siginfo.si_code must be less than 0. (This restriction exists to prevent a process from spoofing as the kernel when sending signals to other processes.) This means that even if we could use signalfd() to retrieve the two most significant bytes of si_code, we could not use rt_sigqueueinfo() to restore those bytes during a CRIU restore.
Progress towards a solution
Andrey's first attempt to add support for checkpoint/restore of pending signals took the form of an extension that added three new flags to the signalfd() system call. The first of these flags, SFD_RAW, changed the behavior of subsequent reads from the signalfd file descriptor: instead of returning a signalfd_siginfo structure, reads returned a "raw" siginfo structure that contains some information not returned via signalfd_siginfo and whose si_code field includes the two most significant bytes. The other flags, SFD_PRIVATE and SFD_GROUP, controlled whether reads should return signals from the per-thread queue or the process-wide queue.One other piece of the patch set relaxed the restrictions in rt_sigqueueinfo() and rt_tgsigqueueinfo() so that a positive value can be specified in si_code, so long as the caller is sending a signal to itself. (It seems safe to allow a process to spoof as the kernel when sending signals to itself.)
A discussion on the design of the interface ensued between Andrey and Oleg Nesterov. Andrey noted that, for backward compatibility reasons, the signalfd_siginfo structure could not be fixed to supply the information required by CRIU, so a new message format really was required. Oleg noted that nondestructive reads that employed a positional interface (i.e., the ability to read message N from the queue) would probably be preferable.
In response to Oleg's feedback, Andrey has now produced a second version of his patches with a revised API. The SFD_RAW flag and the use of a "raw" siginfo structure remain, as do the changes to rt_sigqueueinfo() and rt_tgsigqueueinfo(). However, the new patch set provides a rather different interface for reading signals, via the pread() system call:
ssize_t pread(int fd, void *buf, size_t count, off_t offset);
In normal use, pread() reads count bytes from the file
referred to by the descriptor fd, starting at byte offset in
the file. Andrey's patch repurposes the interface somewhat in order to read
from signalfd file descriptors: offset is used to both select
which queue to read from and to specify an ordinal position in that
queue. The caller calculates the offset argument using the formula
queue + pos
queue is either SFD_SHARED_QUEUE_OFFSET to read from the
process-wide signal queue, or SFD_PER_THREAD_QUEUE_OFFSET to read
from the per-thread signal queue. pos specifies an ordinal
position (not a byte offset) in the queue; the first signal in the queue is
at position 0. For example, the following call reads the fourth signal in
the process-wide signal queue:
n = pread(fs, &buf, sizeof(buf), SFD_SHARED_QUEUE_OFFSET + 3);
If there is no signal at position pos (i.e., an attempt was made to read past the end of the queue), pread() returns zero.
Using pread() to read signals from a signalfd file descriptor is nondestructive: the signal remains in the queue to be read again if desired.
Andrey's second round of patches has so far received little comment. Although Oleg proposed the revised API, he is unsure whether it will pass muster with Linus:
This patch adds the hack and it makes signalfd even more strange.
Yes, this hack was suggested by me because I can't suggest something better. But if Linus dislikes this user-visible API it would be better to get his nack right now.
To date, however, a version of the patches that copies Linus does not seem to have gone out. In the meantime, Andrey's work serves as a good example of the complexities involved in getting CRIU to successfully handle checkpoint and restore of each piece of process state. And one way or another, checkpoint/restore of pending signals seems like a useful enough feature that it will make it into the kernel in some form, though possibly with a better API.
Xtables2 vs. nftables
The Linux kernel's firewall and packet filtering support has seen quite a few changes over the years. Back in 2009, it looked like a new packet filtering mechanism, nftables, was set to be the next generation solution for Linux. It was mentioned at the 2010 Kernel Summit as a solution that might apply more widely than just for netfilter. But nftables development stalled, until it was resurrected in 2012 by netfilter maintainer Pablo Neira Ayuso. During the lull, however, another, more incremental change to the existing netfilter code had been developed; Xtables2 was proposed for merging by Jan Engelhardt in mid-December. Many of the same problems in the existing code are targeted by both solutions, so it seems likely that just one or the other gets merged—the decision on which is the subject of some debate.
Xtables2 has been under development since 2010 or so; Engelhardt gave a presentation [PDF] on it at the 2010 Netfilter workshop. Over the last few years, he has occasionally posted progress reports, but the pace of those (and development itself) seems to have picked up after Neira posted his intention to restart nftables development back in October. Given that it will be difficult—impossible, really—to sell two new packet filtering schemes, either the two will need to come together somehow, or one will have to win out.
At least so far, Neira and Engelhardt don't agree about the
direction that netfilter should take. After the October nftables announcement,
Engelhardt pointed out that one of the
missing nftables features noted by Neira was already available in Xtables2: "atomic table replace and atomic
dump
".
Neira's suggestion that Engelhardt look at adding the
feature to nftables was rebuffed.
Beyond that, though, Neira also said that
it would be "*extremely hard* to justify its [Xtables2's] inclusion
into mainline
". To Engelhardt, and perhaps others, that sounded
like a pre-judgment against merging Xtables2, which "would be really
sad
", he said. He continued by
listing a number
of useful features already available in Xtables2, including network
namespace support, a netlink interface, read-copy update (RCU) support,
atomic chain and table replacement, and more.
Much of Neira's announcement concerned the "compatibility layer" that will be needed for any replacement of the existing netfilter code. There are far too many users of iptables to leave behind—not to mention the "no ABI breakage" kernel rule. So, for some period of time, both the existing code that supports iptables and any new solution will have to coexist in the kernel. Eventually, the older code can be removed.
One the main problems that both nftables and Xtables2 address is the code duplication that exists in the existing netfilter implementation (which is often referred to as "xtables"). Because that code is protocol-aware, it was essentially necessary to make four copies of much of it in the kernel to handle each different use case: IPv4, IPv6, ARP, and ethernet bridging. That is clearly sub-optimal, and something that both Xtables2 and nftables address. The difference is in how they address it. With Xtables2, a single protocol-independent table is used per network namespace, while nftables defines a new virtual machine to process packets. Essentially, Xtables2 (as its name would imply) is an evolution of the existing code, while nftables is a more fundamental rework of Linux packet filtering.
That difference in approaches is evident in the discussion over
Engelhardt's merge request. Neira is not
impressed with the feature set, but he also complains that Xtables2 "inherits many of the
design decisions that were taken while designing iptables back in the
late nineties
". As might be guessed, Engelhardt saw things differently:
In my opinion, there is nothing wrong with keeping some concepts. A developer is not required to reevaluate and reinnovate every concept there has been just for the heck of it. (The old "evolution, not revolution" credo.) Throwing everything overboard generally does not turn out to work these days.
Nftables is hardly a revolution, Neira said, because it implements backward
compatibility: "revolutions are never
backward compatible
". Further discussion noted a number of
conceptual similarities between the two approaches, with each side
predictably arguing that their solution could do most or all of what the
other could do.
There are some differences between the two, though. For one thing, Xtables2 seems further along, both with its code and with things like documentation and a roadmap. As Neira noted, the development hiatus for nftables clearly set the project back a ways, but he is not ready to provide more details quite yet:
So I have to request some patience from you. We promise to deliver as much information as possible once we are done with the core features.
So there are two competing proposals for how to move forward with netfilter, one that is largely ready (though certainly not complete), according to Engelhardt, and one that is still under active "core" development. While once seen as the likely successor, nftables certainly suffered from lack of attention for a few years, while Xtables2 was seemingly chugging along for much of that time.
Clearly, both Engelhardt and Neira favor their chosen solutions, and neither seems likely to join forces with the other. Engelhardt indicated that he isn't advocating dropping nftables, necessarily, but is instead focused on getting Xtables2 features into the hands of users:
Neira proposed a discussion at this year's
Netfilter workshop (which should happen by mid-year) to try to resolve the
issue. While Engelhardt expressed some concern over the wait, a few months
may be needed as Jozsef Kadlecsik pointed
out: "Both
nftables and xtables2 have got nice features, so it's not a simple
question.
" Network maintainer David Miller concurred with Neira's
proposal.
While it may be technically possible to merge Xtables2 and start down the path toward removing the older interface, then switch to nftables when it is ready, it seems an unlikely outcome. If the netfilter developers (and maintainers) are convinced that nftables is the right way forward, skipping the Xtables2 step makes sense. That may mean a longer delay before some of the longstanding problems in the existing code are addressed, but trying to maintain three different packet filtering schemes in the kernel is simply not going to happen.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
