Brief items
The current development kernel remains 3.8-rc2; no 3.8 prepatches
have been released in the last week.
Stable updates: 3.2.36 was released
on January 4.
As of this writing, the
2.6.34.14,
3.0.58,
3.4.25, and
3.7.2 updates are in the review process;
they can be expected on or after January 11.
Comments (none posted)
I used to believe in a single, integrated security module that
addressed all the issues. Now that Linux is supporting everything
from real time tire pressure gauges in tricycles to the global
no-fly list that just doesn't seem reasonable. We need better turn
around on supplemental mechanisms. That means collections of
smaller, simpler LSMs instead of monoliths that only a few select
individuals or organizations have any hope of configuring properly.
—
Casey Schaufler
Well at least it crashes safely.
—
Frederic Weisbecker
Comments (none posted)
Kernel development news
By Jonathan Corbet
January 9, 2013
The Linux kernel's CPU scheduler has a challenging task: it must allocate access
to the system's processors in a way that is fair and responsive while
maximizing system throughput and minimizing power consumption. Users
expect these results regardless of the characteristics of their own
workloads, and regardless of the fact that those objectives are often in
conflict with each other. So it is not surprising that the kernel has been
through a few CPU schedulers over the years. That said, things have
seemed relatively stable in recent times; the current "completely
fair scheduler" (CFS) was merged for 2.6.23 in 2007. But, behind that
apparent stability, a lot has changed, and one of the most significant
changes in some time was merged for the 3.8 release.
Perfect scheduling requires a crystal ball; when the kernel knows exactly
what demands every process will make on the system and when, it can schedule
those processes optimally. Unfortunately, hardware manufacturers continue
to push affordable prediction-offload devices back in their roadmaps, so
the scheduler has to be able to muddle through in their absence. Said
muddling tends to be based on the information that is actually available,
with each process's past performance being at the top of the list. But,
interestingly, while the kernel closely tracks how much time each process
actually spends running, it does not have a clear idea of how much each
process is contributing to the load on the system.
One might well ask whether there is a difference between "CPU time
consumed" and "load." The answer, at least as embodied in Paul Turner's
per-entity load tracking patch set, which
was merged for 3.8, would
appear to be "yes." A process can contribute to load even if it is not
actually running at the moment; a process waiting for its turn in the CPU
is an example. "Load" is also meant to be an instantaneous quantity — how
much is a process loading the system right now? — as opposed to a
cumulative property like CPU usage. A long-running process that consumed
vast amounts of processor time last week may have very modest needs at the
moment; such a process is contributing very little to load now, despite its
rather more demanding behavior in the past.
The CFS scheduler (in 3.7 and prior kernels) tracks load on a per-run-queue
basis. It's worth noting that "the" run queue in CFS is actually a long
list of queues; at a minimum, there is one for each CPU. When group
scheduling is in use, though, each control group has its own per-CPU run
queue array. Occasionally the scheduler will account for how much each run
queue is contributing to the load on the system as a whole. Accounting at
that level is sufficient to help the group scheduler allocate CPU time
between control groups, but it leaves the system as a whole unaware of
exactly where the current load is coming from. Tracking load at the run queue
level also tends to yield widely varying estimates even when the workload is
relatively stable.
Toward better load tracking
Per-entity load tracking addresses these problems by pushing this tracking
down to the level of
individual "scheduling entities" — a process or a control group full of
processes. To that end, (wall clock) time is viewed as a sequence of 1ms
(actually, 1024µs) periods. An entity's contribution to the system load in
a period pi is just the portion of that period that the
entity was runnable — either actually running, or waiting for an available
CPU. The trick, though, is to get an idea of contributed load that covers
more than 1ms of real time; this is managed by adding in a decayed version
of the entity's previous contribution to system load. If we let
Li designate the entity's load contribution in period
pi, then an entity's total contribution can be expressed
as:
L = L0 + L1*y +
L2*y2 + L3*y3 + ...
Where y is the decay factor chosen. This formula gives the most
weight to the most recent load, but allows past load to influence the
calculation in a decreasing manner. The nice thing about this series is
that it is not actually necessary to keep an array of past load
contributions; simply multiplying the previous period's total load
contribution by y and adding the new L0 is
sufficient.
In the current code, y has been chosen so that y32
is equal to 0.5 (though, of course, the calculation is done with integer
arithmetic in the kernel). Thus, an entity's load contribution 32ms in the past is
weighted half as strongly as its current contribution.
Once we have an idea of the load contributed by runnable processes, that
load can be propagated upward to any containing control groups with a
simple sum. But, naturally, there are some complications. Calculating the
load contribution of runnable entities is easy, since the scheduler has to
deal with those entities on a regular basis anyway. But non-runnable
entities can also contribute to load; the fact a password cracker is
currently waiting on a page fault does not change the fact that it may be
loading the system heavily. So there needs to be a way of tracking the
load contribution of processes that, by virtue of being blocked, are not
currently the scheduler's concern.
One could, of course, just iterate through all those processes, decay their
load contribution as usual, and add it to the total. But that would be a
prohibitively expensive thing to do. So, instead, the 3.8 scheduler will
simply maintain a separate sum of the "blocked load" contained in each
cfs_rq (control-group run queue) structure. When a process
blocks, its load is subtracted from the total runnable load value and
added to the blocked load instead. That load can be decayed in the same
manner (by multiplying it by y each period). When a blocked process
becomes runnable again, its (suitably decayed) load is transferred back to
the runnable load. Thus, with a bit of accounting during process state
transitions, the scheduler can track load without having to worry about
walking through a long list of blocked processes.
Another complication is throttled processes — those that are running under
the CFS bandwidth controller and have used
all of the CPU time available to them in the current period. Even if those
processes wish to run, and even if the CPU is otherwise idle, the scheduler
will pass them over. Throttled processes thus
cannot contribute to load, so removing their contribution while they
languish makes sense. But allowing their load contribution to decay while
they are waiting to be allowed to run again would tend to skew their
contribution downward. So, in the throttled case, time simply stops for
the affected processes until they emerge from the throttled state.
What it is good for
The end result of all this work is that the scheduler now has a much
clearer idea of how much each process and scheduler control group is
contributing to the load on the system — and it has all been achieved
without an increase in scheduler overhead. Better statistics are usually
good, but one might wonder whether this information is truly useful for the
scheduler.
It does seem that some useful things can be done with a better idea of an
entity's load contribution. The most obvious target is likely to be load
balancing: distributing the processes on the system so that each CPU is
carrying roughly the same load. If the kernel knows how much each process
is contributing to system load, it can easily calculate the effect of
migrating that process to another CPU. The result should be more accurate,
less error-prone load balancing. There are some patches in circulation that make use of
load tracking to improve the scheduler's load balancer; something will
almost certainly make its way toward the mainline in the near future.
Another feature needing per-entity load tracking is the small-task packing patch. The goal here is to
gather "small" processes onto a small number of CPUs, allowing other
processors in the system to be powered down. Clearly, this kind of
gathering requires a reliable indicator of which processes are "small";
otherwise, the system is likely to end up in a highly unbalanced state.
Other subsystems may also be able to use this information; CPU frequency
and power governors should be able to make better guesses about how much
computing power will be needed in the near future, for example. Now that
the infrastructure is in place, we are likely to see a number of developers
using per-entity load information to optimize the behavior of the system.
It is still not a crystal ball with a view into the future, but, at least,
we now have a better understanding of the present.
Comments (19 posted)
By Michael Kerrisk
January 9, 2013
Checkpoint/restore is a mechanism that permits taking a snapshot of the
state of an application (which may consist of multiple processes) and then
later restoring the application to a running state. One use of
checkpoint/restore is for live migration, which allows a running
application to be moved between host systems without loss of
service. Another use is incremental snapshotting, whereby periodic
snapshots are made of a long-running application so that it can be
restarted from a recent snapshot in the event of a system outage, thus
avoiding the loss of days of calculation. There are also many other uses
for the feature.
Checkpoint/restore has a long history, which we covered in November. The initial approach,
starting in 2005, was to provide a kernel-space implementation. However,
the patches implementing this approach were ultimately rejected as being
too complex, invasive, and difficult to maintain. This led to an alternate approach:
checkpoint/restore in user space (CRIU), an implementation that performs
most of the work in user space, with some support from the kernel. The
benefit of the CRIU approach is that, by comparison with a kernel-space
implementation, it requires fewer and less invasive changes in the kernel
code.
To correctly handle the widest possible range of applications, CRIU
needs to be able to checkpoint and restore as much of a process's state as
possible. This is a large task, since there are very many pieces of process
state that need to be handled, including process ID, parent process ID,
credentials, current working directory, resource limits, timers, open file
descriptors, and so on. Furthermore, some resources may be shared across
multiple processes (for example, multiple processes may hold open file
descriptors referring to the same open file), so that successfully
restoring application state also requires reproducing shared aspects of
process state.
For each piece of process state, CRIU requires two pieces of support
from the kernel: a mechanism for retrieving the state (used during
checkpoint) and a mechanism to set the state (used during restore). In some
cases, the kernel provides most or all of the necessary support. In other
cases, however, the kernel does not provide a mechanism to retrieve the
(complete) value of the state during a checkpoint or does not provide a
mechanism to set the state during restore. Thus, one of the ongoing pieces
of work for the implementation of CRIU is to add support to the kernel for
these missing pieces.
Andrey Vagin's recent patches to the signalfd() system call are
an example of this ongoing work and illustrate the complexity of the task of
saving and restoring process state. Before looking at these patches
closely, we need to consider the general problem that CRIU is trying to
solve with respect to signals, and consider some of the details that make
the solution complicated.
The problem and its complexities
The overall problem that the CRIU developers want to solve is
checkpointing and restoring a process's set of pending signals—the
set of signals that have been queued for delivery to the process but not
yet delivered. The idea is that when a process is checkpointed, all of the
process's pending signals should be fetched and saved, and when the process
is restored, all of the signals should be requeued to the process. As
things stand, the kernel does not quite provide sufficient support for CRIU
to perform either of these tasks.
At first glance, it might seem that the task is as simple as fetching
the list of pending signal numbers during a checkpoint and then requeueing
those signals during the restore. However, there's rather more to the story
than that. First, each signal has an associated siginfo structure
that provides additional information about the signal. That information is
available when a process receives a signal. If a signal handler is
installed using sigaction() with the SA_SIGINFO flag,
then the additional information is available as the second argument of the
signal handler, which is prototyped as:
void handler(int sig, siginfo_t *siginfo, void *ucontext);
The siginfo structure contains a number of fields. One of
these, si_code, provides further information about the origin of
the signal. A positive number in this field indicates that the signal was
generated by the kernel; a negative number indicates that the signal was
generated by user space (typically by a library function such as
sigqueue()). For example, if the signal was generated because of the
expiration of a POSIX timer, then si_code will be set to the value
SI_TIMER. On the other hand, if a SIGCHLD signal was
delivered because a child process changed state, then si_code is
set to one of a range of values indicating that the process terminated, was
killed by a signal, was stopped, and so on.
Other siginfo fields provide further information about the
signal. For example, if the signal was sent using the kill()
system call, then the si_pid field contains the PID and the
si_uid field contains the real user ID of the sending
process. Various other fields in the siginfo structure provide
information about specific signals.
There are other factors that make checkpoint/restore of signals
complicated. One of these is that multiple instances of the so-called
real-time signals can be queued. This means that the CRIU mechanism must
ensure that all of the queued signals are gathered up during a checkpoint.
One final detail about signals must also be handled by CRIU. Signals
can be queued either to a specific thread or to a process as a whole
(meaning that the signal can be delivered to any of the threads in the
process). CRIU needs a mechanism to distinguish these two queues during a
checkpoint operation, so that it can later restore them.
Limitations of the existing system call API
At first glance it might seem that the signalfd()
system call could solve the problem of gathering all pending signals during
a CRIU checkpoint:
int signalfd(int fd, const sigset_t *mask, int flags);
This system call creates a file descriptor from which signals can be
"read." Reads from the file descriptor return signalfd_siginfo
structures containing much of the same information that is passed in the
siginfo argument of a signal handler.
However, it turns out that using signalfd() to read all
pending signals in preparation for a checkpoint has a couple of
limitations. The first of these is that signalfd() is unaware of
the distinction between thread-specific and process-wide signals: it simply
returns all pending signals, intermingling those that are process-wide with
those that are directed to the calling thread. Thus, signalfd()
loses information that is required for a CRIU restore operation.
A second limitation is less obvious but just as important. As we noted
above, the siginfo structure contains many fields. However, only
some of those fields are filled in for each signal. (Similar statements
hold true of the signalfd_siginfo structure used by
signalfd().) To simplify the task of deciding which fields need
to be copied to user space when a kernel-generated signal is delivered (or read via a
signalfd() file descriptor), the kernel encodes a value in the
two most significant bytes of the si_code field. The kernel then
elsewhere uses a switch statement based on this value to select
the code that copies values from appropriate fields in the kernel-internal
siginfo structure to the user-space siginfo
structure. For example, for signals generated by POSIX timers, the kernel
encodes the value __SI_TIMER in the high bytes of si_code,
which indicates that various timer-related fields must be copied to the
user-space siginfo structure.
Encoding a value in the high bytes of the kernel-internal
siginfo.si_code field serves the kernel's requirements when it
comes to implementing signal handlers and signalfd(). However, one
piece of information is not copied to user space. For
kernel-generated signals (i.e., those signals with a positive
si_code value), the value encoded in the high bytes of the
si_code field is discarded before that field is copied to user
space, and it is not possible for CRIU to unambiguously reconstruct the
discarded value based only on the signal number and the remaining bits that
are passed in the si_code field. This means that CRIU can't
determine which other fields in the siginfo structure are valid;
in other words, information that is essential to perform a restore of
pending signals has been lost.
A related limitation in the system call API affects CRIU restore. The
obvious candidates for restoring pending signals are two low-level system
calls, rt_sigqueueinfo() and rt_tgsigqueueinfo(), which
queue signals for a process and a thread, respectively. These system calls
are normally rarely used outside of the C library (where, for example, they
are used to implement the sigqueue() and
pthread_sigqueue() library functions). Aside from the thread-versus-process difference, these two system calls are quite similar. For example,
rt_sigqueueinfo() has the following prototype:
int rt_sigqueueinfo(pid_t tgid, int sig, siginfo_t *siginfo);
The system call sends the signal sig, whose attributes are
provided in siginfo, to the process with the ID
tgid. This seems perfect, except that the kernel imposes one
limitation: siginfo.si_code must be less than 0. (This restriction
exists to prevent a process from spoofing as the kernel when sending
signals to other processes.) This means that even if we could use
signalfd() to retrieve the two most significant bytes of
si_code, we could not use rt_sigqueueinfo() to restore
those bytes during a CRIU restore.
Progress towards a solution
Andrey's
first attempt to add support for
checkpoint/restore of pending signals took the form of an extension that
added three new flags to the
signalfd() system call. The first of
these flags,
SFD_RAW, changed the behavior of subsequent reads
from the signalfd file descriptor: instead of returning a
signalfd_siginfo structure, reads returned a "raw"
siginfo structure that contains some information not returned via
signalfd_siginfo and whose
si_code field includes the two
most significant bytes. The other flags,
SFD_PRIVATE and
SFD_GROUP, controlled whether reads should return signals from the
per-thread queue or the process-wide queue.
One other piece of the patch set relaxed the restrictions in
rt_sigqueueinfo() and rt_tgsigqueueinfo() so that a
positive value can be specified in si_code, so long as the caller
is sending a signal to itself. (It seems safe to allow a process to spoof as
the kernel when sending signals to itself.)
A discussion on the design of the interface
ensued between Andrey and Oleg Nesterov. Andrey noted that, for backward
compatibility reasons, the signalfd_siginfo structure could not be fixed to
supply the information required by CRIU, so a new message format really was
required. Oleg noted that nondestructive reads that employed a positional
interface (i.e., the ability to read message N from the queue) would
probably be preferable.
In response to Oleg's feedback, Andrey has now produced a second version of his patches with a
revised API. The SFD_RAW flag and the use of a "raw"
siginfo structure remain, as do the changes to
rt_sigqueueinfo() and rt_tgsigqueueinfo(). However, the
new patch set provides a rather different interface for reading signals,
via the pread() system call:
ssize_t pread(int fd, void *buf, size_t count, off_t offset);
In normal use,
pread() reads
count bytes from the file
referred to by the descriptor
fd, starting at byte
offset in
the file. Andrey's patch repurposes the interface somewhat in order to read
from signalfd file descriptors:
offset is used to both select
which queue to read from and to specify an ordinal position in that
queue. The caller calculates the
offset argument using the formula
queue + pos
queue is either
SFD_SHARED_QUEUE_OFFSET to read from the
process-wide signal queue, or
SFD_PER_THREAD_QUEUE_OFFSET to read
from the per-thread signal queue.
pos specifies an ordinal
position (not a byte offset) in the queue; the first signal in the queue is
at position 0. For example, the following call reads the fourth signal in
the process-wide signal queue:
n = pread(fs, &buf, sizeof(buf), SFD_SHARED_QUEUE_OFFSET + 3);
If there is no signal at position pos (i.e., an attempt was made to
read past the end of the queue), pread() returns zero.
Using pread() to read signals from a signalfd file descriptor
is nondestructive: the signal remains in the queue to be read again if
desired.
Andrey's second round of patches has so far received little
comment. Although Oleg proposed the revised API, he is unsure whether it will pass muster with
Linus:
I think we should cc Linus.
This patch adds the hack and it makes signalfd even more strange.
Yes, this hack was suggested by me because I can't suggest something
better. But if Linus dislikes this user-visible API it would be better to
get his nack right now.
To date, however, a version of the patches that copies Linus does not
seem to have gone out. In the meantime, Andrey's work serves as a good
example of the complexities involved in getting CRIU to successfully handle
checkpoint and restore of each piece of process state. And one way or
another, checkpoint/restore of pending signals seems like a useful enough
feature that it will make it into the kernel in some form, though possibly with a
better API.
Comments (3 posted)
By Jake Edge
January 9, 2013
The Linux kernel's firewall and packet filtering support has seen quite a few
changes over the years. Back in 2009, it looked like a new packet filtering
mechanism, nftables, was set to be the next
generation solution for Linux.
It was mentioned at the 2010 Kernel Summit
as a solution that might apply more widely than just for netfilter.
But nftables development
stalled, until it was resurrected in 2012
by netfilter maintainer Pablo Neira Ayuso. During the lull, however,
another, more incremental change to the existing netfilter code had been
developed; Xtables2 was proposed for
merging by Jan Engelhardt in mid-December. Many of the same problems
in the existing code are targeted by both solutions, so it seems likely that just
one or the other gets merged—the decision on which is the
subject of some debate.
Xtables2 has been under development since 2010 or so; Engelhardt gave a presentation [PDF] on
it at the 2010 Netfilter workshop. Over the last few years, he has
occasionally posted progress reports, but the pace of those (and
development itself) seems to have picked up after Neira posted his
intention to restart nftables development back in October. Given that it
will be difficult—impossible, really—to sell two new packet
filtering schemes, either the two will need to come together somehow, or
one will have to win out.
At least so far, Neira and Engelhardt don't agree about the
direction that netfilter should take. After the October nftables announcement,
Engelhardt pointed out that one of the
missing nftables features noted by Neira was already available in Xtables2: "atomic table replace and atomic
dump".
Neira's suggestion that Engelhardt look at adding the
feature to nftables was rebuffed.
Beyond that, though, Neira also said that
it would be "*extremely hard* to justify its [Xtables2's] inclusion
into mainline". To Engelhardt, and perhaps others, that sounded
like a pre-judgment against merging Xtables2, which "would be really
sad", he said. He continued by
listing a number
of useful features already available in Xtables2, including network
namespace support, a netlink interface, read-copy update (RCU) support,
atomic chain and table replacement, and more.
Much of Neira's announcement concerned the "compatibility layer" that will
be needed for any
replacement of the existing netfilter code. There are far too many users
of iptables to leave behind—not to mention the "no ABI
breakage" kernel rule. So, for some period of time, both the existing code
that supports iptables and any new
solution will have to coexist in the kernel. Eventually, the older code
can be removed.
One the main problems that both nftables and Xtables2 address is the code
duplication that exists in the existing netfilter implementation
(which is often referred to as "xtables"). Because that code is
protocol-aware, it
was essentially necessary to make four copies of much of it in the
kernel to handle each different use case: IPv4, IPv6, ARP, and ethernet
bridging. That is clearly
sub-optimal, and something that both Xtables2 and nftables address. The
difference is in how they address it. With Xtables2, a single
protocol-independent table is used per network namespace, while nftables
defines a new virtual machine to process packets. Essentially, Xtables2
(as its name would imply) is an evolution of the existing code, while
nftables is a more fundamental rework of Linux packet filtering.
That difference in approaches is evident in the discussion over
Engelhardt's merge request. Neira is not
impressed with the feature set, but he also complains that Xtables2 "inherits many of the
design decisions that were taken while designing iptables back in the
late nineties". As might be guessed, Engelhardt saw things differently:
nf_tables itself retains some "late nineties" design decisions as
well.
In my opinion, there is nothing wrong with keeping some concepts. A
developer is not required to reevaluate and reinnovate every concept
there has been just for the heck of it. (The old "evolution, not
revolution" credo.) Throwing everything overboard generally does not
turn out to work these days.
Nftables is hardly a revolution, Neira said, because it implements backward
compatibility: "revolutions are never
backward compatible". Further discussion noted a number of
conceptual similarities between the two approaches, with each side
predictably arguing that their solution could do most or all of what the
other could do.
There are some differences between the two, though. For one thing,
Xtables2 seems further along, both with its code and with things like documentation and a roadmap. As Neira noted, the development hiatus for nftables
clearly set the project back a ways, but he is not ready to provide more
details quite yet:
I understand you want to know more on the future of nftables, but
because the way I am, I prefer to skip "hot air" wording by now and
talk on code done anytime soon.
So I have to request some patience from you. We promise to deliver as
much information as possible once we are done with the core features.
So there are two competing proposals for how to move forward with
netfilter, one that is largely ready (though certainly not complete),
according to Engelhardt, and one that is still under active "core"
development. While once seen as the likely successor, nftables certainly
suffered from lack of attention for a few years, while Xtables2 was
seemingly chugging along for much of that time.
Clearly, both Engelhardt and Neira favor their chosen solutions, and
neither seems likely to join forces with the other. Engelhardt indicated that he isn't advocating dropping
nftables, necessarily, but is instead focused on getting Xtables2 features
into the hands of users:
In all fairness, I have never said anything about dropping nft.
I am focused on xt2, its inclusion and subsequent maintenance, because
it resolves the ipt shortcomings in a way that I think appeals most to
the userspace crowd.
Neira proposed a discussion at this year's
Netfilter workshop (which should happen by mid-year) to try to resolve the
issue. While Engelhardt expressed some concern over the wait, a few months
may be needed as Jozsef Kadlecsik pointed
out: "Both
nftables and xtables2 have got nice features, so it's not a simple
question." Network maintainer David Miller concurred with Neira's
proposal.
While it may be technically possible to merge Xtables2 and start down the
path toward removing the older interface, then switch to nftables when it
is ready, it seems an unlikely outcome.
If the netfilter developers (and maintainers) are convinced that nftables
is the right way forward, skipping the Xtables2 step makes sense. That may
mean a longer delay before some of the
longstanding problems in the existing code are addressed, but trying to
maintain three different packet filtering schemes in the kernel is simply
not going to happen.
Comments (14 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>