Brief items
The current development kernel is 3.8-rc2,
released on January 2. Linus says:
"
It's a new year, people are getting back to work, and trying
desperately to forget the over-eating that has been going on for the last
two weeks. And hey, to celebrate, here's -rc2!" Perhaps as a result
of the aforementioned over-eating, this patch is relatively small; things
might be expected to pick up a bit by the time -rc3 comes out.
The 3.8-rc1 release happened on
December 21. The most significant changes pulled between the last edition's summary and the closing of
the merge window include the f2fs
filesystem, and a driver for Dialog Semiconductor DA9055 watchdog
devices.
Stable updates: no stable updates have been released since
December 17. The 3.2.36 update is in
the review process as of this writing; its release can be expected at any
time.
Comments (none posted)
So after I released 3.7-nohz1, I shut down the light then sat down
in front of an empty wall in my flat and waited in the darkness
with a black tea for december 21th's apocalypse.
But then after a few days, I've been thinking I should have taken a
second cup of tea with me.
—
Frederic Weisbecker
For those who are going to vanish today due to their firm believe
in the Mayan Apocalypse it's the last release ever and I can assure
you it's the best one as well. You just won't have enough time to
verify that.
For all others this is just the last release of 2012.
—
Thomas Gleixner
Nothing says "classy" quite like drinking champagne with a straw.
With a shark on it.
—
Linus
Torvalds
The problem is that driver authors
_ARE_ stupid, lazy morons who don't bother to read documentation.
—
Russell King
Comments (42 posted)
The "Tux3" next-generation filesystem project generated a lot of discussion
and a fair amount of code before fading into obscurity; LWN last
covered this work in 2008. Tux3 developer
Daniel Phillips has resurfaced with a new-year posting suggesting that work
on this code is resuming. "
In
brief, the first time Hirofumi ever put together all the kernel pieces in his
magical lab over in Tokyo, our Tux3 rocket took off and made it straight to
orbit. Or in less metaphorical terms, our first meaningful benchmarks turned in
numbers that meet or even slightly beat the illustrious incumbent,
Ext4." The code can be found in
git://github.com/OGAWAHirofumi/tux3.git for those who want to play with it.
Full Story (comments: 40)
Kernel development news
By Jonathan Corbet
January 3, 2013
Spinlocks, being the lowest-level synchronization mechanism in the kernel,
are the target of seemingly endless attempts at performance enhancement.
The
ticket spinlock mechanism used in the
mainline has resisted such attempts for a few years. Now, though, some
developers have identified a performance bottleneck associated with these
locks and are busily trying to come up with an improved version.
A spinlock is so-named because a CPU waiting for a contended lock will
"spin" in a tight loop, repeatedly querying the lock until it becomes
available. Ticket spinlocks adjust this algorithm by having each waiting
CPU take a "ticket" so that each CPU obtains the lock in the order in which
it arrived. These locks thus resemble the "take a number" mechanisms found
at deli counters or motor vehicle division offices worldwide — though, with
luck, the wait is rather shorter than is required to renew a driver's
license in your editor's part of the world. Without the ticket mechanism,
which was added for the 2.6.25 release, the kernel's spinlocks were unfair;
in some situations, some waiters could be starved for an extended period of
time.
It has long been understood that lock contention reduces system performance
considerably. The simple act of spinning for a lock clearly is not going
to be good for performance, but there are also caching issues to take into
account. If two CPUs are repeatedly acquiring a spinlock, the memory
location representing that lock will bounce back and forth between those
CPUs' caches. Even if neither CPU ever has to wait for the lock, the
process of moving it between caches will slow things down considerably.
For that reason, interest in lockless algorithms has been growing for many
years.
In the case of a contended lock, though, cache contention would appear to
be less of an issue. A CPU spinning on a lock will cache its contents in a
shared mode; no cache bouncing should occur until the CPU owning the lock
releases it. Releasing the lock (and its acquisition by another CPU)
requires writing to the lock, and that requires exclusive cache access.
The cache line movement at that time hurts, but probably not as much as
waiting for the lock in the first place. So it would seem that trying to
optimize cache behavior in the contended case is not likely to produce much
in the way of useful results.
That picture is not complete, though; one must take a couple of other facts
into account. Processors do not cache a single value; they cache a "line"
of (typically) 128 consecutive bytes as a single unit. In other words, the
cache lines in any contemporary processor are
almost certainly significantly larger than what is required to hold a
spinlock. So when a CPU needs
exclusive access to a spinlock's cache line, it also gains exclusive access
to a significant chunk of surrounding data. And that is where the other
important detail comes into play: spinlocks tend to be embedded within the
data structures that they protect, so that surrounding data is typically
data of immediate interest to the CPU holding the lock.
Kernel code will acquire a lock to work with (and, usually, modify) a
structure's contents. Often, changing a field within the protected
structure will require access to the same cache line that holds the
structure's spinlock. If the lock is uncontended, that access is not a
problem; the CPU owning the lock probably owns the cache line as well. But
if the lock is contended, there will be one or more other CPUs constantly
querying its value, obtaining shared access to that same cache line and
depriving the lock holder of the exclusive access it needs. A subsequent
modification of data within the affected cache line will thus incur a cache
miss. So
CPUs querying a contended lock can slow the lock owner considerably, even
though that owner is not accessing the lock directly.
How badly can throughput be impacted? In the description of his patch adding proportional backoff to ticket
spinlocks, Rik van Riel describes a microbenchmark that is slowed by a
factor of two when there is a single contending CPU, and by as much as a
factor of ten with many CPUs in the mix. That is not just a slowdown; that
is a catastrophic loss of performance. Needless to say, that is not the
sort of behavior that kernel developers like to see.
Rik's solution is simple enough. Rather than spinning tightly and querying a
contended lock's status, a waiting CPU should wait a bit more patiently,
only querying the lock occasionally. So his patch causes a waiting CPU to
loop a number of times doing nothing at all before it gets impatient and
checks the lock again. It goes without saying that picking that "number of
times" correctly is the key to good performance with this algorithm. While
a CPU is looping without querying the lock it cannot be bouncing cache
lines around, so
the lock holder should be able to make faster progress. But too much
looping will cause the lock to sit idle before the owner of the next ticket
notices that its turn has come; that, too, will hurt performance.
The first step in Rik's patch series calculates how many CPUs must release
the lock before the current CPU can claim it (by subtracting the current
CPU's ticket number from the number currently being served) and loops 50
times for every CPU that is ahead in the queue. That is where the
"proportional backoff" comes in; the further back in line the CPU is, the
longer it will wait between queries of the lock. The result should be a
minimizing of idle looping while also minimizing cache traffic.
The number 50 was determined empirically, but it seems unlikely that it
will be optimal for all situations. So the final part of Rik's patch set
attempts to tune that number dynamically. The dynamic delay factor is
increased when the lock is found to be unavailable and decreased when the
lock is obtained. The goal is to have a CPU query the lock an average of
2.7 times before obtaining it. The number 2.7, once again, was obtained by
running lots of tests and seeing what worked best; subsequent versions of
the patch have tweaked this heuristic somewhat.
Details aside, the core idea is that the delay factor (a per-CPU value that
applies to all
contended locks equally) will increase for workloads experiencing more
contention, tuning the system appropriately.
That said, the notion of a single delay
for all locks is likely to be causing a severe case of raised eyebrows for
some readers, and, indeed, it turned out to be inadequate; some locks are
rather more contended than others, after all. So the January 3 version of Rik's patch
keeps a hashed list (based on the spinlock address) of delay values instead.
Michel Lespinasse ran some
experiments of his own to see how well the proportional backoff
algorithm worked. In particular, he wanted to figure out whether it was
truly necessary to calculate a dynamic delay factor, or whether an optimal
static value could be found. His conclusion was that, in fact, a static
value is good enough; it might be possible to do a little better with a
dynamic value, he said, but the improvement is not enough to justify the
added complexity of the tuning mechanism. There is just one little
difficulty:
Of course, one major downside in my proposal is that I haven't
figured out an automatic way to find the most appropriate
spinlock_delay system tunable. So there is clearly some more
experimentation needed there. However, IMO the important result
here is that our goal of avoiding performance cliffs seems to be
reachable without the complexity (and IMO, risk) of per-spinlock
tuning values.
If these results stand, and an appropriate way of picking the static value
can be found, then there is probably not a case for adding dynamic backoff
to the kernel's spinlock implementation. But the backoff idea in general
would appear to be a significant improvement for some workloads. So the
chances are good that we will see it added in some form in an upcoming
development cycle.
Comments (17 posted)
January 3, 2013
This article was contributed by Bartlomiej Zolnierkiewicz
Last November, LWN described
the vmpressure_fd() work which
implemented a new system call making it possible for user-space
applications to be
informed when system memory is tight. Those applications could then
respond by freeing memory, easing the crunch. That patch set has since
evolved considerably.
Based on the feedback that author Anton Vorontsov received, the concept has
changed from a new system call to a new, control-group-based subsystem.
The controller implementation allows for integration with the memory
controller, meaning that applications can receive notifications when their
specific control group is running low on memory, even if the system as a
whole is not under memory pressure.
As with previous versions of the patch, applications can receive
notifications for three levels of memory pressure: "low" (memory reclaim is
happening at a low level), "medium" (some swapping is happening), and "oom"
(memory pressure is severe). But these notifications may no longer be the
primary way in which applications interact with the controller, thanks to
the most significant change in comparison to the previous
vmpressure_fd() solution:
the addition of a
user-space "shrinker"
interface allowing the kernel to ask user space to
free specific amounts of memory when needed. This API was inspired by Andrew
Morton's feedback
on the first revision of the mempressure control group subsystem patch:
It blurs the question of "who is in control". We tell userspace
"hey, we're getting a bit tight here, please do something". And
userspace makes the decision about what "something" is. So
userspace is in control of part of the reclaim function and the
kernel is in control of another part. Strange interactions are
likely.
Andrew also worried that application developers may tune their programs
against a particular kernel version; subtle behavioral changes in new
kernel releases might then cause regressions. In short, Andrew complained,
the behavior of the system as a whole was not testable, so there would be
no way to know if subsequent kernel changes made performance worse.
Andrew's suggestion was to give more control to the kernel and introduce some
kind of interface for user-space memory scanning and freeing (similar in
its main concept to the shrink_slab() kernel shrinkers). This
interface would control user-space reclaim behavior; if something goes
wrong, it will be up to kernel to resolve the issue. It would also give
kernel developers the ability to test and tune whole system behavior by
writing a compliant user-space test application and running it.
The user-space shrinker implementation by Anton operates on the concept of
chunks of an application-defined size. There is an assumption that the
application
does memory allocations with a specific granularity (the "chunk size,"
which may be not 100% accurate but the more accurate it is, the better).
So if the application caches data in chunks of 1MB, that is the size it
will provide to the shrinker interface. That is done through a sequence
like this:
- The application opens the control interface, which is found as the
file cgroup.event_control in the controller directory.
- The shrinker interface
(mempressure.shrinker in the controller directory) must also
be opened.
- The eventfd() system call is used to obtain a third file
descriptor for notifications.
- The application then writes a string containing the eventfd()
file descriptor number, the mempressure.shrinker file
descriptor number, and the chunk size to the control interface.
Occasionally, the application should write a string to the shrinker file
indicating how many chunks have been allocated or (using a negative count)
freed. The kernel uses this information to maintain an internal count of
how many reclaimable chunks the application is currently holding on to.
If the kernel wants the application to free some memory, the notification
will come through the eventfd() file descriptor in the form of an
integer count of the number of chunks that should be freed. The kernel
assumes that the application will free the specified number of chunks before
reading from the eventfd() file descriptor again. If the
application isn't able to reclaim all chunks for some reason, it
should re-add the number of chunks that were not freed by writing to
the mempressure.shrinker file.
The patchset also includes an example
application (slightly buggy in the current version) for testing the new
interface. It creates two threads; the first thread initializes the
user-space shrinker mechanism notifications and then tries to allocate
memory (more than physically available) in an infinite loop. The second
thread listens for user-space shrinker notifications and frees the requested
number of chunks (also in an infinite loop). Ideally, during the run of the
test application the system shouldn't get into an out-of-memory condition
and it also shouldn't use much swap space (if any is available of
course).
Various comments were received on the patch set, so at least one more round
of changes will be required before this interface can be considered for
merging into the mainline. There is also an open question on how this
feature interacts with volatile ranges and
whether both mechanisms (neither of which has yet been merged) are truly
required. So this discussion may
continue well into the new year before we end up with reclaimable
user-space memory caches in their final form.
Comments (3 posted)
By Michael Kerrisk
January 4, 2013
The Linux 3.8 merge window saw the acceptance of Eric Biederman's
sizeable series of user namespace and related
patches. Although there remain some details to finish—for
example, a number of Linux filesystems are not yet user-namespace
aware—the implementation of user namespaces is now functionally
complete.
The completion of the user namespaces work is something of a milestone,
for a number of reasons. First, this work represents the completion of
one of the most complex namespace implementations to date, as evidenced by the fact
that it has been around five years since the first steps in the
implementation of user namespaces (in Linux 2.6.23). Second, the namespace
work is currently at something of a "stable point", with the implementation
of most of the existing namespaces being more or less complete. This does
not mean that work on namespaces has finished: other namespaces may be
added in the future, and there will probably be further extensions to
existing namespaces, such as the addition of namespace isolation for the kernel log.
Finally, the recent changes in the implementation of user namespaces are
something of a game changer in terms of how namespaces can be used:
starting with Linux 3.8, unprivileged processes can create user namespaces
in which they have full privileges, which in turn allows any other type of
namespace to be created inside a user namespace.
Thus, the present moment seems a good point to take an overview of
namespaces and a practical look at the namespace API. This is the first of
a series of articles that does so: in this article, we provide an overview
of the currently available namespaces; in the follow-on articles, we'll
show how the namespace APIs can be used in programs.
The namespaces
Currently, Linux implements six different types of namespaces. The purpose of each
namespace is to wrap a particular global system resource in an abstraction
that makes it appear to the processes within the namespace that they have
their own isolated instance of the global resource. One of the overall
goals of namespaces is to support the implementation of containers, a tool for lightweight
virtualization (as well as other purposes) that provides a group of
processes with the illusion that they are the only processes on the
system.
In the discussion below, we present the namespaces in the order that
they were implemented (or at least, the order in which the implementations
were completed). The CLONE_NEW* identifiers listed in parentheses
are the names of the constants used to identify namespace types when
employing the namespace-related APIs (clone(), unshare(),
and setns()) that we will describe in our follow-on articles.
Mount
namespaces (CLONE_NEWNS, Linux 2.4.19) isolate the set of filesystem mount points seen by a group of
processes. Thus, processes in different mount namespaces can have different
views of the filesystem hierarchy. With the addition of mount namespaces,
the mount()
and umount()
system calls ceased operating on a global set of mount points visible to
all processes on the system and instead performed operations that affected
just the mount namespace associated with the calling process.
One use of mount namespaces is to create environments that are similar
to chroot jails. However, by contrast with
the use of the chroot() system call, mount namespaces are a more
secure and flexible tool for this task. Other more
sophisticated uses of mount namespaces are also possible. For example,
separate mount namespaces can be set up in a master-slave relationship, so
that the mount events are automatically propagated from one namespace to
another; this allows, for example, an optical disk device that is mounted
in one namespace to automatically appear in other namespaces.
Mount namespaces were the first type of namespace to be implemented on
Linux, appearing in 2002. This fact accounts for the rather generic "NEWNS"
moniker (short for "new namespace"): at that time no one seems to have been
thinking that other, different types of namespace might be needed in the
future.
UTS namespaces (CLONE_NEWUTS,
Linux 2.6.19) isolate two system identifiers—nodename and
domainname—returned by the uname()
system call; the names are set using the sethostname() and
setdomainname() system calls.
In the context of containers, the UTS namespaces feature allows each
container to have its own hostname and NIS domain name. This can be useful
for initialization and configuration scripts that tailor their actions
based on these names.
The term "UTS" derives from the name of the structure passed to the
uname() system call: struct utsname. The name of that
structure in turn derives from "UNIX Time-sharing System".
IPC namespaces (CLONE_NEWIPC,
Linux 2.6.19) isolate certain interprocess communication (IPC) resources,
namely, System
V IPC objects and (since Linux 2.6.30) POSIX
message queues. The common characteristic of these IPC mechanisms is
that IPC objects are identified by mechanisms other than filesystem
pathnames. Each IPC namespace has its own set of System V IPC identifiers
and its own POSIX message queue filesystem.
PID namespaces (CLONE_NEWPID,
Linux 2.6.24) isolate the process ID number space. In other words,
processes in different PID namespaces can have the same PID. One of the
main benefits of PID namespaces is that containers can be migrated
between hosts while keeping the same process IDs for the processes inside
the container. PID namespaces also allow each container to have its own
init (PID 1), the "ancestor of all processes" that manages various
system initialization tasks and reaps orphaned child processes when they
terminate.
From the point of view of a particular PID namespace instance, a
process has two PIDs: the PID inside the namespace, and the PID outside the
namespace on the host system. PID namespaces can be nested: a process will
have one PID for each of the layers of the hierarchy starting from the PID
namespace in which it resides through to the root PID namespace. A process
can see (e.g., view via /proc/PID and send
signals with kill()) only processes contained in its own PID
namespace and the namespaces nested below that PID namespace.
Network namespaces
(CLONE_NEWNET, started in Linux 2.4.19 2.6.24 and largely completed by
about Linux 2.6.29) provide isolation of the system resources associated
with networking. Thus, each network namespace has its own network devices, IP
addresses, IP routing tables, /proc/net directory, port numbers,
and so on.
Network namespaces make containers useful from a networking perspective:
each container can have its own (virtual) network device and its own
applications that bind to the per-namespace port number space; suitable
routing rules in the host system can direct network packets to the network
device associated with a specific container. Thus, for example, it is
possible to have multiple containerized web servers on the same host
system, with each server bound to port 80 in its (per-container) network
namespace.
User namespaces
(CLONE_NEWUSER, started in Linux 2.6.23 and completed in Linux
3.8) isolate the user and group ID number spaces. In other words, a
process's user and group IDs can be different inside and outside a user
namespace. The most interesting case here is that a process can have a
normal unprivileged user ID outside a user namespace while at the same time
having a user ID of 0 inside the namespace. This means that the process has
full root privileges for operations inside the user namespace, but is
unprivileged for operations outside the namespace.
Starting in Linux 3.8, unprivileged processes can
create user namespaces, which opens up a raft of interesting new
possibilities for applications: since an otherwise unprivileged process can
hold root privileges inside the user namespace, unprivileged applications
now have access to functionality that was formerly limited to
root. Eric Biederman has put a lot of effort into making the user
namespaces implementation safe and correct. However, the changes wrought by
this work are subtle and wide ranging. Thus, it may happen that user
namespaces have some as-yet unknown security issues that remain to be found
and fixed in the future.
Concluding remarks
It's now around a decade since the implementation of the first Linux
namespace. Since that time, the namespace concept has expanded into a more
general framework for isolating a range of global resources whose scope was
formerly system-wide. As a result, namespaces now provide the basis for a
complete lightweight virtualization system, in the form of containers. As
the namespace concept has expanded, the associated API has grown—from
a single system call (clone()) and one or two /proc
files—to include a number of other system calls and many more files
under /proc. The details of that API will form the subject of the
follow-ups to this article.
Series index
The following list shows later articles in this series, along with
their example programs:
Comments (20 posted)
Patches and updates
Kernel trees
Core kernel code
- Frederic Weisbecker: 3.7-nohz1 .
(December 21, 2012)
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>