Kernel development
Brief items
Kernel release status
The current development kernel is 4.2-rc3, released on July 19. Linus said: "Normal Sunday release schedule, and a fairly normal rc release. There was some fallout from the x86 FPU cleanups, but that only hit CPU's with the xsaves instruction, and it should be all good now."
Stable updates: 4.1.3 and 4.0.9 were released on July 21.
Quotes of the week
/* * Well, when the cobbler got mad like this, he would go into hiding. He * would not make or sell any boots. He would not go out at all. Pretty * soon, the coffee shop would have to close because the cobbler wasn't * coming by twice a day any more. Then the grocery store would have to * close because he wouldn't eat much. After a while, everyone would panic * and have to move from the village and go live with all their relatives * (usually the ones they didn't like very much). * * Eventually, the cobbler would work his way out of his bad mood, and * open up his boot business again. Then, everyone else could move back * to the village and restart their lives, too. * * Fortunately, we have been able to collect up all the cobbler's careful * notes (and we wrote them down below). We'll have to keep checking these * notes over time, too, just as the cobbler does. But, in the meantime, * we can avoid the panic and the reboot since we can make sure that each * subtable is doing okay. And that's what bad_madt_entry() does. */
Gorman: Continual testing of mainline kernels
Mel Gorman introduces SUSE's kernel performance-testing system. "Marvin is a system that continually runs performance-related tests and is named after another robot doomed with repetitive tasks. When tests are complete it generates a performance comparison report that is publicly available but rarely linked. The primary responsibility of this system is to check SUSE Linux for Enterprise kernels for performance regressions but it is also configured to run tests against mainline releases."
Kernel development news
rm -r fs/ext3
The kernel development community is quite good at adding code to the kernel; its record on removing code is not always quite so bright. There are all kinds of reasons why removing code can be difficult; often, even code that appears to be without use stays around just in case somebody, somewhere, still needs it. Removal can be hard even when there is a known replacement that should work for all users; that can be seen in the case of the ext3 filesystem.
A few eyebrows went up when Jan Kara posted a
patch removing the ext3 filesystem recently. Some users clearly
thought the move represented a forced upgrade to ext4; Randy Dunlap remarked that "this looks like an
April 1 joke to me
". In truth, it is neither a joke nor a
forced upgrade; it is, however, an interesting story to look back at.
Nine years ago, in the middle of 2006, the premier filesystem for most users was ext3, but that filesystem was showing its age in a few ways. Its 32-bit block pointers limited maximum filesystem size to 8TB, a limit that was not too restrictive for most users at the time, but which would be highly problematic today. The filesystem tracks blocks in files with individual pointers, leading to large amounts of metadata overhead and poor performance on larger files. These problems, along with a number of missing features, had long since convinced developers that something newer and better was required.
For a while, some thought that might be a filesystem called reiser4, but that story failed to work out well even before that filesystem's primary developer left the development community.
The ext3 developers came up with a number of patches aimed at easing its scalability problems. These patches were made directly against the ext3 filesystem, with the idea that ext3 would evolve in the direction that was needed. There was, however, some resistance to the idea of making major changes to ext3 from developers who valued that filesystem in its current, stable form. One of those developers, it turned out, was Linus who, as we all know, has a relatively strong voice in such decisions.
And so it came to be that the ext3 developers announced their intent to create a new
filesystem called "ext4"; all new-feature development would be done there.
Actually, the new filesystem was first called "ext4dev" to emphasize its
experimental nature; the plan was to rename it to "ext4" once things were
stable, "probably in 6-9 months
". In the real world, that renaming
happened nearly 28 months later and was merged for the 2.6.28 kernel.
Since then, of course, ext4 has become the primary Linux filesystem for many users. It has seen many new features added, and it is not clear that this process will stop, even though ext4 is now in the same position that ext3 was nine years ago. Through this entire history, though, ext4 has retained the ability to mount and manage ext2 and ext3 filesystems; it can be configured to do so transparently in the absence of the older ext2 and ext3 modules. And, indeed, many distributions now don't bother to build the older filesystem modules, relying on ext4 to manage all three versions of the filesystem.
Back when ext4 was created, it was envisioned that the older filesystem
code would eventually become unnecessary. The plan was that when this
happened, "
One might well wonder whether we will see a similar story in the future and
the addition of an ext5 filesystem. For the time being, that does not seem
to be in the works. Ext4 has picked up a number of features in recent
years, with encryption as the most recent
example, but there has been no talk of moving development to a new source
base. Over the years, perhaps, the ext4 developers have done well enough
at not breaking things that users are less worried about new development
than they once might have been.
At the other end, there is the question of the ext2 filesystem. That code,
too, could be replaced by ext4, but there seems to be no pressure to do
so. Ext2 is small, weighing in at less than 10,000 lines of code; ext3 and
the associated JBD journaling code come in at 28,000, while ext4
and JBD2 add up to closer to 60,000 lines. The
simplicity of ext2
makes it a good filesystem for developers to experiment with, and its
maintenance cost is nearly zero. So there is no real reason to take it out
anytime soon.
Ext3, being rather larger than ext2, is a more promising target to remove,
though Jan
said that its maintenance cost was pretty
low. The fact that this code has been so thoroughly replaced makes the
removal decision relatively easy — but that decision still took nine years
to come about. Even so, if all old kernel code were this easy to
get rid of, the kernel would be quite a bit smaller than it is today.
The simpler addition is the atomic logical
operations patch set from Peter Zijlstra. Peter noted that there was
no notion
of logical operations on atomic_t variables that was the same
across all architectures. Some of them have related operations called
atomic_set_mask() and atomic_clear_mask(), but those
operations are defined inconsistently across architectures when they are
present at all.
To clean this situation up a bit, Peter introduced these new operations:
There is also a pair of simple wrappers
(atomic_andnot() and atomic64_andnot()) that simply flip
the bits of the mask argument.
All of these functions have a void type; there are no
_return variants (e.g. atomic_and_return()) that return
the result of the operation at the same time. Uses of
atomic_set_mask() and atomic_clear_mask() in the tree are
changed to use the new functions, and the old ones have been deprecated.
Atomic operations do not normally function as memory barriers; in other
words, the processor and the compiler are both free to reorder atomic
operations relative to other
operations in ways that could create confusion in concurrent situations.
The exception to that rule is the _return operations; for example,
atomic_add_return() will add a value to an atomic_t,
return the resulting value, and function as a full memory barrier.
Those rules are looking increasingly inadequate when faced with the growing
complexity and concurrency of contemporary systems. All-or-nothing memory
barriers are an overly blunt tool for developers who are working to
maximize concurrency and minimize the cost of the associated operations.
What developers would like to see instead is the ability to explicitly
control barriers with "acquire" and "release" semantics.
For those who don't want to do a quick read through the increasingly scary
memory-barriers.txt file, here is a quick
refresher. An "acquire" operation (usually a read) contains a barrier
guaranteeing that the operation will complete before any subsequent reads
or writes. A "release" operation (normally a write) guarantees that any
reads or writes issued prior to the release will complete before the
release operation itself completes. Acquire and release operations are
thus only partial barriers. In many situations, though, they are all that
is needed, and they can be less expensive than full barriers; developers
seeking to maximize performance thus want to use them whenever possible.
Will Deacon set out to provide that control with atomic operations. The result was a new set of atomic operations:
Will's patch also defines the 64-bit and atomic_long_t versions of
the above functions. In each case, the "bare" version of the name
(e.g. atomic_add_return() gives full-barrier semantics, while the
_relaxed version provides no barrier at all. In between are the
versions that include barriers with acquire or release semantics.
The first use of these new primitives is
with the queued reader/writer lock code.
Assuming they are merged, they
will likely find their way into other performance-sensitive parts of the
kernel in short order. That should be good for the speed of the system
(though no benchmark numbers have been posted), but it comes at the cost of
requiring more developers to understand the details of how the barrier
semantics work. It is becoming increasingly hard to hide these details in
architecture-specific code over time. As the complexity of our systems
grows, the complexity of the software will have to increase as well.
It is fair to characterize the sandboxing features in Linux as being
relatively complex. The complexity of the security module options, and
SELinux in particular, is legendary. The seccomp() system call
has two modes: very simple (in which case almost nothing but
read() and write() is allowed), or rather complex (a
program written in the Berkeley packet filter (BPF) language makes
decisions on system call availability). There is a great
deal of flexibility available with both security modules and
seccomp(), but it comes at a cost.
OpenBSD leader Theo de Raadt is particularly
scornful of the BPF-based approach:
His posting contains a work-in-progress implementation of a simpler
approach to sandboxing (mostly written by Nicholas Marriott, it seems)
in the form of a system call named tame().
The core idea behind tame() is that most applications run in two
phases: initialization and steady-state execution. The initialization
phase typically involves opening files, establishing network connections,
and more; after initialization is complete, the program may not need to do
any of those things. So
there is often an opportunity to reduce an application's privilege level as
it moves out of the initialization phase. tame() performs that
privilege reduction; it is thus meant to be placed within an application,
rather than (as with SELinux) imposed on it from the outside.
The system call itself is simple enough:
If flags is passed as zero, the only system call available to the
process thereafter will be _exit(). This mode is thus suitable
for a process cranking on data stored in shared memory, but not much else.
For most real-world applications, the reduction in privilege will need to
be a bit less heavy-handed. That is what the flags are for. If any flags
at all are present, a base set of system calls, with read-only
functionality like getpid(), is available. For additional
privilege, specific flags must be used:
A process may make multiple calls to tame(), but it can only
restrict its current capabilities. Once a particular flag has been
cleared, it cannot be set again.
The patch includes changes to a number of OpenBSD utilities. The
cat command is restricted to TAME_MALLOC and
TAME_RPATH, for example; never again will cat be able to
run amok on the net. The ping command gets access to the net,
instead, but loses the ability to access the filesystem. And so on.
This system call has a number of features that may look a bit strange to
developers used to Linux. It encodes quite a bit of policy in the kernel,
including where the password database is stored and the use of Yellow
Pages/NIS; one would grep in vain for ypbind.lock in the Linux
kernel
source. tame() may seem limited in the range of restrictions that
it can apply to a process; it will almost certainly allow more than what is
strictly needed in most cases. It thus lacks the flexibility that Linux
developers typically like to see.
On the other hand, using tame(), it was evidently possible to add
restrictions to a fair number of system commands with a relatively small
amount of work and little code. Writing ad hoc BPF programs or
SELinux policies to accomplish the same thing would have taken quite a bit
longer and would have been more error-prone. tame(), thus, looks
like a way to add another layer of defense to a program in a quick and
standardized way; as such, it may, in the end, be used more than something
like seccomp().
If the tame() interface proves to be successful in the BSD world,
there is an interesting possibility on the Linux side: it should be
possible to completely implement that functionality in user space using the
seccomp() feature (though it would probably be necessary to merge one
of the patches adding extended BPF
functionality to seccomp()). We would then have the simple
interface for situations where it is adequate while still being able to
write more flexible filter policies where they are indicated. It could be
the best of both worlds.
The first step, though, would probably be to let the OpenBSD project
explore this space and
see what kind of results it gets. The ability to try out different models
is one of the strengths that comes from having competing kernels out
there. The ability to quickly copy that work is, instead, an advantage
that comes from free software. If this approach to attack-surface
reduction works out, we in the Linux world may, too, be able to
tame() our cat in the future.
perhaps 12-18 months out
", the ext3 code would be
removed. Once again, reality had something different to say, and the ext3
code endured for over nine years. Unless something surprising happens,
though, that record is about to come to an end; ext3 could be removed as
soon as the 4.3 development cycle, taking some 28,000 lines of code with
it. And most users, even those with ext3 filesystems, will not even
notice.
Atomic additions
Atomic variables have a long history as part of the kernel's
concurrency-management toolkit. These variables enable the execution of
simple arithmetic (and related) operations in an all-or-nothing manner;
other CPUs will never see partially-executed operations. As systems grow
more complex, though, atomic variables are having to become more complex as
well, as seen by a couple of recently proposed additions to the
atomic_t repertoire.
Atomic logical operations
void atomic_and(int mask, atomic_t *value);
void atomic_or(int mask, atomic_t *value);
void atomic_xor(int mask, atomic_t *value);
void atomic64_and(int mask, atomic64_t *value);
void atomic64_or(int mask, atomic64_t *value);
void atomic64_xor(int mask, atomic64_t *value);
Relaxed atomics
int atomic_read_acquire(atomic_t *value);
void atomic_set_release(atomic_t *value);
int atomic_add_return_relaxed(int i, atomic_t *value);
int atomic_add_return_acquire(int i, atomic_t *value);
int atomic_add_return_release(int i, atomic_t *value);
int atomic_sub_return_relaxed(int i, atomic_t *value);
int atomic_sub_return_acquire(int i, atomic_t *value);
int atomic_sub_return_release(int i, atomic_t *value);
/*
* And so on for atomic_xchg(), atomic_cmpxchg(),
* xchg(), and cmpxchg().
*/
Domesticating applications, OpenBSD style
One of the many approaches to improving system security consists of
reducing the attack surface of a given program by restricting the range of
system calls available to it. If an application has no need for access to
the network, say, then removing its ability to use the socket() system
call should cause no loss in functionality while reducing the scope of the
mischief that can be made should that application be compromised. In the
Linux world, this kind of sandboxing can be done using a security module or
the seccomp() system call. OpenBSD has lacked this capability so
far, but it may soon gain it via a somewhat different approach than has
been seen in Linux.
int tame(int flags);
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>