Kernel development [LWN.net]

Kernel release status

The current development kernel is 4.2-rc3, released on July 19. Linus said: "Normal Sunday release schedule, and a fairly normal rc release. There was some fallout from the x86 FPU cleanups, but that only hit CPU's with the xsaves instruction, and it should be all good now."

Stable updates: 4.1.3 and 4.0.9 were released on July 21.

Comments (none posted)

Quotes of the week

/*
 * Well, when the cobbler got mad like this, he would go into hiding.  He
 * would not make or sell any boots.  He would not go out at all.  Pretty
 * soon, the coffee shop would have to close because the cobbler wasn't
 * coming by twice a day any more.  Then the grocery store would have to
 * close because he wouldn't eat much.  After a while, everyone would panic
 * and have to move from the village and go live with all their relatives
 * (usually the ones they didn't like very much).
 *
 * Eventually, the cobbler would work his way out of his bad mood, and
 * open up his boot business again.  Then, everyone else could move back
 * to the village and restart their lives, too.
 *
 * Fortunately, we have been able to collect up all the cobbler's careful
 * notes (and we wrote them down below).  We'll have to keep checking these
 * notes over time, too, just as the cobbler does.  But, in the meantime,
 * we can avoid the panic and the reboot since we can make sure that each
 * subtable is doing okay.  And that's what bad_madt_entry() does.
 */

— Al Stone (thanks to Mark Rutland)

We talk a lot about creating tooling to help newbies submit perfect patches. Maybe we need to spend more time creating tooling to help old timers accept imperfect patches.

— Neil Brown

Comments (none posted)

Gorman: Continual testing of mainline kernels

Mel Gorman introduces SUSE's kernel performance-testing system. "Marvin is a system that continually runs performance-related tests and is named after another robot doomed with repetitive tasks. When tests are complete it generates a performance comparison report that is publicly available but rarely linked. The primary responsibility of this system is to check SUSE Linux for Enterprise kernels for performance regressions but it is also configured to run tests against mainline releases."

Comments (6 posted)

rm -r fs/ext3

By Jonathan Corbet
July 21, 2015

The kernel development community is quite good at adding code to the kernel; its record on removing code is not always quite so bright. There are all kinds of reasons why removing code can be difficult; often, even code that appears to be without use stays around just in case somebody, somewhere, still needs it. Removal can be hard even when there is a known replacement that should work for all users; that can be seen in the case of the ext3 filesystem.

A few eyebrows went up when Jan Kara posted a patch removing the ext3 filesystem recently. Some users clearly thought the move represented a forced upgrade to ext4; Randy Dunlap remarked that "this looks like an April 1 joke to me". In truth, it is neither a joke nor a forced upgrade; it is, however, an interesting story to look back at.

Nine years ago, in the middle of 2006, the premier filesystem for most users was ext3, but that filesystem was showing its age in a few ways. Its 32-bit block pointers limited maximum filesystem size to 8TB, a limit that was not too restrictive for most users at the time, but which would be highly problematic today. The filesystem tracks blocks in files with individual pointers, leading to large amounts of metadata overhead and poor performance on larger files. These problems, along with a number of missing features, had long since convinced developers that something newer and better was required.

For a while, some thought that might be a filesystem called reiser4, but that story failed to work out well even before that filesystem's primary developer left the development community.

The ext3 developers came up with a number of patches aimed at easing its scalability problems. These patches were made directly against the ext3 filesystem, with the idea that ext3 would evolve in the direction that was needed. There was, however, some resistance to the idea of making major changes to ext3 from developers who valued that filesystem in its current, stable form. One of those developers, it turned out, was Linus who, as we all know, has a relatively strong voice in such decisions.

And so it came to be that the ext3 developers announced their intent to create a new filesystem called "ext4"; all new-feature development would be done there. Actually, the new filesystem was first called "ext4dev" to emphasize its experimental nature; the plan was to rename it to "ext4" once things were stable, "probably in 6-9 months". In the real world, that renaming happened nearly 28 months later and was merged for the 2.6.28 kernel.

Since then, of course, ext4 has become the primary Linux filesystem for many users. It has seen many new features added, and it is not clear that this process will stop, even though ext4 is now in the same position that ext3 was nine years ago. Through this entire history, though, ext4 has retained the ability to mount and manage ext2 and ext3 filesystems; it can be configured to do so transparently in the absence of the older ext2 and ext3 modules. And, indeed, many distributions now don't bother to build the older filesystem modules, relying on ext4 to manage all three versions of the filesystem.

Back when ext4 was created, it was envisioned that the older filesystem code would eventually become unnecessary. The plan was that when this happened, "perhaps 12-18 months out", the ext3 code would be removed. Once again, reality had something different to say, and the ext3 code endured for over nine years. Unless something surprising happens, though, that record is about to come to an end; ext3 could be removed as soon as the 4.3 development cycle, taking some 28,000 lines of code with it. And most users, even those with ext3 filesystems, will not even notice.

One might well wonder whether we will see a similar story in the future and the addition of an ext5 filesystem. For the time being, that does not seem to be in the works. Ext4 has picked up a number of features in recent years, with encryption as the most recent example, but there has been no talk of moving development to a new source base. Over the years, perhaps, the ext4 developers have done well enough at not breaking things that users are less worried about new development than they once might have been.

At the other end, there is the question of the ext2 filesystem. That code, too, could be replaced by ext4, but there seems to be no pressure to do so. Ext2 is small, weighing in at less than 10,000 lines of code; ext3 and the associated JBD journaling code come in at 28,000, while ext4 and JBD2 add up to closer to 60,000 lines. The simplicity of ext2 makes it a good filesystem for developers to experiment with, and its maintenance cost is nearly zero. So there is no real reason to take it out anytime soon.

Ext3, being rather larger than ext2, is a more promising target to remove, though Jan said that its maintenance cost was pretty low. The fact that this code has been so thoroughly replaced makes the removal decision relatively easy — but that decision still took nine years to come about. Even so, if all old kernel code were this easy to get rid of, the kernel would be quite a bit smaller than it is today.

Comments (18 posted)

Atomic additions

By Jonathan Corbet
July 20, 2015

Atomic variables have a long history as part of the kernel's concurrency-management toolkit. These variables enable the execution of simple arithmetic (and related) operations in an all-or-nothing manner; other CPUs will never see partially-executed operations. As systems grow more complex, though, atomic variables are having to become more complex as well, as seen by a couple of recently proposed additions to the atomic_t repertoire.

Atomic logical operations

The simpler addition is the atomic logical operations patch set from Peter Zijlstra. Peter noted that there was no notion of logical operations on atomic_t variables that was the same across all architectures. Some of them have related operations called atomic_set_mask() and atomic_clear_mask(), but those operations are defined inconsistently across architectures when they are present at all.

To clean this situation up a bit, Peter introduced these new operations:

    void atomic_and(int mask, atomic_t *value);
    void atomic_or(int mask, atomic_t *value);
    void atomic_xor(int mask, atomic_t *value);
    void atomic64_and(int mask, atomic64_t *value);
    void atomic64_or(int mask, atomic64_t *value);
    void atomic64_xor(int mask, atomic64_t *value);

There is also a pair of simple wrappers (atomic_andnot() and atomic64_andnot()) that simply flip the bits of the mask argument. All of these functions have a void type; there are no _return variants (e.g. atomic_and_return()) that return the result of the operation at the same time. Uses of atomic_set_mask() and atomic_clear_mask() in the tree are changed to use the new functions, and the old ones have been deprecated.

Relaxed atomics

Atomic operations do not normally function as memory barriers; in other words, the processor and the compiler are both free to reorder atomic operations relative to other operations in ways that could create confusion in concurrent situations. The exception to that rule is the _return operations; for example, atomic_add_return() will add a value to an atomic_t, return the resulting value, and function as a full memory barrier.

Those rules are looking increasingly inadequate when faced with the growing complexity and concurrency of contemporary systems. All-or-nothing memory barriers are an overly blunt tool for developers who are working to maximize concurrency and minimize the cost of the associated operations. What developers would like to see instead is the ability to explicitly control barriers with "acquire" and "release" semantics.

For those who don't want to do a quick read through the increasingly scary memory-barriers.txt file, here is a quick refresher. An "acquire" operation (usually a read) contains a barrier guaranteeing that the operation will complete before any subsequent reads or writes. A "release" operation (normally a write) guarantees that any reads or writes issued prior to the release will complete before the release operation itself completes. Acquire and release operations are thus only partial barriers. In many situations, though, they are all that is needed, and they can be less expensive than full barriers; developers seeking to maximize performance thus want to use them whenever possible.

Will Deacon set out to provide that control with atomic operations. The result was a new set of atomic operations:

    int atomic_read_acquire(atomic_t *value);
    void atomic_set_release(atomic_t *value);

    int atomic_add_return_relaxed(int i, atomic_t *value);
    int atomic_add_return_acquire(int i, atomic_t *value);
    int atomic_add_return_release(int i, atomic_t *value);

    int atomic_sub_return_relaxed(int i, atomic_t *value);
    int atomic_sub_return_acquire(int i, atomic_t *value);
    int atomic_sub_return_release(int i, atomic_t *value);

    /*
     * And so on for atomic_xchg(), atomic_cmpxchg(),
     * xchg(), and cmpxchg().
     */

Will's patch also defines the 64-bit and atomic_long_t versions of the above functions. In each case, the "bare" version of the name (e.g. atomic_add_return() gives full-barrier semantics, while the _relaxed version provides no barrier at all. In between are the versions that include barriers with acquire or release semantics.

The first use of these new primitives is with the queued reader/writer lock code. Assuming they are merged, they will likely find their way into other performance-sensitive parts of the kernel in short order. That should be good for the speed of the system (though no benchmark numbers have been posted), but it comes at the cost of requiring more developers to understand the details of how the barrier semantics work. It is becoming increasingly hard to hide these details in architecture-specific code over time. As the complexity of our systems grows, the complexity of the software will have to increase as well.

Comments (11 posted)

Domesticating applications, OpenBSD style

By Jonathan Corbet
July 21, 2015

One of the many approaches to improving system security consists of reducing the attack surface of a given program by restricting the range of system calls available to it. If an application has no need for access to the network, say, then removing its ability to use the socket() system call should cause no loss in functionality while reducing the scope of the mischief that can be made should that application be compromised. In the Linux world, this kind of sandboxing can be done using a security module or the seccomp() system call. OpenBSD has lacked this capability so far, but it may soon gain it via a somewhat different approach than has been seen in Linux.

It is fair to characterize the sandboxing features in Linux as being relatively complex. The complexity of the security module options, and SELinux in particular, is legendary. The seccomp() system call has two modes: very simple (in which case almost nothing but read() and write() is allowed), or rather complex (a program written in the Berkeley packet filter (BPF) language makes decisions on system call availability). There is a great deal of flexibility available with both security modules and seccomp(), but it comes at a cost.

OpenBSD leader Theo de Raadt is particularly scornful of the BPF-based approach:

Some BPF-style approaches have showed up. So you need to write a program to observe your program, to keep things secure? That is insane.

His posting contains a work-in-progress implementation of a simpler approach to sandboxing (mostly written by Nicholas Marriott, it seems) in the form of a system call named tame().

The core idea behind tame() is that most applications run in two phases: initialization and steady-state execution. The initialization phase typically involves opening files, establishing network connections, and more; after initialization is complete, the program may not need to do any of those things. So there is often an opportunity to reduce an application's privilege level as it moves out of the initialization phase. tame() performs that privilege reduction; it is thus meant to be placed within an application, rather than (as with SELinux) imposed on it from the outside.

The system call itself is simple enough:

    int tame(int flags);

If flags is passed as zero, the only system call available to the process thereafter will be _exit(). This mode is thus suitable for a process cranking on data stored in shared memory, but not much else. For most real-world applications, the reduction in privilege will need to be a bit less heavy-handed. That is what the flags are for. If any flags at all are present, a base set of system calls, with read-only functionality like getpid(), is available. For additional privilege, specific flags must be used:

TAME_MALLOC provides access to memory-management calls like mmap(), mprotect(), and more.
TAME_RW allows I/O on existing file descriptors, enabling calls like read(), write(), poll(), fcntl(), sendmsg() and, interestingly, pipe().
TAME_RPATH enables system calls that perform pathname lookup without changing the filesystem: chdir(), openat() (read-only), fstat(), etc.
TAME_WPATH allows changes to the filesystem: chmod(), openat() for writing, chown(), etc. Note that TAME_RPATH and TAME_WPATH both implicitly set TAME_RW as well.
TAME_CPATH allows the creation and removal of files and directories via rename(), rmdir(), link(), unlink(), mkdir(), etc.
TAME_TMPPATH enables a number of filesystem-related system calls, but only when applied to files underneath /tmp.
TAME_INET allows socket() and related calls needed to function as an Internet client or server.
TAME_UNIX allows networking-related system calls restricted to Unix-domain sockets.
TAME_DNSPATH is meant to allow hostname lookups; it gives access to a few system calls like socket(), but only after the program successfully opens /etc/resolv.conf. So the kernel has to track whether a few "special" files like resolv.conf have been opened during the lifetime of the tamed process.
TAME_GETPW enables the read-only opening of a few specific files needed for getpwnam() and related functions. It will also turn on TAME_INET if the program succeeds in opening /var/run/ypbind.lock.
TAME_CMSG allows file descriptors to be passed with sendmsg() and recvmsg().
TAME_IOCTL turns on a few specific, terminal-related ioctl() commands.
TAME_PROC allows access to fork(), kill(), and related process-management system calls.

A process may make multiple calls to tame(), but it can only restrict its current capabilities. Once a particular flag has been cleared, it cannot be set again.

The patch includes changes to a number of OpenBSD utilities. The cat command is restricted to TAME_MALLOC and TAME_RPATH, for example; never again will cat be able to run amok on the net. The ping command gets access to the net, instead, but loses the ability to access the filesystem. And so on.

This system call has a number of features that may look a bit strange to developers used to Linux. It encodes quite a bit of policy in the kernel, including where the password database is stored and the use of Yellow Pages/NIS; one would grep in vain for ypbind.lock in the Linux kernel source. tame() may seem limited in the range of restrictions that it can apply to a process; it will almost certainly allow more than what is strictly needed in most cases. It thus lacks the flexibility that Linux developers typically like to see.

On the other hand, using tame(), it was evidently possible to add restrictions to a fair number of system commands with a relatively small amount of work and little code. Writing ad hoc BPF programs or SELinux policies to accomplish the same thing would have taken quite a bit longer and would have been more error-prone. tame(), thus, looks like a way to add another layer of defense to a program in a quick and standardized way; as such, it may, in the end, be used more than something like seccomp().

If the tame() interface proves to be successful in the BSD world, there is an interesting possibility on the Linux side: it should be possible to completely implement that functionality in user space using the seccomp() feature (though it would probably be necessary to merge one of the patches adding extended BPF functionality to seccomp()). We would then have the simple interface for situations where it is adequate while still being able to write more flexible filter policies where they are indicated. It could be the best of both worlds.

The first step, though, would probably be to let the OpenBSD project explore this space and see what kind of results it gets. The ability to try out different models is one of the strengths that comes from having competing kernels out there. The ability to quickly copy that work is, instead, an advantage that comes from free software. If this approach to attack-surface reduction works out, we in the Linux world may, too, be able to tame() our cat in the future.

Comments (73 posted)

Linus Torvalds Linux 4.2-rc3 ?

Greg KH Linux 4.1.3 ?

Greg KH Linux 4.0.9 ?

Kamal Mostafa Linux 3.19.8-ckt4 ?

Luis Henriques Linux 3.16.7-ckt15 ?

Robin Murphy arm64: IOMMU-backed DMA mapping ?

James Morse arm64: kernel: Add support for Privileged Access Never ?

Andrey Ryabinin KASAN for arm64 ?

Madhavan Srinivasan powerpc/powernv: Nest Instrumentation support ?

Masami Hiramatsu kprobes blacklist enhancement ?

Will Deacon Add generic support for relaxed atomics ?

Peter Zijlstra arch: Provide atomic logic ops ?

Tom Zanussi tracing: 'hist' triggers ?

kaixu xia bpf: Introduce the new ability of eBPF programs to access hardware PMU counter ?

Paul E. McKenney Expedited grace period changes for 4.3 ?

Waiman Long locking/qspinlock: Enhance pvqspinlock performance ?

Bjorn Andersson firmware: qcom: scm: Peripheral Authentication Service ?

Cyrille Pitchen add driver for Atmel QSPI controller ?

LABBE Corentin crypto: Add Allwinner Security System crypto accelerator ?

Sanchayan Maity Add support for touchscreen on Colibri VF50 ?

Xing Zheng Add codec machine driver for rockchip platform ?

Matt Ranostay iio: light: add support for APDS9660 sensor ?

Henry Chen regulator: MT6311: Add support for MT6311 regulator ?

YH Huang Add MediaTek display PWM driver ?

Jianwei Wang drm/layerscape: Add Freescale DCU DRM driver ?

Archit Taneja mtd: Qualcomm NAND controller driver ?

Yann Cantin new USB eBeam input driver ?

Chanwoo Choi thermal: Add generic devfreq cooling device ?

atull@opensource.altera.com FPGA Manager Framework and Simple FPGA Bus ?

Antti Palosaari SDR transmitter API ?

Markus Pargmann gpiolib: Add GPIO name support ?

Dan Williams unify ioremap definitions and introduce memremap ?

Srinivas Kandagatla Add simple NVMEM Framework via regmap. ?

Tomeu Vizoso Declarative specification of resources (aka devm_probe) ?

Daniel Baluta DocBook documentation for IIO ?

Michael Kerrisk (man-pages) Draft 3 of bpf(2) man page for review ?

Ming Lei block: loop: improve loop with AIO ?

Mike Marshall Orangefs: kernel client introduction ?

Mike Kravetz hugetlbfs: add fallocate support ?

Dongsheng Yang Add quota supporting in ubifs ?

Andreas Gruenbacher Richacls ?

Matias Bjørling Support for Open-Channel SSDs ?

Kirill A. Shutemov Make vma_is_anonymous() reliable ?

Jérôme Glisse HMM (Heterogeneous Memory Management) v9 ?

Jérôme Glisse Add ODP support using HMM ?

Vladimir Davydov idle memory tracking ?

Mel Gorman Remove zonelist cache and high-order watermark checking ?

Kirill A. Shutemov THP refcounting redesign ?

Eric B Munson Allow user to request memory to be locked on page fault ?

Thomas Graf [PATCH net-next 00/22] Lightweight & flow based encapsulation ?

Lawrence Brakmo tcp: add NV congestion control ?

Seth Forshee Initial support for user namespace owned mounts ?

Lee, Chun-Yi Signature verification of hibernate snapshot ?

David Howells MODSIGN: Use PKCS#7 for module signatures ?

Juergen Gross xen: support pv-domains larger than 512GB ?

Adrian Hunter perf tools: Introduce an abstraction for AUX Area and Instruction Tracing ?

Jiri Olsa perf stat: Add scripting support ?

Rasmus Villemoes compiler.h: enable builtin overflow checkers and add fallback code ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Gorman: Continual testing of mainline kernels

Kernel development news

rm -r fs/ext3

Atomic additions

Atomic logical operations

Relaxed atomics

Domesticating applications, OpenBSD style

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous