|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 4.8-rc5, released on September 4. Linus said: "So rc5 is noticeably bigger than rc4 was, and my hope last week that we were starting to calm down and shrink the releases seems to have been premature. [...] Not that any of this looks worrisome per se, but if things don't start calming down from now, this may be one of those releases that will need an rc8. We'll see."

Stable updates: 4.7.3, 4.4.20, and 3.14.78 were released on September 7. Note that the 3.14 long-term series is finally coming to an end, with only one more update planned.

Comments (1 posted)

Quotes of the week

The fact that there is demand for a collaborative project on a common kernel tree to carry features for the embedded folks suggests they are already feeling the pain themselves.

What is missing is the realization that we already have such a tree, where everybody (not just the embedded folks) are collaborating on features.

The upstream kernel.

Rik van Riel

Actually what you did with SoC vendors from Chrome OS and stating clearly that upstream presence is a factor in procurement was the *only* thing I have ever seen that actually works to change the behaviour of an entire company, apart from dedicated individuals on the inside of the companies. [...]

The day the Android people say that for a Nexus(-ish) device it's gonna be all upstream kernel and they will pick the SoC that delivers that, then things will happen.

Linus Walleij

Comments (4 posted)

Kernel development news

Audit, namespaces, and containers

By Jake Edge
September 8, 2016

Linux Security Summit

Richard Guy Briggs works on the kernel's audit subsystem for Red Hat and has run into some problems with the interaction between the audit daemon (auditd) and namespaces. He gave a report on those difficulties to the Linux Security Summit in Toronto. In the talk, he also looked at containers and what they might mean in the context of audit.

[Richard Guy Briggs]

Audit was started in 2004, around the same time that the kernel started using Git. It is a "syslog on steroids", he said. Syslog is used a lot for debugging, but audit is meant as a secure audit trail to log kernel and other events in a way that could be used in court. There are configurable filters in the kernel for what events should be logged and it has the auditd user-space daemon that can log to disk or the network.

Audit only reports on behavior; it does not actively interfere with what is going on in the system. There is one exception, though: audit can panic the system if it cannot log its data.

Briggs then went through a bit of an introduction to namespaces in Linux, noting that they are a kernel-enforced user-space view of the resource being namespaced. There are seven different namespaces in Linux; three are hierarchical in nature (PID, user, and control groups), which means their permissions and configuration are inherited from the parent namespace, while the other four are made up of peer namespaces (mount, UTS, IPC, and net).

He is not sure that anyone actually uses IPC namespaces, he said, but the net namespace is one of the easier ones to understand. Network devices can be assigned to a net namespace from the initial net namespace, so each namespace can have its own firewall rules, routing, and so on. If two namespaces need to talk, a virtual ethernet pair can be set up between them.

The user namespace has been "the most contentious one so far", as there are a number of "security traps" in allowing unprivileged users to be root within the namespace. Many distributions don't enable the feature by default at this point. The control groups (cgroups) namespace, which is the most recent namespace (added in 4.6), is meant to hide system-resource limits so that processes only see what resources have been allocated to their cgroup.

Namespaces are one component of the concept of containers, but there really is no hard definition of containers, Briggs said. In fact, there are almost as many definitions as there are users of containers. There is some general agreement that containers use a combination of namespaces, cgroups, and seccomp to partition some portion of the system into its own world.

But the kernel has no concept of a container at all. Managing containers is left up to a user-space container-orchestration system of some kind. From an audit perspective, though, there is interest in having some knowledge of containers in the kernel. That might be through some form of "container ID" or simply by collecting up the namespace IDs that correspond to the container.

Problems

With that introductory material out of the way, he turned to the problems, which boil down to a Highlander quote: "There can be only one." The "one" in this case is auditd, which runs in the initial namespace but must be reachable from the other namespaces. For the mount, UTS, and IPC namespaces, there have been no problems, but others do have a variety of issues.

For example, the net namespace partitions netlink sockets (which processes use to talk to the audit subsystem) so that only processes in the initial network namespace can send their audit messages. That broke various container implementations because things like pluggable authentication modules (PAM) would try to write an audit message and get an unexpected error return. Instead of the ECONNREFUSED error that it expected when auditd cannot be reached, PAM-using programs (e.g. the login process) would get EPERM and fail such that users could not log in. The short-term solution for that was to simply "lie" in non-initial namespaces and return the expected error message so that user-space programs do not break.

For PID namespaces, the problem cropped up with vsftpd authentication that wanted to write a log message to auditd. Until 3.15, that could only be done from the initial PID namespace, where processes could see the PID of auditd. Some distributions put vsftpd in its own PID namespace, however, which meant that vsftpd could not talk to auditd. By adding the CAP_AUDIT_WRITE capability to the program and adding some code in 3.15, though, that could be worked around.

PID namespaces also present another problem for audit: the PIDs that get reported are not the "real" PIDs in the system. Processes within a PID namespace get their own PID range that is separate from the PIDs in the parent namespace (which might be the initial namespace where the real system PIDs are used). So audit needed to do a translation of the PID reported from non-initial PID namespaces. Someday, when CAP_AUDIT_CONTROL is allowed in PID namespaces (so that processes with that capability can configure the audit filters), there will need to be more cleanup done on the PID handling in the kernel, he said.

Allowing multiple auditd processes in the system would be reasonable if they are tied to user namespaces. There was an idea "thrown around" about creating a new audit namespace, but it became clear that yet another namespace was not a particularly popular idea. Having one auditd per user namespace still requires some process having CAP_AUDIT_CONTROL within the namespace. He wondered if the process creating the user namespace also needed that capability.

Beyond that, the configuration of audit running in the initial namespace cannot be changed from inside user namespaces even with the capability. In particular, only the initial namespace audit can panic the system; instead of that, the audit in a user namespace might instead kill off the user namespace and all its children if it cannot log (thus wants to panic). So each user namespace will get its own set of audit rules (a "rulespace") and its own event queue. Originally it was thought that the event queue might be shared by all of the auditd processes, but a single one that overflowed the queue would affect the rest of the system, which is unacceptable, Briggs said.

There is interest in being able to track containers by some kind of ID. There was a proposal in 2013 to use the /proc inode number that uniquely identifies each namespace in the audit log messages. He felt that was harder to use, so he prototyped a simple incrementing serial number for each namespace. The checkpoint/restore in user space (CRIU) developers were not happy with that, since those numbers would not easily translate during a migration.

Eventually, Briggs reworked the inode-number-based scheme to work with the namespace filesystem (nsfs). Each event then has a set of namespace IDs along with a device ID for the nsfs. That allows container orchestration systems to track the information, even across migrations, which allows them to aggregate logs from multiple hosts.

An alternative would be to add a "container ID" that would be set by the orchestration system and tracked in the task structure. The container ID would be inherited by children and audit events would contain the ID. There is precedent for this kind of ID, he said; session IDs are not something the kernel itself knows anything about, but it helps user space manage those values.

In conclusion, he said that namespace support for audit is largely working at this point, though changes for net and PID namespaces will be needed down the road. There is work to be done to allow multiple auditd processes anchored to the user namespace, as well. As far as IDs go, there is a decision to be made between the list of namespace IDs versus a single kernel-managed container ID. He favors the former, even though dealing with eight separate numbers is harder to use. Either solution will require higher-level tools to map, track, and aggregate information about the containers across multiple hosts.

[I would like to thank the Linux Foundation for travel support to attend the Linux Security Summit in Toronto.]

Comments (4 posted)

Atomic patterns 2: coupled atomics

September 7, 2016

This article was contributed by Neil Brown

Our recent survey of the use of atomic operations in the Linux kernel covered the use of simple flags and counters, along with various approaches to gaining exclusive access to some resource or other. On reaching the topic of shared access we took a break, in part because reference counts, which are the tool for managing shared access, have been covered before. Much of that earlier content requires no more than a brief recap, but the use of biases, then described as an anti-pattern, is worthy of further examination as it is a stepping stone toward understanding a range of other patterns for the use of atomics in the kernel.

Recap: three styles of reference counters

I previously identified three styles of reference counters used in Linux; my recent explorations have found no reason to adjust that list. The distinction between the three involves what happens when the count reaches zero.

When a "plain" reference count reaches zero, nothing particular happens beyond the obvious. Some code somewhere might occasionally check if the counter is zero and behave differently if it is, but the moment of transition from non-zero to zero has no significance. A good example is child_count used by the runtime power management code. This allows a "child" device to hold a reference on its parent to keep it active. Unless it has been configured to ignore_children, the parent will be kept active as long as any child still holds a reference.

When a "kref" reference count reaches zero, some finalization operation happens on the object; typically it is freed. Code requiring that pattern should use the struct kref data type, though an atomic_t counter and atomic_dec_and_test() can be used if there is a good reason to avoid kref.

Finally, the "kcref" counter is not allowed to reach zero unless a lock is held. Code implementing this pattern can use atomic_dec_and_lock(), which takes a spinlock only if it is likely to be needed. A more general approach that can work with any sort of lock is to have a fast path that uses atomic_add_unless() to decrement the counter as long as its value is not one. If this fails, the lock can be taken and at atomic_dec_and_test() or similar can be used. hw_perf_event_destroy() in the perf code displays this quite nicely.

Counter bias: multiple values in the one atomic

A number of reference counters in Linux (e.g. in procfs and kernfs) have a "bias" added to the value. This bias is a large value (larger than the normal range of the counter) that can be added to the counter's value. The presence or absence of the bias can easily be detected even as the counter itself moves up or down. This allows a boolean value to be stored in the same variable as the counter. I previously described this as an anti-pattern; a proper solution would instead use a separate variable (or bit in a bitmap) to store the boolean value. When the counter and the boolean are changed independently, I stand by that assessment, but sometimes there is value in being able to control them together in a single operation.

A particularly simple example is found in the function-tracing (ftrace) code for the SuperH architecture. The nmi_running counter sometimes has its most significant bit set, effectively using a bias of 231. This flag, which is used to provide synchronization between ftrace and non-maskable interrupt handlers, may be cleared at any time, but may only be set when the value of the counter is zero. Normally, when there is a need to synchronize the change in one value with some other value, it is simplest to hold a spinlock whenever either value is changed — but that is not necessarily the fastest way. If the two values of interest can be stored in the same machine word, then an atomic compare-exchange operation, often in a loop to handle races, can achieve the same end more efficiently.

Having identified this pattern of two values being managed with a single atomic operation, we need a name for it; "coupled atomics" seems a good choice as the interdependence between the two values could be seen as a coupling. Other examples of this pattern are easy to find. The "lockref" type that was introduced in Linux 3.12 follows exactly this pattern, storing a 32-bit spinlock and a 32-bit reference count in a single 64-bit word that, on several popular architectures, can be updated atomically. Even this 32-bit spinlock itself is sometimes a multi-part atomic, as is the case for both ticket spinlocks and MSC locks.

The previous article mentioned two uses for the new atomic_fetch*() operations; we can now add a third. This one involves an atomic_t variable that contains a counter and a couple of flags, only this time the flags are in the least significant bits and the counter is in the higher-order bits. This atomic_t is used to implement a queued reader/writer spinlock. The flags record if a write lock is held, or if a writer is waiting for the lock. The counter, which is incremented by adding 256 (using the defined name _QR_BIAS) records the number of active readers. A new reader attempts to get a read lock using an atomic operation to add _QR_BIAS and then see if either of the flags were set in the result. If they were set, the read lock was not acquired; the failed reader subtracts the bias and tries again. Interestingly, the fast path code uses atomic_add_return(), while the slow path code uses the new atomic_fetch_add_acquire(). Either is quite suitable for the task, but a little more consistency would be nice.

Another example is the helpfully named combined_event_count counter in the system suspend code. This variable stores two counters: the number of in-progress wakeup events and the total number of completed wakeup events. When the in-progress counter is decremented, the total needs to be incremented; by combining the two counters in the one atomic value, the two can be updated in a single race-free operation.

More coupled atomics, big and small

Examples so far could be seen as mid-range examples, combining a counter with some other modestly sized value, typically another counter or a flag, into the one atomic value. To finish off we will look at two extremes in size, the largest and smallest.

Most atomics are 32 bits in size, though 64-bit values, whether pointers manipulated with cmpxchg() or the atomic_long_t type, are not exactly uncommon. What is uncommon is 128-bit atomic types. These are limited to three architectures (arm64, x86_64, and s390) and to a small number of users, mainly the SLUB memory allocator.

SLUB owns several fields in the page description structure: a pointer to a list of free space, some counters of allocated objects, and a "frozen" flag. Sometimes it wants to access or update several of these atomically. On a 32-bit host, these values all fit inside a 64-bit value. On a 64-bit machine, they don't, so a larger operation is needed; cmpxchg_double() is available on the listed architectures to allow this. It is given two pointers to 64-bit memory locations that must be consecutive, two values for comparison, and two values for replacement. Unlike the single-word cmpxchg() that always returns the value that was fetched, cmpxchg_double() returns a success status, rather than trying to squeeze 128 bits into the return value.

On 64-bit architectures without this 128-bit atomic option, SLUB will use a spinlock to gain the required exclusive access — effective, but not quite as fast. cmpxchg_double() seems to me to be an eloquent example of the lengths some kernel developers will go to in order to squeeze out that last drop of performance.

The other extreme in size is to combine two of the smallest possible data types into a single atomic: two bits. A simple example in the xen events_fifo code clears one bit, EVTCHN_FIFO_MASKED, but only when the other bit, EVTCHN_FIFO_BUSY is also clear. Manipulating multiple bits at once is another place where the new atomic_fetch*() operations could be useful. They do not support any dependency between bits as we see in the xen example, but they could, for example, clear a collection of bits atomically and report which bits were cleared, by using atomic_fetch_and(). Similarly, if an atomic_t contained a counter in some of the bits, that counter could be extracted and zeroed without affecting other accesses. Whether these are actually useful I cannot say as there are no concrete examples to refer to. But the pattern of multiple values in the one atomic_t does seem to raise more possible uses for these new operations.

Both a strength and a weakness

Having found these various patterns, several of which I did not expect, the overall impression I am left with is the tension between the strength and the weakness of C for implementing these patterns. On the one hand C, together with the GCC extensions for inline assembly language code, provides easy access to low-level details that make it possible to implement the various atomic accesses in the most efficient way possible. On the other hand, the lack of a rich type system means that we tend to use the one type, atomic_t, for a wide range of different use cases. Some improvements might be possible there, as we saw with the introduction of the kref type, but I'm not sure how far we could take that. I contemplate the atomic_cmpxchg_double() usage in SLUB and wonder what sort of high-level language construct would make that more transparent and easy to read, and yet keep it as performant on all hardware as it currently is. It certainly would be nice if some of these patterns were more explicit in the code, rather than requiring careful analysis to find.

Comments (none posted)

Reimplementing mutexes with a coupled lock

By Jonathan Corbet
September 8, 2016
Oscar Wilde once famously observed that fashion "is usually a form of ugliness so intolerable that we have to alter it every six months". Perhaps the same holds true of locking primitives in the kernel; basic mechanisms like the mutex have been through many incarnations over the years. This season, it would appear that coupled atomic locks are all the rage with the trendiest kernel developers, so it should not be surprising that a new mutex implementation using those locks is making the rounds. This code may be glittering and shiny, but it also has the potential to greatly simplify the mutex implementation.

A mutex is a sleeping lock, meaning that kernel code that tries to acquire a contended mutex may go to sleep to wait until that mutex becomes available. Early mutex implementations would always put a waiter to sleep, but, following the scalability trends of the day, mutexes soon gained a glamorous accessory: optimistic spinning. Waking a sleeping thread can take a long time and, once that thread gets going, it may find that the processor cache contains none of its data, leading to unfashionable cache misses. A thread that spins waiting for a mutex, instead, will be able to grab it quickly once it becomes available and will likely still be cache-hot. Enabling optimistic spinning can improve performance considerably. There is a cost, in that mutexes are no longer fair (they can be "stolen" from a thread that has been waiting for longer), but being properly à la mode is never free.

Optimistic spinning brings with it an interesting complication, though, in that it requires tracking the current owner of the mutex. If that owner sleeps, or if the owner changes while a thread is spinning, it doesn't make any sense to continue spinning, since the wait is likely to be long. As a field within the mutex, the owner information is best protected by the mutex itself. But, by its nature, this information must be accessed by threads that do not own the mutex. The result is some tricky code that is trying to juggle the lock itself and the owner information at the same time.

Peter Zijlstra has sent an alternative mechanism down the runway; it takes care of this problem by combining the owner information and lock status into a single field within the mutex. In current kernels, the count field, an atomic_t value, holds the status of the lock itself, while owner, a pointer to struct task_struct, indicates which thread owns the mutex. Peter's patch removes both of those fields, replacing them with a single atomic_long_t value called "owner".

This value is 64 bits wide, large enough to hold a pointer value. If the mutex is available, there is no owner, so the new owner field contains zero. When the mutex is taken, the acquiring thread's task_struct pointer is placed there, simultaneously indicating that the mutex is unavailable and which thread owns it. The task_struct structure must be properly aligned, though, meaning that the bottom bits of a pointer to it will always be zero, so those bits are available for other locking-related purposes. Following this season's coupled-lock trend, two of those bits are so used, in ways that will be described shortly.

With the new organization, the code to attempt to acquire a mutex now looks like this:

    static inline bool __mutex_trylock(struct mutex *lock)
    {
    	unsigned long owner, curr = (unsigned long)current;
    
    	owner = atomic_long_read(&lock->owner);
    	for (;;) { /* must loop, can race against a flag */
    	    unsigned long old;
    
    	    if (__owner_task(owner))
    		return false;
    	    old = atomic_long_cmpxchg_acquire(&lock->owner, owner,
    					      curr | __owner_flags(owner));
    	    if (old == owner)
    		return true;
    	    owner = old;
    	}
    }

The __owner_task() and __owner_flags() macros simply mask out the appropriate parts of the owner field. The key is the atomic_long_cmpxchg_acquire() call, which attempts to store the current thread as the owner of the mutex on the assumption that it is available. Should some other thread own the mutex, that call will fail, and the mutex code will know that it will have to work harder.

There are currently two flags that can be stored in the least significant bits of owner. If a thread finds it must sleep while waiting for a contended mutex, it will set MUTEX_FLAG_WAITERS; the thread currently holding the mutex will then know it must wake the waiters when the mutex is freed. Most of the time, it is hoped, there will be no waiters; maintaining this bit allows for a bit of unnecessary work to be skipped.

As mentioned above, optimistic spinning, while good for performance, is unfair; in the worst case, an unlucky thread contending for a highly contended mutex could be starved for a long time. In an attempt to prevent that problem, the second owner bit, MUTEX_FLAG_HANDOFF, can be used to change how a contended mutex changes ownership.

If a thread tries and fails to obtain a mutex after having already slept waiting for it to become available, it can set MUTEX_FLAG_HANDOFF prior to returning to sleep. Later on, when the mutex is freed, the freeing thread will notice the flag and behave differently. In particular, it must avoid clearing the owner field as it normally would, lest some other thread, spinning on the mutex, steal it away. Instead, it finds the first thread in the wait queue for the mutex and transfers ownership directly, waking that thread once the job is done. This dance restores some fairness, at the cost of making everybody wait for the sleeping thread to wake up and get its work done.

The new code simplifies the mutex implementation considerably by getting rid of a number of strange cases involving the separate count and owner fields. But it gets a bit better than that, since the new code is also architecture-independent; all of the old, architecture-specific mutex code can go away. So the bottom line of Peter's cover letter reads:

    49 files changed, 382 insertions(+), 1407 deletions(-)

Removing code, as it happens, is always in fashion, and removing 1000 lines of tricky assembly-language locking code is especially chic. Assuming that this code manages to avoid introducing performance regressions, it could be a must-have item at a near-future merge-window ball.

Comments (1 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.8-rc5 Sep 04
Greg KH Linux 4.7.3 Sep 07
Sebastian Andrzej Siewior 4.6.7-rt12 Sep 08
Greg KH Linux 4.4.20 Sep 07
Greg KH Linux 3.14.78 Sep 07
Jiri Slaby Linux 3.12.63 Sep 06

Architecture-specific

Core kernel code

Development tools

Shuah Khan kobject tracepoints Sep 06

Device drivers

Stanimir Varbanov Qualcomm video decoder/encoder driver Sep 07
Quentin Schulz add support for Allwinner SoCs ADC Sep 01
YT Shen MT2701 DRM support Sep 02
Minghsiu Tsai Add MT8173 MDP Driver Sep 08
Chunfeng Yun Add MediaTek USB3 DRD Driver Sep 05
vadimp@mellanox.com leds: add driver for Mellanox systems leds Sep 07
Raghu Vatsavayi liquidio CN23XX support Sep 01
Martin Blumenstingl meson: Meson8b and GXBB DWMAC glue driver Sep 04
David Herrmann drm: add simpledrm driver Sep 02
William Breathitt Gray Add IIO support for counter devices Sep 07
Todor Tomov OV5645 camera sensor driver Sep 08
Quentin Schulz add support for Allwinner SoCs ADC Sep 08

Device driver infrastructure

Heikki Krogerus USB Type-C Connector class Sep 01
Bjorn Andersson Make rpmsg a framework Sep 01
Srinivas Pandruvada User space governor enhancements Aug 26

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous

Theodore Ts'o Release of e2fsprogs 1.43.2 Sep 01
Karel Zak util-linux stable v2.28.2 Sep 07

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2016, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds