Kernel development [LWN.net]

Kernel release status

The current development kernel is 4.3-rc6, released on October 18. "Things continue to be calm, and in fact have gotten progressively calmer. All of which makes me really happy, although my suspicious nature looks for things to blame. Are people just on their best behavior because the Kernel Summit is imminent, and everybody is putting their best foot forward?"

Stable updates: none have been released since October 3. The large 3.10.91, 3.14.55, 4.1.11, and 4.2.4 updates are in the review process as of this writing; they can be expected at any time.

Comments (none posted)

Quotes of the week

I love the 0-day build replying to patches on the lists now. It's an automatic flag to ignore those patches.

— Rob Herring

"Host Managed Drive" is a vendor euphemism for "we don't know how to make this faster, but we have this (non-standard) extension for 'applications' to make things go faster. We don't really know how they can use it, either, but that's their problem to solve."

— Dave Chinner

Comments (none posted)

Linux-next takes a break

Linux-next maintainer Stephen Rothwell has announced that there will be no new linux-next releases until November 2. As a result, code added to subsystem maintainer trees after October 22 will not show up in linux-next before the (probable) opening of the 4.4 merge window. There are, he said, about 8500 commits in linux-next now, so he expects there is a fair amount of 4.4 work that hasn't showed up there yet.

Full Story (comments: none)

The return of simple wait queues

By Jonathan Corbet
October 21, 2015

A "wait queue" in the kernel is a data structure that allows one or more processes to wait (sleep) until something of interest happens. They are used throughout the kernel to wait for available memory, I/O completion, message arrival, and many other things. In the early days of Linux, a wait queue was a simple list of waiting processes, but various scalability problems (including the thundering herd problem highlighted by the infamous Mindcraft report in 1999) have led to the addition of a fair amount of complexity since then. The simple wait queue patch set is an attempt to push the pendulum back in the other direction.

Simple wait queues are not new; we looked at them in 2013. The API has not really changed since then, so that discussion will not be repeated here. For those who don't want to go back to the previous article, the executive summary is that simple wait queues provide an interface quite similar to that of regular wait queues, but with a lot of the bells and whistles removed. Said bells (or perhaps they are whistles) include exclusive wakeups (an important feature motivated by the aforementioned Mindcraft report), "killable" waits, high-resolution timeouts, and more.

There is value in simplicity, of course, and the memory saved by switching to a simple wait queue is welcome, even if it's small. But that, alone, would not be justification for the addition of another wait-queue mechanism to the kernel. Adding another low-level scheduling primitive like this increases the complexity of the kernel as a whole and makes ongoing maintenance of the scheduler harder. It is unlikely to happen without a strong and convincing argument in its favor.

In this case, the drive for simple wait queues is (as is the code itself) coming from the realtime project. The realtime developers seek determinism at all times, and, as it turns out, current mainline wait queues get in the way.

The most problematic aspect of ordinary wait queues appears to be the ability to add custom wakeup callbacks. By default, if one of the various wake_up() functions is called to wake processes sleeping on a wait queue, the kernel will call default_wake_function(), which simply wakes these waiting processes. But there is a mechanism provided to allow specialized users to change the wake-up behavior of wait queues:

    typedef int (*wait_queue_func_t)(wait_queue_t *wait, unsigned mode,
    				     int flags, void *key);
    void init_waitqueue_func_entry(wait_queue_t *q, wait_queue_func_t func);

This feature is only used in a handful of places in the kernel, but they are important uses. The I/O multiplexing system calls (poll(), select(), and epoll_wait()) use it to turn specific device events into poll events for waiting processes. The userfaultfd() code (added for the 4.3 release) has a wake function that only does a wakeup for events in the address range of interest. The exit() code similarly uses a custom wake function to only wake processes that have an interest in the exiting process. And so on. It is a feature that cannot be removed.

The problem with this feature, from the realtime developers' point of view, is that they have no control over how long the custom wake function will take to run. This feature thus makes it harder for them to provide response-time guarantees. Beyond that, these callbacks require that the wait-queue structure be protected by an ordinary spinlock, which is a sleeping lock in the realtime tree. That, too, gets in the way in the realtime world; it prevents, for example, the use of wake_up() in hard (as opposed to threaded) interrupt handlers.

Simple wait queues dispense with custom callbacks and many other wait-queue features, allowing the entire data structure to be reduced to:

    struct swait_queue_head {
	raw_spinlock_t		lock;
	struct list_head	task_list;
    };

    struct swait_queue {
	struct task_struct	*task;
	struct list_head	task_list;
    };

The swait_queue_head structure represents the wait queue as a whole, while struct swait_queue represents a process waiting in the queue. Waiting is just a matter of adding a new swait_queue entry to the list, and wakeups are a simple traversal of that list. Regular wait queues, instead, may have to search the list for specific processes to wake. The lack of custom wakeup callbacks means that the time required to wake any individual process on the list is known (and short), so a raw spinlock can be used to protect the whole thing.

This patch set has been posted by Daniel Wagner, who has taken on the challenge of getting it into the mainline, but the core wait-queue work was done by Peter Zijlstra. It has seen a few revisions in the last few months, but comments appear to be slowing down. One never knows with such things (the patches looked mostly ready in 2013 as well), but it seems like there is not much keeping this work from going into the 4.4 kernel.

Comments (3 posted)

Other approaches to random number scalability

By Jake Edge
October 21, 2015

Back in late September, we looked at a patch to improve the scalability of random number generation on Linux systems—large NUMA systems, in particular. While the proposed change solved the immediate scalability problem, there were some downsides to that approach, in terms of both complexity and security. Some more recent discussion has come up with other possibilities for solving the problem.

The original idea came from Andi Kleen; it changed the kernel's non-blocking random number pool into a set of pools, one per NUMA node. That would prevent a spinlock on a single pool from becoming a bottleneck. But it also made the kernel's random number subsystem more complex. In addition, it spread the available entropy over all of the pools, effectively dividing the amount available to users on any given node by the number of pools.

But, as George Spelvin noted, the entropy in a pool is "not located in any particular bit position", but is distributed throughout the pool—entropy is a "holographic property of the pool", as he put it. That means that multiple readers do not need to be serialized by a spinlock as long as each gets a unique salt value that ensures that the random numbers produced are different. Spelvin suggested using the CPU ID for the salt; each reader hashes the salt in with the pool to provide a unique random number even if the pool is in the same state for each read.

Spelvin provided a patch using that approach along with his comments. Random number subsystem maintainer Ted Ts'o agreed with Spelvin about how the entropy is distributed, but had some different ideas on how to handle mixing the random numbers generated back into the pool. He also provided a patch and asked Kleen to benchmark his approach. "I really hope it will be good enough, since besides using less memory, there are security advantages in not spreading the entropy across N pools."

Either approach would eliminate the lock contention (and cache-line bouncing of the lock), but there still may be performance penalties for sharing the pool among multiple cores due to cache coherency. The non-blocking pool changes frequently, either as data gets mixed in from the input pool (which is shared with the blocking pool) or as data that is read from the pool gets mixed back in to make it harder to predict its state. The cache lines of the pool will be bounced around between the cores, which may well be less than desirable.

As it turned out, when Kleen ran his micro-benchmark, both patch sets performed poorly in comparison to the multi-pool approach. In fact, for reasons unknown, Spelvin's was worse than the existing implementation.

Meanwhile, while the benchmarking was taking place, Ts'o pointed out that it may just make sense to recognize when a process is "abusing" getrandom() or /dev/urandom and to switch it to using its own cryptographic-strength random number generator (CSRNG or CRNG) seeded from the non-blocking pool. That way, uncommon—or, more likely, extremely rare—workloads won't force changes to the core of the Linux random number generator. Ts'o is hoping to not add any more complexity into the random subsystem:

At this point, I wonder if it might not be simpler to restrict the current nonblocking pool to kernel users, and for userspace users, the first time a process reads from /dev/urandom or calls getrandom(2), we create for them a ChaCha20 CRNG, which hangs off of the task structure. This would require about 72 bytes of state per process, but normally very few processes are reading from /dev/urandom or calling getrandom(2) from userspace.

The CRNG would be initialized from the non-blocking pool, and is reseeded after, say, 2**24 cranks or five minutes. It's essentially an OpenBSD-style arc4random in the kernel.

Spelvin was concerned that the CSRNG solution would make long-running servers susceptible to backtracking: using the current state of the generator to determine random numbers that have been produced earlier. If backtracking protection can be discarded, there can be even simpler solutions, he said, including: "just have *one* key for the kernel, reseeded more often, and a per-thread nonce and stream position." But Ts'o said that anti-backtracking was not being completely abandoned, just relaxed: "We are discarding backtracking protection between successive reads from a single process, and even there we would be reseeding every five minutes (and this could be tuned), so there is *some* anti-backtracking protection."

Furthermore, he suggested that perhaps real abusers could get their own CSRNG output, while non-abusers would still get output from the non-blocking pool:

On the flip side, the time when you might care about anti-backtracking protection is say, when you're generating a new session key for a new connection. So perhaps one approach is to use some kind of ratelimiter algorithm so that if you're using /dev/urandom "carefully" (say, no more than ten times a second), we'll use the non-blocking pool. But once a process exceeds that limit, it will switch over the the CRNG, and then the only performance that abuser process will hurt is its own (because it would be even faster if they were running something like arc4random in userspace).

Spelvin had suggested adding another random "device" (perhaps /dev/frandom) to provide the output of a CSRNG directly to user space, because he was concerned about changing the semantics of /dev/urandom and getrandom() by introducing the possibility of backtracking. But he agreed that changing the behavior for frequent heavy readers/callers would not change the semantics since the random(4) man page explicitly warns against that kind of usage:

[...] so if any program reads more than 256 bits (32 bytes) from the kernel random pool per invocation, or per reasonable reseed interval (not less than one minute), that should be taken as a sign that its cryptography is not skillfully implemented.

Spelvin posted another patch set that pursues his ideas on improving the scalability of generating random numbers. It focuses on the reducing the lock contention when the output of the pool is mixed back into the pool to thwart backtracking (known as a mixback operation). If there are multiple concurrent readers for the non-blocking pool, Spelvin's patch set ensures that one of them causes a mixback operation; others that come along while a mixback lock is held simply write their data into a global mixback buffer, which then gets incorporated into the mixback operation that is done by the lock holder when releasing the lock.

There has been no comment on those patches so far, but one gets the sense that Ts'o (or someone) will try to route around the whole scalability problem with a separate CSRNG for abusers. That would leave the current approach intact, while still providing a scalable solution for those who are, effectively, inappropriately using the non-blocking pool. Ts'o seemed strongly in favor of that approach, so it seems likely to prevail. Kleen has asked that his multi-pool approach be merged, since "it works and is actually scalable and does not require any new 'cryptographic research' or other risks". But it is not clear that the complexity and (slightly) reduced security of that approach will pass muster.

Comments (15 posted)

Rich access control lists

By Jonathan Corbet
October 20, 2015

Linux has had support for POSIX access control lists (ACLs) since the 2.5.46 development kernel was released in 2002 — despite the fact that POSIX has never formally adopted the ACL specification. Over time, POSIX ACLs have been superseded by other ACL mechanisms, notably the ACL scheme adopted with the NFSv4 protocol. Linux support for NFSv4 ACLs is minimal, though; there is no support for them at all in the virtual filesystem layer. The Linux NFS server supports NFSv4 ACLs by mapping them, as closely as possible, to POSIX ACLs. Chances are, that will end with the merging of Andreas Gruenbacher's RichACLs patch set for the 4.4 kernel.

The mode ("permissions") bits attached to every file and directory on a Linux system describe how that object may be accessed by its owner, by members of the file's group, and by the world as a whole. Each class of user has three bits regulating write, read, and execute access. For many uses, that is all the control a system administrator needs, but there are times where finer-grained access control is useful. That is where ACLs come in; they allow the specification of access-control policies that don't fit into the nine mode bits. There are different types of ACLs, reflecting their evolution over time.

POSIX ACLs

POSIX ACLs are clearly designed to fit in with traditional Unix-style permissions. They start by implementing the mode bits as a set of implicit ACL entries (ACEs), so a file with permissions like:

    $ ls -l foo
    -rw-rw-r--  1 linus penguins  804 Oct 18 09:40 foo

Has a set of implicit ACEs that looks like:

    $ getfacl foo
    user::rw-
    group::rw-
    other::r--

The user and group ACEs that contain empty name fields ("::") apply to the owner and group of the file itself. The administrator can add other user or group ACEs to give additional permissions to named users and groups. The actual access control is implemented in a way similar to how the mode bits are handled. If one of the user entries matches, the associated permissions are applied. Otherwise, if one of the group entries matches, that entry is used; failing that, the other permissions are applied.

There is one little twist: the traditional mode bits still apply as well. When ACLs are in use, the mode bits define the maximum permissions that may be allowed. In other words, ACLs cannot grant permissions that would not be allowed by the mode bits. The reason for this behavior is to avoid unpleasant surprises for applications (and users) that do not understand ACLs. So a file with mode 0640 (rw-r-----) would not allow group-write access, even if it had an ACE like:

    group::rw-

If a particular process matches a named ACE (by either user or group name), that process is in the group class and is regulated by the group mode bits on the file. The owning group itself can be given fewer permissions than the mode bits would otherwise allow. See this article for a detailed description of how it all works.

NFSv4 ACLs

When the NFS community took on the task of defining an ACL mechanism for the NFS protocol, they chose not to start with POSIX ACLs; instead, they started with something that looks a lot more like Windows ACLs. The result is a rather more expressive and flexible ACL mechanism. With one obscure exception, all POSIX ACLs can be mapped onto NFSv4 ACLs, but the reverse is not true.

NFSv4 ACLs do away with the hardwired evaluation order used by POSIX ACLs. Instead, ACEs are evaluated in the order they are defined. Thus, for example, a group ACE can override an owner ACE if the group ACE appears first in the list. NFSv4 ACEs can explicitly deny access to a class of users. Permissions bits are also additive in NFSv4 ACLs. As an example of this, consider a file with these ACLs:

    group1:READ_DATA:ALLOW
    group2:WRITE_DATA:ALLOW

These ACEs allow read access to members of group1 and write access to members of group2. If a process that is a member of both groups attempts to open this file for read-write access, the operation will succeed. When POSIX ACLs are in use, instead, the requested permissions must all be allowed by a single ACE.

NFSv4 ACLs have a lot more permissions that can be granted and denied. Along with "read data," "write data," and "execute," there are independent permissions bits allowing append-only access, deleting the file (regardless of permissions in the containing directory), deleting any file contained within a directory, reading a file's metadata, writing the metadata (changing the timestamps, essentially), taking ownership of the file, and reading and writing a file's ACLs.

There is a set of bits controlling how ACLs are inherited from their containing directories. ACEs on directories can be marked as being inheritable by files within those directory; there is also a bit to mark an ACE that should only propagate a single level down the hierarchy. When a file is created within the directory, it will be given the ACLs that are marked as being inheritable in its parent directory. This behavior conflicts with POSIX, which requires that any "supplemental security mechanisms" be disabled for new files.

ACLs can have an "automatic inheritance" flag set. When an ACL change is made to a directory, that change will be propagated to any files or directories underneath that have automatic inheritance enabled — unless the "protected" flag is also set. Setting the "protected" flag happens whenever the ACL or mode of the file have been set explicitly; that keeps inheritance from overriding permissions that have been intentionally set to something else. The interesting twist here is that there is no way in Linux for user space to create a file without explicitly setting its mode, so the "protected" bit will always be set on new files and automatic inheritance simply won't work. NFS does have a way to create files without specifying the permissions to use, though, so automatic inheritance will work in that case.

NFSv4 ACLs also differ in how permissions are applied to the world as a whole. The "other" class is called "EVERYONE@", and it means truly everyone. In normal POSIX semantics, if a process is in the "user" or "group" class, the "other" permissions will not even be consulted; that allows, for example, a specific group to be blocked from a file that is otherwise world accessible. If a file is made available to everyone in an NFSv4 ACL, though, it is truly available to everyone unless a specific "deny" ACE earlier in the list says otherwise.

RichACLs

The RichACLs work tries to square NFSv4 ACLs with normal POSIX expectations. To do so, it applies the mode bits in the same way that POSIX ACLs do — the mode specifies the maximum access that is allowed. Since there are far more access types in NFSv4 ACLs than there are mode bits, a certain amount of mapping must be done. So, for example, if the mode denies write access, that will be translated to a denial of related capabilities like "create file," "append data," "delete child," and more.

The actual relationship between the ACEs and the mode is handled via a set of three masks, corresponding to owner, group, and other access. If a file's mode is set to deny group-write access, for example, the corresponding bits will be cleared from the group mask in the ACL. Thereafter, no ACE will be able to grant write access to a group member. The original ACEs are preserved when the mode is changed, though; that means that any additional access rights will be returned if the mode is made more permissive again. The masks can be manipulated directly, giving more fine-grained control over the maximum access allowed to each class; tweaking the masks can cause the file's mode to be adjusted to match.

There are some interesting complications in the relationship between the ACEs, the masks, and the actual file mode. Consider an example (from this document) where a file has this ACL:

    OWNER@:READ_DATA::ALLOW
    EVERYONE@:READ_DATA/WRITE_DATA::ALLOW

This ACL gives both read and write access to the owner. If, however, the file's mode is set to 0640, the mask for EVERYONE@ will be cleared, denying owner-write access even though there is nothing in the permissions that requires that. Fixing this issue requires a special pass through the ACL to grant the EVERYONE@ flags to other classes where the mode allows it.

A similar problem comes up when an EVERYONE@ ACE grants access that is denied by the owner or group mode bits. Handling this case requires inserting explicit DENY ACEs for OWNER@ (or GROUP@) ahead of the EVERYONE@ ACE.

The RichACLs patch set implements all of this and more. See this page and the richacl.7 man page for more details. As of this writing, though, there is still an open question: how to handle RichACL support in Linux filesystems.

At the implementation level that question is easily answered; RichACLs are stored as extended attributes, just like POSIX ACLs or SELinux labels. The problem is one of backward compatibility: what happens when a filesystem containing RichACLs is mounted by a kernel that does not implement them? Older kernels will not corrupt the filesystem (or the ACLs) in this case, but neither will they honor the ACLs. That can result in access being granted that would have been denied by an ACL; it also means that ACL inheritance will not be applied to new files.

To prevent such problems, Andreas requested that a feature flag be added to the ext4 filesystem; that flag would prevent the filesystem from being mounted by kernels that do not implement RichACLs. There was some discussion about whether this made sense; ext4 maintainer Ted Ts'o felt that the feature flags were there to mark metadata changes that the e2fsck utility needed to know about to avoid corrupting the filesystem. RichACLs do not apply, since filesystems don't pay attention to the contents of extended attributes.

Over the course of the conversation, though, a consensus seemed to form around the idea that the use of RichACLs is a fundamental filesystem feature. So it appears that once they are enabled for an ext4 filesystem (either at creation time, or via tune2fs), that filesystem will be marked as being incompatible with kernels that don't implement RichACLs. Something similar will likely be done for XFS.

If things go as planned, this work will be mainlined during the 4.4 merge window. At that point, NFS servers should be able to implement the full semantics of NFSv4 ACLs; the feature should also be of use to people running Samba servers. This patch set, the culmination of several years' work, should provide a useful capability to server administrators who need fully supported access control lists on Linux.

Comments (84 posted)

Linus Torvalds Linux 4.3-rc6 ?

Thomas Gleixner v4.1.10-rt10 ?

Suzuki K. Poulose arm64: 16K translation granule support ?

Ganapatrao Kulkarni arm64, numa: Add numa support for arm64 platforms ?

Geoff Levand arm64 kexec kernel patches v10 ?

Borislav Petkov x86/microcode: Merge the early loader ?

Damian Hobson-Garcia Generalize poll events from eventfd ?

Alexander Holler init: deps: dependency based (parallelized) init ?

He Kuang bpf: Add new bpf map type for timer ?

Daniel Wagner Simple wait queue support ?

Chris Metcalf support "task_isolation" mode for nohz_full ?

Josh Poimboeuf Compile-time stack metadata validation ?

M'boumba Cedric Madianga Add support for STM32 DMA ?

Maxime Coquelin Add STM32 pinctrl/GPIO driver ?

Maxime Coquelin Add STM32 EXTI interrupt controller support ?

Marc Zyngier Adding core support for wire-MSI bridges ?

Jon Hunter Add support for Tegra210 ADMA ?

Andrew F. Davis Add support for the TI TPS65086 PMIC. ?

Chunfeng Yun Mediatek xHCI support Oct 18

Bharat Kumar Gogada PCI: Xilinx-NWL-PCIe: Added support for Xilinx NWL PCIe Host Controller ?

Philipp Zabel MT8173 DRM support ?

Zhiqiang Hou i2c: Add i2c support to Freescale Layerscape platforms ?

Paul Osmialowski Additional kmsg devices ?

Pramod Kumar Generalized broadcom cygnus gpio driver ?

Pankaj Dubey Add support for Exynos SROM Controller driver ?

Richard Fitzgerald Add support for Cirrus Logic CS47L24 and WM1831 codecs ?

Ley Foon Tan Altera PCIe host controller driver with MSI support ?

Yoshihiro Shimoda phy: rcar-gen3-usb2: Add R-Car Gen3 USB2 PHY driver ?

Chaotian Jing Add tune support of Mediatek MMC driver ?

Georgi Djakov Add initial support for RPM clocks ?

Mika Westerberg pinctrl: Add support for Intel Broxton SoC ?

Lee Jones remoteproc: Add driver for STMicroelectronics platforms ?

WingMan Kwok Common SerDes driver for TI's Keystone Platforms ?

Alex Deucher Add support for Stoney APU ?

Shashank Sharma Color Management for DRM framework ?

Junghak Sung Refactoring Videobuf2 for common use ?

ahaslam@baylibre.com Multiple intermediate states for genpd ?

Christoph Hellwig Persistent Reservation API V4 ?

Matias Bjørling Support for Open-Channel SSDs ?

Andreas Gruenbacher Richacls ?

Anna Schumaker VFS: In-kernel copy system call ?

Jeff Layton nfsd: open file caching ?

Jérôme Glisse HMM (Heterogeneous Memory Management) ?

Jérôme Glisse HMM anomymous memory migration to device memory ?

Thomas F Herbert openvswitch: Add support for 802.1ad ?

Jiri Benc netlink: strict attribute checking option ?

George Spelvin Alternate sclable urandom patchset ?

Tobias Markus userns/capability: Add user namespace capability ?

Tycho Andersen seccomp, ptrace: add support for dumping seccomp filters ?

Benjamin Gaignard RFC: Secure Memory Allocation Framework ?

David Howells KEYS: Change how keys are determined to be trusted ?

Denis V. Lunev Hyper-V synthetic interrupt controller ?

Jiri Olsa perf stat: Add scripting support ?

Wang Nan perf tools: Config BPF maps through perf cmdline ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Linux-next takes a break

Kernel development news

The return of simple wait queues

Other approaches to random number scalability

Rich access control lists

POSIX ACLs

NFSv4 ACLs

RichACLs

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous