Kernel development
Brief items
Kernel release status
The current development kernel is 4.3-rc6, released on October 18. "Things continue to be calm, and in fact have gotten progressively calmer. All of which makes me really happy, although my suspicious nature looks for things to blame. Are people just on their best behavior because the Kernel Summit is imminent, and everybody is putting their best foot forward?"
Stable updates: none have been released since October 3. The large 3.10.91, 3.14.55, 4.1.11, and 4.2.4 updates are in the review process as of this writing; they can be expected at any time.
Quotes of the week
Linux-next takes a break
Linux-next maintainer Stephen Rothwell has announced that there will be no new linux-next releases until November 2. As a result, code added to subsystem maintainer trees after October 22 will not show up in linux-next before the (probable) opening of the 4.4 merge window. There are, he said, about 8500 commits in linux-next now, so he expects there is a fair amount of 4.4 work that hasn't showed up there yet.
Kernel development news
The return of simple wait queues
A "wait queue" in the kernel is a data structure that allows one or more processes to wait (sleep) until something of interest happens. They are used throughout the kernel to wait for available memory, I/O completion, message arrival, and many other things. In the early days of Linux, a wait queue was a simple list of waiting processes, but various scalability problems (including the thundering herd problem highlighted by the infamous Mindcraft report in 1999) have led to the addition of a fair amount of complexity since then. The simple wait queue patch set is an attempt to push the pendulum back in the other direction.Simple wait queues are not new; we looked at them in 2013. The API has not really changed since then, so that discussion will not be repeated here. For those who don't want to go back to the previous article, the executive summary is that simple wait queues provide an interface quite similar to that of regular wait queues, but with a lot of the bells and whistles removed. Said bells (or perhaps they are whistles) include exclusive wakeups (an important feature motivated by the aforementioned Mindcraft report), "killable" waits, high-resolution timeouts, and more.
There is value in simplicity, of course, and the memory saved by switching to a simple wait queue is welcome, even if it's small. But that, alone, would not be justification for the addition of another wait-queue mechanism to the kernel. Adding another low-level scheduling primitive like this increases the complexity of the kernel as a whole and makes ongoing maintenance of the scheduler harder. It is unlikely to happen without a strong and convincing argument in its favor.
In this case, the drive for simple wait queues is (as is the code itself) coming from the realtime project. The realtime developers seek determinism at all times, and, as it turns out, current mainline wait queues get in the way.
The most problematic aspect of ordinary wait queues appears to be the ability to add custom wakeup callbacks. By default, if one of the various wake_up() functions is called to wake processes sleeping on a wait queue, the kernel will call default_wake_function(), which simply wakes these waiting processes. But there is a mechanism provided to allow specialized users to change the wake-up behavior of wait queues:
typedef int (*wait_queue_func_t)(wait_queue_t *wait, unsigned mode, int flags, void *key); void init_waitqueue_func_entry(wait_queue_t *q, wait_queue_func_t func);
This feature is only used in a handful of places in the kernel, but they are important uses. The I/O multiplexing system calls (poll(), select(), and epoll_wait()) use it to turn specific device events into poll events for waiting processes. The userfaultfd() code (added for the 4.3 release) has a wake function that only does a wakeup for events in the address range of interest. The exit() code similarly uses a custom wake function to only wake processes that have an interest in the exiting process. And so on. It is a feature that cannot be removed.
The problem with this feature, from the realtime developers' point of view, is that they have no control over how long the custom wake function will take to run. This feature thus makes it harder for them to provide response-time guarantees. Beyond that, these callbacks require that the wait-queue structure be protected by an ordinary spinlock, which is a sleeping lock in the realtime tree. That, too, gets in the way in the realtime world; it prevents, for example, the use of wake_up() in hard (as opposed to threaded) interrupt handlers.
Simple wait queues dispense with custom callbacks and many other wait-queue features, allowing the entire data structure to be reduced to:
struct swait_queue_head { raw_spinlock_t lock; struct list_head task_list; }; struct swait_queue { struct task_struct *task; struct list_head task_list; };
The swait_queue_head structure represents the wait queue as a whole, while struct swait_queue represents a process waiting in the queue. Waiting is just a matter of adding a new swait_queue entry to the list, and wakeups are a simple traversal of that list. Regular wait queues, instead, may have to search the list for specific processes to wake. The lack of custom wakeup callbacks means that the time required to wake any individual process on the list is known (and short), so a raw spinlock can be used to protect the whole thing.
This patch set has been posted by Daniel Wagner, who has taken on the challenge of getting it into the mainline, but the core wait-queue work was done by Peter Zijlstra. It has seen a few revisions in the last few months, but comments appear to be slowing down. One never knows with such things (the patches looked mostly ready in 2013 as well), but it seems like there is not much keeping this work from going into the 4.4 kernel.
Other approaches to random number scalability
Back in late September, we looked at a patch to improve the scalability of random number generation on Linux systems—large NUMA systems, in particular. While the proposed change solved the immediate scalability problem, there were some downsides to that approach, in terms of both complexity and security. Some more recent discussion has come up with other possibilities for solving the problem.
The original idea came from Andi Kleen; it changed the kernel's non-blocking random number pool into a set of pools, one per NUMA node. That would prevent a spinlock on a single pool from becoming a bottleneck. But it also made the kernel's random number subsystem more complex. In addition, it spread the available entropy over all of the pools, effectively dividing the amount available to users on any given node by the number of pools.
But, as George Spelvin noted, the entropy
in a pool is "not located in any
particular bit position
", but is distributed throughout the
pool—entropy is a "holographic property of the pool
", as he
put it. That means that multiple readers do not need to be serialized by a
spinlock as long as each gets a unique salt value
that ensures
that the random numbers produced are different. Spelvin suggested using
the CPU ID for the salt; each reader hashes the salt in with the pool to
provide a unique random number even if the pool is in the same state for
each read.
Spelvin provided a patch using that approach along with his comments.
Random number subsystem maintainer Ted Ts'o
agreed with Spelvin about how the entropy
is distributed, but had some different ideas on how to handle mixing the
random numbers generated back into the pool. He also provided a patch and
asked
Kleen to benchmark his approach. "I really hope it will be good
enough, since besides
using less memory, there are security advantages in not spreading the
entropy across N pools.
"
Either approach would eliminate the lock contention (and cache-line bouncing of the lock), but there still may be performance penalties for sharing the pool among multiple cores due to cache coherency. The non-blocking pool changes frequently, either as data gets mixed in from the input pool (which is shared with the blocking pool) or as data that is read from the pool gets mixed back in to make it harder to predict its state. The cache lines of the pool will be bounced around between the cores, which may well be less than desirable.
As it turned out, when Kleen ran his micro-benchmark, both patch sets performed poorly in comparison to the multi-pool approach. In fact, for reasons unknown, Spelvin's was worse than the existing implementation.
Meanwhile, while the benchmarking was taking place, Ts'o pointed out that it may just make sense to recognize when a process is "abusing" getrandom() or /dev/urandom and to switch it to using its own cryptographic-strength random number generator (CSRNG or CRNG) seeded from the non-blocking pool. That way, uncommon—or, more likely, extremely rare—workloads won't force changes to the core of the Linux random number generator. Ts'o is hoping to not add any more complexity into the random subsystem:
The CRNG would be initialized from the non-blocking pool, and is reseeded after, say, 2**24 cranks or five minutes. It's essentially an OpenBSD-style arc4random in the kernel.
Spelvin was concerned that the CSRNG
solution would make long-running servers susceptible to backtracking: using
the current state of the generator to determine random numbers that have
been produced earlier. If backtracking protection can be discarded, there
can be even simpler solutions, he said, including: "just have *one* key for the kernel, reseeded more
often, and a per-thread nonce and stream position.
" But Ts'o said that anti-backtracking was not being
completely abandoned, just relaxed: "We are
discarding backtracking protection between successive reads from a
single process, and even there we would be reseeding every five
minutes (and this could be tuned), so there is *some*
anti-backtracking protection.
"
Furthermore, he suggested that perhaps real abusers could get their own CSRNG output, while non-abusers would still get output from the non-blocking pool:
Spelvin had suggested adding another random "device" (perhaps /dev/frandom) to provide the output of a CSRNG directly to user space, because he was concerned about changing the semantics of /dev/urandom and getrandom() by introducing the possibility of backtracking. But he agreed that changing the behavior for frequent heavy readers/callers would not change the semantics since the random(4) man page explicitly warns against that kind of usage:
Spelvin posted another patch set that pursues his ideas on improving the scalability of generating random numbers. It focuses on the reducing the lock contention when the output of the pool is mixed back into the pool to thwart backtracking (known as a mixback operation). If there are multiple concurrent readers for the non-blocking pool, Spelvin's patch set ensures that one of them causes a mixback operation; others that come along while a mixback lock is held simply write their data into a global mixback buffer, which then gets incorporated into the mixback operation that is done by the lock holder when releasing the lock.
There has been no comment on those patches so far, but one gets the sense
that Ts'o (or someone) will try to route around the whole scalability
problem with a separate CSRNG for abusers. That would leave the current
approach intact, while still providing a scalable solution for those who
are, effectively, inappropriately using the non-blocking pool. Ts'o seemed
strongly in favor of that approach, so it seems likely to prevail.
Kleen has
asked that his multi-pool approach be
merged, since "it works and is actually scalable
and does not require any new 'cryptographic research' or other
risks
". But it is not clear that the complexity and (slightly) reduced
security of that approach will pass muster.
Rich access control lists
Linux has had support for POSIX access control lists (ACLs) since the 2.5.46 development kernel was released in 2002 — despite the fact that POSIX has never formally adopted the ACL specification. Over time, POSIX ACLs have been superseded by other ACL mechanisms, notably the ACL scheme adopted with the NFSv4 protocol. Linux support for NFSv4 ACLs is minimal, though; there is no support for them at all in the virtual filesystem layer. The Linux NFS server supports NFSv4 ACLs by mapping them, as closely as possible, to POSIX ACLs. Chances are, that will end with the merging of Andreas Gruenbacher's RichACLs patch set for the 4.4 kernel.The mode ("permissions") bits attached to every file and directory on a Linux system describe how that object may be accessed by its owner, by members of the file's group, and by the world as a whole. Each class of user has three bits regulating write, read, and execute access. For many uses, that is all the control a system administrator needs, but there are times where finer-grained access control is useful. That is where ACLs come in; they allow the specification of access-control policies that don't fit into the nine mode bits. There are different types of ACLs, reflecting their evolution over time.
POSIX ACLs
POSIX ACLs are clearly designed to fit in with traditional Unix-style permissions. They start by implementing the mode bits as a set of implicit ACL entries (ACEs), so a file with permissions like:
$ ls -l foo -rw-rw-r-- 1 linus penguins 804 Oct 18 09:40 foo
Has a set of implicit ACEs that looks like:
$ getfacl foo user::rw- group::rw- other::r--
The user and group ACEs that contain empty name fields ("::") apply to the owner and group of the file itself. The administrator can add other user or group ACEs to give additional permissions to named users and groups. The actual access control is implemented in a way similar to how the mode bits are handled. If one of the user entries matches, the associated permissions are applied. Otherwise, if one of the group entries matches, that entry is used; failing that, the other permissions are applied.
There is one little twist: the traditional mode bits still apply as well. When ACLs are in use, the mode bits define the maximum permissions that may be allowed. In other words, ACLs cannot grant permissions that would not be allowed by the mode bits. The reason for this behavior is to avoid unpleasant surprises for applications (and users) that do not understand ACLs. So a file with mode 0640 (rw-r-----) would not allow group-write access, even if it had an ACE like:
group::rw-
If a particular process matches a named ACE (by either user or group name), that process is in the group class and is regulated by the group mode bits on the file. The owning group itself can be given fewer permissions than the mode bits would otherwise allow. See this article for a detailed description of how it all works.
NFSv4 ACLs
When the NFS community took on the task of defining an ACL mechanism for the NFS protocol, they chose not to start with POSIX ACLs; instead, they started with something that looks a lot more like Windows ACLs. The result is a rather more expressive and flexible ACL mechanism. With one obscure exception, all POSIX ACLs can be mapped onto NFSv4 ACLs, but the reverse is not true.
NFSv4 ACLs do away with the hardwired evaluation order used by POSIX ACLs. Instead, ACEs are evaluated in the order they are defined. Thus, for example, a group ACE can override an owner ACE if the group ACE appears first in the list. NFSv4 ACEs can explicitly deny access to a class of users. Permissions bits are also additive in NFSv4 ACLs. As an example of this, consider a file with these ACLs:
group1:READ_DATA:ALLOW group2:WRITE_DATA:ALLOW
These ACEs allow read access to members of group1 and write access to members of group2. If a process that is a member of both groups attempts to open this file for read-write access, the operation will succeed. When POSIX ACLs are in use, instead, the requested permissions must all be allowed by a single ACE.
NFSv4 ACLs have a lot more permissions that can be granted and denied. Along with "read data," "write data," and "execute," there are independent permissions bits allowing append-only access, deleting the file (regardless of permissions in the containing directory), deleting any file contained within a directory, reading a file's metadata, writing the metadata (changing the timestamps, essentially), taking ownership of the file, and reading and writing a file's ACLs.
There is a set of bits controlling how ACLs are inherited from their containing directories. ACEs on directories can be marked as being inheritable by files within those directory; there is also a bit to mark an ACE that should only propagate a single level down the hierarchy. When a file is created within the directory, it will be given the ACLs that are marked as being inheritable in its parent directory. This behavior conflicts with POSIX, which requires that any "supplemental security mechanisms" be disabled for new files.
ACLs can have an "automatic inheritance" flag set. When an ACL change is made to a directory, that change will be propagated to any files or directories underneath that have automatic inheritance enabled — unless the "protected" flag is also set. Setting the "protected" flag happens whenever the ACL or mode of the file have been set explicitly; that keeps inheritance from overriding permissions that have been intentionally set to something else. The interesting twist here is that there is no way in Linux for user space to create a file without explicitly setting its mode, so the "protected" bit will always be set on new files and automatic inheritance simply won't work. NFS does have a way to create files without specifying the permissions to use, though, so automatic inheritance will work in that case.
NFSv4 ACLs also differ in how permissions are applied to the world as a whole. The "other" class is called "EVERYONE@", and it means truly everyone. In normal POSIX semantics, if a process is in the "user" or "group" class, the "other" permissions will not even be consulted; that allows, for example, a specific group to be blocked from a file that is otherwise world accessible. If a file is made available to everyone in an NFSv4 ACL, though, it is truly available to everyone unless a specific "deny" ACE earlier in the list says otherwise.
RichACLs
The RichACLs work tries to square NFSv4 ACLs with normal POSIX expectations. To do so, it applies the mode bits in the same way that POSIX ACLs do — the mode specifies the maximum access that is allowed. Since there are far more access types in NFSv4 ACLs than there are mode bits, a certain amount of mapping must be done. So, for example, if the mode denies write access, that will be translated to a denial of related capabilities like "create file," "append data," "delete child," and more.
The actual relationship between the ACEs and the mode is handled via a set of three masks, corresponding to owner, group, and other access. If a file's mode is set to deny group-write access, for example, the corresponding bits will be cleared from the group mask in the ACL. Thereafter, no ACE will be able to grant write access to a group member. The original ACEs are preserved when the mode is changed, though; that means that any additional access rights will be returned if the mode is made more permissive again. The masks can be manipulated directly, giving more fine-grained control over the maximum access allowed to each class; tweaking the masks can cause the file's mode to be adjusted to match.
There are some interesting complications in the relationship between the ACEs, the masks, and the actual file mode. Consider an example (from this document) where a file has this ACL:
OWNER@:READ_DATA::ALLOW EVERYONE@:READ_DATA/WRITE_DATA::ALLOW
This ACL gives both read and write access to the owner. If, however, the file's mode is set to 0640, the mask for EVERYONE@ will be cleared, denying owner-write access even though there is nothing in the permissions that requires that. Fixing this issue requires a special pass through the ACL to grant the EVERYONE@ flags to other classes where the mode allows it.
A similar problem comes up when an EVERYONE@ ACE grants access that is denied by the owner or group mode bits. Handling this case requires inserting explicit DENY ACEs for OWNER@ (or GROUP@) ahead of the EVERYONE@ ACE.
The RichACLs patch set implements all of this and more. See this page and the richacl.7 man page for more details. As of this writing, though, there is still an open question: how to handle RichACL support in Linux filesystems.
At the implementation level that question is easily answered; RichACLs are stored as extended attributes, just like POSIX ACLs or SELinux labels. The problem is one of backward compatibility: what happens when a filesystem containing RichACLs is mounted by a kernel that does not implement them? Older kernels will not corrupt the filesystem (or the ACLs) in this case, but neither will they honor the ACLs. That can result in access being granted that would have been denied by an ACL; it also means that ACL inheritance will not be applied to new files.
To prevent such problems, Andreas requested that a feature flag be added to the ext4 filesystem; that flag would prevent the filesystem from being mounted by kernels that do not implement RichACLs. There was some discussion about whether this made sense; ext4 maintainer Ted Ts'o felt that the feature flags were there to mark metadata changes that the e2fsck utility needed to know about to avoid corrupting the filesystem. RichACLs do not apply, since filesystems don't pay attention to the contents of extended attributes.
Over the course of the conversation, though, a consensus seemed to form around the idea that the use of RichACLs is a fundamental filesystem feature. So it appears that once they are enabled for an ext4 filesystem (either at creation time, or via tune2fs), that filesystem will be marked as being incompatible with kernels that don't implement RichACLs. Something similar will likely be done for XFS.
If things go as planned, this work will be mainlined during the 4.4 merge window. At that point, NFS servers should be able to implement the full semantics of NFSv4 ACLs; the feature should also be of use to people running Samba servers. This patch set, the culmination of several years' work, should provide a useful capability to server administrators who need fully supported access control lists on Linux.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>