Kernel development
Brief items
Kernel release status
The current development kernel is 3.18-rc5, released on November 16. Linus said: "So we still have a few pending issues, but things look fairly normal. We've still got a few weeks to go before final, and the more you can test, the better off we'll be."
Stable updates: 3.17.3, 3.14.24, and 3.10.60 were released on November 14.
Quotes of the week
"The racer-boys who run $FASTEST_DISTRO_OF_THE_MONTH and infest racer-boy tech forums always tune atime for maximum warp speed. One of them will make the connection between atime updates and lazytime and so turn on lazytime. They will then make a forum post saying how their bonnie++ test runs 3.4% faster but only when they stand on their head, go cross-eyed and stick their tongue out and so everyone should do that it because it's Clearly Better.
Me Too posts will rapidly spread across racer forums around the world and Google will then start returning them as the first hits when someone searches for lazytime. We'll see a cargo cult of users doing headstands and using "-o noatime,nodiratime,lazytime" because that's what the first guy did and everyone who has tried it since has agreed that it was Clearly Better.
In ten years time we'll still be telling people that bonnie++ numbers are meaningless, nodiratime is redundant when you specific noatime, that both are redundant when you specific lazytime and that standing on your head making faces just makes you look like an idiot and gives you a headache.
Meanwhile Eric will be sitting in the corner muttering quietly to himself about how there *still* isn't a Sed For Google API that would let him revise history to stop people finding old posts about the performance benefits of standing on your head, making faces and using noatime,nodiratime,lazytime..."
Kernel development news
Introducing lazytime
POSIX-compliant filesystems maintain three timestamps for every file, corresponding to the times of the last change in the file's metadata or contents (known as its "ctime"), modification of the file's contents ("mtime"), and access of the file ("atime"). The first two timestamps are generally considered to be useful, but "atime" has long been seen as being too expensive for the benefits it provides. In current systems, there is a mount option (called "relatime") that mitigates the worst problems caused by atime, but it has a few issues of its own. Now a new "lazytime" option might replace relatime and provide the best of all worlds.The problem with atime is that it is supposed to be updated every time the associated file is accessed. Updating atime requires writing the file's inode back to disk, so atime tracking essentially turns every read operation into a write. For many workloads, the effect on performance can be severe. On top of that, there are few programs out there that make use of atime or depend on it being updated. So, ten years ago, it was common to mount filesystems with the "noatime" option, which disabled the tracking of access times entirely.
The problem, of course, is that "few programs" is not the same as "no programs"; it turns out that there are indeed utilities that break in the absence of atime tracking. A classic example is mail clients that use atime to determine whether a mailbox has been read since mail was last delivered to it. After some discussion, the kernel community added the "relatime" mount option in the 2.6.20 development cycle. Relatime will cause most atime updates to be suppressed, but it will allow an atime update if the current recorded atime is prior to the current ctime or mtime. Later on, relatime was tweaked to update atime once every 24 hours regardless (but only if the file is accessed, of course).
Relatime works well enough for most systems, but there are still those who would like better atime tracking without paying the performance penalty for it. Some users also dislike the fact that relatime, for all its value, causes the system to not be fully compliant with the POSIX specification. For the most part, people have put up with the minimal deficiencies in relatime (or put up with the cost of atime updates), but there is now an alternative on the horizon.
That alternative takes the form of the lazytime mount option, posted as an ext4-specific patch by Ted Ts'o. With lazytime enabled, a filesystem will keep atime current in a file's in-memory inode. But that inode will not be written to disk until there is some other reason to do so, or until the inode itself is being pushed out of memory. The effect will to have an atime that is always correct from the point of view of any program running on the system. The version of atime stored on disk may well lag significantly behind reality, though, and the current atime could be lost if the system were to crash.
Dave Chinner was quick to point out that, while the option looked like it could be useful, the ext4 filesystem might not be the best place to implement it. If lazytime were to be implemented in the virtual filesystem (VFS) layer, then it would be available for all filesystems, not just ext4 and, perhaps most importantly, it would work the same way on all of them. Ted agreed that a VFS implementation might make sense; the next version of this patch seems almost certain to be implemented at that level.
Dave also suggested that delaying the writing of atime updates indefinitely might not be advisable:
Once again, Ted was amenable to this idea, so the next version will probably write out updated atime values a minimum of once every 24 hours. Without that change, atime updates could be held in memory for months at a time on a system like a database server (which keeps its files open indefinitely).
Finally, there is the question of whether lazytime should become the default mount option. It satisfies POSIX (or, at least, will after another fix or two) without incurring the cost of normal atime updates, so it does seem like a better option than relatime, which is the current default. Ted, seemingly, would like to change the default in the near future, while Dave is a bit more concerned about regressions and would like to wait a couple of years to see how things work out. That led to a question of whether the feature will see enough testing in the meantime, but, as Dave noted, there will probably be enough interest in the feature to ensure that people will try it out.
Whether that is true remains to be seen; relatime works well enough for most users, so there isn't necessarily a crowd of people looking to try a new filesystem mount option. But eventually some of the more adventurous distributions are likely to pick it up; at that point, any latent problems should probably come out before too long. So, when lazytime becomes the default in 2016 or so, it should indeed be well tested and shown to work without problems.
Control group namespaces
Containers in Linux use both control groups (cgroups) and namespaces to isolate a set of processes into a virtual system at the operating system level (as opposed to at the hardware level as with KVM). But, currently, cgroups themselves are not virtualized. That leads to a number of problems for container managers (e.g. LXC, Docker), since processes inside the containers can see the global cgroup landscape. A recent patch set seeks to fix those problems by creating a new namespace for cgroups.
Aditya Kali posted v2 of the cgroup namespace patch set at the end of October. It is based on Tejun Heo's unified cgroup hierarchy work and is meant to solve several problems for containers. For example, when a task consults the /proc/self/cgroup file, it currently sees the full cgroup path from the global cgroup hierarchy, which leaks information about the host system. That information makes it difficult to do container migration across systems (using checkpoint/restore in user space, aka CRIU) since all of the names would need to be unique across all systems so that there were no collisions with names on the new system. In addition, running container-management tools inside of containers (to nest them) is difficult because the information available is not relative to the existing container.
The basic idea in the patch set is that a process can call unshare() using the CLONE_NEWCGROUP flag to enter a new cgroup namespace. Once it does that, it will no longer see the global cgroup hierarchy, but will instead see itself in the root cgroup. In the first patch, Kali described how that would look:
$ cat /proc/self/cgroup 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1 $ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash # From within new cgroupns, process sees that its in the root cgroup [ns]$ cat /proc/self/cgroup 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
Similarly, /proc/<PID>/cgroup will return a path that is relative to the cgroup namespace root (known as cgroupns-root). In addition, mounting control group filesystem (cgroupfs) within the namespace would make cgroupns-root be the root of the mounted cgroupfs. In effect, it would be like bind-mounting the cgroup namespace's subtree in cgroupfs (i.e. starting at cgroupns-root) at the mount point. Currently, mounting cgroupfs exposes the full hierarchy of existing cgroups, which leaks unnecessary (and confusing) information.
The main area of discussion on the patch set (and its v1 predecessor) has been about which processes can be moved into cgroup namespaces at various levels in the hierarchy (e.g. below, above, or into sibling hierarchies). The original patches only allowed processes to be moved into namespaces below the root of the cgroupns they are in, but that was deemed too restrictive (it could lead to a situation where the root user could not move a process to a particular namespace, for example). The current patches allow suitably privileged processes to move processes to any cgroup namespace in the hierarchy, though it does not do any implicit movement of the process into a different cgroup—that must be handled by the process doing the moving. That can lead to relative paths in /proc/<PID>/cgroup depending on the namespace of the process looking and that of the PID in question:
# ns is at '/groups/a', PID 4567 is in a cgroupns at '/groups/b' [ns]$ cat /proc/4567/cgroup 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../b
With those changes in place, container managers can treat nested containers the same as they do the top level. Tools for container migration can also do their job without having to be concerned about name collisions on the new system.
So far, the reception has been fairly positive. There has been discussion
about various aspects of the patch set, but no one seems to be putting the
brakes on the idea. In fact, namespaces developer Eric W. Biederman
noted that the patch set "definitely looks like the right direction to go, and something that
in some form or another I had been asking for since cgroups were
merged
". There is certainly more work to do, but it would seem
likely that a new namespace for cgroups is in the kernel's future.
The trouble with dropping groups
Linux, like all but the earliest of Unix-like systems, allows a process to be a member of multiple groups at any given time. Any of those group memberships can be used to make access-control decisions. Changing the list of groups a process belongs to is a privileged operation for obvious reasons. Recently, though, a developer started to think about letting processes drop some groups from their lists. On its face, allowing a process to discard credentials seems like it should be a safe thing to do. In practice, it turns out that there are some surprises waiting for anybody wanting to give the idea a try.Josh Triplett would like to make it easy for unprivileged users to run programs in a sandboxed mode. As part of that work, he put together a patch allowing an unprivileged process to drop its membership in one or more groups. The idea is that a user could fire off a sandboxed process with minimal credentials, including a reduced set of group memberships, without having to resort to privileged helper commands. That should reduce the level of privilege in the system overall and, hopefully, make things more secure.
But it seems that no Unix-like system anywhere has made it possible for unprivileged processes to drop group membership. The reason may be surprising, but it's there nonetheless: some systems use group membership as a way of reducing privilege; such schemes would no longer work if users could discard groups at will.
Casey Schaufler jumped in early with the assertion that Tizen is one of the systems using groups in this way. The specific mechanism he is worried about is access control lists (ACLs). An ACL can make either positive or negative access decisions, so one can write an ACL to deny access to a resource if the accessing process is a member of a given group. It makes sense in a way; user applications could be run with membership in a special "untrusted" group that would be denied access to most system resources. Casey was pretty clear in his opinion that this change should not be merged:
Ted Ts'o pointed out that there is no need to have ACLs to use groups as a negative access indicator. Imagine you had a directory called games that you wanted to make available to all users except those in the games-abusers group. All that is needed is a command sequence like:
chown bin.games-abusers games chmod 705 games
This will have the (possibly surprising) effect of blocking access to the directory for anybody in the games-abusers group — the "no access" group permission bits override the more open permissions for the whole world. Once again, being able to drop group membership would defeat this kind of mechanism.
Finally, the sudo utility can also make decisions based on group membership or the lack thereof. Being able to drop group membership could thus enable a user to get privileges via sudo that would normally have been denied.
As it happens, on a Linux system one does not actually need Josh's patch to be able to drop group membership. As Andy Lutomirski noted, user namespaces already make that possible. In a user namespace, normal users can have root access and can thus call setgroups(). In theory, that access does not enable any expanded privilege outside of the namespace, but, as can be seen here, a few surprises still lurk in that code. Beyond setgroups(), though, user namespaces can be used to drop groups by simply neglecting to map them to groups outside of the namespace; see this article for details on how this mapping works. Andy sees the problem as being serious enough that he reported it to the oss-security mailing list as a vulnerability with no fix available.
A few of the participants in the discussion seemed to feel that the idea of using credentials to reduce privilege was a bit backward. But it appears to be something that people do, so breaking it is not an option. For the case of user namespaces, some sort of fix will have to be applied; it may become impossible to drop group memberships from within a user namespace. The sudo problem can be addressed by only allowing groups to be dropped if the "no new privileges" flag (originally introduced for system call filtering) is set, but Eric Biederman worries about the additional complexity that would bring in.
There was talk of adding a sysctl knob to control the unprivileged dropping
of group membership. Such a flag would default to "off"; system
administrators could turn it on if they were confident that it would not
subvert the security models in use on their systems. But Casey is not confident that this option makes sense;
just because a system does not use groups to restrict privilege now doesn't
mean that somebody won't install a package using that approach tomorrow.
Also, as he pointed out: "The developers of user namespaces didn't
notice it might be a problem. You can't count on sysadmins or distro
developers to do better.
"
So, in the end, unprivileged dropping of group membership may turn out to be one of those ideas that just can't quite be shoehorned into the decades-old Unix privilege model. There is a lot of history there and no end of systems that might see surprising results coming from a change like this. If this work does go forward, expect to hear some loud complaints before it makes it into the mainline.
(Those interested in this work may also want to have a look at Josh's other patch. It allows a process to have multiple user IDs in the same way that multiple group IDs are possible now. The idea is that the process could use those supplemental IDs to run sandboxed processes each with their own user ID.)
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Device driver infrastructure
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>