User namespaces are created by passing the CLONE_NEWUSER flag to the clone() or unshare() system calls. Administrators who are nervous about allowing access to this feature currently only have one option: configure out support at kernel build time. That option is not easily available to the many systems running distribution-built kernels, though. Kees Cook set out to create an easier way with this patch set creating a new sysctl knob to control access to the user-namespace feature, saying:
In particular, the patch adds a knob called /proc/sys/kernel/userns_restrict. When it is set to the default value (zero), user namespaces are unrestricted. Setting it to one allows only privileged users to create user namespaces; a setting of two disables user namespaces altogether. In that final case, it is not possible to re-enable user namespaces without rebooting the system.
One of the first issues to be aired had to do with naming: it turns out that Debian currently carries a similar patch, but, on Debian systems, the knob is called unprivileged_userns_clone and doesn't support the "privileged users only" setting. Ben Hutchings agreed that the new naming was probably better and said that, should Kees's patch go upstream, Debian would slowly move over to it.
Some developers worried that allowing user namespaces to be turned off would slow the process of finding and fixing any remaining security issues. Additionally, Serge Hallyn suggested that, if application developers could not count on the availability of user namespaces, they wouldn't use them at all. He suggested that, if the knob is accepted, it be marked as a short-term workaround that would eventually be removed.
The strongest opposition, though, came from Eric Biederman, the creator of user namespaces and also the developer who has done the most work on the sysctl code in recent times. He stated flat out that "the code is buggy, and poorly thought through" and would not be merged. In another message he described his objections in detail, starting with a challenge to the idea that user namespaces are a security risk at all:
Others, though, seem to think that, if problems elsewhere are being "amplified," there is indeed a security exposure. Andy Lutomirski described some concerns of his own:
Eric echoed the point that making it possible to disable user namespaces would be a net loss in security, since the feature would not be available on all systems. He cited web browsing with Chrome as a use case; Kees responded that this patch wasn't really aimed at desktop systems in the first place.
Next on Eric's list was a complaint that a system-wide knob was too coarse; he suggested that perhaps the seccomp() mechanism should be used instead if access to user namespaces must really be restricted. Kees's answer here is that it's not really possible to set a global seccomp() policy, that performance would suffer in any case, and that seccomp() is meant for developers to use rather than system administrators. "It's an extraordinarily big hammer for wanting to turn off a single area of the kernel with a long history of problems." He noted that trying to use a Linux security module to achieve this end would have a number of similar problems.
Then, Eric said, the sysctl knob could create "a false sense of security" since it would have no effect on processes that are already running in a user namespace. If a security issue comes to light, just turning off the knob will not be enough to protect a system; a reboot will also be necessary. Eric returned to this point later, calling the patch "fatally flawed" as a result of the "subtlety and nuance" involved in using it.
Kees acknowledged the "corner case" in the sysctl implementation, one that, he said, applies to a number of other, existing knobs as well. But, he said, it really does not matter to an administrator who simply wants to disable the feature outright as a way of reducing the attack surface of a system. Even so, he allowed: "I'm open to having this sysctl kill all CLONE_NEWUSERed process trees," without noting that having a sysctl knob kill off processes might pose some interesting "subtlety and nuance" of its own.
As a sort of postscript, Eric suggested that, perhaps, the desired restriction could be implemented as a resource limit controlling the number of user namespaces that any user would be allowed to create. Setting that number to zero would effectively disable the feature. Kees indicated a willingness to look at this idea; it is the end result he wants, rather than the sysctl knob itself.
There is an evident desire for the ability to turn off access to user namespaces; various other developers spoke in its favor over the course of the discussion. But this desire is clearly not universal and, as a result, the current patches do not appear to have an easy path into the mainline. It is entirely possible that the concerns blocking this feature may eventually be addressed and overcome, but it also seems possible that, in the end, this knob ends up being part of the patch set carried by distributors and users. It seems that getting security-related changes into the kernel is still a difficult task.
Controlling access to user namespaces
Posted Jan 29, 2016 1:56 UTC (Fri) by zuki (subscriber, #41808) [Link]
Also, I don't really buy the argument that setting the sysctl does not work retroactively and this is terrrrrrible. The same is true for most settings... If I had a setuid binary, dropping the bit only affects the future, running instances are not killed. If I change the permissions on a file, processes which had it open just continue. Etc, etc. For example kernel.modules_disabled=1 follows a similar pattern.
It seems that EB doesn't like that people want to disable some feature which he deeply cares about and loses objectivity. The "shortcomings" of the patch seem like things made up post factum to justify the initial emotional response.
Also a global per-user limit doesn't seem very useful. If there's a vulnerability, just one namespace is enough to exploit it. And otherwise, why would we care how many namespaced processes are running? So only two values of the limit make sense: 0 and infinity. So we're back to the original sysctl patch.
Copyright © 2016, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds