Minijail

By Jake Edge
September 14, 2016

Sandboxing services and applications running on the Linux kernel is a way to help mitigate problems they can cause when they have bugs or are compromised. While there are multiple technologies in the kernel to help with creating the sandbox, it is easy for programmers to get it wrong. Jorge Lucangeli Obes gave a presentation on minijail, which is a tool and library that Google uses in multiple systems for sandboxing. In fact, he said, Google uses it everywhere: on Android, Chrome OS, on its servers, and beyond.

He started the talk by showing a portion of a ps listing from his laptop that showed multiple root-owned processes running. Each of those processes is "one bug away" from an attacker getting root privileges. For example, the Bluetooth daemon is running as root and listening on the air even on a "super modern kernel". He could have set up a Bluetooth beacon in the room to try to exploit the Bluetooth stacks in the laptops present, which would have given him complete control of them if it was successful; he didn't do that, but it is certainly possible.

Part of the reason that so many processes run as root is that there are misaligned incentives, Lucangeli said. Administrators don't know what permissions are needed by the software and developers don't know where their software is running. Even when the developers do try to reduce the privileges their programs need, they make mistakes as there are a lot of pitfalls in doing so correctly.

So instead of reinventing the wheel for each program and expecting the developers to be experts in security hardening, Google developed minijail. That way, those who are writing Android or Chrome OS system programs do not have be security experts; there is simply a library they can use to handle these sandboxing chores. That library will be regularly tested to ensure that it always works and there will be one place to fix bugs when it doesn't.

Minijail is also part of what allows Android apps to run on Chrome OS, he said. It is effectively creating a container for programs that use it. So minijail is a "containment helper" for Android, Chrome OS, Brillo, and more.

The goal is to eliminate as many of the services running as root as possible. For one thing, minijail uses Linux capabilities to reduce the privileges a process needs. For example, the Bluetooth daemon needs the ability to administrate network interfaces and to open sockets, but it does not need to be able to remount filesystems or reboot the system. So it is given the appropriate capabilities that allow it to do its job—and no others.

In Chrome OS, for example, no network-facing services are running as root. They are not completely unprivileged, of course, but instead try to follow the principle of least privilege.

There's more to minijail than just capabilities, though. Processes with a restricted set of capabilities can still access the entire kernel API. It really doesn't make sense for a process that doesn't have the capability needed to mount a filesystem to still have access to the mount() system call, Lucangeli said.

So minijail uses seccomp to restrict the system calls that processes can make. For example, cat needs only nine system calls to function, instead of the 350 or so that are available in the kernel API. The idea is that even if the process gets subverted, it can't really do anything more than it is meant to do. The Chrome rendering process only needs around half of the available system calls to do its job; with seccomp protections, malicious content still can't cause it to make any of those other calls.

Minijail uses LD_PRELOAD to ensure that the mini-jail is entered before the program's main() function is called. This has the advantage that the system calls used by glibc initialization do not have to be added to the seccomp rules, since glibc is loaded and initialized before the jail.

There is another reason that LD_PRELOAD is needed, he said. Ostensibly, capabilities are inherited over execve(), so you can have a launcher that sets up the sandbox and runs the program in it, but there is a hitch. Unless filesystem capabilities are enabled, it is impossible to actually pass the capabilities on to the new program. There are good reasons not to enable the file-based capabilities, however, because they allow processes to gain capabilities at runtime, which makes reasoning about them more difficult. "Everyone who tried to use capabilities to do something useful" has seen the problem, he said. The solution was ambient capabilities, which allow processes to pass their capabilities across an execve() call without using filesystem capabilities.

Sometimes code is not prepared to deal with the errors returned from a capability check or a seccomp rule, so there is another option in that case: return a dummy object. That is the way he thinks of namespaces in some contexts. They allow the kernel to return "fake" objects for some resources. Namespaces make it easier to port code from elsewhere without having to do major surgery on the code, Lucangeli said.

All seven of the Linux namespaces are supported in minijail at this point. He showed an example using process ID (PID) namespaces, which can be used to prevent "exploiting horizontally"—attacking other processes rather than the kernel. Separating processes into their own PID namespace prevents compromised programs from even seeing the other processes. Over the years, there have been several bugs in the code checking for ptrace() access, but they can't be exploited if the target PID cannot even be seen.

The minijail0 binary wraps all of these techniques up together into a single program that can start and enter namespaces, apply seccomp rules, manage capabilities, and so on. It provides access to all of the Linux sandbox features in that one binary. When starting a PID namespace, it will launch a small program that knows how to act like init in the namespace. It will also use a mount namespace to remount /proc inside the mini-jail.

While there may be security concerns about user namespaces, they are the thing that "ties everything together" for minijail. Up until user namespace support was added to minijail, minijail0 had to be run as the root user. The team got requests from within Google to be able to run minijail on systems where root access was not available. Now it can be run as a regular user, which has opened up new applications for minijail, such as on build systems or in the fuzzing infrastructure.

There are some processes that need to run as root, such as the Android init process. So, for the Android container on Chrome OS, the team put the Android system into a user namespace where it was root; some parts of the filesystem were bind-mounted into the container so that init could find things where it expected them. Everything "pretty much just worked". Input events were plumbed into the container and graphics textures are sent out to Chrome OS over a file descriptor; those were the two main changes to Android to make it work. Minijail allowed most of Android to run unmodified on Chrome OS and it also solved many other problems in Chrome OS, Lucangeli said.

Many people were involved in developing minijail. It is used in Chrome OS and will be in Android 7.0 (Nougat), mostly for the seccomp support. It is available under the BSD license in the Android repositories.

[I would like to thank the Linux Foundation for travel support to attend the Linux Security Summit in Toronto.]

Index entries for this article
Conference	Linux Security Summit/2016

Minijail

Posted Sep 15, 2016 15:16 UTC (Thu) by JanC_ (guest, #34940) [Link] (3 responses)

So, if I understand this well, 'minijail' is similar to 'firejail'? Any other similar tools? And how do they all compare in features & security?

Minijail

Posted Sep 15, 2016 15:53 UTC (Thu) by sjj (guest, #2020) [Link]

This would be a great LWN article topic. Hint, hint.

Minijail

Posted Sep 15, 2016 21:28 UTC (Thu) by dw (guest, #12017) [Link]

There are more than a few of these around, they suffer from a bunch of problems:

* Static binaries (of course)
* Reliance on LD_PRELOAD which is essentially a user-tunable knob. Use of LD_PRELOAD should always be considered a hack, and since it is a hack, few hacky vendor scripts are written to set LD_PRELOAD (as they occasionally do) using something like "LD_PRELOAD=$LD_PRELOAD:...". And about that, stacking LD_PRELOADs is an analytical nightmare, good luck debugging a SEGV.
* Building essentially a system call interface emulator. The wrapper must ensure it catches every interesting syscall and regularly be audited to ensure the situation hasn't changed. Ensuring seccomp is enabled early is a nice feature here, but it doesn't avoid the architectural travesty of trying to mirror an interface that is a moving target
* Endless bizarre interactions due to messing with every program's runtime image. LD_PRELOAD causes process-visible changes during e.g. calls to dlopen()

I can't say I'd rely on any tool for general purpose use that is effectively exploiting a debug interface to work at all

Minijail

Posted Oct 3, 2016 11:59 UTC (Mon) by mgedmin (subscriber, #34497) [Link]

There's also Bubblewrap (https://github.com/projectatomic/bubblewrap), which is affiliated with Flatpak (nee xdg-app).

Minijail

Posted Sep 19, 2016 5:01 UTC (Mon) by Qwertii (guest, #110709) [Link]

Seems interesting, It has always bothered me how processes usually run with minimal permissions as a regular user or with full system access and nothing in between. What is the difference between this and AppArmor or SELinux?