Architecture emulation containers with binfmt_misc

By Jonathan Corbet
March 9, 2016

Containers bring a lot of advantages in the areas of security and systems administration; they can be used to run lightweight virtual "systems" in isolation from each other and the host system. Since containers lack their own kernel — they run directly on the host-system kernel — one does not ordinarily expect them to be built for an architecture other than the host they are running on. But it turns out that there are use cases for such containers, and that they can be run using the somewhat obscure "binfmt_misc" kernel mechanism if a small patch set is applied.

binfmt_misc

When an executable file is provided to one of the exec*() system calls, the kernel normally expects to find a native binary for the system it is running on. The kernel has long had a mechanism by which it can recognize other executable-file formats and run them, though. The classic example is the module that looks for a file that begins with "#!" — the marker for a shell script. When such a file is recognized, the name of the interpreter for the script will be read from the first line of the file; the interpreter will then be run with the file as its standard input.

But one can imagine many other possible formats for executable files. These could be binaries built for a different operating system (DOS binaries that could be run with DOSEMU, for example) or byte-code binaries that need to run on a specific machine (such as Java byte code). One could try to code awareness of all these formats into the kernel, but that gets unwieldy after a while. It also lacks flexibility, which is unfortunate; the kernel developers are never going to know about all of the possible executable formats that might be of interest.

The obvious solution is to allow user space to describe new executable formats to the kernel; that is the role of the binfmt_misc mechanism. If this feature is configured into the kernel (as it usually is), a system administrator can add a new executable format by writing a special string to /proc/sys/fs/binfmt_misc/register. That string includes:

A way for the kernel to recognize the new format. It can either be a particular file extension, or a "magic number" found near the beginning of the file.
The name of the interpreter that is to be run to execute files with this format.
Some flags that control how the argv array is created and, essentially, whether files in this format can be setuid or not.

The full details of how it all works can be found in Documentation/binfmt_misc.txt in the kernel source tree.

It is not hard to see how binfmt_misc can be used to run binaries built for a different architecture. It is a simple matter of describing those binaries and naming an emulator (QEMU, for example) that is able to run the binaries. That works well for binaries to be run directly on the host system, but it can be a bit more challenging to run a container that is built for another architecture.

Architecture emulation in containers

The problem, as James Bottomley pointed out in this brief patch set, is that binfmt_misc has to locate and run the interpreter binary at the time that the foreign binary is invoked. This invocation happens within the container, so the interpreter has to be visible in the container as well, but, likely as not, the container is running within a namespace intended to keep it from seeing the rest of the system. As a result, the interpreter must be placed inside the container itself. That complicates what would otherwise be containerized system built entirely for the emulated architecture. It also forces any orchestration system to be aware of the emulation within the container and set things up accordingly, making emulated containers less transparent than they would otherwise be.

The solution is to add a new mode for binfmt_misc wherein the interpreter binary is opened by the kernel when the new format is initially set up. When a binary in that format is encountered, the already-opened interpreter can be run, rather than seeking out and opening the interpreter at that time. This mechanism will work inside a container that otherwise has no access to the interpreter; the kernel already has the interpreter open, so it can run it directly.

This mode is set up by using the new "F" flag when describing the format to binfmt_misc. Once the kernel has opened the interpreter file, it will keep it open until the format is removed. That means that updates to the interpreter binary will not take effect unless the format is removed and reestablished. That should not ordinarily be a problem, but it could be a surprise for system administrators who are not aware of this behavior.

The patch set received a small number of generally favorable reviews. If it is merged, as seems likely, it will make it easier to run containers built for a number of machine architectures on the same host, making Linux containers more flexible in general.

Index entries for this article
Kernel	binfmt_misc
Kernel	Containers

Architecture emulation containers with binfmt_misc

Posted Mar 10, 2016 4:14 UTC (Thu) by raven667 (subscriber, #5198) [Link] (2 responses)

I'm sure James did good work but I worry that somewhere along the line the interpreter would retain access to some resource from outside the contained environment, such as the mmap of the interpreter binary outside the container as James notes, that would allow privilege escalation, which he doesn't think is possible but it would be great if someone who understands this better than either of us could comment on it.

Architecture emulation containers with binfmt_misc

Posted Mar 10, 2016 12:34 UTC (Thu) by jejb (subscriber, #6654) [Link] (1 responses)

Actually, that's not possible. The emulator runs inside the container, not outside of it. What is poked through into the container from the outside is a file descriptor, opened in the host OS, which is then mapped and executed inside the container, so any fault in the emulator faults inside the container, not outside of it. It also means, except for the fd of the emulator binary, the emulator has no access to any resources outside of the container (that's why, as I explained in the 0/3 patch, the emulator has to be static ... it can't resolve dynamic libraries outside of the container)

Architecture emulation containers with binfmt_misc

Posted Mar 10, 2016 15:39 UTC (Thu) by raven667 (subscriber, #5198) [Link]

I'm sure you are probably right, I'm a sysadmin and not much of a developer, but I just have a unformed suspicion that there is some kernel syscall or resource commonly presented inside containers that would treat the open fd from outside the contained environment as an access token proving the program should be allowed to perform operations outside the container that could be leveraged to exit the container. I don't know of a mechanism to do this, so you are probably right and it's not possible, my lack of confidence is more my lack of deep knowledge of this area than any real problem.

Architecture emulation containers with binfmt_misc

Posted Mar 10, 2016 12:40 UTC (Thu) by RobSeace (subscriber, #4435) [Link] (2 responses)

> When such a file is recognized, the name of the interpreter for the script will be read from the first line of the file; the interpreter will then be run with the file as its standard input.

Actually, it just passes the script as a command-line parameter not as stdin, doesn't it?

Architecture emulation containers with binfmt_misc

Posted Mar 10, 2016 17:20 UTC (Thu) by itvirta (guest, #49997) [Link]

The filename of the script, yes. It couldn't be through stdin, since the script might need that for user input.

Architecture emulation containers with binfmt_misc

Posted Mar 10, 2016 21:01 UTC (Thu) by eternaleye (guest, #67051) [Link]

It depends on your flags - in particular, the "O" flag says to pass a pre-opened FD, and pass the FD number as an argument. There's also "C", which implies "O" and calculates credentials according to the binary rather than the interpreter.

However, you are correct that no mode of operation seems to pass it on stdin.

Architecture emulation containers with binfmt_misc

Posted Mar 10, 2016 20:47 UTC (Thu) by smurf (subscriber, #17840) [Link] (2 responses)

Unfortunately this solution still requires the emulator to be statically linked.
Hard-linking or bind-mounting a single emulator binary isn't that difficult, so I don't think this feature helps much.

Architecture emulation containers with binfmt_misc

Posted Mar 10, 2016 21:05 UTC (Thu) by eternaleye (guest, #67051) [Link]

Well, there's also that because it's executed from a pre-opened FD, these may be relevant:

- Possibly less overhead, as it doesn't need to do FS traversals to get to the binary
- Doesn't break if the user accidentally unlinks it (as was called out as a potential failure mode of what you suggest)
- Reduces the divergence between an emulated container and a native container on that arch (as far as the emulated container can see)
- Avoids any need to make changes _within_ the container to boot it on another host arch
- Likely more I haven't thought of

Architecture emulation containers with binfmt_misc

Posted Mar 10, 2016 22:51 UTC (Thu) by jejb (subscriber, #6654) [Link]

The problem with both of those is that they contaminate the container image, which makes it hard to handle non native pure images. Secondly, with hard linking, the emulator has to be on the same mount as the image, which usually isn't the case for docker style images and for bind mounting, you require the support of the container orchestration system to perform the bind mount. None of this can't be solved, but it's certainly a lot easier to have the emulation just work.