Architecture emulation containers with binfmt_misc
binfmt_misc
When an executable file is provided to one of the exec*() system calls, the kernel normally expects to find a native binary for the system it is running on. The kernel has long had a mechanism by which it can recognize other executable-file formats and run them, though. The classic example is the module that looks for a file that begins with "#!" — the marker for a shell script. When such a file is recognized, the name of the interpreter for the script will be read from the first line of the file; the interpreter will then be run with the file as its standard input.
But one can imagine many other possible formats for executable files. These could be binaries built for a different operating system (DOS binaries that could be run with DOSEMU, for example) or byte-code binaries that need to run on a specific machine (such as Java byte code). One could try to code awareness of all these formats into the kernel, but that gets unwieldy after a while. It also lacks flexibility, which is unfortunate; the kernel developers are never going to know about all of the possible executable formats that might be of interest.
The obvious solution is to allow user space to describe new executable formats to the kernel; that is the role of the binfmt_misc mechanism. If this feature is configured into the kernel (as it usually is), a system administrator can add a new executable format by writing a special string to /proc/sys/fs/binfmt_misc/register. That string includes:
- A way for the kernel to recognize the new format. It can either be
a particular file extension, or a "magic number" found near the
beginning of the file.
- The name of the interpreter that is to be run to execute
files with this format.
- Some flags that control how the argv array is created and, essentially, whether files in this format can be setuid or not.
The full details of how it all works can be found in Documentation/binfmt_misc.txt in the kernel source tree.
It is not hard to see how binfmt_misc can be used to run binaries built for a different architecture. It is a simple matter of describing those binaries and naming an emulator (QEMU, for example) that is able to run the binaries. That works well for binaries to be run directly on the host system, but it can be a bit more challenging to run a container that is built for another architecture.
Architecture emulation in containers
The problem, as James Bottomley pointed out in this brief patch set, is that binfmt_misc has to locate and run the interpreter binary at the time that the foreign binary is invoked. This invocation happens within the container, so the interpreter has to be visible in the container as well, but, likely as not, the container is running within a namespace intended to keep it from seeing the rest of the system. As a result, the interpreter must be placed inside the container itself. That complicates what would otherwise be containerized system built entirely for the emulated architecture. It also forces any orchestration system to be aware of the emulation within the container and set things up accordingly, making emulated containers less transparent than they would otherwise be.
The solution is to add a new mode for binfmt_misc wherein the interpreter binary is opened by the kernel when the new format is initially set up. When a binary in that format is encountered, the already-opened interpreter can be run, rather than seeking out and opening the interpreter at that time. This mechanism will work inside a container that otherwise has no access to the interpreter; the kernel already has the interpreter open, so it can run it directly.
This mode is set up by using the new "F" flag when describing the format to binfmt_misc. Once the kernel has opened the interpreter file, it will keep it open until the format is removed. That means that updates to the interpreter binary will not take effect unless the format is removed and reestablished. That should not ordinarily be a problem, but it could be a surprise for system administrators who are not aware of this behavior.
The patch set received a small number of generally favorable reviews. If
it is merged, as seems likely, it will make it easier to run containers
built for a number of machine architectures on the same host, making Linux
containers more flexible in general.
Index entries for this article | |
---|---|
Kernel | binfmt_misc |
Kernel | Containers |
Posted Mar 10, 2016 4:14 UTC (Thu)
by raven667 (subscriber, #5198)
[Link] (2 responses)
Posted Mar 10, 2016 12:34 UTC (Thu)
by jejb (subscriber, #6654)
[Link] (1 responses)
Posted Mar 10, 2016 15:39 UTC (Thu)
by raven667 (subscriber, #5198)
[Link]
Posted Mar 10, 2016 12:40 UTC (Thu)
by RobSeace (subscriber, #4435)
[Link] (2 responses)
Actually, it just passes the script as a command-line parameter not as stdin, doesn't it?
Posted Mar 10, 2016 17:20 UTC (Thu)
by itvirta (guest, #49997)
[Link]
Posted Mar 10, 2016 21:01 UTC (Thu)
by eternaleye (guest, #67051)
[Link]
However, you are correct that no mode of operation seems to pass it on stdin.
Posted Mar 10, 2016 20:47 UTC (Thu)
by smurf (subscriber, #17840)
[Link] (2 responses)
Posted Mar 10, 2016 21:05 UTC (Thu)
by eternaleye (guest, #67051)
[Link]
- Possibly less overhead, as it doesn't need to do FS traversals to get to the binary
Posted Mar 10, 2016 22:51 UTC (Thu)
by jejb (subscriber, #6654)
[Link]
Architecture emulation containers with binfmt_misc
Architecture emulation containers with binfmt_misc
Architecture emulation containers with binfmt_misc
Architecture emulation containers with binfmt_misc
Architecture emulation containers with binfmt_misc
Architecture emulation containers with binfmt_misc
Architecture emulation containers with binfmt_misc
Hard-linking or bind-mounting a single emulator binary isn't that difficult, so I don't think this feature helps much.
Architecture emulation containers with binfmt_misc
- Doesn't break if the user accidentally unlinks it (as was called out as a potential failure mode of what you suggest)
- Reduces the divergence between an emulated container and a native container on that arch (as far as the emulated container can see)
- Avoids any need to make changes _within_ the container to boot it on another host arch
- Likely more I haven't thought of
Architecture emulation containers with binfmt_misc