How programs get run

January 28, 2015

This article was contributed by David Drysdale

This is the first in pair of articles that describe how the kernel runs programs: what happens under the covers when a user program invokes the execve() system call? I recently worked on the implementation of a new execveat() system call, which is a close variant of execve() that allows the caller to specify the invoked program by a combination of file descriptor and path, as with other *at() system calls. (This will, in turn, enable an implementation of the fexecve() library function that doesn't rely on access to the /proc filesystem, which is important for sandboxed environments such as Capsicum.)

Along the way, I explored the existing execve() implementation, and so these articles present the details of that functionality. In this one, we'll focus on the general mechanisms that the kernel uses for program invocation, which allow for different program formats; the second article will focus on the details of running ELF binaries.

The view from user space

Before diving into the kernel, we'll start by exploring the behavior of program execution from user space (there's also a good description of this behavior in chapter 27 of The Linux Programming Interface). For Linux versions up to and including 3.18, the only system call that invokes a new program is execve(), which has the following prototype:

    int execve(const char *filename, char *const argv[], char *const envp[]);

The filename argument specifies the program to be executed, and the argv and envp arguments are NULL-terminated lists that specify the command line arguments and environment variables for the new program. A simple skeleton driver program (do_execve.c) allows us to explore how this behaves, by feeding in "zero", "one", "two" as arguments and "ENVVAR1=1", "ENVVAR2=2" as environment variables. To see the result in the invoked program, we use another simple program (show_info.c) that just outputs its command-line arguments (argv) and environment (environ).

Putting these together gives the expected result — the arguments and environment are passed through to the invoked program. Notice, though, that the argv[0] for the invoked binary is just the value specified by the caller of execve(); having the program's name in argv[0] isn't a convention that's required or policed by execve() itself, at least for binaries.

    % ./do_execve ./show_info
    argv[0] = 'zero'
    argv[1] = 'one'
    argv[2] = 'two'
    ENVVAR1=1
    ENVVAR2=2

Things change slightly when the program being invoked is a script rather than a binary program. To explore this, we use a shell script equivalent (show_info.sh) of our environment-outputting program; putting this together with the original program that invokes execve(), we see a couple of differences:

% ./do_execve ./show_info.sh $0 = './show_info.sh' $1 = 'one' $2 = 'two' ENVVAR1=1 ENVVAR2=2 PWD=/home/drysdale/src/lwn/exec

First, the environment has gained an extra PWD value, indicating the current directory. Secondly, the initial argument to the script is now the script filename, rather than the "zero" value that the invoker specified. A further experiment reveals that the /bin/sh script interpreter added the PWD environment variable, but the kernel itself modified the arguments:

% cat ./wrapper #!./show_info % ./do_execve ./wrapper argv[0] = './show_info' argv[1] = './wrapper' argv[2] = 'one' argv[3] = 'two' ENVVAR1=1 ENVVAR2=2

More specifically, the kernel has removed the first ("zero") argument and replaced it with two arguments — the name of the script interpreter program (taken from the first line of the script) and the name of the invoked file (which holds the script text). If the first line of the script also includes command-line arguments for the interpreter (for example, awk needs an -f option to treat its input as a filename rather than script text), a third extra argument is also inserted, holding all of the extra options:

% cat ./wrapper_args #!./show_info -a -b -c % ./do_execve ./wrapper_args argv[0] = './show_info' argv[1] = '-a -b -c' argv[2] = './wrapper_args' argv[3] = 'one' argv[4] = 'two' ENVVAR1=1 ENVVAR2=2

Up to a point, we can also repeat this pop-one, push-two alteration of the arguments, by invoking scripts that wrap scripts and so on; each such alteration effectively pushes the wrapper script name in at argv[1]:

argv[0]: 'zero'=>'./wrapper4'=>'./wrapper3'=>'./wrapper2'=>'./wrapper' =>'./show_info' argv[1]: 'one' './wrapper5' './wrapper4' './wrapper3' './wrapper2' './wrapper' argv[2]: 'two' 'one' './wrapper5' './wrapper4' './wrapper3' './wrapper2' argv[3]: 'two' 'one' './wrapper5' './wrapper4' './wrapper3' argv[4]: 'two' 'one' './wrapper5' './wrapper4' argv[5]: 'two' 'one' './wrapper5' argv[6]: 'two' 'one' argv[7]: 'two'

However, this doesn't continue forever — once there are too many levels of wrappers, the process fails with ELOOP:

% ./do_execve ./wrapper6 Failed to execute './wrapper6', Too many levels of symbolic links

Into the kernel: `struct linux_binprm`

Now we move into kernel space and begin delving into the code that implements the execve() system call. A previous article explored the general system call machinery (and the special wrinkles needed for execve()), so we can pick up the story at the do_execve_common() function in fs/exec.c. The main purpose of the code in this function is to build a new struct linux_binprm instance that describes the current program invocation operation. In the structure:

The file field is set to a freshly opened struct file for the program being invoked; this allows the kernel to read the file contents and decide how to handle the file.
The filename and interp fields are both set to the name of the file holding the program; we'll see later why there are two distinct fields here.
The bprm_mm_init() function allocates and sets up the associated struct mm_struct and struct vm_area_struct data structures in preparation for managing the virtual memory of the new program. In particular, the new program's virtual memory ends at the highest possible address for the architecture; its stack will grow downward from there.
The p field is set to point at the end of memory space for the new program, but leaves space for a NULL pointer as an end marker for the stack. The value of p will be updated (downward) as more information is added to the new program's stack.
The argc and envc fields are set to hold the counts of arguments and environment values so that this information can be propagated to the new program later in the invocation process.
The unsafe field is set up to hold a bitmask of reasons why the program execution might not be safe; for example, if the process is being traced with ptrace() or has the PR_SET_NO_NEW_PRIVS bit set. The Linux Security Module (LSM) may subsequently use this information to deny the program execution operation.
The cred field is a separately allocated object of type struct cred that holds information about the credentials for the new program. These are generally inherited from the process that called execve(), but are updated to allow for setuid / setgid bits and other complications. The presence of setuid/setgid bits also disallows a collection of compatibility features because they have an adverse effect on security; the per_clear field records the bits in the process's personality that will be cleared later.
The security field allows an LSM to store LSM-specific information with the linux_binprm; the LSM is notified via a call to security_bprm_set_creds() and the bprm_set_creds LSM hook. The default implementation of this hook updates the new program's Linux capabilities to allow for the file capabilities of the program file; other LSM implementations chain this behavior into their own implementations of the hook (e.g. Smack, SELinux).
The buf scratch space is filled with the first chunk (128 bytes) of data from the program file. This data will be used later to detect the binary format so it can be processed appropriately.

The parts of this setup process that depend on the particular file that's being executed are performed in an inner prepare_binprm() function; this function can be called again later to update those fields if a different file (e.g. a script interpreter) is actually run.

Finally, information about the program invocation is copied into the top of new program's stack, using the local copy_strings() and copy_strings_kernel() utility functions. First, the program filename is pushed to the stack (and its location is saved in the exec field of the linux_bprm instance), followed by all of the environment values, then by all of the arguments. At the end of this process, the stack looks like:

---------Memory limit--------- NULL pointer program_filename string envp[envc-1] string ... envp[1] string envp[0] string argv[argc-1] string ... argv[1] string argv[0] string

Binary format handler iteration: `struct linux_binfmt`

With a complete struct linux_binprm in hand, the real business of program execution is performed in exec_binprm() and (more importantly) search_binary_handler(). This code iterates over a list of struct linux_binfmt objects, each of which provides a handler for a particular format of binary programs. A binary handler could potentially be defined in a kernel module, so the code calls try_module_get() for each format to ensure the relevant code can't be unloaded by another task while it's being used here.

For each struct linux_binfmt handler object, the load_binary() function pointer is called, passing in the linux_binprm object. If the handler code supports the binary format, it does whatever is needed to prepare the program for execution and returns success (>= 0). Otherwise, the handler returns a failure code (< 0) and iteration continues with the next handler.

Execution of a particular program may itself rely on execution of a different program; the obvious example is executable scripts, which need to invoke the script interpreter. To cope with this, the search_binary_handler() code can be called recursively, re-using the struct linux_binprm object. However, recursion depth is limited to prevent infinite recursion, giving the ELOOP error behavior seen earlier.

The system's LSM also gets a say in the operation; before the iteration over binary formats starts, the bprm_check_security LSM hook is triggered, allowing the LSM to make a decision on whether to allow the operation. To do so, it may use the state it stored in the linux_binprm.security field earlier.

At the end of the iteration, if no formats that can handle the program have been found (and the program appears to be binary rather than text, at least according to the first four bytes), then the code will also attempt to load a module named "binfmt-XXXX", where XXXX is the hex value of bytes three and four in the program file. This is an old mechanism (added in 1996 for Linux 1.3.57) to allow for a more dynamic way of associating binary format handlers with formats; the more recent binfmt_misc mechanism (described below) allows a more flexible way of doing something similar.

Binary formats

So what are the binary formats available in the standard kernel? A search for code that registers instances of struct linux_binfmt (via register_binfmt() and insert_binfmt()) gives us quite a collection of possible formats, all of which are configured and explained in the fs/Kconfig.binfmts file:

binfmt_script.c: Support for interpreted scripts, starting with a #! line.
binfmt_misc.c: Support miscellaneous binary formats, according to runtime configuration.
binfmt_elf.c: Support for ELF format binaries.
binfmt_aout.c: Support for traditional a.out format binaries.
binfmt_flat.c: Support for flat format binaries.
binfmt_em86.c: Support for Intel ELF binaries running on Alpha machines.
binfmt_elf_fdpic.c: Support for ELF FDPIC binaries.
binfmt_som.c: Support for SOM format binaries (an HP/UX PA-RISC format).

(plus a couple of other architecture-specific formats).

The next sections will examine the most important of these: interpreted scripts and the "miscellaneous" mechanism for supporting arbitrary formats; the next article will examine the ELF binary format — which is typically where all program execution ends up.

Script invocation: `binfmt_script.c`

Files that start with the character sequence #! (and have the execute bit set) are treated as scripts, handled by the fs/binfmt_script.c handler. After checking those first two bytes, this code parses the rest of the script-invocation line, splitting it into an interpreter name (everything after #! up to the first white space) and possible arguments (everything else up to the end of the line, stripping external white space).

(One detail to note: back when the struct linux_binprm object was created, only the first 128 bytes of the program were retrieved. This means that if the interpreter name and arguments are longer than this, the results will be truncated.)

With these in hand, the code then removes argv[0] from the top of the new program's stack (i.e. at the lowest address), and in its place pushes the following, adjusting the argc value in the linux_binprm object along the way:

the program name
(optionally) the collected interpreter arguments
the name of the interpreter program

Taken together, this explains the user space behavior we observed at the beginning of the article; our new program's stack is modified to look like:

---------Memory limit--------- NULL pointer program_filename string envp[envc-1] string ... envp[1] string envp[0] string argv[argc-1] string ... argv[1] string program_filename string ( interpreter_args ) interpreter_filename string

The code also changes the interp value in the linux_binprm structure so that it references the interpreter filename, rather than the script filename. This explains why the linux_binprm structure refers to two strings: one (interp) is the program that we currently want to execute, and one is the name (filename) that was originally invoked in the execve() call. Along similar lines, the file field in the linux_binprm is also updated to reference the new interpreter program, and the first 128 bytes of its contents read into the buf scratch space.

The script handler code then recurses into search_binary_handler() to repeat the whole process for the script interpreter program. If the interpreter is itself a script, then the interp value will be changed once again but the filename will stay unchanged.

Miscellaneous interpreter detection: `binfmt_misc.c`

We saw previously that early versions of the Linux kernel supported a rough-and-ready way of dynamically adding format support, by hunting for a kernel module with a name containing the early bytes of the binary. That's not particularly convenient — only searching on a couple of bytes is very limited (compare the vast range of detection signatures that the file command uses) and requiring a kernel module raises the barrier to entry.

The miscellaneous binary format handler allows a more flexible and dynamic method of dealing with new formats, by allowing run-time configuration (via a special filesystem mounted under /proc/sys/fs/binfmt_misc) to specify:

How to recognize a supported format, based on filename extension or a magic value at a particular offset. (As with parsing script interpreters, this magic value has to fall within the first 128 bytes of the program file.)
The interpreter program to invoke, which will get the program filename passed to it as argv[1] (as with script invocation).

A good example of the miscellaneous format handler in use is for Java files: detect .class files (based on their 0xCAFEBABE prefix) or .jar files (based on the .jar extension) and automatically invoke the JVM executable on them. This will require a wrapper script to provide the relevant command-line arguments, as the miscellaneous configuration doesn't allow arguments to be specified — which means that the miscellaneous handler will invoke the script handler, which will then invoke the ELF handler for the JVM executable (and which will probably in turn invoke the dynamic linker ld.so, although that's a somewhat different story).

Internally, the kernel implementation for this format is similar to the handler for script programs described above, except that there is an initial search for a matching configuration entry, and that configuration is used to make some of the details (such as removing argv[0]) optional.

The format handlers for both scripts and miscellaneous formats recurse on to attempt to invoke the interpreter program that is needed for that particular format. This recursion has to end at some point, and on a modern Linux system this is almost always at an ELF binary program — the subject of the next article — stay tuned.

Index entries for this article
Kernel	exec()
GuestArticles	Drysdale, David

How programs get run

Posted Jan 29, 2015 6:16 UTC (Thu) by wahern (subscriber, #37304) [Link] (7 responses)

After checking those first two bytes, this code parses the rest of the script-invocation line, splitting it into an interpreter name (everything after #! up to the first white space) and possible arguments (everything else up to the end of the line, stripping external white space).

Linux passes the remainder of the line as a _single_ argument. You show this in your example where "-a -b -c" are all located in argv[1]. But you say

... a third extra argument is also inserted, holding all of the extra options:

Those aren't extra options--the plural is misleading. The distinction matters because neither getopt nor getopt_long will parse "-a -b -c" as three separate options. Rather, it'll be parsed as optc='a' and optarg=" -b -c", or it will parse as optc='a', optc=' ', optc='-', optc='b', etc. Most likely it'll just fail because your option specification won't match the parse. If a, b, and c are all single options without arguments, then you could put "-abc" on the shebang line. But you can't space them out, and you can't use an option that takes an argument unless the argument is the path of the script, as with the -f option for awk. And you can't mix non-argument with argument options unless the sole argument-taking option comes last. For example, "-abcf".

OS X, by contrast, will field-split the trailing shebang line in the kernel so that the script "#!./show_info -a -b -c" will print out

argv[0] = './show_info'
argv[1] = '-a'
argv[2] = '-b'
argv[3] = '-c'

Solaris is quirky. It will field-split, but only includes the first field. So "#!./show_info -a -b -c" will print out

argv[0] = './show_info'
argv[1] = '-a'

FWIW, OpenBSD 5.5, NetBSD 6.1, and FreeBSD 9.0 all behave like Linux. Which was surprising because I could have sworn that either FreeBSD or NetBSD (or both) would field-split the remainder of the shebang line.

How programs get run

Posted Jan 29, 2015 9:50 UTC (Thu) by drysdale (guest, #95971) [Link] (2 responses)

Thanks for the clarification & comparisons with other OSes -- I should have made clear that the bundling together of arguments into argv[1] means that multiple interpreter arguments basically won't work.

How programs get run

Posted Jan 29, 2015 17:17 UTC (Thu) by vonbrand (subscriber, #4458) [Link]

Please do update the article with this information. It is definitely one to bookmark.

How programs get run

Posted Jan 29, 2015 21:13 UTC (Thu) by wahern (subscriber, #37304) [Link]

FWIW, Linux and OS X are the only systems I'm aware of that permit recursive shebang execution. Some systems, like Free/Net/OpenBSD, will recursively search for the binary interpreter, but they won't stack the paths of the intervening interpreters. Instead the binary interpreter is only passed the original file path. (And any trailing shebang arguments in the scripts seem to get dropped altogether.)

That's not germane to how Linux executes binaries. But I have a feeling this page might end up near the top of the Google results (as all good LWN articles do) for shebang-related queries, so it's worth putting out there.

Because shells parse scripts line-by-line, if you can come up with a construct that is both valid shell code and valid code in your other language, you can mix interpreters portably. For example, the following is a mixed shell/Lua script which will locate a Lua interpreter. Because both the locations _and_ interpreter names of Lua differ across systems, even across Linux distributions, and even for the same version of Lua, you can't use the #!/usr/bin/env trick to run your Lua scripts and expect it to work even remotely reliably.

#!/bin/sh
_=[[ # variable assignment in shell, beginning of long string in Lua
IFS=:
for D in ${PATH:-$(command -p getconf PATH)}; do
    for F in ${D}/lua*; do
         # check if it's our preferred version
        if ...; then
            exec "${F}" "$0" "$@"
        fi
    done
done
printf "%s: unable to locate Lua interpreter\n" "${0##*/}" >&2
exit 1
]]
-- begin pure Lua code
print(_VERSION)

I recently published a script, runlua, for portable execution of Lua scripts, which is why all of this stuff is still fresh in my mind.

How programs get run

Posted Jan 29, 2015 21:56 UTC (Thu) by peter-b (guest, #66996) [Link]

GNU Guile has a special "meta switch" which instructs the interpreter to interpret the first few lines of the file -- up to a line containing only "!#" -- as arguments to the interpreter rather than as source code. It seems to work quite well:

#!/usr/local/bin/guile \
-e main -s
!#
(define (main args)
        (map (lambda (arg) (display arg) (display " "))
             (cdr args))
        (newline))

How programs get run

Posted Feb 5, 2015 8:50 UTC (Thu) by grawity (subscriber, #80596) [Link]

Sven Mascheck's website has loads of information regarding OS differences in #! handling.

How programs get run

Posted Nov 25, 2019 9:34 UTC (Mon) by Profpatsch (guest, #130533) [Link] (1 responses)

This here is the bible of shebang interpretations: https://www.in-ulm.de/~mascheck/various/shebang/

I have to look through it surprisingly often.

How programs get run

Posted Nov 25, 2019 9:35 UTC (Mon) by Profpatsch (guest, #130533) [Link]

Ah, grawity beat me to it (by about 4 years).

How programs get run

Posted Jan 29, 2015 14:43 UTC (Thu) by jem (subscriber, #24231) [Link]

This reminds me of the simplest way to make a Linux script that outputs some static text:

#!/usr/bin/tail --lines=+2
Text goes here.
More text.
Last line.

How programs get run

Posted Nov 25, 2019 9:37 UTC (Mon) by Profpatsch (guest, #130533) [Link]

The source code links are broken. This is such a valuable resource, it would be great if they could be rewritten to an archive.org link or something.

How programs get run

The view from user space

Into the kernel: struct linux_binprm

Binary format handler iteration: struct linux_binfmt

Binary formats

Script invocation: binfmt_script.c

Miscellaneous interpreter detection: binfmt_misc.c

How programs get run

How programs get run

How programs get run

How programs get run

How programs get run

How programs get run

How programs get run

How programs get run

How programs get run

How programs get run

Into the kernel: `struct linux_binprm`

Binary format handler iteration: `struct linux_binfmt`

Script invocation: `binfmt_script.c`

Miscellaneous interpreter detection: `binfmt_misc.c`