Kernel development [LWN.net]

Kernel release status

The current development kernel is 3.19-rc6, released by Linus Torvalds on January 25. "I currently expect to make an rc7 next week, with the final 3.19 in two weeks, as per the usual schedule."

Stable kernels: The 3.18.4, 3.14.30, and 3.10.66 stable kernels were released on January 27. There are three stable kernels in the review process: 3.18.5, 3.14.31, and 3.10.67. They can be expected on or before January 30.

Comments (none posted)

Quotes of the week

And since I curse at people who ignore regression reports because "it fixes a bug", I should take the time to say how much I liked seeing you explain to the people who reported this regression why it happened and what the thinking was. Now *that* is how things should work. "My bad, this was the background for why it seemed like a good idea".

— Linus Torvalds

In fact, one could argue that in the case of the Internet of Things, the tiniest embedded devices **especially** need secure crypto. It would be.... unfortunate.... if the next time North Korea gets upset at the Great Satan, that all of our light bulbs, refrigerators, cars, heating systems, etc., are subject to attack.

— Ted Ts'o

Comments (none posted)

A crypto module loading vulnerability

By Jake Edge
January 28, 2015

Loading a module into a running kernel is rather invasive, which is part of why the operation is restricted in various ways. For the most part, it requires root privileges (or the CAP_SYS_MODULE capability) to load a module, but there are exceptions. Some modules get automatically loaded when a new piece of hardware is plugged in, a new filesystem type is mounted, or a new kernel cryptographic algorithm is needed. In the latter case, however, unprivileged users could cause any module in the official module directory to get loaded by exploiting a hole in the crypto subsystem—at least until recently.

The problem was actually discovered almost two years ago when Mathias Krause pointed it out in a thread about a similar bug with mount and user namespaces. In that case, the root user in a user namespace (who might be a regular user in the top-level namespace) could mount a filesystem using the -t (type) option and pass any module name as the parameter. That would cause the kernel's module-loading logic to load the module, even if it wasn't a filesystem. Krause noted that the same was true for the crypto subsystem.

The mount problem was fixed shortly after it was reported, but a fix for the crypto bug evidently slipped through the cracks. Neither of these bugs allowed unprivileged users to load arbitrary (i.e. attacker-controlled) modules, which would be a much more severe problem, but even being able to load unexpected modules from the standard location (normally under /lib/modules) can lead to various vulnerabilities including privilege escalation—full system compromise, essentially.

Since kernel modules have intimate access to the kernel, vulnerabilities in modules have been exploited in the past. But if an administrator believes that regular users cannot load certain modules, they may be less inclined to update their kernel to fix a problem in an "irrelevant" module. There is also the risk of unknown or zero-day module vulnerabilities. The risk of any of that may be fairly low, but there are reasons that module loading is restricted to certain users.

The usual fix (and the one that was applied for filesystems) is to prefix the user-supplied module name with a subsystem-specific string; for filesystems, "fs-" was used. All filesystems that can be built as modules got a MODULE_ALIAS() using the prefix. The request_module() call for filesystems was modified to prepend the prefix, which means that the kernel would no longer load just any module as a filesystem type, it would need to actually be a filesystem module.

Obviously, the same scheme can be applied to the crypto subsystem, but there were a few wrinkles, as Krause outlined in a G+ post. First off, the vulnerability is a bit more easily accessed than the mount and user-namespace flaw. Any user-space program that binds to an AF_ALG socket (which provides a netlink-based user-space interface to the crypto subsystem) can specify the type of cipher that it wishes to use. If that cipher is not present, the kernel will try to load a module of that name. Since there are no restrictions on the name, any module name can be passed and it will be loaded.

Modifying all of the cryptographic algorithms that can be built as modules to add a "crypto-" prefix, while changing the crypto module–loading code to do the same, is the obvious path forward. Kees Cook made that change, but there was a problem with the fix. Crypto ciphers can also specify a mode, so the AES cipher in electronic codebook (ECB) mode would be specified as "ecb(aes)". Cook's original fix would allow any module to be specified for the mode (e.g. "vfat(aes)") and it would get loaded.

That led to a second patch from Cook, but that was missing some needed crypto module aliases. A patch from Krause added the necessary aliases.

But there was still one more (non-kernel) bug that was found in this process. The kernel turns the module-loading job over to the modprobe user-space utility, which finds the module file, reads it in, and uses the init_module() system call to add it to the kernel. As it turns out, Krause was using a BusyBox-based system to test the patches. He discovered that BusyBox's modprobe effectively uses the basename of the module name passed to it. That means everything up through the final "/" is ignored. A request for "/vfat" gets turned into a request for "crypto-/vfat", but the BusyBox modprobe ignores the first part and happily loads the vfat module—which takes us back to square one.

That problem affected more than just crypto (in fact, Krause doesn't mention crypto in the bug report, presumably because Cook's patches had not yet been merged). He noted two other commands that would load modules when they shouldn't:

    # ifconfig /usbserial up
    # mount -t /snd_pcm none /

In both cases a prefix is used ("netdev-" and "fs-", respectively) to avoid this kind of problem, but BusyBox effectively ignored them. BusyBox maintainer Denys Vlasenko fixed the bug one day after Krause reported it. There were some fits and starts along the way, but those bugs are fixed now, as Krause noted:

So, all in all, this initial remark on an otherwise unrelated LKML thread [led] to an incomplete fix that, while being tested, uncovered its incompleteness and yet another bug in a completely different code base. Nice bug smashing, I would say ;)

The kernel bug has been around since 2011, when the AF_ALG interface to the crypto subsystem was introduced in 2.6.38. The bugs were assigned three separate CVE numbers: CVE-2013-7421 for the original bug Krause pointed out in 2013, CVE-2014-9644 for the "vfat(aes)" variation, and CVE-2014-9645 for the BusyBox modprobe bug. The kernel fixes are included in the mainline and will be released with 3.19; backports to the stable kernels may be coming as well.

While it is not a particularly critical bug, letting unprivileged users mess with kernel internals is certainly something to be avoided. But it has languished for nearly two years since its discovery, which is kind of surprising. Disabling module loading (either at boot or in the kernel config) is one fairly easy mitigation technique, though it may not be an option for some types of systems (especially desktops and laptops).

Comments (34 posted)

How programs get run

January 28, 2015

This article was contributed by David Drysdale

This is the first in pair of articles that describe how the kernel runs programs: what happens under the covers when a user program invokes the execve() system call? I recently worked on the implementation of a new execveat() system call, which is a close variant of execve() that allows the caller to specify the invoked program by a combination of file descriptor and path, as with other *at() system calls. (This will, in turn, enable an implementation of the fexecve() library function that doesn't rely on access to the /proc filesystem, which is important for sandboxed environments such as Capsicum.)

Along the way, I explored the existing execve() implementation, and so these articles present the details of that functionality. In this one, we'll focus on the general mechanisms that the kernel uses for program invocation, which allow for different program formats; the second article will focus on the details of running ELF binaries.

The view from user space

Before diving into the kernel, we'll start by exploring the behavior of program execution from user space (there's also a good description of this behavior in chapter 27 of The Linux Programming Interface). For Linux versions up to and including 3.18, the only system call that invokes a new program is execve(), which has the following prototype:

    int execve(const char *filename, char *const argv[], char *const envp[]);

The filename argument specifies the program to be executed, and the argv and envp arguments are NULL-terminated lists that specify the command line arguments and environment variables for the new program. A simple skeleton driver program (do_execve.c) allows us to explore how this behaves, by feeding in "zero", "one", "two" as arguments and "ENVVAR1=1", "ENVVAR2=2" as environment variables. To see the result in the invoked program, we use another simple program (show_info.c) that just outputs its command-line arguments (argv) and environment (environ).

Putting these together gives the expected result — the arguments and environment are passed through to the invoked program. Notice, though, that the argv[0] for the invoked binary is just the value specified by the caller of execve(); having the program's name in argv[0] isn't a convention that's required or policed by execve() itself, at least for binaries.

    % ./do_execve ./show_info
    argv[0] = 'zero'
    argv[1] = 'one'
    argv[2] = 'two'
    ENVVAR1=1
    ENVVAR2=2

Things change slightly when the program being invoked is a script rather than a binary program. To explore this, we use a shell script equivalent (show_info.sh) of our environment-outputting program; putting this together with the original program that invokes execve(), we see a couple of differences:

% ./do_execve ./show_info.sh $0 = './show_info.sh' $1 = 'one' $2 = 'two' ENVVAR1=1 ENVVAR2=2 PWD=/home/drysdale/src/lwn/exec

First, the environment has gained an extra PWD value, indicating the current directory. Secondly, the initial argument to the script is now the script filename, rather than the "zero" value that the invoker specified. A further experiment reveals that the /bin/sh script interpreter added the PWD environment variable, but the kernel itself modified the arguments:

% cat ./wrapper #!./show_info % ./do_execve ./wrapper argv[0] = './show_info' argv[1] = './wrapper' argv[2] = 'one' argv[3] = 'two' ENVVAR1=1 ENVVAR2=2

More specifically, the kernel has removed the first ("zero") argument and replaced it with two arguments — the name of the script interpreter program (taken from the first line of the script) and the name of the invoked file (which holds the script text). If the first line of the script also includes command-line arguments for the interpreter (for example, awk needs an -f option to treat its input as a filename rather than script text), a third extra argument is also inserted, holding all of the extra options:

% cat ./wrapper_args #!./show_info -a -b -c % ./do_execve ./wrapper_args argv[0] = './show_info' argv[1] = '-a -b -c' argv[2] = './wrapper_args' argv[3] = 'one' argv[4] = 'two' ENVVAR1=1 ENVVAR2=2

Up to a point, we can also repeat this pop-one, push-two alteration of the arguments, by invoking scripts that wrap scripts and so on; each such alteration effectively pushes the wrapper script name in at argv[1]:

argv[0]: 'zero'=>'./wrapper4'=>'./wrapper3'=>'./wrapper2'=>'./wrapper' =>'./show_info' argv[1]: 'one' './wrapper5' './wrapper4' './wrapper3' './wrapper2' './wrapper' argv[2]: 'two' 'one' './wrapper5' './wrapper4' './wrapper3' './wrapper2' argv[3]: 'two' 'one' './wrapper5' './wrapper4' './wrapper3' argv[4]: 'two' 'one' './wrapper5' './wrapper4' argv[5]: 'two' 'one' './wrapper5' argv[6]: 'two' 'one' argv[7]: 'two'

However, this doesn't continue forever — once there are too many levels of wrappers, the process fails with ELOOP:

% ./do_execve ./wrapper6 Failed to execute './wrapper6', Too many levels of symbolic links

Into the kernel: `struct linux_binprm`

Now we move into kernel space and begin delving into the code that implements the execve() system call. A previous article explored the general system call machinery (and the special wrinkles needed for execve()), so we can pick up the story at the do_execve_common() function in fs/exec.c. The main purpose of the code in this function is to build a new struct linux_binprm instance that describes the current program invocation operation. In the structure:

The file field is set to a freshly opened struct file for the program being invoked; this allows the kernel to read the file contents and decide how to handle the file.
The filename and interp fields are both set to the name of the file holding the program; we'll see later why there are two distinct fields here.
The bprm_mm_init() function allocates and sets up the associated struct mm_struct and struct vm_area_struct data structures in preparation for managing the virtual memory of the new program. In particular, the new program's virtual memory ends at the highest possible address for the architecture; its stack will grow downward from there.
The p field is set to point at the end of memory space for the new program, but leaves space for a NULL pointer as an end marker for the stack. The value of p will be updated (downward) as more information is added to the new program's stack.
The argc and envc fields are set to hold the counts of arguments and environment values so that this information can be propagated to the new program later in the invocation process.
The unsafe field is set up to hold a bitmask of reasons why the program execution might not be safe; for example, if the process is being traced with ptrace() or has the PR_SET_NO_NEW_PRIVS bit set. The Linux Security Module (LSM) may subsequently use this information to deny the program execution operation.
The cred field is a separately allocated object of type struct cred that holds information about the credentials for the new program. These are generally inherited from the process that called execve(), but are updated to allow for setuid / setgid bits and other complications. The presence of setuid/setgid bits also disallows a collection of compatibility features because they have an adverse effect on security; the per_clear field records the bits in the process's personality that will be cleared later.
The security field allows an LSM to store LSM-specific information with the linux_binprm; the LSM is notified via a call to security_bprm_set_creds() and the bprm_set_creds LSM hook. The default implementation of this hook updates the new program's Linux capabilities to allow for the file capabilities of the program file; other LSM implementations chain this behavior into their own implementations of the hook (e.g. Smack, SELinux).
The buf scratch space is filled with the first chunk (128 bytes) of data from the program file. This data will be used later to detect the binary format so it can be processed appropriately.

The parts of this setup process that depend on the particular file that's being executed are performed in an inner prepare_binprm() function; this function can be called again later to update those fields if a different file (e.g. a script interpreter) is actually run.

Finally, information about the program invocation is copied into the top of new program's stack, using the local copy_strings() and copy_strings_kernel() utility functions. First, the program filename is pushed to the stack (and its location is saved in the exec field of the linux_bprm instance), followed by all of the environment values, then by all of the arguments. At the end of this process, the stack looks like:

---------Memory limit--------- NULL pointer program_filename string envp[envc-1] string ... envp[1] string envp[0] string argv[argc-1] string ... argv[1] string argv[0] string

Binary format handler iteration: `struct linux_binfmt`

With a complete struct linux_binprm in hand, the real business of program execution is performed in exec_binprm() and (more importantly) search_binary_handler(). This code iterates over a list of struct linux_binfmt objects, each of which provides a handler for a particular format of binary programs. A binary handler could potentially be defined in a kernel module, so the code calls try_module_get() for each format to ensure the relevant code can't be unloaded by another task while it's being used here.

For each struct linux_binfmt handler object, the load_binary() function pointer is called, passing in the linux_binprm object. If the handler code supports the binary format, it does whatever is needed to prepare the program for execution and returns success (>= 0). Otherwise, the handler returns a failure code (< 0) and iteration continues with the next handler.

Execution of a particular program may itself rely on execution of a different program; the obvious example is executable scripts, which need to invoke the script interpreter. To cope with this, the search_binary_handler() code can be called recursively, re-using the struct linux_binprm object. However, recursion depth is limited to prevent infinite recursion, giving the ELOOP error behavior seen earlier.

The system's LSM also gets a say in the operation; before the iteration over binary formats starts, the bprm_check_security LSM hook is triggered, allowing the LSM to make a decision on whether to allow the operation. To do so, it may use the state it stored in the linux_binprm.security field earlier.

At the end of the iteration, if no formats that can handle the program have been found (and the program appears to be binary rather than text, at least according to the first four bytes), then the code will also attempt to load a module named "binfmt-XXXX", where XXXX is the hex value of bytes three and four in the program file. This is an old mechanism (added in 1996 for Linux 1.3.57) to allow for a more dynamic way of associating binary format handlers with formats; the more recent binfmt_misc mechanism (described below) allows a more flexible way of doing something similar.

Binary formats

So what are the binary formats available in the standard kernel? A search for code that registers instances of struct linux_binfmt (via register_binfmt() and insert_binfmt()) gives us quite a collection of possible formats, all of which are configured and explained in the fs/Kconfig.binfmts file:

binfmt_script.c: Support for interpreted scripts, starting with a #! line.
binfmt_misc.c: Support miscellaneous binary formats, according to runtime configuration.
binfmt_elf.c: Support for ELF format binaries.
binfmt_aout.c: Support for traditional a.out format binaries.
binfmt_flat.c: Support for flat format binaries.
binfmt_em86.c: Support for Intel ELF binaries running on Alpha machines.
binfmt_elf_fdpic.c: Support for ELF FDPIC binaries.
binfmt_som.c: Support for SOM format binaries (an HP/UX PA-RISC format).

(plus a couple of other architecture-specific formats).

The next sections will examine the most important of these: interpreted scripts and the "miscellaneous" mechanism for supporting arbitrary formats; the next article will examine the ELF binary format — which is typically where all program execution ends up.

Script invocation: `binfmt_script.c`

Files that start with the character sequence #! (and have the execute bit set) are treated as scripts, handled by the fs/binfmt_script.c handler. After checking those first two bytes, this code parses the rest of the script-invocation line, splitting it into an interpreter name (everything after #! up to the first white space) and possible arguments (everything else up to the end of the line, stripping external white space).

(One detail to note: back when the struct linux_binprm object was created, only the first 128 bytes of the program were retrieved. This means that if the interpreter name and arguments are longer than this, the results will be truncated.)

With these in hand, the code then removes argv[0] from the top of the new program's stack (i.e. at the lowest address), and in its place pushes the following, adjusting the argc value in the linux_binprm object along the way:

the program name
(optionally) the collected interpreter arguments
the name of the interpreter program

Taken together, this explains the user space behavior we observed at the beginning of the article; our new program's stack is modified to look like:

---------Memory limit--------- NULL pointer program_filename string envp[envc-1] string ... envp[1] string envp[0] string argv[argc-1] string ... argv[1] string program_filename string ( interpreter_args ) interpreter_filename string

The code also changes the interp value in the linux_binprm structure so that it references the interpreter filename, rather than the script filename. This explains why the linux_binprm structure refers to two strings: one (interp) is the program that we currently want to execute, and one is the name (filename) that was originally invoked in the execve() call. Along similar lines, the file field in the linux_binprm is also updated to reference the new interpreter program, and the first 128 bytes of its contents read into the buf scratch space.

The script handler code then recurses into search_binary_handler() to repeat the whole process for the script interpreter program. If the interpreter is itself a script, then the interp value will be changed once again but the filename will stay unchanged.

Miscellaneous interpreter detection: `binfmt_misc.c`

We saw previously that early versions of the Linux kernel supported a rough-and-ready way of dynamically adding format support, by hunting for a kernel module with a name containing the early bytes of the binary. That's not particularly convenient — only searching on a couple of bytes is very limited (compare the vast range of detection signatures that the file command uses) and requiring a kernel module raises the barrier to entry.

The miscellaneous binary format handler allows a more flexible and dynamic method of dealing with new formats, by allowing run-time configuration (via a special filesystem mounted under /proc/sys/fs/binfmt_misc) to specify:

How to recognize a supported format, based on filename extension or a magic value at a particular offset. (As with parsing script interpreters, this magic value has to fall within the first 128 bytes of the program file.)
The interpreter program to invoke, which will get the program filename passed to it as argv[1] (as with script invocation).

A good example of the miscellaneous format handler in use is for Java files: detect .class files (based on their 0xCAFEBABE prefix) or .jar files (based on the .jar extension) and automatically invoke the JVM executable on them. This will require a wrapper script to provide the relevant command-line arguments, as the miscellaneous configuration doesn't allow arguments to be specified — which means that the miscellaneous handler will invoke the script handler, which will then invoke the ELF handler for the JVM executable (and which will probably in turn invoke the dynamic linker ld.so, although that's a somewhat different story).

Internally, the kernel implementation for this format is similar to the handler for script programs described above, except that there is an initial search for a matching configuration entry, and that configuration is used to make some of the details (such as removing argv[0]) optional.

The format handlers for both scripts and miscellaneous formats recurse on to attempt to invoke the interpreter program that is needed for that particular format. This recursion has to end at some point, and on a modern Linux system this is almost always at an ELF binary program — the subject of the next article — stay tuned.

Comments (10 posted)

Linus Torvalds Linux 3.19-rc6 ?

Greg KH Linux 3.18.4 ?

Greg KH Linux 3.14.30 ?

Greg KH Linux 3.10.66 ?

Bryan O'Donoghue x86: Add IMR support to Quark/Galileo ?

Ross Zwisler add support for new persistent memory instructions ?

James Hogan Add MIPS CDMM bus support ?

Vikas Shivappa x86: Intel Cache Allocation Support ?

Paul Moore Overhaul the audit filename handling ?

Andrey Ryabinin Kernel address sanitizer - runtime memory debugger. ?

Steven Rostedt tracing: Add new file system tracefs ?

Matt Fleming perf: Intel Cache QoS Monitoring support ?

Tejun Heo bitmap, cpumask, nodemask: implement %*pb[l] to format bitmaps directly from printf family of functions ?

Jiri Olsa perf tools: New build framework ?

Javi Merino Add array printing helpers to ftrace ?

Tina Ruchandani trace: Use 64-bit timekeeping ?

Alexei Starovoitov tracing: attach eBPF programs to tracepoints/syscalls/kprobe ?

Xunlei Pang drivers/rtc/interface.c: Update code to use y2038-safe time interfaces ?

Geert Uytterhoeven drivers: bus: Add Simple Power-Managed Bus ?

Javier Martinez Canillas platform/chrome: Add user-space dev inferface support ?

Shuah Khan media: au0828 - convert to use videobuf2 ?

Stathis Voukelatos net: Linn Ethernet Packet Sniffer driver ?

Heikki Krogerus usb: ulpi bus ?

Scot Doyle fbcon: user-defined cursor blink interval ?

Sudeep Dutt misc: mic: SCIF driver ?

Olliver Schinagl Let leds use named gpios ?

Chanwoo Choi [PATCH v10 0/7] devfreq: Add devfreq-event class to provide raw data for devfreq device ?

Xander Huff driver core: add device_poll interface ?

Heinrich Schuchardt ioctl-fat.2: new manpage for the ioctl fat API ?

Namjae Jeon fs: Introduce FALLOC_FL_INSERT_RANGE for fallocate ?

Calvin Owens [RFC][PATCH] procfs: Always expose /proc/<pid>/map_files/ and make it readable ?

Jeff Layton locks: saner method for managing file locks ?

Ebru Akagunduz mm: incorporate read-only pages into transparent huge pages ?

Christoph Lameter Slab allocator array operations ?

Toshi Kani Kernel huge I/O mapping support ?

Joerg Roedel iommu: Move domain allocation into drivers ?

Joerg Roedel iommu: Introduce default domains for iommu groups ?

Kirill A. Shutemov Introduce <linux/mm_struct.h> ?

Vladimir Davydov slub: make dead caches discard free slabs immediately ?

David Decotigny net: mlx4: use new ETHTOOL_G/SSETTINGS API ?

Stephen Smalley Add security hooks to binder and implement the hooks for SELinux. ?

Luis R. Rodriguez x86/xen: add xen hypercall preemption ?

al.stone@linaro.org Start deprecating _OSI on new architectures ?

Tom Zanussi tinification: Make memory-access char devices optional ?

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel development news

A crypto module loading vulnerability

How programs get run

The view from user space

Into the kernel: `struct linux_binprm`

Binary format handler iteration: `struct linux_binfmt`

Binary formats

Script invocation: `binfmt_script.c`

Miscellaneous interpreter detection: `binfmt_misc.c`

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Kernel development

Brief items

Kernel release status

Quotes of the week

Kernel development news

A crypto module loading vulnerability

How programs get run

The view from user space

Into the kernel: struct linux_binprm

Binary format handler iteration: struct linux_binfmt

Binary formats

Script invocation: binfmt_script.c

Miscellaneous interpreter detection: binfmt_misc.c

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Into the kernel: `struct linux_binprm`

Binary format handler iteration: `struct linux_binfmt`

Script invocation: `binfmt_script.c`

Miscellaneous interpreter detection: `binfmt_misc.c`