|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 4.0-rc2, released on March 3. Linus said:

So rc2 missed the usual Sunday afternoon timing, because I spent most of the weekend debugging an issue that happened on an old Mac Mini I have around, and I hate making even early -rc releases with problems on machines that I have direct access to. Even if it only affected old machines that actual developers are unlikely to have or at least use.

Today I got the patch from Daniel Vetter to fix it, so instead of doing a Sunday evening rc2, it's a Tuesday morning one. Go get it. It works better for the delay.

Stable updates: 3.18.8, 3.14.34, and 3.10.70 were released on February 26. The 3.19.1, 3.18.9, 3.14.35, and 3.10.71 updates are in the review process as of this writing; they can be expected on or after March 6.

Comments (none posted)

Quotes of the week

Say Y here if you want your system to crash and hang more often.
Say N if you want a sane system.
Paul McKenney sells a new RCU feature

Lawyers don't seem to believe in #include <legalese.h>
Alan Cox

This hocus pocus coding is just going to lead us down the path of the black arts. I already have a black cat, so I'm good to go.
Steven Rostedt

Yes, I googled "worlds fastest snail" just to make sure - I can take on any snail and win, dammit.
Linus Torvalds

The kernel team is full of brilliant, friendly folks, far more than it is filled with bearded grognards looking for a fight.
SD Times

Comments (none posted)

Kernel development news

Memory management when failure is not an option

By Jonathan Corbet
March 4, 2015
Last December, a discussion of system stalls related to low-memory situations led to the revelation that small memory allocations never fail in the kernel. Since then, the discussion on how to best handle low-memory situations has continued, focusing in particular on situations where the kernel cannot afford to let a memory allocation fail. That discussion has exposed some significant differences of opinion on how memory allocation should work in the kernel.

Some introductory concepts

The kernel's memory-management subsystem is charged with ensuring that memory is available when it is needed by either the kernel or a user-space process. That job is easy when a lot of memory is free, but it gets harder once memory fills up — as it inevitably does. When memory gets tight and somebody is requesting more, the kernel has a couple of options: (1) free some memory currently in use elsewhere, or (2) deny (fail) the allocation request.

The process of freeing (or "reclaiming") memory may involve writing the current contents of that memory to persistent storage. That, in turn, involves calling into the filesystem or block I/O code. But if any of those subsystems are, in fact, the source of the allocation request, calling back into them can lead to deadlocks and other unfortunate situations. For that reason (among others), allocation requests carry a set of flags describing the actions that can be performed in the handling of the request. The two flags of interest in this article are GFP_NOFS (calls back into filesystems are not allowed), and GFP_NOIO (no type of I/O can be started). The former inhibits attempts to write dirty pages back to files on disk; the latter can block activity like writing pages to swap.

Obviously, the more constrained the memory-management subsystem is, the higher the chances of it being unable to satisfy an allocation request at all. Kernel developers have long been told that (almost) any allocation request can fail; as a result, the kernel is full of error-handling paths meant to deal with that eventuality. But it became clear recently that the memory-management code does not actually allow smaller requests to fail; it will, instead, loop indefinitely trying to free some memory. That behavior has been seen to lead occasionally to locked-up systems, despite the fact that the code involved is prepared to deal with allocation failures. The "too small to fail" behavior is controversial, but would prove hard to change at this point.

There are, however, places in the kernel that are simply unprepared to deal with allocation failures, usually because the allocation happens deep within a complex series of operations that would be difficult to unwind. The __GFP_NOFAIL flag exists to explicitly state that failure is not an option for a given request, though its use is heavily discouraged.

The following discussion, in the end, focuses on two related questions: (1) should the kernel really be treating small allocations as if they all had __GFP_NOFAIL set, and (2) should failure-proof allocations be supported at all, and, if so, how can that support be made more robust?

No longer too small to fail

The discussion (re)started when Tetsuo Handa noted that memory allocation behavior had changed in the 3.19 kernel; in particular, small allocations with the GFP_NOIO or GFP_NOFS flags would fail under severe memory pressure. In previous kernels, such allocations would loop indefinitely if no memory was available. Among other things, this change can cause filesystem operations to fail on memory-stressed systems where they would have (eventually) succeeded before.

The behavior change is the result of this patch from Johannes Weiner which was aimed at avoiding the memory-allocation deadlocks that started the December discussion. The intent was to avoid looping forever in an allocation attempt if it appeared that no progress was being made toward freeing some memory for that allocation, but, by accident, it also prevented looping entirely in the GFP_NOIO and GFP_NOFS cases. So those allocations can now fail; that is a significant change from how previous kernels worked.

Johannes initially wanted to keep the new behavior, saying that it "makes more sense". But the filesystem developers disagreed strongly. It seems that there are numerous places in the filesystem code that depend on allocations succeeding reliably, and that many of them are not marked with __GFP_NOFAIL. Ted Ts'o threatened to add a lot of __GFP_NOFAIL flags to allocation calls in the ext4 filesystem if the change were not reverted. The memory-management developers were thus faced with the need to pick the option they disliked least.

In the end, the filesystem developers won out on this one; Johannes merged a change into 4.0-rc2 restoring the looping behavior for those allocation types. This change is likely to end up in the 3.19 stable series as well. The original patch is a good argument for the approach of refusing "cleanup" patches late in the development cycle. It was merged for the 3.19-rc7 prepatch, meaning that there was almost no time for problems to be noticed before the final 3.19 release came out.

The discussion was not limited to the unexpected effects of one late-arriving memory-management patch, though. The bigger problems of how to avoid deadlocks in low-memory situations and how to ensure that important tasks can proceed in those situations remain unsolved.

The OOM killer

The out-of-memory (OOM) killer is implicated in a number of stall scenarios. In the original problem reported last December, the OOM killer would choose a victim that was blocked on a lock, but that lock was held by the process waiting (forever) for a memory allocation to proceed. As a result, the victim could not exit and, thus, could not free its memory. Since the OOM killer only goes after a single process at a time, everything would stop at that point.

Johannes suggested a change to how the OOM killer works: if a targeted process failed to exit after five seconds, the OOM killer would give up and move on to another victim. The idea was not hugely popular, though. David Rientjes pointed out that there was no guarantee that the next victim would be any more appropriate than the one that came before. Dave Chinner claimed more broadly that efforts to tweak the OOM killer are misdirected:

I really don't care about the OOM Killer corner cases - it's completely the wrong line of development to be spending time on and you aren't going to convince me otherwise. The OOM killer a crutch used to justify having a memory allocation subsystem that can't provide forward progress guarantee mechanisms to callers that need it.

The end result is that OOM-killer timeouts will probably not find their way into the memory-management subsystem anytime soon.

__GFP_NOFAIL and looping

From the point of view of the memory-management developers, many things would get easier if any allocation request could fail when the necessary resources are not available. That would mean getting rid of the implicit "small allocations never fail" rule, but, beyond that, it would also require getting rid of the explicit __GFP_NOFAIL call sites. Michal Hocko was perhaps the most outspoken in this regard, saying that __GFP_NOFAIL "is deprecated and shouldn't be used". He also suggested that existing __GFP_NOFAIL call sites should be reimplemented in a way that allows them to recover from allocation failures.

Dave took issue with that idea, saying that failure-proof allocations are a hard requirement for the XFS filesystem. To rework XFS to be able to roll back dirty transactions in the face of an allocation failure would increase its complexity significantly, he said; the project would take a couple of years to reach a point where it could be put into production use. He summarized by saying "I'm not about to spend a couple of years rewriting XFS just so the VM can get rid of a GFP_NOFAIL user". Strangely enough, there were no other developers volunteering to take on that job either.

Contemporary filesystems are complex beasts that have to meet a wide variety of demands. They incorporate complex transaction mechanisms that help them to maintain filesystem integrity in every situation possible. Implementing such a mechanism in a way that allows it to recover from a memory-allocation failure in the middle of a transaction, after resources have been committed, locks taken, etc., is not a simple task. Filesystem developers on Linux have not taken on that task because, in the end, there has not been a need to. Allocations that cannot be allowed to fail have proved sufficient in almost all situations.

Once one accepts that some sort of failure-proof allocation mechanism is needed, though, the next question is: how should it be done? The __GFP_NOFAIL flag is one solution, but it turns out that quite a bit of code in the kernel does not make use of it. Instead, there are a number of places in the kernel that implement their own retry loops on top of a kmalloc() call without __GFP_NOFAIL. That is something that the memory-management developers don't like; those developers would rather not see __GFP_NOFAIL used at all, but they still prefer its use to retry loops implemented outside of the memory-management subsystem. Consider, for example, this message from Johannes saying that the XFS developers should replace a retry loop with a single __GFP_NOFAIL call.

There are couple of reasons why such loops exist. One of those is that __GFP_NOFAIL was explicitly deprecated in 2009; the patch (from Andrew Morton) said:

__GFP_NOFAIL is a bad fiction. Allocations _can_ fail, and callers should detect and suitably handle this (and not by lamely moving the infinite loop up to the caller level either).

After this change went in, it became harder to get code containing __GFP_NOFAIL past reviewers. Whether it is done lamely or not, a hand-coded infinite retry loop is easier to sneak into the kernel than an easily greppable __GFP_NOFAIL use. So that is what developers did.

The memory-management developers dislike must-succeed allocations because they complicate the code and, as has occasionally been seen, create the possibility of deadlocks. If such allocations must be made, they would rather see the looping done in the memory-management code, where behavior can be tweaked and appropriate action taken (starting the OOM killer, for example) if it becomes clear that no progress is being made. In the real world, though, according to both Ted and Dave, looping actually works pretty well. The XFS code has a "canary" that puts out a warning when the looping goes on for too long, but, Dave said:

yet we *rarely* see the canary warnings we emit when we do too many allocation retries, the code has been that way for 13-odd years. Hence, despite your protestations that your way is *better*, we have code that is tried, tested and proven in rugged production environments. That's far more convincing evidence that the *code should not change* than your assertions that it is broken and needs to be fixed.

One might take that as a statement that the XFS developers are currently uninterested in replacing their own loops with __GFP_NOFAIL invocations. But they actually have another reason to maintain a loop outside of the memory-management code: they want to retain control over how the filesystem should respond to low-memory conditions. It is, in their mind, a policy decision that the memory-management code lacks the information to handle. There are currently plans afoot to expose some of that policy to user space, allowing administrators to configure what the filesystem's low-memory response should be.

Reservations

Still, there is no real disagreement over this idea: looping over a failing memory allocation is undesirable and best avoided whenever possible. Thus it may well be that the most useful part of the discussion came when the developers got around to the topic of avoiding allocation failures altogether. There are a few ways of working toward that goal.

One of those is preallocation — allocating all of the needed memory resources before the code gets to a point where it can't back out of a transaction. Preallocation is used in many contexts in the kernel and works well, so it was natural for the memory-management developers to ask whether it can be used in this context. Dave shot that idea down fairly quickly:

However, preallocation is dumb, complex, CPU and memory intensive and will have a *massive* impact on performance. Allocating 10-100 pages to a reserve which we will almost *never use* and then free them again *on every single transaction* is a lot of unnecessary additional fast path overhead. Hence a "preallocate for every context" reserve pool is not a viable solution.

Mempools were also raised as a possibility. They are a form of preallocation that might avoid some of the overhead described above. But they are, according to Dave, poorly suited to the problem at hand. Mempools deal with a single size of object, while a filesystem transaction needs a wide variety of objects; that implies that several mempools would be needed at various levels in the stack. There is also a mismatch between object lifetimes that make mempools difficult to use across multiple transactions. So mempools do not appear to be an option either.

Dave's suggestion, instead, is to add the concept of "reservations" to the memory-management subsystem. Prior to entering a transaction, the filesystem code would inform the memory-management code that it will need guaranteed access to a certain amount of memory; calculating an approximate memory requirement is, apparently, not that hard. The memory-management code would then ensure that the requisite amount of memory would be available; subsequent allocation requests would dip into the reserve if need be. As long as the estimate for the size of the reserve is sufficient, there should be no problem with failing allocations during the transaction.

Reservations may look a lot like preallocation, but there is a crucial difference. The memory-management code already maintains a "watermark," a level of free memory below which it is unwilling to go unless absolutely necessary. A reservation would simply raise that watermark, making a bit less memory available to the system as a whole. If a reservation would raise the watermark above the amount of memory that is currently free, the request would block until more memory could be reclaimed. In the simplest case, a reservation would be represented as an increased value in a single integer variable.

There seems to be some general support for the addition of a reservation mechanism, but things get less clear once one looks at the details. Andrew Morton suggested a scheme where a process making a reservation would get a number of "tokens"; subsequent allocations done by that process would come from the reserve first. Dave does not like that idea, saying that it fails to account for the fact that many objects allocated during a transaction will be freed (perhaps by others) shortly and, thus, should not come from the reservation. His view of the reservation, instead, is a range of memory that is not touched at all unless there is no alternative; even then, only allocations using the GFP_RESERVE flag would be able to get at that memory. The reservation, in his view, comes into play when the kernel would have otherwise put the OOM killer into action.

Johannes, instead, says that this approach will not work. The problem is that "we simply don't KNOW the exact point when we ran out of reclaimable memory", so the memory-management subsystem cannot easily guarantee the sort of loose reservation that Dave has described. Dave disagreed with that assessment, it almost goes without saying. And that is about where the conversation wound down.

Reservations are a promising idea for a solution to some of the kernel's memory-allocation challenges. But, at this point, it is just an idea; it has neither code nor a design consensus behind it. The discussion has slowed for the moment, but that is almost certain to be a temporary state of affairs. The annual Linux Storage, Filesystem, and Memory Management Summit is less than one week away as of this writing. This subject is on the agenda, and LWN will be there to report on the discussion.

Comments (16 posted)

Python bindings added for libseccomp 2.2.0

By Jake Edge
March 4, 2015

The secure computing (seccomp) facility was added to Linux some ten years ago as a way to restrict programs so that they can only make a small subset of system calls. It is a way to sandbox processes but, over the years, it was found to be too restrictive. Thus, after a few false starts, a new seccomp mode that used the kernel's Berkeley Packet Filter (BPF) implementation to provide a way to more flexibly sandbox processes came about in 2012. To help application writers use the facility, Paul Moore created libseccomp, and he has just released version 2.2.0 of the library.

This is the first release of libseccomp for more than a year, with a 2.1.1 minor release in October 2013 (after 2.1.0 in June 2013). One of the headline features for 2.2.0 is the addition of Python bindings, which have been around for a while but have not been part of a release until now. Other big changes for this release include a switch to Autotools, support for the ARM 64-bit architecture, as well as support for several flavors of the MIPS architecture (mips, mips64, and mips64n32).

The newer seccomp mode (known as seccomp filters or seccomp mode 2) allows developers to specify which system calls are allowed to be made and to restrict the arguments that can be passed to those system calls. In order to do that, the kernel requires a program written using the BPF language, which was originally targeted at network filtering, though it has grown well beyond that. Libseccomp is meant to provide a higher-level interface to that functionality—from C and, now, Python.

If you have a program that is handling untrusted user input—HTTP traffic or image formats, say—you might want to restrict the kinds of operations that program can perform. For example, there might just be a handful of operations that should be allowed to the program, so that if it were compromised by unexpected input, there is little an attacker can actually do. On the other hand, though, the program might require a bit more than the four system calls (read(), write(), exit(), and sigreturn()) allowed by the original seccomp mode.

For instance, open() might be allowed, but only to open files for reading. Or, the write() system call might be restricted to certain file descriptors (say, 1 and 2 for stdout and stderr). Meanwhile, all other system calls, including powerful calls that attackers might want to use, such as execve() or socket(), would be disabled. As with the C interface, the actions taken when a disallowed system call is made depend on how the library is initialized. Those calls could cause the program to be killed, to receive a signal, to generate a ptrace() event, or for the call to fail with a particular errno value.

Using the Python bindings is similar in many ways to calling the library directly from C:

    import sys, os
    from seccomp import *

    f = SyscallFilter(defaction=KILL)

    f.add_rule(ALLOW, "open")
    f.add_rule(ALLOW, "write", Arg(0, EQ, sys.stdout.fileno()))

    f.load()

    x = os.open('/tmp/x', os.O_WRONLY)
    os.write(x, 'Hello, world\n')
That, at least conceptually, will create the filter object, add two rules, and load it, which will cause the write() to fail. The rules allow the open() system call, but only allow calling write() on the stdout file descriptor. The initialization of the filter object chooses the KILL default action, which means the program will be terminated if it uses disallowed system calls.

However there is a bit more to it than that. When testing the non-error path by commenting out the os.write(), Python requires brk(), rt_sigaction(), and exit_group() to exit gracefully. So the following would need to be added to the list of rules:

    f.add_rule(ALLOW, "exit_group")
    f.add_rule(ALLOW, "rt_sigaction")
    f.add_rule(ALLOW, "brk")

While that does add to the list of allowed system calls, it doesn't really enlarge what an attacker could do when subverting this (extremely simple) program. Using the Python version of open() and write(), instead of those from the os module, requires opening up several more system calls (mmap(), read(), close(), and fstat()), which could be a bigger problem. Having both open() and read() available might allow an attacker to access files, but the contents can only be written to stdout, which may well be an impediment. Further refinement of the rules could limit an attacker even further.

When debugging seccomp filters, there is often a need to track down which system call caused a failure. That can be done a number of different ways. When the KILL action is used, as above, the process is forced to exit (with SIGSYS as its status), so the shell simply prints "Bad system call". But it also leaves an audit trail that records the system call number that failed:

    type=SECCOMP msg=audit(1425421709.486:42015): auid=1000 uid=1000
      gid=1000 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
      pid=25113 comm="python" exe="/usr/bin/python2.7" sig=31 arch=c000003e
      syscall=5 compat=0 ip=0x7faf36e38824 code=0x0
One easy way to turn the number reported into a name is by using the scmp_sys_resolver tool that is bundled with libseccomp. Another option for determining which system call is causing a failure would be to switch to the TRAP action when setting up the filter object, then add a signal handler for the SIGSYS signal that gets generated when disallowed system calls are made. Using ptrace() or strace are possibilities as well.

Rules can be built using the Arg() function to specify the allowable values for up to six arguments. Those values can be tested using the usual comparison operators, but by name (e.g. EQ for =, GE for >=, LT for <, and so on). There is also a MASKED_EQ operator that can be used for flag values:

    f.add_rule(ALLOW, "open",
               Arg(1, MASKED_EQ, os.O_RDONLY,
                   os.O_RDONLY | os.O_RDWR | os.O_WRONLY))
That rule would ensure that of the three file access flags, only O_RDONLY is set, so open() would only be allowed for reads.

The 2.2.0 release comes with a test suite that includes both C and Python versions of each test. It also has man pages for each of the C language seccomp_* calls. The only Python language documentation appears to be in the src/python/seccomp.pyx file, but perhaps that will change before long. Anyone looking to sandbox their programs should definitely take a peek at libseccomp.

Comments (3 posted)

Ftrace and histograms: a fork in the road

By Jonathan Corbet
March 4, 2015
The kernel's "ftrace" tracing machinery is useful for obtaining a great deal of information about what is going on inside a running kernel. What ftrace generally does not provide is analysis features to "boil down" tracing data into a more useful format. Tom Zanussi's "hist triggers" patch could change that situation, but it has exposed a significant difference of opinion over how such capabilities should be implemented in the kernel.

The idea behind hist triggers is simple enough: a user interested in a histogram of tracing data would write a command string to the appropriate ftrace control file specifying the parameters of the histogram. For example, one could look at kmalloc() calls with a command like:

    # echo 'hist:key=call_site:val=bytes_req' > \
           /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger

Here, the hist: prefix indicates that histogram output is desired. The key= and val= parameters describe the axes of the histogram; in this case, the user will get the total amount of memory requested from each location where kmalloc() is called. One obtains the results by reading the hist file that magically pops up in the tracing control directory:

    # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
    trigger info: hist:keys=call_site:vals=bytes_req:sort=hitcount:size=2048 [active]

    call_site: 18446744071581750326 hitcount:          1  bytes_req:         24
    call_site: 18446744071583151255 hitcount:          1  bytes_req:         32
    call_site: 18446744071582443167 hitcount:          1  bytes_req:        264
    call_site: 18446744072104099935 hitcount:          2  bytes_req:        464
    call_site: 18446744071579323550 hitcount:          3  bytes_req:        168
    [...]

There are additional options that can, for example, turn the call-site address into a symbolic location. This example was taken from the documentation posted with Tom's patch set; many more examples and details can be found there.

Tom thinks that this sort of functionality will be useful to a wide variety of tracing users. Indeed, it may reduce the need for more sophisticated tools:

A surprising number of typical use cases can be accomplished by users via this simple mechanism. In fact, a large number of the tasks that users typically do using the more complicated script-based tracing tools, at least during the initial stages of an investigation, can be accomplished by simply specifying a set of keys and values to be used in the creation of a hash table.

Nobody seems to disagree that this would be a nice feature to have, but, still, criticism of the patch set came from two directions. Ftrace maintainer Steve Rostedt complained that the tracepoint code generating the histograms performs memory allocations; those allocations are necessary to maintain the hash table used to hold the histogram data. Tracepoint callbacks can be called with all sorts of locks held; allocating memory in such a situation is not a safe thing to do. So, Steve said, that aspect of the patch set is a "showstopper." Future versions of the patch set will, thus, have to accumulate this data without allocating memory in the tracepoint callbacks.

A different type of criticism came from Alexei Starovoitov, the developer behind the eBPF work that has gone into the kernel over the last year. One of the use cases for a tool like eBPF is to allow users to gather data in kernel space and generate output in forms like histograms. Alexei thus duly suggested that eBPF should be used to implement Tom's histogram functionality. Rather than parse the commands in the kernel, though, Alexei would like to see the development of a tool that would parse the same commands in user space and load an eBPF program to do the actual work.

To Tom, the idea seemed "silly"; a lot of work would be required to implement the functionality that already exists. He saw the request as an attempt to force the use of eBPF on users who may not want to deal with it. Alexei responded by saying:

Your 'hist->bpf' tool could have been first to solve 'bpf hard to use' problem and over time it could have evolved into full dtrace alternative. Whereas by adding 'hist:keys=..' parsing to kernel you'll be stuck with it and somebody else's dtrace-like tool will supersede it.

Tom remained unimpressed:

I think there's some misunderstanding there - it was never my intent to create a full dtrace alternative, really the idea was (and has been, even before there was any such thing as ebpf in the kernel) only to provide access to some low-hanging fruit via the standard read/write interfaces people are used to with ftrace.

In the end, there is an important question to answer here. The eBPF subsystem provides a mechanism by which a great deal of interesting tracing functionality could be implemented without having to hardwire the logic in the kernel. Now that eBPF is here, adding new tracing modes as more C code in the kernel could lead to duplicated functionality that needs to be supported indefinitely, even if, someday, an alternative implemented in eBPF draws most of the users.

On the other hand, the current interface to ftrace, wherein users write simple control strings to a set of virtual files, appeals to a lot of users. It is relatively easy to work with, does not require any additional tools to use, and is straightforward to script. Some of those users would not be pleased if they felt pushed to move over to an interface requiring the compilation and loading of eBPF programs to get their work done.

This has the look of a debate that could go on for some time. In the absence of a decision by decree from a suitably placed subsystem maintainer, it seems unlikely that the developers involved will settle on a single approach to the problem of how to add new tracing features. The kernel's tracing subsystem is arguably at a fork in the road, but we may not know which branch will be taken for a while yet.

Comments (2 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.0-rc2 ?
Greg KH Linux 3.18.8 ?
Greg KH Linux 3.14.34 ?
Greg KH Linux 3.10.70 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Michael Kerrisk (man-pages) man-pages-3.81 is released ?

Filesystems and block I/O

Memory management

Networking

Joe Stringer OVS conntrack support ?

Security-related

Miscellaneous

Lucas De Marchi kmod 20 ?

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2015, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds