Kernel development
Brief items
Kernel release status
The current development kernel is 4.9-rc4, released on November 5. Linus said: "So I'm not going to lie: this is not a small rc, and I'd have been happier if it was. But it's not unreasonably large for this (big) release either, so it's not like I'd start worrying. I'm currently still assuming that we'll end up with the usual seven release candidates, assuming things start calming down. We'll see how that goes as we get closer to a release."
The November 6 regression report shows 17 known regressions in the 4.9 kernel.
Stable updates: none have been released in the last week. The 4.8.7 and 4.4.31 updates are in the review process as of this writing; they can be expected on or after November 11.
Kernel development news
A discussion on virtual-memory topics
The Kernel Summit "technical topics" day was a relatively low-key affair in 2016; much of the interesting discussion had been pulled into the adjoining Linux Plumbers Conference instead. That day did feature one brief discussion involving several core virtual-memory (VM) developers, though. While a number of issues were discussed, the overall theme was clear: the growth of persistent memory is calling into question many of the assumptions on which the kernel's VM subsystem was developed.The half-hour session begin with self-introductions from the four developers at the head of the room. Mel Gorman, who works in the performance group at SUSE, said that his current interest is in improving small-object allocation, a task which he hopes will be relatively straightforward. Rik van Riel has spent much of his time on non-VM topics recently, but has a number of projects he wants to get back to. Those include using the persistent-memory support in QEMU to shift the page cache out of virtualized guest systems and into the host, where it can be better managed and shared. Johannes Weiner has been working on scalability issues, and is working on improving page-cache thrashing detection. Vlastimil Babka is concerned with high-order (larger than one page) allocations. With the merging of the virtually mapped kernel stacks work, stack allocation no longer requires high-order allocations, but there are other users in the kernel. So his current focus is on making the compaction and anti-fragmentation code work better.
Mel opened the discussion by asking whether anybody in the room had problems with the VM subsystem. Laura Abbott replied that she is working on slab sanitization as a way of hardening the kernel; it zeroes memory when it is freed to prevent memory leaks. The initial code came from the PaX/grsecurity patch set, but that code is not acceptable in mainline due to the performance hit it adds to the memory allocator's hot paths. She has been pursuing a suggestion to use the SLUB allocator's debug path, but there is little joy to be found there; SLUB's slow path is very slow.
Mel replied that he would expect that the performance hit from sanitization would be severe in any case; thrashing the memory caches will hurt even if nothing else does. If a particular environment cares enough about security to want sanitization, it will have to accept the performance penalty; this would not be the first time that such tradeoffs have had to be made. That said, he has no fundamental opposition to the concept. Laura does believe that the hit can be reduced, though; much of the time spent in the slow path goes to lock contention, so perhaps lockless algorithms should be considered. Mel concurred, noting that the slow path was misnamed; it should be the "glacial path."
Swap woes
The next topic was brought up by a developer who is working on next-generation memory, and swapping to hardware-compressed memory in particular. In the absence of the actual hardware, he is working on optimizing swapping to a RAM disk using zram. There are a number of problems he is running into, including out-of-memory kills while there is still memory available. But he's concerned about performance; with zram, about ⅔ of the CPU's time is spent on compression, while the other ⅓ is consumed by overhead. When the compression moves to hardware, that ⅓ will be the limiting factor, so he would like to find ways to improve it.
Johannes replied that there are a lot of things to fix in the swap path when fast devices are involved, starting with the fact that the swapout path uses global locks. Because swapping to rotational devices has always been terrible, the system is biased heavily against swapping in general. A workload can be thrashing the page cache, but the VM subsystem will still only reclaim page-cache pages and not consider swapping. He has been working on a patch set to improve the balance between swapping and the page cache; it tries to reclaim memory from whichever of the two is thrashing the least. There are also problems with the swapout path splitting huge pages on the way out, with a consequent increase in overhead. Adding batching to the swap code will hopefully help here.
Mel suggested the posting of profiles showing where the overhead is in the problematic workload. Getting representative workloads is hard for the VM developers; without those workloads, they cannot easily reproduce or address the problems. In general, he said, swapping is "running into walls" and needs to be rethought. Patience will be required, though; it could be 6-24 months before the problems are fixed.
Shrinker shenanigans
Josef Bacik is working, as he has for years, on improving the Btrfs filesystem. He has observed a problematic pattern: if the system is using vast amounts of slab memory, everything will bog down. He has workloads that can fill 80% of memory with cached inodes and dentries. The size of those caches should be regulated by the associated shrinkers, but that is not working well. Invocation of shrinkers is tied to page-cache scanning, but this workload has almost no page cache, so that scanning is not happening and the shrinkers are not told to free as much memory as they should. As more subsystems use the shrinker API, he said, we will see more cases where it is not working as desired.
Ted Ts'o said that he has seen similar problems with the extent status slab cache in the ext4 filesystem. That cache can get quite large; it can also create substantial spinlock contention when multiple shrinkers are running concurrently. The problems are not limited to Btrfs, he said.
Rik asked whether it would make sense to limit the size of these caches to some more reasonable value. There are quite a few systems out there now that do not really have a page cache, and their number will grow as the use of persistent memory spreads. Persistent memory is nice in that it can make terabytes worth of file data instantly accessible, but that leads to the storing of a lot of metadata in RAM.
Christoph Hellwig replied that blindly limiting the size of metadata caches is not a good solution; it might be a big hammer that is occasionally needed, but it should not be relied upon in a well-functioning system. What is needed is better balancing, he said, not strict limits. The VM subsystem has been built around the idea that filesystems store much of their metadata in the page cache, but most of them have shifted that metadata over to slab-allocated memory now. So, he said, there needs to be more talk between the VM and filesystem developers to work out better balancing mechanisms.
Rik answered that the only thing the VM code can do now is to call the shrinkers. Those shrinkers will work through a global list of objects and free them, but there is a problem. Slab-allocated objects are packed many to a page; all objects in a page must be freed before the page itself can be freed. So, he said, a shrinker may have to clear out a large fraction of the cache before it is able to free the first whole page. The cache is wiped out, but little memory is made available to the rest of the system.
Christoph said that shrinkers are currently designed around a one-size-fits-all model. There needs to be a way to differentiate between clean objects (which can be freed immediately) and dirty objects (that must be written back to persistent store first). There should also be page-based shrinkers that can try to find pages filled with clean objects that can be quickly freed when the need arises.
Mel suggested that there might be a place for a helper that a shrinker can call to ask for objects that are on the same page; it could then free them all together. The problem of contention for shrinker locks could be addressed by limiting the number of threads that can be running in direct reclaim at any given time. Either that, or shrinkers should back off quickly when locks are unavailable on the assumption that other shrinkers are running and will get the job done.
Ted said that page-based shrinkers could make use of a shortcut by which they could indicate that a particular object is pinned and cannot be freed. The VM subsystem would then know that the containing page cannot be freed until the object is unpinned. Jan Kara suggested that there could be use in having a least-recently-used (LRU) list for slab pages to direct reclaim efforts, but Linus Torvalds responded that such a scheme would not work well for the dentry cache, which is usually one of the largest caches in the system.
The problem is that some objects will pin others in memory; inodes are pinned by their respective dentries, and dentries can pin the dentries corresponding to their parent directories. He suggested that it might make sense to segregate the dentries for leaves (ordinary files and such) from those for directories. Leaf dentries are much easier to free, so keeping them together will increase the chances of freeing entire pages. There's just one little problem: the kernel often doesn't know which type a dentry will be when it is allocated, so there is no way to know where to allocate it. There are places where heuristics might help, but it is not an easy problem. Mel suggested that the filesystem code could simply allocate another dentry and copy the data when it guesses wrong; Linus said that was probably doable.
Some final details
Linus said that there is possible trouble coming with the merging of slab caches in the SLUB allocator. SLUB normally does that merging for objects of similar size, but many developers don't like it. Slab merging would also obviously complicate the task of freeing entire pages. That merging currently doesn't happen when there is a shrinker associated with a cache, but that could change in the future; disabling merging increases the memory footprint considerably. We need to be able to do merging, he said, but perhaps need to be more intelligent about how it is done.
Tim Chen talked briefly about his swap optimization work. In particular, he is focused on direct access to swap when persistent memory is used as the swap device. Since persistent memory is directly addressable, the kernel can map swapped pages into a process's address space, avoiding the need to swap them back into RAM. There will be a performance penalty if the pages are accessed frequently, though, so some sort of decision needs to be made on when a page should be swapped back in. Regular RAM has the LRU lists to help with this kind of decision, but all that is available for persistent memory is the "accessed" bit in the page-table entry.
Johannes pointed out that the NUMA code has a page-table scanner that uses the accessed bit; perhaps swap could do something similar, but Rik said that this mechanism is too coarse for swap use. Instead, he said, perhaps the kernel could use the system's performance-monitoring unit (PMU) to detect situations where pages in persistent memory are being accessed too often. The problem with that approach, Andi Kleen pointed out, is that developers generally want the PMU to be available for performance work; they aren't happy when the kernel grabs the PMU for its own use. So it's not clear what form the solution to this problem will take.
All of the above was discussed in a mere 30 minutes. Mel closed the session by thanking the attendees, noting that some good information had been shared and that there appeared to be some low-hanging fruit that could be addressed in the near future.
The perils of printk()
One might be tempted to think that there is little to be said about the kernel's printk() function; all it does, after all, is output a line of text to the console. But printk() has its problems. In a Kernel Summit presentation, Sergey Senozhatsky said that he is simply unable to use printk() in its current form. The good news, he said, is that it is not unfixable — and that there are plans for addressing its problems.
Locking the system with printk()
One of the biggest problems associated with printk() is deadlocks, which can come about in a couple of ways. One of those is reentrant calls. Consider an invocation of printk() that is preempted by a non-maskable interrupt (NMI). The handler for that NMI will, likely as not, want to print something out; NMIs are extraordinary events, after all. If the preempted printk() call holds a necessary lock, the second call will deadlock when it tries to acquire the same lock. That is just the sort of unpleasantness that operating system developers normally go far out of their way to avoid.
This particular problem has been solved; printk() now has a special per-CPU buffer that is used for calls in NMI context. Output goes into that buffer and is flushed after the NMI completes, avoiding the need to acquire the locks normally needed by a printk() call.
Unfortunately, printk() deadlocks do not end there. It turns out that printk() calls can be recursive, the usual ban on recursion in the kernel notwithstanding. Recursive calls can happen as the result of warnings issued from deep within the kernel; lock debugging was also listed as a way to create printk() calls at inopportune times. If something calls printk() at the wrong time, the result is a recursive call that can deadlock in much the same way as preempted calls.
The problem looks similar to the NMI case, so it should not be surprising
that the solution is similar as well. Sergey has a proposal to extend the NMI idea, creating
more per-CPU buffers for printk() output. Whenever
printk() wanders into a section of code where recursion could
happen, output from any recursive calls goes to those buffers, to be
flushed at a safe time. Two new functions, printk_safe_enter()
and printk_safe_exit(), mark the danger areas. Perhaps
confusingly, printk_safe_enter() does not mark a safe area;
instead, it marks an area where the "safe" output code must be used.
Given that the per-CPU buffers are required in an increasing number of situations, Peter Zijlstra wondered whether printk() should just use the per-CPU buffer always. Sergey responded that this approach is under consideration.
Hannes Reinecke said that part of the problem results from the two distinct use cases for printk(): "chit-chat" and "the system is about to die now." The former type of output can go out whenever, while the latter is urgent. In the absence of better information, printk() must assume that everything is urgent, but a lot of problems could be solved by simply deferring non-urgent output to a safe time. Linus Torvalds pointed out that the log level argument should indicate which output is urgent, but Peter said that just deferring non-urgent output is not close to a full solution. The real problem, he said, is in the console drivers; this subject was revisited later in the session.
One problem with deferring non-urgent output, Sergey said, is that the ordering of messages can be changed and it can be hard to sort them out again. Peter suggested that this was not much of a problem; Hannes said, rather forcefully, that printk() output has timestamps on it, so placing it back into the proper order should not be difficult. The problem there, according to Linus, is that timestamps are not necessarily consistent across CPUs; if a thread migrates, the ordering of its messages could be wrong.
Petr Mladek, who joined Sergey at the front of the room, said that there is a problem with per-CPU buffers: they will almost necessarily be smaller than a single, global buffer, and can thus limit the amount of output that can be accumulated. So it is more likely that the system will lose messages if it is using per-CPU buffers. It was pointed out that the ftrace subsystem has solved this problem for a long time, but it was also pointed out that the cost of that solution is a lot of complicated ring-buffer code. Linus said that the one thing that must be carefully handled is oops messages resulting from a kernel crash; those must make it immediately to the console.
Sergey went on to say that there is a larger set of printk() deadlocks that needs to be dealt with. Thus far, the conversation had concerned "internal" locks that are part of printk() itself. But printk() often has to acquire "external" locks in other parts of the kernel. The biggest problem area appears to be sending output to the console; there are locks and related problems in various serial drivers that can, once again, deadlock the system. Unlike internal locks, external locks are not controlled by printk(), so the problem is harder to solve.
The kernel already has a printk_deferred() function that goes out of its way to avoid taking external locks, once again deferring output to a safer time. Sergey's proposal is to make printk() always behave like printk_deferred(), eliminating the distinction between the two and enabling the eventual removal of printk_deferred() itself. The only exception would be for emergency output, which will always go directly to the console. Linus suggested going one step further, and taking the deferred path even in emergencies, but then flushing the buffers immediately thereafter.
Console troubles and more
Locks are not the only problem with printk(), though. To output its messages, it must call into the console drivers and, at completion, it must call console_unlock() which will, among other things, flush any pending output to the console. This function has some unfortunate properties: it can loop indefinitely, it may not be preemptible, and the time it takes depends on the speed of the console — which may not be fast at all. As a result, nobody knows how long a printk() call will take, so it's not really safe to call it in any number of situations, including atomic context, RCU critical sections, interrupt context, and more.
To get around this kind of problem, Jan Kara has proposed making printk() completely asynchronous. Once again, output would be directed to a buffer and sent to the console later, but, with this proposal, the actual writing to the console would be done in a dedicated kernel thread. A call to printk() would simply store the message, then use the irq_work mechanism to kick off that thread. This suggestion passed by without much in the way of complaints from the group in the room.
Then, there is the problem of pr_cont(), a form of printk() used to print a single line using multiple calls. This function is not safe on SMP systems, with the result that output generated with it can be mixed up and corrupted. There is a strong desire to get rid of the "continuation line" style of printing, but, as Sergey pointed out, the number of pr_cont() calls in the kernel is growing rapidly. The problem, as Linus pointed out, is that there is no other convenient way to output variable-length lines in the kernel. Changing pr_cont(), to use a per-CPU buffer, for example, is possible, but one would want to create a well thought-out helper function. Then, perhaps, pr_cont() users could be easily fixed up with a Coccinelle script.
Ted Ts'o asked how much of a problem interleaved output really is on a production system; the consensus seemed to be that it was rarely a problem. Linus said that, on occasion, he sees ugly oops output as a result of continuation lines. Andy Lutomirski said, with a grin, that his algorithm for dealing with interleaved oops output is to wait for Linus to straighten it out for him. That solution seemed to work for the group as a whole; there does not seem to be any work planned in this area in the immediate future.
The final topic, covered in a bit of a hurry at the end of the session, was the console_sem semaphore. This semaphore covers access to all consoles in the system, so it is a global contention point. But there are paths that acquire console_sem that do not need to modify the console list or even write to a console. For example, simply reading /proc/consoles from user space will acquire that semaphore. That can cause unpleasant delays, including in printk() itself. And releasing this semaphore, once again, results in a call to console_unlock(), with the same associated problems.
Sergey suggested that console_sem should be turned into a reader/writer lock. That way, any path that does not need to modify the console list itself can acquire the lock in reader mode, increasing parallelism. That still won't help direct callers of console_unlock(), who will still be stuck flushing output to the device. For them, there was discussion of splitting console_unlock() into synchronous and asynchronous versions; the latter would just wake the printk() thread rather than flushing any pending console output itself. There does not appear to be any urgency to this work, though.
That is where time ran out and the session ended. Sergey's slides are available for those who are interested.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Device driver infrastructure
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>

![Rik van Riel, Mel Gorman, Johannes
Weiner, Vlastimil Babka [The virtual memory panel]](https://static.lwn.net/images/conf/2016/ks/vm-group-sm.jpg)