Brief items
Stable updates: no stable updates have been released in the last week. 3.2.34 is in the review process as of this writing; it can be expected on or after November 16.
Kernel development news
On November 6, memory management hacker Mel Gorman, who had not contributed code of his own toward NUMA scheduling so far, posted a new patch series called "Foundation for automatic NUMA balancing," or "balancenuma" for short. He pointed out that there were objections to both of the existing approaches to NUMA scheduling and that it was proving hard to merge the best from each. So his objective was to add enough infrastructure to the memory management subsystem to make it easy to experiment with different NUMA placement policies. He also implemented a placeholder policy of his own:
In short, the MORON policy works by instantly migrating pages whenever a cross-node reference is detected using the NUMA hinting mechanism. Mel's second version, posted one week later, fixes a number of problems, adds the "home node" concept (that tries to keep processes and their memory on a single "home" NUMA node), and adds some statistics gathering to implement a "CPU follows memory" policy that can move a process to a new home node if it appears that better memory locality would result.
Andrea Arcangeli, author of the AutoNUMA approach, said that balancenuma "looks OK" and that AutoNUMA could be built on top of it. Ingo Molnar, instead, was less accepting, saying "I've picked up a number of cleanups from your series and propagated them into tip:numa/core tree." He later added a request that Mel rebase his work on top of the numa/core tree. He clearly did not see the patch set as a "foundation" on which to build. A new numa/core patch set was posted on November 13.
Peter Zijlstra, meanwhile, has posted an "enhanced NUMA scheduling with adaptive affinity" patch set. This one does away with the "home node" concept altogether; instead, it looks at memory access patterns to determine where a process's memory lives and who that memory might be shared with. Based on that information, the CPU affinity mechanism is used to move processes to the appropriate nodes. Peter says:
This patch set has not gotten a lot of review comments, and it does not appear to have been folded into the numa/core series as of this writing.
The numa/core approach remains in linux-next, which is intended to be the final stage for code that is intended to be merged. And, indeed, Ingo has reiterated that he plans to merge this code for the 3.8 cycle, saying "numa/core sums up the consensus so far." The use of that language might rightly raise some eyebrows; when there are between two and four competing patch sets (depending on how one counts) aimed at the same problem, the term "consensus" does not usually come to mind. And, indeed, it seems that this consensus does not yet exist.
Andrew Morton has been overtly grumpy; the existence of numa/core in linux-next has made the management of his tree (which is based on linux-next) difficult — his tree needs to be ready for the 3.8 merge window where, he thinks, numa/core should not be under consideration:
Hugh Dickins, a developer who is not normally associated with this sort of discussion, chimed in as well:
If we had wanted to push in a good solution a little prematurely, we would surely have chosen Andrea's AutoNUMA months ago, despite efforts to block it; and maybe we shall still want to go that way.
Please, forget about v3.8, cut this branch out of linux-next, and seek consensus around getting it right for v3.9.
Rik van Riel agreed, saying "Having unreviewed (some of it NAKed) code sitting in tip.git and you trying to force it upstream is not the right way to go." He also suggested that, if anything should be considered for merging in 3.8, it would be Mel's foundation patches.
And that is where the discussion stands as of this writing. There is a lot of uncertainty about what might happen with NUMA scheduling in 3.8, meaning that, most likely, nothing will happen at all. It is highly unlikely that Linus would merge the numa/core set in the face of the above complaints; he would be far more likely to sit back and tell the developers involved to work out something they can all agree with. So this is a discussion that might go on for a while yet.
Making changes to the memory management subsystem is a famously hard thing to do, especially when the changes are as large as those being considered here. But there is another factor that is complicating this particular situation. As the term "NUMA scheduling" suggests, this is not just a memory management problem. The path to improved NUMA performance will require coordinated changes to — and greater integration between — the memory management subsystem and the CPU scheduler. It's telling that the developers on one side of this divide are primarily associated with scheduler development, while those on the other side are mostly memory management folks. Each camp is, in a sense, invading the other's turf in an attempt to create a comprehensive solution to the problem; it is not surprising that some disagreements have emerged.
Also implicit in this situation is that Linus is unlikely to attempt to resolve the disagreement by decree. There are too many developers and too many interrelated core subsystems involved. So some sort of rough consensus will have to be found. Your editor's explicitly unreliable prediction is that little NUMA-related work will be merged in the 3.8 development cycle. Under pressure from several directions, the developers involved will figure out how to resolve their biggest differences in the next few months. The resulting code will likely be at least partially merged for 3.9 — later than many would wish, but the end result is likely to be better than would be seen with a patch set rushed into 3.8.
The idea behind Anton Vorontsov's vmpressure_fd() patch set is to create a mechanism by which the kernel can inform user space when the system is under memory pressure. An application using this call would start by filling out a vmpressure_config structure:
#include <linux/vmpressure.h>
struct vmpressure_config {
__u32 size;
__u32 threshold;
};
The size field should hold the size of the structure; it is there as a sort of versioning mechanism should more fields be added to the structure in the future. The threshold field indicates the minimum level of notification the application is interested in; the available levels are:
- VMPRESSURE_LOW
- The system is out of free memory and is having to reclaim pages to satisfy new allocations. There is no particular trouble in performing that reclamation, though, so the memory pressure, while non-zero, is low.
- VMPRESSURE_MEDIUM
- A medium level of memory pressure is being experienced — enough, perhaps, to cause some swapping to occur.
- VMPRESSURE_OOM
- Memory pressure is at desperate levels, and the system may be about to fall prey to the depredations of the out-of-memory killer.
An application might choose to do nothing at low levels of memory pressure, clean up some low-value caches at the medium level, and clean up everything possible at the highest level of pressure. In this case, it would probably set threshold to VMPRESSURE_MEDIUM, since notifications at the VMPRESSURE_LOW level are not actionable.
Signing up for notifications is a simple matter:
int vmpressure_fd(struct vmpressure_config *config);
The return value is a file descriptor that can be read to obtain pressure events in this format:
struct vmpressure_event {
__u32 pressure;
};
The current interface only supports blocking reads, so a read() on the returned file descriptor will not return until a pressure notification has been generated. Applications can use poll() to determine whether a notification is available; the current patch does not support asynchronous notification via the SIGIO signal.
Internally, the virtual memory subsystem has no simple concept of memory pressure, so the patch has to add one. To that end, the "reclaimer inefficiency index" is calculated by looking at the number of pages examined by the reclaimer and how many of those pages could not be reclaimed. The need to look at large numbers of pages to find reclaim candidates indicates that reclaimable pages are getting hard to find — that the system is under memory pressure in other words. The index is simply the ratio of reclamation failures to the number of pages examined, expressed as a percentage.
This percentage is calculated over a "window" of pages examined; by default, it is generated each time the reclaimer looks at 256 pages. This value can be changed by tweaking the new vmevent_window sysctl knob. There are also controls to set the levels at which the various notifications occur: vmevent_level_medium (default 60) and vmevent_level_oom (default 99); the "low" level is wired at zero, so it will trigger anytime that the system is actively looking for pages to reclaim.
An additional mechanism exists to detect the out-of-memory case, since it can be hard to distinguish it using just the reclaimer inefficiency index. The reclaim code includes the concept of a "priority" which controls how aggressive it can be to reclaim pages; its value starts at 12 and falls over time as attempts to locate enough pages fail. If the priority falls to four (by default; it can be set with the vmevent_level_oom_priority knob), the system is deemed to be heading into an out-of-memory state and the notification is sent.
Some reviewers questioned the need for a new system call. We already have a system call — eventfd() — that exists to create file descriptors for notifications from the kernel. Actually using eventfd() tends to involve an interesting dance where the application gets a file descriptor from eventfd(), opens a sysfs file, and writes the file descriptor number into the sysfs file to connect it to a specific source of events. But it is an established pattern that might be best maintained here. Another reviewer suggested using the perf events subsystem, but Anton believes, not without reason, that perf brings a lot of complexity to something that should be relatively simple.
The other complaint has to do with the integration (or lack thereof) with the "memcg" control-group-based memory usage controller. Memcg already has a notification mechanism (described in Documentation/cgroups/memory.txt) that can inform a process when a control group is running out of memory; it might make sense to use the same mechanism for this purpose. Anton responded that the memcg mechanism does not provide the same information, it does not account for all memory use, and that it requires the use of control groups — not always a popular kernel feature. Still, even if vmpressure_fd() is merged as a separate mechanism, it will at least have to be extended to work at the control group level as well. The code shows that this integration has been thought about, but it has not yet been implemented.
Given these concerns, it seems unlikely that the current patch set will find its way into the mainline. But there is a clear desire for this kind of functionality in all kinds of use cases from very large systems to very small ones (Anton's patches were posted from a linaro.org address). So, one way or another, a kernel in the near future will probably have the ability to inform processes that it is experiencing some memory pressure. The next challenge will then be getting applications to use those notifications and reduce that pressure.
At the moment, realtime development is concentrated in three areas. The first is the ongoing work to mitigate problems with software interrupts; that has been covered here before and has not changed much since then. On the memory management front, the SLUB allocator is now the default for realtime kernels. A few years ago, SLUB was considered hopeless for the realtime environment, but it has improved considerably since then. It now works well when allocating objects from the caches. Anytime it becomes necessary to drop into the page allocator, though, behavior will not be deterministic; there is little to be done about that.
Finally, the latest realtime patches include a new option called PREEMPT_LAZY. Thomas has been increasingly irritated by the throughput penalty experienced by realtime users; PREEMPT_LAZY is an attempt to improve that situation. It is an option that applies only to the scheduling of SCHED_OTHER tasks (the non-realtime part of the workload); it works by deferring context switches after a task is awakened, even if the newly-awakened task has a higher priority. Doing so reduces determinism, but SCHED_OTHER was never meant to be deterministic in the first place. The benefit is a reduction in context switches and a corresponding improvement in SCHED_OTHER throughput.
When SLUB and PREEMPT_LAZY are enabled, the realtime kernel shows a 60% throughput increase with some workloads. Someday, Thomas said (not entirely facetiously), realtime will be faster than mainline, especially for workloads involving a lot of networking. He is looking forward to the day when the realtime kernel offers superior network performance; there should be some interesting conversations with the networking developers when that happens.
The realtime kernel, he claimed in summary, is at production quality.
Looking at all the code that has been produced in the realtime effort, Thomas concluded that, at this point, 95% of it has been upstreamed into the mainline kernel. What is missing before the rest can go upstream is "mainline sellable" solutions for memory management, high-resolution timers (hrtimers), and software interrupts. The memory management work is the most complex, and the patches are still "butt ugly." A significant amount of cleanup work will be required before those patches can make it into the mainline.
The hrtimer code, instead, just requires harassing the maintainer (a certain Thomas Gleixner) to get it into shape; it's just a matter of time. There needs to be a "less horrible" way to solve the software interrupt problem; the search is on. The rest of the realtime tree, including the infamous locking patches, is all nicely self-contained and should not be a problem for upstreaming.
So what is coming in the future? The next big feature looks to be CPU isolation. This is not strictly a realtime need, but it is useful for some realtime users. CPU isolation gives one or more processors over to user-space code, with no kernel overhead at all (as long as that code does not make any system calls, naturally). It is useful for applications that cannot wait even for a deterministic interrupt response; instead, they poll a device so that they can respond even more quickly to events. There are also high-performance users who pour vast amounts of money into expensive hardware; these users are willing to expend great effort for a 1% performance gain. For some kinds of workloads, the increased cache locality offered by CPU isolation can give an improvement of 3-4%, so it is unsurprising that these users are interested. A number of developers are working on this problem; some sort of solution should be ready before too long.
Also coming is the long-awaited deadline scheduler. According to Thomas, this code shows that, sometimes, it is possible to get useful work out of academic institutions. The solution is close, he said, and could possibly even be ready for the 3.8 merge window. There is also interest in doing realtime work from a KVM guest system. That will allow us to offload our realtime automation tasks into the cloud. Thomas clearly thought that virtualized realtime was a bit of a silly idea, but there is evidently a user community looking for this capability.
Given that things seem so close, Thomas asked himself why things were taking so long; the realtime effort has been going for some eight years now. The answer is that the problems are hard and that the manpower to solve them has been lacking. Noting that few developers have been contributing to the realtime tree, Thomas started to wonder if there was a lack of interest in the concept overall. Perhaps all this work was being done, but nobody was using it?
An opportunity to answer that question presented itself when kernel.org went down for an extended period in 2011. It became necessary to provide an alternative site for people wanting to grab the realtime patches; that, in turn, was the perfect opportunity to obtain download statistics. It turns out that most realtime patch set releases saw about 3,000 downloads within the first three days. About 30% of those went to European corporations, 25% to American corporations, 20% to Asian corporations, 5% to academic institutions, and 20% "all over." 75% of the downloads, he said, were done by users at identifiable corporations constituting a "who's who" of the industry.
All told, there were 2,250 corporations that downloaded the realtime patch set while this experiment was taking place. Of all those companies, less than 5% reported back to the realtime developers in any way, be it a patch, a bug report, or even an "it works, thanks" sort of message. A 5% response rate may seem good; it should be enough to get the bugs fixed. But a further look showed that 80% of the reports came from Red Hat and the Open Source Automation Development Laboratory; add in Thomas's company Linutronix, and the number goes up to 90%. "What," he asked the audience, "are the rest of you doing?"
Thomas's conclusion is that something is wrong. Perhaps we are seeing a return of the embedded nightmare in a new form? As in those days, he does see private reports from companies that are keeping all of their work secret. Private reports are better than nothing, but he would really like to see more participation in the community: more success reports, bug reports, documentation contributions, and fixes. Even incorrect fixes are a great thing; they give a lot of information about the problem and ease the process of making a proper fix.
To conclude, Thomas noted that some people have complained that his roadmap slides are insufficiently serious. In response, he said, he took a few days off and took a marketing course; that has allowed him to produce a more suitable roadmap that looks like this:
Perhaps the best conclusion to be drawn from this roadmap is that Thomas is unlikely to switch from development to marketing anytime in the near future. That is almost certainly good news for Linux users — and probably for marketing people as well.
[Your editor would like to thank the Linux Foundation for funding his travel to Barcelona.]
Patches and updates
Kernel trees
Architecture-specific
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds