User: Password:
|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.7-rc5, released on November 11. "This is quite a small -rc, I'm happy to say. -rc4 was already fairly calm, and -rc5 has fewer commits still. And more importantly, apart from one revert, and a pinctl driver update, it's not just a fairly small number of commits, they really are mostly one-liners."

Stable updates: no stable updates have been released in the last week. 3.2.34 is in the review process as of this writing; it can be expected on or after November 16.

Comments (none posted)

Quotes of the week

I for one do mourn POSIX, and standardization in general. I think it's very sad that a lot of stuff these days is moving forward without going through a rigorous standardization. We had this little period known affectionately as the "Unix Wars" in the 1980s/90s and we're well on our way to a messy repeat in the Linux space.
Jon Masters

You're missing something; that is one of the greatest powers of open source. The many eyes (and minds) effect. Someone out there probably has a solution to whatever problem, the trick is to find that person.
Russell King

Unfortunately there is no EKERNELSCREWEDUP, so we usually use EINVAL.
Andrew Morton

Comments (8 posted)

Introducing RedPatch (Ksplice Blog)

Back in early 2011, we looked at changes to the way Red Hat distributed its kernel changes. Instead of separate patches, it switched to distributing a tarball of the source tree—a move which was met with a fair amount of criticism. The Ksplice team at Oracle has just announced the availability of a Git tree that breaks the changes up into individual patches again. "The Ksplice team is happy to announce the public availability of one of our git repositories, RedPatch. RedPatch contains the source for all of the changes Red Hat makes to their kernel, one commit per fix and we've published it on oss.oracle.com/git. With RedPatch, you can access the broken-out patches using git, browse them online via gitweb, and freely redistribute the source under the terms of the GPL." (Thanks to Dmitrijs Ledkovs.)

Comments (85 posted)

Masters: ARM atomic operations

Jon Masters has put together a summary of how atomic operations work on the ARM architecture for those who are not afraid of the grungy details. "To provide for atomic access to a given memory location, ARM processors implement a reservation engine model. A given memory location is first loaded using a special 'load exclusive' instruction that has the side-effect of setting up a reservation against that given address in the CPU-local reservation engine. When the modified value it is later written back into memory, using the corresponding 'store exclusive' processor instruction, the reservation engine verifies that it has an outstanding reservation against that given address, and furthermore confirms that no external agents have interfered with the memory commit. A register returns success or failure."

Comments (5 posted)

A break for linux-next

Linux-next maintainer Stephen Rothwell has announced that the November 15 linux-next release will be the last until November 26.

Full Story (comments: none)

3.5.x to get extended stable support

Herton Ronaldo Krzesinski has announced that Canonical intends to maintain the 3.5.x stable kernel, which is shipped in the Ubuntu 12.10 release. This kernel will be supported as long as 12.10, currently planned to be through the end of March, 2014.

Full Story (comments: 1)

Kernel development news

NUMA in a hurry

By Jonathan Corbet
November 14, 2012
The kernel's behavior on non-uniform memory access (NUMA) systems is, by most accounts, suboptimal; processes tend to get separated from their memory, leading to lots of cross-node traffic and poor performance. Until now, the work to improve this situation has been a story of two competing patch sets; it recently appeared that one of them may be set to be merged as the result of decisions made outside of the community's view. But nothing in memory management is ever simple, so it should be unsurprising that the NUMA scheduling discussion has become more complicated.

On November 6, memory management hacker Mel Gorman, who had not contributed code of his own toward NUMA scheduling so far, posted a new patch series called "Foundation for automatic NUMA balancing," or "balancenuma" for short. He pointed out that there were objections to both of the existing approaches to NUMA scheduling and that it was proving hard to merge the best from each. So his objective was to add enough infrastructure to the memory management subsystem to make it easy to experiment with different NUMA placement policies. He also implemented a placeholder policy of his own:

The actual policy it implements is a very stupid greedy policy called "Migrate On Reference Of pte_numa Node (MORON)". While stupid, it can be faster than the vanilla kernel and the expectation is that any clever policy should be able to beat MORON.

In short, the MORON policy works by instantly migrating pages whenever a cross-node reference is detected using the NUMA hinting mechanism. Mel's second version, posted one week later, fixes a number of problems, adds the "home node" concept (that tries to keep processes and their memory on a single "home" NUMA node), and adds some statistics gathering to implement a "CPU follows memory" policy that can move a process to a new home node if it appears that better memory locality would result.

Andrea Arcangeli, author of the AutoNUMA approach, said that balancenuma "looks OK" and that AutoNUMA could be built on top of it. Ingo Molnar, instead, was less accepting, saying "I've picked up a number of cleanups from your series and propagated them into tip:numa/core tree." He later added a request that Mel rebase his work on top of the numa/core tree. He clearly did not see the patch set as a "foundation" on which to build. A new numa/core patch set was posted on November 13.

Peter Zijlstra, meanwhile, has posted an "enhanced NUMA scheduling with adaptive affinity" patch set. This one does away with the "home node" concept altogether; instead, it looks at memory access patterns to determine where a process's memory lives and who that memory might be shared with. Based on that information, the CPU affinity mechanism is used to move processes to the appropriate nodes. Peter says:

Note that this adaptive NUMA affinity mechanism integrated into the scheduler is essentially free of heuristics - only the access patterns determine which tasks are related and grouped. As a result this adaptive affinity code is able to move both threads and processes close(r) to each other if they are related - and let them spread if they are not.

This patch set has not gotten a lot of review comments, and it does not appear to have been folded into the numa/core series as of this writing.

What will happen in 3.8?

The numa/core approach remains in linux-next, which is intended to be the final stage for code that is intended to be merged. And, indeed, Ingo has reiterated that he plans to merge this code for the 3.8 cycle, saying "numa/core sums up the consensus so far." The use of that language might rightly raise some eyebrows; when there are between two and four competing patch sets (depending on how one counts) aimed at the same problem, the term "consensus" does not usually come to mind. And, indeed, it seems that this consensus does not yet exist.

Andrew Morton has been overtly grumpy; the existence of numa/core in linux-next has made the management of his tree (which is based on linux-next) difficult — his tree needs to be ready for the 3.8 merge window where, he thinks, numa/core should not be under consideration:

And yes, I'm assuming you're not targeting 3.8. Given the history behind this and the number of people who are looking at it, that's too hasty... And I must say that I deeply regret not digging my heels in when this went into -next all those months ago. It has caused a ton of trouble for me and for a lot of other people.

Hugh Dickins, a developer who is not normally associated with this sort of discussion, chimed in as well:

People are still reviewing and comparing competing solutions. Maybe this latest will prove to be closest to the right answer, maybe it will not. It's, what, about two days old right now?

If we had wanted to push in a good solution a little prematurely, we would surely have chosen Andrea's AutoNUMA months ago, despite efforts to block it; and maybe we shall still want to go that way.

Please, forget about v3.8, cut this branch out of linux-next, and seek consensus around getting it right for v3.9.

Rik van Riel agreed, saying "Having unreviewed (some of it NAKed) code sitting in tip.git and you trying to force it upstream is not the right way to go." He also suggested that, if anything should be considered for merging in 3.8, it would be Mel's foundation patches.

And that is where the discussion stands as of this writing. There is a lot of uncertainty about what might happen with NUMA scheduling in 3.8, meaning that, most likely, nothing will happen at all. It is highly unlikely that Linus would merge the numa/core set in the face of the above complaints; he would be far more likely to sit back and tell the developers involved to work out something they can all agree with. So this is a discussion that might go on for a while yet.

Making changes to the memory management subsystem is a famously hard thing to do, especially when the changes are as large as those being considered here. But there is another factor that is complicating this particular situation. As the term "NUMA scheduling" suggests, this is not just a memory management problem. The path to improved NUMA performance will require coordinated changes to — and greater integration between — the memory management subsystem and the CPU scheduler. It's telling that the developers on one side of this divide are primarily associated with scheduler development, while those on the other side are mostly memory management folks. Each camp is, in a sense, invading the other's turf in an attempt to create a comprehensive solution to the problem; it is not surprising that some disagreements have emerged.

Also implicit in this situation is that Linus is unlikely to attempt to resolve the disagreement by decree. There are too many developers and too many interrelated core subsystems involved. So some sort of rough consensus will have to be found. Your editor's explicitly unreliable prediction is that little NUMA-related work will be merged in the 3.8 development cycle. Under pressure from several directions, the developers involved will figure out how to resolve their biggest differences in the next few months. The resulting code will likely be at least partially merged for 3.9 — later than many would wish, but the end result is likely to be better than would be seen with a patch set rushed into 3.8.

Comments (none posted)

vmpressure_fd()

By Jonathan Corbet
November 14, 2012
One of the nice features of virtual memory is that applications do not have to concern themselves with how much memory is actually available in the system. One need not try to get too much work done to realize that some applications (or their developers) have taken that notion truly to heart. But it has often been suggested that the system as a whole would work better if interested applications could be informed when memory is tight. Those applications could react to that news by reducing their memory requirements, hopefully heading off thrashing or out-of-memory situations. The latest proposal along those lines is a new system call named vmpressure_fd(); it is unlikely to be merged in its current form, but it still merits a look.

The idea behind Anton Vorontsov's vmpressure_fd() patch set is to create a mechanism by which the kernel can inform user space when the system is under memory pressure. An application using this call would start by filling out a vmpressure_config structure:

    #include <linux/vmpressure.h>

    struct vmpressure_config {
    	__u32 size;
	__u32 threshold;
    };

The size field should hold the size of the structure; it is there as a sort of versioning mechanism should more fields be added to the structure in the future. The threshold field indicates the minimum level of notification the application is interested in; the available levels are:

VMPRESSURE_LOW
The system is out of free memory and is having to reclaim pages to satisfy new allocations. There is no particular trouble in performing that reclamation, though, so the memory pressure, while non-zero, is low.

VMPRESSURE_MEDIUM
A medium level of memory pressure is being experienced — enough, perhaps, to cause some swapping to occur.

VMPRESSURE_OOM
Memory pressure is at desperate levels, and the system may be about to fall prey to the depredations of the out-of-memory killer.

An application might choose to do nothing at low levels of memory pressure, clean up some low-value caches at the medium level, and clean up everything possible at the highest level of pressure. In this case, it would probably set threshold to VMPRESSURE_MEDIUM, since notifications at the VMPRESSURE_LOW level are not actionable.

Signing up for notifications is a simple matter:

    int vmpressure_fd(struct vmpressure_config *config);

The return value is a file descriptor that can be read to obtain pressure events in this format:

    struct vmpressure_event {
        __u32 pressure;
    };

The current interface only supports blocking reads, so a read() on the returned file descriptor will not return until a pressure notification has been generated. Applications can use poll() to determine whether a notification is available; the current patch does not support asynchronous notification via the SIGIO signal.

Internally, the virtual memory subsystem has no simple concept of memory pressure, so the patch has to add one. To that end, the "reclaimer inefficiency index" is calculated by looking at the number of pages examined by the reclaimer and how many of those pages could not be reclaimed. The need to look at large numbers of pages to find reclaim candidates indicates that reclaimable pages are getting hard to find — that the system is under memory pressure in other words. The index is simply the ratio of reclamation failures to the number of pages examined, expressed as a percentage.

This percentage is calculated over a "window" of pages examined; by default, it is generated each time the reclaimer looks at 256 pages. This value can be changed by tweaking the new vmevent_window sysctl knob. There are also controls to set the levels at which the various notifications occur: vmevent_level_medium (default 60) and vmevent_level_oom (default 99); the "low" level is wired at zero, so it will trigger anytime that the system is actively looking for pages to reclaim.

An additional mechanism exists to detect the out-of-memory case, since it can be hard to distinguish it using just the reclaimer inefficiency index. The reclaim code includes the concept of a "priority" which controls how aggressive it can be to reclaim pages; its value starts at 12 and falls over time as attempts to locate enough pages fail. If the priority falls to four (by default; it can be set with the vmevent_level_oom_priority knob), the system is deemed to be heading into an out-of-memory state and the notification is sent.

Some reviewers questioned the need for a new system call. We already have a system call — eventfd() — that exists to create file descriptors for notifications from the kernel. Actually using eventfd() tends to involve an interesting dance where the application gets a file descriptor from eventfd(), opens a sysfs file, and writes the file descriptor number into the sysfs file to connect it to a specific source of events. But it is an established pattern that might be best maintained here. Another reviewer suggested using the perf events subsystem, but Anton believes, not without reason, that perf brings a lot of complexity to something that should be relatively simple.

The other complaint has to do with the integration (or lack thereof) with the "memcg" control-group-based memory usage controller. Memcg already has a notification mechanism (described in Documentation/cgroups/memory.txt) that can inform a process when a control group is running out of memory; it might make sense to use the same mechanism for this purpose. Anton responded that the memcg mechanism does not provide the same information, it does not account for all memory use, and that it requires the use of control groups — not always a popular kernel feature. Still, even if vmpressure_fd() is merged as a separate mechanism, it will at least have to be extended to work at the control group level as well. The code shows that this integration has been thought about, but it has not yet been implemented.

Given these concerns, it seems unlikely that the current patch set will find its way into the mainline. But there is a clear desire for this kind of functionality in all kinds of use cases from very large systems to very small ones (Anton's patches were posted from a linaro.org address). So, one way or another, a kernel in the near future will probably have the ability to inform processes that it is experiencing some memory pressure. The next challenge will then be getting applications to use those notifications and reduce that pressure.

Comments (1 posted)

LCE: Realtime, present and future

By Jonathan Corbet
November 13, 2012
As the standing-room-only crowd at Thomas Gleixner's LinuxCon Europe 2012 talk showed, there is still a lot of interest in the progress of the realtime preemption patch set. Your editor attended with the main purpose of heckling Thomas, thinking that our recent Realtime Linux Workshop coverage would include most of the information to be heard in Barcelona. As it turns out, though, there were some new things to be learned, along with some concerns about the possible return of an old problem in a new form.

At the moment, realtime development is concentrated in three areas. The first is the ongoing work to mitigate problems with software interrupts; that has been covered here before and has not changed much since then. On the memory management front, the SLUB allocator is now the default for realtime kernels. A few years ago, SLUB was considered hopeless for the realtime environment, but it has improved considerably since then. It now works well when allocating objects from the caches. Anytime it becomes necessary to drop into the page allocator, though, behavior will not be deterministic; there is little to be done about that.

Finally, the latest realtime patches include a new option called PREEMPT_LAZY. Thomas has been increasingly irritated by the throughput penalty experienced by realtime users; PREEMPT_LAZY is an attempt to improve that situation. It is an option that applies only to the scheduling of SCHED_OTHER tasks (the non-realtime part of the workload); it works by deferring context switches after a task is awakened, even if the newly-awakened task has a higher priority. Doing so reduces determinism, but SCHED_OTHER was never meant to be deterministic in the first place. The benefit is a reduction in context switches and a corresponding improvement in SCHED_OTHER throughput.

When SLUB and PREEMPT_LAZY are enabled, the realtime kernel shows a 60% throughput increase with some workloads. Someday, Thomas said (not entirely facetiously), realtime will be faster than mainline, especially for workloads involving a lot of networking. He is looking forward to the day when the realtime kernel offers superior network performance; there should be some interesting conversations with the networking developers when that happens.

The realtime kernel, he claimed in summary, is at production quality.

Looking at all the code that has been produced in the realtime effort, Thomas concluded that, at this point, 95% of it has been upstreamed into the mainline kernel. What is missing before the rest can go upstream is "mainline sellable" solutions for memory management, high-resolution timers (hrtimers), and software interrupts. The memory management work is the most complex, and the patches are still "butt ugly." A significant amount of cleanup work will be required before those patches can make it into the mainline.

The hrtimer code, instead, just requires harassing the maintainer (a certain Thomas Gleixner) to get it into shape; it's just a matter of time. There needs to be a "less horrible" way to solve the software interrupt problem; the search is on. The rest of the realtime tree, including the infamous locking patches, is all nicely self-contained and should not be a problem for upstreaming.

So what is coming in the future? The next big feature looks to be CPU isolation. This is not strictly a realtime need, but it is useful for some realtime users. CPU isolation gives one or more processors over to user-space code, with no kernel overhead at all (as long as that code does not make any system calls, naturally). It is useful for applications that cannot wait even for a deterministic interrupt response; instead, they poll a device so that they can respond even more quickly to events. There are also high-performance users who pour vast amounts of money into expensive hardware; these users are willing to expend great effort for a 1% performance gain. For some kinds of workloads, the increased cache locality offered by CPU isolation can give an improvement of 3-4%, so it is unsurprising that these users are interested. A number of developers are working on this problem; some sort of solution should be ready before too long.

Also coming is the long-awaited deadline scheduler. According to Thomas, this code shows that, sometimes, it is possible to get useful work out of academic institutions. The solution is close, he said, and could possibly even be ready for the 3.8 merge window. There is also interest in doing realtime work from a KVM guest system. That will allow us to offload our realtime automation tasks into the cloud. Thomas clearly thought that virtualized realtime was a bit of a silly idea, but there is evidently a user community looking for this capability.

Where are the contributors?

Given that things seem so close, Thomas asked himself why things were taking so long; the realtime effort has been going for some eight years now. The answer is that the problems are hard and that the manpower to solve them has been lacking. Noting that few developers have been contributing to the realtime tree, Thomas started to wonder if there was a lack of interest in the concept overall. Perhaps all this work was being done, but nobody was using it?

An opportunity to answer that question presented itself when kernel.org went down for an extended period in 2011. It became necessary to provide an alternative site for people wanting to grab the realtime patches; that, in turn, was the perfect opportunity to obtain download statistics. It turns out that most realtime patch set releases saw about 3,000 downloads within the first three days. About 30% of those went to European corporations, 25% to American corporations, 20% to Asian corporations, 5% to academic institutions, and 20% "all over." 75% of the downloads, he said, were done by users at identifiable corporations constituting a "who's who" of the industry.

All told, there were 2,250 corporations that downloaded the realtime patch set while this experiment was taking place. Of all those companies, less than 5% reported back to the realtime developers in any way, be it a patch, a bug report, or even an "it works, thanks" sort of message. A 5% response rate may seem good; it should be enough to get the bugs fixed. But a further look showed that 80% of the reports came from Red Hat and the Open Source Automation Development Laboratory; add in Thomas's company Linutronix, and the number goes up to 90%. "What," he asked the audience, "are the rest of you doing?"

Thomas's conclusion is that something is wrong. Perhaps we are seeing a return of the embedded nightmare in a new form? As in those days, he does see private reports from companies that are keeping all of their work secret. Private reports are better than nothing, but he would really like to see more participation in the community: more success reports, bug reports, documentation contributions, and fixes. Even incorrect fixes are a great thing; they give a lot of information about the problem and ease the process of making a proper fix.

To conclude, Thomas noted that some people have complained that his roadmap slides are insufficiently serious. In response, he said, he took a few days off and took a marketing course; that has allowed him to produce a more suitable roadmap that looks like this:

[Thomas's new roadmap]

Perhaps the best conclusion to be drawn from this roadmap is that Thomas is unlikely to switch from development to marketing anytime in the near future. That is almost certainly good news for Linux users — and probably for marketing people as well.

[Your editor would like to thank the Linux Foundation for funding his travel to Barcelona.]

Comments (45 posted)

Patches and updates

Kernel trees

Architecture-specific

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

  • Lucas De Marchi: kmod 11 . (November 11, 2012)

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds