|
|
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The 3.12 kernel is out, released on November 3. "I was vacillating whether to do an rc8 or just cut the final 3.12, but since the biggest reason to *not* do a final release was not so much the state of the code, as simply the fact that I'll be traveling with very bad internet connection next week, I didn't really want to delay the release."

Some of the main features in this release include improvements to the dynamic tick code, support infrastructure for DRM render nodes, TSO sizing and the FQ scheduler in the network layer, support for user namespaces in the XFS filesystem, multithreaded RAID5 in the MD subsystem, offline data deduplication in the Btrfs filesystem, and more. See the KernelNewbies 3.12 page for more information.

Linus noted a couple of other things in the announcement. One is that the 3.13 merge window will not be starting for another week. He is also starting to think about an eventual 4.0 release, and has tossed out the idea of having 4.0 be a bugfix-only release, though he has his doubts as to whether it would work. "But I do wonder.. Maybe it would be possible, and I'm just unfairly projecting my own inner squirrel onto other kernel developers. If we have enough heads-up that people *know* that for one release (and companies/managers know that too) the only patches that get accepted are the kind that fix bugs, maybe people really would have sufficient attention span that it could work."

Stable updates: 3.11.7, 3.10.18, and 3.4.68 were released on November 5.

Comments (none posted)

Quotes of the week

Overall as a community I think we still rely on enterprise distributions and their partners doing a complete round of testing on major releases as a backstop. It's not an ideal situation but we are never going to have enough dedicated physical, financial or human resources to do a complete job.
Mel Gorman

It's not developers that typically determine the stability of a subsystem but _maintainers_, and the primary method of stabilization is, beyond being careful when merging a patch, is to remember/monitor breakages and not merge new feature patches from a developer until fixable bugs are fixed by the developer.
Ingo Molnar

I wonder if it should be called kernel/locks, as that's less to type, smaller path names, and tastes good on bagels.
Steven Rostedt

Comments (none posted)

Support for atomic block I/O operations

By Jonathan Corbet
November 6, 2013
Some newer storage devices have the ability to perform atomic I/O operations. An atomic operation will either succeed or fail as a unit; if multiple blocks are to be written, they will all make it to persistent storage or none will. This feature has the potential to improve life considerably at the higher levels, but the kernel currently has no way to support it.

Chris Mason's atomic I/O patch set aims to fix that situation. It allows a file to be opened with the O_ATOMIC and O_DIRECT flags (only direct I/O is supported) to request atomic semantics. Thereafter, every write() call will be executed atomically if the hardware supports it. This feature is, thus, quite easy to use from user space.

Within the kernel, there is a new function available to block drivers:

    void blk_queue_set_atomic_write(struct request_queue *q,
    				    unsigned int segments);

This function tells the block layer that the device behind the given request queue can perform atomic operations up to the given number of segments (separate ranges of blocks on the storage medium). Thereafter, I/O requests may arrive with the REQ_ATOMIC flag set to request atomic execution. The block layer will ensure that the maximum segment count is not exceeded.

One can imagine a number of uses for this functionality. A journaling filesystem could, for example, use it to write out the journal and the commit block together, knowing that said commit block will only be visible if everything else was successfully written. But, Chris says, the first target is MySQL:

O_ATOMIC | O_DIRECT allows mysql and friends to disable double buffering. This cuts their write IO in half, making them roughly 2x more flash friendly.

The patch set (which does not add support to any block drivers) is relatively small and simple, so it should have a relatively good chance of being merged in the very near future.

Comments (3 posted)

Kernel development news

The pernicious USB-stick stall problem

By Jonathan Corbet
November 6, 2013
Artem S. Tashkinov recently encountered a problem that will be familiar to at least some LWN readers. Plug a slow storage device (a USB stick, say, or a media player) into a Linux machine and write a lot of data to it. The entire system proceeds to just hang, possibly for minutes. Things eventually come back, but, by then, the user may well have given up in disgust and gone for a beer or two; by the time the system is usable again, that user may well have lost interest.

This time around, though, Artem made an interesting observation: the system would stall when running with a 64-bit kernel, but no such problem was experienced when using a 32-bit kernel on the same hardware. One might normally expect the block I/O subsystem to be reasonably well isolated from details like the word length of the processor, but, in this case, one would be surprised.

The problem

Linus was quick to understand what was going on here. It all comes down to the problem of matching the rate at which a process creates dirty memory to the rate at which that memory can be written to the underlying storage device. If a process is allowed to dirty a large amount of memory, the kernel will find itself committed to writing a chunk of data that might take minutes to transfer to persistent storage. All that data clogs up the I/O queues, possibly delaying other operations. And, as soon as somebody calls sync(), things stop until that entire queue is written. It's a storage equivalent to the bufferbloat problem.

The developers responsible for the memory management and block I/O subsystems are not entirely unaware of this problem. To prevent it from happening, they have created a set of tweakable knobs under /proc/sys/vm to control what happens when processes create a lot of dirty pages. These knobs are:

  • dirty_background_ratio specifies a percentage of memory; when at least that percentage is dirty, the kernel will start writing those dirty pages back to the backing device. So, if a system has 1000 pages of memory and dirty_background_ratio is set to 10% (the default), writeback will begin when 100 pages have been dirtied.

  • dirty_ratio specifies the percentage at which processes that are dirtying pages are made to wait for writeback. If it is set to 20% (again, the default) on that 1000-page system, a process dirtying pages will be made to wait once the 200th page is dirtied. This mechanism will, thus, slow the dirtying of pages while the system catches up.

  • dirty_background_bytes works like dirty_background_ratio except that the limit is specified as an absolute number of bytes.

  • dirty_bytes is the equivalent of dirty_ratio except that, once again, it is specified in bytes rather than as a percentage of total memory.

Setting these limits too low can affect performance: temporary files that will be deleted immediately will end up being written to persistent storage, and smaller I/O operations can lead to lower I/O bandwidth and worse on-disk placement. Setting the limits too high, instead, can lead to the sort of overbuffering described above.

The attentive reader may well be wondering: what happens if the administrator sets both dirty_ratio and dirty_bytes, especially if the values don't agree? The way things work is that either the percentage-based or byte-based limit applies, but not both. The one that applies is simply the one that was set last; so, for example, setting dirty_background_bytes to some value will cause dirty_background_ratio to be set to zero and ignored.

Two other details are key to understanding the behavior described by Artem: (1) by default, the percentage-based policy applies, and (2) on 32-bit systems, that ratio is calculated relative to the amount of low memory — the memory directly addressable by the kernel, not the full amount of memory in the system. In almost all 32-bit systems, only the first ~900MB of memory fall into the low-memory region. So on any current system with a reasonable amount of memory, a 64-bit kernel will implement dirty_background_ratio and dirty_ratio differently than a 32-bit system will. For Artem's 16GB system, the 64-bit dirty_ratio limit would be 3.2GB; the 32-bit system, instead, sets the limit at about 180MB.

The (huge) difference between these two limits is immediately evident when writing a lot of data to a slow storage device. The lower limit does not allow anywhere near as much dirty data to accumulate before throttling the process doing the writing, with much better results for the user of the system (unless said user wanted to give up in disgust and go for beer, of course).

Workarounds and fixes

When the problem is that clearly understood, one can start to talk about solutions. Linus suggested that anybody running into this kind of problem can work around it now by setting dirty_background_bytes and dirty_bytes to reasonable values. But it is generally agreed that the default values on 64-bit systems just don't make much sense on contemporary systems. In fact, according to Linus, the percentage-based limits have outlived their usefulness in general:

The percentage notion really goes back to the days when we typically had 8-64 *megabytes* of memory So if you had a 8MB machine you wouldn't want to have more than one megabyte of dirty data, but if you were "Mr Moneybags" and could afford 64MB, you might want to have up to 8MB dirty!!

Things have changed.

Thus, he suggested, the defaults should be changed to use the byte-based limits; either that, or the percentage-based limits could be deemed to apply only to the first 1GB of memory.

Of course, it would be nicer to have smarter behavior in the kernel. The limit that applies to a slow USB device may not be appropriate for a high-speed storage array. The kernel has logic now that tries to estimate the actual writeback speeds achievable with each attached device; with that information, one could try to limit dirty pages based on the amount of time required to write them all out. But, as Mel Gorman noted, this approach is "not that trivial to implement".

Andreas Dilger argued that the whole idea of building up large amounts of dirty data before starting I/O is no longer useful. The Lustre filesystem, he said, will start I/O with 8MB or so of dirty data; he thinks that kind of policy (applied on a per-file basis) could solve a lot of problems with minimal complexity. Dave Chinner, however, sees a more complex world where that kind of policy will not work for a wide range of workloads.

Dave, instead, suggests that the kernel focus on implementing two fundamental policies: "writeback caching" (essentially how things work now) and "writethrough caching," where much lower limits apply and I/O starts sooner. Writeback would be used for most workloads, but writethrough makes sense for slow devices or sequential streaming I/O patterns. The key, of course, is enabling the kernel to figure out which policy should apply in each case without the need for user intervention. There are some obvious indications, including various fadvise() calls or high-bandwidth sequential I/O, but, doubtless, there would be details to be worked out.

In the short term, though, we're most likely to see relatively simple fixes. Linus has posted a patch limiting the percentage-based calculations to the first 1GB of memory. This kind of change could conceivably be merged for 3.13; fancier solutions, obviously, will take longer.

Comments (42 posted)

Ktap almost gets into 3.13

By Jonathan Corbet
November 6, 2013
The ktap project surprised almost everybody when it made its 0.1 release last May. In short, ktap is a dynamic tracing tool; it works by embedding a Lua interpreter within the kernel and hooking it into the existing tracepoint mechanism. A suitably privileged user can load a script into the kernel; that script can enable tracepoints, boil down the resulting data, and return information back to user space. It thus fills a niche similar to that of SystemTap or DTrace, but with a smaller, simpler, and (for now) less-functional code base.

Ktap was reasonably well received from the outset, with a number of developers welcoming this functionality. The work was presented at LinuxCon Japan, and the 0.2 release followed at the end of July. After that, things quieted down for a bit until mid-October, when Greg Kroah-Hartman announced (on Google+) that ktap had been merged into the staging tree for the 3.13 release. This release, which already looked to be a relatively feature-heavy development cycle, appeared to be set to acquire a dynamic tracing framework as well.

It subsequently became clear, though, that some developers hadn't seen Greg's post and weren't happy. On October 24, Ingo Molnar sent a protest over the imminent merging of ktap. He added a "Nacked-by" for good measure; Steven Rostedt then followed up with a NACK of his own. After a Kernel Summit conversation with Ingo, Greg reverted the changes, removing ktap from the staging tree and, thus, from the queue for the 3.13 kernel. Those who have been waiting for this kind of dynamic tracing functionality in the Linux kernel will have to wait a bit longer.

What happened, and what's next

At a first glance, this incident could be mistaken as an example of a newcomer being rejected by the incumbent developers of the existing tracing functionality. One need not look much further, though, to see that a rather different explanation applies. The rejection of ktap, which looks highly likely to be a temporary affair, is more a story about the right and wrong ways to get new functionality into the kernel.

The first and foremost mistake is that the patches were never actually posted to the linux-kernel mailing list for review. Ktap developer Jovi Zhangwei has posted his release announcements there, and the code has always been available in a public git repository. But getting review for code in a repository is never easy; that certainly proved to be true in this case. While some developers knew about ktap and what it could do, few had actually looked at the implementation.

As Ingo pointed out in his objection, the form of the code in the staging tree (a single changeset adding 16,000 lines of code) did not make the review process any easier. Neither did the lack of documentation about the design of ktap.

The biggest complaint, though, was about the lack of integration with the kernel's existing tracing functionality. Ingo pointed out a couple of areas where that kind of integration might be useful:

  • The kernel currently has a simple interpreter used to implement filter conditions on tracepoints. The Lua bytecode interpreter offers a rather richer execution environment; it could perhaps be put to work implementing a more comprehensive conditional evaluation mechanism for tracepoints.

  • Rather than requiring a new set of commands, ktap scripts could be made part of the perf probe command, tying them more firmly into the existing system. Ingo's goal is clearly to have a single set of comprehensive tracing functionality that can all be exercised without having to know about the separate subsystems that implement it.

  • Ingo also requested the ability to extract ktap scripts from the kernel and turn them back into their source form. That, he said, would improve security by making it possible to examine any scripts running within the kernel.

In summary, Ingo said, the code looks like it could be useful and make the kernel's instrumentation better:

Despite my criticism, I'm actually a big proponent of safe kernel probing concepts and this code does have many of the qualities that I always wanted the tracepoint filter code to have in the long run.

So it appears that ktap should be able to get in eventually, but first Jovi has some work cut out for him to get this code into proper shape for merging. The good news is that he seems prepared to do that work. The hope is to have a reworked version of ktap reviewed and ready for the 3.14 development cycle; at that point, it should be able to go directly into the core kernel without a stay in the staging tree. This delay is likely to be frustrating for everybody involved, but the end result will almost certainly be an improved, in-kernel scripting facility for dynamic tracing.

Comments (15 posted)

The future of realtime Linux

By Jake Edge
November 6, 2013

Real Time Linux Workshop

The future of the realtime (aka PREEMPT_RT) kernel patch set was on the agenda for the Realtime Linux minisummit—as usual—but this year's edition had a bit more urgency than in years past. It is clear that Thomas Gleixner, who is doing most of the development work on the patch set, is concerned about the future of the remaining pieces. There appears to be minimal interest in furthering the development of realtime Linux outside of its main sponsor, Red Hat, and that may not be a sustainable model, he reported to both the minisummit and the concurrent 15th Real Time Linux Workshop (RTLWS).

[Group photo]

The RTLWS is a four-day gathering of those interested in using Linux for realtime applications. The minisummit was an all-day meeting on the first day of the workshop for nine kernel developers who are involved in the PREEMPT_RT work: Ingo Molnar, Sebastian Siewior, Frédéric Weisbecker, John Kacur, Paul McKenney, Steven Rostedt, Paul Gortmaker, Gleixner, and Darren Hart. Gleixner also reported on the minisummit to the participants of the workshop on its second day. The conclusion was the same in both cases: the PREEMPT_RT project will be done in 2014, "one way or another". The first way would be to get most of the rest of the code upstream, but that will require more of an effort, from a wider group than is currently involved. The alternative is to decide that the 95% of the realtime work already upstream is good enough and to drop further efforts.

Status

Since last year's minisummit, not much has changed for the realtime patch set, Gleixner said. He spent three months trying to clean up some of the last pieces to get them ready for mainline, but that quickly ballooned into nine separate projects with circular dependencies. That code was "stashed in the horror cabinet", he said, but the work was not entirely wasted as he has a good idea of what needs to be done, "but it's not pretty".

In order to focus on the development side of things, Gleixner has handed off doing the -rt releases to Siewior. The "whole problem" over the last couple of years has been the lack of developers, paid developers in particular, Gleixner said. Red Hat pays him to work half-time on realtime development and it pays the messaging, realtime, and grid (MRG) team to test and productize the realtime patch set. Beyond that, there are a few other contributors, including McKenney for read-copy-update (RCU) development as well as Hart and Gortmaker. Both of the latter two indicated they would try to encourage their employers (Intel and Wind River, respectively) to contribute more to realtime development moving forward.

Gleixner said that he spoke with Linus Torvalds and Andrew Morton at the recently completed Kernel Summit about the remaining realtime code. Both were favorably disposed toward getting those pieces upstream, but Gleixner is concerned that the lack of a dedicated realtime development team to sustain the code may be an impediment. Torvalds worries less about drivers or filesystems that "can bitrot away at the edge of the kernel" without affecting anything else, Gleixner said, but core kernel code is different. If Red Hat were to decide that realtime was no longer of interest, that code might go largely unmaintained.

Part of the problem may be that it is smaller and smaller segments of the user base that is being served with each addition of realtime code to the mainline. McKenney said that he has seen that with RCU—the mainline version works for a lot of people, which is a sign of maturity. Newer features are just for specialized situations, which means it is hard to get a groundswell of support behind them.

Similarly, Hart said that there are customers who are talking about using PREEMPT_RT with the Yocto project, but there would need to be a whole lot more of them before he is likely to be able to convince his management to have him work half-time on realtime. The Open Source Automation Development Lab (OSADL)—sponsor of the RTLWS—was originally set up to gather enough funds to support four or so full-time engineers working on realtime Linux, Gleixner said. Unfortunately, the number of members has never risen to the point where that became a reality and the organization is just able to support its QA test farm alongside its other activities, he said.

From statistics gathered by Gleixner while he was hosting the -rt patches during the kernel.org outage, it is clear that there is interest in realtime Linux. Within five days of the announcement of a new -rt kernel, he would see around 3000 downloads from addresses that read like a "who's who" of the computer industry, he said. In particular, Germany—where OSADL is based—was well represented, with 45% or so of the downloads coming from there.

As Hart noted, this is a different kind of problem than the realtime project has faced in the past. It is not a technical problem like those that the project has tackled—largely successfully. There may be some help on the horizon, however, as there are plans to put the MRG offering on top of the RHEL kernel (with the -rt patches, of course), Kacur said. That may free up some people who are currently working to test and stabilize the -rt kernels because much of the RHEL driver testing and the like can be reused.

Gortmaker asked what the "end game" for realtime Linux looked like, was it going to be like linux-tiny with some out-of-tree patches? Gleixner acknowledged that was likely to be the case; some "special case" pieces would need to be maintained out of tree.

Something that might help realtime get the final push it needs to be mainlined and to have enough maintenance support would be a contract that required mainline realtime support, McKenney suggested. The original realtime patches came about due to a contract that IBM and Red Hat had with the US Navy, so something like that might come along again.

While the financial industry was once a hotbed for interest in realtime, that seems to have cooled somewhat. The traders are more interested in throughput and are willing to allow Linux to miss some latency deadlines in the interests of throughput, Hart said. The embedded use cases for realtime seem to be where the "action" is today, Gleixner and others said, but there has been little effort to fund realtime development coming from that community.

Technical hurdles

After that, the conversation moved on to the remaining technical hurdles to clear to "finish" the realtime work. Gortmaker noted that Ted Ts'o was not opposed to replacing the bit spinlocks (single bits used as spinlocks to save space) in the ext4 filesystem with regular spinlocks if it could be shown there were no performance impacts or, better still, that performance increases offset the extra space used in buffer-heads. Bit spinlocks are a problem for the realtime code because they cannot be turned into sleeping spinlocks as is done with other mainline spinlocks. Gortmaker said that he plans to pursue the conversion with Ts'o as a possible step toward eliminating bit spinlocks.

Gleixner suggested that a patch which converted bit spinlocks to regular spinlocks when lockdep is enabled might be one approach to solving the bit spinlock problem for realtime. Right now, bit spinlocks are not tracked by lockdep, so a cleanup that tracked them could be sold as a debug feature and the realtime patches could enable that mode as well. Molnar said that doing so might well find locking bugs, which would demonstrate its usefulness.

Rostedt asked about getting rid of uses of the cpu_chill(), which can lead to livelock situations. Gleixner said that it is a replacement for cpu_relax() that just does an msleep(1) to allow "nasty trylock loops" to continue to work. By delaying the looping task for a tick, it allows a preempted task to make progress which will, eventually, allow it to release a lock the looping task is waiting on.

Rostedt called cpu_chill() a "hail Mary" that just hopes whoever has the lock will let go of it. He suggested that the waiting task temporarily give its priority to the lock holder, but others thought that fixing the directory entry cache (dcache) code where cpu_chill() is used would be a better approach. For now, the msleep() is reasonable.

The bottleneck of turning reader-writer locks (rwlock) and semaphores (rwsem) into a single spinlock was another issue that Rostedt raised. He noted that there was some ongoing work by Mathieu Desnoyers to turn the oft-contended mmap_sem into something else. "Any heavily threaded application" works poorly on the realtime kernel, Rostedt said, because of mmap_sem contention. Either the realtime kernel can do something realtime-specific to alleviate the problem or there could be an effort to get rid of rwsems in the mainline kernel.

McKenney said that he had put Desnoyers and Peter Zijlstra together at the Kernel Summit to discuss the effort to rework mmap_sem. Zijlstra had made an attempt at the rework earlier, based on a paper from MIT that applied RCU in the page fault path to avoid the need for mmap_sem. It used a different kind of tree that required less rebalancing to track the mappings. He wasn't sure if anything came out of that conversation but that is one possible approach.

Another approach, suggested by Molnar, is to just eliminate rwsem in the realtime patch set and allow memory management to be non-deterministic. If an application cares about deterministic memory management behavior, it should be using mlock(), Gleixner said. Just removing rwsem and "see who complains" is probably best, Molnar said. If someone truly needs the functionality in the realtime kernel, they can fund the development of a realtime variant of rwsem, he said.

There was also discussion of restricting which softirqs are run when softirqs are processed as a result of a call to local_bh_enable(). A mask value could be provided to the local_bh_disable() call that would specify which types of softirqs were being disabled. Typically it would just be those for the subsystem doing the disabling, but others could be added if they needed to be held off.

Gortmaker plans to create an API to post as an RFC soon that would add the mask, but hide it behind a higher-level interface (like local_bh_disable_net() or similar). It would also convert one type of softirq (scheduler, RCU, or timer were suggested) to use the new API. That way, people can "scream" right away, Gleixner said, if they don't like the idea and another way to handle softirq processing for realtime can be found.

Where to next?

In summary, Gleixner said, much of the "low-hanging fruit" from last year's to-do list has been dealt with. Gortmaker picked up many of those and got them to the right subsystem maintainer, Gleixner said. What's left are the "nasty and intrusive" pieces that he keeps trying to "wrap my brain around", but the lack of developer time has really hurt that process.

[Thomas Gleixner]

Gleixner expanded on that problem some more in the report to the RTLWS. He believes a team of four or five full-time developers is needed to make realtime Linux truly sustainable. Without that kind of commitment, he is concerned that Torvalds will be unwilling to take the kind of core changes required by the remaining pieces of the realtime patch set.

The community contribution to the realtime patch set "amounts roughly to zero", he said. There is a fair amount of frustration on the team in always chasing mainline and being unable to stop realtime-unfriendly mainline features because the code is out of tree. In his mind, that should not continue past 2014; either a larger group steps up to work on the code or the project can be declared "finished", he said. "I could live with either outcome."

Gleixner said that he was trying to scare the audience a little bit with his proclamation, but it is clear he has nearly reached the end of his rope. While he would like to see the project continue—and prosper—he has real concerns about making the kinds of changes to the kernel that are required without a deeper and wider group to maintain it all. It seems that it is now up to those who use the realtime kernel to either step up or prepare for a future where the mainline kernel will have to serve their needs.

Comments (25 posted)

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Sebastian Andrzej Siewior debugobject: add support for kref ?
Jonathan Lebon SystemTap 2.4 release ?

Device drivers

Documentation

Filesystems and block I/O

Janitorial

Memory management

Networking

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds