|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 3.19-rc5, which was released on January 18. Things are not quite calming down the way Linus Torvalds would like, but: "That said, it's not like there is anything particularly scary in here. The arm64 vm bug that I mentioned as pending in the rc4 notes got fixed within a day of that previous rc release, and the rest looks pretty standard. Mostly drivers (networking, usb, scsi target, block layer, mmc, tty etc), but also arch updates (arm, x86, s390 and some tiny powerpc fixes), some filesystem updates (fuse and nfs), tracing fixes, and some perf tooling fixes."

Stable updates: The 3.18.3, 3.14.29, and 3.10.65 stable kernels were released on January 16. As of this writing, there are no stable kernels in the review process.

Comments (none posted)

Quotes of the week

Greg, from my spell in IVI, I too have to say your faith in the wisdom of IVI developers' choices is touching. I think D-Bus was in the main picked because it had some nice features, but then people realized it had no bandwidth, and the solution has been "make D-Bus faster", rather than "maybe we should explore other (mixed model) solutions". This isn't to say that I'm against adding kdbus, but I don't think there's much strength to the argument you make above.
Michael Kerrisk

return -ETOOMANYWINDMILLS;
Josh Triplett

You can do ioctls in perl just fine if you are mad (and if you are using perl you are ;-) ) while python has a complete explicit fcntl.ioctl model.
Alan Cox

Comments (none posted)

Dropping x86 EISA support

By Jake Edge
January 21, 2015

It is clear that Paul Gortmaker thought there was a pretty good chance to get rid of some old, unloved code when he proposed dropping support for the EISA bus from 32-bit x86 kernels. As he noted, when support for the MCA bus was dropped in 2012, Linus Torvalds mused that perhaps EISA could follow suit "some day". Obviously Gortmaker hoped that day had come, at least for the x86 architecture, but it seems he was a bit premature.

As Gortmaker pointed out, there are some architectures that are essentially "frozen in time (from a hardware perspective)"—he mentioned Alpha and PA-RISC as examples—so EISA support cannot be completely removed from the tree (as MCA was). Removing it from x86 did not save much in the way of code—it only deleted a little over 100 lines—but he had something else in mind:

Given that it is 20 years on since its demise, and the above specs might seem just barely acceptable for a wireless router today, lets stop forcing everyone to build EISA infrastructure and assoc. drivers during their routine build coverage testing for no value whatsoever.

But Maciej W. Rozycki was not on board with the removal, noting that it is needed "to support EISA FDDI [Fiber Distributed Data Interface] equipment I maintain if nothing else". He suggested that perhaps it could be hidden behind a configuration option for "more exotic stuff" so that not everyone needed to build and test it.

Unsurprisingly, Torvalds was quick to put the kibosh on EISA's removal: "So if we actually have a user, and it works, then no, we're not removing EISA support". But it is instructive to consider what might have happened if Rozycki had not posted his disagreement. It seems quite possible that if no one spoke up for EISA on x86, it might well have been removed.

There is always some tension in the kernel community between those who want to clean up and clear out "legacy" code and those who want to see it continue to live in the mainline tree. There is a cost associated with maintaining legacy code, though, even if it rarely needs to change, and it does continue to get built as part of various kernel-wide testing efforts. That puts some (possibly small) amount of burden on many other kernel developers, most of whom are not interested in the old code at all.

As certain kinds of hardware start to disappear entirely—from the kernel developers' consciousness, at a minimum—it behooves those using Linux on that hardware to pay attention to the kernel mailing list. As seen here, real users who do speak up will likely be able to block efforts to remove support, but timely responses will be needed. If a kernel release cycle or two goes by, it may well be too late.

Comments (none posted)

Kernel development news

When real validation begins

By Jonathan Corbet
January 21, 2015

LCA 2015
No computer-oriented conference is complete without a good war-story presentation or two. Paul McKenney's LCA 2015 talk on the implications of enabling full dynamic tick support for all users fit the bill nicely. The result was an overview of what can happen when your code is unexpectedly enabled on vast numbers of machines — and some lessons for how to avoid disasters in the future.

Some history

Paul started by noting that, in the 1990s, there was little concern about CPU energy efficiency. In fact, in those days, an idle CPU tended to consume more power than one that was doing useful work. That's because an idle processor would sit in a tight loop waiting for something to do; there were no cache misses, so the CPU ran without a break. Delivering regular clock interrupts to an idle processor thus increased its energy efficiency; it was still doing nothing useful, but it would burn less power in the process.

Things have changed since then, he continued. CPUs are designed to be powered off when they have nothing to do, so clock interrupts to an idle CPU are bad news. But until the early 2000s, that's exactly what was happening on Linux systems. One of the changes merged for the 2.6.21 release in 2007 was partial dynamic tick support, which removed those idle clock interrupts.

That was a good step forward, but was not a full solution; delivering regular scheduling interrupts to a busy CPU can also be a problem. Realtime developers don't like clock ticks because they can introduce unwanted latency. High-performance computing users also complain; they are trying to get the most out of their CPUs, and any work that is not directed [Paul McKenney] toward their problems is just overhead. Beyond that, high-performance computing workloads often communicate results between processors; delaying work on one processor can cause others to wait, creating cascading delays. The scheduling interrupt is often necessary, but, in a high-performance environment, there will only be one process running on a given CPU and no other work to do, so those interrupts can only slow things down.

Full dynamic tick support was first prototyped by Josh Triplett in 2009; it resulted in a 3% performance gain for CPU-intensive workloads. For people determined to get maximal performance from their systems, 3% is a big deal. But this patch, which was mostly a proof of concept, had some problems. Without the scheduling interrupt, a single task could monopolize the CPU and starve others. There was no process accounting, and read-copy-update (RCU) grace periods could go on forever, with the result that the system could run out of memory. So Frederic Weisbecker's decision to work on a production-ready version of the patch was welcome.

That code was merged for the 3.10 kernel. It works well, in that there will be no scheduler interrupt while only one task is running on the CPU. There is a residual once-per-second interrupt that, Paul said, serves as a sort of security blanket to make sure nothing slips through the cracks. It can be disabled, but that is not recommended at this time.

Paul did some of the work to ensure that RCU worked properly in a full dynamic tick environment. He had thought of full dynamic tick as a specialty feature that would only be enabled by users building their own kernels. Still, he was pleasantly surprised to hear that the feature was enabled for all users in the RHEL7 kernel. But, he said ruefully, you would think he would know better after his many years of experience in this industry. Turning on the feature in a major distribution means that everybody is using it. That, he said, is when the real validation begins — validation by users running workloads that he had not thought to test his patches against.

The fun begins

He soon got an email from Rik van Riel asking why the rcu_sched process was taking 40% of the CPU. This was happening on a workload that had lots of context switches — a completely different environment than the one the dynamic tick feature was designed for. Paul's first thought was that grace periods were maybe completing too quickly in the presence of a lot of context switches, increasing the amount of grace-period processing that needed to be done. He tried slowing grace-period completion down artificially, but that did not help the problem. Thus, he said, he was forced to actually analyze what was going on.

The real problem had to do with the RCU callback offloading mechanism, which moves RCU cleanup work off the CPUs that are being used in the dynamic-tick mode. This cleanup work is done in dedicated kernel threads that can be run on whichever CPU makes the most sense. It's a useful feature for high-performance workloads, but it isn't all that useful for everybody else; indeed, it appeared to be causing problems for other workloads. To address that problem, Paul put in a patch to only enable callback offloading if the nohz_full boot parameter (requesting full dynamic tick behavior) is set.

According to Paul, industry experience shows that one out of six fixes introduces a new bug of its own. This was, he said, one of those fixes. It turns out that RCU is used earlier in the boot process than he had thought, and the switch to the offloaded mode would cause early callbacks from the offloaded CPUs to be lost. The result would certainly be leaked memory, but it can also result in a full system hang if processes are waiting for a specific callback to complete. So another fix went in to make the decision on which CPUs to offload earlier.

So "now that the bleeding was stopped," he said, it was time to fix the real bug. After all, 40% CPU usage on an 80-CPU system is a bit excessive, and the problem would get worse as the number of CPUs increases. By the time the CPU count got up to 4000 or so, the system simply would not be able to keep up with the load. Since he already gets complaints about RCU performance on 4096-CPU machines, this was a real problem in need of a solution.

It turned out that a big part of the overhead was the simple process of waking up all of the offload threads at the beginning and end of grace periods. So he decided to hide the problem a bit; rather than wake all threads from the central RCU scheduling thread, he organized them into a tree and made a subset of threads responsible for waking the rest. The idea was to spread the load around the system a bit, but it also happened to reduce the total number of wakeups since it turned out to only be necessary to wake the first level of threads at the beginning of the grace period.

One of six fixes may introduce a new bug, but in this case, Paul admitted, it was two out of six. Some callbacks that were posted early in the life of the system were not being executed, leading to occasional system hangs. Yet another fix ensured that they got run, and everything was well again.

At least, all was well until somebody looked at their system and wondered why there were hundreds of callback-offload threads on a machine with a handful of CPUs. It turns out that some systems have firmware that lies about the number of installed CPUs, and "RCU was stupid enough to believe it." Changing the callback-offload code to defer starting the offload threads until the relevant CPU actually comes online dealt with that one.

At this point, the callback-offload code passed all of its tests. At least, it passed them all if loadable kernel modules were not enabled — the situation on Paul's machines. The problem was that a module could post callbacks that would still be outstanding when the module was removed. That would lead to the kernel jumping into code that was no longer present — an "embarrassing failure" that can lead to calls (of the telephone variety) back to the relevant kernel developers instead. The solution was to wait for all existing callbacks to be invoked before completing the removal of the module; that wait is done by posting a special callback on each CPU in the system and waiting for them all to report completion.

As mentioned above, the code was fixed to only run callback threads for online CPUs. That last fix would put callbacks on all CPUs, including those that are currently offline. Since an offline CPU has no offload thread, those callbacks will wait forever. So yet another fix ensured that never-online CPUs would not get callbacks posted.

Lessons learned

At this point, as far as anybody knows, things have stabilized and no remaining bugs lurk to attack innocent users. There are, Paul said, a number of lessons that one can learn from his experience. The first of these is to limit the scope of all changes to avoid putting innocent bystanders at risk. Turning on full dynamic tick behavior for all users went against that lesson with unfortunate consequences. We should also recognize that the Linux kernel serves a wide variety of workloads; it will never be possible to test them all.

Fixes can — and will — generate more bugs. Fixes for minor bugs require more caution before they are applied; since they address a problem seen by only a small subset of users, they have a high probability of creating unforeseen problems for the larger majority. And, Paul said, it is not enough to simply check one's assumptions; one may have built "towers of logic" upon those assumptions and formed habits of thought that are hard to break out of. In this case, the assumption that all users of the dynamic tick code would be building their own kernels led to some unfortunate consequences. And finally, he said, people probably trust him too much.

[Your editor would like to thank linux.conf.au for funding his travel to the event.]

Comments (23 posted)

Inserting a hole into a file

By Jake Edge
January 21, 2015

Last March, we looked at a proposal for a new fallocate() option to collapse a range of blocks within a file. The FALLOC_FL_COLLAPSE_RANGE flag was added to the 3.15 kernel; its counterpart, FALLOC_FL_INSERT_RANGE, has been proposed by the same developer: Namjae Jeon. It would provide a way to open up a range of blocks within a file, without requiring an expensive data copy.

The example use case that Jeon has used for both new flags is the removal (using FALLOC_FL_COLLAPSE_RANGE) or insertion (using FALLOC_FL_INSERT_RANGE) of advertisements into large video files. While that particular example may not resonate with everyone, there are other uses for quickly removing and inserting chunks of data in the middle of large files. For example, doing non-linear editing on various types of media (video, in particular) may benefit from reducing the amount of data copying needed. The requirement that the ranges be block-aligned, though, could limit the overall usefulness of both flags.

The fallocate() system call provides a means for programmers to alter the allocation of blocks for a file—essentially to give the filesystem more information about the programmer's plans for the file so that better allocation decisions can be made. Over time, additional features have been added to fallocate(), including the ability to punch holes in or to zero-out ranges of a file.

There are quite a few similarities between FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE. Both must be the only flag passed to fallocate() (other options allow ORing in multiple flags), require that the offset and length specified are multiples of the filesystem's logical block size, and both are only implemented for the XFS and extent-based ext4 filesystems. Also, they are restricted to working within the existing file, so the range covered by offset + length must not stretch beyond the current end of file (EOF).

For inserting a range, the basic algorithm is the same for both XFS and ext4. Once the offset and length parameters are validated (i.e. block-aligned and not past EOF), the file size is increased by the length. The extent containing the logical block number for offset is then examined to see if that block number is the first in the extent. If not, the extent is split so that it starts with the block number corresponding to offset. Then, starting with that extent, all extents from there to the EOF are shifted over (i.e. to the right) by the length, which leaves behind a hole located at the offset with the specified length.

Once that is done, callers can fill that hole by writing whatever data they want into it—hopefully not just ads. Reading from that region before writing to it will return zeroes, as with other holes punched in files.

Beyond the changes to the kernel filesystem layer (which are minimal), XFS, and ext4 (which are more extensive), Jeon has also added a number of test cases to xfstests. There are simple tests of the insert range feature, as well as more complicated tests that do multiple inserts or inserts coupled with collapse operations to try to stress both of these features. In addition, he has added support for an "finsert" command to the xfs_io program from xfsprogs.

Jeon's patch set is up to version 8 at this point; there have been lots of suggestions for changes along the way, but little in the way of fundamental opposition. Given that the collapse range capability was added, it would seem likely that insert range will follow along before too long.

Comments (8 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.19-rc5 ?
Greg KH Linux 3.18.3 ?
Luis Henriques Linux 3.16.7-ckt4 ?
Greg KH Linux 3.14.29 ?
Steven Rostedt 3.14.29-rt26 ?
Kamal Mostafa Linux 3.13.11-ckt14 ?
Jiri Slaby Linux 3.12.36 ?
Steven Rostedt 3.12.36-rt50 ?
Greg KH Linux 3.10.65 ?
Steven Rostedt 3.10.65-rt69 ?
Steven Rostedt 3.4.105-rt129 ?

Architecture-specific

Build system

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jake Edge
Next page: Distributions>>


Copyright © 2015, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds