|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 4.8-rc7, released on September 18. Linus said: "Normally rc7 is the last in the series before the final release, but by now I'm pretty sure that this is going to be one of those releases that come with an rc8. Things did't calm down as much as I would have liked, there are still a few discussions going on, and it's just unlikely that I will feel like it's all good and ready for a final 4.8 next Sunday."

See this summary for the state of currently known regressions in the 4.8 release.

Stable updates: 4.7.4 and 4.4.21 were released on September 15.

Comments (none posted)

Quotes of the week

We know bug reports come from everyone, there is no such thing as "bug free software", and none of us are claiming it. What we are claiming is that you should stick to the tree that is tested by as many people as possible the closest (i.e. mainline) as that gets you the most bug fixes, as well as the ability to use the kernel community to help you out when you have problems. Otherwise you are on your own with your 2.5million lines added franken-kernel that no one will touch if they have a choice not to.
Greg Kroah-Hartman

Simply repeating "upstream first" over and over again and telling people that doing anything else is just silly isn't really helping move things forward. People have heard this but for a good chunk of the industry there's a big gap between that simple statement and something that can be practically acted on in any sort of direct fashion, it can easily just come over as dismissive and hostile. It's going to be much more productive to acknowledge the realities people are dealing with and talk about how people can improve their engagement with upstream, make the situation better and close the gaps.
Mark Brown

When someone says "pretty simple" regarding cryptography, it's often neither pretty nor simple.
Alex Elsayed

The point is, I suspect that the block layer community is all about throughput and the talk about latency and interactivity is seen as an annoying distraction.

Like the kids making noise about doing detours for catching Pokémons in the back seat of the car while you're in the driving seat, driving to some percieved important destination. If you see what I mean. Their problems is not really your problem, so you don't care much. It will be more "yeah yeah, we'll see about your Pokémons. Someday."

Linus Walleij

Comments (8 posted)

Kernel development news

Adding encryption to Btrfs

By Jonathan Corbet
September 21, 2016
One of the promises of the Btrfs filesystem is that its new design would facilitate the addition of modern features like compression and encryption. Compression has been there for a while, but Btrfs has yet to gain support for encryption; indeed, the ext4 filesystem got this feature first over a year ago with an implementation that is also used by the f2fs filesystem. Work to fill this gap is underway, as can be seen in this recently posted patch set from Anand Jain, but it would appear that encryption in Btrfs remains a distant goal.

It remains distant because it has become clear that this code will not be merged in anything like its current form. With luck, though, it should be the source of a lot of lessons that can be applied to later, hopefully more successful attempts. Sometimes, one simply has to stumble a few times when attacking a difficult problem space.

Crypto troubles

There is an aspect to cryptographic code development that has been learned the hard way many times over: this code needs to be written with help from people who understand cryptography well and know where the pitfalls are. Developers who set out without that domain knowledge are certain to make serious mistakes. So this is not a good way to introduce an encryption-related patch set:

Also would like to mention that a review from the security experts is due, which is important and I believe those review comments can be accommodated without major changes from here.

As Dave Chinner (among others) pointed out, it is far too late for a security review, which should really happen during the design phase. The ext4 encryption feature, he noted, did go through a design review phase ahead of the posting of any code, and quite a bit of useful feedback was the result.

In this case, it would appear that this kind of review would have been helpful. Eric Biggers, who is working on the ext4 encryption feature, looked at the code and came back with a harsh judgment:

You will also not get a proper review without a proper design document which details things like the threat model and the security properties provided. But I did take a short look at the code anyway because I was interested. The results were not pretty. As far as I can see the current proposal is fatally flawed as it does not provide confidentiality of file contents against a basic attack.

Alex Elsayed also pointed out some of the cryptographic problems in the code. It comes down to a poor choice of encryption modes that leaves a filesystem open to well-understood known-plaintext attacks. The reviewers said that a mode like XTS, which lacks this particular vulnerability, should have been used instead. Or, even better, an authenticated encryption (AE) approach should be used; AE modes are believed to be far more resistant to most known attacks. AE brings its own challenges, though; the (mostly obsolete) ecryptfs filesystem uses it, but the current ext4/f2fs implementation does not. A related issue, as Ted Ts'o pointed out, is the increasing importance of taking advantage of hardware-based encryption for performance; that will tend to rule out "exotic encryption modes" in favor of something boring (but hardware-supported) like AES.

Crypto at the wrong level?

Another criticism of the patch set is that it implements a Btrfs-specific encryption infrastructure, rather than using the generic infrastructure added at the virtual filesystem (VFS) layer and used by ext4 and f2fs. One motivation for that approach is that Btrfs encryption is managed at the subvolume level, meaning that a single master key is used for the entire subvolume. Ext4 and f2fs, instead, lack the subvolume concept; they provide file-level encryption that allows different users to have different keys within the same filesystem. Another result is that Btrfs does not benefit from the work that has been done on the VFS infrastructure; as Chinner put it:

The generic file encryption code is solid, reviewed, tested and already widely deployed via two separate filesystems. There is a much wider pool of developers who will maintain it, review changes and know all the traps that a new implementation might fall into. There's a much bigger safety net here, which significantly lowers the risk of zero-day fatal flaws in a new implementation and of flaws in future modifications and enhancements.

He compared Btrfs-specific encryption to the Btrfs RAID5/6 implementation, which has had known problems for years and appears to be essentially unmaintained. "Encryption simply cannot be treated like this - it has to be right, and it has to be well maintained." Some Btrfs developers bristled at the description of the filesystem's RAID implementation, but there was general agreement that the VFS code should be used to the greatest extent possible — and improved in places where it cannot yet be used.

Btrfs does provide some unique challenges that will stress the capabilities of the existing VFS code. That code, for example, manages encryption keys as an inode attribute; that is how file-level encryption is supported. Btrfs throws a spanner into that works in a couple of ways:

  • If Btrfs snapshots are present, an inode is likely to be present in more than one of them. Without a great deal of care, these snapshots could be used to force a reuse of the encryption keys and "nonce" values used with a specific file; many AE algorithms will fail catastrophically if that happens.

  • In general, Btrfs does a lot of sharing of file blocks at the extent level. That is how the copy-on-write mechanism works in general, and features like deduplication will cause even more sharing to happen. Once again, this sharing could be used to expose encrypted traffic, or to simply tell when one party has modified a file that shares extents with another.

A solution to some of these problems would be to simply copy extents and do without the sharing when encryption is involved. But another solution falls out of the requirements: encryption in Btrfs probably needs to be managed at the extent level, rather than at the file level. That would reduce the potential for nonce-reuse attacks and would eliminate problems that would otherwise result if one file sharing an extent is modified in a way that changes the extent's offset within the file.

As Btrfs developer Zygo Blaxell put it, the Btrfs extent-use model already creates challenges for the VFS layer:

Currently any extent in the filesystem can be shared by any inode in the filesystem (assuming the two inodes have compatible attributes, which could include encryption policy), including multiple references from the same inode to the same extent at different logical offsets. This is the basis of the deduplication and copy_file_range features.

This confuses the VFS caching layer when dealing with deduped reflinked, or snapshotted files. It's not surprising that VFS crypto has problems coping with it as well.

At the moment, encryption at the VFS level doesn't have any real concept of extents at all; extents are generally something that only specific filesystems know about. So the VFS file-encryption code is not suitable for solving the Btrfs encryption problem in its current form. As many have pointed out, though, the solution is not to start over, but to enhance the VFS code to get it to the point where it can do the job.

About the only definite conclusion that came from the discussion was that there is still a lot of work to do before the Btrfs encryption problem is even well understood, much less properly implemented. If nothing else, the patches posted so far have served as a focus point for a discussion that needs to happen and, hopefully, a starting point for the next try, sometime in the future. Once again, we see that cryptography is hard, and the intersection with a next-generation filesystem makes it even harder.

Comments (14 posted)

Automating stable-kernel creation

By Jake Edge
September 21, 2016

LinuxCon North America

At LinuxCon North America 2016 in Toronto, Sasha Levin presented some of the tools and techniques he uses to maintain stable kernels. He maintains the 4.1 and 3.18 stable kernels and wanted to make his life easier, so he started automating the process. While creating a stable kernel will never be a fully automatic task, he has developed some tools that can help.

Stable trees are just like more -rc cycles, he said with a grin. The intent is that stable kernels only get small changes (< 100 lines) from the mainline that fix a non-theoretical bug that users are running into. The criteria is pretty strict, but an exception is made for new device IDs; those are normally one-line changes that simply enable new hardware using the existing code. Stable kernels are typically supported for around ten weeks for the period between kernel releases.

[Sasha Levin]

There are also long-term support (LTS) stable kernels. Those follow the same rules, but continue to get support for much longer—typically two years or more. As time passes, fewer commits are made to the LTS kernels; since they don't add new features, they also don't add new bugs. But, on the other hand, fixing the bugs that are found is harder, since they often must be backported rather than simply cherry-picking commits from the mainline.

That means that the rate of LTS stable patches goes down, but each patch takes more time to handle. In addition, more people depend on those trees for servers and other critical infrastructure, where they don't want to change the kernel (and, in particular, update to a new major version) frequently. So it is important that those kernels are as reliable as they can be.

So, that "doesn't sound hard", Levin said, just look at every patch that goes into the mainline, decide if it is a fix, and add it to the tree if it is. But, of course, there are too many patches—around eight patches per hour, every hour of every day.

Even if someone could look at all those patches, it is not always obvious whether they fix a real problem or not. There is also the chance that Levin or some other stable maintainer misses a patch that does fix something. If no one is using that functionality, that isn't much of a problem, but if it is a critical security fix, that can be serious. On the flip side, if he takes a fix that he shouldn't have, it might introduce a security hole. For example, a few weeks earlier he took an XFS patch into the wrong kernel version and introduced a local privilege escalation.

Let's automate

So, "let's automate". He finds most of the patches needed for his trees by looking for "stable@" addresses or "Fixes" tags in the mainline commits. His first step, then, was to write a script that grabbed the logs and looked for those strings. But that was not enough.

As an example, he pointed to a commit, which is a simple fix for a minor security bug (an information leak), but was not marked for stable, nor with a "Fixes" tag. So he can't rely on those alone to find the patches that should be added to his tree(s).

Another technique he uses is to search for certain keywords and phrases. Strings like "fix", "NULL dereference", "buffer overflow", and so on might indicate a commit he should look at more closely. He has around twenty of these strings that he looks for now, though he adds to the list occasionally.

After that, he started "shamelessly stealing" Greg Kroah-Hartman's work. So Levin has a script, stable-show-missing, that looks at other stable trees to see what is missing in one or the other. Are there commits in Kroah-Hartman's (or another stable maintainer's) tree that are not in his? Or vice versa?

In a continuation of the "shameless stealing", he has a script that looks for backports of fixes into other stable trees. "Backports are evil", he said, and should be avoided, but it is important not to have multiple backports of the same fix in various trees. If there is only one backport, it may be wrong, but at least all of them are the same and a single fix can be applied to all of them if needed. For example, if a fix has been backported from 4.8 into 4.4, he can run his tool to find and show the backported patches; if they apply cleanly to his tree, he can just adopt them.

Another tool, stable-deps, will give a list of commits that need to be applied before a particular fix can be applied. That list can be used to find stable-candidate commits that have been missed along the way. It can also show whether a fix is for a bug in some big feature that has been introduced since the kernel version he is working with. That makes it easier to drop those kind of fixes without doing costly research on the mailing list.

When looking at a specific patch, there is always the question of whether it truly should be applied or not. There are multiple rules in the stable_kernel_rules.txt file; the first five are straightforward, but the rest of it is "lawyer talk", he said with a chuckle. In any case, his common check_relevant() function will find some of the obvious violations , though of course it is not perfect.

Finding the "stable@" address in a commit is a good indicator that it is stable material, but is no guarantee that it truly is. On the other hand, there may not be a stable indicator, but the fix should be applied. Even if there is a "Cc: stable@vger.kernel.org" line in the commit, there are multiple different ways that "tag" is formed. Some have angle brackets or other formatting differences; there is also, perhaps, a version indication (which can also come in a variety of formats).

These version tags (e.g. "Cc: stable@vger.kernel.org # v3.4+") are meant to help the stable maintainers quickly determine whether they should be interested in the patch or not. But there is no standard way of specifying the applicable versions, so check_relevant() tries to parse the version specification and to determine which kernel versions it actually corresponds to.

One problem he has encountered is the "fix for a fix". The "Fixes" tag refers to a commit that has been fixed, but that only works for mainline commit IDs. Once a fix has been cherry-picked into a stable tree, it will have a different commit ID than the corresponding change in the mainline. So a fix that references a mainline commit that has been cherry-picked into a stable tree is easy for a stable maintainer to miss. check_relevant() looks for that as well.

There are certain patch authors who are themselves flags for a patch that should get stable consideration. He mentioned Linus Torvalds and David Miller as two maintainers that mostly just fix bugs, often important bugs. While Torvalds "tries to hide" security problems, the fact that he has authored a particular change is a big sign that it is significant.

Putting all of that together results in a stable-steal-commits tool. It can be run on upstream or various stable maintainers' trees and will create a new tree with the changes that are found with his tools. It is not something that can be shipped, obviously, since it needs lots of validation, but it is a starting point. In particular, it is important to run stable-show-missing and look carefully at the results. Running stable-steal-commits takes about 30 minutes on an -rc release after -rc1; it takes around two hours for an -rc1 release.

When he is validating the tree that is created, he often finds that some patches need to be yanked out of the tree or that other patches need to be pulled in. That is not something that Git handles easily, which is why Kroah-Hartman uses quilt to manage stable-tree patches. Levin has created stable-yank and stable-insert to handle those kinds of problems. They are currently being used quite a bit, he said; he is trying to convince Kroah-Hartman to drop quilt in favor of them.

He now has a GitHub repository containing multiple tools that he uses for his stable kernel work. He also introduced his scripts in a post to the linux-kernel mailing list nearly a year ago.

Levin showed a rant from Dave Chinner that complained about having to make the same set of comments for multiple stable trees and maintainers. He wanted to see more coordination between the stable maintainers so that he and others could simply make one set of comments that would (somehow) propagate to all of the other stable trees that might also cherry-pick the commit(s) in question.

To help fill in that "somehow", Levin has come up with stable "notes". It will grab reviews and other comments from the mailing list and store them as notes on the commits in a Git tree. Other stable maintainers can add Levin's tree as a remote repository and configure Git to consult the notes that he is adding from stable reviews. That will help reviewers and maintainers so that they do not need to do multiple reviews for multiple stable releases; it will also help stable maintainers coordinate more easily.

The last piece of the puzzle is testing. Stable kernel candidates need to be tested before they can be released. He does local build tests and boots the kernels inside a virtual machine, but there is much more testing going on. The 0-day testing service and kernelci.org both test on every commit made to his Git repository. To him, it seems like these groups have "unlimited computing power or something" and their testing makes his life much easier. It is much better to find out about problems during the review cycle for the stable kernel rather than after it has been released.

[I would like to thank the Linux Foundation for travel assistance to attend LinuxCon North America in Toronto.]

Comments (none posted)

BBR congestion control

By Jonathan Corbet
September 21, 2016
Congestion-control algorithms are unglamorous bits of code that allow network protocols (usually TCP) to maximize the throughput of any given connection while simultaneously sharing the available bandwidth equitably with other users. New algorithms tend not to generate a great deal of excitement; the addition of TCP New Vegas during the 4.8 merge window drew little fanfare, for example. The BBR (Bottleneck Bandwidth and RTT) algorithm just released by Google, though, is attracting rather more attention; it moves away from the mechanisms traditionally used by these algorithms in an attempt to get better results in a network characterized by wireless links, meddling middleboxes, and bufferbloat.

The problem that any congestion-control algorithm must solve is that the net has no mechanism for informing an endpoint of the bandwidth available for a given connection. So the algorithm must, somehow, come to its own conclusions regarding just how much data it can send at any given time. Since the available bandwidth will generally vary over time, that bandwidth estimate must be revised occasionally. In other words, a congestion control algorithm must maintain an ongoing estimate of how much data can be sent, derived from the information that is available to it.

That information is somewhat sparse. These algorithms typically work by using one metric that they are easily able to measure: the number of packets that do not make it to the other end of the connection and must be retransmitted. When the network is running smoothly, dropped packets should be a rare occurrence. Once a router's buffers begin to fill, though, it will have no choice but to drop the packets it has no room for. Packet drops are thus a fairly reliable signal that a connection is overrunning the bandwidth available to it and should slow down.

The problem with this approach, on the network we have now, is that the buffers between any pair of endpoints can be huge. Oversized buffers have been recognized as a problem for some years now, and progress has been made in mitigating the resulting bufferbloat issues. But the world is still full of bloated routers and some link-level technologies, such as WiFi, require a certain amount of buffering for optimal performance. By the time an endpoint has sent enough data to overflow a buffer somewhere along the connection, the amount of data buffered could be huge. The packet-loss signal, in other words, comes far too late; by the time it is received, an endpoint could have been overdriving the connection for a long time.

Loss-based algorithms can also run into problems when short-lived conditions cause a dropped packet. They may slow down unnecessarily and, as a result, fail to make use of the bandwidth that is available.

Bottleneck Bandwidth and RTT

The BBR algorithm differs from most of the others in that it pays relatively little attention to packet loss. Instead, its primary metric is the actual bandwidth of data delivered to the far end. Whenever an acknowledgment packet is received, BBR updates its measurement of the amount of data delivered. The sum of data delivered over a period of time is a reasonably good indicator of the bandwidth the connection is able to provide, since the connection has demonstrably provided that bandwidth recently.

When a connection starts up, BBR will be in the "startup" state; in this mode, it behaves like most traditional congestion-control algorithms in that it starts slowly, but quickly ramps up the transmission speed in an attempt to measure the available bandwidth. Most algorithms will continue to ramp up until they experience a dropped packet; BBR, instead, watches the bandwidth measurement described above. In particular, it looks at the actual delivered bandwidth for the last three round-trip times to see if it changes. Once the bandwidth stops rising, BBR concludes that it has found the effective bandwidth of the connection and can stop ramping up; this has a good chance of happening well before packet loss would begin.

The measured bandwidth is then deemed to be the rate at which packets should be sent over the connection. But in measuring that rate, BBR probably transmitted packets at a higher rate for a while; some of them will be sitting in queues waiting to be delivered. In an attempt to drain those packets out of the buffers where they languish, BBR will go into a "drain" state, during which it will transmit below the measured bandwidth until it has made up for the excess packets sent before.

Once the drain phase is done, BBR goes into the steady-state mode where it transmits at more-or-less the calculated bandwidth. That is "more-or-less" because the characteristics of a network connection will change over time, so the actual delivered bandwidth must be continuously monitored. Also, an increase in effective bandwidth can only be detected by occasionally trying to transmit at a higher rate, so BBR will scale the rate up by 25% about 1/8 of the time. If the bandwidth has not increased (transmitting at a higher rate does not result in data being delivered at a higher rate, in other words), that probe will be followed by a drain period to even things out again.

One interesting aspect of BBR is that, unlike most other algorithms, it does not use the congestion window as the primary means of controlling outgoing traffic. The congestion window limits the maximum amount of data that can be in flight at any given time; an increase in the window will generally result in a burst of packets consuming the newly available bandwidth. BBR, instead, uses the tc-fq packet scheduler to send out data at the proper rate. The congestion window is still set as a way of ensuring that there is never too much data in flight, but it is no longer the main regulatory mechanism.

There is one last complication: many network connections are subject to "policers", middleboxes that limit the maximum data rate any connection can reach. If such a box exists, there is little point in trying to exceed the rate it will allow. The BBR code looks for periods with a suspiciously constant bandwidth (within 4Kb/sec) and a high packet loss rate; should that happen, it concludes that there is a policer in the loop and limits the bandwidth to a level that will not cause that policer to start dropping packets.

The BBR patch set was posted by Neal Cardwell; the code itself carries signoffs from a number of people, including Van Jacobson and Eric Dumazet. Google has, they say, been using BBR for some time, and is evidently happy with the results; BBR works fine when only one side of the connection is using it, so each deployment should, if it lives up to its promises, make the net that much better. We shouldn't have to wait long to find out; networking maintainer David Miller has applied the patches, meaning that BBR should be available in the 4.9 kernel.

Comments (48 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 4.8-rc7 Sep 18
Greg KH Linux 4.7.4 Sep 15
Sebastian Andrzej Siewior 4.6.7-rt13 Sep 15
Greg KH Linux 4.4.21 Sep 15
Steven Rostedt 4.4.21-rt30 Sep 21
Steven Rostedt 4.1.33-rt37 Sep 21
Steven Rostedt 3.18.42-rt44 Sep 21
Steven Rostedt 3.14.79-rt84 Sep 21
Steven Rostedt 3.12.63-rt84 Sep 21

Architecture-specific

Build system

Nicolas Pitre make POSIX timers configurable Sep 18

Core kernel code

Development tools

Device drivers

Device driver infrastructure

Documentation

Mauro Carvalho Chehab Create a book for Kernel development Sep 19
Jesper Dangaard Brouer XDP (eXpress Data Path) documentation Sep 20

Filesystems and block I/O

Kirill A. Shutemov ext4: support of huge pages Sep 15
Christoph Hellwig iomap based DAX path V3 Sep 16
Damien Le Moal ZBC / Zoned block device support Sep 20

Memory management

Networking

Security-related

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2016, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds