A slow path to a fast fix

By Jonathan Corbet
March 23, 2016

In the Linux community, we normally like to take pride in our quick response to security problems. We may not be as good at preventing those problems (or preventing exploits) as we might like, but at least we fix them promptly when they come to light. As the example of CVE-2015-1805 shows, though, sometimes a useful response takes longer than it seems it should. In the end, it comes down to realizing that a given security problem is, indeed, severe.

In April 2014 Al Viro, as part of the project of converting much of the kernel's virtual filesystem code over to the iov_iter abstraction, committed a patch reworking how the pipe code performs I/O. In the old version, pipes would work with a struct iovec array; the replacement used struct iov_iter instead. It looked like a straightforward code refactoring and cleanup with, it was thought, no user-visible changes at all.

That turned out not to be true; the old code contained a little mistake that wasn't present in the new version. If an attempt to perform a read or write operation atomically (without blocking) failed, the code would retry in a non-atomic mode. But, if this failure occurred partway through an operation with multiple iovecs to handle, the code would forget about the ones it had already done; the subsequent retry would run off the end of the iovec array. There is no indication that Al knew this at the time he fixed the bug, but others eventually figured it out.

But it took a while. In March of 2015, nearly one year later, a Red Hat bugzilla entry was filed for this issue, indicating that: "A local, unprivileged user could use this flaw to crash the system or, potentially, escalate their privileges on the system." Red Hat's developers quickly verified that the RHEL 5, 6, and 7 distributions were all vulnerable to the problem. They were a little less quick to fix it, though; the first alerts didn't come out until the beginning of June, nearly three months later. A flood of advisories came out in the following months; see the LWN vulnerability entry for a list.

Then things went quiet for a while, but that doesn't mean that nothing was happening. It would appear that the developers of the KingRoot tool happened upon this issue and were able to write an exploit for it. KingRoot is an Android app intended to let users gain root on their devices. It is not something that most users would see as malicious; they install it intentionally it so that they can gain control over their own devices. But if KingRoot can exploit this issue, then other Android apps can do that as well.

Once Google became aware of the issue, a patch was put together and an advisory was sent out to the world. Concern about security issues led Google to start sending out monthly security updates a while back, but this one was seen as being even more serious than that: an extraordinary, mid-month update was created for it. Owners of Nexus devices should be seeing it show up, though your editor's phone has not yet received the update.

With luck, this one has been caught before it was used to cause serious harm, though it is difficult to know that for sure. Meanwhile, owners of devices that do not receive regular security updates — that would be most of them — will be out of luck and vulnerable for the foreseeable future. In any case, it is unfortunate that a problem that was fixed two years ago, and understood to be a security issue one year ago, had to be fixed in an extraordinary update so much later on.

There is no knowing why Google didn't identify this vulnerability as being important when it was first disclosed. Perhaps it didn't seem like something that could be exploited. One can only wonder how many other known but unfixed issues have slipped under the radar in the same way. That number cannot be zero.

There are a number of conclusions that can be drawn from this event, most of which are hardly new or original. A much wider variety of bugs than one might think can be exploited; this is especially true in kernel space. There is value in improving code; designing and implementing a better I/O-tracking abstraction fixed a severe bug that the developers didn't even know was present. There is value in running newer code; the fixes went into 3.16, so anything running a kernel at least that recent is unaffected. Unfortunately, Android devices are not known for running recent kernels.

But, most to the point, our current security process is failing us. If we were able to fix problems as quickly as we like to think, it would still be insufficient; it's increasingly clear that patching things up after somebody discloses them is not enough. But, sometimes, problems seemingly slip through the cracks and are not fixed quickly, even when they are known. Exploit writers are less inclined to drag their feet, and will thus always be ahead in this game. We need solutions that are not so exclusively dependent on patching up vulnerabilities after the fact.

Index entries for this article
Security	Linux kernel

A slow path to a fast fix

Posted Mar 24, 2016 6:13 UTC (Thu) by viro (subscriber, #7872) [Link]

No, I hadn't noticed at the time. The function in question had migrated into that commit on queue reordering - IOW, in my tree it got introduced (as a combination of iov_iter_copy_from_user() and iov_iter_advance()) a couple of months earlier. Early February, at a guess... Queue had been reordered and refactored many times, at some point it got used for pipe conversion to ->write_iter(), where it was obviously the right thing and simplified the logics.

iov_iter_copy_from_user() predates the whole series; it goes back to 2007 and used to live in mm/filemap.c - it's a replacement of filemap_copy_from_user() that doesn't mutilate iovecs; a part of the series by Nick Piggin that had introduced iov_iter in the first place. The whole pile (fault-in, try to use kmap_atomic, etc.) had originated there,
actually - in September 2002, series from Andrew Morton. In 2006 Jens Axboe had done something similar in fs/pipe.c. That's when the bug had been introduced.

iov_iter_copy_from_user() did the same song and dance (try to copy under kmap_atomic, etc.), but it didn't modify iovec *and* didn't advance the iov_iter. The latter was actually a bad idea - that's what copy_page_from_iter() had done differently, and that's what pretty much every caller wanted. However, pipe.c variant *did* modify iovecs, and what's more, it didn't do a self-contained fallback. So it had to be called twice. In a sense, that bug was a result of bad calling conventions...

Combination of iov_iter_copy_from_user() and advancing the iterator was fairly obvious from the look at its callers. What's more, it *did* the fallback, so it looked like it should've simplified the hell out of pipe_writev() - the thing there was obviously a homegrown analogue of that thing, with bloody inconvenient calling conventions.

So there it went. As it turned out, the homegrown analogue (with very obvious intended behaviour) wasn't correct. I really should have done the massage towards the open-coded copy_page_from_iter() - it would've caught the bogosity there and then, *and* it would have avoided compounding another bug in there with an embarrassing bug of my own. In case of failed copy_from_user() in the very beginning EFAULT had been lost - write() ended up returning 0 instead. Eric Biggers had caught that (and a very similar lost error nearby from all way back in 2006) last October...

As an aside, early enough it had become pretty clear that "take an array of iovecs, consume some data, adjust ->iov_{base,len} so that the next time it would point past the data consumed" was a source of recurring bugs waiting to happen. So systematic switch to iterators leaving the iovecs never modified was likely to reduce the likelyhood of that class of bugs. Some got noticed when fixing (e.g. CVE-2014-0069 got caught while doing similar conversion in cifs_iovec_write(); the fix got reordered in front of the queue, leaving the pure conversion to copy_page_from_iter() until the merge in April). Some didn't - bogus code with obvious intent got replaced with something that did what everyone assumed that bogus code to have been doing all along, without noticing a bug concealed from everyone (including the original author) by the convolutions in old variant...

And no, it wasn't a coverup - cifs bugs noticed at the same time got reported as such; I wouldn't have hesitated to do the same to pipe one had I noticed it. Code in cifs_iovec_write() was simpler, so the bogosity there got spotted...

Multiple lines of defense are not superfluous

Posted Mar 24, 2016 9:28 UTC (Thu) by ortalo (guest, #4654) [Link]

We should not have to fix security vulnerabilities in a hurry [1]. That is the problem.

Some people sometimes qualify successive security protections with varying names, like excessive security, redundant controls, double risks coverage, etc. I've seen this kind of discourse in many contexts (technical or non technical), but in the first place such people are simply wrong (and should probably stop taking decisions or responsibilites in this area).
Redundant security measures are what allows you to stop losing the eternal chase between vulnerabilities and corrections by tolerating the presence of some of the inevitable vulnerabilities without compromising security altogether.
Of course, there are useless defense in depth architectures, but they are like useless security mechanisms: useless.

Note that the accurate design of a security architecture implementing defense in depth is also entirely orthogonal to the problem of selecting good security mechanisms in the first place. (Those arguing for single protection mechanism also frequently also select inadequate ones to add injury to insult.)

[1] Note in the kernel this is debatable though btw. I wonder if self-contradiction qualifies as a redundant control. :-)

A slow path to a fast fix

Posted Mar 24, 2016 10:04 UTC (Thu) by malor (guest, #2973) [Link] (4 responses)

While it's not relevant in this specific case, it would be very good if bugs that are known to have security impacts are labeled as such.

The policy of hiding security fixes is actively hurting end-users. Not all fixes get ported to all the various Linux forks. This will. not. change. Will. Not.

At least, by telling people that patches have a security context when you know they do, you increase the chance of that patch being brought to other kernel trees. Nobody is asking for security analysis if it doesn't already exist, just... share what you know. Hiding security patches is bullshit, and it's not working. There's a huge underground ecosystem that thrives on this, on noticing the things that kernel devs try to hide.

So stop trying to hide things. It would have a wildly disproportionate positive impact for the amount of mental energy expended.

A slow path to a fast fix

Posted Mar 24, 2016 11:27 UTC (Thu) by khim (subscriber, #9252) [Link] (2 responses)

Not all fixes get ported to all the various Linux forks. This will. not. change. Will. Not.

Indeed. And that's fine. If people like to live with a known devil (and known vulnerabilities) then it's their choice.

While it's not relevant in this specific case, it would be very good if bugs that are known to have security impacts are labeled as such.

Why? As this example very succinctly demonstrates this wouldn't change anything: either you follow the ToT or you are broken. Heck, you are broken even if you do follow ToT, what chance is there for you to remain protected if you decide to stay in the past?

The whole security model is broken. If we care about security then we should redesign our toys (most likely we'll need microkernels and hardware support to make them fast), if we don't (and we don't: we complain loudly when there's new vulnerability found but if someone offers to sacrifice even just 5% of efficiency... no, we, as a society, don't want to pay for the security... at all), then we don't. It's as simple as that.

A slow path to a fast fix

Posted Mar 24, 2016 13:05 UTC (Thu) by ortalo (guest, #4654) [Link] (1 responses)

IMHO, the need for security is here and demonstrates itself more and more regularly and more and more catastrophically due to its unmanaged state.
The problem is we need to change the decisions of the people currently holding the design authority on our toys (both from the hardware or software point of view).
First, I guess we should point out that the problem is *not* that these top level executives or governement decision-makers do not want to pay for security.
It was/is that they only want to pay for *their* protection (using corporate or public money btw) and now they select the risks they want to cover much too narrow-mindedly or with much too little adequate information.
The risk selection is catastrophic at the decision-taking level for the moment in computer security (and certainly in security too), because it has degraded continuously for... maybe 2 decades.

Due to that, executives are having everybody else take risks and they are also taking too much risk for their organizations (either companies or administrations).
By the way, it seems to me some US companies are among the first to have started to realize that: that the risks of their customers are also necessarily theirs to some extent. That they do not live in isolation. But apparently, as private companies, they cannot integrate the full spectrum of the risk management (because of, you know, the "business" reasons and the learning disabilities money making seems to create).
Anyway they should not be allowed to police these risk management decisions alone themselves - because they are only businesses after all.
But then, we are left with governments to actually do that. And they will annoyingly be a part of the problem until they really decide to acquire recent and accurate knowledge of the issues involved.

(Sidenote for the curious executive: No, restarting a 2nd Crypto War debate does not qualify as such at all... Neither multiplying inefficient security controls around your own neighbourhood only. Until you actually try to learn you will not learn anything new...)

A slow path to a fast fix

Posted Mar 28, 2016 0:48 UTC (Mon) by dps (guest, #5725) [Link]

Executives sometimes want security properties not consistent with the laws of mathematics. Case in point: could they access their email from anywhere, without a VPN, and have it impossible for crackers to do the same? Other requests with similar problems involved the firewall machine protecting things on the wild side of it.

This is the sort of logic that RIM used: "The black berry server is insufficiently secure for use in a DMZ network. We suggest you put in your internal network.". When I that explained this said "The black berry server is insecure. We suggest you put where the damage is maximised.", and that no server anywhere should have security issues beyond those implied by the services provided, it became an easy decision not to spend a large amount of money on that box.

There is also the issue that all their employees need security and not just the chief executive, and just because you are the CEO does not make security optional.

A slow path to a fast fix

Posted Mar 28, 2016 7:07 UTC (Mon) by Seegras (guest, #20463) [Link]

> Not all fixes get ported to all the various Linux forks. This will. not. change. Will. Not.

Yes. But why should anyone run some kind of archaic kernel in the first place? A year after the fixed one was released?

A slow path to a fast fix

Posted Mar 25, 2016 13:23 UTC (Fri) by BenHutchings (subscriber, #37955) [Link] (1 responses)

Unfortunately the fix by Seth Jennings for RHEL, later applied to stable branches, was still incorrect, leading to CVE-2016-0774. I hope AOSP picks up the second fix as well.

A slow path to a fast fix

Posted Mar 26, 2016 18:43 UTC (Sat) by thestinger (guest, #91827) [Link]

I submitted your linux-stable fix:

https://android-review.googlesource.com/#/c/210720/
https://android-review.googlesource.com/#/c/210730/

I'm sure they're missing a lot more than this though. Maybe one day I'll go through the stable fixes for 3.10 and submit the relevant ones to them. It would be really nice if they pulled in all the stable kernel fixes. I can understand that they'd want to do it themselves, but they really aren't doing a good job.

A while ago, I noticed that they somehow had a vulnerability fix for their own Android extension (VMA naming) in most of their 3.10 kernels that was missing for 3.4. I had run into the vulnerability by writing buggy userspace code triggering the crash. They paid me a bug bounty for reporting it (was not expecting that at all)... so it's hard to complain.