A kernel security hole

By Jake Edge
January 16, 2008

Security holes can sneak into code in surprising ways, even in highly scrutinized codebases. Perhaps even more surprising is how long they can persist in something as popular as the Linux kernel before someone notices. The release of stable kernels 2.6.22.16 and 2.6.23.14 this week are instructive for both of those reasons.

The bug that led to the releases is fixed by a two line patch, but might be exploitable to cause filesystem corruption. If it were a bug in a driver for an obscure piece of hardware, with relatively few users, it might have been less eye opening, but it was in the Virtual File System (VFS) layer of the kernel. VFS is the abstraction that allows all kernel filesystems to be used identically regardless of their underlying implementation. The open() system call is used to open any file on any type of filesystem; VFS is what makes that work.

In fact it is the open() path that is affected by the bug. Due to a faulty test, the bug allows directories to be opened for writing, which is generally a recipe for disaster. It could also allow a file on a read-only filesystem to be opened for writing – depending on the underlying filesystem implementation, that could lead to corruption. In both cases, they are only locally exploitable.

The bug was introduced in a change to support NFS in October of 2005 – more than two years ago; all kernels since 2.6.15 are affected. The change was aimed at making NFSv4 open calls be atomic (because an open is really a lookup followed by an open), but also did some code reorganization that changed the semantics of a flag variable. That variable was being used to determine the access mode for directories and read-only filesystems, so that change subtly broke the tests.

Part of the problem is that the tests are in a function called may_open(), which takes two flag parameters:

    int may_open(struct nameidata *nd, int acc_mode, int flag)

The incorrect code was using flag in the tests when it should have been using acc_mode. Each of them is a bitmask of values that, on first glance, might be easy to confuse – each is related to permissions. The bit values for each have names like FMODE_WRITE and MAY_WRITE, which would seem to have a fair amount of overlap. This may explain why the problem was not spotted at the time it was introduced.

There may be no easy solution to this kind of problem – other than more scrutiny. Using different types, rather than plain int, for each flag might have helped, but since the tests were using the right kind of bit values for flag, that is a somewhat hard sell.

Something unpleasant to consider in all of this is that this may not be the first time this problem has been noticed. It may just have been the first time it was noticed by someone who reported it. Folks with a malicious intent are much less inclined to report bugs. This particular bug is not one that would be particularly useful to attackers, but we would do well to remember that fixing a two year old hole means that systems were vulnerable for all that time. It is not only the good guys who can read code.

Index entries for this article
Security	Linux kernel/Virtual file system (VFS)

A kernel security hole

Posted Jan 17, 2008 2:08 UTC (Thu) by jamesm (guest, #2273) [Link] (2 responses)

Something to note is that the LSM hooks associated with these checks always used the correct acc_mode variable, so there was potentially some mitigation possible if e.g. using SELinux MAC policy. This was really accidental, and outside the scope of the protection goals of LSMs, but it is not the first time that a kernel bug has been mitigated in this way. It seems that having both DAC and MAC frameworks in the kernel provides some unexpected "defense in depth" benefits.

A kernel security hole

Posted Jan 17, 2008 9:12 UTC (Thu) by filipjoelsson (guest, #2622) [Link] (1 responses)

So, what you're saying is that implementing the same checks twice independently significantly
lowers the risk of this kind of mistake, right? (If the chance of a mistake is 1:1000 - the
chance of the same mistake in both layers is 1:1000000.)

Though this is clearly true, doesn't it implicate a bit of a performance problem? And is every
relevant check implemented twice (ie once in code and once for the security module)?

Or am I completely misunderstanding something?

A kernel security hole

Posted Jan 17, 2008 10:04 UTC (Thu) by jamesm (guest, #2273) [Link]

I don't know if you can generalize things in that way, as it is essentially a side effect of having independent DAC and MAC mechanisms that has been observed a couple of times. Yes, there is a performance hit when you have multiple security mechanisms, but MAC is typically expected to have some impact.

Wrong name

Posted Jan 17, 2008 9:56 UTC (Thu) by NAR (subscriber, #1313) [Link] (18 responses)

Maybe the name of the variable shouldn't be 'flag'. It's really the type of the variable (i.e.
it's a flag), but it's name should indicate what it's used for (i.e. dir_access_mode - I don't
know, how that's variable is used, so it's probably a wrong name).

Wrong name

Posted Jan 17, 2008 10:45 UTC (Thu) by ms (subscriber, #41272) [Link] (17 responses)

Maybe people should actually use sensible types. At the very least, [Bool] would be
preferable, though obviously unbounded. It should really be a record with meaningful names:

data OpenFlags =
                 { flagName1 :: Bool
                 , flagName2 :: Bool
                 , ...
                 }

data OpenAccMode =
                   { modeName1 :: Bool
                   , ...
                   }

And then, oh look, 1) we have meaningful names 2) the fact it's a bit mask is obvious and 3),
and most importantly, you can't get them the wrong way round as the type of may_open is:

may_open :: NameIData -> OpenAccMode -> OpenFlags -> IO Int

Once again, this bug is simply a result of relying on a prehistoric language.

Wrong name

Posted Jan 17, 2008 11:04 UTC (Thu) by njs (subscriber, #40338) [Link] (4 responses)

What language would you suggest?

For some reason the programming language geeks who understand why powerful type systems are
useful also only care about hyper-abstract functional languages, which are nice and all but
have serious limitations when it comes to writing a kernel, don't have compilers for most
architectures, etc.

Wrong name

Posted Jan 17, 2008 11:48 UTC (Thu) by ms (subscriber, #41272) [Link] (3 responses)

Well there's no hiding it, I'm a Haskell fan. GHC does support the more standard architectures
(x86, sparc, ppc - see http://haskell.org/ghc/download_ghc_682.html). As for performance, I
really think that the choice of C in general is a premature optimisation. I would argue that
it would be easier and quicker and much safer to write the kernel in a much higher-level
language like Haskell and then invest time in making the compiler and optimiser really
staggeringly clever.

On the other hand, I'm in no way speaking from experience, so I guess like with most
academics, the world will look the other way...

Incidentally, the reason why powerful type-systems tend to only appear in functional languages
is that imperative languages just allow the programmer too much madness to permit a really
powerful type system. Scala pretty much contains everything you can get away with in an
imperative language and even there, I'm unaware that anyone's actually proved it sound.

Wrong name

Posted Jan 17, 2008 12:14 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

I would argue that it would be easier and quicker and much safer to write the kernel in a
much higher-level language like Haskell and then invest time in making the compiler and
optimiser really staggeringly clever.

It's not obvious to me that this would work, nor that if it did work it would fix a
sufficiently broad category of problems to be worthwhile, and nor that if it did work, AND
fixed a broad category of problems, it would actually take comparable time (if it takes 10
years to do what used to take six weeks then you've shot yourself in the foot because 10 years
is too late). It's also not obvious that Linux Kernel developers (often expert in C and
low-level hardware stuff) would make good compiler designers, since these are largely
unrelated skills.

No-one has, to my knowledge, been stopping people from actually developing these staggeringly
clever compilers over the decades since LISP and C were the state of the art. It's even
regarded as a genuinely interesting problem (unlike Operating systems which have been largely
treated as a commodity) so you could get funding to work on it.

Wrong name

Posted Jan 17, 2008 21:27 UTC (Thu) by droundy (subscriber, #4559) [Link]

"I would argue that it would be easier and quicker and much safer to write the kernel in a
much higher-level language like Haskell and then invest time in making the compiler and
optimiser really staggeringly clever."

I'm also a big fan of Haskell, but wouldn't really want the kernel to be written in Haskell.
It's a wonderful language, and adding more static type-checking to the kernel would be great,
but for the kernel, performance should be everything (well, almost everything).  I think the
kernel devs have it right: add static checks to C (via sparse).

David

high level kernel code

Posted Jan 18, 2008 21:49 UTC (Fri) by giraffedata (guest, #1954) [Link]

I agree with your view, but I don't like the phrasing, "compiler and optimizer." The optimizer is an intrinsic part of the compiler -- it's only in low level languages such as C that a user can have a concept of the natural object code, which can be distinguished from optimized code.

The compiler has to be thought of as something that writes machine code, not something that translates source code into machine code. Then it makes sense to say people should spend their time making compilers that write efficient code rather than writing efficient code themselves.

Wrong name

Posted Jan 17, 2008 11:10 UTC (Thu) by jonth (guest, #4008) [Link] (4 responses)

Yeah, let's rewrite the kernel in Java. Or C#. Or Haskell.

Can someone give me a mainstream example of a kernel written in anything other than C or
assembler? I would be surprised. Maybe someone's done one in C++, but it doesn't address the
concerns above.

J

Kernels in other languages

Posted Jan 17, 2008 12:05 UTC (Thu) by tialaramex (subscriber, #21167) [Link] (3 responses)

Haiku (which possibly doesn't count as mainstream) uses lots of C++ including in the kernel
and device drivers.

It barely works (still isn't self-hosting after more than six years) and its performance is
miserable, but there you are, an example of something other than C or assembler.

Kernels in other languages

Posted Jan 17, 2008 12:20 UTC (Thu) by ms (subscriber, #41272) [Link] (2 responses)

There are 3 kernels in Haskell, in various states of development and/or (dis)repair.

House: http://programatica.cs.pdx.edu/House/
seL4: http://www.ertos.nicta.com.au/research/sel4/
Kinetic: http://www.ninj4.net/kinetic/

I think seL4 is the most active at the moment.

Kernels in other languages

Posted Jan 17, 2008 12:29 UTC (Thu) by jonth (guest, #4008) [Link] (1 responses)

Hardly mainstream, but I have to say I didn't expect to see anything in Haskell. I doff my
cap. Does anyone know how fast these things go?

Kernels in other languages

Posted Jan 17, 2008 15:14 UTC (Thu) by mrfredsmoothie (guest, #3100) [Link]

They go to 11.

Wrong name

Posted Jan 17, 2008 12:25 UTC (Thu) by NAR (subscriber, #1313) [Link] (6 responses)

Once again, this bug is simply a result of relying on a prehistoric language.

I'm afraid it's not. I agree that C could be improved with a bitfield type like this:

bitfield accessMode {
  bit open;
  bit write;
  ...
};

but wrongly named variables of same types would still lead to these kind of errors.

Wrong name

Posted Jan 17, 2008 12:29 UTC (Thu) by ms (subscriber, #41272) [Link] (3 responses)

No no. The problem is these parameters should not have the same type.

Wrong name

Posted Jan 17, 2008 13:06 UTC (Thu) by BenHutchings (subscriber, #37955) [Link]

They could be changed to enumerated types, but that wouldn't automatically help much because
conversion between integer and enumerated types is implicit in C. Perhaps sparse would have
caught it though.

Wrong name

Posted Jan 17, 2008 14:28 UTC (Thu) by nix (subscriber, #2304) [Link] (1 responses)

It could have been made more obvious, and thus more likely to be rapidly spotted, by making
sure that FMODE_WRITE and MAY_WRITE had different *values* which overlapped with something
quite different in the other flag: but if that had been thought of, this bug would never have
happened because people would have been paying extra attention to it anyway.

Wrong name

Posted Jan 17, 2008 16:17 UTC (Thu) by tbellman (guest, #49983) [Link]

Except that it wouldn't have helped.  The buggy code used FMODE_WRITE to check the bit in the
variable 'flag'; the correct code uses MAY_WRITE to check the bit in the variable 'acc_mode'.
The buggy code did use the correct access mechanisms for the variable it looked at, so no
amount of BDSM type control would have helped.

Wrong name

Posted Jan 17, 2008 13:06 UTC (Thu) by guus (subscriber, #41608) [Link] (1 responses)

C already supports bitfields:

struct accessMode {
    int open:1;
    int write:1;
    ...
};

Bitfields

Posted Jan 17, 2008 14:57 UTC (Thu) by zlynx (guest, #2285) [Link]

But the kernel doesn't appear to use bitfields much.  I seem to remember that it's something
to do with GCC and miserable performance on non-x86 arch's.

A kernel security hole

Posted Jan 17, 2008 14:57 UTC (Thu) by nix (subscriber, #2304) [Link] (3 responses)

I thought this only allowed open(..., ... | O_TRUNC) on directories, allowing attackers to
truncate directories. If it allows writing to directories too (I'm behind a harsh firewall and
can't check right now), then I suspect this could be an exploitable hole as well as 'mere'
filesystem corruption.

Consider: this code amounts to allowing hostile attackers to call link() on arbitrary files,
even in subdirectories they cannot read, by editing the directory to include a reference to
whatever-it-is. But more than that, it allows them to call link() *without adjusting the link
count*.

So, imagine a system which had /tmp or /var/tmp or something user-writable on the same
directory as /. The hostile attacker creates a directory, opens it for writing, and attaches a
link to /etc/passwd (without readdir()ing it, so IIRC the directory won't be cached yet so the
changes will be picked up: if not, allocate a lot of memory to push the cached copy out).

Then unlink() that copy, and the link count will fall to zero, leading to /etc/passwd being
unlinked. It's not open all the time, so this now leads to /etc/passwd's blocks being freed.
So far, so corruption: but now the attacker fills up his/her quota with minimum-size files
containing the /etc/passwd contents he wants (up to the length of /etc/passwd beforehand, or
one block, whichever is smaller), then unlinks them again, repeatedly. Because /etc/passwd is
now a view onto free space, I suspect that if you do this for long enough, it would let the
attacker replace /etc/passwd's contents, with, say, a single root account without password :)
unsubtle, but still an attack.

(Again, I haven't been able to check the code from here, and I can't recall whether files
whose link count falls to zero get their size reset to zero as well. If so, this particular
attack is prevented, and 'all' the attacker would be able to do would be to in effect truncate
any file on a filesystem he could write to, rather than being able to write arbitrary content
to it.)

A kernel security hole

Posted Jan 17, 2008 16:33 UTC (Thu) by tbellman (guest, #49983) [Link]

As far as I have been able to figure out, you can only truncate directories due to this bug,
not write arbitrary contents to them.  There are checks in place in the write() system call
that checks that you actually have opened the file with O_WRONLY or O_RDWR (but having those
set would stop you already in the open() call).

The bug also only allows you to truncate directories to which you already have write
permission.  I.e, you can truncate /tmp, but not /etc.

However, that is probably exploitable enough.  After clearing /tmp, you could then create a
new /tmp/.X11-unix/X0 socket going to your own program.  Any new X client would connect to
that and talk to your program, which could act as a man-in-the-middle, grabbing the X cookie
and gain access to the X server.  Or you can replace the socket to someone's ssh-agent, and
thus grab all ssh keys added after that.  (Normally, the sticky bit on /tmp prevents you from
doing this.)  I wouldn't be surprised if you can attack other programs too, by replacing
temporary files they don't expect to be replaced...

A kernel security hole

Posted Jan 18, 2008 21:40 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

Then unlink() that copy, and the link count will fall to zero, leading to /etc/passwd being unlinked. It's not open all the time, so this now leads to /etc/passwd's blocks being freed.

It doesn't unlink /etc/passwd (and that's the problem). What it does is delete the password file. /etc/passwd now points to a ghost inode.

So what you want to do to exploit this is create files until one happens to get that inode slot (i.e. inode number). Now you own the password file.

A kernel security hole

Posted Jan 18, 2008 21:44 UTC (Fri) by nix (subscriber, #2304) [Link]

Argh. Yeah, that's worse: now you can put multi-block content in there 
fairly fast and without filling the disk up, which means you could e.g. 
copy the pre-existing /etc/passwd and remove root's password, or add a new 
uid 0 account... plus you own the file, too, I guess that's a security 
hole :)