By Jonathan Corbet
February 12, 2008
As this is being written, distributors are working quickly to ship kernel
updates fixing the local root vulnerabilities in the
vmsplice()
system call. Unlike a number of other recent vulnerabilities which have
required special situations (such as the presence of specific hardware) to
exploit, these vulnerabilities are trivially exploited and the code to do
so is circulating on the net. Your editor found himself wondering how such
a wide hole could find its way into the core kernel code, so he set himself
the task of figuring out just what was going on - a task which took rather
longer than he had expected.
The splice() system call, remember, is a mechanism for creating
data flow plumbing within the kernel. It can be used to join two file
descriptors; the kernel will then read data from one of those descriptors
and write it to the other in the most efficient way possible. So one can
write a trivial file copy program which opens the source and destination
files, then splices the two together. The vmsplice() variant
connects a file descriptor (which must be a pipe) to a region of user
memory; it is in this system call that the problems came to be.
The first step in understanding this vulnerability is that, in fact, it is
three separate bugs. When the word of this problem first came out, it was
thought to only affect 2.6.23 and 2.6.24 kernels. Changes to the
vmsplice() code had caused the omission of a couple of important
permissions checks. In particular, if the application had requested that
vmsplice() move the contents of a pipe into a range of memory, the
kernel didn't check whether that application had the right to write to that
memory. So the exploit could simply write a code snippet of its choice
into a pipe, then ask the kernel to copy it into a piece of kernel memory.
Think of it as a quick-and-easy rootkit installation mechanism.
If the application is, instead, splicing a memory range into a pipe, the
kernel must, first, read in one or more iovec structures
describing that memory range. The 2.6.23 vmsplice() changes omitted
a check on whether the purported iovec structures were in readable
memory. This looks more like an information disclosure vulnerability than
anything else - though, as we will see, it can be hard to tell sometimes.
These two vulnerabilities (CVE-2008-0009 and CVE-2008-0010) were patched in
the 2.6.23.15 and 2.6.24.1 kernel updates,
released on February 8.
On February 10, Niki Denev pointed out that
the kernel appeared to be still vulnerable after the fix. In fact, the
vulnerability was the result of a different problem - and it is a much worse one, in
that kernels all the way back to 2.6.17 are affected. At this point, a
large proportion of running Linux systems are vulnerable. This one has
been fixed in the 2.6.22.18,
2.6.23.16, and 2.6.24.2 kernels, also released
on the 10th. At this point, with luck, all of these bugs have been firmly
stomped - though, now, we need to see a lot of distributor updates.
The problem, once again, is in the memory-to-pipe implementation. The
function get_iovec_page_array() is charged with finding a set of
struct page pointers corresponding to the array of iovec
structures passed in by the calling application. Those pointers are stored
in this array:
struct page *pages[PIPE_BUFFERS];
Where PIPE_BUFFERS happens to be 16. In order to avoid
overflowing this array, get_iovec_page_array() does the following
check:
npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (npages > PIPE_BUFFERS - buffers)
npages = PIPE_BUFFERS - buffers;
Here, off is the offset into the first page of the memory to be
transferred, len is the length passed in by the application, and
buffers is the current index into the pages array.
Now, if we turn our attention to the exploit code for a
moment, we see it
setting up a number of memory areas with mmap(); some of that
setup is not necessary for the exploit to work, as it turns out. At the
end, the code does this (edited slightly):
iov.iov_base = map_addr;
iov.iov_len = ULONG_MAX;
vmsplice(pi[1], &iov, 1, 0);
The map_addr address points to one of the areas created with
mmap() which, crucially, is significantly more than
PIPE_BUFFERS pages long. And the length is passed through as the
largest possible unsigned long value.
Now let's go back to fs/splice.c, where the vmsplice()
implementation lives. We note that, prior to the fix, the
kernel did not check whether the memory area pointed to by the
iovec structure was readable by the calling process. Once again,
this looks like an information disclosure vulnerability - the process could
cause any bit of kernel memory to be written to the pipe, from which it
could be read. But the exploit code is, in fact, passing in a valid
pointer - it's just the length which is clearly absurd.
Looking back at the code which calculates npages, we see
something interesting:
npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (npages > PIPE_BUFFERS - buffers)
npages = PIPE_BUFFERS - buffers;
Since len will be ULONG_MAX when the exploit runs, the
addition will cause an integer overflow - with the effect that
npages is calculated to be zero. Which, one would think, would
cause no pages to be examined at all. Except that there is an unfortunate
interaction with another part of the kernel.
Once npages has been calculated, the next line of code looks like
this:
error = get_user_pages(current, current->mm,
(unsigned long) base, npages, 0, 0,
&pages[buffers], NULL);
get_user_pages() is the core memory management function used to
pin a set of user-space pages into memory and locate their struct
page pointers. While the npages variable passed as an
argument is an unsigned quantity, the prototype for
get_user_pages() declares it as a simple int called len. And, to
complete the evil, this function processes pages in a
do {} while(); loop which ends thusly:
len--;
} while (len && start < vma->vm_end);
So, if get_user_pages() is passed with a len argument of
zero, it will pass through the mapping loop once, decrement len to a
negative number, then continue faulting in pages until it hits an address
which lacks a valid mapping. At that point it will stop and return. But,
by then, it may have stored far more entries into the pages array
than the caller had allocated space for.
The practical result in this case is that get_user_pages() faults
in (and stores struct page pointers for) the entire region mapped
by the exploit code. That region (by design) has more than
PIPE_BUFFERS pages - in fact, it has three times that many, so 48
pointers get stored into a 16-pointer array. And this turns the failure to read-verify
the source array into a buffer overflow vulnerability
within the kernel. Once that is in place, it is a relatively
straightforward exercise for any suitably 31337 hacker to cause the kernel
to jump into the code of his or her choice. Game over. (Update: as
a linux-kernel reader pointed out, the
story is a little more complicated still at this point; this is an unusual
sort of buffer overflow attack).
The fix
which was applied simply checks the address range that the
application is trying to splice into the pipe. Since a range of length
ULONG_MAX is unlikely to be valid, the vulnerability is closed -
as are any potential information disclosure problems.
This vulnerability is a clear example of how a seemingly read-only
vulnerability can be escalated into something rather more severe. It also
shows what can happen when certain types of sloppiness find their way into
the code - if get_user_pages() is asked to get zero pages, that's
how many it should do. Your editor is working on a patch to clean that up
a bit. Meanwhile, everybody should ensure that they are running current
kernels with the vulnerability closed.
(
Log in to post comments)