By Jonathan Corbet
March 4, 2008
Back in February, LWN published
a
discussion of the vmsplice() exploit which showed how the
failure to check permissions for a read operation led to a buffer overflow
within the kernel. Subsequently, a linux-kernel reader
pointed out that the article
stopped short of a complete explanation: this is not an ordinary buffer
overflow exploit. Travel schedules and such prevented the writing of an
immediate followup, but your editor would still like to tell the full
story. So this article picks up where the last one left off and describes
how the
vmsplice() exploit makes use of this buffer overflow to
take over the system.
When vmsplice() is being used to feed data from memory into a
pipe, the function charged with making it all happen is
vmsplice_to_pipe(), found in fs/splice.c. It declares a
couple of arrays of interest:
struct page *pages[PIPE_BUFFERS];
struct partial_page partial[PIPE_BUFFERS];
PIPE_BUFFERS, remember, is 16 on exploitable configurations. Both
of these arrays are passed into get_iovec_page_array(), which, as
described in the previous article, makes a call to
get_user_pages() to fill in the pages array. As a result
of the failure to check whether the calling application is allowed to read
the requested region of memory, get_user_pages() will overflow the
pages array, writing far more than PIPE_BUFFERS pointers
into it. These are, however, pointers to legitimate kernel data
structures; it remains to be seen how this overflow enables the attacker to
take control of the system.
The partial array is also passed into
get_iovec_page_array(); it describes the portion of each page which
should be written into the pipe. To that end, a loop like this is run
immediately after returning from get_user_pages():
for (i = 0; i < error; i++) {
const int plen = min_t(size_t, len, PAGE_SIZE - off);
partial[buffers].offset = off;
partial[buffers].len = plen;
/* ... */
}
Since full pages are being written in this case, the calculated offset will be zero, and the length
will be PAGE_SIZE (4096). The value of error is the
return value from get_user_pages(); that will be the number of
pages actually mapped: 46, in the case of the exploit. Remember that the
partial array is also dimensioned to hold 16 entries, so this loop
will overflow that array as well.
Both of these arrays are declared, one right after the other, in
vmsplice_to_page(). A quick test by your editor suggests that the
partial array will be placed below pages in memory, so,
once partial is overflowed, the loop will start overwriting
pages instead. So the pages array will end up containing
alternating values of zero and 4096 rather than the real struct
page pointers it had before. (It's worth noting that the exploit
still works if the arrays are placed in the opposite order, since the
overflow causes code down the line to think that pages is larger
than it really is).
Once all this has happened, control returns to vmsplice_to_pipe()
- the overflow is not big enough to have overwritten the return address. A
call to splice_to_pipe() is supposed to finish the job, but
something interesting happens there. Toward the beginning of this
function, this test is made:
if (!pipe->readers) {
send_sig(SIGPIPE, current, 0);
if (!ret)
ret = -EPIPE;
break;
}
Looking back at the exploit
code, we see that it closes the read side of the pipe before calling
vmsplice(). So splice_to_pipe() will quit almost
immediately. On its way out, however, it does this:
while (page_nr < spd_pages)
page_cache_release(spd->pages[page_nr++]);
The call to get_user_pages() will have locked each of the relevant
pages into memory to allow the kernel to work with them; this is the
cleanup code which goes back and unlocks the pages which will not be used.
But remember that the pointers in the pages array have been
overwritten, and are now either zero or 4096. What would normally happen
here is a kernel oops, since those are not legitimate addresses. The
exploit code has done something tricky, though: using some special
mmap() calls, it has created some anonymous memory at the bottom
of its address space.
Directly dereferencing user-space addresses while running in kernel mode is
frowned upon for a number of reasons; it can blow up in a number of ways.
But, if the address is valid and the relevant page is resident in memory,
direct access to user-space memory will work. So, when the kernel starts
to work with the addresses that it thinks are struct page
pointers, it does not get any sort of fault; instead, it gets the data
placed in that memory by the exploit. Needless to say, that data has been
arranged carefully.
The Linux kernel normally manages each page as an independent object.
There are times, however, when pages are grouped into larger units, called
"compound pages." This generally happens when physically contiguous
allocations larger than one page are needed by the kernel; when this
happens, a compound page is passed back to the caller. These pages are
special in that they must be split back apart when they are released back
into the system, and there may be other cleanup work to do. So
compound pages have an attribute not found on normal pages: a destructor
which is called when the page is freed.
So, if we look at how the exploit sets up its low-memory page
structures, we see:
pages[0]->flags = 1 << PG_compound;
pages[0]->private = (unsigned long) pages[0];
pages[0]->count = 1;
pages[1]->lru.next = (long) kernel_code;
When the kernel looks for a page structure at user-space address
zero, it will find something which looks like a compound page. The
destructor (stored in the lru.next field of the second
page structure) is set to kernel_code(), a function
defined within the exploit itself. Since the count is set to one,
the call to page_cache_release() (which decrements that count)
will conclude that there are no further references and, since the page looks like
a compound page, the destructor will be called. At this point, the exploit
has arbitrary code running in kernel mode, and the show is truly over.
This code just sets the process's uid to zero (giving it root
access), then engages in some assembly-language trickery to return
immediately to user space, shorting out the rest of the cleanup process.
There are a couple of interesting implications from all of this. One, clearly,
is that this exploit is not something which was bashed out by a script
kiddie somewhere. It was written by somebody who understands low-level
kernel code quite well and who is able to use that understanding to
escalate an apparent information-disclosure vulnerability into a full code
execution problem. It is, clearly, a mistake to underestimate those who
write exploits, not all of whom immediately make their works known to the
development community. One also should not assume that they have not
already written exploits for other, still unfixed bugs.
Also worth noting is the fact that ordinary buffer overflow protection may
well have not been effective against this vulnerability. The return address on
the stack was not overwritten, and no exploit code was put in data areas.
This episode has caused a renewed interested in technical security measures
in the kernel. These measures are good, but it would be a mistake to think
that they will fix the problem. What is really needed is stronger review
of patches with security in mind; it is not yet clear to your editor that
this review is happening.
(
Log in to post comments)