Brief items
The current 2.6 prepatch is 2.6.25-rc1,
released by Linus on
February 10. It is a huge patch. Among many
other things, 2.6.25 will have realtime group scheduling,
preemptible RCU,
LatencyTop support, a bunch of
ext4 filesystem enhancements,
the
controller area network
protocol, Atheros wireless support, the reworked
timerfd() system
call, the
page map patches,
the
SMACK security module,
the
container memory use
controller, the ACPI thermal regulation API,
and support for the MN10300/AM33 architecture. See
the short-form changelog for lots of details,
or
the
long changelog for more detail than anybody can cope with.
As of this writing, a few dozen small fixes have gone into the mainline git
repository since the -rc1 release.
The current stable 2.6 kernel is 2.6.24.2, released on February 10.
This update contains a single patch fixing the vmsplice()
vulnerability. 2.6.24.1 was
released - with a rather longer list of fixes - on February 8.
For older kernels: 2.6.23.16 and 2.6.22.18 both come out on
February 10; they, too, contain the vmsplice() fix. 2.6.23.15 was released on
February 8 with a few dozen fixes. And 2.6.22.17, also with quite a few
fixes, came out on February 6.
Comments (1 posted)
Kernel development news
Remember, we are currently clocking along at the steady rate of:
4000 lines added every day
1900 lines removed every day
1300 lines modified every day
--
Greg Kroah-Hartman
???? lines reviewed every day.
--
Al Viro
Comments (none posted)
By Jonathan Corbet
February 12, 2008
The 2.6.25 merge window closed on February 10, after the merging of an
eye-opening 9450 non-merge changesets. Most of the changes merged for
2.6.25 were covered in the
first and
second "what got merged"
articles. This, the third in the series, covers the final 1900 patches
merged before the window closed.
User-visible changes include:
- There are new drivers for SC2681/SC2691-based serial ports, Dallas
DS1511 timekeeping chips, AT91sam9 realtime clock devices, Compaq
ASIC3 multi-function chips, Cell Broadband Engine memory controllers,
Marvell MV64x60 memory controllers, PA Semi PWRficient NAND flash
interfaces, Marvell Orion NAND flash controllers, Freescale eLBC NAND
flash controllers, Sharp Zaurus SL-6000x keyboards, Fujitsu Lifebook
Application Panel buttons, IPWireless 3G UMTS PCMCIA cards,
intelligent storage device enclosures, Winbond W83L786NG
and W83L786NR sensor chips, Texas Instruments ADS7828
12-bit 8-channel ADC devices, and Sony MemoryStick cards.
- Also added are updated video drivers for Radeon R500 chipsets (2D
acceleration is now supported) and Intel i915 chipsets (suspend and
resume now work properly).
- Several more obsolete OSS audio drivers have been removed. The old
mxser driver has also been removed in favor of mxser_new, now called
simply "mxser."
- File descriptors returned by inotify_init() now support
signal-based (using SIGIO) I/O. There is also a new
notification event (IN_ATTRIB) sent when the link count of a
watched file changes.
- The mac80211 (formerly Devicescape) wireless subsystem is no longer
marked "experimental."
- The memory use controller for containers has been merged. This
controller was described in this LWN article, but the
patch has evolved somewhat since then and the details have changed.
Some documentation can be found in Documentation/controllers/memory.txt.
- ACPI thermal regulation support has been added; see Documentation/thermal/sysfs-api.txt for
details on how it works. The ACPI code also now supports the Windows
Management Instrumentation interface, and uses that support to make
recent Acer laptops work.
- ACPI now provides support for users who want to override their
system's Differentiated System Description Table (DSDT).
- The XFS filesystem now supports the fallocate() system call.
- ATA-over-Ethernet (AoE) now properly supports devices with multiple
network interfaces (and, thus, multiple paths to the host).
- Support for the MN10300
architecture (little-endian mode only) has been added.
- Support for a.out binaries has been removed from the ELF loader. Pure
a.out systems will still work, though.
- Disk I/O statistics (as seen in /proc/diskstats and under
/sys/block) have been augmented with more information about
request merging and I/O wait time.
- The S390 architecture now implements dynamic page tables - processes
will use 2-, 3-, or 4-level page tables depending on the size of their
address space.
- The ext4 "in development" flag has been added; mounting an ext4
filesystem will now require an explicit "I know this might explode"
option.
Changes visible to kernel developers include:
- Many nopage() methods have been replaced by the newer
fault() API; the near-term plan is to remove
nopage() altogether. See this article for a
description of the new way of "page not present" handling.
- This cycle has also seen a bit of a reinvigoration of the long-stalled
project to eliminate the big kernel lock. A number of BKL-removal
patches have been merged, with more certainly to come.
- A generic resource counter mechanism was merged as part of the memory
controller patch set; see <linux/res_counter.h> for the
details.
- reserve_bootmem() has a new flags parameter. Most
callers will set it to BOOTMEM_DEFAULT; the kdump code,
though, uses BOOTMEM_EXCLUSIVE to ensure that it is the only
one to touch the memory.
- Most architectures now have support for cmpxchg64() and
cmpxchg_local().
- There is a new set of string functions:
extern int strict_strtoul(const char *string, unsigned int base,
unsigned long *result);
extern int strict_strtol(const char *string, unsigned int base,
long *result);
extern int strict_strtoull(const char *string, unsigned int base,
unsigned long long *result);
extern int strict_strtoll(const char *string, unsigned int base,
long long *result);
These functions convert the given strings to various forms of
long values, but they will return an error status if the
given string value, as a whole, does not represent a proper
integer value. These functions are now used in the parsing of kernel
parameters.
At this point, the merging of features is done (though there has been a bit
of pushing for one or two things to slip in) and the stabilization period
begins. With luck, that process will go a little more quickly than it did
with 2.6.24.
Comments (7 posted)
By Jonathan Corbet
February 13, 2008
The kernel development process operates at a furious pace, merging
on the order of 10,000 changesets over the course of a 2-3 month
release cycle. There have been many changes over the last few years which
have helped to make this level of patch flow possible, and the process has
been optimized significantly. An ongoing discussion on the kernel mailing
list has made it clear, though, that a truly optimal solution has not yet
been found.
It started with the announcement
of the linux-next tree. This tree, to be maintained by Stephen
Rothwell, is intended to be a gathering point for the patches which are
planned to be merged in the next development cycle. So, since we are
currently in the 2.6.25 cycle, linux-next will accumulate patches for
2.6.26. The idea is to solve the patch integration issues there and reduce
the demands on Andrew Morton's time.
The question which was immediately raised was this: how do we deal with big
API changes which require changes in multiple subsystems? These changes
are already problematic, often requiring maintainers to rework their trees
in the middle of the merge window. Trying to integrate such changes
earlier, in a separate tree, could bring a new set of problems. There will
be a lot of conflicts between patches done before and after the API change,
and somebody is going to have to put the pieces back together again.
Andrew does some of that now, but the problem is big enough that not even
Andrew can solve it all the time. The bidirectional SCSI patches merged
for 2.6.25 were held up as an example; that
change required coordinated SCSI and block layer patches, and it never was
possible to get the whole thing working in -mm.
Arjan van de Ven asserted that the only way
to make large API changes work is to merge them first, at the beginning of
the merge window. The merged patch would fix all in-tree users of the
changed API, as is
the usual rule. Maintainers of all other trees could then merge with the
updated mainline, fixing any new code which might be affected by the API
change. This is, essentially, the approach which was taken for the big
device model changes in 2.6.25; they hit the mainline at the beginning of
the merge window, then everybody else got to adapt to the new way of doing
things.
Greg Kroah-Hartman worries that this approach
is not sufficient, especially when live trees are being merged. If an
API change in one tree forces a change to a separate tree, the coordination
issues just get hard. Keeping the secondary changes in the primary tree
risks conflicts with patches in the proper subsystem tree. Patches which
reach across trees are also, increasingly, being discouraged as making life
harder for everybody. But the fixup patch will not apply to its nominal subsystem
tree as long as the API change itself is not there. In the -mm tree, this
sort of problem is glued together by a series of fixup patches maintained
by Andrew; Greg says that the linux-next tree would need something similar.
David Miller's suggestion was to resolve
this sort of conflict through frequent rebasing of the -next tree.
Rebasing is an operation (supported by git and other code management tools)
which takes a set of patches against one tree and does what's required to
make them apply to a different version of the tree. It can be quite useful
for maintaining patches against a moving target - which kernel trees tend
to be. David talked about how he rebases his (networking subsystem) trees
frequently as a way of eliminating conflicts with the mainline and, in the
process, cleaning some cruft out of the development history.
It turns out, though, that this frequent rebasing is not popular with the
developers who are downstream of David. Rebasing the tree forces all
downstream contributors to do the same thing, and to deal with any merge
conflicts that result. It makes it much harder to prepare trees which can
be pulled upstream and creates extra work.
This was where Linus jumped into the
conversation and expressed his dislike of rebasing. He echoed the
complaints from downstream developers that a constantly-rebased tree is
hard to prepare patches against. It also confuses the development history,
making changes to other developers' patches in silent ways. After
somebody's patch set has been rebased, it is no longer the patches that
were sent. So, says Linus:
So there's a real reason why we strive to *not* rewrite
history. Rewriting history silently turns tested code into totally
untested code, with absolutely no indication left to say that it
now is untested.
It is about here that Andrew Morton commented that git does not appear to be
matching entirely well with the way that kernel developers work. Some of
the solution may be found in tools more oriented toward the management of
patch queues - such as quilt. There may be a renewed push to get more
quilt-like functionality built into git (along the lines of the stacked git project) in the near
future.
Linus is also not entirely pleased with how
the integration of patches only happens in the mainline:
I'm also a bit unhappy about the fact you think all merging has to
go through my tree and has to be visible during the two-week merge
period. Quite frankly, I think that you guys could - and should -
just try to sort API changes out more actively against each other,
and if you can't, then that's a problem too.
His suggestion is that a separate git tree should be created to contain a
large API change - and nothing else. Affected subsystem maintainers could
then merge that tree and develop against the result. In the end, all of
the pieces should merge nicely in the mainline.
This approach raises a number of interesting issues. The API-change tree
has to be agreed upon by everybody, and it must be quite stable - lots of
changes at that level will create downstream trouble. There must also be a
high degree of confidence that this API-change tree will, in fact, get
merged into the mainline; should Linus balk, everybody else's trees will no
longer be applicable to the mainline. Replacing the current "tree of
trees" patch flow with something messier could create a number of
coordination issues. And there are fears that a mainline tree built from
this process would fail to build in many of its intermediate states, which
would make tools like "git bisect" much harder to use. Even so, it could
be part of the long-term solution.
Linus also took the opportunity to complain about large-scale API changes
in general:
Really. I do agree that we need to fix up bad designs, but I
disagree violently with the notion that this should be seen as some
ongoing thing. The API churn should absolutely *not* be seen as a
constant pain, and if it is (and it clearly is) then I think the
people involved should start off not by asking "how can we
synchronize", but looking a bit deeper and saying "what are we
doing wrong?"
He also stated that the costs of big API
changes are high enough that we should, more often, stay with older
interfaces, even if they are not as good as they could be. Others disagreed, claiming that Linux must continue
to evolve if it is to stay alive and relevant.
The rate of change seems unlikely to fall in the near future. There may be
some changes to how big changes are done, though. As suggested by Ted Ts'o, more changes could be
done by creating entirely new interfaces rather than breaking old ones.
With Ted's scheme, the old interface would be marked "deprecated" at the
beginning of the merge window. Developers would then have the entire
development cycle to adjust to the change, and the deprecated interface
would be removed before the final release.
There is resistance to this approach, based on the observation that getting
rid of deprecated interfaces tends to be harder than one would expect.
But, still, it is a relatively painless way of making changes. The current
transition (in the memory management area) from the nopage() VMA
operation to fault() is an example of how it can work. Nick
Piggin has been slowly changing in-tree users with the eventual goal of
removing nopage() altogether. For now, though, both interfaces
coexist in the tree and nothing has been broken.
Like the kernel itself, its development process is undergoing constant
change and (hopefully) improvement. As the development community and the
rate of change continues to grow, the process will have to adjust
accordingly. What changes come out of this discussion remain to be seen.
But it's worth noting that Andrew Morton fears that the biggest problem - regressions
and bugs - will be relatively unaffected.
Comments (none posted)
By Jonathan Corbet
February 12, 2008
As this is being written, distributors are working quickly to ship kernel
updates fixing the local root vulnerabilities in the
vmsplice()
system call. Unlike a number of other recent vulnerabilities which have
required special situations (such as the presence of specific hardware) to
exploit, these vulnerabilities are trivially exploited and the code to do
so is circulating on the net. Your editor found himself wondering how such
a wide hole could find its way into the core kernel code, so he set himself
the task of figuring out just what was going on - a task which took rather
longer than he had expected.
The splice() system call, remember, is a mechanism for creating
data flow plumbing within the kernel. It can be used to join two file
descriptors; the kernel will then read data from one of those descriptors
and write it to the other in the most efficient way possible. So one can
write a trivial file copy program which opens the source and destination
files, then splices the two together. The vmsplice() variant
connects a file descriptor (which must be a pipe) to a region of user
memory; it is in this system call that the problems came to be.
The first step in understanding this vulnerability is that, in fact, it is
three separate bugs. When the word of this problem first came out, it was
thought to only affect 2.6.23 and 2.6.24 kernels. Changes to the
vmsplice() code had caused the omission of a couple of important
permissions checks. In particular, if the application had requested that
vmsplice() move the contents of a pipe into a range of memory, the
kernel didn't check whether that application had the right to write to that
memory. So the exploit could simply write a code snippet of its choice
into a pipe, then ask the kernel to copy it into a piece of kernel memory.
Think of it as a quick-and-easy rootkit installation mechanism.
If the application is, instead, splicing a memory range into a pipe, the
kernel must, first, read in one or more iovec structures
describing that memory range. The 2.6.23 vmsplice() changes omitted
a check on whether the purported iovec structures were in readable
memory. This looks more like an information disclosure vulnerability than
anything else - though, as we will see, it can be hard to tell sometimes.
These two vulnerabilities (CVE-2008-0009 and CVE-2008-0010) were patched in
the 2.6.23.15 and 2.6.24.1 kernel updates,
released on February 8.
On February 10, Niki Denev pointed out that
the kernel appeared to be still vulnerable after the fix. In fact, the
vulnerability was the result of a different problem - and it is a much worse one, in
that kernels all the way back to 2.6.17 are affected. At this point, a
large proportion of running Linux systems are vulnerable. This one has
been fixed in the 2.6.22.18,
2.6.23.16, and 2.6.24.2 kernels, also released
on the 10th. At this point, with luck, all of these bugs have been firmly
stomped - though, now, we need to see a lot of distributor updates.
The problem, once again, is in the memory-to-pipe implementation. The
function get_iovec_page_array() is charged with finding a set of
struct page pointers corresponding to the array of iovec
structures passed in by the calling application. Those pointers are stored
in this array:
struct page *pages[PIPE_BUFFERS];
Where PIPE_BUFFERS happens to be 16. In order to avoid
overflowing this array, get_iovec_page_array() does the following
check:
npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (npages > PIPE_BUFFERS - buffers)
npages = PIPE_BUFFERS - buffers;
Here, off is the offset into the first page of the memory to be
transferred, len is the length passed in by the application, and
buffers is the current index into the pages array.
Now, if we turn our attention to the exploit code for a
moment, we see it
setting up a number of memory areas with mmap(); some of that
setup is not necessary for the exploit to work, as it turns out. At the
end, the code does this (edited slightly):
iov.iov_base = map_addr;
iov.iov_len = ULONG_MAX;
vmsplice(pi[1], &iov, 1, 0);
The map_addr address points to one of the areas created with
mmap() which, crucially, is significantly more than
PIPE_BUFFERS pages long. And the length is passed through as the
largest possible unsigned long value.
Now let's go back to fs/splice.c, where the vmsplice()
implementation lives. We note that, prior to the fix, the
kernel did not check whether the memory area pointed to by the
iovec structure was readable by the calling process. Once again,
this looks like an information disclosure vulnerability - the process could
cause any bit of kernel memory to be written to the pipe, from which it
could be read. But the exploit code is, in fact, passing in a valid
pointer - it's just the length which is clearly absurd.
Looking back at the code which calculates npages, we see
something interesting:
npages = (off + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (npages > PIPE_BUFFERS - buffers)
npages = PIPE_BUFFERS - buffers;
Since len will be ULONG_MAX when the exploit runs, the
addition will cause an integer overflow - with the effect that
npages is calculated to be zero. Which, one would think, would
cause no pages to be examined at all. Except that there is an unfortunate
interaction with another part of the kernel.
Once npages has been calculated, the next line of code looks like
this:
error = get_user_pages(current, current->mm,
(unsigned long) base, npages, 0, 0,
&pages[buffers], NULL);
get_user_pages() is the core memory management function used to
pin a set of user-space pages into memory and locate their struct
page pointers. While the npages variable passed as an
argument is an unsigned quantity, the prototype for
get_user_pages() declares it as a simple int called len. And, to
complete the evil, this function processes pages in a
do {} while(); loop which ends thusly:
len--;
} while (len && start < vma->vm_end);
So, if get_user_pages() is passed with a len argument of
zero, it will pass through the mapping loop once, decrement len to a
negative number, then continue faulting in pages until it hits an address
which lacks a valid mapping. At that point it will stop and return. But,
by then, it may have stored far more entries into the pages array
than the caller had allocated space for.
The practical result in this case is that get_user_pages() faults
in (and stores struct page pointers for) the entire region mapped
by the exploit code. That region (by design) has more than
PIPE_BUFFERS pages - in fact, it has three times that many, so 48
pointers get stored into a 16-pointer array. And this turns the failure to read-verify
the source array into a buffer overflow vulnerability
within the kernel. Once that is in place, it is a relatively
straightforward exercise for any suitably 31337 hacker to cause the kernel
to jump into the code of his or her choice. Game over. (Update: as
a linux-kernel reader pointed out, the
story is a little more complicated still at this point; this is an unusual
sort of buffer overflow attack).
The fix
which was applied simply checks the address range that the
application is trying to splice into the pipe. Since a range of length
ULONG_MAX is unlikely to be valid, the vulnerability is closed -
as are any potential information disclosure problems.
This vulnerability is a clear example of how a seemingly read-only
vulnerability can be escalated into something rather more severe. It also
shows what can happen when certain types of sloppiness find their way into
the code - if get_user_pages() is asked to get zero pages, that's
how many it should do. Your editor is working on a patch to clean that up
a bit. Meanwhile, everybody should ensure that they are running current
kernels with the vulnerability closed.
Comments (91 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jake Edge
Next page: Distributions>>