Brief items
The current development kernel is 3.3-rc1,
released on January 19; the 3.3 merge
window is now closed. "
Anyway, it's out now, and I'm taking off early
for a weekend of beer, skiing and poker (not necessarily in that order:
'don't drink and ski'). No email." See our merge window summaries
(
part 1,
part 2) for details on the features merged
for the 3.3 release.
Stable updates: The 2.6.32.55,
3.0.18,
and 3.2.2
stable
updates were released on January 25.
Comments (none posted)
This is digressing a bit, but the binary nvidia driver is the best
way that I see that we can support our users with a feature set
compatible to that available to other operating systems. For
technical reasons, we've chosen to leverage a lot of common code
written internally, which allows us to release support for new
hardware and software features much more quickly than if those of
us working on the Linux/FreeBSD/Solaris drivers wrote it all from
scratch. This means that we share a lot with other NVIDIA drivers,
but we for better or worse can't share much infrastructure like
DRI.
--
Robert Morell
For a Linux kernel containing any code I own the code is under the
GNU public license v2 (in some cases or later), I have never given
permission for that code to be used as part of a combined or
derivative work which contains binary chunks. I have never said
that modules are somehow magically outside the GPL and I am
doubtful that in most cases a work containing binary modules for a
Linux kernel is compatible with the licensing, although I accept
there may be some cases that it is.
--
Alan Cox
Comments (none posted)
Kernel development news
By Jake Edge
January 25, 2012
A privilege escalation in the kernel is always a serious threat that
leads kernel hackers and distributions to scramble to close the hole
quickly. That's exactly what happened after a January 17 report from Jüri
Aedla to the closed kernel security mailing list. But most people
didn't learn of the hole from Aedla (since he posted to a closed list), but
instead from Jason Donenfeld (aka zx2c4) who posted a detailed look at the flaw on January
22. The fix was made by Linus Torvalds and went
into the mainline on January 17, though with a commit message that
obfuscated the security implications—something that didn't sit well
with some.
The problem and exploit
The problem itself stems from the removal
of the restriction on writes to /proc/PID/mem that was merged for the
2.6.39 kernel. It was part of a patch set
that was specifically targeted at allowing debuggers to write to the memory
of processes easily via the /proc/PID/mem file. Unfortunately, it
left open a hole that Aedla and Donenfeld (at least) were able to exploit.
The posting by Donenfeld is worth a read for those interested in how
exploits of this sort are created. The problem starts with the fact that
the open() call for /proc/PID/mem does no additional
checking beyond the normal VFS permissions before returning a file
descriptor. That will prove to be a mistake, and one that Torvalds's fix
remedies. Instead of checks at open() time, the code would check
in write() and only allow
writing if the process being written to is the same as the process doing
the writing (i.e. task == current).
That restriction seems like it would make an exploit difficult, but it can
be avoided with an exec() and coercing the newly run program to do
the writing to itself. That will be dangerous if the newly run program is a setuid executable for
example. But there is another test that is
meant to block that particular path, by testing that
current->self_exec_id has the same value as it did at
open() time. self_exec_id is incremented every time that
a process does an exec(), so it will definitely be different after
executing the setuid binary. But, since it is simply incremented, one can
arrange (via fork()) to have a child process with the same
self_exec_id as the main process after the setuid
exec()
is done.
The child with the "correct" self_exec_id value (which it gets by doing
an exec()) can then
open the parent's /proc/PID/mem file (since there are no extra
checks on the open()) and pass the descriptor back to the parent via
Unix sockets. The parent then needs to arrange that the setuid
executable writes to that file descriptor once a seek() to the
proper address has been done. Finding that proper address and getting the
binary to write to the fd are the final pieces of the puzzle.
Donenfeld's example uses su
because it is not compiled as a position-independent executable (PIE) for
most distributions, which makes it easier to figure out which address to
use. He exploits the fact that su prints an error message when it
is passed an unknown username and the error message helpfully prints the
username passed. That allows the exploit to pass shellcode (i.e. binary
machine language that spawns a shell when executed) as the argument to
su.
After printing the error message, su calls the exit()
function (really exit@plt), which is what Donenfeld's exploit
overwrites. It finds the
address of the function using objdump, subtracts the length of the
error message that gets printed before the argument, and seeks the file to
that location. It uses dup2() to connect stderr to the
/proc/PID/mem file descriptor and execs
su "shellcode".
In pseudocode, it might look something like this:
if (!child && fork()) { /* child flag set based on -c */
/* first program invocation, this is parent, wait for fd from child */
fd = recv_fd(); /* get the fd from the child */
dup2(2, 15);
dup2(fd, 2); /* make fd be stderr */
lseek(fd, offset); /* offset to overwrite location */
exec("/bin/su", shellcode); /* will have self_exec_id == 1 */
}
else if (!child) {
/* this is the child from the fork(), exec with child flag */
exec("thisprog", "-c"); /* this program with -c (child) */
}
else {
/* child after exec, will have self_exec_id == 1 */
fd = open("/proc/PPID/mem", O_RDWR); /* open parent PID's mem file */
send_fd(fd); /* send the fd to the parent */
}
Of course Aedla's
proof-of-concept
or Donenfeld's
exploit
code are likely to be even more instructive.
It's obviously a complicated multi-step process, but it is also a completely
reliable way to get root privileges. Updates to Donenfeld's post show
exploits for distributions like Fedora that do build su as a PIE,
or for Gentoo where the read permissions on setuid binaries have been
removed so objdump can't be used to find the address of
the exit function. For Fedora, gpasswd can be substituted as it
is not built as a PIE, while on Gentoo, ptrace() can be used to
find the needed address. While it was believed that address space layout
randomization (ASLR) for PIEs would make exploitation much more difficult, that
proved to be only a small hurdle, at least on 32-bit systems.
The fix and reactions
The fix hit the mainline without any coordination with Linux
distributions. Kees Cook, who works on ChromeOS security (and
formerly was a member of the Ubuntu security team), told LWN that Red Hat has a person on the closed
kernel security mailing list, so it was aware of the problem but did not
share that information on the Linux distribution security list. "I've been told this will
change in the future, but I'm worried it will be overlooked again",
he said. The first indication that
other distributions had was likely from Red Hat's Eugene Teo's request for a CVE on the
oss-security mailing list.
As Cook points out, the abrupt public disclosure of the bug (via a mainline
commit) runs
counter to the policy described in
the kernel's Documentation/SecurityBugs
file, where the default policy is to leave roughly seven days between
reports to the mailing list and public disclosure to allow time for vendors
to fix the problem. Cook is concerned that bugs
reported to security@kernel.org are not being handled reasonably:
The current behavior of security@kernel.org harms end users, harms
distros, and harms security researchers, all while ignoring their own
published standards of notification. I have repeatedly seen the security@kernel.org list hold a
double-standard of "it is urgent to publish this security fix" and
"it's just a bug like any other bug". If it were just a bug, there
should be no problem in delaying publication. If it were an urgent
security fix, all the distros should be notified.
The "just a bug" refers to statements that Torvalds has made over the years
about security bugs being no different than any other kind of bug. In
email, Torvalds
described it this way:
To me, a bug is a bug. Nothing more, nothing less. Some bugs are
critical, but it's not about some random "security" crap - it could be
because it causes a machine to crash, or it could be because it causes
some user application to misbehave.
In keeping with that philosophy, Torvalds does not disclose the security
relevance of a fix in the commit message: "I think the whole 'mark this patch as having security implications' is
pure and utter garbage". Even if there is a known security problem
that is being fixed, his commit
messages do not reflect that, as with the message for the
/proc/PID/mem fix:
Jüri Aedla reported that the /proc/<pid>/mem handling really isn't very
robust, and it also doesn't match the permission checking of any of the
other related files.
This changes it to do the permission checks at open time, and instead of
tracking the process, it tracks the VM at the time of the open. That
simplifies the code a lot, but does mean that if you hold the file
descriptor open over an execve(), you'll continue to read from the _old_
VM.
Torvalds's commit message stands in pretty stark contrast to Aedla's report
to security@kernel.org (linked above):
I have found a privilege escalation vulnerability, introduced by making
/proc/<pid>/mem writable. It is possible to open /proc/self/mem as stdout or
stderr before executing a SUID. This leads to SUID writing to it's own memory.
This "masking" of the actual reason for a commit doesn't site well with
either Cook or Teo (who also responded to an email query). Cook "cannot overstate how much I am
against this kind of masking", while Teo pointed out that this
particular bug is in no way unique:
There are many kernel vulnerabilities that were fixed silently in the
upstream kernel. This is not the first one, nor will be the last one I'm
afraid.
Both Teo and Cook were in agreement that disclosing what is known about a
fix at the time it is applied can only help distributions and others trying
to track kernel development. Torvalds, on the other hand, is concerned
about attackers reading commit messages, which could lead to more attacks
against Linux systems. He has a well-known contempt for security
"clowns" that seems to also factor into his reasoning:
So I just ignore the idiots, and go "fix things asap, but try not to
help black hats". No games, no crap, just get the damn work done and
don't make a circus out of it.
Both the security camps hate me. The full disclosure people think I
try to hide things (which is true), while the embargo people think I
despise their corrupt arses (which is also true).
The strange thing is that by explicitly not putting the known
security implications of a patch into the commit message, Torvalds
is treating security bugs differently. They are no longer "just
bugs" because some of the details of the bug are being purposely omitted.
That may make it difficult for "black hats"—though it would be
somewhat surprising if it did—but it definitely makes it more difficult
for those who are trying to keep Linux users secure. Worse yet, it makes
it more difficult down the road when someone is looking at a commit (or
reversion) in isolation because they may miss out on some important context.
Silent security fixes are a hallmark of proprietary software, and
Torvalds's policy resembles that to some extent. It could be argued (and
presumably would be by Torvalds and others) that the fixes aren't
silent since they go into a public repository and that is
true—as far as it goes. By deliberately omitting important
information about the bug, which is not done for most or all other
bugs, perhaps they aren't so much silent as they are "muted" or, sadly,
"covered up". There is definitely a lot of validity to Torvalds's
complaints about the security "circus", but his reaction to that circus
may not be in the best interests of the kernel community either.
Comments (32 posted)
By Jonathan Corbet
January 25, 2012
The kernel cannot be said to lack for memory allocation mechanisms. At the
lowest level, "memblock" handles chunks of memory for the rest of the
system. The page allocator provides memory to the rest of the kernel in
units of whole pages. Much of the kernel uses one of the three slab
allocators to get memory blocks in arbitrary sizes, but there is also
vmalloc() for situations where large, virtually-contiguous regions
are needed. Add in various other specialized allocation functions and
other allocators (like
CMA) and it starts
to seem like a true embarrassment of choices. So what's to be done in this
situation? Add another one, of course.
The "zsmalloc" allocator, proposed by Seth
Jennings, is aimed at a specific use case. The slab allocators work by
packing multiple objects into each page of memory; that works well when the
objects are small, but can be problematic as they get larger. In the worst
case, if a kernel subsystem routinely needs allocations that are just
larger than PAGE_SIZE/2, only one object will fit within a page. Slab
allocators can attempt to allocate multiple physically-contiguous pages in
order to pack those large objects more efficiently, but, on
memory-constrained systems, those allocations can become difficult - or
impossible. So, on systems that are already tight of memory, large objects
will need to be allocated one-per-page, wasting significant amounts of
memory through internal fragmentation.
The zsmalloc allocator attempts to address this problem by packing objects
into a new type of compound page where the component pages are not
physically contiguous. The result can be much more efficient memory usage,
but with some conditions:
- Code using this allocator must not require physically-contiguous
memory,
- Objects must be explicitly mapped before use, and
- Objects can only be accessed in atomic context.
Code using zsmalloc must start by creating an allocation pool to work from:
struct zs_pool *zs_create_pool(const char *name, gfp_t flags);
Where name is the name of the pool, and flags will be
used to allocate memory for the pool. It is not entirely clear (to your
editor, at least) why multiple pools exist; the zs_pool structure
is relatively large, and a pool is really only efficient if the number of
objects allocated from it is also large. But that's how the API is
designed.
A pool can be released with:
void zs_destroy_pool(struct zs_pool *pool);
A warning (or several warnings) will be generated if there are objects
allocated from the pool that have not been freed; those objects will become
entirely inaccessible after the pool is gone.
Allocating and freeing memory is done with:
void *zs_malloc(struct zs_pool *pool, size_t size);
void zs_free(struct zs_pool *pool, void *obj);
The return value from zs_malloc() will be a pointer value, or NULL
if the object cannot be allocated. It would be a fatal mistake, though, to
treat that pointer as if it were actually a pointer; it is actually a magic
cookie that represents the allocated memory indirectly. It might have been
better to use a non-pointer type, but, again, that is how the API is
designed. Getting a pointer that can actually be used is done with:
void *zs_map_object(struct zs_pool *pool, void *handle);
void zs_unmap_object(struct zs_pool *pool, void *handle);
The return value from zs_map_object() will be a kernel virtual address that
can be used to access the actual object. The return address is essentially
a per-CPU object, so the calling code will be in
atomic context until the object is freed with zs_unmap_object().
Note that the handle passed to zs_unmap_object() is the
original cookie obtained from zs_malloc(), not the pointer from
zs_map_object(). Note also that only one object can be safely
mapped at a time on any given CPU.
Internally, zsmalloc divides allocations by object size much like the slab
allocators do, but with a much higher granularity - there are 254 possible
allocation sizes all less than PAGE_SIZE. For each size, the code
calculates an optimum number of pages (up to 16) that will hold an array of
objects of that size with minimal loss to fragmentation. When an
allocation is made, a "zspage" is created by allocating the calculated
number of individual pages and tying them all together. That tying is done
by overloading some fields of struct page in a scary way (that is
not a criticism of zsmalloc: any additional meanings overlaid onto
the already heavily overloaded page structure are scary):
- The first page of a zspage has the PG_private flag set. The
private field points to the second page (if any), while the
lru list structure is used to make a list of zspages of the
same size.
- Subsequent pages are linked to each other with the lru
structure, and are linked back to the first page with the
first_page field (which is another name for private,
if one looks at the structure definition).
- The last page has the PG_private_2 flag set.
Within a zspage, objects are packed from the beginning, and may cross the
boundary between pages. The cookie returned from zs_malloc() is a
combination of a pointer to the page structure for the first
physical page and the offset of the object within the zspage. Making that
object accessible to the rest of the kernel at mapping time is a matter of
calculating its location, then either (1) mapping it with
kmap_atomic() if the object fits entirely within one physical
page, or (2) assigning a pair of virtual addresses if the object
crosses a physical page boundary.
The primary users of zsmalloc are the zcache and zram mechanisms, both of which are
currently in staging. These subsystems use the transcendent memory abstraction to store
compressed copies of pages in memory. Those compressed pages can still be
a substantial fraction of the (uncompressed) page size, so fragmentation
issue addressed by zsmalloc can be a real problem. Given the specialized
use case and the limitation imposed by zsmalloc, it is not clear that it
will find users elsewhere in the kernel, but one never knows.
Comments (1 posted)
By Jonathan Corbet
January 20, 2012
Linux has a lot of filesystems, but two of them (ext4
and btrfs) tend to get most of the attention. In his 2012 linux.conf.au
talk, XFS developer Dave Chinner served notice that he thinks more users
should be considering XFS. His talk covered work that has been done to
resolve the biggest scalability problems in XFS and where he thinks things
will go in the future. If he has his way, we will see a lot more XFS
around in the coming years.
XFS is often seen as the filesystem for people with massive amounts of
data. It serves that role well, Dave said, and it has traditionally
performed well for a
lot of workloads. Where things have tended to fall down is in the
writing of metadata; support for workloads that generate a lot of metadata
writes has been a longstanding weak point for the filesystem. In short,
metadata writes were slow, and did not really scale past even a single
CPU.
How slow? Dave put up some slides showing fs-mark results compared to
ext4. XFS was significantly worse (as in half as fast) even on a
single CPU; the situation just gets worse up to eight threads, after which
ext4 hits a cliff and slows down as well. For I/O-heavy workloads with a
lot of metadata changes - unpacking a tarball was given as an example -
Dave said that ext4 could be 20-50 times faster than XFS. That is slow
enough to indicate the presence of a real problem.
Delayed logging
The problem turned out to be journal I/O; XFS was generating vast amounts
of journal traffic in response to metadata changes. In the worst cases,
almost all of the actual I/O traffic was for the journal - not the data the
user was actually trying to write. Solving this problem took multiple
attempts over years, one major algorithm change, and a lot of other
significant optimizations and tweaks. One thing that was not
required was any sort of on-disk format change - though that may be in the
works in the future for other reasons.
Metadata-heavy workloads can end up changing the same directory block many
times in a short period; each of those changes generates a record that must
be written to the journal. That is the source of the huge journal
traffic. The solution to the problem is simple in concept: delay the
journal updates and combine changes to the same block into a single entry.
Actually implementing this idea in a scalable way took a lot of work over
some years, but it is now working; delayed logging will be the only
XFS journaling mode supported in the 3.3 kernel.
The actual delayed logging technique was mostly stolen from the ext3
filesystem. Since that algorithm is known to work, a lot less time was
required to prove that it would work well for XFS as well. Along with its
performance benefits, this change resulted in a net reduction in code.
Those wanting details on how it works should find more than they ever
wanted in filesystems/xfs-delayed-logging.txt in the
kernel documentation tree.
Delayed logging is the big change, but far from the only one. The log
space reservation fast path is a very hot path in XFS; it is now lockless, though
the slow path still requires a global lock at this point. The asynchronous
metadata writeback code was creating badly scattered I/O, reducing
performance considerably. Now metadata writeback is delayed and sorted
prior to writing out. That means that the filesystem is, in Dave's words,
doing the I/O scheduler's work. But the I/O scheduler works with a request
queue that is typically limited to 128 entries while the XFS delayed
metadata writeback queue can have many thousands of entries, so it makes
sense to do the sorting in the filesystem prior to I/O submission. "Active log
items" are a mechanism that improves the performance of the (large) sorted log item list by
accumulating changes and applying them in batches. Metadata
caching has also been moved out of the page cache, which had a tendency to
reclaim pages at inopportune times. And so on.
How the filesystems compare
So how does XFS scale now? For one or two threads, XFS is still slightly
slower than ext4, but it scales linearly up to eight threads, while ext4
gets worse, and btrfs gets a lot worse. The scalability constraints for
XFS are now to be found in the locking in the virtual filesystem layer
core, not in the filesystem-specific code at all. Directory traversal is
now faster for even one thread and much faster for eight. These are, he
suggested, not the kind of results that the btrfs developers are likely to show
people.
The scalability of space allocation is "orders of magnitude" faster than
ext4 offers now. That changes a bit with the "bigalloc" feature
added in 3.2, which improves ext4 space allocation scalability by two
orders of magnitude if a sufficiently large block size is used.
Unfortunately, it also increases small-file space
usage by about the same amount, to the point that 160GB are required to
hold a kernel tree. Bigalloc does not play well with some other ext4
options and requires complex configuration questions to be
answered by the administrator, who must think about how the filesystem will
be used over its entire lifetime when the filesystem is created. Ext4,
Dave said, is suffering from architectural deficiencies - using bitmaps for
space tracking, in particular - that are typical of an 80's era
filesystem. It simply cannot scale to truly large filesystems.
Space allocation in Btrfs is even slower than with ext4. Dave said that
the problem was primarily in the walking of the free space cache, which is
CPU intensive currently. This is not an architectural problem in btrfs, so
it should be fixable, but some optimization work will need to be done.
The future of Linux filesystems
Where do things go from here? At this point, metadata performance and
scalability in XFS can be considered to be a solved problem. The
performance bottleneck is now in the VFS layer, so the next round of work
will need to be done there. But the big challenge for the future is in the
area of reliability; that may require some significant changes in the XFS
filesystem.
Reliability is not just a matter of not losing data - hopefully XFS is
already good at that - it is really a scalability issue going forward. It
just is not practical to take a petabyte-scale filesystem offline to run a
filesystem check and repair tool; that work really needs to be done online
in the future. That requires robust failure detection built into the
filesystem so that metadata can be validated as correct on the fly. Some
other filesystems are implementing validation of data as well, but that is considered to
be beyond the scope of XFS; data validation, Dave said, is best done at
either the storage array or the application levels.
"Metadata validation" means making the metadata self describing to protect
the filesystem against writes that are misdirected by the storage layer.
Adding checksums is not sufficient - a checksum only proves that what is
there is what was written. Properly self-describing metadata can detect
blocks that were written in the wrong place and assist in the reassembly of
a badly broken filesystem. It can also prevent the "reiserfs problem,"
where a filesystem repair tool is confused by stale metadata or metadata
found in filesystem images stored in the filesystem being repaired.
Making the metadata self-describing involves a lot of changes. Every
metadata block will contain the UUID of the filesystem to which it belongs;
there will also be block and inode numbers in each block so the filesystem
can verify that the metadata came from the expected place. There will be
checksums to detect corrupted metadata blocks and an owner identifier to
associate metadata with its owning inode or directory. A reverse-mapping
allocation tree will allow the filesystem to quickly identify the file to
which any given block belongs.
Needless to say, the current XFS on-disk format does not provide for the
storage of all this extra data. That implies an on-disk format change.
The plan, according to Dave, is to not provide any sort of forward or
backward format compatibility; the format change will be a true flag day.
This is being done to allow complete freedom in designing a new format that
will serve XFS users for a long time. While the format is being changed to
add the above-described reliability features, the developers will also add
space for d_type in the directory structure, NFSv4 version
counters, the inode creation time, and, probably, more. The maximum
directory size, currently a mere 32GB, will also be increased.
All this will enable a lot of nice things: proactive detection of
filesystem corruption, the location and replacement of disconnected blocks,
and better online filesystem repair. That means, Dave said, that XFS will
remain the best filesystem for large-data applications under Linux
for a long time.
What are the implications of all this from a btrfs perspective? Btrfs,
Dave said, is clearly not optimized for filesystems with metadata-heavy
workloads; there are some serious scalability issues getting in the way.
That is only to be
expected for a filesystem at such an early stage of development. Some of
these problems will take some time to overcome, and the possibility exists
that some of them might not be solvable. On the other hand, the
reliability features in btrfs are well developed and the filesystem is well
placed to handle the storage capabilities expected in the coming few years.
Ext4, instead, suffers from architectural scalability issues. According to
Dave's results, it is not the fastest filesystem anymore. There are few
plans for reliability improvements, and its on-disk format is showing its
age. Ext4 will struggle to support the storage demands of the near
future.
Given that, Dave had a question of sorts to end his presentation with.
Btrfs will, thanks to its features, soon replace ext4 as the default
filesystem in many distributions. Meanwhile, ext4 is being outperformed by XFS on most
workloads, including those where it was traditionally stronger. There are
scalability problems that show up on even smaller server systems. It is
"an aggregation of semi-finished projects" that do not always play well
together; ext4, Dave said, is not as stable or well-tested as people
think. So, he asked: why do we still need ext4?
One assumes that ext4 developers would have a robust answer to that
question, but none were present in the room. So this seems like a
discussion that will have to be continued in another setting; it should be
interesting to watch.
[ Your editor would like to thank the linux.conf.au organizers for their
assistance with his travel to the conference. ]
Comments (278 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>