Kernel development
Brief items
Kernel release status
The current development kernel is 3.3-rc1, released on January 19; the 3.3 merge window is now closed. "Anyway, it's out now, and I'm taking off early for a weekend of beer, skiing and poker (not necessarily in that order: 'don't drink and ski'). No email." See our merge window summaries (part 1, part 2) for details on the features merged for the 3.3 release.
Stable updates: The 2.6.32.55, 3.0.18, and 3.2.2 stable updates were released on January 25.
Quotes of the week
Kernel development news
A /proc/PID/mem vulnerability
A privilege escalation in the kernel is always a serious threat that leads kernel hackers and distributions to scramble to close the hole quickly. That's exactly what happened after a January 17 report from Jüri Aedla to the closed kernel security mailing list. But most people didn't learn of the hole from Aedla (since he posted to a closed list), but instead from Jason Donenfeld (aka zx2c4) who posted a detailed look at the flaw on January 22. The fix was made by Linus Torvalds and went into the mainline on January 17, though with a commit message that obfuscated the security implications—something that didn't sit well with some.
The problem and exploit
The problem itself stems from the removal of the restriction on writes to /proc/PID/mem that was merged for the 2.6.39 kernel. It was part of a patch set that was specifically targeted at allowing debuggers to write to the memory of processes easily via the /proc/PID/mem file. Unfortunately, it left open a hole that Aedla and Donenfeld (at least) were able to exploit.
The posting by Donenfeld is worth a read for those interested in how exploits of this sort are created. The problem starts with the fact that the open() call for /proc/PID/mem does no additional checking beyond the normal VFS permissions before returning a file descriptor. That will prove to be a mistake, and one that Torvalds's fix remedies. Instead of checks at open() time, the code would check in write() and only allow writing if the process being written to is the same as the process doing the writing (i.e. task == current).
That restriction seems like it would make an exploit difficult, but it can be avoided with an exec() and coercing the newly run program to do the writing to itself. That will be dangerous if the newly run program is a setuid executable for example. But there is another test that is meant to block that particular path, by testing that current->self_exec_id has the same value as it did at open() time. self_exec_id is incremented every time that a process does an exec(), so it will definitely be different after executing the setuid binary. But, since it is simply incremented, one can arrange (via fork()) to have a child process with the same self_exec_id as the main process after the setuid exec() is done.
The child with the "correct" self_exec_id value (which it gets by doing an exec()) can then open the parent's /proc/PID/mem file (since there are no extra checks on the open()) and pass the descriptor back to the parent via Unix sockets. The parent then needs to arrange that the setuid executable writes to that file descriptor once a seek() to the proper address has been done. Finding that proper address and getting the binary to write to the fd are the final pieces of the puzzle.
Donenfeld's example uses su because it is not compiled as a position-independent executable (PIE) for most distributions, which makes it easier to figure out which address to use. He exploits the fact that su prints an error message when it is passed an unknown username and the error message helpfully prints the username passed. That allows the exploit to pass shellcode (i.e. binary machine language that spawns a shell when executed) as the argument to su.
After printing the error message, su calls the exit() function (really exit@plt), which is what Donenfeld's exploit overwrites. It finds the address of the function using objdump, subtracts the length of the error message that gets printed before the argument, and seeks the file to that location. It uses dup2() to connect stderr to the /proc/PID/mem file descriptor and execs su "shellcode".
In pseudocode, it might look something like this:
if (!child && fork()) { /* child flag set based on -c */
/* first program invocation, this is parent, wait for fd from child */
fd = recv_fd(); /* get the fd from the child */
dup2(2, 15);
dup2(fd, 2); /* make fd be stderr */
lseek(fd, offset); /* offset to overwrite location */
exec("/bin/su", shellcode); /* will have self_exec_id == 1 */
}
else if (!child) {
/* this is the child from the fork(), exec with child flag */
exec("thisprog", "-c"); /* this program with -c (child) */
}
else {
/* child after exec, will have self_exec_id == 1 */
fd = open("/proc/PPID/mem", O_RDWR); /* open parent PID's mem file */
send_fd(fd); /* send the fd to the parent */
}
Of course Aedla's proof-of-concept
or Donenfeld's exploit
code are likely to be even more instructive.
It's obviously a complicated multi-step process, but it is also a completely reliable way to get root privileges. Updates to Donenfeld's post show exploits for distributions like Fedora that do build su as a PIE, or for Gentoo where the read permissions on setuid binaries have been removed so objdump can't be used to find the address of the exit function. For Fedora, gpasswd can be substituted as it is not built as a PIE, while on Gentoo, ptrace() can be used to find the needed address. While it was believed that address space layout randomization (ASLR) for PIEs would make exploitation much more difficult, that proved to be only a small hurdle, at least on 32-bit systems.
The fix and reactions
The fix hit the mainline without any coordination with Linux
distributions. Kees Cook, who works on ChromeOS security (and
formerly was a member of the Ubuntu security team), told LWN that Red Hat has a person on the closed
kernel security mailing list, so it was aware of the problem but did not
share that information on the Linux distribution security list. "I've been told this will
change in the future, but I'm worried it will be overlooked again
",
he said. The first indication that
other distributions had was likely from Red Hat's Eugene Teo's request for a CVE on the
oss-security mailing list.
As Cook points out, the abrupt public disclosure of the bug (via a mainline commit) runs counter to the policy described in the kernel's Documentation/SecurityBugs file, where the default policy is to leave roughly seven days between reports to the mailing list and public disclosure to allow time for vendors to fix the problem. Cook is concerned that bugs reported to security@kernel.org are not being handled reasonably:
The "just a bug" refers to statements that Torvalds has made over the years about security bugs being no different than any other kind of bug. In email, Torvalds described it this way:
In keeping with that philosophy, Torvalds does not disclose the security
relevance of a fix in the commit message: "I think the whole 'mark this patch as having security implications' is
pure and utter garbage
". Even if there is a known security problem
that is being fixed, his commit
messages do not reflect that, as with the message for the
/proc/PID/mem fix:
This changes it to do the permission checks at open time, and instead of tracking the process, it tracks the VM at the time of the open. That simplifies the code a lot, but does mean that if you hold the file descriptor open over an execve(), you'll continue to read from the _old_ VM.
Torvalds's commit message stands in pretty stark contrast to Aedla's report to security@kernel.org (linked above):
This "masking" of the actual reason for a commit doesn't sit well with
either Cook or Teo (who also responded to an email query). Cook "cannot overstate how much I am
against this kind of masking
", while Teo pointed out that this
particular bug is in no way unique:
Both Teo and Cook were in agreement that disclosing what is known about a fix at the time it is applied can only help distributions and others trying to track kernel development. Torvalds, on the other hand, is concerned about attackers reading commit messages, which could lead to more attacks against Linux systems. He has a well-known contempt for security "clowns" that seems to also factor into his reasoning:
Both the security camps hate me. The full disclosure people think I try to hide things (which is true), while the embargo people think I despise their corrupt arses (which is also true).
The strange thing is that by explicitly not putting the known security implications of a patch into the commit message, Torvalds is treating security bugs differently. They are no longer "just bugs" because some of the details of the bug are being purposely omitted. That may make it difficult for "black hats"—though it would be somewhat surprising if it did—but it definitely makes it more difficult for those who are trying to keep Linux users secure. Worse yet, it makes it more difficult down the road when someone is looking at a commit (or reversion) in isolation because they may miss out on some important context.
Silent security fixes are a hallmark of proprietary software, and Torvalds's policy resembles that to some extent. It could be argued (and presumably would be by Torvalds and others) that the fixes aren't silent since they go into a public repository and that is true—as far as it goes. By deliberately omitting important information about the bug, which is not done for most or all other bugs, perhaps they aren't so much silent as they are "muted" or, sadly, "covered up". There is definitely a lot of validity to Torvalds's complaints about the security "circus", but his reaction to that circus may not be in the best interests of the kernel community either.
The zsmalloc allocator
The kernel cannot be said to lack for memory allocation mechanisms. At the lowest level, "memblock" handles chunks of memory for the rest of the system. The page allocator provides memory to the rest of the kernel in units of whole pages. Much of the kernel uses one of the three slab allocators to get memory blocks in arbitrary sizes, but there is also vmalloc() for situations where large, virtually-contiguous regions are needed. Add in various other specialized allocation functions and other allocators (like CMA) and it starts to seem like a true embarrassment of choices. So what's to be done in this situation? Add another one, of course.The "zsmalloc" allocator, proposed by Seth Jennings, is aimed at a specific use case. The slab allocators work by packing multiple objects into each page of memory; that works well when the objects are small, but can be problematic as they get larger. In the worst case, if a kernel subsystem routinely needs allocations that are just larger than PAGE_SIZE/2, only one object will fit within a page. Slab allocators can attempt to allocate multiple physically-contiguous pages in order to pack those large objects more efficiently, but, on memory-constrained systems, those allocations can become difficult - or impossible. So, on systems that are already tight of memory, large objects will need to be allocated one-per-page, wasting significant amounts of memory through internal fragmentation.
The zsmalloc allocator attempts to address this problem by packing objects into a new type of compound page where the component pages are not physically contiguous. The result can be much more efficient memory usage, but with some conditions:
- Code using this allocator must not require physically-contiguous
memory,
- Objects must be explicitly mapped before use, and
- Objects can only be accessed in atomic context.
Code using zsmalloc must start by creating an allocation pool to work from:
struct zs_pool *zs_create_pool(const char *name, gfp_t flags);
Where name is the name of the pool, and flags will be used to allocate memory for the pool. It is not entirely clear (to your editor, at least) why multiple pools exist; the zs_pool structure is relatively large, and a pool is really only efficient if the number of objects allocated from it is also large. But that's how the API is designed.
A pool can be released with:
void zs_destroy_pool(struct zs_pool *pool);
A warning (or several warnings) will be generated if there are objects allocated from the pool that have not been freed; those objects will become entirely inaccessible after the pool is gone.
Allocating and freeing memory is done with:
void *zs_malloc(struct zs_pool *pool, size_t size);
void zs_free(struct zs_pool *pool, void *obj);
The return value from zs_malloc() will be a pointer value, or NULL if the object cannot be allocated. It would be a fatal mistake, though, to treat that pointer as if it were actually a pointer; it is actually a magic cookie that represents the allocated memory indirectly. It might have been better to use a non-pointer type, but, again, that is how the API is designed. Getting a pointer that can actually be used is done with:
void *zs_map_object(struct zs_pool *pool, void *handle);
void zs_unmap_object(struct zs_pool *pool, void *handle);
The return value from zs_map_object() will be a kernel virtual address that can be used to access the actual object. The return address is essentially a per-CPU object, so the calling code will be in atomic context until the object is freed with zs_unmap_object(). Note that the handle passed to zs_unmap_object() is the original cookie obtained from zs_malloc(), not the pointer from zs_map_object(). Note also that only one object can be safely mapped at a time on any given CPU.
Internally, zsmalloc divides allocations by object size much like the slab allocators do, but with a much higher granularity - there are 254 possible allocation sizes all less than PAGE_SIZE. For each size, the code calculates an optimum number of pages (up to 16) that will hold an array of objects of that size with minimal loss to fragmentation. When an allocation is made, a "zspage" is created by allocating the calculated number of individual pages and tying them all together. That tying is done by overloading some fields of struct page in a scary way (that is not a criticism of zsmalloc: any additional meanings overlaid onto the already heavily overloaded page structure are scary):
- The first page of a zspage has the PG_private flag set. The
private field points to the second page (if any), while the
lru list structure is used to make a list of zspages of the
same size.
- Subsequent pages are linked to each other with the lru
structure, and are linked back to the first page with the
first_page field (which is another name for private,
if one looks at the structure definition).
- The last page has the PG_private_2 flag set.
Within a zspage, objects are packed from the beginning, and may cross the boundary between pages. The cookie returned from zs_malloc() is a combination of a pointer to the page structure for the first physical page and the offset of the object within the zspage. Making that object accessible to the rest of the kernel at mapping time is a matter of calculating its location, then either (1) mapping it with kmap_atomic() if the object fits entirely within one physical page, or (2) assigning a pair of virtual addresses if the object crosses a physical page boundary.
The primary users of zsmalloc are the zcache and zram mechanisms, both of which are currently in staging. These subsystems use the transcendent memory abstraction to store compressed copies of pages in memory. Those compressed pages can still be a substantial fraction of the (uncompressed) page size, so fragmentation issue addressed by zsmalloc can be a real problem. Given the specialized use case and the limitation imposed by zsmalloc, it is not clear that it will find users elsewhere in the kernel, but one never knows.
XFS: the filesystem of the future?
Linux has a lot of filesystems, but two of them (ext4 and btrfs) tend to get most of the attention. In his 2012 linux.conf.au talk, XFS developer Dave Chinner served notice that he thinks more users should be considering XFS. His talk covered work that has been done to resolve the biggest scalability problems in XFS and where he thinks things will go in the future. If he has his way, we will see a lot more XFS around in the coming years.
XFS is often seen as the filesystem for people with massive amounts of
data. It serves that role well, Dave said, and it has traditionally
performed well for a
lot of workloads. Where things have tended to fall down is in the
writing of metadata; support for workloads that generate a lot of metadata
writes has been a longstanding weak point for the filesystem. In short,
metadata writes were slow, and did not really scale past even a single
CPU.
How slow? Dave put up some slides showing fs-mark results compared to ext4. XFS was significantly worse (as in half as fast) even on a single CPU; the situation just gets worse up to eight threads, after which ext4 hits a cliff and slows down as well. For I/O-heavy workloads with a lot of metadata changes - unpacking a tarball was given as an example - Dave said that ext4 could be 20-50 times faster than XFS. That is slow enough to indicate the presence of a real problem.
Delayed logging
The problem turned out to be journal I/O; XFS was generating vast amounts of journal traffic in response to metadata changes. In the worst cases, almost all of the actual I/O traffic was for the journal - not the data the user was actually trying to write. Solving this problem took multiple attempts over years, one major algorithm change, and a lot of other significant optimizations and tweaks. One thing that was not required was any sort of on-disk format change - though that may be in the works in the future for other reasons.
Metadata-heavy workloads can end up changing the same directory block many times in a short period; each of those changes generates a record that must be written to the journal. That is the source of the huge journal traffic. The solution to the problem is simple in concept: delay the journal updates and combine changes to the same block into a single entry. Actually implementing this idea in a scalable way took a lot of work over some years, but it is now working; delayed logging will be the only XFS journaling mode supported in the 3.3 kernel.
The actual delayed logging technique was mostly stolen from the ext3 filesystem. Since that algorithm is known to work, a lot less time was required to prove that it would work well for XFS as well. Along with its performance benefits, this change resulted in a net reduction in code. Those wanting details on how it works should find more than they ever wanted in filesystems/xfs-delayed-logging.txt in the kernel documentation tree.
Delayed logging is the big change, but far from the only one. The log space reservation fast path is a very hot path in XFS; it is now lockless, though the slow path still requires a global lock at this point. The asynchronous metadata writeback code was creating badly scattered I/O, reducing performance considerably. Now metadata writeback is delayed and sorted prior to writing out. That means that the filesystem is, in Dave's words, doing the I/O scheduler's work. But the I/O scheduler works with a request queue that is typically limited to 128 entries while the XFS delayed metadata writeback queue can have many thousands of entries, so it makes sense to do the sorting in the filesystem prior to I/O submission. "Active log items" are a mechanism that improves the performance of the (large) sorted log item list by accumulating changes and applying them in batches. Metadata caching has also been moved out of the page cache, which had a tendency to reclaim pages at inopportune times. And so on.
How the filesystems compare
So how does XFS scale now? For one or two threads, XFS is still slightly slower than ext4, but it scales linearly up to eight threads, while ext4 gets worse, and btrfs gets a lot worse. The scalability constraints for XFS are now to be found in the locking in the virtual filesystem layer core, not in the filesystem-specific code at all. Directory traversal is now faster for even one thread and much faster for eight. These are, he suggested, not the kind of results that the btrfs developers are likely to show people.
The scalability of space allocation is "orders of magnitude" faster than ext4 offers now. That changes a bit with the "bigalloc" feature added in 3.2, which improves ext4 space allocation scalability by two orders of magnitude if a sufficiently large block size is used. Unfortunately, it also increases small-file space usage by about the same amount, to the point that 160GB are required to hold a kernel tree. Bigalloc does not play well with some other ext4 options and requires complex configuration questions to be answered by the administrator, who must think about how the filesystem will be used over its entire lifetime when the filesystem is created. Ext4, Dave said, is suffering from architectural deficiencies - using bitmaps for space tracking, in particular - that are typical of an 80's era filesystem. It simply cannot scale to truly large filesystems.
Space allocation in Btrfs is even slower than with ext4. Dave said that the problem was primarily in the walking of the free space cache, which is CPU intensive currently. This is not an architectural problem in btrfs, so it should be fixable, but some optimization work will need to be done.
The future of Linux filesystems
Where do things go from here? At this point, metadata performance and scalability in XFS can be considered to be a solved problem. The performance bottleneck is now in the VFS layer, so the next round of work will need to be done there. But the big challenge for the future is in the area of reliability; that may require some significant changes in the XFS filesystem.
Reliability is not just a matter of not losing data - hopefully XFS is already good at that - it is really a scalability issue going forward. It just is not practical to take a petabyte-scale filesystem offline to run a filesystem check and repair tool; that work really needs to be done online in the future. That requires robust failure detection built into the filesystem so that metadata can be validated as correct on the fly. Some other filesystems are implementing validation of data as well, but that is considered to be beyond the scope of XFS; data validation, Dave said, is best done at either the storage array or the application levels.
"Metadata validation" means making the metadata self describing to protect the filesystem against writes that are misdirected by the storage layer. Adding checksums is not sufficient - a checksum only proves that what is there is what was written. Properly self-describing metadata can detect blocks that were written in the wrong place and assist in the reassembly of a badly broken filesystem. It can also prevent the "reiserfs problem," where a filesystem repair tool is confused by stale metadata or metadata found in filesystem images stored in the filesystem being repaired.
Making the metadata self-describing involves a lot of changes. Every metadata block will contain the UUID of the filesystem to which it belongs; there will also be block and inode numbers in each block so the filesystem can verify that the metadata came from the expected place. There will be checksums to detect corrupted metadata blocks and an owner identifier to associate metadata with its owning inode or directory. A reverse-mapping allocation tree will allow the filesystem to quickly identify the file to which any given block belongs.
Needless to say, the current XFS on-disk format does not provide for the
storage of all this extra data. That implies an on-disk format change.
The plan, according to Dave, is to not provide any sort of forward or
backward format compatibility; the format change will be a true flag day.
This is being done to allow complete freedom in designing a new format that
will serve XFS users for a long time. While the format is being changed to
add the above-described reliability features, the developers will also add
space for d_type in the directory structure, NFSv4 version
counters, the inode creation time, and, probably, more. The maximum
directory size, currently a mere 32GB, will also be increased.
All this will enable a lot of nice things: proactive detection of filesystem corruption, the location and replacement of disconnected blocks, and better online filesystem repair. That means, Dave said, that XFS will remain the best filesystem for large-data applications under Linux for a long time.
What are the implications of all this from a btrfs perspective? Btrfs, Dave said, is clearly not optimized for filesystems with metadata-heavy workloads; there are some serious scalability issues getting in the way. That is only to be expected for a filesystem at such an early stage of development. Some of these problems will take some time to overcome, and the possibility exists that some of them might not be solvable. On the other hand, the reliability features in btrfs are well developed and the filesystem is well placed to handle the storage capabilities expected in the coming few years.
Ext4, instead, suffers from architectural scalability issues. According to Dave's results, it is not the fastest filesystem anymore. There are few plans for reliability improvements, and its on-disk format is showing its age. Ext4 will struggle to support the storage demands of the near future.
Given that, Dave had a question of sorts to end his presentation with. Btrfs will, thanks to its features, soon replace ext4 as the default filesystem in many distributions. Meanwhile, ext4 is being outperformed by XFS on most workloads, including those where it was traditionally stronger. There are scalability problems that show up on even smaller server systems. It is "an aggregation of semi-finished projects" that do not always play well together; ext4, Dave said, is not as stable or well-tested as people think. So, he asked: why do we still need ext4?
One assumes that ext4 developers would have a robust answer to that question, but none were present in the room. So this seems like a discussion that will have to be continued in another setting; it should be interesting to watch.
[ Your editor would like to thank the linux.conf.au organizers for their assistance with his travel to the conference. ]
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
