Kernel development
Brief items
Kernel release status
The current development kernel is 3.13-rc3, released on December 6. Linus said: "I'm still on a Friday release schedule, although I hope that changes soon - the reason I didn't drag this one out to Sunday is that it's already big enough, and I'll wait until things start calming down. Which they really should, at this point. Hint hint."
Stable updates: 3.12.4, 3.10.23, and 3.4.73 were released on December 8. As of this writing, the 3.12.5, 3.10.24, and 3.4.74 updates are in the review process; they can be expected sometime on or after December 12.
Quotes of the week
Kernel development news
The Btrfs filesystem: An introduction
The Btrfs filesystem has been through almost every part of the hype cycle at one point or another in its short history. For a time, it was the next-generation filesystem that was going to solve many of our problems; distributors were racing to see who could be the first to ship it by default. Then it became clear that all those longtime filesystem developers weren't totally out to lunch when they warned that it takes many years to get a filesystem to a point where it can be trusted with important data. At this point, it is possible to go to a conference and hear almost nothing positive about Btrfs; disillusionment appears to have set in. By the time Btrfs is truly ready, some seem to think, it will be thoroughly obsolete.The truth may not be quite so grim. Development on Btrfs continues, with a strong emphasis on stability and performance. Problems are getting fixed, and users are beginning to take another look at this promising filesystem. More users are beginning to play with it, and openSUSE considered the idea of using it by default back in September. Your editor's sense is that the situation may be bottoming out, and that we may, slowly, be heading into a new phase where Btrfs takes its place — still slowly — as one of the key Linux filesystems.
This article is intended to be the first in a series for users interested in experimenting with and evaluating the Btrfs filesystem. We'll start with the basics of the design of the filesystem and how it is being developed; that will be followed by a detailed look at specific Btrfs features. One thing that will not appear in this series, though, is benchmark results; experience says that proper filesystem benchmarking is hard to do right; it's also highly workload- and hardware-dependent. Poor-quality results would not be helpful to anybody, so your editor will simply not try.
What makes Btrfs different?
Not that long ago, Linux users were still working with filesystems that had evolved little since the Unix days. The ext3 filesystem, for example, was still using block pointers: each file's inode (the central data structure holding all the information about the file) contained a list of pointers to each individual block holding the file's data. That design worked well enough when files were small, but it scales poorly: a 1GB file would require 256K individual block pointers. More recent filesystems (including ext4) use pointers to "extents" instead; each extent is a group of contiguous blocks. Since filesystems work to store data contiguously anyway, extent-based storage greatly reduces the overhead of managing a file's space.
Naturally, Btrfs uses extents as well. But it differs from most other Linux filesystems in a significant way: it is a "copy-on-write" (or "COW") filesystem. When data is overwritten in an ext4 filesystem, the new data is written on top of the existing data on the storage device, destroying the old copy. Btrfs, instead, will move overwritten blocks elsewhere in the filesystem and write the new data there, leaving the older copy of the data in place.
The COW mode of operation brings some significant advantages. Since old data is not overwritten, recovery from crashes and power failures should be more straightforward; if a transaction has not completed, the previous state of the data (and metadata) will be where it always was. So, among other things, a COW filesystem does not need to implement a separate journal to provide crash resistance.
Copy-on-write also enables some interesting new features, the most notable of which is snapshots. A snapshot is a virtual copy of the filesystem's contents; it can be created without copying any of the data at all. If, at some later point, a block of data is changed (in either the snapshot or the original), that one block is copied while all of the unchanged data remains shared. Snapshots can be used to provide a sort of "time machine" functionality, or to simply roll back the system after a failed update.
Another important Btrfs feature is its built-in volume manager. A Btrfs filesystem can span multiple physical devices in a number of RAID configurations. Any given volume (collection of one or more physical drives) can also be split into "subvolumes," which can be thought of as independent filesystems sharing a single physical volume set. So Btrfs makes it possible to group part or all of a system's storage into a big pool, then share that pool among a set of filesystems, each with its own usage limits.
Btrfs offers a wide range of other features not supported by other Linux filesystems. It can perform full checksumming of both data and metadata, making it robust in the face of data corruption by the hardware. Full checksumming is expensive, though, so it remains likely to be used in only a minority of installations. Data can be stored on-disk in compressed form. The send/receive feature can be used as part of an incremental backup scheme, among other things. The online defragmentation mechanism can fix up fragmented files in a running filesystem. The 3.12 kernel saw the addition of an offline de-duplication feature; it scans for blocks containing duplicated data and collapses them down to a single, shared copy. And so on.
It is worth noting that the copy-on-write approach is not without its costs. Obviously, some sort of garbage collection is required or all those block copies will quickly eat up all of the available space on the filesystem. Copying blocks can take more time than simply overwriting them as well as significantly increasing the filesystem's memory requirements. COW operations will also have a tendency to fragment files, wrecking the nice, contiguous layout that the filesystem code put so much effort into creating. Fragmentation hurts less with solid-state devices than on rotational storage, but, even in the former case, fragmented files will not be as quick to access.
So all this shiny new Btrfs functionality does not come for free. In many settings, administrators may well decide that the costs associated with Btrfs outweigh the benefits; those sites will stick with filesystems like ext4 or XFS. For others, though, the flexibility and feature set provided with Btrfs are likely to be quite appealing. Once it is generally accepted that Btrfs is ready for real-world use, chances are it will start popping up on a lot of systems.
Development
One concern your editor has heard in conference hallways is that the pace of Btrfs development has slowed. For the curious, here's the changeset count history for the Btrfs code in the kernel, grouped into approximately one-year periods:
Year Changesets Developers 2008 (2.6.25—29) 913 42 2009 (2.6.30—33) 279 45 2010 (2.6.34—37) 193 33 2011 (2.6.38—3.2) 610 67 2012 (3.3—8) 773 63 2013 (3.9—13) 671 68
These numbers, on their own, do not demonstrate a slowing of development; there was an apparent slow period in 2010, but the number of changesets and the number of developers contributing them has held steady thereafter. That said, there are a couple of things to bear in mind when looking at those numbers. One is that the early work involved the addition of features to a brand-new filesystem, while work in 2013 is almost entirely fixes. So the size of the changes has shrunk considerably, but one could easily argue that things should be just that way.
The other relevant point is that contributions by Btrfs creator Chris Mason have clearly fallen in recent years. Partly that is because he has been working on the user-space btrfs-progs code — work which is not reflected in the above, kernel-side-only numbers — but it also seems clear that he has been busy with other work-related issues. It will be interesting to see how things change now that Chris and prolific Btrfs contributor Josef Bacik have found a new home at Facebook.
In summary, the amount of new code going into Btrfs has clearly fallen in recent years, but that will be seen as good news by anybody hoping for a stable filesystem anytime soon. There is still some significant effort going into this filesystem, and chances are good that developer attention will increase as distributors look more closely at using Btrfs by default.
What's next
All told, Btrfs still looks interesting, and it seems like the right time to take a closer look at what is still the next generation Linux filesystem. Now that the introductory material is out of the way, the next article in this series will start to actually play with Btrfs and explore its feature set. Those articles (appearing here as they are published) are:
- Getting started: where to get the
software, and the basics of creating and using a Btrfs filesystem.
- Working with multiple devices: a Btrfs
filesystem is not limited to a physical device; instead, the
filesystem supports multiple-device filesystems in a number of RAID
and RAID-like configurations. This article describes that
functionality and how to make use of it.
- Subvolumes and snapshots: creating
multiple filesystems on a single storage volume, along with the
associated snapshot mechanism.
- Send/receive and ioctl(): using the send/receive feature for incremental backups, a brief overview of functionality available with ioctl(), and concluding notes.
By the end of the series, we plan to have a reasonably comprehensive introduction to Btrfs in place; stay tuned.
Fixing FS_IOC_GETFLAGS
Making kernel interfaces that work for both 32- and 64-bit processors has proved to be something of a challenge over the years. One of the more problematic areas has been passing arguments to ioctl() so that the same code will work on both types of processor—in both big- and little-endian varieties. As a recent thread shows, not all of those problems have been completely worked out over time.
Aurelien Jarno posted a question to the linux-fsdevel mailing list about FS_IOC_GETFLAGS and FS_IOC_SETFLAGS (which query and set inode flags on files). He noted that the definitions of those requests in include/uapi/linux/fs.h listed the argument types as a pointer to long, except for the 32-bit compatibility versions, which specify an int *. The code in the kernel filesystems expects and uses a 32-bit quantity, and most—but not all—user-space code passes a pointer to a 32-bit integer.
Any application that passed a pointer to a 64-bit integer would work, but only on little-endian systems. Since the kernel code treats the pointer as one to a 32-bit quantity, it's a matter of which four bytes are accessed when the pointer is dereferenced. On big-endian processors, it is the most significant four bytes, whereas little-endian systems reference the least significant end. Since all of the flags live in the least significant four bytes, the big-endian systems effectively pass zero to FS_IOC_SETFLAGS or retrieve a value with (undefined) high bits set with FS_IOC_GETFLAGS.
Darrick J. Wong pointed out that the kernel FUSE driver uses the types from that header, which also causes a problem. The kernel driver expects to transfer a 64-bit quantity, but most user-space programs only provide 32 bits. He plans to special case those ioctl() requests for FUSE.
The number of big-endian 64-bit systems (e.g. PowerPC, MIPS, and s390) is dwarfed by the number of x86_64 little-endian processors. That means that few have seen the problem, but it also means that any fix needs to be made carefully to avoid breaking millions of existing systems. That is always an important—overriding—consideration for changes to the kernel, of course, but Ted Ts'o highlighted that concern when he explained a bit about how this had come about and why changing to a long * everywhere would not work. Because the majority of user-space programs pass an int *, a change like that would cause them to break on 64-bit systems regardless of the endian-ness.
But, as Jarno pointed out, anyone trying to
do the right thing and look up the argument type in
<linux/fs.h> will get it wrong. "The bare minimum would be to add a comment close to the
definition to explain to use an int and not a long.
" It turns out
that there are four ioctl() requests (FS_IOC_GETVERSION
and FS_IOC_SETVERSION in addition to the get/set flags mentioned
above) that have the problematic definition. Jarno posted a patch to make that change by adding a
warning to include/uapi/linux/fs.h (which gets installed in
include/linux):
/* * WARNING: The next four following ioctls actually take an int argument * despite their definition. This is important to support 64-bit big-endian * machines. */
One might think that just changing the type of the argument to 32 bits in the header file would be a possibility, but that cannot be done either. The mapping of the request name to a number is done using a set of macros that use sizeof() on the "type" argument. For 64-bit systems that is eight for a long, but 32-bit uses four. Since the numbers calculated for the requests are now a fixed part of the Linux ABI, changing the type of the argument in that header would not solve the problem.
Several suggested adding a new request type (FS_IOC_GETFLAGS_NEW or FS_IOC_GETFLAGS_WIDE for example). It would take a pointer to a 64-bit integer on all architectures. That would have the advantage of doubling the number of available flags, which may be getting close to being consumed. There are perhaps ten bits available today for expansion, adding another 32 might cover any upcoming use cases, though some are rather skeptical that 32 will be enough.
The FS_IOC_[GS]ETFLAGS requests were originally added for the ext* filesystems, but have also been used by other filesystems over time. In addition, there are flags for other filesystems that are only available via filesystem-specific ioctl() requests. According to Dave Chinner, XFS already has roughly ten flags available using a different request (XFS_IOC_FSGETXATTR); other filesystems have their own sets. So, if a change is going to be made, Chinner said, why not create one that unifies all of the disparate inode flag handling; one that allows for more than 64 flags that might be completely exhausted soon.
Ts'o is not convinced that the additional
complexity is worth it. But Chinner sees
XFS adding "tens of new inode
flags
" over the coming years. Other filesystems may well have
similar needs. So a fixed-length bitmap may not be the best solution
long-term, but there was little agreement on which alternative should be
pursued.
Chinner suggested some kind of attribute-based interface that is open-ended so that it could be expanded to handle any inode flags for any filesystems down the road. He also mentioned the xstat() system call as another possibility. But, as Andreas Dilger pointed out, xstat() has been proposed many times, but has never made it into the kernel.
So there are some possible solutions that "solve the problem
once and for all
" (as Chinner put it), but it is not at all clear
that anyone is planning to push for one of them. In the meantime, Jarno's
"fix" to the header file will at least help users pass the right argument
types. The
user-space applications that pass long pointers (bup and
libexplain were mentioned) will need to
change, but that shouldn't be too onerous. A more ambitious, global
solution may not be in the works anytime soon.
Memory barriers for TSO architectures
Developers — even kernel developers — like to think of access to memory as a nice, predictable, deterministic thing. If two memory locations are written in a given order, one would hope that any other processors would see those writes in the same order. The truth of the matter is rather murkier; compilers and processors are both happy to take liberties with the ordering of memory accesses in the name of performance. Most of this playing around is invisible to programmers, but it can interfere with the correct operation of concurrent systems, so developers must occasionally force things to happen in the right order through the use of memory barriers. The set of available barrier types looks like it will get a bit larger in the 3.14 kernel.
Introduction to memory barriers
A memory barrier is a directive that prohibits the hardware (and compiler) from reordering operations in specific ways. To see how they might be used, consider the following simple example, taken from a 2013 Kernel Summit session. The lockless insertion of a new element into a linked list can be performed in two steps. The first is to set the "next" pointer of the new item to point to the item that will follow it in the list:
Once that is done, the list itself can be modified to include the new item:
A thread walking the list will either see the new item or it won't, depending on the timing, but it will see a well-formed list in either case. If, however, the operations are reordered such that the second pointer assignment becomes visible before the first, there will be a period of time during which the structure of the list is corrupted. Should a thread follow that pointer at the wrong time, it will end up off in the weeds. To keep that from happening, this sort of list operation must use a memory barrier between the two writes. With a proper barrier in place, the pointer assignments will never be seen in the wrong order.
The kernel offers a wide variety of memory barrier operations adapted to specific situations, but the most commonly used barriers are:
- smp_wmb() is a write memory barrier; it ensures that
any write operation executed after the barrier will not become visible
until all writes executed prior to the barrier are visible. A
write memory barrier would be the appropriate type to use in the
linked list example above.
- smp_rmb() is a read memory barrier; any reads executed
before the barrier are forced to complete before any reads after the
barrier can happen. Code traversing a linked list that is subject to
lockless modification would want to use read barriers between access
to subsequent link pointers.
- smp_mb() is a barrier for both read and write operations; it can be thought of as the combination of smb_wmb() and smp_rmb().
Memory barriers almost invariably come in pairs. If one of two cooperating threads cares about the order in which two values are written, the other side must be equally concerned about the order in which those values are read.
Naturally enough, the full story is rather more complex than described here. Readers with sufficient interest and free time, along with quite a bit of excess brain power, can read Documentation/memory-barriers.txt for the full story.
The primary reason for the proliferation of memory barrier types is performance. A full memory barrier can be an expensive operation; that is something that kernel developers would prefer to avoid in fast paths. Weaker barriers are often cheaper, especially if they can be omitted altogether on some architectures. The x86 architecture, in particular, offers more ordering guarantees than some others do, making it possible to do without barriers entirely in some situations.
TSO barriers
A situation that has come up relatively recently has to do with "total
store order" (TSO) architectures, where, as Paul McKenney put it, "reads are ordered
before reads, writes before writes, and reads before writes, but not
writes before reads
". The x86 architecture has this property,
though some others do not. TSO ordering guarantees are enough for a number
of situations, but, in current kernels, a full memory barrier must be used
to ensure those semantics on non-TSO architectures. Thus, it would be nice
to have yet another memory barrier primitive to suit this situation.
Peter Zijlstra had originally called the new barrier smp_tmb(), but Linus was less than impressed with the descriptive power of that name. So Peter came up with a new patch set adding two new primitives:
- smp_load_acquire() forces a read of a location in memory
(in much the same way as ACCESS_ONCE()), but it ensures
that the read happens before any subsequent reads or writes.
- smp_store_release() writes a value back to memory, ensuring that the write happens after any previously-executed reads or writes.
These new primitives are immediately put to work in the code implementing the ring buffer used for perf events. That buffer has two pointers, called head and tail; head is where the kernel will next write event data, while tail is the next location user space will read events from. Only the kernel changes head, while only user space can change tail. In other words, it is a fairly standard circular buffer.
The code on the kernel side works like this (in pseudocode form):
tail = smp_load_acquire(ring_buffer->tail); write_events(ring_buffer->head); /* If 'tail' indicates there is space */ smp_store_release(ring_buffer->head, new_head);
The smp_load_acquire() operation ensures that the proper tail pointer is read before any data is written to the buffer. And, importantly, smp_store_release() ensures that any data written to the buffer is actually visible there before the new head pointer is made visible. Without that guarantee, the reader side could possibly see a head pointer indicating that more data is available before that data is actually visible in the buffer.
The code on the read side is the mirror image:
head = smp_load_acquire(ring_buffer->head); read_events(tail); /* If 'head' indicates available events */ smp_store_release(ring_buffer->tail, new_tail);
Here, the code ensures that the head pointer has been read before trying to access any data in the buffer; in that way, head corresponds to the data the kernel side wrote there. This smp_load_acquire() operation is thus paired with the smp_store_release() in the kernel-side code; together they make sure that data is seen in the correct order. The smp_store_release() call here pairs with the smp_load_acquire() call in the kernel-side code; it makes sure that the tail pointer does not visibly change until user space has fully read the data from the buffer. Without that guarantee, the kernel could possibly overwrite that data before it was actually read.
The ring buffer code worked properly before the introduction of these new
operations, but it had to use full barriers,
making it slower than it needed to be. The new operations allow this code
to be optimized while
also better describing the exact operations that are being protected by
barriers. As it happens, a lot of kernel code may be able to work with the
slightly weaker guarantees offered by the new barrier operations; the patch
changelog says "It appears that roughly half of the explicit barriers
in core kernel code might be so replaced.
"
The cost, of course, is that the kernel's complicated set of memory barrier operations has become even more complex. Once upon a time that might not have mattered much, since most use of memory barriers was deeply hidden within other synchronization primitives (spinlocks and mutexes, for example). With scalability pressures pushing lockless techniques into more places in the kernel, though, the need to be explicitly aware of memory barriers is growing. There may come a point where understanding memory-barriers.txt will be mandatory for working in much of the kernel.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Page editor: Jonathan Corbet
Next page:
Distributions>>