Kernel development
Brief items
Kernel release status
The current development kernel is 2.6.36-rc1, released (without announcement) on August 15. A number of patches have been merged since last week's merge window summary; see below for the most significant of them. Overall, the headline additions to 2.6.36 look to be the AppArmor security module, a new suspend mechanism which might - or might not - address the needs of the Android project, the LIRC infrared controller driver suite, a new out-of-memory killer, and the fanotify hooks for anti-malware applications. The full changelog is available for those who want all the details.A handful of patches have been merged since 2.6.36-rc1; they include parts of the VFS scalability patch set by Nick Piggin. We'll take a closer look at those patches for next week's edition.
Stable updates: The 2.6.27.51, 2.6.32.19, 2.6.34.4, and 2.6.35.2 updates were released on August 13. Greg notes that the 2.6.34 updates are coming to an end, with only one more planned. There is another 2.6.27 update in the review process as of this writing; the future of 2.6.27 updates appears to be short as well, given that Greg can no longer boot such old kernels on any hardware in his possession.
Previously, the 2.6.27.50, 2.6.32.18, 2.6.34.3, and 2.6.35.1 updates came out on August 10.
Quotes of the week
1) base and suffices choose the possible types.
2) order of types is always the same: int -> unsigned -> long -> unsigned
long -> long long -> unsigned long long
3) we always choose the first type the value would fit into
4) L in suffix == "at least long"
5) LL in suffix == "at least long long"
6) U in suffix == "unsigned"
7) without U in suffix, base 10 == "signed"
That's it.
Notes from the Boston Linux Power Management Mini-summit
Several developers concerned with Linux power management met for a mini-summit in Boston on August 9, immediately prior to LinuxCon. Numerous topics were discussed, including suspend performance, the Android and MeeGo approaches to power management, and more. Len Brown has posted a set of notes from this gathering; click below for the full text.Notes from the LSF summit storage track
LWN readers will have seen our reporting from the Linux Storage and Filesystem Summit (day 1, day 2), held on August 8 and 9 in Boston. Your editor was unable to attend the storage-specific sessions, though, so they were not covered in those articles. Fortunately, James Bottomley took detailed notes, which he has now made available to us. Click below for the results, covering iSCSI, SAN management, thin provisioning, trim and discard, solid-state storage devices, multipath, and more.KVM Forum presentations available
The KVM Forum was held concurrently with LinuxCon on August 9 and 10. Slides from all of the presentations made at the forum are now available in PDF format.Testing crypto drivers at boot time
Developers, understandably, want their code to be used, but turning new features on by default is often thought to be taking things a bit too far. Herbert Xu and other kernel crypto subsystem developers recently ran afoul of this policy when a new option controlling the self-testing the crypto drivers at boot time was set to "yes" by default. They undoubtedly thought that this feature was important—bad cryptography can lead to system or data corruption—but Linux has a longstanding policy that features should default to "off". When David Howells ran into a problem caused by a bug when loading the cryptomgr module, Linus Torvalds was quick to sharply remind Xu of that policy.
The proximate cause of Howells's problem was that the cryptomgr was returning a value that made it appear as if it was not loaded. That caused a cascade of problems early in the boot sequence when the module loader was trying to write an error message to /dev/console, which had not yet been initialized. Xu sent out a patch to fix that problem, but Howells's bisection pointed to a commit that added a way to disable boot-time crypto self-tests—defaulted to running the tests.
Torvalds was characteristically blunt: "People always think that their magical code is so important. I tell
you up-front that [it] absolutely is not. Just remove the crap entirely,
please.
" He was unhappy that, at least by default, everyone would
be running these self-tests every time they boot. But Xu was worried
about data corruption and potentially flaky crypto hardware:
The last thing you want to is to upgrade your kernel with a new hardware crypto driver that detects that you have a piece of rarely- used crypto [hardware], decides to use it and ends up making your data toast.
But Torvalds was unconvinced: "The _developer_ had better test the thing. That is absolutely
_zero_ excuse for then forcing every boot for every poor user to re-do
the test over and over again.
". Others were not so sure, however.
Kyle Moffett noted that he had been
personally bitten by new hardware crypto drivers that failed the
self-tests—thus falling back to the software implementation—so
he would like to see more testing:
Basically Torvalds's point was that making every user pay the cost to run
the self-tests at boot time was too high. The drivers should be
reliable or they shouldn't be in the kernel. He continued: "And if you worry about alpha-particles, you should run a
RAM test on every boot. But don't ask _me_ to run one.
"
Though Xu posted a patch to default the self-tests to "off", it has not yet made its way into the mainline. Given Torvalds's statements, though, that will probably happen relatively soon. If distributions disagree with his assessment, they can, and presumably will, enable the tests for their kernels.
Kernel development news
The conclusion of the 2.6.36 merge window
The 2.6.36 merge window closed with the release of 2.6.36-rc1 on August 15. About 1000 changesets were merged into the mainline after last week's update; this article will cover the significant additions since then, starting with the user-visible changes:
- The Squashfs filesystem has gained support for filesystems compressed
with LZO.
- The Ceph filesystem now has advisory locking support.
- There is now support for erase and trim operations (including the
"secure" variants) in the multi-media card (MMC) subsystem. The block
layer has been extended with a new REQ_SECURE flag and a new
BLKSECDISCARD ioctl() command to support this
functionality.
- New drivers:
- Block: ARM PXA-based PATA controllers.
- Systems and processors:
Income s.r.o. PXA270 single-board computers,
Wiliboard WBD-111 boards, and
Samsung S5PC210-based systems.
- Miscellaneous: Semtech SX150-series I2C GPIO expanders, Intersil ISL12022 RTC chips, Freescale IMX DryIce real time clocks, Dallas Semiconductor DS3232 real-time clock chips, SuperH Mobile HDMI controllers, iPAQ h1930/h1940/rx1950 battery controllers, Intel MID battery controllers, STMicroelectronics STMPE I/O expanders, TI TPS6586x power management chips, ST-Ericsson AB8500 power regulators, Maxim Semiconductor MAX8998 power management controllers, Intersil ISL6271A power regulators, Analog Devices AD5398/AD5821 regulators, Lenovo IdeaPad ACPI rfkill switches, Winbond/Nuvoton NUC900 I2C controllers, and SMSC EMC2103 temperature and fan sensors.
- Block: ARM PXA-based PATA controllers.
Changes visible to kernel developers include:
- The ioctl() file operation has been removed, now that
all in-tree users have been converted to the unlocked_ioctl()
version which does not acquire the big kernel lock. Removal of the
BKL has gotten yet another step closer.
- The nearly unused function dma_is_consistent(), meant to
indicate whether cache-coherent DMA can be performed on a specific
range of memory, has been removed.
- The kfifo API has been reworked for ease of use
and performance. Some examples of how to use the API have been added
under samples/kfifo.
- There is a new set of functions for avoiding races with sysfs access
to module parameters:
kparam_block_sysfs_read(name); kparam_unblock_sysfs_read(name); kparam_block_sysfs_write(name); kparam_unblock_sysfs_write(name);
Here, name is the name of the parameter as supplied to module_param() in the same source file. They are implemented with a mutex.
- Experimental support for multiplexed I2C busses has been added.
All told, some 7,770 changes were incorporated during this merge window. There were not a whole lot of changes pushed back this time around. The biggest feature which was not merged, perhaps, was transparent hugepages, but that omission is most likely due to the lack of a proper pull request from the maintainer.
Now the stabilization period begins. Linus has suggested that he plans to repeat his attempt to hold a hard line against any post-rc1 changes which are not clearly important fixes; we will see how that works out in practice.
One billion files on Linux
What happens if you try to put one billion files onto a Linux filesystem? One might see this as an academic sort of question; even the most enthusiastic music downloader will have to work a while to collect that much data. It would require over 30,000 (clean) kernel trees to add up to a billion files. Even contemporary desktop systems, which often seem to be quite adept at the creation of vast numbers of small files, would be hard put to make a billion of them. But, Ric Wheeler says, this is a problem we need to be thinking about now, or we will not be able to scale up to tomorrow's storage systems. His LinuxCon talk used the billion-file workload as a way to investigate the scalability of the Linux storage stack.One's first thought, when faced with the prospect of handling one billion files, might be to look for workarounds. Rather than shoveling all of those files into a single filesystem, why not spread them out across a series of smaller filesystems? The problems with that approach are that (1) it limits the kernel's ability to optimize head seeks and such, reducing performance, and (2) it forces developers (or administrators) to deal with the hassles involved in actually distributing the files. Inevitably things will get out of balance, forcing things to be redistributed in the future.
Another possibility is to use a database rather than the filesystem. But filesystems are familiar to developers and users, and they come with the operating system from the outset. Filesystems also are better at handling partial failure; databases, instead, tend to be all-or-nothing affairs.
If one wanted to experiment with a billion-file filesystem, how would one
come up with hardware which is up to the task? The
most obvious way at the moment is with external disk arrays. These boxes
feature non-volatile caching and a hierarchy of storage technologies. They
are often quite fast at streaming data, but random access may be fast or
slow, depending on where the data of interest is stored. They cost $20,000
and up.
With regard to solid-state storage, Ric noted only that 1Tb still costs a good $1000. So rotating media is likely to be with us for a while.
What if you wanted to put together a 100Tb array on your own? They did it at Red Hat; the system involved four expansion shelves holding 64 2Tb drives. It cost over $30,000, and was, Ric said, a generally bad idea. Anybody wanting a big storage array will be well advised to just go out and buy one.
The filesystem life cycle, according to Ric, starts with a mkfs operation. The filesystem is filled, iterated over in various ways, and an occasional fsck run is required. At some point in the future, the files are removed. Ric put up a series of plots showing how ext3, ext4, XFS, and btrfs performed on each of those operations with a one-million-file filesystem. The results varied, with about the only consistent factor being that ext4 generally performs better than ext3. Ext3/4 are much slower than the others at creating filesystems, due to the need to create the static inode tables. On the other hand, the worst performers when creating 1 million files were ext3 and XFS. Everybody except ext3 performs reasonably well when running fsck - though btrfs shows room for some optimization. The big loser when it comes to removing those million files is XFS.
To see the actual plots, have a look at Ric's slides [PDF].
It's one thing to put one million files into a filesystem, but what about one billion? Ric did this experiment on ext4, using the homebrew array described above. Creating the filesystem in the first place was not an exercise for the impatient; it took about four hours to run. Actually creating those one billion files, instead, took a full four days. Surprisingly, running fsck on this filesystem only took 2.5 hours - a real walk in the park. So, in other words, Linux can handle one billion files now.
That said, there are some lessons that came out of this experience; they indicate where some of the problems are going to be. The first of these is that running fsck on an ext4 filesystem takes a lot of memory: on a 70Tb filesystem with one billion files, 10GB of RAM was needed. That number goes up to 30GB when XFS is used, though, so things can get worse. The short conclusion: you can put a huge amount of storage onto a small server, but you'll not be able to run the filesystem checker on it. That is a good limitation to know about ahead of time.
Next lesson: XFS, for all of its strengths, struggles when faced with metadata-intensive workloads. There is work in progress to improve things in this area, but, for now, it will not perform as well as ext3 in such situations.
According to Ric, running ls on a huge filesystem is "a bad idea"; iterating over that many files can generate a lot of I/O activity. When trying to look at that many files, you need to avoid running stat() on every one of them or trying to sort the whole list. Some filesystems can return the file type with the name in readdir() calls, eliminating the need to call stat() in many situations; that can help a lot in this case.
In general, enumeration of files tends to be slow; we can do, at best, a few thousand files per second. That may seem like a lot of files, but, if the target is one billion files, it will take a very long time to get through the whole list. A related problem is backup and/or replication. That, too, will take a very long time, and it can badly affect the performance of other things running at the same time. That can be a problem because, given that a backup can take days, it really needs to be run on an operating, production system. Control groups and the I/O bandwidth controller can maybe help to preserve system performance in such situations.
Finally, application developers must bear in mind that processes which run this long will invariably experience failures, sooner or later. So they will need to be designed with some sort of checkpoint and restart capability. We also need to do better about moving on quickly when I/O operations fail; lengthy retry operations can take a slow process and turn it into an interminable one.
In other words, as things get bigger we will run into some scalability problems. There is nothing new in that revelation. We've always overcome those problems in the past, and should certainly be able to do so in the future. It's always better to think about these things before they become urgent problems, though, so talks like Ric's provide a valuable service to the community.
The end of block barriers
One of the higher-profile decisions made at the recently-concluded Linux Storage and Filesystem summit was to get rid of support for barriers in the Linux kernel block subsystem. This was a popular decision, but also somewhat misunderstood (arguably, by your editor above all). Now, a new patch series from Tejun Heo shows how request ordering will likely be handled between filesystems and the block layer in the future.The block layer must be able to reorder disk I/O operations if it is to obtain the sort of performance that users expect from their systems. On rotating media, there is much to be gained by minimizing head seeks, and that goal is best achieved by executing all nearby requests together, regardless of the order in which those requests were issued. Even with flash-based devices, there is some benefit to be had by grouping adjacent requests, especially when small requests can be coalesced into larger operations. Proper dispatch of requests to the low-level device driver is normally the I/O scheduler's job; the scheduler will freely reorder requests, blissfully ignorant of the higher-level decisions which created those requests in the first place.
Note that this reordering also usually happens within the storage device itself; requests will be cached in (possibly volatile) memory and writes will be executed at a time which the hardware deems to be convenient. This reordering is typically invisible to the operating system.
The problem, of course, is that it is not always safe to reorder I/O requests in arbitrary ways. The classic example is that of a journaling filesystem, which operates in roughly this way:
- Begin a new transaction.
- Write all planned metadata changes to the journal. Depending on the filesystem and its configuration, data changes may go to the journal as well.
- Write a commit record closing out the transaction.
- Begin the process of writing the journaled changes to the filesystem itself.
- Goto 1.
If the system were to crash before step 3 completes, everything written to the journal would be lost, but the integrity of the filesystem would be unharmed. If the system crashes after step 3, but before the changes are written to the filesystem, those changes will be replayed at the next mount, preserving both the metadata and the filesystem's integrity. Thus, journaling makes a filesystem relatively crash-proof.
But imagine what can happen if requests are reordered. If the commit record is written before all of the other changes have been written to the journal, then, after a crash, an incomplete journal would be replayed, corrupting the filesystem. Or, if a transaction frees some disk blocks which are subsequently reused elsewhere in the filesystem, and the reused blocks are written before the transaction which freed them is committed, a crash at the wrong time would, once again, corrupt things. So, clearly, the filesystem must be able to impose some ordering on how requests are executed; otherwise, its attempts to guarantee filesystem integrity in all situations may well be for nothing.
For some years, the answer has been barrier requests. When the filesystem issues a request to the block layer, it can mark that request as a barrier, indicating that the block layer should execute all requests issued before the barrier prior to doing any requests issued afterward. Barriers should, thus, ensure that operations make it to the media in the right order while not overly constraining the block layer's ability to reorder requests between the barriers.
In practice, barriers have an unpleasant reputation for killing block I/O performance, to the point that administrators are often tempted to turn them off and take their risks. While the tagged queue operations provided by contemporary hardware should implement barriers reasonably well, attempts to make use of those features have generally run into difficulties. So, in the real world, barriers are implemented by simply draining the I/O request queue prior to issuing the barrier operation, with some flush operations thrown in to get the hardware to actually commit the data to persistent media. Queue-drain operations will stall the device and kill the parallelism needed for full performance; it's not surprising that the use of barriers can be painful.
In their discussions of this problem, the storage and filesystem developers have realized that the ordering semantics provided by block-layer barriers are much stronger than necessary. Filesystems need to ensure that certain requests are executed in a specific order, and they need to ensure that specific requests have made it to the physical media before starting others. Beyond that, though, filesystems need not concern themselves with the ordering for most other requests, so the use of barriers constrains the block layer more than is required. In general, it was concluded, filesystems should concern themselves with ordering, since that's where the information is, and not dump that problem into the block layer.
To implement this reasoning, Tejun's patch gets rid of hard-barrier operations in the block layer; any filesystem trying to use them will get a cheery EOPNOTSUPP error for its pains. A filesystem which wants operations to happen in a specific order will simply need to issue them in the proper order, waiting for completion when necessary. The block layer can then reorder requests at will.
What the block layer cannot do, though, is evade the responsibility for getting important requests to the physical media when the filesystem requires it. So, while barrier requests are going away, "flush requests" will replace them. On suitably-capable devices, a flush request can have two separate requirements: (1) the write cache must be flushed before beginning the operation, and (2) the data associated with the flush request itself must be committed to persistent media by the time the request completes. The second part is often called a "force unit access" (or FUA) request.
In this world, a journaling filesystem can issue all of the journal writes for a given transaction, then wait for them to complete. At that point, it knows that the writes have made it to the device, but the device might have cached those requests internally. The write of the commit record can then follow, with both the "flush" and "FUA" bits set; that will ensure that all of the journal data makes it to physical media before the commit record does, and that the commit record itself is written by the time the request completes. Meanwhile, all other I/O operations - playing through previous transactions or those with no transaction at all - can be in flight at the same time, avoiding the queue stall which characterizes the barrier operations implemented by current kernels.
The patch set has been well received, but there is still work to be done, especially with regard to converting filesystems to the new way of doing things. Christoph Hellwig has posted a set of patches to that end. A lot of testing will be required as well; there is little desire to introduce bugs in this area, since the consequences of failure are so high. But the development cycle has just begun, leaving a fair amount of time to shake down this work before the 2.6.37 merge window opens.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>