LWN.net Logo

Kernel development

Brief items

Kernel release status

The 2.6.36 merge window is still open, so no development kernel release is available yet. See the article below for a summary of the merges made in the last week.

Four stable kernels were released on August 10: 2.6.27.50, 2.6.32.18, 2.6.34.3, and 2.6.35.1.

Comments (none posted)

Quotes of the week

I don't think the situation is in fact deteriorating. We're shipping decent releases, growing our user base, within and without the kernel developer community, and still have plenty of major feature areas to work on. We have not seen regressive LKML obstructions, though admittedly that is a low standard when it comes to serving the community.
-- SystemTap maintainer Frank Eigler

If my corporate overlords told me I had to use my Exchange "messaging" account for external email communication, they would get a quite clear 'no' in response. My response may also contain suggestions that they use certain other objects for purposes for which they were not designed.

Seriously, just use an external email account and ignore the broken corporate policy. 'Policy' is just a euphemism for not having to think for yourself.

-- David Woodhouse

Comments (3 posted)

Kernel development news

2.6.36 merge window: the sequel

By Jonathan Corbet
August 11, 2010
As of this writing, some 6700 non-merge changesets have been accepted for the 2.6.36 development cycle. These changes bring a lot of fixes and a number of new features, some of which have been in the works for some time. The most interesting changes since last week's summary are summarized here.

User-visible changes include:

  • The ext3 filesystem, once again, defaults to the (safer) "ordered" mode at mount time. This reverses the change (to "writeback" mode) made in 2009, which was typically overridden by distributions.

  • The out-of-memory killer has been rewritten. The practical result is that the system may choose different processes to kill in out-of-memory situations, and the user-space API for adjusting how attractive processes appear to the OOM killer has changed.

  • The fanotify mechanism has been merged. Fanotify allows a user-space daemon to obtain notification of file operations and, perhaps, block access to specific files. It is intended for use with malware scanning applications, but there are other potential uses (hierarchical storage management, for example) as well.

  • There is a new system call for working with resource limits:

        int prlimit64(pid_t pid, unsigned int resource, 
                      const struct rlimit64 *new_rlim, struct rlimit64 *old_rlim);
    

    It is meant to (someday) replace setrlimit(); the differences include the ability to modify limits belonging to other processes and the ability to query and set a limit in a single operation.

  • The TTY driver has gained support for the EXTPROC mode supported by BSD for the last 20 years or so. This option was originally developed to facilitate telnet's "linemode", but it is useful for contemporary protocols as well.

  • New drivers:

    • Processors and systems: Ingenic JZ4740 SOC systems, Trapeze ITS GPR boards, ifm PDM360NG boards, Freescale P1022DS reference boards, TQM mcp8xx-based boards, TI TNETV107X-based systems, OMAP4430-based PandaBoards, NVIDIA Tegra-based systems, and Tilera TILEPro and TILE64 processors (a whole new architecture).

    • Block: QLogic ISP82XX host adaptors, AppliedMicro 460EX processor on-chip SATA controllers, Samsung S3C/S5P board PATA controllers, and Moorestown NAND Flash controllers.

    • Media: EasyCAP USB video adapters, Softlogic 6x10 MPEG codec cards, Winbond/Nuvoton NUC900-based audio controllers, Cirrus Logic CS42L51 codecs, Cirrus Logic EP93xx series audio devices, Marvell Kirkwood I2S audio devices, Ingenic JZ4740-based audio devices, SmartQ board audio devices, Wolfson Micro WM8741 codecs, and Samsung S5P FIMC video postprocessors.

    • Miscellaneous: Silicon Image sil164 TMDS transmitters, TI DSP bridge devices, PCILynx TSB12LV21/A/B controllers (as a FireWire sniffer; the user-space side has also been added under tools/firewire), Bosch Sensortec BMP085 digital pressure sensors, ROHM BH1780GLI ambient light sensors, Honeywell HMC6352 compasses, Summit Microelectronics SMM665 six-channel active DC output controller/monitor devices, JEDEC JC 42.4 compliant temperature sensors, Intel Topcliff PCH DMA controllers, Intel Moorestown DMAC1 and DMAC2 controllers, Intel Moorestown MAX3110 and MAX3107 UARTs, Intel Medfield UARTs, Quatech SSU-100 USB serial ports, and ARM Primecell SP805 watchdog timers.

Changes visible to kernel developers include:

  • The SCSI layer now supports runtime power management, but almost no work has been done (yet) to push that support down into individual drivers.

  • The MIPS architecture now has kprobes support.

  • The KGDB debugger is now supported with the Microblaze architecture.

  • There are a few new build-time configuration commands: listnewconfig outputs a list of new configuration options, oldnoconfig sets all new configuration options to "no" without asking, alldefconfig sets all options to their default values, and savedefconfig writes a minimal configuration file in defconfig. (This patch adding the first two options above also introduces a new Whatevered-by: patch tag, with unknown semantics).

  • There is a new scripts/coccinelle directory containing a number of Coccinelle "semantic patches" which perform various useful checks. They can be run with "make coccicheck".

  • The kmemtrace ftrace plugin is gone; "perf kmem" should be used instead. The ksym plugin has also been superseded by perf, and, thus, removed.

  • There is a new function for short, blocking delays:

        void usleep_range(unsigned long min, unsigned long max);
    

    This function will sleep (uninterruptibly) for a period between min and max microseconds. It is based on hrtimers, so the timing will be more precise than obtained with msleep().

  • The new IRQF_NO_SUSPEND flag for request_irq() will cause the interrupt line not to be disabled during suspend; IRQF_TIMER can no longer be (mis)used for this purpose.

  • The concurrency-managed workqueues patch set has been merged, completely changing the way workqueues are implemented. One immediate user-visible result will be that there should be far fewer kernel threads running on most systems. All users of the "slow work" API have been converted to concurrency-manged workqueues, so the slow work mechanism has been removed from the kernel.

  • The cpuidle mechanism has been enhanced to allow for the set of available idle states to change over time. Details can be found in this patch.

  • The Blackfin architecture has gained dynamic ftrace support.

  • There is a new super_operations method called evict_inode(); it handles all of the necessary work when an in-core inode is being removed. It should be used instead of clear_inode() and delete_inode().

  • The inotify mechanism has been removed from inside the kernel; the fsnotify mechanism must be used instead. (Of course, the user-space inotify interface is still supported).

  • The Video4Linux2 layer has gained a new framework which simplifies the handling of controls; see this commit and Documentation/video4linux/v4l2-controls.txt for details.

  • The open() and release() functions in struct block_device_operations are now called without the big kernel lock held. Additionally, the locked_ioctl() function has gone away; all block drivers must implement their own locking there as well.

  • The domain name resolution code has been pulled out of the CIFS filesystem and made generic. It works by using the key mechanism to request DNS resolution from user space; see Documentation/networking/dns-resolver.txt for details.

The merge window remains open as of this writing, so we may yet see more interesting features merged for 2.6.36. Watch this space next week for the final merge window updates for this development cycle.

Comments (4 posted)

The 2010 Linux Storage and Filesystem Summit, day 1

By Jonathan Corbet
August 9, 2010
The fourth Linux storage and filesystem summit was held August 8 and 9 in Boston, immediately prior to LinuxCon. This time around, a number of developers from the memory management community were present as well. Your editor was also there; what follows are his notes from the first day of the summit.

Testing tools

The first topic of the workshop was "testing and tools," led by Eric Sandeen. The 2009 workshop identified a generic test suite as something that the community would very much like to have. One year later, quite a bit of progress has been made in the form of the xfstests package. As the name suggests, this test suite has its origins in the XFS filesystem, and it is still somewhat specific to XFS. But, with the addition of generic testing last May, about 70 of the 240 tests are now generic. Xfstests is concerned primarily with regression testing; it is not, generally, a performance-oriented test suite. Tests tend to get added when somebody stumbles across a bug and wants to verify that it's fixed - and that it stays fixed. Xfstests also does not have any real facility for the creation of test filesystems; tools like Impressions are best used for that purpose.

About 40 new tests have been added to xfstests over the last year; it is now heavily used in ext4 development. Most tests look for specific bugs; there isn't a whole lot of coverage for extreme situations - millions of files in one directory and such. Those tests just tend to take too long.

It was emphasized that just running xfstests is not enough on its own; tests must be run under most or all reasonable combinations of mount options to get good coverage. Ric Wheeler also pointed out that different types of storage have very different characteristics. Most developers, he fears, tend to test things on their laptops and call the result good. Testing on other types of storage, of course, requires access to the hardware; USB sticks are easy, but not all developers can test on enterprise-class storage arrays.

Tests which exercise more of the virtual memory and I/O paths would also be nice. There is one package which covers much of this ground: FIO, available from kernel.dk. Destructive power failure testing is another useful area which Red Hat (at least) is beginning to do. There has also been some work done using hdparm to corrupt individual sectors on disk to see how filesystems respond. A wishlist item was better latency measurement, with an emphasis on seeing how long I/O requests sit within drivers which do their own queueing. It was suggested that what is really needed is some sort of central site for capturing wishlist ideas for future tests; then, whenever somebody has some time available, those ideas are available.

In an attempt to better engage the memory management developers in the room, it was asked: how can we make tests which cover writeback? The key, it seems, is to choose a workload which is large enough to force writeback, but not so large that it pushes the system into heavy swapping. One simple test is a large tar run; while that is happening, monitor /proc/vmstat to see when writeback kicks in, and when things get bad enough that direct reclaim is invoked. An arguably more representative test can be had with sysbench; again, the key is to tune it so that the shared buffers fit within physical memory.

But, as Nick Piggin pointed out, anything that dirties memory is, in the end, a writeback test. The key is to find ways of making tests which adequately model real-world workloads.

Memory-management testing

Your editor is messing with the timing here, but the session on testing of memory management changes fits well with the above. So please ignore the fact that this session actually happened after lunch.

The question here is simple: how can memory management changes be tested to satisfy everybody? This is a subject which has been coming up for years; memory management changes seem to be especially subject to "have you tested with this other kind of workload?" questions. Developers find this frustrating; it never seems to be possible to do enough testing to satisfy everybody, especially since people asking for testing of different workloads are often unable or unwilling to supply an actual test program.

It was suggested that the real test should be "put the new code on the Google cluster and see if the Internet breaks." There are certain practical difficulties with this approach, however. So the question remains: how can a developer conclude that a memory management change actually works? Especially in a situation where "works" means different things to different people? There is far too wide a variety of workloads to test them all. Beyond that, memory management changes often involve tradeoffs - making one workload better may mean making another one worse. Changes which make life better for everybody are rare.

Still, it was agreed that a standard set of tests would help. Some suggestions were made, including hackbench, netperf, various database benchmarks (pgbench or sysbench, for example), and the "compilebench" test which is popular with kernel developers. There was also some talk of microbenchmarks; Nick Piggin noted that microbenchmarks are terrible when it comes to arguing for the inclusion of a change, but they can be useful for the detection of performance regressions.

Sometimes running a single benchmark is not enough; many memory management problems are only revealed when the system comes under a combination of stresses. And Andrea Arcangeli made the point that, in the end, only one test really matters: how much time does it take the system to complete running a workload which exceeds the amount of physical RAM available?

There was some discussion of the challenges involved in tracking down problems; Mel Gorman stated that the debugability of the virtual memory subsystem is "a mess." Tracepoints can be useful for this purpose, but they are hard to get merged, partly due to Andrew Morton's hostility to tracepoints in general. There is also ongoing concern about the ABI status of tracepoints; what happens when programs (perhaps being run by large customers of enterprise distributions) depend on tracepoints which expose low-level kernel details? Those tracepoints may no longer make sense after the code changes, but breaking them may not be an option.

Filesystem freeze/thaw

The filesystem freeze feature enables a system administrator to suspend writes to a filesystem, allowing it to be backed up or snapshotted while in a consistent state. It had its origins in XFS, but has since become part of the Linux VFS layer. There are a few issues with how freeze works in current kernels, though.

The biggest of these problems is unmounting - what happens when the administrator unmounts a frozen filesystem? In current kernels, the whole thing hangs, leaving the system with an unusable, un-unmountable filesystem - behavior which does not further the Linux World Domination goal. So four possible solutions were proposed:

  1. Simply disallow the unmounting of frozen filesystems. Al Viro stated that this solution is not really an option; there are cases where the unmount cannot be disallowed. When the final process exits the namespace in which the filesystem is mounted is one of those cases. Disallowing unmounts would also break the useful umount -l option, which is meant to work at all times.

  2. Keep the filesystem frozen across the unmount, so that the filesystem would still be frozen after the next mount. The biggest problem here is that there may be changes that the filesystem code needs to write to the device; if the system reboots before that can happen, bad things can result.

  3. Automatically thaw filesystems on unmount.

  4. Add a new ioctl() command which will cause the thawing of an unmounted filesystem.

Al suggested a variant on #3, in the form of a new freeze command. The proper way to handle freeze is to return a file descriptor; as long as that file descriptor is held open, the filesystem remains frozen. This solves the "last process exits" problem because the file descriptor will be closed as the process exits, automatically causing the filesystem to be thawed. Also, as Al noted, the kill command is often the system recovery tool of choice for system administrators, so having a suitably-targeted kill cause a frozen filesystem to be thawed makes sense.

There seemed to be a consensus that the file descriptor approach is the best long-term solution. Meanwhile, though, there are tools based on the older ioctl() commands which will take years to replace in the field. So we might also see an implementation of #4, above, to help in the near future.

Barriers

Contemporary filesystems go to great lengths to avoid losing data - or corrupting things - if the system crashes. To that end, quite a bit of thought goes into writing things to disk in the correct order. As a simple example, operations written to a filesystem journal must make it to the media before the commit record which marks those operations as valid. Otherwise, the filesystem could end up replaying a journal with random data, with an outcome that few people would love.

All of that care is for nothing, though, if the storage device reorders writes on their way to the media. And, of course, reordering is something that storage devices do all the time in the name of increasing performance. The solution, for years now, has been "barrier" operations; all writes issued before a barrier are supposed to complete before any writes issued after the barrier. The problem is that barriers have not always been well supported in the Linux block subsystem, and, when they are supported, they have a significant impact on performance. So, even now, many systems run with barriers disabled.

Barriers have been discussed among the filesystem and storage developers for years; it was hoped that this year, with the memory management developers present as well, some better solutions might be found.

There was some discussion about the various ways of implementing barriers and making them faster. The key point in the discussion, though, was the assertion that barriers are not really the same as the memory barriers they were patterned after. There are, instead, two important aspects to block subsystem barriers: request ordering and forcing data to disk. That led, eventually, to one of the clearest decisions in the first day of the summit: barriers, as such, will be no more. The problem of ordering will be placed entirely in the hands of filesystem developers, who will ensure ordering by simply waiting for operations to complete when needed. There will be no ordering issues as such in the block layer, but block drivers will be responsible for explicitly flushing writes to physical media when needed.

Whether this decision will lead to better-performing and more robust filesystem I/O remains to be seen, but it is a clearer description of the division of responsibilities than has been seen in the past.

Transparent hugepages

At this point, the summit split into three tracks for storage, filesystem, and memory management developers. Your editor followed the memory management track; with luck, we'll eventually have writeups from the other tracks as well.

Andrea Arcangeli presented his transparent hugepages work, starting with a discussion of the advantages of hugepages in general. Hugepages are a feature of many contemporary processors; they allow the memory management subsystem to use larger-than-normal page sizes in parts of the virtual address range. There are a number of advantages to using hugepages in the right places, but it all comes down to performance.

A hugepage takes much of the pressure off the processor's translation lookaside buffer (TLB), speeding memory access. When a TLB miss happens anyway, a 2MB hugepage requires traversing three levels of page table rather than four, saving a memory access and, again, reducing TLB pressure. The result is a doubling of the speed with which initial page faults can be handled, and better application performance in general. There can be some costs, especially when entire hugepages must be cleared or copied; that can wipe out much of the processor's cache. But this cost tends to be overwhelmed by the performance advantages that hugepages bring.

Those advantages, incidentally, are multiplied when hugepages are used on systems hosting virtualized guests. Using hugepages in this situation can eliminate much of the remaining cost of running virtualized systems.

Hugepages on Linux are currently accessed through the hugetlbfs filesystem, which was discussed in great detail by Mel Gorman on LWN earlier this year. There are some real limitations associated with hugetlbfs, though: hugepages are not swappable, they must be reserved at boot time, there is no mixing of page sizes in the same virtual memory area, etc. Many of these problems could be fixed, but, as Andrea put it, hugetlbfs is becoming a sort of secondary - and inferior - Linux virtual memory subsystem. It is time to turn hugepages into first-class citizens in the real Linux VM.

Transparent hugepages eliminate much of the need for hugetlbfs by automatically grouping together sections of a process's virtual address space into hugepages when warranted. They take away the hassles of hugetlbfs and make it possible for the system to use hugepages with no need for administrator intervention or application changes. There seems to be a fair amount of interest in the feature; a Google developer said that the feature is attractive for internal use.

At the core of the patch is a new thread called khugepaged, which is charged with scanning memory and creating hugepages where it makes sense. Other parts of the VM can split those hugepages back into normally-sized pieces when the need arises. Khugepaged works by allocating a hugepage, then using the migration mechanism to copy the contents of the smaller component pages over. There was some talk of trying to defragment memory and "collapse in place" instead, but it doesn't seem worth the effort at this point. The amount of work to be done would be about the same except in the special case where a hugepage had been split and was being grouped back together before much had changed - a situation which is expected to be relatively rare.

Andrea put up a number of benchmarks showing how transparent hugepages improve performance; the all-important kernel compile benchmark (seen as a sort of worst case for hugepages) is 2.5% faster. Various other benchmarks show bigger improvements.

Transparent hugepages, it seems, will be enabled by default in the RHEL 6 kernel. Andrea would really like to get the feature into 2.6.36, but the merge window is already well advanced and it's not clear that things will work out that way. There is still a need to convince Linus that the feature is worthwhile, and, perhaps, some work to be done to enable the feature on SPARC systems.

mmap_sem

The memory map semaphore (mmap_sem) is a reader-writer semaphore which protects the tree of virtual memory area (VMA) structures describing each address space. It is, Nick Piggin says, one of the last nasty locking issues left in the virtual memory subsystem. Like many busy, global locks, mmap_sem can cause scalability problems through cache line bouncing. In this case, though, simple contention for the lock can be a problem; mmap_sem is often held while disk I/O is being performed. With some workloads, the amount of time that mmap_sem is held can slow things down significantly.

Various groups, including developers at HP and Google, have chipped away at the mmap_sem problem in the past, usually by trying to drop the semaphore in various paths. These patches have all run into the same problem, though: Linus hates them. In particular, he seems to dislike the additional complication added to the retry paths which must be followed when things change while the lock is dropped. So none of this work has gotten into the mainline.

There have also been some unfair rwsem proposals aimed at reducing mmap_sem contention; these have run aground over fears of writer starvation.

According to Nick, the real problem is the red-black tree used to track allocated address space; the data structure is cache-unfriendly and requires a global lock for its protection. His idea is to do away with this rbtree and associate VMAs directly with the page table entries, protecting them with the PTE locks. This approach would eliminate much of the locking entirely, since the page tables must be traversable without locks, and solve the mmap_sem problem.

That said, there are some challenges. A VMA associated with a page table entry can cover a maximum of 2MB of address space; larger areas would have to be split into (possibly a large number of) smaller VMAs. It's not clear how this mechanism would then interact with hugepages. The instantiation of large VMAs would require the creation of the full range of PTEs, which is not required now; that could hurt applications with very sparsely-populated memory areas. Growing VMAs would have its own challenges. There is also the issue of free space allocation, a problem which might be solved by preallocating ranges of addresses to each thread sharing an address space. In summary, the list of obstacles to be overcome before this idea becomes practical looks somewhat daunting.

The developers in the room seemed to not be entirely comfortable with this approach, but nobody could come up with a fundamental reason why it would not work. So we'll probably be seeing patches from Nick exploring this idea in the future.

copyfile()

The reflink() system call was originally proposed as a sort of fast copy operation; it would create a new "copy" of a file which shared all of the data blocks. If one of the files were subsequently written to, a copy-on-write operation would be performed so that the other file would not change. LWN readers last heard about this patch last September, when Linus refused to pull it for 2.6.32. Among other things, he didn't like the name.

So now reflink() is back as copyfile(), with some proposed additional features. It would make the same copy-on-write copies on filesystems that support it, but copyfile() would also be able to delegate the actual copy work to the underlying storage device when it makes sense. For example, if a file is being copied on a network-mounted filesystem, it may well make sense to have the server do the actual copy work, eliminating the need to move the data over the network twice. The system call might also do ordinary copies within the kernel if nothing faster is available.

The first question that was asked is: should copyfile() perhaps be an asynchronous interface? It could return a file descriptor which could be polled for the status of the operation. Then, graphical utilities could start a copy, then present a progress bar showing how things were going. Christoph Hellwig was adamant, though, that copyfile() should be a synchronous operation like almost all other Linux system calls; there is no need to create something weird and different here. Progress bars neither justify nor require the creation of asynchronous interfaces.

There was also opposition to the mixing of the old reflink() idea with that of copying a file. There is little perceived value in creating a bad version of cp within the kernel. The two ideas were mixed because it seems that Linus seems to want it that way, but, after this discussion, they may yet be split apart again.

Dirty limits

Jan Kara led a short discussion on the problem of dirty limits. The tuning knob found at /proc/sys/vm/dirty_ratio contains a number representing a percentage of total memory. Any time that the number of dirty pages in the system exceeds that percentage, processes which are actually writing data will be forced to perform some writeback directly. This policy has a couple of useful results: it helps keep memory from becoming entirely filled with dirty pages, and it serves to throttle the processes which are creating dirty pages in the first place.

The default value for dirty_ratio is 20, meaning that 20% of memory can be dirty before processes are conscripted into writeback duties. But that turns out to be too low for a number of applications. In particular, it seems that many Berkeley DB applications exhibit behavior where they dirty a lot of pages all over memory; setting dirty_ratio too low causes a lot of excessive I/O and serious performance issues. For this reason, distributions like RHEL raise this limit to 40% by default.

But 40% is not an ideal number either; it can lead to a lot of wasted memory when the system's workloads are mostly sequential. Lots of dirty pages can also cause fsync() calls to take a very long time, especially with the ext3 filesystem. What's really needed is a way to set this parameter in a more automatic, adaptive way, but exactly how that should be done is not entirely clear.

What is likely to happen in the short term is that a user-space daemon will be written to experiment with various policies for dirty_ratio. Some VM tracepoints can be used to catch events and tune things accordingly. A system which is handling a lot of fsync() calls should probably have a lower value of dirty_ratio, for example. In the absence of reasons to the contrary, the daemon can try to nudge the limit higher and try to see if applications perform better. This kind of heuristic experimentation has its hazards, but there does not seem to be a better method on offer at the moment.

Topology and alignment

There was a brief session on storage device topology issues; unfortunately, it was late in the day and your editor's notes are increasingly fuzzy. Much of the discussion had to do with 4K-sector disks. There are still issues, it seems, with drives which implement strange sector alignments in an attempt to get better performance with some versions of Windows. Linux can cope with these drives, but only if the drives themselves export information on what they are doing. Not all hardware provides that information, unfortunately.

Meanwhile, the amount of software which does make use of the topology information exported through the kernel is increasing. Partitioning tools are getting smarter, and the device mapper now uses this information properly. The readahead code will be tweaked to create properly-aligned requests when possible.

Lightning talks

The last session of the day was dedicated to three lightning talks. The first, by Matthew Wilcox, had to do with merging of git trees. Quite a bit of work in the VM/filesystem/storage area depends on changes made in a number of different trees. Making those trees fit together can be a bit of a challenge. That problem can be solved in linux-next, but those solutions do not necessarily carry over into the mainline, where trees may be pulled in just about any order - or not at all. The result is a lot of work and merge-window scrambling by developers, who are getting a little tired of it.

So, it was asked, is it time for a git tree dedicated to storage as a whole, and a storage maintainer to go with it? The idea was to create something like David Miller's networking tree, which is the merge point for almost all networking-related patches. James Bottomley made the mistake of suggesting that this kind of discussion could not go very far without a volunteer to manage that tree; he was then duly volunteered for the job.

The discussion moved on to how this tree would work, and, in particular, whether its maintainer would become the "overlord of storage," or whether it would just be a more convenient place to work out merge conflicts. If its maintainer is to be a true overlord, a fairly hardline approach will need to be taken with regard to when patches would have to be ready for merging. It's not clear whether the storage community is ready to deal with such a maintainer. So, for the near future, James will run the tree as a merge point to see whether that helps developers get their code into the mainline. If it seems like there is need for a real storage maintainer, that question will be addressed after a development cycle or two.

Dan Magenheimer presented his Cleancache proposal, mostly with an eye toward trying to figure out a way to get it merged. There is still some opposition to it, and its per-filesystem hooks in particular. It's hard to see how those hooks can be avoided, though; Cleancache is not suitable for all filesystems and, thus, may not be a good fit for the VFS layer. The crowd seemed reasonably amenable to merging the patches, but the chief opponent - Christoph Hellwig - was not in the room at the time. So no real conclusions have been reached.

The final lightning talk came from Boaz Harrosh, who talked about "stable pages." Currently, pages which are currently under writeback can be modified by filesystem code. That's potentially a data integrity problem, and it can be fatal in situations where, for example, checksums of page contents are being made. That is why the RAID5 code must copy all pages being written to an array; changing data would break the RAID5 checksums. What, asked Boaz, would break if the ability to change pages under writeback were withdrawn?

The answer seems to be that nothing would break, but that some filesystems might suffer performance impacts. The only way to find out for sure, though, is to try it. As it happens, this is a relatively easy experiment to run, so filesystem developers will probably start playing with it sometime soon.

That was the end of the first day of the summit; reports from the second day will be posted as soon as they are ready.

Comments (38 posted)

The 2010 Linux Storage and Filesystem Summit, day 2

By Jonathan Corbet
August 10, 2010
The second day of the 2010 Linux Storage and Filesystem Summit was held on August 9 in Boston. Those who have not yet read the coverage from day 1 may want to start there. This day's topics were, in general, more detailed and technical and less amenable to summarization here. Nonetheless, your editor will try his best.

Writeback

The first session of the day was dedicated to the writeback issue. Writeback, of course, is the process of writing modified pages of files back to persistent store. There have been numerous complaints over recent years that writeback performance in Linux has regressed; the curious reader can refer to this article for some details, or this bugzilla entry for many, many details. The discussion was less focused on this specific problem, though; instead, the developers considered the problems with writeback as a whole.

Sorin Faibish started with a discussion of some research that he has done in this area. The challenges for writeback are familiar to those who have been watching the industry; the size of our systems - in terms of both memory and storage - has increased, but speed of those systems has not increased proportionally. As a result, writing back a given percentage of a system's pages takes longer than it once did. It is always easier for the writeback system to fail to keep up with processes which are dirtying pages, leading to poor performance.

His assertion is that the use of watermarks to control writeback is no longer appropriate for contemporary systems. Writeback should not wait until a certain percentage of memory is dirty; it should start sooner, and, crucially, be tied to the rate with which processes are dirtying pages. The system, he says, should work much more aggressively to ensure that the writeback rate matches the dirty rate.

From there, the discussion wandered through a number of specific issues. Linux writeback now works by flushing out pages belonging to a specific file (inode) at a time, with the hope that those pages will be located nearby on the disk. The memory management code will normally ask the filesystem to flush out up to 4MB of data for each inode. One poorly-kept secret of Linux memory management is that filesystems routinely ignore that request - they typically flush far more data than requested if there are that many dirty pages. It's only by generating much larger I/O requests that they can get the best performance.

Ted Ts'o wondered if blindly increasing writeback size is the best thing to do. 4MB is clearly too small for most drives, but it may well be too large for a filesystem located on a slow USB drive. Flushing large amounts of data to such a filesystem can stall any other I/O to that device for quite some time. From this discussion came the idea that writeback should not be based on specific amounts of data, but, instead, should be time-based. Essentially, the backing device should be time-shared between competing interests in a way similar to how the CPU is shared.

James Bottomley asked if this idea made sense - is it ever right to cut off I/O to an inode which still has contiguous, dirty pages to write? The answer seems to be "yes." Consider a process which is copying a large file - a DVD image or something even larger. Writeback might not catch up with such a process until the copy is done, which may not be for a long time into the future; meanwhile, all other users of that device will be starved. That is bad for interactivity, and it can cause long delays before other files are flushed to disk. Also, the incremental performance benefit of extending large I/O operations tend to drop off over time. So, in the end, it's necessary to switch to another inode at some point, and making the change based on wall-clock time seems to be the most promising approach.

Boaz Harrosh raised the idea of moving the I/O scheduler's intelligence up to the virtual memory management level. Then, perhaps, application priorities could be used to give interactive processes privileged access to I/O bandwidth. Ted, instead, suggested that there may be value in allowing the assignment of priorities to individual file descriptors. It's fairly common for an application to have files it really cares about, and others (log files, say) which matter less. The problem with all of these ideas, according to Christoph Hellwig, is that the kernel has far too many I/O submission paths. The block layer is the only place where all of those I/O operations come together into a single place, so it's the only place where any sort of reasonable I/O control can be applied. A lot of fancy schemes are hard to implement at that level, so, even if descriptor-based priorities are a good idea (not everybody was convinced), it's not something that can readily be done now. Unifying the I/O submission paths was seen as a good idea, but it's not something for the near future.

Jan Kara asked about how results can be measured, and against what requirements will they be judged? Without that information, it is hard to know if any changes have had good effects or not. There are trivial cases, of course - changes which slow down kernel compiles tend to be caught early on. But, in general, we have no way to measure how well we are doing with writeback. So, in the end, the first action item is likely to be an attempt to set down the requirements and to develop some good test cases. Once it's possible to decide whether patches make sense, there will probably an implementation of some sort of time-based writeback mechanism.

Solid-state storage devices

There were two sessions on solid-state storage devices (SSDs) at the summit; your editor was able to attend only the first. The situation which was described there is one we have been hearing about for a couple of years at least. These devices are getting faster: they are heading toward a point where they can perform one million I/O operations per second. That said, they still exhibit significant latency on operations (though much less than rotating drives do), so the only way to get that kind of operation count is to run a lot of operations in parallel. "A lot" in this case means having something like 100 operations in flight at any given time.

Current SSDs work reasonably well with Linux, but there are certainly some problems. There is far too much overhead in the ATA and SCSI layers; at that kind of operation rate, microseconds hurt. The block layer's request queues are becoming a bottleneck; it's currently only possible to have about 32 concurrent operations outstanding on a device. The system needs to be able to distribute I/O completion work across multiple CPUs, preferably using smart controllers which can direct each completion interrupt to the CPU which initiated a specific operation in the first place.

For "storage-attached" SSDs (those which look like traditional disks), there are not a lot of problems at the filesystem level; things work pretty well. Once one gets into bus-attached devices which do not look like disks, though, the situation changes. One participant asserted that, on such devices, the ext4 filesystem could not be expected to get reasonable performance without a significant redesign. There is just too much to do in parallel.

Ric Wheeler questioned the claim that SSDs are bringing a new challenge for the storage subsystem. Very high-end enterprise storage arrays have achieved this kind of I/O rate for some years now. One thing those arrays do is present multiple devices to the system, naturally helping with parallelism; perhaps SSDs could be logically partitioned in the same way.

Resizing guest memory

A change of pace was had in the memory management track, where Rik van Riel talked about the challenges involved in resizing the memory available to virtualized guests. There are four different techniques in use currently:

  • Memory hotplug by way of simulated hardware hotplug events. This mechanism works well for adding memory to guests, but it cannot really be used to take memory back. Hot remove simply does not work well, because there's always some sort of non-movable allocation which ends up in the space which would be removed.

  • Ballooning, wherein a special driver in the guest allocates pages and retires them from use, essentially handing them back to the host. Memory can be fed back into the guest by having the balloon driver free the pages it has allocated. This mechanism is simple, if somewhat slow, but simple management policies are scarce.

  • Transcendent memory techniques like cleancache and frontswap, which can be used to adjust memory availability between virtual guests.

  • Page hinting, whereby guests mark pages which can be discarded by the host. These pages may be on the guest's free list, or they may simply be clean pages. Should the guest try to access such a page after the host has thrown it away, that guest will receive a special page fault telling it that it needs to allocate the page anew. Hinting techniques tend to bring a lot of complexity with them.

The real question of interest in this session seemed to be the "rightsizing" of guests - giving each guest just enough memory to optimize the performance of the system as a whole. Google is also interested in this problem, though it is using cgroup-based containers instead of full virtualization. It comes down to figuring out what a process's minimal working set size is - a problem which has resisted attempts at solution for decades.

Mel Gorman proposed one approach to determine a guest's working set size. Place that guest under memory pressure, slowly shrinking its available memory over time. There will come a point where the kernel starts scanning for reclaimable pages, and, as the pressure grows, a point where the process starts paging in pages which it had previously used. That latter point could be deemed to be the place where the available memory had fallen below the working set size. It was also suggested that page reactivations - when pages are recovered from the inactive list and placed back into active use - could also be the metric by which the optimal size is determined.

Nick Piggin was skeptical of such schemes, though. He gave the example of two processes, one of which is repeatedly working through a 1GB file, while the other is working through a 1TB file. If both processes currently have 512MB of memory available, they will both be doing significant amounts of paging. Adjusting the memory size will not change that behavior, leading to the conclusion that there's not much to be done - until the process with the smaller file gets 1GB of memory to work with. At that point, its paging will stop. The process working with the larger file will never reach that point, though, at least on contemporary systems. So, even though both processes are paging at the same rate, the initial 512MB memory size is too small for one process, but is just fine for the other.

The fact that the problem is hard has not stopped developers from trying to improve the situation, though, so we are likely to see attempts made at dynamically resizing guests in an attempt to work out their optimal sizes.

I/O bandwidth controllers

Vivek Goyal led a brief session on the I/O bandwidth controller problem. Part of that problem has been solved - there is now a proportional-weight bandwidth controller in the mainline kernel. This controller works well for single-spindle drives, perhaps a bit less so with large arrays. With larger systems, the single dispatch queue in the CFQ scheduler becomes a bottleneck. Vivek has been working on a set of patches to improve that situation for a little while now.

The real challenge, though, is the desired maximum bandwidth controller. The proportional controller which is there now will happily let a process consume massive amounts of bandwidth in the absence of contention. In most cases, that's the right result, but there are hosting providers out there who want to be able to keep their customers within the bandwidth limits they have paid for. The problem here is figuring out where to implement this feature. Doing it at the I/O scheduler level doesn't work well when there are devices stacked higher in the chain.

One suggestion is to create a special device mapper target which would do maximum bandwidth throttling. There was some resistance to that idea, partly because some people would rather avoid the device mapper altogether, but also due to practical problems like the inability of current Linux kernels to insert a DM-based controller into the stack for an already-mounted disk. So we may see an attempt to add this feature at the request queue level, or we may see a new hook allowing a block I/O stream to be rerouted through a new module on the fly.

The other feature which is high on the list is support for controlling buffered I/O bandwidth. Buffered I/O is hard; by the time an I/O request has made it to the block subsystem, it has been effectively detached from the originating process. Getting around that requires adding some new page-level accounting, which is not a lightweight solution.

Reclaim topics

Back in the memory management track, a number of reclaim-oriented topics were covered briefly. The first of these is per-cgroup reclaim. Control groups can be used now to limit total memory use, so reclaim of anonymous and page-cache pages works just fine. What is missing, though, is the sort of lower-level reclaim used by the kernel to recover memory: shrinking of slab caches, trimming the inode cache, etc. A cgroup can consume considerable resources with this kind of structure, and there is currently no mechanism for putting a lid on such usage.

Zone-based reclaim would also be nice; that is evidently covered in the VFS scalability patch set, and may be pushed toward the mainline as a standalone patch.

Reclaim of smaller structures is a problem which came up a few times this afternoon. These structures are reclaimed individually, but the virtual memory subsystem is really only concerned with the reclaim of full pages. So reclaiming individual inodes (or dentries, or whatever) may just serve to lose useful cached information and increase fragmentation without actually freeing any memory for the rest of the system. So it might be nice to change the reclaim of structures like dentries to be more page-focused, so that useful chunks of memory can be returned to the system.

The ability to move these structures around in memory, freeing pages through defragmentation, would also be useful. That is a hard problem, though, which will not be amenable to a quick solution.

There is an interesting problem with inode reclaim: cleaning up an inode also clears all related page cache pages out of the system. There can be times when that's not what's really called for. It can free vast amounts of memory when only small amounts are needed, and it can deprive the system of cached data which will just need to be read in again in the near future. So there may be an attempt to change how inode reclaim works sometime soon.

There are some difficulties with how the page allocator works on larger systems; free memory can go well below the low watermark before the system notices. That is the result of how the per-CPU queues work; as the number of processors grows, the accounting of the size of those queues gets fuzzier. So there was talk of sending inter-processor interrupts on occasion to get a better count, but that is a very expensive solution. Better, perhaps, is just to iterate over the per-CPU data structures and take the locking overhead.

Slab allocators

Christoph Lameter ran a discussion on slab allocators, talking about the three allocators which are currently in the kernel and the attempts which are being made to unify them. This is a contentious topic, but there was a relative lack of contentious people in the room, so the discussion was subdued. What happens will really depend on what patches Christoph posts in the future.

O_DIRECT

A brief session touched on a few problems associated with direct I/O. The first of these is an obscure race between get_user_pages() (which pins user-space pages in memory so they can be used for I/O) and the fork() system call. In some cases, a fork() while the pages are mapped can corrupt the system. A number of fixes have been posted, but they have not gotten past Linus. The real fix will involve fixing all get_user_pages() callers and (the real point of contention) slowing down fork(). The race is a real problem, so some sort of solution will need to find its way into the mainline.

Why, it was asked, do applications use direct I/O instead of just mapping file pages into their address space? The answer is that these applications know what they want to do with the hardware and do not want the virtual memory system getting in the way. This is generally seen as a valid requirement.

There is some desire for the ability to do direct I/O from the virtual memory subsystem itself. This feature could be used to support, for example, swapping over NFS in a safe way. Expect patches in the near future.

Finally, there is a problem with direct I/O to transparent hugepages. The kernel will go through and call get_user_pages_fast() for each 4K subpage, but that is unnecessary. So 512 mapping calls are being made when one would do. Some kind of fix will eventually need to be made so that this kind of I/O can be done more efficiently.

Lightning talks

Once again, the day ended with lightning talk topics. Matthew Wilcox started by asking developers to work at changing more uninterruptible waits into "killable" waits. The difference is that uninterruptible waits can, if they wait for a long time, create unkillable processes. System administrators don't like such processes; "kill -9" should really work at all times.

The problem is that making this change is often not straightforward; it turns a function call which cannot fail into one which can be interrupted. That means that, for each change, a new error path must be added which properly unwinds any work which had been done so far. That is typically not a simple change, especially for somebody who does not intimately understand the code in question, so it's not the kind of job that one person can just take care of.

It was suggested that iSCSI drives - which can cause long delays if they fall off the net - are a good way of testing this kind of code. From there, the discussion wandered into the right way of dealing with the problems which result when network-attached drives disappear. They can often hang the system for long periods of time, which is unfortunate. Even worse, they can sometimes reappear as the same drive after buffers have been dropped, leading to data corruption. The solution to all of this is faster and better recovery when devices disappear, especially once it becomes clear that they will not be coming back anytime soon. Additionally, should one of those devices reappear after the system has given up on it, the storage layer should take care that it shows up as a totally new device. Work will be done to this end in the near future.

Mike Rubin talked a bit about how things are done at Google. There are currently about 25 kernel engineers working there, but few of them are senior-level developers. That, it was suggested, explains some of the things that Google has tried to do in the kernel.

There are two fundamental types of workload at Google. "Shared" workloads work like classic mainframe batch jobs, contending for resources while the system tries to isolate them from each other. "Dedicated workloads" are the ones which actually make money for Google - indexing, searching, and such - and are very sensitive to performance degradation. In general, any new kernel which shows a 1% or higher performance regression is deemed to not be good enough.

The workloads exhibit a lot of big, sequential writes and smaller, random reads. Disk I/O latencies matter a lot for dedicated workloads; 15ms latencies can cause phone calls to the development group. The systems are typically doing direct I/O on not-too-huge files, with logging happening on the side. The disk is shared between jobs, with the I/O bandwidth controller used to arbitrate between them.

Why is direct I/O used? It's a decision which dates back to the 2.2 days, when buffered I/O worked less well than it does now. Things have gotten better, but, meanwhile, Google has moved much of its buffer cache management into user space. It works much like enterprise database systems do, and, chances are, that will not change in the near future.

Google uses the "fake NUMA" feature to partition system memory into 128MB chunks. These chunks are assigned to jobs, which are managed by control groups. The intent is to firmly isolate all of these jobs, but writeback still can cause interference between them.

Why, it was asked, does Google not use xfs? Currently, Mike said, they are using ext2 everywhere, and "it sucks." On the other hand, ext4 has turned out to be everything they had hoped for. It's simple to use, and the migration from ext2 is straightforward. Given that, they feel no need to go to a more exotic filesystem.

Mark Fasheh talked briefly about "cluster convergence," which really means sharing of code between the two cluster filesystems (GFS2 and OCFS2) in the mainline kernel. It turns out that there is a surprising amount of sharing happening at this point, with the lock manager, management tools, and more being common to both. The biggest difference between the two, at this point, is the on-disk format.

The cluster filesystems are in a bit of a tough place. Neither has a huge group dedicated to its development, and, as Ric Wheeler pointed out, there just isn't much of a hobbyist community equipped with enterprise-level storage arrays out there. So these two projects have struggled to keep up with the proprietary alternatives. Combining them into a single cluster filesystem looks like a good alternative to everybody involved. Practical and political difficulties could keep that from happening for some years, though.

There was a brief discussion about the DMAPI specification, which describes an API to be used to control hierarchical storage managers. What little support exists in the kernel for this API is going away, leaving companies with HSM offerings out in the cold. There are a number of problems with DMAPI, starting with the fact that it fails badly in the presence of namespaces. The API can't be fixed without breaking a range of proprietary applications. So it's not clear what the way forward will be.

Closing

[Group photo] The summit was widely seen as a successful event, and the participation of the memory management community was welcomed. So there will be a joint summit again for storage, filesystem, and memory management developers next year. It could happen as soon as early 2011; the participants would like to move the event back to the (northern) spring, and waiting for 18 months for the next gathering seemed like too long.

Comments (22 posted)

Patches and updates

Kernel trees

Core kernel code

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds