The 2.6.36 merge window is still open
, so no development kernel
release is available yet. See the article below for a summary of the
merges made in the last week.
Four stable kernels were released on August 10: 220.127.116.11, 18.104.22.168, 22.214.171.124, and 126.96.36.199.
Comments (none posted)
I don't think the situation is in fact deteriorating. We're
shipping decent releases, growing our user base, within and without
the kernel developer community, and still have plenty of major
feature areas to work on. We have not seen regressive LKML
obstructions, though admittedly that is a low standard when it
comes to serving the community.
-- SystemTap maintainer Frank Eigler
If my corporate overlords told me I had to use my Exchange
"messaging" account for external email communication, they would
get a quite clear 'no' in response. My response may also contain
suggestions that they use certain other objects for purposes for
which they were not designed.
Seriously, just use an external email account and ignore the broken
corporate policy. 'Policy' is just a euphemism for not having to
think for yourself.
-- David Woodhouse
Comments (3 posted)
Kernel development news
As of this writing, some 6700 non-merge changesets have been accepted for
the 2.6.36 development cycle. These changes bring a lot of fixes and a
number of new features, some of which have been in the works for some
time. The most interesting changes since last week's summary
User-visible changes include:
- The ext3 filesystem, once again, defaults to the (safer) "ordered"
mode at mount time. This reverses the change (to "writeback" mode)
made in 2009, which was typically overridden by distributions.
- The out-of-memory killer has
been rewritten. The practical result is that the system may
choose different processes to kill in out-of-memory situations, and
the user-space API for adjusting how attractive processes appear to
the OOM killer has changed.
- The fanotify mechanism
has been merged. Fanotify allows a user-space daemon to obtain
notification of file operations and, perhaps, block access to specific
files. It is intended for use with malware scanning applications, but
there are other potential uses (hierarchical storage management, for
example) as well.
- There is a new system call for working with resource limits:
int prlimit64(pid_t pid, unsigned int resource,
const struct rlimit64 *new_rlim, struct rlimit64 *old_rlim);
It is meant to (someday) replace setrlimit(); the differences
include the ability to modify limits belonging to other processes and
the ability to query and set a limit in a single operation.
- The TTY driver has gained support for the EXTPROC mode
supported by BSD for the last 20 years or so. This option was
originally developed to
facilitate telnet's "linemode", but it is useful for contemporary
protocols as well.
- New drivers:
- Processors and systems: Ingenic JZ4740 SOC systems,
Trapeze ITS GPR boards,
ifm PDM360NG boards,
Freescale P1022DS reference boards,
TQM mcp8xx-based boards,
TI TNETV107X-based systems,
NVIDIA Tegra-based systems, and
Tilera TILEPro and TILE64 processors (a whole new architecture).
QLogic ISP82XX host adaptors,
AppliedMicro 460EX processor on-chip SATA controllers,
Samsung S3C/S5P board PATA controllers, and
Moorestown NAND Flash controllers.
EasyCAP USB video adapters,
Softlogic 6x10 MPEG codec cards,
Winbond/Nuvoton NUC900-based audio controllers,
Cirrus Logic CS42L51 codecs,
Cirrus Logic EP93xx series audio devices,
Marvell Kirkwood I2S audio devices,
Ingenic JZ4740-based audio devices,
SmartQ board audio devices,
Wolfson Micro WM8741 codecs, and
Samsung S5P FIMC video postprocessors.
Silicon Image sil164 TMDS transmitters,
TI DSP bridge devices,
PCILynx TSB12LV21/A/B controllers (as a FireWire sniffer; the
user-space side has also been added under tools/firewire),
Bosch Sensortec BMP085 digital pressure sensors,
ROHM BH1780GLI ambient light sensors,
Honeywell HMC6352 compasses,
Summit Microelectronics SMM665 six-channel active DC output
JEDEC JC 42.4 compliant temperature sensors,
Intel Topcliff PCH DMA controllers,
Intel Moorestown DMAC1 and DMAC2 controllers,
Intel Moorestown MAX3110 and MAX3107 UARTs,
Intel Medfield UARTs,
Quatech SSU-100 USB serial ports, and
ARM Primecell SP805 watchdog timers.
Changes visible to kernel developers include:
- The SCSI layer now supports runtime power management, but almost no
work has been done (yet) to push that support down into individual
- The MIPS architecture now has kprobes support.
- The KGDB debugger is now supported with the Microblaze architecture.
- There are a few new build-time configuration commands:
listnewconfig outputs a list of new configuration options,
oldnoconfig sets all new configuration options to "no"
alldefconfig sets all options to their default values, and
savedefconfig writes a minimal configuration file in
patch adding the first two options above also introduces a new
Whatevered-by: patch tag, with unknown semantics).
- There is a new scripts/coccinelle directory containing a
number of Coccinelle
"semantic patches" which perform various useful checks. They can be
run with "make coccicheck".
- The kmemtrace ftrace plugin is gone; "perf kmem" should be used
instead. The ksym plugin has also been superseded by perf, and, thus,
- There is a new function for short, blocking delays:
void usleep_range(unsigned long min, unsigned long max);
This function will sleep (uninterruptibly) for a period between
max microseconds. It is based on hrtimers, so the timing
will be more precise than obtained with msleep().
- The new IRQF_NO_SUSPEND flag for request_irq() will cause
the interrupt line
not to be disabled during suspend; IRQF_TIMER can no longer
be (mis)used for this purpose.
- The concurrency-managed
workqueues patch set has been merged, completely changing the way
workqueues are implemented. One immediate user-visible result will be
that there should be far fewer kernel threads running on most systems.
All users of the "slow work" API have been converted to
concurrency-manged workqueues, so the slow work mechanism has been
removed from the kernel.
- The cpuidle mechanism has been enhanced to allow for the set of
available idle states to change over time. Details can be found in this
- The Blackfin architecture has gained dynamic ftrace support.
- There is a new super_operations method called
evict_inode(); it handles all of the necessary work when an
in-core inode is being removed. It should be used instead of
clear_inode() and delete_inode().
- The inotify mechanism has been removed from inside the kernel; the
fsnotify mechanism must be used instead. (Of course, the user-space
inotify interface is still supported).
- The Video4Linux2 layer has gained a new framework which simplifies the
handling of controls; see this
commit and Documentation/video4linux/v4l2-controls.txt
- The open() and release() functions in struct
block_device_operations are now called without the big kernel
lock held. Additionally, the locked_ioctl() function has
gone away; all block drivers must implement their own locking there as
- The domain name resolution code has been pulled out of the CIFS
filesystem and made generic. It works by using the key mechanism to
request DNS resolution from user space; see Documentation/networking/dns-resolver.txt
The merge window remains open as of this writing, so we may yet see more
interesting features merged for 2.6.36. Watch this space next week for the
final merge window updates for this development cycle.
Comments (4 posted)
The fourth Linux storage and filesystem summit was held August 8
and 9 in Boston, immediately prior to LinuxCon. This time around, a
number of developers from the memory management community were present as
well. Your editor was also there; what follows are his notes from the
first day of the summit.
The first topic of the workshop was "testing and tools," led by Eric
Sandeen. The 2009 workshop identified a generic test suite as something
that the community would very much like to have. One year later, quite a
bit of progress has been made in the form of the xfstests package.
As the name suggests, this test suite has its origins in the XFS
filesystem, and it is still somewhat specific to XFS. But, with the
addition of generic testing last May, about 70 of the 240 tests are now
Xfstests is concerned primarily with regression testing; it is not,
generally, a performance-oriented test suite. Tests tend to get added when
somebody stumbles across a bug and wants to verify that it's fixed - and
that it stays fixed. Xfstests also does not have any real facility for the
creation of test filesystems; tools like Impressions are best used
for that purpose.
About 40 new tests have been added to xfstests over the last year; it is
now heavily used in ext4 development. Most tests look for specific bugs;
there isn't a whole lot of coverage for extreme situations - millions of files
in one directory and such. Those tests just tend to take too long.
It was emphasized that just running xfstests is not enough on its own;
tests must be run under most or all reasonable combinations of mount
options to get good coverage. Ric Wheeler also pointed out that different
types of storage have very different characteristics. Most developers, he
fears, tend to test things on their laptops and call the result good.
Testing on other types of storage, of course, requires access to the
hardware; USB sticks are easy, but not all developers can test on
enterprise-class storage arrays.
Tests which exercise more of the virtual memory and I/O paths would also be
nice. There is one package which covers much of this ground: FIO,
available from kernel.dk.
Destructive power failure testing is another useful area which Red Hat (at
least) is beginning to do. There has also been some work done using hdparm
to corrupt individual sectors on disk to see how filesystems respond.
A wishlist item was better latency measurement, with an emphasis on seeing
how long I/O requests sit within drivers which do their own queueing. It
was suggested that what is really needed is some sort of central site for
capturing wishlist ideas for future tests; then, whenever somebody has some
time available, those ideas are available.
In an attempt to better engage the memory management developers in the
room, it was asked: how can we make tests which cover writeback? The key,
it seems, is to choose a workload which is large enough to force writeback,
but not so large that it pushes the system into heavy swapping. One simple
test is a large tar run; while that is happening, monitor
/proc/vmstat to see when writeback kicks in, and when things get
bad enough that direct reclaim is invoked.
An arguably more representative test can be had with sysbench; again, the
key is to tune it so that the shared buffers fit within physical memory.
But, as Nick Piggin pointed out, anything that dirties memory is, in the
end, a writeback test. The key is to find ways of making tests which
adequately model real-world workloads.
Your editor is messing with the timing here, but the session on testing of
memory management changes fits well with the above. So please ignore the
fact that this session actually happened after lunch.
The question here is simple: how can memory management changes be tested to
satisfy everybody? This is a subject which has been coming up for years;
memory management changes seem to be especially subject to "have you tested
with this other kind of workload?" questions. Developers find this
frustrating; it never seems to be possible to do enough testing to satisfy
everybody, especially since people asking for testing of different
workloads are often unable or unwilling to supply an actual test program.
It was suggested that the real test should be "put the new code on the
Google cluster and see if the Internet breaks." There are certain
practical difficulties with this approach, however.
So the question remains: how can a developer conclude that a memory
management change actually works? Especially in a situation where "works"
means different things to different people? There is far too wide a
variety of workloads to test them all. Beyond that, memory management
changes often involve tradeoffs - making one workload better may mean
making another one worse. Changes which make life better for everybody are
Still, it was agreed that a standard set of tests would help. Some
suggestions were made, including hackbench, netperf, various database
benchmarks (pgbench or sysbench, for example), and the "compilebench" test
which is popular with kernel developers. There was also some talk of
microbenchmarks; Nick Piggin noted that microbenchmarks are terrible when
it comes to arguing for the inclusion of a change, but they can be useful
for the detection of performance regressions.
Sometimes running a single benchmark is not enough; many memory management
problems are only revealed when the system comes under a combination of
stresses. And Andrea Arcangeli made the point that, in the end, only one
test really matters: how much time does it take the system to complete
running a workload which exceeds the amount of physical RAM available?
There was some discussion of the challenges involved in tracking down
problems; Mel Gorman stated that the debugability of the virtual memory
subsystem is "a mess." Tracepoints can be useful for this purpose, but
they are hard to get merged, partly due to Andrew Morton's
hostility to tracepoints in
general. There is also ongoing concern
about the ABI status of tracepoints; what happens when programs (perhaps
being run by large customers of enterprise distributions) depend on
tracepoints which expose low-level kernel details? Those tracepoints may
no longer make sense after the code changes, but breaking them may not be
The filesystem freeze feature
enables a system administrator to suspend writes to a filesystem, allowing
it to be backed up or snapshotted while in a consistent state. It had its
origins in XFS, but has since become part of the Linux VFS layer. There
are a few issues with how freeze works in current kernels, though.
The biggest of these problems is unmounting - what happens when the
administrator unmounts a frozen filesystem? In current kernels, the whole
thing hangs, leaving the system with an unusable, un-unmountable filesystem
- behavior which does not further the Linux World Domination goal. So four
possible solutions were proposed:
- Simply disallow the unmounting of frozen filesystems. Al Viro
stated that this solution is not really an option; there are cases
where the unmount cannot be disallowed. When the final process exits
the namespace in which the filesystem is mounted is one of those
cases. Disallowing unmounts would also break the useful
umount -l option, which is meant to work at all times.
- Keep the filesystem frozen across the unmount, so that the filesystem
would still be frozen after the next mount. The biggest problem here
is that there may be changes that the filesystem code needs to write
to the device; if the system reboots before that can happen, bad
things can result.
- Automatically thaw filesystems on unmount.
- Add a new ioctl() command which will cause the thawing of an
Al suggested a variant on #3, in the form of a new freeze command. The
proper way to handle freeze is to return a file descriptor; as long as that
file descriptor is held open, the filesystem remains frozen. This solves
the "last process exits" problem because the file descriptor will be closed
as the process exits, automatically causing the filesystem to be thawed.
Also, as Al noted, the kill command is often the system recovery
tool of choice for system administrators, so having a suitably-targeted
kill cause a frozen filesystem to be thawed makes sense.
There seemed to be a consensus that the file descriptor approach is the
best long-term solution. Meanwhile, though, there are tools based on the
older ioctl() commands which will take years to replace in the
field. So we might also see an implementation of #4, above, to help in
the near future.
Contemporary filesystems go to great lengths to avoid losing data - or
corrupting things - if the system crashes. To that end, quite a bit of
thought goes into writing things to disk in the correct order. As a simple
example, operations written to a filesystem journal must make it to the
media before the commit record which marks those operations as valid.
Otherwise, the filesystem could end up replaying a journal with random
data, with an outcome that few people would love.
All of that care is for nothing, though, if the storage device reorders
writes on their way to the media. And, of course, reordering is something
that storage devices do all the time in the name of increasing
performance. The solution, for years now, has been "barrier" operations;
all writes issued before a barrier are supposed to complete before any
writes issued after the barrier. The problem is that barriers have not
always been well supported in the Linux block subsystem, and, when they are
supported, they have a significant impact on performance. So, even now,
many systems run with barriers disabled.
Barriers have been discussed among the filesystem and storage developers for
years; it was hoped that this year, with the memory management developers
present as well, some better solutions might be found.
There was some discussion about the various ways of implementing barriers
and making them faster. The key point in the discussion, though, was the
assertion that barriers are not really the same as the memory barriers they
were patterned after. There are, instead, two important aspects to block
subsystem barriers: request ordering and forcing data to disk.
That led, eventually, to one of the clearest decisions in the first day of
the summit: barriers, as such, will be no more. The problem of ordering
will be placed entirely in the hands of filesystem developers, who will
ensure ordering by simply waiting for operations to complete when needed.
There will be no ordering issues as such in the block layer, but block
drivers will be responsible for explicitly flushing writes to physical
media when needed.
Whether this decision will lead to better-performing and more robust
filesystem I/O remains to be seen, but it is a clearer description of the
division of responsibilities than has been seen in the past.
At this point, the summit split into three tracks for storage, filesystem,
and memory management developers. Your editor followed the memory
management track; with luck, we'll eventually have writeups from the other
tracks as well.
Andrea Arcangeli presented his transparent hugepages work,
starting with a discussion of the advantages of hugepages in general.
Hugepages are a feature of many contemporary processors; they allow the
memory management subsystem to use larger-than-normal page sizes in parts
of the virtual address range. There are a number of advantages to using
hugepages in the right places, but it all comes down to performance.
A hugepage takes much of the pressure off the processor's translation
lookaside buffer (TLB), speeding memory access. When a TLB miss happens
anyway, a 2MB hugepage requires traversing three levels of page table
rather than four, saving a memory access and, again, reducing TLB
pressure. The result is a doubling of the speed with which initial page
faults can be handled, and better application performance in general.
There can be some costs, especially when entire hugepages must be cleared
or copied; that can wipe out much of the processor's cache. But this cost
tends to be overwhelmed by the performance advantages that hugepages bring.
Those advantages, incidentally, are multiplied when hugepages are used on
systems hosting virtualized guests. Using hugepages in this situation can
eliminate much of the remaining cost of running virtualized systems.
Hugepages on Linux are currently accessed through the hugetlbfs filesystem,
which was discussed in great
detail by Mel Gorman on LWN earlier this year. There are some real
limitations associated with hugetlbfs, though: hugepages are not swappable,
they must be reserved at boot time, there is no mixing of page sizes in the
same virtual memory area, etc. Many of these problems could be fixed, but,
as Andrea put it, hugetlbfs is becoming a sort of secondary - and inferior
- Linux virtual memory subsystem. It is time to turn hugepages into
first-class citizens in the real Linux VM.
Transparent hugepages eliminate much of the need for hugetlbfs by
automatically grouping together sections of a process's virtual address
space into hugepages when warranted. They take away the hassles of
hugetlbfs and make it possible for the system to use hugepages with no need
for administrator intervention or application changes. There seems to be a
fair amount of interest in the feature; a Google developer said that the
feature is attractive for internal use.
At the core of the patch is a new thread called khugepaged, which
is charged with scanning memory and creating hugepages where it makes
sense. Other parts of the VM can split those hugepages back into
normally-sized pieces when the need arises. Khugepaged works by allocating
a hugepage, then using the migration mechanism to copy the contents of the
smaller component pages over. There was some talk of trying to defragment
memory and "collapse in place" instead, but it doesn't seem worth the
effort at this point. The amount of work to be done would be about the
same except in the special case where a hugepage had been split and was
being grouped back together before much had changed - a situation which is
expected to be relatively rare.
Andrea put up a number of benchmarks showing how transparent hugepages
improve performance; the all-important kernel compile benchmark (seen as a
sort of worst case for hugepages) is 2.5% faster. Various other benchmarks
show bigger improvements.
Transparent hugepages, it seems, will be enabled by default in the
RHEL 6 kernel. Andrea would really like to get the feature into
2.6.36, but the merge window is already well advanced and it's not clear
that things will work out that way. There is still a need to convince
Linus that the feature is worthwhile, and, perhaps, some work to be done to
enable the feature on SPARC systems.
The memory map semaphore (mmap_sem) is a reader-writer semaphore
which protects the tree of virtual memory area (VMA) structures describing
each address space. It is, Nick Piggin says, one of the last nasty locking
issues left in the virtual memory subsystem. Like many busy, global locks,
mmap_sem can cause scalability problems through cache line
bouncing. In this case, though, simple contention for the lock can be a
problem; mmap_sem is often held while disk I/O is being performed.
With some workloads, the amount of time that mmap_sem is
held can slow things down significantly.
Various groups, including developers at HP and Google, have chipped away at
the mmap_sem problem in the past, usually by trying to drop the
semaphore in various paths. These patches have all run into the same
problem, though: Linus hates them. In particular, he seems to dislike the
additional complication added to the retry paths which must be followed
when things change while the lock is dropped. So none of this work has
gotten into the mainline.
There have also been some unfair
rwsem proposals aimed at reducing mmap_sem contention; these
have run aground over fears of writer starvation.
According to Nick, the real problem is the red-black tree used to track
allocated address space; the data structure is cache-unfriendly and
requires a global lock for its protection. His idea is to do away with
this rbtree and associate VMAs directly with the page table entries,
protecting them with the PTE locks. This approach would eliminate much of
the locking entirely, since the page tables must be traversable without
locks, and solve the mmap_sem problem.
That said, there are some challenges. A VMA associated with a page table
entry can cover a maximum of 2MB of address space; larger areas would have
to be split into (possibly a large number of) smaller VMAs. It's not clear
how this mechanism would then interact with hugepages. The instantiation
of large VMAs would require the creation of the full range of PTEs, which
is not required now; that could hurt applications with very
sparsely-populated memory areas. Growing VMAs would have its own
challenges. There is also the issue of free space allocation, a problem
which might be solved by preallocating ranges of addresses to each
thread sharing an address space. In summary, the list of obstacles to be
overcome before this idea becomes practical looks somewhat daunting.
The developers in the room seemed to not be entirely comfortable with this
approach, but nobody could come up with a fundamental reason why it would not
work. So we'll probably be seeing patches from Nick exploring this idea in
The reflink() system
call was originally proposed as a sort of fast copy operation; it would
create a new "copy" of a file which shared all of the data blocks. If one
of the files were subsequently written to, a copy-on-write operation would
be performed so that the other file would not change. LWN readers last
heard about this patch last September, when Linus refused to pull it for 2.6.32.
Among other things, he didn't like the name.
So now reflink() is back as copyfile(), with some
proposed additional features. It would make the same copy-on-write copies
on filesystems that support it, but copyfile() would also be able
to delegate the actual copy work to the underlying storage device when it
makes sense. For example, if a file is being copied on a network-mounted
filesystem, it may well make sense to have the server do the actual copy
work, eliminating the need to move the data over the network twice. The
system call might also do ordinary copies within the kernel if nothing
faster is available.
The first question that was asked is: should copyfile() perhaps be
an asynchronous interface? It could return a file descriptor which could
be polled for the status of the operation. Then, graphical utilities could
start a copy, then present a progress bar showing how things were going.
Christoph Hellwig was adamant, though, that copyfile() should be a
synchronous operation like almost all other Linux system calls; there is no
need to create something weird and different here. Progress bars neither
justify nor require the creation of asynchronous interfaces.
There was also opposition to the mixing of the old reflink() idea
with that of copying a file. There is little perceived value in creating a
bad version of cp within the kernel. The two ideas were mixed
because it seems that Linus seems to want it that way, but, after this discussion,
they may yet be split apart again.
Jan Kara led a short discussion on the problem of dirty limits. The tuning
knob found at /proc/sys/vm/dirty_ratio contains a number
representing a percentage of total memory. Any time that the number of
dirty pages in the system exceeds that percentage, processes which are
actually writing data will be forced to perform some writeback directly.
This policy has a couple of useful results: it helps keep memory from
becoming entirely filled with dirty pages, and it serves to throttle the
processes which are creating dirty pages in the first place.
The default value for dirty_ratio is 20, meaning that 20% of
memory can be dirty before processes are conscripted into writeback
duties. But that turns out to be too low for a number of applications. In
particular, it seems that many Berkeley DB applications exhibit behavior
where they dirty a lot of pages all over memory; setting
dirty_ratio too low causes a lot of excessive I/O and serious
performance issues. For this reason, distributions like RHEL raise this
limit to 40% by default.
But 40% is not an ideal number either; it can lead to a lot of wasted
memory when the system's workloads are mostly sequential. Lots of dirty
pages can also cause fsync() calls to take a very long time,
especially with the ext3 filesystem. What's really needed is a way to set
this parameter in a more automatic, adaptive way, but exactly how that
should be done is not entirely clear.
What is likely to happen in the short term is that a user-space daemon will
be written to experiment with various policies for dirty_ratio.
Some VM tracepoints can be used to catch events and tune things
accordingly. A system which is handling a lot of fsync() calls
should probably have a lower value of dirty_ratio, for example.
In the absence of reasons to the contrary, the daemon can try to nudge the
limit higher and try to see if applications perform better. This kind of
heuristic experimentation has its hazards, but there does not seem to be a
better method on offer at the moment.
Topology and alignment
There was a brief session on storage device topology issues; unfortunately,
it was late in the day and your editor's notes are increasingly fuzzy.
Much of the discussion had to do with 4K-sector disks. There are still
issues, it seems, with drives which implement strange sector alignments in
an attempt to get better performance with some versions of Windows. Linux
can cope with these drives, but only if the drives themselves export
information on what they are doing. Not all hardware provides that
Meanwhile, the amount of software which does make use of the topology
information exported through the kernel is increasing. Partitioning tools
are getting smarter, and the device mapper now uses this information
properly. The readahead code will be tweaked to create properly-aligned
requests when possible.
The last session of the day was dedicated to three lightning talks. The
first, by Matthew Wilcox, had to do with merging of git trees. Quite a bit
of work in the VM/filesystem/storage area depends on changes made in a
number of different trees. Making those trees fit together can be a bit of
a challenge. That problem can be solved in linux-next, but those solutions
do not necessarily carry over into the mainline, where trees may be pulled
in just about any order - or not at all. The result is a lot of work and
merge-window scrambling by developers, who are getting a little tired of
So, it was asked, is it time for a git tree dedicated to storage as a
whole, and a storage maintainer to go with it? The idea was to create
something like David Miller's networking tree, which is the merge point for
almost all networking-related patches. James Bottomley made the mistake of
suggesting that this kind of discussion could not go very far without a
volunteer to manage that tree; he was then duly volunteered for the job.
The discussion moved on to how this tree would work, and, in particular,
whether its maintainer would become the "overlord of storage," or whether
it would just be a more convenient place to work out merge conflicts. If
its maintainer is to be a true overlord, a fairly hardline approach will
need to be taken with regard to when patches would have to be ready for
merging. It's not clear whether the storage community is ready to deal
with such a maintainer. So, for the near future, James will run the tree
as a merge point to see
whether that helps developers get their code into the mainline. If it
seems like there is need for a real storage maintainer, that question will
be addressed after a development cycle or two.
Dan Magenheimer presented his Cleancache proposal, mostly with
an eye toward trying to figure out a way to get it merged. There is still
some opposition to it, and its per-filesystem hooks in particular. It's
hard to see how those hooks can be avoided, though; Cleancache is not
suitable for all filesystems and, thus, may not be a good fit for the VFS
layer. The crowd seemed reasonably amenable to merging the patches, but
the chief opponent - Christoph Hellwig - was not in the room at the time.
So no real conclusions have been reached.
The final lightning talk came from Boaz Harrosh, who talked about "stable
pages." Currently, pages which are currently under writeback can be
modified by filesystem code. That's potentially a data integrity problem,
and it can be fatal in situations where, for example, checksums of page
contents are being made. That is why the RAID5 code must copy all pages
being written to an array; changing data would break the RAID5 checksums.
What, asked Boaz, would break if the ability to change pages under
writeback were withdrawn?
The answer seems to be that nothing would break, but that some filesystems
might suffer performance impacts. The only way to find out for sure,
though, is to try it. As it happens, this is a relatively easy experiment
to run, so filesystem developers will probably start playing with it
That was the end of the first day of the summit; reports from the second
day will be posted as soon as they are ready.
Comments (38 posted)
The second day of the 2010 Linux Storage and Filesystem Summit was held on
August 9 in Boston. Those who have not yet read the coverage from day 1
want to start there. This day's topics were, in general, more detailed and
technical and less amenable to summarization here. Nonetheless, your
editor will try his best.
The first session of the day was dedicated to the writeback issue.
Writeback, of course, is the process of writing modified pages of files
back to persistent store. There have been numerous complaints over recent
years that writeback performance in Linux has regressed; the curious reader
refer to this article for
some details, or this
bugzilla entry for many, many details. The discussion was less focused
on this specific problem, though; instead, the developers considered the
problems with writeback as a whole.
Sorin Faibish started with a discussion of some research that he has done
in this area. The challenges for writeback are familiar to those who have
been watching the industry; the size of our systems - in terms of both
memory and storage - has increased, but speed of those systems
has not increased proportionally. As a result, writing back a given
percentage of a system's pages takes longer than it once did. It is always
easier for the writeback system to fail to keep up with processes which are
dirtying pages, leading to poor performance.
His assertion is that the use of watermarks to control writeback is no
longer appropriate for contemporary systems. Writeback should not wait
until a certain percentage of memory is dirty; it should start sooner, and,
crucially, be tied to the rate with which processes are dirtying pages.
The system, he says, should work much more aggressively to ensure that the
writeback rate matches the dirty rate.
From there, the discussion wandered through a number of specific issues.
Linux writeback now works by flushing out pages belonging to a specific file
(inode) at a time, with the hope that those pages will be located nearby on
the disk. The memory management code will normally ask the filesystem to
flush out up to 4MB of data for each inode. One poorly-kept secret of
Linux memory management is that filesystems routinely ignore that request -
they typically flush far more data than requested if there are that many
dirty pages. It's only by generating much larger I/O requests that they
can get the best performance.
Ted Ts'o wondered if blindly increasing writeback size is the best thing to
do. 4MB is clearly too small
for most drives, but it may well be too large for a filesystem located on a
slow USB drive. Flushing large amounts of data to such a filesystem can
stall any other I/O to that device for quite some time. From this
discussion came the idea that writeback should not be based on specific
amounts of data, but, instead, should be time-based. Essentially, the
backing device should be time-shared between competing interests in a way
similar to how the CPU is shared.
James Bottomley asked if this idea made sense - is it ever right to cut
off I/O to an inode which still has contiguous, dirty pages to write? The answer
seems to be
"yes." Consider a process which is copying a large file - a DVD image or
something even larger. Writeback might not catch up with such a process
until the copy is done, which may not be for a long time into the future;
meanwhile, all other users of that device will be starved. That is bad for
interactivity, and it can cause long delays before other files are flushed
to disk. Also, the incremental performance benefit of extending large I/O
operations tend to drop off over time. So, in the end, it's necessary to
switch to another inode at some point, and making the change based on
wall-clock time seems to be the most promising approach.
Boaz Harrosh raised the idea of moving the I/O scheduler's
intelligence up to the virtual memory management level. Then, perhaps,
application priorities could be used to give interactive processes
privileged access to I/O bandwidth. Ted, instead, suggested that there may
be value in allowing
the assignment of priorities to individual file descriptors. It's fairly
common for an application to have files it really cares about, and others
(log files, say) which matter less. The problem with all of these ideas,
according to Christoph Hellwig, is that the kernel has far too many I/O
submission paths. The block layer is the only place where all of those I/O
operations come together into a single place, so it's the only place where
any sort of reasonable I/O control can be applied. A lot of fancy schemes
are hard to implement at that level, so, even if descriptor-based
priorities are a good idea (not everybody was convinced), it's not
something that can readily be done now. Unifying the I/O submission paths
was seen as a good idea, but it's not something for the near future.
Jan Kara asked about how results can be measured, and against what
requirements will they be judged? Without that information, it is hard to
know if any changes have had good effects or not. There are trivial cases,
of course - changes which slow down kernel compiles tend to be caught early
on. But, in general, we have no way to measure how well we are doing with
So, in the end, the first action item is likely to be an attempt to set
down the requirements and to develop some good test cases. Once it's
possible to decide whether patches make sense, there will probably an
implementation of some sort of time-based writeback mechanism.
Solid-state storage devices
There were two sessions on solid-state storage devices (SSDs) at the
summit; your editor was able to attend only the first. The situation which
was described there is one we have been hearing about for a couple of years
at least. These devices are getting faster: they are heading toward a
they can perform one million I/O operations per second. That said, they
still exhibit significant latency on operations (though much less than
rotating drives do), so the only way to get that kind of operation count is
to run a lot of operations in parallel. "A lot" in this case means having
something like 100 operations in flight at any given time.
Current SSDs work reasonably well with Linux, but there are certainly some
problems. There is far too much overhead in the ATA and SCSI layers; at
that kind of operation rate, microseconds hurt. The block layer's request
queues are becoming a bottleneck; it's currently only possible to have
about 32 concurrent operations outstanding on a device. The system needs to be able to
distribute I/O completion work across multiple CPUs, preferably using smart
controllers which can direct each completion interrupt to the CPU which
initiated a specific operation in the first place.
For "storage-attached" SSDs (those which look like traditional disks),
there are not a lot of problems at the filesystem level; things work pretty
well. Once one gets into bus-attached devices which do not look like
disks, though, the situation changes. One participant asserted that, on
such devices, the ext4 filesystem could not be expected to get reasonable
performance without a significant redesign. There is just too much to do
Ric Wheeler questioned the claim that SSDs are bringing a new challenge
for the storage subsystem. Very high-end enterprise storage arrays have
achieved this kind of I/O rate for some years now. One thing those arrays
do is present multiple devices to the system, naturally helping with
parallelism; perhaps SSDs could be logically partitioned in the same way.
Resizing guest memory
A change of pace was had in the memory management track, where Rik van Riel
talked about the challenges involved in resizing the memory available to
virtualized guests. There are four different techniques in use currently:
- Memory hotplug by way of simulated hardware hotplug events. This
mechanism works well for adding memory to guests, but it cannot really
be used to take memory back. Hot remove simply does not work well,
because there's always some sort of non-movable allocation which ends
up in the space which would be removed.
- Ballooning, wherein a special driver in the guest allocates pages and
retires them from use, essentially handing them back to the host.
Memory can be fed back into the guest by having the balloon driver
free the pages it has allocated. This mechanism is simple, if
somewhat slow, but simple management policies are scarce.
- Transcendent memory techniques like cleancache and frontswap,
which can be used to adjust memory availability between virtual
- Page hinting, whereby guests mark pages which can be discarded by the
host. These pages may be on the guest's free list, or they may simply
be clean pages. Should the guest try to access such a page after the
host has thrown it away, that guest will receive a special page fault
telling it that it needs to allocate the page anew. Hinting
techniques tend to bring a lot of complexity with them.
The real question of interest in this session seemed to be the
"rightsizing" of guests - giving each guest just enough memory to optimize
the performance of the system as a whole. Google is also interested in
this problem, though it is using cgroup-based containers instead of full
virtualization. It comes down to figuring out what a process's minimal
working set size is - a problem which has resisted attempts at solution for
Mel Gorman proposed one approach to determine a guest's working set size.
Place that guest under memory pressure, slowly shrinking its available
memory over time. There will come a point where the kernel starts scanning
for reclaimable pages, and, as the pressure grows, a point where the
process starts paging in pages which it had previously used. That latter
point could be deemed to be the place where the available memory had fallen
below the working set size. It was also suggested that page reactivations
- when pages are recovered from the inactive list and placed back into
active use - could also be the metric by which the optimal size is
Nick Piggin was skeptical of such schemes, though. He gave the example of
two processes, one of which is repeatedly working through a 1GB file, while
the other is working through a 1TB file. If both processes currently have
512MB of memory available, they will both be doing significant amounts of
paging. Adjusting the memory size will not change that behavior, leading
to the conclusion that there's not much to be done - until the process with
the smaller file gets 1GB of memory to work with. At that point, its
paging will stop. The process working with the larger file will never
reach that point, though, at least on contemporary systems. So, even
though both processes are paging at the same rate, the initial 512MB memory
size is too small for one process, but is just fine for the other.
The fact that the problem is hard has not stopped developers from trying to
improve the situation, though, so we are likely to see attempts made at
dynamically resizing guests in an attempt to work out their optimal sizes.
I/O bandwidth controllers
Vivek Goyal led a brief session on the I/O bandwidth controller problem.
Part of that problem has been solved - there is now a proportional-weight
bandwidth controller in the mainline kernel. This controller works well
for single-spindle drives, perhaps a bit less so with large arrays. With
larger systems, the single dispatch queue in the CFQ scheduler becomes a
bottleneck. Vivek has been working on a set of patches to improve that
situation for a little while now.
The real challenge, though, is the desired maximum bandwidth controller.
The proportional controller which is there now will happily let a process
consume massive amounts of bandwidth in the absence of contention. In most
cases, that's the right result, but there are hosting providers out there
who want to be able to keep their customers within the bandwidth limits
they have paid for. The problem here is figuring out where to implement
this feature. Doing it at the I/O scheduler level doesn't work well when
there are devices stacked higher in the chain.
One suggestion is to create a special device mapper target which would do
maximum bandwidth throttling. There was some resistance to that idea,
partly because some people would rather avoid the device mapper altogether,
but also due to practical problems like the inability of current Linux
kernels to insert a DM-based controller into the stack for an
already-mounted disk. So
we may see an attempt to add this feature at the request queue level, or we
may see a new hook allowing a block I/O stream to be rerouted through a new
module on the fly.
The other feature which is high on the list is support for controlling
buffered I/O bandwidth. Buffered I/O is hard; by the time an I/O request
has made it to the block subsystem, it has been effectively detached from
the originating process. Getting around that requires adding some new
page-level accounting, which is not a lightweight solution.
Back in the memory management track, a number of reclaim-oriented topics
were covered briefly. The first of these is per-cgroup reclaim. Control
groups can be used now to limit total memory use, so reclaim of anonymous
and page-cache pages works just fine. What is missing, though, is the sort
of lower-level reclaim used by the kernel to recover memory: shrinking of
slab caches, trimming the inode cache, etc. A cgroup can consume
considerable resources with this kind of structure, and there is currently
no mechanism for putting a lid on such usage.
Zone-based reclaim would also
be nice; that is evidently covered in the VFS scalability patch set, and
may be pushed toward the mainline as a standalone patch.
Reclaim of smaller structures is a problem which came up a few times this
afternoon. These structures are reclaimed individually, but the virtual
memory subsystem is really only concerned with the reclaim of full pages.
So reclaiming individual inodes (or dentries, or whatever) may just serve
to lose useful cached information and increase fragmentation without
actually freeing any memory for the rest of the system. So it might be
nice to change the reclaim of structures like dentries to be more
page-focused, so that useful chunks of memory can be returned to the
The ability to move these structures around in memory,
freeing pages through defragmentation, would also be useful. That is a
hard problem, though,
which will not be amenable to a quick solution.
There is an interesting problem with inode reclaim: cleaning up an inode
also clears all related page cache pages out of the system. There can be
times when that's not what's really called for. It can free vast amounts of
memory when only small amounts are needed, and it can deprive the system of
cached data which will just need to be read in again in the near future.
So there may be an attempt to change how inode reclaim works sometime soon.
There are some difficulties with how the page allocator works on larger
systems; free memory can go well below the low watermark before the system
notices. That is the result of how the per-CPU queues work; as the number
of processors grows, the accounting of the size of those queues gets
fuzzier. So there was talk of sending inter-processor interrupts on
occasion to get a better count, but that is a very expensive solution.
Better, perhaps, is just to iterate over the per-CPU data structures and
take the locking overhead.
Christoph Lameter ran a discussion on slab allocators, talking about the
three allocators which are currently in the kernel and the attempts which
are being made to unify them. This is a contentious topic, but there was a
relative lack of contentious people in the room, so the discussion was
subdued. What happens will really depend on what patches Christoph posts
in the future.
A brief session touched on a few problems associated with direct I/O. The
first of these is an obscure race between get_user_pages() (which
pins user-space pages in memory so they can be used for I/O) and the
fork() system call. In some cases, a fork() while the
pages are mapped can corrupt the system. A number of fixes have been
posted, but they have not gotten past Linus. The real fix will involve
fixing all get_user_pages() callers and (the real point of
contention) slowing down fork(). The race is a real problem, so
some sort of solution will need to find its way into the mainline.
Why, it was asked, do applications use direct I/O instead of just mapping
file pages into their address space? The answer is that these applications
know what they want to do with the hardware and do not want the virtual
memory system getting in the way. This is generally seen as a valid
There is some desire for the ability to do direct I/O from the virtual
memory subsystem itself. This feature could be used to support, for
example, swapping over NFS in a safe way. Expect patches in the near
Finally, there is a problem with direct I/O to transparent hugepages. The
kernel will go through and call get_user_pages_fast() for each 4K
subpage, but that is unnecessary. So 512 mapping calls are being made when
one would do. Some kind of fix will eventually need to be made so that
this kind of I/O can be done more efficiently.
Once again, the day ended with lightning talk topics. Matthew Wilcox
started by asking developers to work at changing more uninterruptible waits
into "killable" waits. The difference is that uninterruptible waits can,
if they wait for a long time, create unkillable processes. System
administrators don't like such processes; "kill -9" should
really work at all times.
The problem is that making this change is often not straightforward; it
turns a function call which cannot fail into one which can be interrupted.
That means that, for each change, a new error path must be added which
properly unwinds any work which had been done so far. That is typically
not a simple change, especially for somebody who does not intimately
understand the code in question, so it's not the kind of job that one
person can just take care of.
It was suggested that iSCSI drives - which can cause long delays if they
fall off the net - are a good way of testing this kind of code. From
there, the discussion wandered into the right way of dealing with the
problems which result when network-attached drives disappear. They can
often hang the system for long periods of time, which is unfortunate. Even
worse, they can sometimes reappear as the same drive after buffers have
been dropped, leading to data corruption.
The solution to all of this is faster and better recovery when devices
disappear, especially once it becomes clear that they will not be coming
back anytime soon. Additionally,
should one of those devices reappear after the system has given
up on it, the storage layer should take care that it shows up as a totally
new device. Work will be done to this
end in the near future.
Mike Rubin talked a bit about how things are done at Google. There are
currently about 25 kernel engineers working there, but few of them are
senior-level developers. That, it was suggested, explains some of the
things that Google has tried to do in the kernel.
There are two fundamental types of workload at Google. "Shared" workloads
work like classic mainframe batch jobs, contending for resources while the
system tries to isolate them from each other. "Dedicated workloads" are
the ones which actually make money for Google - indexing, searching, and
such - and are very sensitive to performance degradation. In general, any
new kernel which shows a 1% or higher performance regression is deemed to
not be good enough.
The workloads exhibit a lot of big, sequential writes and smaller, random
reads. Disk I/O latencies matter a lot for dedicated workloads; 15ms
latencies can cause phone calls to the development group. The systems are
typically doing direct I/O on not-too-huge files, with logging happening on
the side. The disk is shared between jobs, with the I/O bandwidth
controller used to arbitrate between them.
Why is direct I/O used? It's a decision which dates back to the 2.2 days,
when buffered I/O worked less well than it does now. Things have gotten
better, but, meanwhile, Google has moved much of its buffer cache
management into user space. It works much like enterprise database systems
do, and, chances are, that will not change in the near future.
Google uses the "fake NUMA" feature to partition system memory into 128MB
chunks. These chunks are assigned to jobs, which are managed by control
groups. The intent is to firmly isolate all of these jobs, but writeback
still can cause interference between them.
Why, it was asked, does Google not use xfs? Currently, Mike said, they are
using ext2 everywhere, and "it sucks." On the other hand, ext4 has turned
out to be everything they had hoped for. It's simple to use, and the
migration from ext2 is straightforward. Given that, they feel no need to
go to a more exotic filesystem.
Mark Fasheh talked briefly about "cluster convergence," which really means
sharing of code between the two cluster filesystems (GFS2 and OCFS2) in the
mainline kernel. It turns out that there is a surprising amount of sharing
happening at this point, with the lock manager, management tools, and more
being common to both. The biggest difference between the two, at this
point, is the on-disk format.
The cluster filesystems are in a bit of a tough place. Neither has a huge
group dedicated to its development, and, as Ric Wheeler pointed out, there
just isn't much of a hobbyist community equipped with enterprise-level
storage arrays out there. So these two projects have struggled to keep up
with the proprietary alternatives. Combining them into a single cluster
filesystem looks like a good alternative to everybody involved. Practical
and political difficulties could keep that from happening for some years,
There was a brief discussion about the DMAPI specification, which describes
an API to be used to control hierarchical storage managers. What little
support exists in the kernel for this API is going away, leaving companies
with HSM offerings out in the cold. There are a number of problems with
DMAPI, starting with the fact that it fails badly in the presence of
namespaces. The API can't be fixed without breaking a range of proprietary
applications. So it's not clear what the way forward will be.
The summit was widely seen as a successful event, and the participation of
the memory management community was welcomed. So there will be a joint
summit again for storage, filesystem, and memory management developers next
year. It could happen as soon as early 2011; the participants would like
to move the event back to the (northern) spring, and waiting for 18 months
for the next gathering seemed like too long.
Comments (22 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>