Brief items
The current development kernel is 2.6.39-rc2,
released on April 5. "
It's been an uncommonly calm -rc2, which should make me really
happy, but quite honestly just makes me really suspicious. You guys are up
to something, aren't you?" See
the
long-format changelog for all the details.
Stable updates: the 2.6.35.12 update
was released, with a long list of fixes, on March 31.
Comments (none posted)
And that concept can be brought to its logical conclusion: i think
it's only a matter of time until someone takes the Linux kernel,
integrates klibc and a toolchain into it with some good initial
userspace and goes wild with that concept, as a single, sane, 100%
self-hosting and self-sufficient OSS project, tracking the release
schedule of the Linux kernel.
--
Ingo Molnar
The minimal patchset is too minimal for Oren's use and the maximal
patchset seems to have run aground on general kernel sentiment. So
I guess you either take the minimal patchset and make it less
minimal or take the maximal patchset and make it less maximal,
ending up with the same thing. How's that for hand-waving useless
obviousnesses.
--
Andrew Morton
I've told people this before, and I'll tell it again: when I flame
submaintainers, they should try to push the pain down. I'm not
really asking those submaintainers to clean up all the stuff they
are getting: I'm basically asking people to say "no", or at least
push back a lot, and argue with the people who send you code. Tell
them what you don't like about the code, and tell them that you
can't take it any more.
--
Linus Torvalds
The mobile space is about proprietary drivers.
-- Mark Charlebois, Qualcomm Innovation Center (on stage at the Linux Foundation Collaboration Summit)
Comments (none posted)
As I have seen this tangentially mentioned already a few times
publicly, I figured it warranted it's own announcement now.
Linux has lost a great developer with the passing of David Brownell
recently and he will be greatly missed.
--
Greg Kroah-Hartman
David made contributions to a large number of areas in the Linux
kernel. Even a quick look through MAINTAINERS will show that he
worked on USB controllers (OHCI, EHCI, OMAP and others), USB
gadgets, USB networking, and SPI. He was influential in the core
USB design (the HCD "glue" layer and the scatter-gather library)
and the development of Power Management (system sleep and the USB
PM implementation). His designs were elegant and his code was
always a pleasure to read.
He also was a big help to me personally, assisting in my initial
entry to USB core development. And he was the first person I met
at the first Linux conference I attended. I too will miss him.
--
Alan Stern
I guess many of us have similar experience with Dave. He also
helped me a lot when I first started doing Linux development. I
learned a lot from him and will miss him a lot. His teachings, I
will always carry with me.
--
Felipe Balbi
Comments (none posted)
Kernel development news
By Jonathan Corbet
April 6, 2011
The Linux kernel supports a wide variety of architectures, some of which
are more prominent than others. The ARM architecture does not usually draw
a lot of attention, but, over the years, it has become one of the most
important architectures for Linux. There's now a vast array of embedded
devices which run Linux because the kernel runs well on ARM. So when the
mailing lists see extended and heated discussions about the quality of the
ARM architecture code, it's worth paying attention.
It all started early in the 2.6.39 merge window when Linus objected to one of many pull requests for an
ARM subarchitecture. He complained about excessive churn in the
architecture, duplicated code, board-specific data encoded in source files,
and conflicts between different merge requests. Much of that
board-specific data, he says, should be pulled out of the kernel and into
the boot loader; others have suggested that device trees could solve much
of that problem. Meanwhile, it is impossible to build a kernel which runs
on a wide variety of ARM systems, and that, he
says, is a problem for the platform as a whole:
Why? Think of the Ubuntu's etc of the world. If you can't make
half-way generic install images, you can't have a reasonably
generic distribution. And if you don't have that, then what happens
to your developer situation? Most sane people won't touch it with a
ten-foot pole, because the bother is simply not worth their time.
There actually seems to be a bit of a consensus on what the sources of the
problems with the ARM architecture are. The hardware itself varies widely
from one chip to the next; each vendor's system-on-chip offerings are
inconsistent with each other, and even more so with other vendors'
products. According to Nicolas Pitre, the openness of Linux has helped to
make ARM successful, but is also part of the
problem:
On ARM you have no prepackaged "real" Windows. That let hardware
people try things. So they do change the hardware platform all the
time to gain some edge. And this is no problem for them because
most of the time they have access to the OS source code and they
modify it themselves directly. No wonder why Linux is so popular on
ARM. I'm sure hardware designers really enjoy this freedom.
So the ARM architecture is a massive collection of "subplatforms,"
each one of those subplatforms is managed independently, often
by different developers, and few of those developers have the time for or
interest in doing cross-platform architecture work. The result is a lot of
code flux, many duplicated drivers, and lots of hacks.
Complicating the situation is the simple fact that the kernel is a victim
of its own success. For years developers have been beating on the embedded
industry to work upstream and to get its code into the kernel. Now the
industry is doing exactly that; the result is a lot of code, not all
of which is as nice as we would like. The fact that a lot of embedded
vendors seem to have little long-term vision or interest in solving
anything but immediate problems makes things worse. The result is code
that "works for now," but which is heading toward a long-term maintenance
disaster.
How is this problem to be solved? It seems clear that the ARM architecture
needs more maintainers who are concerned with cross-platform issues and
improving the state of ARM support as a whole. There would appear to be a
consensus that ARM maintainer Russell King is doing a good job with the
core code, and there are a few people (Nicolas Pitre, Catalin Marinas, Tony
Lindgren, etc.) who are trying to bring some order to the subplatform mess,
but they seem to be unable to contain the problem. As Nicolas put it:
So we need help! If core kernel people could get off their X86
stool and get down in the ARM mud to help sort out this mess that
would be really nice (thanks tglx). Until then all that the few of
us can do is to contain the flood and hope for the best, and so far
things being as they are have still worked surprisingly well in
practice for users....
And we can't count on vendor people doing this work. They are all busy
porting the kernel to their next SOC version so they can win the next
big Android hardware design, and doing so with our kernel quality
standards is already quite a struggle for them.
There are some developers who are willing to provide at least some of that
help. The Linaro project could also conceivably take on a role here. But
that still leaves open the question of just how the code can be cleaned
up. Arnd Bergmann has suggested the
radical step of creating a new ARM architecture tree with a new, clean,
design, then moving support over to it. Eventually the older code would
either fade away, or it would only be used to support older hardware.
Creating a new architecture tree seems like a big step, but it has been
done before - more than once. The x86-64 architecture was essentially a
clean start from x86; the two platforms were then eventually merged back
together into a much cleaner tree. PowerPC support went through a similar
process.
Whether that will happen with ARM remains to be seen; there are other
developers who would rather perform incremental cleanups on the existing
ARM tree. Either way, the first step will have to be finding developers
who are willing to do the work. There is no shortage of developers who are
interested in ARM, but fewer of them are willing and able to do high-level
architectural work - and to deal with the inevitable resistance to change.
As Thomas Gleixner said:
The only problem is to find a person, who is willing to do that,
has enough experience, broad shoulders and a strong accepted
voice. Not to talk about finding someone who is willing to pay a
large enough compensation for pain and suffering.
So there are some challenges to overcome. But there is also a great deal
of economic value to the ARM platform, a lot of people working in that
area, and a reasonably common understanding of where the problems are. So
chances are good that some sort of solution will be found.
Comments (6 posted)
By Jonathan Corbet
April 5, 2011
It has been a mere eight months since the
2010
Linux Filesystem, Storage, and Memory Management Summit was held in
Boston, but that does
not mean that there is not much to talk about - or that there has not been
time to get a lot done. A number of items discussed at the 2010 event,
including writeback improvements, better error handling, transparent huge
pages, the I/O bandwidth controller, and the block-layer barrier rework,
have been completed or substantially advanced in those eight months. Some
other tasks remain undone, but there is hope: Trond Myklebust, in the
introductory session of the 2011 summit, said that this might well be the
last time that it will be necessary to discuss pNFS - a prospect which
caused very little dismay.
The following is a report from the first day of the 2011 meeting, held on
April 4. This
coverage is necessarily incomplete; when the group split into multiple
tracks, your editor followed the memory management discussions.
Writeback
The Summit started with a plenary session to review the writeback problem.
Writeback is the process of writing dirty pages back to persistent storage;
it has been identified as one of the most significant performance issues
with recent kernels. This session, led by Jan Kara, Jens Axboe, and
Johannes Weiner, made it clear that a lot of thought is going into the
issue, but that a full understanding of the problem has not yet been
reached.
One aspect of the problem that is well understood is that there are too
many places in the kernel which are submitting writeback I/O. As a result,
the different I/O streams conflict with each other and cause suboptimal I/O
patterns, even if the individual streams are well organized - which is not
always the case. So it would be useful to reduce the number of submission
points, preferably to just one.
Eliminating "direct reclaim," where processes which are allocating pages
take responsibility for flushing other pages out to disk, is at the top of
most lists. Direct reclaim cannot easily create large, well-ordered I/O,
is computationally expensive, and leads to excessive lock contention. Some
patches have been written in an attempt to improve direct reclaim by, for
example, performing "write-around" of targeted pages to create larger
contiguous operations, but none have passed muster for inclusion into the
mainline.
There was some talk of improving the I/O scheduler so that it could better
merge and organize the I/O stream created by direct reclaim. One problem
with that idea is that the request queue length is limited to 128 requests
or so, which is not enough to perform reordering when there are multiple
streams of heavy I/O. There were suggestions that the I/O scheduler might
be considered broken and in need of fixing, but that view did not go very
far. The problem with increasing the request queue length is that there
would be a corresponding increase in I/O latencies, which is not a desirable
result. Christoph Hellwig summed things up by saying that it was a mistake
to generate bad I/O patterns in the higher levels of the system and expect
the lower layers to fix them up, especially when it's relatively easy to
create better I/O patterns in the first place.
Many filesystems already try to improve things by generating larger writes
than the kernel asks them to. Each filesystem has its own, specific
hacks, though, and there is no generic solution to the problem. One thing
that some filesystems could apparently do better has to do with their
response to writeback on files where delayed allocation is being done. The
kernel will often request writeback on a portion of the delayed allocation
range; if the filesystem only completes allocation for the requested range,
excessive fragmentation of the file may result. So, in response to
writeback requests where the destination blocks have not yet been
allocated, the filesystem should always allocate everything it can to
create larger contiguous chunks.
There are a couple of patches out there aimed at the goal of eliminating
direct reclaim; both are based on the technique of blocking tasks which are
dirtying pages until pages can be written back elsewhere in the system.
The first of these was written by Jan; he was, he said, aiming at making
the code as simple as possible. With this patch, a process which goes over
its dirty page limit will be put on a wait queue. Occasionally the system
will look at the I/O completions on each device and "distribute" those
completions among the processes which are waiting on that device. Once a
process has accumulated enough completions, it will be allowed to continue
executing. Processes are also resumed if they go below their dirty limit
for some other reason.
The task of distributing completions runs every 100ms currently, leading to
concerns that this patch could cause 100ms latencies for running
processes. That could happen, but writeback problems can cause worse
latencies now. Jan was also asked if control groups should be used for
this purpose; his response was that he had considered the idea but put it
aside because it added too much complexity at this time. There were also
worries about properly distributing completions to processes; the code is
inexact, but, as Chris Mason put it, getting more consistent results than
current kernels is not a particularly hard target to hit.
Evidently this patch set does not yet play entirely well with network
filesystems; completions are harder to track and the result ends up being
bursty.
The alternative patch comes from Wu Fengguang. The core idea is the
same, but it works by attempting to limit the dirtying of pages based on
the amount of I/O bandwidth which is available for writeback. The
bandwidth calculations are said to be complex to the point that mere memory
management hackers have a hard time figuring out how it all works. When all
else fails, and the system goes beyond the global dirty limit, processes
will simply be put to sleep for 200ms at a time to allow the I/O subsystem
to catch up.
Mike Rubin complained that this patch would lead to unpredictable 200ms
latencies any time that 20% of memory (the default global dirty limit) is
dirtied. It was agreed that this is unfortunate, but it's no worse than
what happens now. There was some talk of making the limit higher, but in
the end that would change little; if pages are being dirtied faster than
they can be written out, any limit will be hit eventually. Putting the
limit too high, though, can lead to livelocks if the system becomes
completely starved of memory.
Another problem with this approach is that it's all based on the bandwidth
of the underlying block device. If the I/O pattern changes - a process
switches from sequential to random I/O, for example - the bandwidth
estimates will be wrong and the results will not be optimal. The code also
has no way of distinguishing between processes with different I/O patterns,
with the result that those doing sequential I/O will be penalized in favor
of other processes with worse I/O patterns.
Given that, James Bottomley asked, should we try to limit I/O operations
instead of bandwidth? The objection to that idea is that it requires
better tracking of the ownership of pages; the control group mechanism can
do that tracking, but it brings a level of complexity and overhead that is
not pleasing to everybody. It was asserted that control groups are
becoming more mandatory all the time, but that view has not yet won over
the entire community.
A rough comparison of the two approaches leads to the conclusion that Jan's
patch can cause bursts of latency and occasional pauses. Fengguang's
patch, instead, has smoother behavior but does not respond well to changing
workloads; it also suffers from the complexity of the bandwidth estimation
code. Beyond that, from the (limited) measurements which were presented,
the two seem to have similar performance characteristics.
What's the next step? Christoph suggested rebasing Fengguang's more
complex patch on top of Jan's simpler version, merging the simple patch, and
comparing from there. Before that comparison can be done, though, there
need to be more benchmarks run on more types of storage devices. Ted Ts'o
expressed concerns that the patches are insufficiently tested and might
cause regressions on some devices. The community as a whole still lacks a
solid idea of what benchmarks best demonstrate the writeback problems, so
it's hard to say if the proposed solutions are really doing the job. That
said, Ted liked the simpler patch for a reason that had not yet been
mentioned: by doing away with direct reclaim, it at least gets rid of the
stack space exhaustion problem. Even in
the absence of performance
benefits, that would be a good reason to merge the code.
But it would be good to measure the patch's performance effects, so better
benchmarks are needed. Getting those benchmarks is easier said than done,
though; at least some of them need to run on expensive, leading-edge
hardware which must be updated every year. There are also many workloads
to test, many of which are not easily available to the public. Mike Rubin
did make a promise that Google would post at least three different
benchmarks in the near future.
There was some sympathy for the idea of merging Jan's patch, but Jan would
like to see more testing done first. Some workloads will certainly
regress, possibly as a result of performance hacks in specific
filesystems. Andrew Morton said that there needs to be a plan for
integration with the memory control group subsystem; he would also like a
better analysis of what writeback problems are solved by the patch. In the
near future, the code will be posted more widely for review and testing.
VFS summary
James Bottomley started off the session on the virtual filesystem layer
with a comment that, despite the fact that there were a number of "process
issues" surrounding the merging of Nick Piggin's virtual filesystem
scalability patches, this session was meant to be about technical issues
only. Nick was the leader of the session, since the planned leader, Al
Viro, was unable to attend due to health problems. As it turned out,
Nick wanted to talk about process issues,
so that is where much of the time was spent.
The merging of the scalability work, he said, was not ideal. The patches
had been around for a year, but there had not been enough review of them
and he knew it at the time. Linus knew it too, but decided to merge that
work anyway as a way of forcing others to look at it. This move worked,
but at the cost of creating some pain for other VFS developers.
Andrew Morton commented that the merging of the patches wasn't a huge
problem; sometimes the only way forward is to "smash something in and fix
it up afterward." The real problem, he said, was that Nick disappeared
after the patches went in; he wasn't there to support the work. Developers
really have to be available after merging that kind of change. If
necessary, Andrew said, he is available to lean on a developer's manager if
that's what it takes to make the time available.
In any case, the code is in and mostly works. The autofs4 automounter is
said to be "still sick," but that may not be entirely a result of the VFS
work - a lot of automounter changes went in at the same time.
An open issue is that lockless dentry lookup cannot happen on a directory
which has extended attributes or when there is any sort of security module
active. Nick acknowledged the problem, but had no immediate solution to
offer; it is a matter of future work. Evidently supporting lockless lookup
in such situations will require changes within the filesystems as well as
at the VFS layer.
A project that Nick seemed more keen to work on was adding per-node lists
for data structures like dentries and inodes, enabling them to be reclaimed
on a per-node basis. Currently, a memory shortage on one NUMA node can
force the eviction of inodes and dentries on all nodes, even though memory
may not be in short supply elsewhere in the system. Christoph Hellwig was
unimpressed by the idea, suggesting that Nick should try it and he would see
how badly it would work. Part of the problem, it seems, is that, for many
filesystems, an inode cannot be reclaimed until all associated pages have
been flushed to disk. Nick suggested that this was a problem in need of
fixing, since it makes memory management harder.
A related problem, according to Christoph, is that there is no coordination
when it comes to the reclaim of various data structures. Inodes and
dentries are closely tied, for example, but they are not reclaimed
together, leading to less-than-optimal results. There were also
suggestions that more of the reclaim logic for these data structures should
be moved to the slab layer, which already has a fair amount of
node-specific knowledge.
Transcendent memory
Dan Magenheimer led a session on his transcendent memory work. The core idea
behind transcendent memory - a type of memory which is only addressable on
a page basis and which is not directly visible to the kernel - remains the
same, but the uses have expanded. Transcendent memory has been tied to
virtualization (and to Xen in particular) but, Dan says, he has been
pushing it toward more general uses and has not written a line of Xen code
in six months.
So where has this work gone? It has resulted in "zcache," a mechanism for
in-RAM memory compression which was merged into the staging tree for
2.6.39. There is increasing support for devices meant to extend the amount
of available RAM - solid-state storage and phase-change memory devices, for
example. The "ramster" module is a sort of peer-to-peer memory mechanism
allowing pages to be moved around a cluster; systems which have free memory
can host pages for systems which are under memory stress. And yes,
transcendent memory can still be used to move RAM into and out of virtual
machines.
All of the above can be supported behind the cleancache and frontswap patches, which Linus
didn't get around to merging for 2.6.39-rc1. He has not yet said "no," but
the chances of a 2.6.39 merge seem to be declining quickly.
Hugh Dickins voiced a concern that frontswap is likely to be filled
relatively early in a system's lifetime, and what will end up there is
application initialization code which is unlikely to ever be needed again.
What is being done to avoid filling the frontswap area with useless stuff?
Dan acknowledged that it could be a problem; one solution is to have a
daemon which would fetch pages back out of frontswap when memory pressure
is light. Hugh worried that, in the long term, the kernel was going to
need new data structures to track pages as they are put into hierarchical
swap systems and that frontswap, which lacks that tracking, may be a step
in the wrong direction.
Andrea Arcangeli said that a feature like zcache could well be useful for
guests running under KVM as well. Dan agreed that it would be nice, but
that he (an Oracle employee) could not actually do the implementation.
How memory control groups are used
This session was a forum in which representatives of various companies
could talk about how they are making use of the memory controller control groups
functionality. It turns out that this feature is of interest to a wide
range of users.
Ying Han gave the surprising news that Google has a lot of machines to
manage, but also a lot of jobs to run. So the company is always trying to
improve the utilization of those machines, and memory utilization in
particular. Lots of jobs tend to be packed into each machine, but that
leads to interference; there simply is not enough isolation between tasks.
Traditionally, Google has used a "fake NUMA" system to partition memory
between groups of tasks, but that has some problems. Fake NUMA suffers
from internal fragmentation, wasting a significant amount of memory. And
Google is forced to carry a long list of fake NUMA patches which are not
ever likely to make it upstream.
So Google would rather make more use of memory control groups, which are
upstream and which, thus, will not be something the company has to maintain
on its own indefinitely. Much work has gone upstream, but there are still
unmet needs. At the top of the list is better accounting of kernel
allocations; the memory controller currently only tracks memory allocations
made from user space. There is also a need for "soft limits" which would
be more accommodating of bursty application behavior; this topic was
discussed in more detail later on.
Hiroyuki Kamezawa of Fujitsu said that his employer deals with two classes
of customers in particular: those working in the high-performance computing
area, and government. The memory controller is useful in both situations,
but it has one big problem: performance is not what they really need it to
be. Things get especially bad when control groups begin to push up against
the limits. So his work is mostly focused on improving the performance of
memory control groups.
Larry Woodman of Red Hat, instead, deals mostly with customers in the
financial industry. These customers are running trading systems with tight
deadlines, but they also need to run backups at regular intervals during
the day. The memory controller allows these backups to be run in a small,
constrained group which enables them to make forward progress without
impeding the trading traffic.
Michal Hocko of SUSE deals with a different set of customers, working in an
area which was not well specified. These customers have large datasets
which take some time to compute, and which really need to be kept in RAM if
at all possible. Memory control groups allow those customers to protect
that memory from reclaim. They work like a sort of "clever
mlock()" which keeps important memory around most of the time, but
which does not get in the way of memory overcommit.
Pavel Emelyanov of Parallels works with customers operating as virtual
hosting Internet service providers. They want to be able to create
containers with a bounded amount of RAM to sell to customers; memory control
groups enable that bounding. They also protect the system against memory
exhaustion denial-of-service attacks, which, it seems, are a real problem
in that sector. He would like to see more graceful behavior when memory
runs out in a specific control group; it should be signaled as a failure in
a fork() or malloc() call rather than a segmentation
fault signal at page fault time. He would also like to see better
accounting of memory use to help with the provisioning of containers.
David Hansen of IBM talked about customers who are using memory control
groups with KVM to keep guests from overrunning their memory allocations.
Control groups are nice because they apply limits while still allowing the
guests to use their memory as they see fit. One interesting application is
in cash registers; these devices, it seems, run Linux with an emulation
layer that can run DOS applications. Memory control groups are useful for
constraining the memory use of these applications. Without this control,
these applications can grow until the OOM killer comes into play; the OOM
killer invariably kills the wrong process (from the customer's point of
view), leading to the filing of bug reports. The value of the memory
controller is not just that it constrains memory use - it also limits the
number of bug reports that he has to deal with.
Coly Li, representing Taobao, talked briefly about that company's use of
memory control groups. His main wishlist item was the ability to limit
memory use based on the device which is providing backing store.
What's next for the memory controller
The session on future directions for the memory controller featured
contributors who were both non-native English speakers and quite
soft-spoken, so your editor's notes are, unfortunately, incomplete.
One topic which came up repeatedly was the duplication of the
least-recently used (LRU) list. The core VM subsystem maintains LRU lists
in an attempt to track which pages have gone unused for the longest time
and which, thus, are unlikely to be needed in the near future. The memory
controller maintains its own LRU list for each control group, leading to a
wasteful duplication of effort. There is a strong desire to fix this
problem by getting rid of the global LRU list and performing all memory
management with per-control-group lists. This topic was to come back
later in the day.
Hiroyuki Kamezawa complained that the memory controller currently tracks
the sum of RAM and swap usage. There could be value, he said, in splitting
swap usage out of the memory controller and managing it separately.
Management of kernel memory came up again. It was agreed that this was a
hard problem, but there are reasons to take it on. The first step, though,
should be simple accounting of kernel memory usage; the application of
limits can come later. Pavel noted that it will never be possible to track
all kernel memory usage, though; some allocations can never be easily tied
to specific control groups. Memory allocated in interrupt handlers is one
example. It also is often undesirable to fail kernel allocations even when
a control group is over its limits; the cost would simply be too high.
Perhaps it would be better, he said, to focus on specific problem sources.
Page tables, it seems, are a data structure which can soak up a lot of
memory and a place where applying limits might make sense.
The way shared pages are accounted for was discussed for a bit. Currently,
the first control group to touch a page gets charged for it; all subsequent
users get the page for free. So if one control group pages in the entire C
library, it will find its subsequent memory use limited while everybody
else gets a free ride. In practice, though, this behavior does not seem to
be a problem; a control group which is carrying too much shared data will
see some of it reclaimed, at which point other users will pick up the
cost. Over time, the charging for shared pages is distributed throughout
the system, so there does not seem to be a need for a more sophisticated
mechanism for accounting for them.
Local filesystems in the cloud
Mike Rubin of Google ran a plenary session on the special demands
that cloud computing
puts on filesystems. Unfortunately, the notes on this talk are also
incomplete due to a schedule misunderstanding.
What cloud users need from a filesystem is predictable performance, the
ability to share systems, and visibility into how the filesystem works.
Visibility seems to be the biggest problem; it is hard, he said, to figure
out why even a single machine is running slowly. Trying to track down
problems in an environment consisting of thousands of machines is a huge
problem.
Part of that problem is just understanding a filesystem's resource
requirements. How much memory does an ext4 filesystem really need? That
number turns out to be 2MB of RAM for every terabyte of disk space managed
- a number which nobody had been able to provide. Just as important is the
metadata overhead - how much of a disk's bandwidth will be consumed by
filesystem metadata? In the past, Google has been surprised when adding
larger disks to a box has caused the whole system to fall over;
understanding the filesystem's resource demands is important to prevent
such things from happening in the future.
Tracing, he said, is important - he does not know how people ever lived
without it. But there need to be better ways of exporting the
information; there is a lack of user-space tools which can integrate the
data from a large number of systems. Ted Ts'o added that the "blktrace"
tool is fine for a single system where root access is available. In a
situation where there are hundreds or thousands of machines, and where
developers may not have root access on production systems, blktrace does
not do the job. There needs to be a way to get detailed, aggregated
tracing information - with file name information - without root access.
Michael said that he is happy that Google's storage group has been
upstreaming almost everything they have done. But, he said, the group has
a "diskmon" tool which still needs to see the light of day. It can create
histograms of activity and latencies at all levels of the I/O stack,
showing how long each operation took and how much of that time was consumed
by metadata. It is all tied to a web dashboard which can highlight
problems down to the ID of the process which is having trouble.
This tool is useful, but it is not yet complete. What we really need, he
said, is to have that kind of visibility designed into kernel subsystems
from the outset.
Michael concluded by saying that, in the beginning, his group was nervous
about engaging with the development community. Now, though, they feel that
the more they do it, the better it gets.
Memory controller targeted reclaim
Back in the memory management track, Ying Han led a session on improving
reclaim within the memory controller. When things get tight, she said, the
kernel starts reclaiming from the global LRU list, grabbing whatever pages
it finds. It would be far better to reclaim pages specifically from the
control groups which are causing the problem, limiting the impact on the
rest of the system.
One technique Google uses is soft memory limits in the memory controller.
Hard limits place an absolute upper bound on the amount of memory any group
can use. Soft limits, instead, can be exceeded, but only as long as the
system as a whole is not suffering from memory contention. Once memory
gets tight at the global level, the soft limits are enforced; that
automatically directs reclaim at the groups which are most likely to be
causing the global stress.
Adding per-group background reclaim, which would slowly clean up pages in
the background, would help the situation, she said. But the biggest
problem is the global LRU list. Getting rid of that list would eliminate
contention on the per-zone LRU lock, which is a problem, but, more
importantly, it would improve isolation between groups. Johannes Weiner
worried that eliminating the global LRU list would deprive the kernel of
its global view of memory, making zone balancing harder; Rik van Riel
responded that we are able to balance pages between zones using per-zone
LRU lists now; we should, he said, be able to do the same thing with
control groups.
The soft limits feature can help with global balancing. There is a
problem, though, in that configuring those limits is not an easy task. The
proper limits depend on the total load on the system, which can change over
time; getting them right will not be easy.
Andrea Arcangeli made the point that whatever is done with the global LRU
list cannot be allowed to hurt performance on systems where control groups
are configured out. The logic needs to transparently fall back to
something resembling the current implementation. In practice, that
fallback is likely to take the form of a "global control group" which
contains all processes which are not part of any other group. If control
groups are not enabled, the global group would be the only one in
existence.
Shrinking struct page_cgroup
The system memory map contains one struct page for every page in
the system. That's a lot of structures, so it's not surprising that
struct page is, perhaps, the most tightly designed structure in
the kernel. Every bit has been placed into service, usually in multiple
ways. The memory controller has its own per-page information requirements;
rather than growing struct page, the memory controller developers
created a separate struct page_cgroup instead. That structure
looks like this:
struct page_cgroup {
unsigned long flags;
struct mem_cgroup *mem_cgroup;
struct page *page;
struct list_head lru;
};
The existence of one of these structures for every page in the system is
part of why enabling the memory controller is expensive. But Johannes
Weiner thinks that he can reduce that overhead considerably - perhaps to
zero.
Like the others, Johannes would like to get rid of the duplicated LRU
lists; that would allow the lru field to be removed from this
structure. It should also be possible to remove the struct page
backpointer by using a single LRU list as well. The struct
mem_cgroup pointer, he thinks, is excessive; there will usually be a
bunch of pages from a single file used in any given control group. So what
is really needed is a separate structure to map from a control group to the
address_space structure representing the backing store for a set
of pages. Ideally, he would point to that structure (instead of struct
address_space) in struct page, but that would require some
filesystem API changes.
The final problem is getting rid of the flags field. Some of the
flags used in this structure, Johannes thinks, can simply be eliminated.
The rest could be moved into struct page, but there is little room
for more flags there on 32-bit systems. How that problem will be resolved
is not yet entirely clear. One way or the other, though, it seems that
most or all of the memory overhead associated with the memory controller
can be eliminated with some careful programming.
Memory compaction
Mel Gorman talked briefly about the current state of the memory compaction code, which is charged with
the task of moving pages around to create larger, physically-contiguous
ranges of free pages. The rate of change in this code, he said, has
reduced "somewhat." Initially, the compaction code was relatively
primitive; it only had one user (hugetlbfs) to be concerned about. Since
then, the lumpy reclaim code has been
mostly pushed out of the kernel, and transparent huge pages have greatly increased
the demands on the compaction code.
Most of the problems with compaction have been fixed. The last was one in
which interrupts could be disabled for long periods - up to about a half
second, a situation which Mel described as "bad." He also noted that it
was distressing to see how long it took to find the bug, even with tools
like ftrace available. There are more interrupt-disabling problems in the
kernel, he said, especially in the graphics drivers.
One remaining problem with compaction is that pages are removed from the
LRU list while they are being migrated to their new location; then they are
put back at the head of the list. As a result, the kernel forgets what it
knew about how recently the page has actually been used; pages which should
have been reclaimed can live on as a result of compaction. A potential
fix, suggested by Minchan Kim, is to remember which pages were on either
side of the moved page in the LRU list; after migration, if those two pages
are still together on the LRU, it probably makes sense to reinsert the
moved page between them. Mel asked for comments on this approach.
Rik van Riel noted that, when transparent huge pages are used, the chances
of physically-contiguous pages appearing next to each other in the LRU list
are quite high; splitting a huge page will create a whole set of contiguous
pages. In that situation, compaction is likely to migrate several
contiguous pages together; that would break Minchan's heuristic. So Mel is
going to investigate a different approach: putting the destination page
into the LRU in the original page's place while migration is underway.
There are some issues that need to be resolved - what happens if the
destination page falls off the LRU and is reclaimed during migration, for
example - but that approach might be workable.
Mel also talked briefly about some experiments he ran writing large trees
to slow USB-mounted filesystems. Things have gotten better in this area,
but the sad fact is that generating lots of dirty pages which must be
written back to a USB stick can still stall the system for a long time. He
was surprised to learn that the type of filesystem used on the device makes
a big difference; VFAT is very slow, ext3 is better, and ext4 is better
yet. What, he asked, is going on?
There was a fair amount of speculation without a lot of hard conclusions.
Part of the problem is probably that the filesystem (ext3, in particular)
will end up blocking processes which are waiting on buffers until a big
journal commit frees some buffers. That can cause writes to a slow device
to stall unrelated processes. It seems that there is more going on,
though, and the problem is not yet solved.
Per-CPU variables
Christoph Lameter and Tejun Heo discussed per-CPU data. For the most part,
the session was a beginner-level introduction to this feature and its
reason for existence; see this article if a
refresher is needed. There was some talk about future applications of
per-CPU variables; Christoph thinks that there is a lot of potential for
improving scalability in the VFS layer in particular. Further in the
future, it might make sense to confine certain variables to specific CPUs,
which would then essentially function as servers for the rest of the
kernel; LRU scanning was one function which could maybe be implemented in
this way.
There was some side talk about the limitations placed on per-CPU variables
on 32-bit systems. Those limits exist, but 32-bit systems also create a
number of other, more severe limits. It was agreed that the limit to
scalability with 32 bits was somewhere between eight and 32 CPUs.
Lightning talks
The final session of the day was a small set of lightning talks. Your
editor will plead "incomplete notes" one last time; perhaps the long day
and the prospect of beer caused a bit of inattention.
David Howells talked about creating a common infrastructure for the
handling of keys in network filesystems. Currently most of these
filesystems handle keys for access control, but they all have their own
mechanisms. Centralizing this code could simplify a lot of things. He
would also like to create a common layer for the mapping of user IDs while
he is at it.
David also talked about a scheme for the attachment of attributes to
directories at any level of a network filesystem. These attributes would
control behavior like caching policies. There were questions as to why the
existing extended attribute mechanism could not be used; it came down to a
desire to control policy on the client side when root access to the server
might not be available.
Matthew Wilcox introduced the "NVM Express" standard to the group. This
standard describes the behavior of solid-state devices connected via
PCI-Express. The standard was released on March 1; a Linux driver, he
noted with some pride, was shipped on the same day. The Windows driver is
said to be due within 6-9 months; actual hardware can be expected within
about a year.
The standard seems to be reasonably well thought out; it provides for all
of the functionality one might expect on these devices. It allows devices
to implement multiple "namespaces" - essentially separate logical units
covering parts of the available space. There are bits for describing the
expected access patterns, and a "this data is already compressed so don't
bother trying to compress it yourself" bit. There is a queued "trim"
command which, with luck, won't destroy performance when it is used.
How the actual hardware will behave remains to be seen; Matthew
disappointed the audience with his failure to have devices to hand out.
Day 2
See this page for reporting from the second day of the summit.
Comments (12 posted)
By Jonathan Corbet
April 6, 2011
This article covers the second day of the 2011 Linux Filesystem, Storage,
and Memory Management Summit, held on April 5, 2011 in San Francisco,
California. Those who have not yet seen the
first day coverage may want to have a look
before continuing here.
The opening plenary session was led by Michael Cornwall, the global
director for technology standards at IDEMA, a standards organization for disk
drive manufacturers. His talk, which was discussed in a separate article, covered the changes
that are coming in the storage industry and how the Linux community can get
involved to make things work better.
I/O resource management
The main theme of the memory management track often appeared to be "control
groups"; for one session, though, the entire gathering got to share the control
group fun as Vivek Goyal, Fernando Cao, and Chad Talbott led a discussion on
I/O bandwidth management. There are two I/O bandwidth controllers in the
kernel now: the throttling controller (which can limit control groups to an
absolute bandwidth value) and the proportional controller (which divides up
the available bandwidth between groups according to an administrator-set
policy). Vivek was there to talk about the throttling controller, which is
in the kernel and working, but which still has a few open issues.
One of those is that the throttling controller does not play entirely well
with journaling filesystems. I/O ordering requirements will not allow the
journal to be committed before other operations have made it to disk; if
some of those other operations have been throttled by the controller, the
journal commit stalls and the
whole filesystem slows down. Another is that the controller can only
manage synchronous writes; writes which have been buffered through the page
cache have lost their association with the originating control group and
cannot be charged against that group's quota. There are patches to perform throttling of buffered
writes, but that is complicated and intrusive work.
Another problem was pointed out by Ted Ts'o: the throttling controller
applies bandwidth limits on a per-device basis. If a btrfs filesystem is
in use, there may be multiple devices which make up that filesystem. The
administrator would almost certainly want limits to apply to the volume
group as a whole, but the controller cannot do that now. A related problem
is that some users want to be able to apply global limits - limits on the
amount of bandwidth used on all devices put together. The throttling
controller also does not work with NFS-mounted filesystems; they have no
underlying device at all, so there is no place to put a limit.
Chad Talbott talked about the proportional bandwidth controller; it works
well with readers and synchronous writers, but, like the throttling
controller, it is unable to deal with asynchronous writes. Fixing that
will require putting some control group awareness into the per-block-device
flushing threads. The system currently maintains a set of per-device lists
containing inodes with dirty pages; those lists need to be further
subdivided into per-control-group lists to enable the flusher threads to
write out data according to the set policy. This controller also does not
yet properly implement hierarchical group scheduling, though there are patches out there to add that functionality.
The following discussion focused mostly on whether the system is
accumulating too many control groups. Rather than a lot of per-subsystem
controllers, we should really have a cross-subsystem controller mechanism.
At this point, though, we have the control groups (and their associated
user-space API which cannot be broken) that are in the kernel. So, while
some (like James Bottomley) suggested that we should maybe dump the
existing control groups in favor of something new which gets it right, that
will be a tall order. Beyond that, as Mike Rubin pointed out, we don't
really know how control groups should look even now. There has been a lack
of "taste and style" people to help design this interface.
Working set estimation
Back in the memory management track, Michel Lespinasse discussed Google's
working set estimation code. Google has used this mechanism for some time
as a way of optimally placing new jobs in its massive cluster. By getting
a good idea of how much memory each job is really using, they can find the
machines with the most idle pages and send new work in that direction.
Working set estimation, in other words, helps Google to make better
decisions on how to overcommit its systems.
The implementation is a simple kernel thread which scans through the
physical pages on the system, every two minutes by default. It looks at
each page to determine whether it has been touched by user space or not and
remembers that state. The whole idea is to try to figure out how many
pages could be taken away from the system without causing undue memory
pressure on the jobs running there.
The kernel thread works by setting a new "idle" flag on each page which
looks like it has not been referenced. That bit is cleared whenever an
actual reference happens (as determined by looking at whether the VM
subsystem has cleared the "young" bit). Pages which are still marked idle
on the successive scan are deemed to be unused. The estimation code does
not take any action to reclaim those pages; it simply exports statistics
on how many unused pages there are through a control group file. The
numbers are split up into clean, swap-backed dirty, and file-backed dirty
pages. It's then up to code in user space to decide what to do with that
information.
There were questions about the overhead of the page scanning; Michel said
that scanning every two minutes required about 1% of the available CPU
time. There were also questions about the daemon's use of two additional
page flags; those flags are a limited resource on 32-bit systems. It was
suggested that a separate bitmap outside of the page structure
could be used. Google runs everything in 64-bit mode, though, so there has
been little reason to care about page flag exhaustion so far. Rik van Riel
suggested that the feature could simply not be supported on 32-bit
systems. He also suggested that the feature might be useful in other
contexts; systems running KVM-virtualized guests could use it to control
the allocation of memory with the balloon driver, for example.
Virtual machine sizing
Rik then led a discussion on a related topic: allocating the right amount
of memory to virtual machines. As with many problems, there are two
distinct aspects: policy (figuring out what the right size is for any given
virtual machine) and mechanism (actually implementing the policy
decisions). There are challenges on both sides.
There are a number of mechanisms available for controlling the memory
available to a virtual machine. "Balloon drivers" can be used to allocate
memory in guests and make it available to the host; when a guest needs to
get smaller, the balloon "inflates," forcing the guest to give up some
pages. Page hinting is a mechanism by
which the guest can inform the host that certain pages do not contain
useful data (for example, they are on the guest's free list). The host can
then reclaim memory used for so-hinted pages without the need to write them
out to backing store. The host can also simply swap the guest's pages out
without involving the guest operating system at all. The KSM mechanism allows the kernel to recover
pages which contain duplicated contents. Compression can be used to cram
data into a smaller number of pages. Page contents can also simply be
moved around between systems or stashed into some sort of transcendent
memory scheme.
There seem to be fewer options on the policy side. The working set
estimation patches are certainly one possibility. One can control memory
usage simply through decisions on the placement of virtual machines. The
transcendent memory mechanism also allows the host to make policy decisions
on how to allocate its memory between guests.
One interesting possibility raised by Rik was to make the balloon mechanism
better. Current balloon drivers tend to force the release of random pages
from the guest; that leads to fragmentation in the host, thwarting attempts
to use huge pages. A better approach might be to use page hinting,
allowing the guest to communicate to the host which pages are free. The
balloon driver could then work by increasing the free memory thresholds
instead of grabbing pages itself; that would force the guest to keep more
pages free. Even better, memory compaction would come into play, so the
guest would be driven to free up contiguous ranges of pages. Since those
pages are marked free, the host can grab them (hopefully as huge pages) and
use them elsewhere. With this approach, there is no need to pass pages
directly to the host; the hinting is sufficient.
There are other reasons to avoid the direct allocation of pages in balloon
drivers; as Pavel Emelyanov pointed out, that approach can lead to
out-of-memory situations in the guest. Andrea Arcangeli stated that, when
balloon drivers are in use, the guest must be configured with enough swap
space to avoid that kind of problem; otherwise things will not be stable.
The policy implemented by current balloon drivers is also entirely
determined by the host system; it's not currently possible to let the guest
decide when it needs to grow.
There is also a simple problem of communication; the host has no
comprehensive view of the memory needs of its guest systems. Fixing that
problem will not be easy; any sort of intrusive monitoring of guest memory
usage will fail to scale well. And most monitoring tends to fall down when
a guest's memory usage pattern changes - which happens frequently.
Few conclusions resulted from this session. There will be a new set of
page hinting patches from Rik in the next few weeks; after that, thought
can be put into doing ballooning entirely through hinting without having to
call back to the host.
Dirty limits and writeback
The memory management track had been able to talk for nearly a full hour
without getting into control groups, but that was never meant to last; Greg
Thelen brought the subject back during his session on the management of
dirty limits within control groups. He made the claim that keeping track
of dirty memory within control groups is relatively easy, but then spent
the bulk of his session talking about the subtleties involved in that
tracking.
The main problem with dirty page tracking is a more general memory
controller issue: the first control group to touch a specific page
gets charged for it, even if other groups make use of that page later.
Dirty page tracking makes that problem worse; if control group "A" dirties
a page which is charged to control group "B", it will be B which is charged
with the dirty page as well. This behavior seems inherently unfair; it
could also perhaps facilitate denial of service attacks if one control
group deliberately dirties pages that are charged to another group.
One possible solution might be to change the ownership of a page when it is
dirtied - the control group which is writing to the page would then be
charged for it thereafter. The problem with that approach is pages which
are repeatedly dirtied by multiple groups; that could lead to the page
bouncing back and forth. One could try a "charge on first dirty" approach,
but Greg was not sure that it's all worth it. He does not expect that
there will be a lot of sharing of writable pages between control groups in
the real world.
The bigger problem is what to do about control groups which hit their dirty
limits. Presumably they will be put to sleep until their dirty page counts
go below the limit, but that will only work well if the writeback code
makes a point of writing back pages which are associated with those control
groups. Greg had three possible ways of making that happen.
The first of those involved creating a new memcg_mapping structure
which would take the place of the address_space structure used to
describe a particular mapping. Each control group would have one of these
structures for every mapping in which it has pages. The writeout code
could then find these mappings to find specific pages which need to be
written back to disk. This solution would work, but is arguably more
complex than is really needed.
An approach which is "a little dumber" would have the system associating
control groups with inodes representing pages which have been dirtied by
those control groups. When a control group goes over its limit, the system
could just queue writeback on the inodes where that group's dirty pages
reside. The problem here is that this scheme does not handle sharing of
inodes well; it can't put an inode on more than one group's list. One
could come up with a many-to-one mechanism allowing the inode to be
associated with multiple control groups, but that code does not exist now.
Finally, the simplest approach is to put a pointer to a memory control
group into each inode structure. When the writeback code scans through the
list of dirty inodes, it could simply skip those which are not associated
with control groups that have exceeded their dirty limit. This approach,
too, does not do sharing well; it also suffers from the disadvantage that
it causes the inode structure to grow.
Few conclusions were reached in this session; it seems clear that this code
will need some work yet.
Kernel memory accounting and soft limits
The kernel's memory control group mechanism is concerned with limiting
user-space memory use, but kernel memory can matter too. Pavel Emelyanov
talked briefly about why kernel memory is important and how it can be
tracked and limited. The "why" is easy; processes can easily use
significant amounts of kernel memory. That usage can impact the system in
general; it can also be a vector for denial of service attacks. For
example, filling the directory entry (dentry) cache is just a matter of
writing a loop running "mkdir x; cd x". For as long as
that loop runs, the entire chain of dentries representing the path to the
bottommost directory will be pinned in the cache; as the chain grows, it
will fill the cache and prevent anything else from performing path
lookups.
Tracking every bit of kernel data used by a control group is a difficult
job; it also becomes an example of diminishing returns after a while. Much
of the problem can be solved by looking at just a few data structures.
Pavel's work has focused on three structures in particular: the dentry
cache, networking buffers, and page tables. The dentry cache controller is
relatively straightforward; it can either be integrated into the memory
controller or made into a separate control group of its own.
Tracking network buffers is harder due to the complexities of the TCP
protocol. The networking code already does a fair amount of tracking,
though, so the right solution here is to integrate with that code to create
a separate controller.
Page tables can occupy large amounts of kernel memory; they present some
challenges of their own, especially when a control group hits its limit.
There are two ways a process can grow its page tables; one is via system
calls like fork() or mmap(). If a limit is hit there,
the kernel can simply return ENOMEM and let the process respond as
it will. The other way, though, is in the page fault handler; there is no
way to return a failure status there. The best the controller can do is to
send a segmentation fault signal; that usually just results in the
unexpected death of the program which incurred the page fault. The only
alternative would be to invoke the out-of-memory killer, but that may not
even help: the OOM killer is designed to free user-space memory, not kernel
memory.
Pavel plans to integrate the page table tracking into the memory
controller; patches are forthcoming.
Ying Han got a few minutes to discuss the implementation of soft limits in
the memory controller. As had been mentioned on the first day, soft limits
differ from the existing (hard) limits in that they can be exceeded if the
system is not under global memory pressure. Once memory gets tight, the
soft limits will be enforced.
That enforcement is currently suboptimal, though. The code maintains a
red-black tree in each zone containing the control groups which are over
their soft limits, even though some of those groups may not have
significant amounts of memory in that specific zone. So the system needs
to be taught to be more aware of allocations in each zone.
The response to memory pressure is also not perfect; the code picks the
control group which has exceeded its soft limit by the largest amount and
beats on it until it goes below the soft limit entirely. It would probably
be better to add some fairness to the algorithm and spread the pain among
all of the control groups which have gone over their limits. Some sort of
round-robin algorithm which would cycle through those groups would probably
be a better way to go.
There was clearly more to discuss on this topic, but time ran out and the
discussion had to end.
Transparent huge page improvements
Andrea Arcangeli had presented the transparent huge page (THP) patch set at the
2010 Summit and gotten some valuable feedback in return. By the 2011
event, that code had been merged for the 2.6.38 kernel; it still had a
number of glitches, but those have since been fixed up. Since then, THP
has gained some improved statistics support under /proc; there is
also an out-of-tree patch to add some useful information to
/proc/vmstat. Some thought has been put into optimizing libraries
and applications for THP, but there is rarely any need to do that;
applications can make good use of the feature with no changes at all.
There are a number of future optimizations on Andrea's list, though he made
it clear that he does not plan to implement them all himself. The first
item, though - adding THP support to the mremap() system call -
has been completed. Beyond that, he would like to see the process of
splitting huge pages optimized to remove some unneeded TLB flush
operations. The migrate_pages() and move_pages() system
calls are not yet THP-aware, so they split up any huge pages they are asked
to move. Adding a bit of THP awareness to glibc could improve performance
slightly.
The big item on the list is THP support for pages in the page cache;
currently only anonymous pages are supported. There would be some big
benefits beyond another reduction in TLB pressure; huge pages in the page
cache would greatly reduce the number of pages which need to be scanned by
the reclaim code. It is, however, a huge job which would require changes
in all filesystems. Andrea does not seem to be in a hurry to jump into
that task. What might happen first is the addition of huge page support to
the tmpfs filesystem; that, at least, would allow huge pages to be used in
shared memory applications.
Currently THP only works with one size of huge pages - 2MB in most
configurations. What about adding support for 1GB pages as well? That
seems unlikely to happen anytime soon. Working with those pages would be
expensive - a copy-on-write fault on a 1GB page would take a long time to
satisfy. The code changes would not be trivial; the buddy allocator cannot
handle 1GB pages, and increasing MAX_ORDER (which determines the
largest chunk managed by the buddy allocator) would not be easy to do.
And, importantly, the
benefits would be small to the point that they would be difficult to
measure. 2MB pages are enough to gain almost all of the performance
benefits which are available, so supporting larger page sizes is almost
certainly not worth the effort. The only situation in which is might
happen is if 2MB pages become the basic page size for the rest of the
system.
Might a change in the primary page size happen? Not anytime soon. Andrea
actually tried it some years ago and ran into a number of problems. Among
other things, a larger page size would change a number of system call
interfaces in ways which would break applications. Kernel stacks would
become far more expensive; their implementation would probably have to
change. A lot of memory would be wasted in internal fragmentation. And a
lot of code would have to change. One should not expect a page size change
to happen in the foreseeable future.
NUMA migration
Non-uniform memory access systems are characterized by the fact that some
memory is more expensive to access than the rest. For any given node in
the system, memory which is local to that node will be faster than memory
found elsewhere in the system. So there is a real advantage to keeping
processes and their memory together. Rik van Riel made the claim that this
is often not happening. Long-running processes, in particular, can have
their memory distributed across the system; that can result in a 20-30%
performance loss. He would like to get that performance back.
His suggestion was to give each process a "home node" where it would run if
at all possible. The home node differs from CPU affinity in that the
scheduler is not required to observe it; processes can be migrated away
from their home node if necessary. But, when the scheduler performs load
balancing, it would move processes back to their homes whenever possible.
Meanwhile, the process's memory allocations would be performed on the home
node regardless of where the process is running at the time. The end
result should be processes running with local memory most of the time.
There are some practical difficulties with this scheme, of course. The
system may end up with a mix of processes which all got assigned to the
same home node; there may then be no way to keep them all there. It's not
clear what should happen if a process creates more threads than can be
comfortably run on the home node. There were also concerns about
predictability; the "home node" scheme might create wider variability
between identical runs of a program. The consensus, though, was that speed
beats predictability and that this idea is worth experimenting with.
Stable pages
What happens if a process (or the kernel) modifies the contents of a page
in the time between when that page is queued for writing to persistent
storage and when the hardware actually performs the write? Normally, the
result would be that the newer data is written, and that is not usually a
problem. If, however, something depends on the older contents, the result
could be problematic. Examples which have come up include checksums used
for integrity checking or pages which have been compressed or encrypted.
Changing those pages before the I/O completes could result in an I/O
operation failure or corrupted data - neither of which is desirable.
The answer to this problem is "stable pages" - a rule that pages which are
in flight cannot be changed. Implementing stable pages is relatively easy
(with one exception - see below). Pages which are written to persistent
storage are already marked read-only by the kernel; if a process tries to
write to the page, the kernel will catch the fault, mark the page
(once again) dirty, then allow the write to proceed. To implement stable
pages, the kernel need only force that process to block until any
outstanding I/O operations have completed.
The btrfs filesystem implements stable pages now; it needs them for a number
of reasons. Other filesystems do not have stable pages, though; xfs and
OCFS implement them for metadata only, and the rest have no concept of
stable pages at all. There has been some resistance to the idea of adding
stable pages because there is some fear that performance could suffer;
processes which could immediately write to pages under I/O would slow down
if they are forced to wait.
The truth of the matter seems to be that most of the performance worries
are overblown; in the absence of a deliberate attempt to show problems, the
performance degradation is not measurable. There are a few exceptions;
applications using the Berkeley database manager seem to be one example.
It was agreed that it would be good to have some better measurements of
potential performance issues; a tracepoint may be placed to allow
developers to see how often processes are actually blocked waiting for
pages under I/O.
It turns out that there is one place where implementing stable pages is
difficult. The kernel's get_user_pages() function makes a range
of user-space pages accessible to the kernel. If write access is
requested, the pages are made writable at the time of the call. Some time
may pass, though, before the kernel actually writes to those pages; in the
meantime, some of them may be placed under I/O. There is currently no way
to catch this particular race; it is, as Nick Piggin put it, a real
correctness issue.
There was some talk of alternatives to stable pages. One is to use bounce
buffers for I/O - essentially copying the page's contents elsewhere and
using the copy for the I/O operation. That would be expensive, though, so
the idea was not popular. A related approach would be to use
copy-on-write: if a process tries to modify a page which is being written,
the page would be copied at that time and the process would operate on the
copy. This solution may eventually be implemented, but only after stable
pages have been shown to be a real performance problem. Meanwhile, stable
pages will likely be added to a few other filesystems, possibly controlled
by a mount-time option.
Closing sessions
Toward the end of the day, Qian Cai discussed the problem of sustainable
testing. There are a number of ways in which our testing is not as good as
it could be. Companies all have their own test suites; they duplicate a
lot of effort and tend not to collaborate in the development of the tests
or sharing of the results. There are some public test suites (such as
xfstests and the Linux Testing Project), but they don't work together and
each have their own approach to things. Some tests need specific hardware
which may not be generally available. Other tests need to be run manually,
reducing the frequency with which they are run.
The subsequent discussion ranged over a number of issues without resulting
in any real action items. There was some talk of error injection; that was
seen as a useful feature, but a hard thing to implement well. It was said
that our correctness tests are in reasonably good shape, but that there are
fewer stress tests out there. The xfstests suite does some stress testing,
but it runs for a relatively short period of time so it cannot catch memory
leaks; xfstests is also not very useful for catching data corruption
problems.
The biggest problem, though, is one which has been raised a number of times
before: we are not very good at catching performance regressions. Ted Ts'o
stated that the "dirty secret" is that kernel developers do not normally
stress filesystems very much, so they tend not to notice performance
problems.
In the final set of lightning talks, Aneesh Kumar and Venkateswararao
Jujjuri talked about work which is being done with the 9p filesystem. Your
editor has long wondered why people are working on this filesystem, which
originally comes from the Plan9 operating system. The answer was revealed
here: 9p makes it possible to export filesystems to virtualized guests in a
highly efficient way. Improvements to 9p have been aimed at that use case;
it now integrates better with the page cache, uses the virtio framework to
communicate with guests, can do zero-copy I/O to guests running under QEMU,
and supports access control lists. The code for all this is upstream and
will be shipping in some distributions shortly.
Amir Goldstein talked about his snapshot
code, which now works with the ext4 filesystem. The presentation
consisted mostly of benchmark results, almost all of which showed no
significant performance costs associated with the snapshot capability. The
one exception appears to be the postmark benchmark, which performs a lot of
file deletes.
Mike Snitzer went back to the "advanced format" discussion from the
morning's session on future technology. "Advanced format" currently means
4k sectors, but might the sector size grow again in the future? How much
pain would it take for Linux to support sector sizes which are larger than
the processor's page size? Would the page size have to grow too?
The answer to the latter question seems to be "no"; there is no need or
desire to expand the system page size to support larger disk sectors.
Instead, it would be necessary to change the mapping between pages in
memory and sectors on the disk; in many filesystems, this mapping is still
done with the buffer head structure. There are some pitfalls, including
proper handling of sparse files and efficient handling of page faults, but
that is just a matter of programming. It was agreed that it would be nice
to do this programming in the core system instead of having each filesystem
solve the problems in its own way.
The summit concluded with an agreement that things had gone well, and that
the size of the event (just over 70 people) was just about right. The
summit, it said, should be considered mandatory for all maintainers working
in this area. It was also agreed that the memory management developers
(who have only been included in the summit for the last couple of meetings)
should continue to be invited. That seems inevitable for the next summit;
the head of the program committee, it was announced, will be memory
management hacker Andrea Arcangeli.
Comments (14 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>