By Jake Edge
April 3, 2012
Day one of the Linux Storage, Filesystem, and Memory Management Summit
(LSFMMS) was
held in San Francisco on April 1. What follows is a report on the combined
and MM sessions
from the day largely based on Mel Gorman's write-ups, with some editing and
additions from my own notes. In addition, James Bottomley sat in on the
Filesystem and Storage discussions and his (lightly edited) reports are
included as well. The plenary session from day one, on runtime filesystem consistency checking, was
covered in a separate article.
Writeback
Fengguang Wu began by enumerating his work on improving the writeback
situation and instrumenting the system to get better information on why
writeback is initiated.
James Bottomley quickly pointed out that we've talked about writeback for
several years at LSFMMS
and specifically asked where are we right now. Unfortunately many people spoke
at the same time, some without microphones making it difficult to follow.
They did focus on how and when sync takes place, what impact it has,
and whether anyone should care about how dd benchmarks behave. The bulk of
the comments focused on the fairness of dealing with multiple syncs coming
from multiple sources. Ironically despite the clarity of the question,
the discussion was vague. As concrete examples were not used by each
audience member it could be only concluded that "on some filesystems for
some workloads depending on what they do, writeback may do something bad".
Wu brought it back on topic by focusing on I/O-less dirty throttling and
the complexities that it brings. However, the intention is to minimize seeks,
and to provide less lock contention and low latency. He maintains that
there were some
impressive performance gains with some minor regressions. There are issues
around integration with task/cgroup I/O controllers but considering the
current state of I/O controllers, this was somewhat expected.
Bottomley asked about how much complexity this added; Dave Chinner pointed
out that the complexity of the code was irrelevant because the focus should be
on the complexity of the algorithm. Wu countered that the coverage of
his testing was pretty comprehensive, covering a wide range of hardware,
filesystems, and workloads.
For dirty reclaim, there is now a greater focus on pushing pageout work to the
flusher threads with some effort to improve interactivity by focusing dirty
reclaim on the tasks doing the dirtying. He stated that dirty pages reaching the
end of the LRU are still a problem and suggested the creation of a dirty LRU
list. With current kernels, dirty pages are skipped over by direct reclaimers,
which increases CPU cost, making it a problem that varies between kernel
versions. Moving them to a separate list unfortunately requires a page flag
which is not readily available.
Memory control groups bring their own issues with writeback, particularly
around flusher fairness. This is currently beyond control with only
coarse options available such as limiting the number of operations that
can be performed on a per-inode basis or limiting the amount of IO that
can be submitted. There was mention of throttling based on the amount of
IO a process completed but it was not clear how this would work in
practice.
The final topic was on the block cgroup (blkcg) I/O controller and the different
approaches
to throttling based on I/O operations/second (IOPS) and access to disk
time. Buffered writes are a
problem, as is how they could possibly be handled via
balance_dirty_pages().
A big issue with throttling buffered writes is still identifying the I/O
owner and throttling them at the time the I/O is queued, which happens after
the I/O owner has already executed a read() or write(). There was a request
to clarify what the best approach might be but there were
few responses. As months, if not years, of discussion on the lists imply,
it is just not a straightforward topic and it was suggested that a spare
slot be stolen to discuss it further (see the follow-up in the filesystem
and storage sessions below).
At the end, Bottomley wanted an estimate of how close writeback was to
being "done".
After some hedging, Wu estimated that it was 70% complete.
Stable pages
The
problems surrounding stable pages were
the next topic under discussion. As was noted by Ted Ts'o, making writing
processes wait for writeback to complete on stable pages can lead to
unexpected and rather long latencies, which may be unacceptable for some
workloads. Stable pages are only really needed for some systems where
things like checksums calculated on the page require that the page be
unchanged when it actually gets written.
Sage Weil and Boaz Harrosh listed the three options for handling the
problem. The first was to reissue the write for pages that have changed while they
were undergoing writeback, but that can confuse some storage systems. Waiting
on the writeback (which is what is currently done) or doing a copy-on-write
(COW) of the page under writeback were the other two. The latter option was the initial focus of the
discussion.
James Bottomley asked if the cost of COW-ing the pages had been benchmarked
and Weil said that they hadn't been. Weil and Harrosh are interested in
workloads that really require stable writes and whether they were truly
affected by waiting for the writeback to complete. Weil noted that Ts'o
can just turn off stable pages, which fixes his problem. Bottomley asked:
could there just
be a mount flag to turn off stable pages? Another way to
approach that might be to have the underlying storage system inform the
filesystem if it needed stable writes or not.
Since waiting on writeback for stable pages introduces a number of
unexpected issues, there is a question of whether replacing it with
something with a different set of issues is the right way to go. The COW
proposal may lead to problems because it results in there being two pages
for the same storage location floating around. In addition, there are
concerns about what would happen for a file that gets truncated after its
pages have been copied, and how to properly propagate that information.
It is unclear whether COW would be always be a win over waiting, so
Bottomley suggested that the first step should be to get some reporting
added into the stable writeback path to gather information on what
workloads are being affected and what those effects are. After that,
someone could flesh out a proposal on how to implement the COW solution
that described how to
work out the various problems and corner cases that were mentioned.
Memory vs. performance
While the topic name of Dan Magenheimer's slot, "Restricting Memory Usage
with Equivalent Performance", was not of his choosing, that didn't deter
him from presenting a problem for memory management developers to
consider. He started by describing a graph of the performance of a workload
as the amount of RAM available to it increases. Adding RAM reduces the
amount of time the workload takes, to a certain point. After that point,
adding more memory has no effect on the performance.
It is difficult or impossible to know the exact amount of RAM required to
optimize
the performance of a workload, he said. Two virtual machines on a single
host are sharing the available memory, but one VM may need the additional
memory that the other does not really need. Some kind of balance point
between the workloads being handled by the two VMs needs to be found.
Magenheimer has some ideas on ways to think about the problem that he
described in the session.
He started with an analogy of two countries, one of which wants resources
that the other has. Sometimes that means they go to war, especially in the
past, but more recently economic solutions have been used rather than
violence to allocate the resource. He wonders if a similar mechanism could
be used in the kernel. There are a number of sessions in the
memory management track that are all related to the resource allocation
problem, he said, including memory control groups soft-limits, NUMA
balancing, and
ballooning.
The top-level question is how to determine how much memory an
application actually needs vs. how much it wants. The idea is try to find
the point where giving some memory to another application has a negligible
performance impact on the giver while the other application can use it to
increase its performance.
Beyond tracking the size of the application, Magenheimer posited that one
could use calculus and calculate the derivative of the size growth to gain
an idea of the "velocity" of the workload. Rik van Riel noted that this
information could be difficult to track when the system is thrashing, but
Magenheimer thought that tracking refaults could help with that problem.
Ultimately, Magenheimer wants to apply these ideas to RAMster, which allows
machines to share "unused" memory between them. RAMster would allow
machines to negotiate storing pages for other machines. For example, in an
eight machine system, seven machines could treat the remaining machine as a
memory server, offloading some of their pages to that machine.
Workload size estimation might help, but the discussion returned to the old chestnut of trying to shrink
memory to find at what point the workload starts "pushing" back by either
refaulting or beginning to thrash. This would allow the issue to be expressed
in terms of control theory. A crucial part of using control theory is
having a feedback mechanism. By and large, virtual machines have almost
non-existent
feedback mechanisms for establishing the priority of different requests
for resources. Further, performance analysis on resource usage is limited.
Glauber Costa pointed out that potentially some of this could be investigated
using memory cgroups that vary in size to act as a type of feedback
mechanism even
if it lacked a global view of resource usage.
In the end, this session was a problem statement - what feedback mechanisms
does a
VM need to assess how much memory the workload on a particular machine
requires? This is related to workload working set size estimation but that
is
sufficiently different from Magenheimer's requirement that they may not share
that much in common.
Ballooning for transparent huge pages
Rik van Riel began by reminding the audience that transparent huge pages (THP) gave a large performance
gain in virtual machines by virtue of the fact that VMs use nested page
tables, which doubles
the normal cost of translation. Huge pages, by requiring far fewer
translations, can make much of the performance penalty associated with nested
page tables go away.
Once ballooning enters the picture though it rains on the parade by
fragmenting memory and reducing the number of huge pages that can be used.
The obvious approach is to balloon in 2M contiguous chunks. However, this
has its own problems because compaction can only do so much. If a guest must
shrink its memory by half, it may use all the regions that are capable
of being defragmented. This would reduce or eliminate the number of 2M huge
pages that could be used.
Van Riel's solution requires that balloon pages become movable within the guest,
which requires
changes to both the balloon driver and potentially the hypervisor. However,
no one in the audience saw a problem with this as such. Balloon pages are
not particularly complicated, because they just have one reference. They need a
new page mapping with a migration callback to release the reference to the
page and the contents do not need to be copied so
there is an optimization available there.
Once that is established, it would also be nice to keep balloon pages
within the same 2M regions. Dan Magenheimer mentioned a user that has
a similar type
of problem, but that problem is very closely related to what CMA does. It was
suggested that Van Riel may need something very similar to MIGRATE_CMA except
where MIGRATE_CMA forbids unmovable pages within their pageblocks, balloon
drivers would simply prefer that unmovable pages were not allocated. This
would allow further concentration of balloon pages within 2M regions without
using compaction aggressively.
There was no resistance to the idea in principle so one would expect that
some sort of prototype will appear on the lists during the next year.
Finding holes for mmap()
Rik van Riel started a discussion on the problem of finding free virtual areas
quickly during mmap() calls. Very simplistically, an
mmap() requires a linear search
of the virtual address space by virtual memory area (VMA) with some minor
optimizations for
caching holes and scan pointers. However, there are some workloads that use
thousands of VMAs so this scan becomes expensive.
VMAs are already organized by a red-black tree (RB tree). Andrea Arcangeli had
suggested that information
about free areas near a VMA could be propagated up the RB tree toward
the root. Essentially it would be an augmented RB tree that stores both
allocated and free information. Van Riel was considering a simpler approach
using a callback on a normal RB tree to store the hole size in the VMA. Using
that, each RB node would know the total free space below it in an unsorted
fashion.
That potentially introduces fragmentation as a problem but that
is inconsequential to Van Riel in comparison to the problem where a hole of a
particular alignment is required. Peter Zijlstra maintained that augmented trees
should be usable to do this, but that was disputed by Van Riel who said that
augmented RB tree users have significant implementation responsibilities
so this detail needs further research.
Again, there was little resistance to the idea in principle but there are
likely to be issues during review about exactly how it gets implemented.
AIO/DIO in the kernel
Dave Kleikamp talked about asynchronous I/O (AIO) and how it is currently
used for user pages. He
wants to be able to initiate AIO from within the kernel, so he wants to convert
struct iov_iter to contain either an iovec or bio_vec and then convert the
direct I/O path to operate on iov_iter.
He maintains that this should be a straightforward conversion based on
the fact that it is the generic code that does all the complicated things
with the various structures.
He tested the API change by converting a loop device to set O_DIRECT and
submit via AIO. This eliminated caching in the underlying filesystem and
assured consistency of the mounted file.
He sent out patches a month ago but did not
get much feedback and was
looking to figure out why that was. He was soliciting input on the
approach and how it might be improved but it seemed like many had either
missed the patches or otherwise not read them. There might be greater
attention in the future.
The question was asked whether it would be a compatible interface for
swap-over-arbitrary-filesystem. The latest swap-over-NFS patches introduced
an interface for pinning pages for kernel I/O but Dave's patches appear to
go further. It would appear that swap-over-NFS could be adapted to use
Dave's work.
Dueling NUMA migration schemes
Peter Zijlstra started the session by talking about his approach for
improving performance on NUMA machines. Simplistically, it assigns processes
to a home node that allocation policies will prefer to allocate from and
load balancer policies to keep the threads near the memory it is using.
System calls are introduced to allow assignment of thread groups and
VMAs to nodes. Applications must be aware of the API to take advantage
of it.
Once the decision has been made to migrate threads to a new node, their
pages are unmapped and migrated as they are faulted, minimizing the number
of pages to be migrated and correctly accounting for the cost of the
migration to
the process moving between nodes. As file pages may potentially be shared,
the scheme focuses on anonymous pages.
In general, the scheme is expected to work well for the case where the
working set fits within a given NUMA node but be easier to implement than
the hard
binding support currently offered by the kernel. Preliminary tests indicate
that it does what it is supposed to do for the cases it handles.
One key advantage Zijlstra cited for his approach was that he maintains
information based on thread and VMA, which is predictable. In contrast,
Andrea Arcangeli's
approach requires storing information on a per-page basis and is much heavier in
terms of memory consumption. There were few questions on the specifics of
how it was implemented with comments from the room focusing instead on comparing
Zijlstra and Arcangeli's approaches.
Hence, Arcangeli presented on AutoNUMA which consists of a number of components.
The first is the knuma_scand component which is a page table
walker that tracks the
RSS usage of processes and the location of their pages. To track reference
behavior, a NUMA page fault hinting component changes page table entries
(PTEs) in an arrangement that is
similar but not identical to PROT_NONE temporarily. Faults are then used
to record what process is using a given page in memory. knuma_migrateN
is a per-node thread that is responsible for migrating pages if a process
should move to a new node. Two further components move threads near
the memory they are using or alternatively, move memory to the CPU that is
using it. Which option it takes depends on how memory is currently being
used by the processes.
There are two types of data being maintained for
decisions. sched_autonuma
works on a task_struct basis and the data is collected by NUMA hinting
page faults. The second is mm_autonuma which works on an
mm_struct basis
and gathers information on the working set size and the location of the
pages it has mapped, which is generated by knuma_scand.
The details on how it decides whether to move threads or memory to different
NUMA nodes is involved but Arcangeli expressed a high degree of confidence
that it could make close to optimal decisions on where threads and memory
should be located. Arcangeli's slide that describes the AutoNUMA workflow
is shown at right.
When it happens, migration is based on per-node queues and care is taken to
migrate pages at a steady rate to avoid bogging the machine down copying data.
While Arcangeli acknowledged the overall concept was complicated, he asserted
that it was relatively well-contained without spreading logic throughout
the whole of MM.
As with Zijlstra's talk, there were few questions on the specifics of how it
was implemented, implying that not many people in the room have reviewed
the patches, so Arcangeli moved on to explaining the benchmarks he ran. The
results of the benchmarks looked as if performance was within a few percent of
manually binding memory and threads to local nodes. It was interesting to
note that for one benchmark, specjbb, it was clear that how well AutoNUMA
does varies, which shows its non-deterministic behavior. But its performance
never dropped below the base performance. He explained that the variation
could be partially explained by the fact that AutoNUMA currently does not
migrate THP pages, instead it splits them and migrates the individual pages
depending on khugepaged to collapse the huge pages again.
Zijlstra pointed out that, for some of the benchmarks that were presented,
his approach potentially performed just as well without the algorithm
complexity or memory overhead. He asserted this was particularly true
for KVM-based workloads as long as the workload fits within a NUMA node.
He pointed out that the history of memcg led to a situation where it had
to be disabled by default in many situations because of the overhead and
that AutoNUMA was vulnerable to the same problem.
When it got down to it, the discussed points were not massively different
to discussions on the mailing list except perhaps in terms of
tone. Unfortunately
there was little discussion on was whether there was any compatibility
between the two approaches and what logic could be shared. This was due
to time limitations but future reviewers may have a clearer view of the
high-level concepts.
Soft limits in memcg
Ying Han began by introducing soft reclaim and stated she wanted to find what
blockers existed for merging parts of it. It has reached the point where it
is getting sufficiently complicated that it is colliding with other aspects
of the memory cgroup (memcg) work.
Right now, the implementation of soft limits allows memcgs to grow above a soft
limit in the absence of global memory pressure. In the event of global memory
pressure then memcgs get shrunk if they are above their soft limit. The
results for shrinking are similar to hierarchical reclaim for hard limits.
In a superficial way, this concept is similar to what Dan Magenheimer
wanted for RAMSter
except that it applies to cgroups instead of machines.
Rik van Riel pointed out that it is possible that a task can be fitting in a
node and within its soft limit. If there are other cgroups on the
same node, the aggregate soft limit can be above the node size and, in
some cases, that cgroup should be shrunk even if it is below the soft limit.
This has a quality-of-service impact; Han recognizes that this needs to
be addressed. This is somewhat of an administrative issue. The total of
all hard limits can exceed physical memory with the impact being that
global reclaim shrinks cgroups before they hit their hard limit. This
may be undesirable from an administrative point of view. For soft
limits, it makes even less sense if the total soft limits exceed
physical memory as it would be functionally similar to if the soft
limits were not set at all.
The primary issue was to decide what to set the ratio to reclaim pages from
cgroups at. If there is global memory pressure and all cgroups are under
their soft limit then a situation potentially arises whereby reclaim is
retried indefinitely without forward progress. Hugh Dickins pointed out
that soft
reclaim has no requirement that cgroups under their soft limit never be
reclaimed. Instead, reclaim from such cgroups should simply be resisted
and the question is how it should be resisted. This may require that all
cgroups get scanned to discover that they are all under their soft limit
and then require burning more CPU rescanning them. Throttling logic is
required but
ultimately this is not dissimilar to how kswapd or direct reclaimers get
throttled when scanning too aggressively. As with many things, memcg is
similar to the global case but the details are subtly different.
Even then, there was no real consensus on how much memory should be reclaimed
from cgroups below their soft limit. There is an inherent fairness
issue here that does not appear to have changed much between different
discussions. Unfortunately, discussions related to soft reclaim are separated
by a lot of time and people need to be reminded of the details. This meant
that little forward progress was made on whether to merge soft reclaim or
not but there were no specific objections during the session. Ultimately,
this is still seen as being a little Google-specific particularly as some
of the shrinking decisions were tuned based on Google workloads. New use-cases are needed to tune the shrinking decisions and to support the patches
being merged.
Kernel interference
Christoph Lameter started by stating that each kernel upgrade resulted in
slowdowns
for his target applications (which are for high-speed trading). This generates a lot of resistance to kernels
being upgraded on their platform. The primary sources of interference were
from faults, reclaim, inter-processor interrupts, kernel threads, and
user-space daemons. Any one
of these can create latency, sometimes to a degree that is catastrophic to their
application. For example, if reclaim causes an additional minor fault to
be incurred, it is in fact a major problem for their application.
The reason this happens is due to some trends. Kernels are simply more
complex with more causes of interference leaving less processor time
available to the user. Other trends which affect them are
larger memory sizes leading to longer reclaim as well as more processors
meaning that for-all-cpu loops take longer.
One possible measure would be to isolate OS activities to a subset of CPUs
possibly including interrupt handling. Andi Kleen pointed out that even with
CPU isolation, if unrelated processes are sharing the same socket,
they can interfere with each other. Lameter maintained that while this
was true such isolation was still of benefit to them.
For some of the examples brought up, there are people working on the
issues but they are still works in progress and have not been
merged. The fact of the matter is that the situation is less than ideal
with kernels today. This is forcing them into a situation where they fully
isolated some CPUs and bypass the OS as much as possible, which turns Linux into
a glorified boot loader. It would be in the interest of the community to
reduce such motivations by watching the kernel overhead, he said.
Filesystem and storage sessions
Copy offload
Frederick Knight, who is the NetApp T10 (SCSI) standards guy, began by
describing copy offload, which is a method for allowing SCSI devices to
copy ranges of blocks without involving the host operating system.
Copy offload is
designed to be a lot faster for large files because wire speed is no
longer the limiting factor. In fact, in spite of the attention now,
offloaded copy has been in SCSI standards in some form or other since
the SCSI-1 days. EXTENDED COPY (abbreviated as XCOPY) takes two
descriptors for the source and destination and a range of blocks.
It is then implemented in a push model (source sends the blocks to the
target) or a pull model (target pulls from source) depending on which
device receives the XCOPY command. There's no requirement that the
source and target use SCSI protocols to effect the copy (they may use an
internal bus if they're in the same housing) but should there be a
failure, they're required to report errors as if they had used SCSI
commands.
A far more complex command set is TOKEN based copy. The idea here is
that the token contains a ROD (Representation of Data) which allows
arrays to give you an identifier for what may be a snapshot. A token
represents a device and a range of sectors which the device guarantees
to be stable. However, if the device does not support snapshotting and
the region gets overwritten (or in fact, for any other reason), it may
decline to accept the token and mark it invalid. This, unfortunately,
means you have no real idea of the token lifetime, and every time the
token goes invalid, you have to do the data transfer by other means
(or renew the token and try again).
There was a lot of debate on how exactly we'd make use of this feature and
whether tokens would be exposed to user space. They're supposed to be
cryptographically secure, but a lot of participants expressed doubt on
this and certainly anyone breaking a token effectively has access to all of
your data.
NFS and CIFS are starting to consider token-based copy commands, and the
token format would be standardized, which would allow copies from a SCSI disk
token into an NFS/CIFS volume.
Copy offload implementation
The first point made by Hannes Reinecke is that identification of source and target for
tokens is a nightmare if everything is done in user space. Obviously,
there is a
need to flush the source range before constructing the token, then we
can possibly use FIEMAP to get the sectors. Chris Mason pointed out
this wouldn't work for Btrfs and after further discussion the concept of
a ref-counted FIETOKEN operation emerged instead.
Consideration then
moved to hiding the token in some type of reflink() and splice()-like system
calls. There was a lot more debate on the mechanics of this,
including whether the token should be exposed to user space (unfortunately, yes,
since NFS and CIFS would need it). Discussion wrapped up with the
thought that we really needed to understand the user-space use cases of
this technology.
RAID unification
pNFS is beginning to require complex RAID-ed objects which require
advanced RAID topologies. This means that pNFS implementations need an
advanced, generic, composable RAID engine that can implement any
topology in a single compute operation. MD was rejected because
composition requires layering within the MD system and that means you can't
do advanced topologies in a single operation.
This proposal was essentially for a new interface that would unify all
the existing RAID systems by throwing them away and writing a new one.
Ted Ts'o pointed out that filesystems making use of this engine don't
want to understand how to reconstruct the data, so the implementation
should "just work" for the degraded case. If we go this route, we
definitely need to ensure that all existing RAID implementations work as
well
as they currently do.
The action summary was to start with MD and then look at Btrfs. Since
we don't really want new administrative interfaces exposed to users, any new
implementation should be usable by the existing LVM
RAID interfaces.
Testing
Dave Chinner reminded everyone that the methodology behind xfstest is "golden
output matching". That means that all XFS tests produce output which is
then filtered (to remove extraneous differences like timestamps or,
rather, to fill them in with X's) and the success or failure indicated by
seeing if the results differ from the expected golden result file. This
means that the test itself shouldn't process output.
Almost every current filesystem is covered by xfstest in some form
and all the code in XFS is tested at 75-80% coverage. (Dave said we
needed to run the code coverage tools to determine what the code
coverage of the tests in other filesystems actually is). Ext4, XFS and Btrfs
regularly have the xfstest suite run as part of their development cycle.
Xfstest consists of ~280 tests which run in 45-60 minutes (depending on
disk speed and processing power). Of these tests, about 100 are
filesystem-independent. One of the problems is that the tests are highly
dependent
on the output format of tools, so, if that changes, the test reports false
failures. On the other hand, it is easily fixed by constructing a new
golden output file for the tests.
One of the maintenance nightmares is that the tests are numbered rather
than named (which means everyone who writes a new test adds it as number 281
and Dave has to renumber). This should be fixed by naming tests instead.
The test space should also become hierarchical (grouping by function) rather
than the current flat scheme.
Keeping a matrix of test results over time allows far better data mining
and makes it easier to dig down and correlate reasons for intermittent
failures, Chinner said.
Flushing and I/O back pressure
This was a breakout session to discuss some thoughts that arose
during the general writeback session (reported above).
The main concept is that writeback limits are trying to limit the amount
of time (or IOPS, etc.) spent in writeback. However, the flusher threads
are currently unlimited because we have no way to charge the I/O they do
to the actual tasks. Also, we have problems accounting for metadata
(filesystems with journal threads) and there are I/O priority inversion problems
(can't have high priority task blocked because of halted writeout on a
low priority one which is being charged for it).
There are three problems:
- Problems between CFQ and block flusher. This should now be
solved by tagging I/O with the originating cgroup.
- CFQ throws all I/O into a single queue (Jens Axboe thinks this isn't a
problem).
- Metadata ordering causes priority inversion.
On the last, the thought was that we could use transaction reservations as an
indicator for whether we had to complete the entire transaction (or just
throttle it entirely) regardless of the writeback limits which would
avoid the priority inversions caused by incomplete writeout of
transactions. For dirty data pages, we should hook writeback throttling
into balance_dirty_pages(). For the administrator, the system needs to
be simple, so there needs to be a single writeback "knob" to adjust.
Another problem is that we can throttle a process which uses buffered
I/O but not if it uses AIO or direct I/O (DIO), so we need to come up with a throttle
that works for all I/O.
(
Log in to post comments)