Brief items
The current development kernel is 3.4-rc1,
released by Linus on March 31. See the
separate article below for a summary of the final changes merged for this
development cycle.
Stable updates: the 3.0.27, 3.2.14, and 3.3.1 updates were released on April 2;
they contain the usual long list of important fixes.
Comments (none posted)
Publicly making fun of people is half the fun of open source
programming.
In fact, the real reason to eschew programming in closed
environments is that you can't embarrass people in public.
--
Linus
Torvalds
/*
+ * Wikipedia: "The current (13th) b'ak'tun will end, or be completed, on
+ * 13.0.0.0.0 (December 21, 2012 using the GMT correlation". GMT or
+ * Mexico/General? What's 6 hours between Mayans friends.. let's follow
+ * 'Mexican time' rules. You might get 6 more hours of reading your
+ * mail, but don't count on it.
+ */
+#define END_13BAKTUN 1356069600
+extern int emulatemayanprophecy; /* End time before the Mayans do */
--
Theo de Raadt; Linux remains unprepared
Maybe I should ask the next person who submits a new architecture
to do that work, that's usually how progress in asm-generic happens
these days.
--
Arnd Bergmann
Although there have been numerous complaints about the complexity
of parallel programming (especially over the past 5-10 years), the
plain truth is that the incremental complexity of parallel
programming over that of sequential programming is not as large as
is commonly believed. Despite that you might have heard, the
mind-numbing complexity of modern computer systems is not due so
much to there being multiple CPUs, but rather to there being any
CPUs at all. In short, for the ultimate in computer-system
simplicity, the optimal choice is NR_CPUS=0.
--
Paul McKenney
Comments (24 posted)
![[plot]](/images/2012/osadl-determinism.jpg)
The Open Source Automation Development Lab has posted
a press
release celebrating a full year's worth of testing of latencies on
several systems running the realtime preemption kernel. "
Each graph
consists of more than 730 latency plots put before one another with the
time scale running from back to front. A latency plot displays the number
of samples within a given latency class (resolution 1 µs). The logarithmic
frequency values at the y-scale ensure that even a single outlier would be
displayed (for details of the test procedures and the load scenarios please
refer to this description). The absence of any outlier in all the very
different systems clearly demonstrates that the perfect determinism of the
mainline Linux real-time kernel is a generic feature; it is not restricted
to any particular architecture." OSADL is an industry consortium
dedicated to encouraging the development and use of Linux in automated
systems.
Comments (19 posted)
Kernel development news
By Jonathan Corbet
April 3, 2012
Linus
announced the 3.4-rc1 release and the
closing of the merge window on March 31. At the outset, he had said
that this merge window could run a little longer than usual; in fact, at 13
days, it was slightly shorter. One should not conclude that there was not
much to pull, though; some 9,248 non-merge changesets went into the
mainline before 3.4-rc1, and a couple of significant features have sneaked
their way in afterward as well.
User-visible features merged since last week's
summary include:
- The device mapper "thin provisioning" target now supports discard
requests, a feature which should help it to use the underlying storage
more efficiently.
- The dm-verity device mapper target has
been merged. This target manages a read-only device where all blocks
are checked against a cryptographic hash maintained elsewhere; it thus
provides a certain degree of tampering detection. Details can be
found in Documentation/device-mapper/verity.txt
- Support for the x32 ABI has been
merged into the kernel. Getting support into the compiler and the C
library is an ongoing project, and the creation of distributions using
this ABI will take even longer, but the foundation, at least, is now
in place.
- The "high-speed synchronous serial interface" (HSI) framework has been
merged. HSI is an interface that is mainly used to connect processors
with cellular modem engines; it will be used for handset support in
future kernel releases.
- New drivers include:
- Processors and platforms:
Samsung EXYNOS5 SoCs, and NVIDIA Tegra3 SoCs.
- Flash:
SMI-attached SPEAR MTD NOR controllers,
DiskOnChip G4 NAND flash devices, and
Universal Flash Storage host controllers (details in Documentation/scsi/ufs.txt).
- Miscellaneous:
Apple "gmux" display multiplexers,
Intel Sodaville GPIO controllers,
TI TPS65217 and TPS65090 power management controllers,
Ricoh RC5T583 power management system devices,
Freescale i.MX on-chip ANATOP controllers,
Summit Microelectronics SMB347 battery chargers, and
ST Ericsson AB8500 battery management controllers.
Changes visible to kernel developers include:
- The "common clock framework" unifies the handling of subsystem clocks,
especially on the ARM architecture (though it is not limited to ARM).
See Documentation/clk.txt for more
information.
- The DMA buffer sharing API has been extended to allow CPU access to
the buffers; see the updated Documentation/dma-buf-sharing.txt file
for details.
- The direct rendering subsystem has gained initial support for the DMA
buffer sharing mechanism. No drivers use it yet, but having this
support in the mainline will ease the development of driver support
for future kernels.
- The massive <asm/system.h> include file has been split
into several smaller files and removed; in-tree users have been fixed.
- The new /proc/dma-mappings file on the ARM architecture
displays the currently-active coherent DMA mappings. Since such
mappings tend to be in short supply on ARM, this can be a useful
debugging tool.
- The ARM architecture has gained jump label ("static branch") support.
- The just-in-time compiler for BPF packet filters has been ported to
the ARM architecture.
There are a couple of other features that Linus may still be considering
merging as of this writing, though the chances of them getting in would
appear to be diminishing. One is the DMA
mapping rework; Linus has been asking for potential users of this
change to speak up, but few have done so. In other words, if there are
developers out there who would like to see the improved DMA subsystem in
the 3.4 release, you are running out of time to make that desire known.
The other is POHMELFS, which has had some
review snags and which also seems to lack a vocal community clamoring for
its inclusion.
Beyond those possibilities, though, the time for new features to go into
the 3.4 development cycle has now passed. The stabilization process has
begun, with a probable final release in late May or early June.
Comments (none posted)
By Jake Edge
April 3, 2012
Day one of the Linux Storage, Filesystem, and Memory Management Summit
(LSFMMS) was
held in San Francisco on April 1. What follows is a report on the combined
and MM sessions
from the day largely based on Mel Gorman's write-ups, with some editing and
additions from my own notes. In addition, James Bottomley sat in on the
Filesystem and Storage discussions and his (lightly edited) reports are
included as well. The plenary session from day one, on runtime filesystem consistency checking, was
covered in a separate article.
Writeback
Fengguang Wu began by enumerating his work on improving the writeback
situation and instrumenting the system to get better information on why
writeback is initiated.
James Bottomley quickly pointed out that we've talked about writeback for
several years at LSFMMS
and specifically asked where are we right now. Unfortunately many people spoke
at the same time, some without microphones making it difficult to follow.
They did focus on how and when sync takes place, what impact it has,
and whether anyone should care about how dd benchmarks behave. The bulk of
the comments focused on the fairness of dealing with multiple syncs coming
from multiple sources. Ironically despite the clarity of the question,
the discussion was vague. As concrete examples were not used by each
audience member it could be only concluded that "on some filesystems for
some workloads depending on what they do, writeback may do something bad".
Wu brought it back on topic by focusing on I/O-less dirty throttling and
the complexities that it brings. However, the intention is to minimize seeks,
and to provide less lock contention and low latency. He maintains that
there were some
impressive performance gains with some minor regressions. There are issues
around integration with task/cgroup I/O controllers but considering the
current state of I/O controllers, this was somewhat expected.
Bottomley asked about how much complexity this added; Dave Chinner pointed
out that the complexity of the code was irrelevant because the focus should be
on the complexity of the algorithm. Wu countered that the coverage of
his testing was pretty comprehensive, covering a wide range of hardware,
filesystems, and workloads.
For dirty reclaim, there is now a greater focus on pushing pageout work to the
flusher threads with some effort to improve interactivity by focusing dirty
reclaim on the tasks doing the dirtying. He stated that dirty pages reaching the
end of the LRU are still a problem and suggested the creation of a dirty LRU
list. With current kernels, dirty pages are skipped over by direct reclaimers,
which increases CPU cost, making it a problem that varies between kernel
versions. Moving them to a separate list unfortunately requires a page flag
which is not readily available.
Memory control groups bring their own issues with writeback, particularly
around flusher fairness. This is currently beyond control with only
coarse options available such as limiting the number of operations that
can be performed on a per-inode basis or limiting the amount of IO that
can be submitted. There was mention of throttling based on the amount of
IO a process completed but it was not clear how this would work in
practice.
The final topic was on the block cgroup (blkcg) I/O controller and the different
approaches
to throttling based on I/O operations/second (IOPS) and access to disk
time. Buffered writes are a
problem, as is how they could possibly be handled via
balance_dirty_pages().
A big issue with throttling buffered writes is still identifying the I/O
owner and throttling them at the time the I/O is queued, which happens after
the I/O owner has already executed a read() or write(). There was a request
to clarify what the best approach might be but there were
few responses. As months, if not years, of discussion on the lists imply,
it is just not a straightforward topic and it was suggested that a spare
slot be stolen to discuss it further (see the follow-up in the filesystem
and storage sessions below).
At the end, Bottomley wanted an estimate of how close writeback was to
being "done".
After some hedging, Wu estimated that it was 70% complete.
Stable pages
The
problems surrounding stable pages were
the next topic under discussion. As was noted by Ted Ts'o, making writing
processes wait for writeback to complete on stable pages can lead to
unexpected and rather long latencies, which may be unacceptable for some
workloads. Stable pages are only really needed for some systems where
things like checksums calculated on the page require that the page be
unchanged when it actually gets written.
Sage Weil and Boaz Harrosh listed the three options for handling the
problem. The first was to reissue the write for pages that have changed while they
were undergoing writeback, but that can confuse some storage systems. Waiting
on the writeback (which is what is currently done) or doing a copy-on-write
(COW) of the page under writeback were the other two. The latter option was the initial focus of the
discussion.
James Bottomley asked if the cost of COW-ing the pages had been benchmarked
and Weil said that they hadn't been. Weil and Harrosh are interested in
workloads that really require stable writes and whether they were truly
affected by waiting for the writeback to complete. Weil noted that Ts'o
can just turn off stable pages, which fixes his problem. Bottomley asked:
could there just
be a mount flag to turn off stable pages? Another way to
approach that might be to have the underlying storage system inform the
filesystem if it needed stable writes or not.
Since waiting on writeback for stable pages introduces a number of
unexpected issues, there is a question of whether replacing it with
something with a different set of issues is the right way to go. The COW
proposal may lead to problems because it results in there being two pages
for the same storage location floating around. In addition, there are
concerns about what would happen for a file that gets truncated after its
pages have been copied, and how to properly propagate that information.
It is unclear whether COW would be always be a win over waiting, so
Bottomley suggested that the first step should be to get some reporting
added into the stable writeback path to gather information on what
workloads are being affected and what those effects are. After that,
someone could flesh out a proposal on how to implement the COW solution
that described how to
work out the various problems and corner cases that were mentioned.
Memory vs. performance
While the topic name of Dan Magenheimer's slot, "Restricting Memory Usage
with Equivalent Performance", was not of his choosing, that didn't deter
him from presenting a problem for memory management developers to
consider. He started by describing a graph of the performance of a workload
as the amount of RAM available to it increases. Adding RAM reduces the
amount of time the workload takes, to a certain point. After that point,
adding more memory has no effect on the performance.
It is difficult or impossible to know the exact amount of RAM required to
optimize
the performance of a workload, he said. Two virtual machines on a single
host are sharing the available memory, but one VM may need the additional
memory that the other does not really need. Some kind of balance point
between the workloads being handled by the two VMs needs to be found.
Magenheimer has some ideas on ways to think about the problem that he
described in the session.
He started with an analogy of two countries, one of which wants resources
that the other has. Sometimes that means they go to war, especially in the
past, but more recently economic solutions have been used rather than
violence to allocate the resource. He wonders if a similar mechanism could
be used in the kernel. There are a number of sessions in the
memory management track that are all related to the resource allocation
problem, he said, including memory control groups soft-limits, NUMA
balancing, and
ballooning.
The top-level question is how to determine how much memory an
application actually needs vs. how much it wants. The idea is try to find
the point where giving some memory to another application has a negligible
performance impact on the giver while the other application can use it to
increase its performance.
Beyond tracking the size of the application, Magenheimer posited that one
could use calculus and calculate the derivative of the size growth to gain
an idea of the "velocity" of the workload. Rik van Riel noted that this
information could be difficult to track when the system is thrashing, but
Magenheimer thought that tracking refaults could help with that problem.
Ultimately, Magenheimer wants to apply these ideas to RAMster, which allows
machines to share "unused" memory between them. RAMster would allow
machines to negotiate storing pages for other machines. For example, in an
eight machine system, seven machines could treat the remaining machine as a
memory server, offloading some of their pages to that machine.
Workload size estimation might help, but the discussion returned to the old chestnut of trying to shrink
memory to find at what point the workload starts "pushing" back by either
refaulting or beginning to thrash. This would allow the issue to be expressed
in terms of control theory. A crucial part of using control theory is
having a feedback mechanism. By and large, virtual machines have almost
non-existent
feedback mechanisms for establishing the priority of different requests
for resources. Further, performance analysis on resource usage is limited.
Glauber Costa pointed out that potentially some of this could be investigated
using memory cgroups that vary in size to act as a type of feedback
mechanism even
if it lacked a global view of resource usage.
In the end, this session was a problem statement - what feedback mechanisms
does a
VM need to assess how much memory the workload on a particular machine
requires? This is related to workload working set size estimation but that
is
sufficiently different from Magenheimer's requirement that they may not share
that much in common.
Ballooning for transparent huge pages
Rik van Riel began by reminding the audience that transparent huge pages (THP) gave a large performance
gain in virtual machines by virtue of the fact that VMs use nested page
tables, which doubles
the normal cost of translation. Huge pages, by requiring far fewer
translations, can make much of the performance penalty associated with nested
page tables go away.
Once ballooning enters the picture though it rains on the parade by
fragmenting memory and reducing the number of huge pages that can be used.
The obvious approach is to balloon in 2M contiguous chunks. However, this
has its own problems because compaction can only do so much. If a guest must
shrink its memory by half, it may use all the regions that are capable
of being defragmented. This would reduce or eliminate the number of 2M huge
pages that could be used.
Van Riel's solution requires that balloon pages become movable within the guest,
which requires
changes to both the balloon driver and potentially the hypervisor. However,
no one in the audience saw a problem with this as such. Balloon pages are
not particularly complicated, because they just have one reference. They need a
new page mapping with a migration callback to release the reference to the
page and the contents do not need to be copied so
there is an optimization available there.
Once that is established, it would also be nice to keep balloon pages
within the same 2M regions. Dan Magenheimer mentioned a user that has
a similar type
of problem, but that problem is very closely related to what CMA does. It was
suggested that Van Riel may need something very similar to MIGRATE_CMA except
where MIGRATE_CMA forbids unmovable pages within their pageblocks, balloon
drivers would simply prefer that unmovable pages were not allocated. This
would allow further concentration of balloon pages within 2M regions without
using compaction aggressively.
There was no resistance to the idea in principle so one would expect that
some sort of prototype will appear on the lists during the next year.
Finding holes for mmap()
Rik van Riel started a discussion on the problem of finding free virtual areas
quickly during mmap() calls. Very simplistically, an
mmap() requires a linear search
of the virtual address space by virtual memory area (VMA) with some minor
optimizations for
caching holes and scan pointers. However, there are some workloads that use
thousands of VMAs so this scan becomes expensive.
VMAs are already organized by a red-black tree (RB tree). Andrea Arcangeli had
suggested that information
about free areas near a VMA could be propagated up the RB tree toward
the root. Essentially it would be an augmented RB tree that stores both
allocated and free information. Van Riel was considering a simpler approach
using a callback on a normal RB tree to store the hole size in the VMA. Using
that, each RB node would know the total free space below it in an unsorted
fashion.
That potentially introduces fragmentation as a problem but that
is inconsequential to Van Riel in comparison to the problem where a hole of a
particular alignment is required. Peter Zijlstra maintained that augmented trees
should be usable to do this, but that was disputed by Van Riel who said that
augmented RB tree users have significant implementation responsibilities
so this detail needs further research.
Again, there was little resistance to the idea in principle but there are
likely to be issues during review about exactly how it gets implemented.
AIO/DIO in the kernel
Dave Kleikamp talked about asynchronous I/O (AIO) and how it is currently
used for user pages. He
wants to be able to initiate AIO from within the kernel, so he wants to convert
struct iov_iter to contain either an iovec or bio_vec and then convert the
direct I/O path to operate on iov_iter.
He maintains that this should be a straightforward conversion based on
the fact that it is the generic code that does all the complicated things
with the various structures.
He tested the API change by converting a loop device to set O_DIRECT and
submit via AIO. This eliminated caching in the underlying filesystem and
assured consistency of the mounted file.
He sent out patches a month ago but did not
get much feedback and was
looking to figure out why that was. He was soliciting input on the
approach and how it might be improved but it seemed like many had either
missed the patches or otherwise not read them. There might be greater
attention in the future.
The question was asked whether it would be a compatible interface for
swap-over-arbitrary-filesystem. The latest swap-over-NFS patches introduced
an interface for pinning pages for kernel I/O but Dave's patches appear to
go further. It would appear that swap-over-NFS could be adapted to use
Dave's work.
Dueling NUMA migration schemes
Peter Zijlstra started the session by talking about his approach for
improving performance on NUMA machines. Simplistically, it assigns processes
to a home node that allocation policies will prefer to allocate from and
load balancer policies to keep the threads near the memory it is using.
System calls are introduced to allow assignment of thread groups and
VMAs to nodes. Applications must be aware of the API to take advantage
of it.
Once the decision has been made to migrate threads to a new node, their
pages are unmapped and migrated as they are faulted, minimizing the number
of pages to be migrated and correctly accounting for the cost of the
migration to
the process moving between nodes. As file pages may potentially be shared,
the scheme focuses on anonymous pages.
In general, the scheme is expected to work well for the case where the
working set fits within a given NUMA node but be easier to implement than
the hard
binding support currently offered by the kernel. Preliminary tests indicate
that it does what it is supposed to do for the cases it handles.
One key advantage Zijlstra cited for his approach was that he maintains
information based on thread and VMA, which is predictable. In contrast,
Andrea Arcangeli's
approach requires storing information on a per-page basis and is much heavier in
terms of memory consumption. There were few questions on the specifics of
how it was implemented with comments from the room focusing instead on comparing
Zijlstra and Arcangeli's approaches.
Hence, Arcangeli presented on AutoNUMA which consists of a number of components.
The first is the knuma_scand component which is a page table
walker that tracks the
RSS usage of processes and the location of their pages. To track reference
behavior, a NUMA page fault hinting component changes page table entries
(PTEs) in an arrangement that is
similar but not identical to PROT_NONE temporarily. Faults are then used
to record what process is using a given page in memory. knuma_migrateN
is a per-node thread that is responsible for migrating pages if a process
should move to a new node. Two further components move threads near
the memory they are using or alternatively, move memory to the CPU that is
using it. Which option it takes depends on how memory is currently being
used by the processes.
There are two types of data being maintained for
decisions. sched_autonuma
works on a task_struct basis and the data is collected by NUMA hinting
page faults. The second is mm_autonuma which works on an
mm_struct basis
and gathers information on the working set size and the location of the
pages it has mapped, which is generated by knuma_scand.
The details on how it decides whether to move threads or memory to different
NUMA nodes is involved but Arcangeli expressed a high degree of confidence
that it could make close to optimal decisions on where threads and memory
should be located. Arcangeli's slide that describes the AutoNUMA workflow
is shown at right.
When it happens, migration is based on per-node queues and care is taken to
migrate pages at a steady rate to avoid bogging the machine down copying data.
While Arcangeli acknowledged the overall concept was complicated, he asserted
that it was relatively well-contained without spreading logic throughout
the whole of MM.
As with Zijlstra's talk, there were few questions on the specifics of how it
was implemented, implying that not many people in the room have reviewed
the patches, so Arcangeli moved on to explaining the benchmarks he ran. The
results of the benchmarks looked as if performance was within a few percent of
manually binding memory and threads to local nodes. It was interesting to
note that for one benchmark, specjbb, it was clear that how well AutoNUMA
does varies, which shows its non-deterministic behavior. But its performance
never dropped below the base performance. He explained that the variation
could be partially explained by the fact that AutoNUMA currently does not
migrate THP pages, instead it splits them and migrates the individual pages
depending on khugepaged to collapse the huge pages again.
Zijlstra pointed out that, for some of the benchmarks that were presented,
his approach potentially performed just as well without the algorithm
complexity or memory overhead. He asserted this was particularly true
for KVM-based workloads as long as the workload fits within a NUMA node.
He pointed out that the history of memcg led to a situation where it had
to be disabled by default in many situations because of the overhead and
that AutoNUMA was vulnerable to the same problem.
When it got down to it, the discussed points were not massively different
to discussions on the mailing list except perhaps in terms of
tone. Unfortunately
there was little discussion on was whether there was any compatibility
between the two approaches and what logic could be shared. This was due
to time limitations but future reviewers may have a clearer view of the
high-level concepts.
Soft limits in memcg
Ying Han began by introducing soft reclaim and stated she wanted to find what
blockers existed for merging parts of it. It has reached the point where it
is getting sufficiently complicated that it is colliding with other aspects
of the memory cgroup (memcg) work.
Right now, the implementation of soft limits allows memcgs to grow above a soft
limit in the absence of global memory pressure. In the event of global memory
pressure then memcgs get shrunk if they are above their soft limit. The
results for shrinking are similar to hierarchical reclaim for hard limits.
In a superficial way, this concept is similar to what Dan Magenheimer
wanted for RAMSter
except that it applies to cgroups instead of machines.
Rik van Riel pointed out that it is possible that a task can be fitting in a
node and within its soft limit. If there are other cgroups on the
same node, the aggregate soft limit can be above the node size and, in
some cases, that cgroup should be shrunk even if it is below the soft limit.
This has a quality-of-service impact; Han recognizes that this needs to
be addressed. This is somewhat of an administrative issue. The total of
all hard limits can exceed physical memory with the impact being that
global reclaim shrinks cgroups before they hit their hard limit. This
may be undesirable from an administrative point of view. For soft
limits, it makes even less sense if the total soft limits exceed
physical memory as it would be functionally similar to if the soft
limits were not set at all.
The primary issue was to decide what to set the ratio to reclaim pages from
cgroups at. If there is global memory pressure and all cgroups are under
their soft limit then a situation potentially arises whereby reclaim is
retried indefinitely without forward progress. Hugh Dickins pointed out
that soft
reclaim has no requirement that cgroups under their soft limit never be
reclaimed. Instead, reclaim from such cgroups should simply be resisted
and the question is how it should be resisted. This may require that all
cgroups get scanned to discover that they are all under their soft limit
and then require burning more CPU rescanning them. Throttling logic is
required but
ultimately this is not dissimilar to how kswapd or direct reclaimers get
throttled when scanning too aggressively. As with many things, memcg is
similar to the global case but the details are subtly different.
Even then, there was no real consensus on how much memory should be reclaimed
from cgroups below their soft limit. There is an inherent fairness
issue here that does not appear to have changed much between different
discussions. Unfortunately, discussions related to soft reclaim are separated
by a lot of time and people need to be reminded of the details. This meant
that little forward progress was made on whether to merge soft reclaim or
not but there were no specific objections during the session. Ultimately,
this is still seen as being a little Google-specific particularly as some
of the shrinking decisions were tuned based on Google workloads. New use-cases are needed to tune the shrinking decisions and to support the patches
being merged.
Kernel interference
Christoph Lameter started by stating that each kernel upgrade resulted in
slowdowns
for his target applications (which are for high-speed trading). This generates a lot of resistance to kernels
being upgraded on their platform. The primary sources of interference were
from faults, reclaim, inter-processor interrupts, kernel threads, and
user-space daemons. Any one
of these can create latency, sometimes to a degree that is catastrophic to their
application. For example, if reclaim causes an additional minor fault to
be incurred, it is in fact a major problem for their application.
The reason this happens is due to some trends. Kernels are simply more
complex with more causes of interference leaving less processor time
available to the user. Other trends which affect them are
larger memory sizes leading to longer reclaim as well as more processors
meaning that for-all-cpu loops take longer.
One possible measure would be to isolate OS activities to a subset of CPUs
possibly including interrupt handling. Andi Kleen pointed out that even with
CPU isolation, if unrelated processes are sharing the same socket,
they can interfere with each other. Lameter maintained that while this
was true such isolation was still of benefit to them.
For some of the examples brought up, there are people working on the
issues but they are still works in progress and have not been
merged. The fact of the matter is that the situation is less than ideal
with kernels today. This is forcing them into a situation where they fully
isolated some CPUs and bypass the OS as much as possible, which turns Linux into
a glorified boot loader. It would be in the interest of the community to
reduce such motivations by watching the kernel overhead, he said.
Filesystem and storage sessions
Copy offload
Frederick Knight, who is the NetApp T10 (SCSI) standards guy, began by
describing copy offload, which is a method for allowing SCSI devices to
copy ranges of blocks without involving the host operating system.
Copy offload is
designed to be a lot faster for large files because wire speed is no
longer the limiting factor. In fact, in spite of the attention now,
offloaded copy has been in SCSI standards in some form or other since
the SCSI-1 days. EXTENDED COPY (abbreviated as XCOPY) takes two
descriptors for the source and destination and a range of blocks.
It is then implemented in a push model (source sends the blocks to the
target) or a pull model (target pulls from source) depending on which
device receives the XCOPY command. There's no requirement that the
source and target use SCSI protocols to effect the copy (they may use an
internal bus if they're in the same housing) but should there be a
failure, they're required to report errors as if they had used SCSI
commands.
A far more complex command set is TOKEN based copy. The idea here is
that the token contains a ROD (Representation of Data) which allows
arrays to give you an identifier for what may be a snapshot. A token
represents a device and a range of sectors which the device guarantees
to be stable. However, if the device does not support snapshotting and
the region gets overwritten (or in fact, for any other reason), it may
decline to accept the token and mark it invalid. This, unfortunately,
means you have no real idea of the token lifetime, and every time the
token goes invalid, you have to do the data transfer by other means
(or renew the token and try again).
There was a lot of debate on how exactly we'd make use of this feature and
whether tokens would be exposed to user space. They're supposed to be
cryptographically secure, but a lot of participants expressed doubt on
this and certainly anyone breaking a token effectively has access to all of
your data.
NFS and CIFS are starting to consider token-based copy commands, and the
token format would be standardized, which would allow copies from a SCSI disk
token into an NFS/CIFS volume.
Copy offload implementation
The first point made by Hannes Reinecke is that identification of source and target for
tokens is a nightmare if everything is done in user space. Obviously,
there is a
need to flush the source range before constructing the token, then we
can possibly use FIEMAP to get the sectors. Chris Mason pointed out
this wouldn't work for Btrfs and after further discussion the concept of
a ref-counted FIETOKEN operation emerged instead.
Consideration then
moved to hiding the token in some type of reflink() and splice()-like system
calls. There was a lot more debate on the mechanics of this,
including whether the token should be exposed to user space (unfortunately, yes,
since NFS and CIFS would need it). Discussion wrapped up with the
thought that we really needed to understand the user-space use cases of
this technology.
RAID unification
pNFS is beginning to require complex RAID-ed objects which require
advanced RAID topologies. This means that pNFS implementations need an
advanced, generic, composable RAID engine that can implement any
topology in a single compute operation. MD was rejected because
composition requires layering within the MD system and that means you can't
do advanced topologies in a single operation.
This proposal was essentially for a new interface that would unify all
the existing RAID systems by throwing them away and writing a new one.
Ted Ts'o pointed out that filesystems making use of this engine don't
want to understand how to reconstruct the data, so the implementation
should "just work" for the degraded case. If we go this route, we
definitely need to ensure that all existing RAID implementations work as
well
as they currently do.
The action summary was to start with MD and then look at Btrfs. Since
we don't really want new administrative interfaces exposed to users, any new
implementation should be usable by the existing LVM
RAID interfaces.
Testing
Dave Chinner reminded everyone that the methodology behind xfstest is "golden
output matching". That means that all XFS tests produce output which is
then filtered (to remove extraneous differences like timestamps or,
rather, to fill them in with X's) and the success or failure indicated by
seeing if the results differ from the expected golden result file. This
means that the test itself shouldn't process output.
Almost every current filesystem is covered by xfstest in some form
and all the code in XFS is tested at 75-80% coverage. (Dave said we
needed to run the code coverage tools to determine what the code
coverage of the tests in other filesystems actually is). Ext4, XFS and Btrfs
regularly have the xfstest suite run as part of their development cycle.
Xfstest consists of ~280 tests which run in 45-60 minutes (depending on
disk speed and processing power). Of these tests, about 100 are
filesystem-independent. One of the problems is that the tests are highly
dependent
on the output format of tools, so, if that changes, the test reports false
failures. On the other hand, it is easily fixed by constructing a new
golden output file for the tests.
One of the maintenance nightmares is that the tests are numbered rather
than named (which means everyone who writes a new test adds it as number 281
and Dave has to renumber). This should be fixed by naming tests instead.
The test space should also become hierarchical (grouping by function) rather
than the current flat scheme.
Keeping a matrix of test results over time allows far better data mining
and makes it easier to dig down and correlate reasons for intermittent
failures, Chinner said.
Flushing and I/O back pressure
This was a breakout session to discuss some thoughts that arose
during the general writeback session (reported above).
The main concept is that writeback limits are trying to limit the amount
of time (or IOPS, etc.) spent in writeback. However, the flusher threads
are currently unlimited because we have no way to charge the I/O they do
to the actual tasks. Also, we have problems accounting for metadata
(filesystems with journal threads) and there are I/O priority inversion problems
(can't have high priority task blocked because of halted writeout on a
low priority one which is being charged for it).
There are three problems:
- Problems between CFQ and block flusher. This should now be
solved by tagging I/O with the originating cgroup.
- CFQ throws all I/O into a single queue (Jens Axboe thinks this isn't a
problem).
- Metadata ordering causes priority inversion.
On the last, the thought was that we could use transaction reservations as an
indicator for whether we had to complete the entire transaction (or just
throttle it entirely) regardless of the writeback limits which would
avoid the priority inversions caused by incomplete writeout of
transactions. For dirty data pages, we should hook writeback throttling
into balance_dirty_pages(). For the administrator, the system needs to
be simple, so there needs to be a single writeback "knob" to adjust.
Another problem is that we can throttle a process which uses buffered
I/O but not if it uses AIO or direct I/O (DIO), so we need to come up with a throttle
that works for all I/O.
Comments (13 posted)
By Jake Edge
April 4, 2012
Day two of the Linux Storage, Filesystem, and Memory Management Summit was
much like its predecessor, but with fewer
combined sessions. It was held in San Francisco on April 2. Below is a
look at the combined sessions as well as
those in the Memory Management track that is largely based on write-ups
from Mel Gorman as well as some additions from my notes. In addition, James
Bottomley has written up the Filesystem and Storage track.
Flash media
Steven Sprouse was invited to the summit to talk about flash
media. He is the director of NAND systems architecture at SanDisk, and his
group is concerned with consumer flash products - for things like mobile
devices, rather than enterprise storage applications, which is handled by a
different group. But, he said, most of what he would be talking about is
generic to most flash technologies.
The important measure of flash for SanDisk is called "lifetime terabyte
writes", which is calculated by the following formula:
physical capacity * write endurance
-----------------------------------
write amplification
Physical capacity is increasing, but write endurance is decreasing (mostly
due to cost issues). Write amplification is a measure of the actual amount
of writing that must be done because the device has to copy data based on
its erase block size.
Write amplification is a function of the usage of the device, its block
size, over-provisioning, and the usage of the trim command (to tell the
device what blocks are no longer being used).
Block sizes (which are the biggest concern for write amplification) are
getting bigger for flash devices, resulting in higher
write amplification.
The write endurance is measured in data retention years. As the cells in
the flash get cycled, the amount of time that data will last is reduced.
If 10,000 cycles are specified for the device, that doesn't mean they die
at that point, just that they may no longer hold data for the required
amount of time. There is also a temperature factor and most of the devices
he works with have a maximum operating temperature of 45-50°C. Someone
asked about read endurance, and Sprouse said that reads do affect endurance
but didn't give any more details.
James Bottomley asked if there were reasons that filesystems should start
looking
at storing long-lived and short-lived data separately and not mixing the
two. Sprouse said that may eventually be needed. He said there is a trend
toward hybrid architectures that have small amounts of high-endurance (i.e.
can handle many more write cycles) flash
and much larger amounts of low-endurance flash. Filesystems may want to
take advantage of that by storing things like the journal in the
high-endurance portion, and more stable OS files in the low-endurance
area. Or storing hot data on high-endurance and cold data on
low-endurance. How that will be specified is not determined, however.
The specs for a given device are based on the worst-case flash cell, but
the average cell will perform much better than that worst case. If you
cycle all of the cells in a device the same number of times, one of the
pages might well
only last 364 days, rather than the one year in the spec. Those values are
determined by the device being "cycled, read, and baked", he said. The
latter is the temperature testing that is done.
Sprouse likened DRAM and SRAM to paper that has been written on in pencil.
If a word is wrong, it can be erased without affecting the surrounding
words. Flash is like writing in pen; it can't be erased, so a one-word
mistake requires that
the entire page be copied. That is the source of write amplification.
From the host side, there may be a 512K write, but if that data resides in
a 2048K
block on the flash, the other three 512K chunks need to be copied which,
makes for a write amplification factor of four. In 2004, flash devices
were like
writing on a small Post-it pad that could only fit four words, but in 2012,
it is like writing on a piece of paper the size of a large table. The cost
for a one-word change has gone way up.
In order for filesystems to optimize their operation for the geometry of
the flash, there needs to be a way to get that geometry information.
Christoph Hellwig pointed out that Linux developers have been asking for
five years for ways to get that information without success. Sprouse
admitted that was a problem and that exposing that information may need to
happen. There is also the possibility of filesystems tagging the data they
are writing to give the device the information necessary to make the right
decision.
Sprouse posed a question about the definition of a "random" write. A 1G
write would be considered sequential by most, while 4K writes would be
random, but what about sizes in between? Bottomley said that anything
beyond 128K is sequential for Linux, while Hellwig said that anything up to
64M is random. But the "right" answer was: "tell me what the erase block
size is". For flash, random writes are anything smaller than the erase
block size. In the past writing in 128K chunks would have been reasonable,
he said, but today each write of that size may make the flash copy several
megabytes of data.
One way to minimize write amplification is to group data that is going to
become obsolete at roughly the same time. Obsolete can mean that the data is
overwritten or that it is thrown away via a trim or discard command.
The filesystem should strive to avoid having cold data get copied because
it is accidentally mixed in with hot data. As an example, Ted Ts'o
mentioned package files (from an RPM or Debian package), which are likely
to be obsoleted at the same time (by a package update or removal). Some
kind of interface so that user space can communicate that information would
be required.
In making those decisions, the focus should be on the hottest files (those changing most frequently)
rather than the coldest files, Sprouse said. If the device could somehow
know what the logical block addresses associated with each file are, that would
help it make better decisions.
As an example, if a flash device has four dies, and four files are being
written, those files could be written in parallel across the dies. That
has the effect of being fast for writing, but is much slower when updating
one of the files. Alternatively, each could be written serially, which is
slower, but will result in much less copying if one file is updated. Data
must be moved around under the hood, Sprouse said, and if the flash knows
that a set of rewrites are all associated with a single file, it could
reorganize the data appropriately when it does the update.
There are a number of things that a filesystem could communicate to the
device that would help it make better decisions. Which blocks relate to
the same file, and which are related by function, like files in a
/tmp directory that will be invalid after the next boot, or are OS
installation files or browser cache files. Filesystems could also mark
data that will be read frequently or written frequently. Flash vendors need
to provide a
way for the host to determine the geometry of a device like its page size,
block size, and
stripe size.
Those are all areas where OS developers and flash vendors could cooperate,
he said. Another that he mentioned was to provide some way for the host to
communicate how much time has elapsed since the last power off. Flash
devices are still "operating" even when they powered off, because they
continue to hold the data that was stored. You could think of flash as
DRAM with a refresh rate of one year, for example. If the flash knows that
it has been off for six months it could make better decisions for data
retention.
Some in the audience advocated an interface to the raw
flash, rather than going through the flash translation layer (FTL).
Ric Wheeler disagreed, saying that we don't want filesystems to
have to
know about the low-level details of flash handling. Ts'o agreed and noted
that new technologies may come along that invalidate all of the work that
would have been put in for an FTL-less filesystem. Chris Mason also
pointed out that flash manufacturers want to be able to put a sticker on
the devices saying that it will store data for some fixed amount of time.
They will not be able (or willing) to do that if it requires the OS to do
the right
thing to achieve that.
One thing that Mason would like to see is some feedback on hints that
filesystems may end up providing to these devices. One of his complaints
is that there is no feedback mechanism for the trim command, so that
filesystem developers can't see what benefits using trim provides. Sprouse
said that trim has huge benefits, but Mason wants to know whether Linux is
effective at trimming. He would like to see ways to determine whether
particular trim strategies are better or worse and, by extension, how any
other hints provided by filesystems are performing.
Bottomley asked if flash vendors could provide a list of the information
they are willing to provide about the internals of a given device. With
that list, filesystem developers could say which would be useful. Many of
these "secrets" about the internals of flash devices are not so secret, as
Ts'o pointed out that Arnd Bergmann has done timing attacks to suss out
these details, which he has published.
Even if there are standards that provide ways for hosts to query these
devices for geometry and other information, that won't necessarily solve
the problem. As someone in the audience pointed out, getting things like
that into a
standard does not force the vendors to correctly fill in the data.
Wheeler asked if
it would help for the attendees' "corporate overlords" to officially ask
for that kind of cooperation from the vendors. There were representatives
many large flash-buying companies at the summit, so that might be a way to
apply some pressure. Sprouse said that like most companies, there are
different factions within SanDisk (and presumably other flash companies).
His group sees the benefit of close cooperation with OS developers, but
others see the inner workings as "secret sauce".
It is clear there are plenty of ways for the OS and these devices to
cooperate, which would result in better usage and endurance. But there is
plenty of work to do on both sides before that happens.
Device mapper and Bcache
Kent Overstreet discussed the Bcache project, which creates
an SSD-based cache for other (slower) block devices. He
began by pointing out that the device mapper (DM) stores much of the
information that Bcache
would need in user space. Basically, the level of pain required to extract the
necessary
information from DM meant that they bypassed it entirely.
It was more or less acknowledged that, because Bcache is sufficiently
well established
in terms of performance, that may imply that DM should provide an API it can use.
Basically, if a flash cache is to be implemented in kernel, basing it upon
Bcache would be preferable. It would also be preferred if any such cache
was configured
via an established interface such as DM; this is the core issue that is
often bashed around.
It was pointed out that Bcache also required some block-layer changes to split
BIOs in some cases, depending on the contents of the btree, which would have
been difficult to communicate via DM. This reinforces the original point that
adapting Bcache to DM would require a larger number of changes than expected.
There was some confusion on exactly how Bcache was implemented and what the
requirements are but the Bcache developers were not against adding DM support as such. They were just indifferent to DM because their needs were already been served.
In different variations, the point was made that the community is behind
the curve in terms of caching on flash and that some sort of decision is needed.
This did not prevent the discussion being pulled in numerous different
directions that brought up a large number of potential issues with any possible
approach. The semi-conclusion was the community "has to do something" but
what that was reached no real conclusion. There was a vague message that
a generic caching storage layer was required that would be based on SSD
initially but exactly at which layer this should exist as was unclear.
Memory hotplug
Hiroyuki Kamezawa discussed the problem of hot unplugging full NUMA nodes on
Intel "Ivy Bridge"-based platforms. There are certain structures that are
allocated on a
node that cannot be reclaimed before unplug such as pgdat. The basic
approach is to declare these nodes as fully ZONE_MOVABLE and allocate
needed support structures off-node.
The nodes this policy affects can be set via kernel parameters.
An alternative is to boot only one node and, later, hotplug the remaining
nodes,
marking them ZONE_MOVABLE as they are brought up. Unfortunately, there is an enumeration
problem with this. The mapping of physical CPUs to NUMA nodes is
not constant because altering a BIOS setting such as HT may change that mapping. For
similar reasons, the NUMA node ID may change if DIMMs change slots.
Hence, the problem is that the physical node IDs and node IDs as reported
by the kernel are not the same between boots. If, on a four-node machine
they boot nodes zero and one and hotplug node two, the physical addresses
might vary
and this is problematic when deciding which node to remove or even when
deciding where to place containers.
To overcome this, they need some sort of translation layer that virtualizes
the CPU and node ID numbers to keep the mappings consistent between boots.
There is more than one use case for this, but the problem mentioned
regarding companies that have very restrictive licensing based on CPU IDs
was not a very
popular one. To avoid going down a political rathole, that use case was
acknowledged, but the conversation moved on as there are enough other
reasons to provide the translation layer.
It was suggested that perhaps only one CPU and node be activated a boot
and to bring up the remaining nodes after udev is active.
udev could be used
to create symbolic links mapping virtual CPU IDs to physical CPU IDs and
similarly symbolic link virtual node IDs to the underlying physical IDs
in sysfs. A further step might be to rename CPU IDs and node IDs at runtime
to match what udev discovers similar to the way network devices can be renamed,
but that may be unnecessary.
Conceivably, an alternative would be that the
kernel could be informed what the mapping from virtual IDs to physical IDs
should be (based on what's used by management software)
and rename the sysfs directories accordingly, but that would be
functionally equivalent. It was also suggested that
this should be managed by the hardware but that is probably optimistic and
would not work for older hardware.
Unfortunately, there was no real conclusion on whether such a scheme could
be made work or if it would suit Kamezawa's requirements.
Stalled MM patches
Dan Magenheimer started by discussing whether frontswap should be merged. It
got stalled, he said, due to bad timing as
he passed a line where there was
an increased emphasis on review and testing. To address this he gave an overview
of transcendent memory and its components such as the cleancache and frontswap
front-ends and the zcache, RAMster, Xen, and KVM backends. Many of these
components
have been merged, with RAMster being the most recent addition, but frontswap is
noticeable by its absence despite the fact that some products ship with it.
He presented the results of a benchmark run based on the old reliable
parallel kernel build with increasing numbers of parallel compiles until
it started hitting swap. He showed the
performance difference when zcache was enabled. The figures seemed to imply
that the overhead of the schemes was minimal until there was memory pressure
but when zcache was enabled, performance could in fact improve due to more
efficient use of RAM and reduced file and swap I/O. He referred people to
the list where more figures are available.
He followed up by presenting the figures when the RAMster backend was used.
The point was made that using RAMster might show an improvement on the target
workload while regressing the performance of the machine that RAMster was
taking resources from. Magenheimer acknowledged this but felt that was sufficient
evidence justifying frontswap's existence to have it merged.
Andrew Morton suggested posting it again with notes on what products are
shipping
with it already. He asked how many people had done a detailed review and
was discouraged that apparently no one had. On further pushing it turned
out that Andrea Arcangeli had looked at it and while he saw some problems he also
thought it was been significantly improved in recent times. Rik van Riel's
problem
was that frontswap's API was synchronous but Magenheimer believes that some
of these
concerns have been alleviated in recent updates. Morton said that if
this gets merged, it will affect everyone and insisted that people review
it. It seems probable that more review will be forthcoming this time around
as people in the room did feel that the frontswap+zcache combination, in
particular, would be usable by KVM.
Kyungmin Park than talked about the contiguous memory allocator (CMA) and
how it has gone through several versions
with review but without being merged. Morton said that he had almost merged
it a few times but then a new version would come out. He said to post it
again and he'll merge that.
Mel Gorman then brought up swap over NFS,
which has also stalled. He
acknowledged that the patches are complex, and the feedback has been that
the feature isn't really needed. But, he maintained, that's not true, it
is used by some and, in fact, ships with SUSE Linux. Red Hat does not,
but he has had queries from at least one engineer there about the status of
the patches.
Gorman's basic question was whether the MM developers were willing to deal
with the complexity of swap over NFS. The network people have "stopped
screaming" at him, which is not quite the same thing as being happy with the
patches, but Gorman thinks progress has been made there. In addition,
there are several other "swap over network filesystem" patches floating
around, all of which will require much of the same infrastructure that swap
over NFS requires.
Morton said that the code needs to be posted again and "we need to promise
to look at it". Hopefully that will result in comments on whether it is
suitable in its current state or, if not, what has to be done to
make it acceptable.
Issues with mmap_sem
While implementing a page table walker for estimating work set size, Michel
Lespinasse found a number of cases where mmap_sem hold time for
writes caused
significant problems. Despite the topic title ("Working Set Estimation"),
he focused on enumerating the worst mmap_sem hold times,
such as when a mapped file is accessed and the atime must be updated or
when a threaded application is scanning files and hammering mmap_sem. The
user visible effects of this can be embarrassing. For example, ps can stall
for long periods of time if a process is stalled on mmap_sem which makes
it difficult to debug a machine that is responding poorly.
There was some discussion on how mmap_sem could be changed to alleviate
some of these problems. The proposed option was to tag a
task_struct before
entering a filesystem to access a page. If the filesystem needs to block
and the task_struct was tagged, it would release the mmap_sem and retry
the full fault from start after the filesystem returns control. Currently
the only fault handler that does this properly is x86. The implementation
was described as being ugly so he would like people to look at it and
see how it could be improved. Hugh Dickins agreed that it was ugly and wants an
alternative. He suggested that maybe we want an extension of pte_same to cover
pte_same_vma_same() but it was not deeply considered. One possibility would
be to have a sequence counter on the mm_struct and observing if it changed.
Andrea Arcangeli pointed out that just dropping the mmap_sem may not help as it still
gets hammered by multiple threads and instead the focus should be on avoiding
blocking when holding mmap_sem for writing because it is an exclusive lock. Lespinasse
felt that this was only a particular problem for mlockall() so there may be
some promise for dropping mmap_sem for any blocking and recovering afterward.
Dickins felt that at some point in the past that there was a time when mmap_sem
was dropped for writes and just a read semaphore held under some circumstances. He
suggested doing some archeology of the commits to confirm if the kernel ever
did that and, if so, what were the reasons it was dropped.
The final decision for Lespinasse was to post the patch that expands task_struct
with information that would allow the mmap_sem to be dropped before doing
a blocking operation. Peter Zijlstra has concerns that this might have some scheduler
impact and Andi Kleen was concerned that it did nothing for hold times in other
cases. It was suggested that the patch be posted with a micro-benchmark that
demonstrates the problem and what impact the patch has on it. People that
feel that there are better alternatives can then evaluate different patches
with the same metrics.
Page flags
Hugh Dickins credited Johannes Weiner's
work on reducing the size of
mem_cgroup and highlighted
Hiroyuki Kamezawa's further work. He asserted that mem_cgroup is now
sufficiently small
that it should be merged with page_cgroup.
He then moved on to page flag availability and pointed out that there
currently should be plenty of flags available on 64-bit systems. Andrew Morton
pointed out that
some architectures have stolen some of those flags already and that should
be verified. Regardless of that potential problem it was noted that, due to
some slab alignment patches, there is a hole in struct page and
there is a
race to make use of that space by expanding page flags.
The discussion was side-tracked by bringing up the problem of virtual
memory area (VMA) flag
availability. There were some hiccups with making VMA flags 64-bit in the
past but thanks to work by Konstantin Khlebnikov, this is likely to be
resolved in the near future.
Dickins covered a number of different uses of flags in the memory cgroup
(memcg)
and where they
might be stored but pointed out that memcg was not the primary target. His
primary concern was that some patches are contorting themselves to avoid
using a page flag. He asserted that the overhead of this complexity is now
higher than the memory savings from having a smaller struct page. As keeping
struct page very small was originally for 32-bit server class
systems (which are now becoming rare) he felt
that we should just expand page flags. Morton pointed out that we are going
to have to expand page flags eventually and now is as good as time as any.
Unfortunately numerous issues were raised about 32-bit systems that would
be impacted by such a change and it was impossible to get consensus on
whether struct page should be expanded or not. For example, it was pointed
out that embedded CPUs with cache lines of 32 bytes benefit from the current
arrangement. Instead it looks like further tricks may be investigated for
prolonging the current situation such as reducing the number of NUMA nodes
that can be supported on 32-bit systems.
Statistics for memcg
Johannes Weiner wanted to discuss the memcg statistics and what should be
gathered. His problem is that he had very little traction on the list and
felt maybe it would be better if he explained the situation in person.
The most important statistics he requires are related to memcg hierarchical
reclaim. The simple case is just the root group and the basic case is one
child that is reclaimed by either hitting its hard limit or due to global
reclaim. It gets further complicated when there is an additional child and this
is the minimum case of interest. In the hierarchy, cgroups might be
arranged as follows:
root
cgroup A
cgroup B
The problem is that if cgroup B is being reclaimed then it should be
possible to identify whether the reclaim is due to internal or external
pressure. Internal pressure would be due to cgroup B hitting its hard
limit. External pressure would be due to either cgroup A hitting its hard
limit or global reclaim.
He wants to report pairs of counters for internal and external
reclaims. By walking cgroup tree,
the statistics for external pressure can be calculated. By looking at the
external figures for each cgroup in user space it can be determined exactly
where external pressure originated from for any cgroup. The alternative is
needing one group of counters per parent which is unwieldy. Just tracking
counters about the parent would be complicated if the group were migrated.
The storage requirements are just for the current cgroup. When reporting
to user space a tree walk is necessary so it costs computationally but
the information will always be coherent even if memcg changes location in
the tree. There was some dispute on what file exactly should expose this
information but that was a relatively minor problem.
The point of the session was for people to understand how he wants to
report statistics and why it is a sensible choice. It seemed that people
in the room had a clearer view of his approach and future review might
be more straightforward.
Development tree for memcg
Michal Hocko stood up to discuss the current state of the memcg devel tree.
After the introduction of the topic, Andrew Morton asked why it was not based on linux-next
which Hocko said was a moving target. This potentially leads to a rebases. Morton did not really get why the tree was
needed but the memcg maintainers said the motivation was develop against
a stable point in time without having to wrestle with craziness in linux-next.
Morton wanted the memcg stuff to be a client of the -mm tree. That is a
client of
linux-next but Andrew feels he could manage the issues as long as the memcg
developers were willing to deal with rebases which they were. Morton is
confident he
can find a way to compromise without the creation of a new tree. In the
event of conflicts, he said that those conflicts should be resolved
sooner rather than later.
Morton made a separate point of how long is it going to take to finish memcg.
It's one file, how much more can there be to do? Peter Zijlstra pointed out
that much of the complexity is due to changing semantics and continual churn.
The rate of change is slowing but it still happens.
The conclusion is that Morton will work on extracting the memcg stuff from
his view of the linux-mm tree into the memcg devel tree on a regular basis to
give them a known base to work against for new features. Some people in
the room commented that they missed the mmotm tree as it used to form a
relatively stable tree to develop against. There might be some effort in
the future to revive something mmotm-like while still basing it on linux-next.
MM scalability
Andi Kleen talked a bit about some of the scalability issues he has run into.
These are issues that have showed up in both micro and macro benchmarks. He
gave the example of the VMA links for very large processes that fork causing
chains that are thousands of VMAs long.
TLB flushing is another problem where pages being reclaimed are resulting
in an IPI for each page; he feels these operations need
to be batched. Andrea Arcangeli pointed out that batching may be awkward because pages
are being reclaimed in LRU, not MM, order and batching may be problematic.
It could just send an IPI when a bunch of pages are gathered or be
able to build lists of pages for multiple MMs.
Another issue on whether clearing the access bit should result in a TLB
flush or not. There were disagreements in the room as to whether this
would be safe. It potentially affects reclaim but the length of time
a page lives on the inactive LRU list should be enough to ensure that the
process gets scheduled and flushes the TLB. Relying on that was considered
problematic
but alternative solutions such as deferring the flush and then sending a
global broadcast would interfere with other efforts to reduce IPI traffic.
Just avoiding the flush for clearing the access should be fine in the
vast majority of cases so chances are a patch will appear on the list
for discussion.
Kleen next raised an issue with drain_pages(), which has severe
lock contention
problem when releasing the pages back to the zone list as well as causing a
large
number of IPIs to be sent.
His final issue was that swap clustering in general seems to be broken and
that the expected clustering of virtual address to contiguous areas in swap
is not happening. This was something 2.4 was easily able to do because of
how it scanned page tables but it's less effective now. However, there have
been recent patches related to swap performance so that particular issue
needs to be re-evaluated.
The clear point that shone through is that there are new scalability issues
that are going to be higher priority as large machines become cheaper and
that the community should be pro-active dealing with them.
Cleancache
Pavel Emelyanov briefly introduced how Parallels systems potentially create hundreds of
containers on a system that are all effectively clones of a template. In this
case, it is preferred that the file cache be shared between containers to
limit the memory usage so as to maximize the number of containers that can be supported.
In the past, they used a unionfs approach but as the number of containers
increased so did the response time. This was not a linear increase and
could be severe on loaded machines. If reclaim kicked in, then performance
would collapse.
Their proposal is to extend cleancache to store the template files and share
them between containers. Functionally this is de-duplication and, superficially,
kernel samepage merging (KSM) would suit their requirements. However, there were a large number of reasons
why KSM was not suitable, primarily because it would not be of reliable benefit
but also because it would not work for file pages.
Dan Magenheimer pointed out that Xen de-duplicates data through use of a backend to
cleancache and that they should create a new backend instead of extending
cleancache which would be cleaner. It was suggested that when they submit
the patches that they be very clear why KSM is not suitable to avoid the
patches being dismissed by the casual observer.
What remains to be done for checkpoint/restore in user space?
Pavel Emelyanov talked about a project he started about six months ago to
address some of
the issues encountered by previous checkpoint implementations, mostly by
trying to move it into user space. This was not without issue because
there is still some assistance needed from the kernel. For example, kernel
assistance was required to figure out if a page is really shared or not.
A second issue mentioned was that given a UNIX socket, it cannot be discovered
from userspace what its peer is.
They currently have two major issues. The first is with "stable memory
management". Applications create big mappings but they do not access every single
page in it and writing the full VMA to a disk file is a waste of time and
space. They need to discover which pages have been touched. There is a system
call for memory residency but it cannot identify that an address is valid
but swapped out for example. For private mappings, it cannot distinguish
between a COW page and one that is based on what is on disk. kpagemap also
gives insufficient information because information such as virtual address
to page frame number (PFN) is missing.
The second major problem is that, if an inode is being watched with inotify,
extracting exact information about the watched inode is difficult.
James Bottomley suggested using a debugfs interface. A second proposal was to extend the /proc interface
in some manner. The audience in the room was insufficiently familiar with the
issue to give full feedback so the suggestion was just to extend
/proc in some
manner, post the patch and see what falls out as people analyze the problem
more closely. There was some surprise from Bottomley that people would suggest
extending /proc but for the purpose of discussion it would not
cause any harm.
Filesystem and Storage sessions
High IOPS and SCSI/Block
Roland Dreier began by noting that people writing block drivers have only
two choices: A full request-based driver, or using make_request(). The
former is far too heavyweight with a single very hot lock (the queue
lock) and a full-fledged elevator. The latter is way too low down in
the stack and bypasses many of the useful block functions, so Dreier
wanted a third way that takes the best of both. Jens Axboe proposed
using his multi-queue work which essentially makes the block queue per-CPU (and thus lockless) coupled with a lightweight elevator. Axboe has
been sitting on these patches for a while but promised to dust them off
and submit them. Dreier agreed this would probably be fine for his
purposes.
Shyam Iyer previewed Dell's vision for where NVMe (Non-Volatile Memory
express - basically PCIe cards with fast flash on them) were going.
Currently the interface is disk-like, with all the semantics and
overhead that implies, but ultimately Dell sees the device as having a
pure memory interface using apertures over the PCIe bus. Many people
in the room pointed out that while a memory-mapped interface may be
appealing from the speed point of view, it wouldn't work if the device
still had the error characteristics of a disk, because error handling in
the memory space is much less forgiving. Memory doesn't do any software
error recovery and every failure to deliver data instantly is a hard
failure resulting in a machine check, so the device would have to do all
recovery itself and only signal a failure to deliver data as a last
resort.
LBA hinting and new storage commands
Frederick Knight began by previewing the current T10 thoughts on handling
shingle drives: devices which vastly increase storage density by
overlapping disk tracks. They can increase storage radically but at the
expense of having to write a band at a time (a band is a set of
overlapping [shingled] tracks). The committee has three thoughts on
handling them:
- Transparent: just make it look like a normal disk
- Banding: Make the host manage the geometry (back to the old IDE
driver days) and expose new SCSI commands for handling bands
- Transparent with Hints: make it look like a normal disks but
develop new SCSI commands to hint both ways between device and
host what the data is and device characteristics are to try to
optimize data placement
The room quickly decided that only the first and last were viable options,
so the
slides on the new banding commands were skipped.
In the possible hint-based architecture, there would be static and
dynamic hints. Static would be from device to host signalling which
indicated geometry preferences by LBA range, while dynamic would be from
the host to device indicating the data characteristics on a write which
would allow the device to do more intelligent placement.
It was also pointed out that shingled drives have very similar
characteristics to SSDs if you consider a band to be equivalent to an
erase block.
The problem with the dynamic hinting architecture is that the proposal
would repurpose the current group field in the WRITE command to contain
the hint, but there would only be six bits available. Unfortunately,
virtually every member of the SCSI committee has their own idea about what
should be hinted (all the way from sequential vs random in a 32-level
sliding scale, write and read frequency and latency, boot time
preload, ...) and this lead to orders of magnitude more hints than fit
into six bits, so the hint would be an index into a mode page which
described what it means in detail. The room pointed out unanimously
that the massive complexity in the description of the hints meant that
we would never have any real hope of using them correctly since not even
device manufacturers would agree exactly what they wanted. Martin
Petersen proposed identifying a simple set of five or so hints and
forcing at least array vendors to adhere to them when the LUN was in
Linux mode.
Storage manager
Lukáš Czerner gave a description of the current state of his storage
manager command-line
tool, which, apart from having some difficulty creating XFS volumes was
working nicely and should take a lot of the annoying administrative complexity
out of creating LVM volumes for mounted devices.
Trim, unmap, and write same
Martin Petersen began by lamenting that in the ATA TRIM command, T13 only left
two bytes for the trim range, meaning that, with one sector of ranges, we
could trim at most 32MB of disk in one operation. The other problem is
that the current
architecture of the block layer only allows us to trim contiguous
ranges. Since TRIM is unqueued and filesystems can only send single
ranges inline, trimming is currently a huge performance hit. Christoph
Hellwig had constructed a prototype with XFS which showed that if we
could do multi-range trims inline, performance could come back to within
1% of what it was without sending trim.
Discussion then focused on what had to happen to the block layer to
send multi-range commands (it was pointed out that it isn't just trim:
scatter/gather SCSI commands with multiple ranges are also on the
horizon). Jens Axboe initially favored the idea of allowing a single BIO to
carry multiple ranges, whereas Petersen had a prototype using
linked BIOs for the range. After discussion it was decided that linked
BIOs was a better way forward for the initial prototype.
SR-IOV and FC sysfs
SR-IOV (Single Root I/O virtualization) is designed to take the
hypervisor out of storage virtualization by allowing a guest to have a
physical presence on the storage fabric. The specific problem is that
each guest needs a world wide name (WWN) as their unique address on the
fabric. It was agreed that we could use some extended host interface
for setting WWNs but that we shouldn't expose this to the guest. The
other thought was around naming of virtual functions when they attach to
hosts. In the network world, virtual function (vf) network devices appear as
eth<phys>-<virt> so should we do the same for SCSI? The answer was
categorically that without any good justification for this naming
scheme: "hell no."
The final problem discussed was that when the vf is
created in the host, the driver automatically binds to it, so it has to
be unbound before passing the virtual function to the guest. Hannes Reinecke
pointed out that
binding could simply be prevented using the standard sysfs interfaces.
James Bottomley would prefer that the driver simply refuse to bind to vf devices
in the host.
Robert Love noted that the first iteration of Fibre Channel attributes was out for
review. All feedback from Greg Kroah-Hartman has been incorporated so he
asked for
others to look at it (Bottomley said he'd get round to it now that
Kroah-Hartman is
happy).
Unit attention handling
How should we report "unit attentions" (UAs - basically SCSI errors reported by
storage devices) to userspace? Three choices
were proposed:
- netlink - which works but is only one way
- blktrace using debugfs - needs a tool to extract data
- using structured logging - feasible only in the current merge
window since the structured logging patch is now in 3.4-rc1
There was a lot of discussion, but it was agreed that the in-kernel
handling should be done by a notifier chain to which drivers and other
entities could subscribe, and the push to user space would happen at the
other end, probably via either netlink or structured logging.
[ I would like to thank LWN subscribers for funding that allowed me to
attend the
summit. Big thanks are also due to Mel Gorman and James Bottomley for their
major contributions to the summit coverage. ]
Group photo
Thanks to Alasdair Kergon for making his photograph of the 2012 Linux
Storage, Filesystem, and Memory Management summit available.
Comments (11 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>