Brief items
The current development kernel is 3.1-rc2,
released on August 14. "
Hey, nice
calm first week after the merge window. Good job. Or maybe people are just
being lazy, and everybody is on vacation. Whatever. Don't tell me. I'm
reasonably happy, I want to stay that way." Details can be found in
the
full changelog. The code name for this kernel, incidentally, has been
changed to "wet seal."
Stable updates: the 2.6.32.45, 2.6.33.18, and 3.0.2 stable updates were released on
August 15. They contain the usual pile of fixes. All three updates
also include a change how TCP sequence numbers
are generated; a (relatively) insecure 24-bit MD4 algorithm has been
replaced by 32-bit MD5. 3.0.3 was released
on August 17 with another set of useful fixes.
Comments (none posted)
The truth to realize is that we have grown really good at
decimating our user-base every year or so.
--
Ingo
Molnar
As far as long-term kernels goes, from the Android perspective we
strongly prefer to snap up to the most recent released kernel on
every platform/device release. I prefer to be as up to date on
bugfixes and features from mainline as possible and minimize the
deltas on our stack 'o patches as much as possible.
--
Brian Swetland
Comments (12 posted)
Greg Kroah-Hartman has posted
a
proposal for some changes to how the stable and (especially) longterm
kernels are maintained. The changes are being driven by users other than
the enterprise distributors. "
Now that 2.6.32 is over a year and a
half, and the enterprise distros are off doing their thing with their
multi-year upgrade cycles, there's no real need from the distros for a new
longterm kernel release. But it turns out that the distros are not the
only user of the kernel, other groups and companies have been approaching
me over the past year, asking how they could pick the next longterm kernel,
or what the process is in determining this." The core idea is to
pick a new longterm kernel once a year; that kernel would then be
maintained for two years thereafter. There is
some
discussion on Google+; it should move to the mailing list around
August 15.
Comments (34 posted)
Kernel development news
By Jonathan Corbet
August 15, 2011
CPUs may not have gotten hugely faster in recent years, but they have
gained in other ways; a typical system-on-chip (SoC) device now has a
number of peripherals which would qualify as reasonably powerful CPUs in
their own right. More powerful devices with direct access to the memory
bus can take on more demanding tasks. For example, an image frame captured
from a camera device can often be passed directly to the graphics processor
for display without all of the user-space processing that was once
necessary. Increasingly, the CPU's job looks like that of a shop foreman
whose main concern is keeping all of the other processors busy.
The foreman's job will be easier if the various devices under its control
can communicate easily with each other. One useful addition in this area
might be the buffer sharing patch set
recently posted by Marek Szyprowski. The idea here is to make it possible
for multiple kernel subsystems to share buffers under the control of user
space. With this type of feature, applications could wire kernel
subsystems together in problem-specific ways then get out of the way,
letting the devices involved process the data as it passes through.
There are (at least) a couple of challenges which must be dealt with to
make this kind of functionality safe to export to applications. One is
that the application should not be able to "create" buffers at arbitrary
kernel addresses. Indeed, kernel-space addresses should not be visible to
user space at all, so the kernel must provide some other way for an
application to refer to a specific buffer. The other is that shared
buffers must not go away until all users have let go of it. A buffer may
be created by a specific device driver, but it must persist, even if the
device is closed, until nobody else expects it to be there.
The mechanism added in this patch set (this part in particular is credited
to Tomasz Stanislawski) is relatively simple - though it will probably get
more complex in the future. Kernel code wanting to make a buffer available
to other parts of the kernel via user space starts by filling in one of
these structures:
struct shrbuf {
void (*get)(struct shrbuf *);
void (*put)(struct shrbuf *);
unsigned long dma_addr;
unsigned long size;
};
One could immediately raise a number of complaints about this structure:
the address should be a dma_addr_t, there's no reason not to put
the kernel virtual address there, only physically-contiguous buffers are
allowed, etc. It also seems like there could be value in the ability to
annotate the state of the buffer (filled or empty, for example) and
possibly signal another thread when that state changes.
But it's worth remembering that this is an explicitly
proof-of-concept patch posting and a lot of things will change. In
particular, the eventual plan is to pass a scatterlist around instead of a
single physical address.
The get() and put() functions are important: they manage
reference counts to the buffer, which must continue to exist until that
count goes to zero. Any subsystem depending on a buffer's continued
existence should hold a reference to that buffer. The put()
function should release the buffer when the last reference is dropped.
Once this structure exists, it can be passed to:
int shrbuf_export(struct shrbuf *sb);
The return value (if all goes well) will be an integer file descriptor
which can be handed to user space. This file descriptor embodies a
reference to the buffer, which now will not be released before the file
descriptor is closed. Other than closing it, there is very little that the
application can do with the descriptor other than give it to another kernel
subsystem; attempts to read from or write to it will fail, for example.
If a kernel subsystem receives a file descriptor which is purported to
represent a kernel buffer, it can pass that descriptor to:
struct shrbuf *shrbuf_import(int fd);
The return value will be the same shrbuf structure (or an
ERR_PTR() error value for a file descriptor of the wrong type). A
reference is taken on the structure before returning it, so the recipient
should call put() at some future time to release it.
The patch set includes a new Video4Linux2 ioctl() command
(VIDIOC_EXPBUF) enabling the exporting of buffers as file
descriptors; a couple of capture drivers have been augmented to support
this functionality. No examples of the other side (importing a buffer)
have been posted yet.
There has been relatively little commentary on the patch set so far,
possibly because it was posted to a couple of relatively obscure mailing
lists. It has the look of functionality that could be useful beyond one or
two kernel subsystems, though. It would probably make sense for the next
iteration, which presumably will have more of the anticipated functionality
built into it, to be distributed more widely for review.
Comments (12 posted)
August 12, 2011
This article was contributed by Dan J. Williams
It is an innocent idea. After all, "
all
problems in computer science can be solved by another level of
indirection." However, when the problem is developing a device driver
for acceptance into the current mainline Linux kernel, OS abstraction
(using a level of indirection to hide a kernel's internal API) is taking
things a level too far. Seasoned Linux kernel developers will have already
cringed at the premise of this article. But they are not my intended
readers; instead, this text is aimed at those that find themselves in a
similar position as the original authors of the isci driver: a team new to
the process of getting a large driver accepted into the mainline, and
tasked with enabling several environments at once. The isci driver fell
into the OS abstraction trap. These are the lessons learned and attitudes
your author developed about this trap while leading the effort to rework
the driver for upstream acceptance.
As mentioned above, one would be hard pressed to find an experienced Linux
kernel developer willing to accept OS abstraction as a general
approach to driver design. So, a simplistic rule of thumb for those
wanting to avoid
the pain of reworking large amounts of code would be to not go it alone.
Arrange for a developer with at least 100 upstream commits to be
permanently assigned to the development team, and seek the advice of a
developer with at least 500 commits early in the design phase. After
the fact, it was one such developer, Arjan van de Ven, who set the
expectation for the magnitude of rework effort. When your author was
toying with ideas of Coccinelle and other automated ways to unwind the
driver's abstractions Arjan presciently noted (paraphrasing): "...it
isn't about
the specific abstractions, it's about the wider assumptions that lead
to the abstractions."
The fundamental problem with OS abstraction techniques is that they
actively defeat the purpose of having an open driver in the first
place. As a community we want drivers upstream so that we can
refactor common code into generic infrastructure and drive uniformity
across drivers of the same class. OS abstraction, in contrast,
implies the development of driver-specific translations of common
kernel constructs, "lowest common denominator" programming to avoid
constructs that do not have a clear analogue in all environments,
and overlooking the subtleties of interfaces that appear similar but
have important semantic differences.
So what were the larger problematic assumptions that led to to
the rework effort? It comes down to the following list that, given
the recurrence of OS-abstracted drivers, may be expected among other
developers new to the Linux mainline acceptance process.
- The programming interface of the the Linux kernel is a static
contract to third party developers. The documentation is up to date,
and the recourse for upper layer bugs is to add workarounds to the
driver.
- The "community" is external to the development team. Many
conversations about Linux kernel requirements reference the
"community" as an anonymous body of developers and norms external to
the driver's development process.
- An OS abstraction layer can cleanly handle the differences between
operating systems.
In the case of the isci driver these assumptions resulted in a nearly
half-year effort to rework the driver as measured from the first
public release until the driver was ultimately deemed acceptable.
Who fixes the platform?
The kernel community runs on trust and reciprocation. To get things done in
a timely manner one needs to build up a cache of trust capital. One
quick way to build this capital is to fix bugs or otherwise lower the
overall maintenance burden of the kernel. Fixing a bug in a core
library, or spotting some duplicated patterns that can be unified are
golden opportunities to demonstrate proficiency and build trust.
The
attitude of aiming to become a co-implementer of common kernel
infrastructure is an inherently foreign idea for a developer with a
proprietary environment background. The proprietary OS vendor
provides an interface sandbox for the driver to play in that is
assumed to be rigid, documented and supported. A similar assumption was
carried to the isci driver; for example, it initially contained workarounds for
bugs (real and perceived) in libsas and the other upper layers. The
assumption behind those workarounds seems to be that the "vendor's"
(maintainer's) interface is broken and the vendor is on the hook for a
fix. This is, of course, the well-known "platform problem."
In the particular case of libsas, SCSI maintainer James Bottomley noted:
"there's
no overall maintainer, it's jointly maintained by its users." Internal
kernel
interfaces evolve to meet the needs of their users, but the users that
engender the most trust tend to have an easier time getting their
needs met. In this case, root-causing the bugs or allowing time to
clarify the documentation for libsas ahead of the driver's
introduction might have streamlined the acceptance process; it
certainly would have saved the effort of developing local workarounds
to global problems.
We the community
Similar to how internal kernel interfaces evolve at a different pace
than their documentation, so too do the expectations of the community
for mainline-acceptable code versus the documented submission
requirements. However, in contrast to the interface question where
the current code can be used to clarify interface details, the same
cannot be done to determine the current set of requirements for mainline
acceptance. The reality is that code exists in the mainline tree that
would not be acceptable if it were being re-submitted for inclusion
today.
A driver with an OS-abstracted core can be found in the tree,
but over time the maintenance burden incurred by that architecture has
precluded future drivers from taking the same approach. As a result,
attempting to understand "the community" from an external position is a
sure-fire way to underestimate the current set of requirements for
acceptable code. The only way to acquire this knowledge is ongoing
participation. Read other drivers, read the code reviews from other
developers, and try to answer the question "would someone external to
the development team have a chance at maintaining the driver without
assistance?".
One clear "no" answer to this question from the isci driver
experience came from the simple usage of c99 structure initializers. The
common core was targeted for reuse in environments where there was no
compiler support for this syntax. However, the state machine
implementation in the driver had dozens of tables filled with, in some cases,
hundreds of function pointers. Modifying such tables by counting commas
and trusting comments is error prone. The conversion to c99-style struct
initialization made the code safer to edit (compiler verifiable), more
readable and, consequently, allowed many of those tables to be removed.
These initializations were a
simple example of lowest-common-denominator programming and a nuance that
an "external" Linux kernel developer need not care to understand when
modifying the driver, especially when the next round of cleanups are
enabled by the change.
Can the OS be hidden?
OS abstraction defenders may look at that last example and propose
automated ways to convert the code for different environments. The
dangerous assumptions of automated abstraction replacement engines are
the same as listed above. The Linux kernel internal interface is not
static, so the abstraction needs to keep up with the pace of API
change, but that effort is better spent participating in the Linux
interface evolution.
More dangerous is the assumption that similar
looking interfaces from different environments can be properly
captured by a unified abstraction. The prominent example from the isci driver
was memory mapping: converting between virtual and physical addresses.
As far as your author knows, Linux is one of the few environments that
utilizes IOMMUs (I/O memory management units) to protect standard
streaming DMA mappings requested by device drivers (via the DMA API).
The isci abstraction had a virtual-to-physical abstraction that mapped
to virt_to_phys() (broken but straightforward to fix), but it also had
a physical-to-virtual abstraction mapped to phys_to_virt() which was
not straightforward to fix. The assumption that physical-to-virtual
translation was a viable mechanism lead to an implementation that not
only missed the DMA API requirements, but also the need to use kmap()
when accessing pages that may be located in high memory. The lesson
is that convenient interfaces in other environments can lead to the
diversionary search for equivalent functionality in Linux and magnify
the eventual rework effort.
Conclusion
The initial patch added 60,896 lines to the kernel over 159 files.
Once the rework was done, the number of new files was cut down to 34 and
overall diffstat for the effort was:
192 files changed, 23575 insertions(+), 60895 deletions(-)
There is no question that adherence to current Linux
coding principles resulted in a simpler implementation of the isci
driver. The community's mainline acceptance criteria are designed to
maximize the portability and effectiveness of a kernel developer's
skills across drivers. Any locally-developed convenience mechanisms
that diminish that global portability will almost always result in
requests for changes and prevent mainline acceptance. In the end
participation and getting one's hands dirty in the evolution of the
native interfaces is the requirement for mainline acceptance, and it
is very difficult to achieve that through a level of indirection.
I want to thank Christoph Hellwig and James Bottomley for their review
and recognize Dave Jiang, Jeff Skirvin, Ed Nadolski, Jacek Danecki,
Maciej Trela, Maciej Patelczyk and the rest of the isci development
team that accomplished this rework.
Comments (74 posted)
August 12, 2011
This article was contributed by Dan Magenheimer
The Linux kernel carefully enumerates and tracks all of its memory
and, for the most part, it can individually access every byte of
it. The purpose of transcendent memory ("tmem") is to provide the
kernel with the capability to utilize memory that it cannot enumerate,
sometimes cannot track, and cannot directly address. This may sound
counterintuitive, or even silly, to core kernel developers, but as we
will see it is actually quite useful; indeed it adds a level of
flexibility to the kernel that allows some rather complex functionalities
to be implemented and layered on a handful of tiny changes
to the core kernel. The end goal is that memory can be more
efficiently utilized by one kernel and/or load-balanced between
multiple kernels (in a virtualized OR a non-virtualized environment),
resulting in higher performance and/or lower RAM costs in a system or
across a data center. This article will provide an overview of
transcendent memory and how it is used in the Linux kernel.
Exactly how the kernel talks to tmem will be described in Part 2, but there
are certain classes of data maintained by the kernel that are suitable.
Two of these are known to kernel developers as "clean pagecache pages"
and "swap pages". The patch that deals with the former is known
as "cleancache"; it was merged into the 3.0 kernel. The patch that
deals with swap pages is known as frontswap and is still being reviewed
on the Linux kernel mailing list, with a target of linux-3.2.
There may well be other classes of data that will also work well with
tmem. Collectively these sources of suitable data for tmem can be
referred to as "frontends" for tmem and we will detail them in Part 3.
There are multiple implementations of tmem which store data using
different methods. We can refer to these data stores as "backends"
for tmem, and all frontends can be used by all backends (possibly
using a shim to connect them). The initial tmem implementation, known as
"Xen tmem," allows Xen hypervisor memory to be
used to store data for one or more tmem-enabled guest kernels.
Xen tmem has been implemented in Xen for over two years and has
been shipping in Xen since Xen 4.0; the in-kernel shim for Xen tmem
was merged into 3.0 (for cleancache only, updated to also support
frontswap in 3.1). Another Xen driver component, the Xen
self-ballooning driver, which helps encourage a guest kernel to use
tmem efficiently, was merged for 3.1 and also includes
the "frontswap-selfshrinker". See Appendix A for more information
about these.
The second tmem implementation does not involve virtualization at
all and is known as "zcache," it is an in-kernel driver that stores
compressed pages. Zcache essentially "doubles RAM" for any class
of kernel memory that the kernel can provide via a tmem frontend
(e.g. cleancache, frontswap), thus reducing memory requirements in,
for example, embedded kernels. Zcache was merged as a staging driver
in 2.6.39 (though dependent on the cleancache and frontswap
patchsets which were not yet upstream)
A third tmem implementation is underway; it is known as "RAMster."
In RAMster, a "closely-connected" set of kernels effectively pool
their RAM so that a RAM-hungry workload on one machine can
temporarily and transparently utilize RAM on another machine
which is presumably idle or running a non RAM-hungry workload.
RAMster has also been dubbed "peer-to-peer transcendent memory"
and is intended for non-virtualized kernels but is being tested
also with virtualized kernels. While RAMster is best suited in
an environment where multiple systems are connected by a high-speed
"exofabric", in which one system can directly address another
systems memory, the initial prototype is built on a standard
ethernet connection.
Other tmem implementations have been proposed: For example,
there has been some argument about how useful tmem might be
for KVM and/or for containers. With recent changes to zcache
merged in 3.1, it may be very easy to simply implement the necessary
shims and try these out; nobody has yet stepped up to do it.
As another example, it has been observed that the tmem protocols
may be ideal for certain kinds of RAM-like technologies such as
"phase-change" memory (PRAM); most of these technologies have
certain idiosyncrasies, such as limited write-cycles, that can
be managed effectively through a software interface such as tmem.
Discussions have begun with certain vendors of such RAM-like
technologies. Yet another example is a variation of RAMster:
a single machine in a cluster acts as a "memory server" and memory
is added solely to that machine; the memory may be RAM, may be
RAM-like, or perhaps may be a fast SSD.
The existing tmem implementations will be described in Part 4
along with some speculation about future implementations.
2: How the kernel talks to transcendent memory
The kernel "talks" to tmem through a carefully defined interface,
which was crafted to provide maximum flexibility for the tmem
implementation while incurring low impact on the core kernel.
The tmem interface may appear odd but there are good reasons for
its peculiarities. Note that in some cases the
tmem interface is completely internal to the kernel and is
thus an "API"; in other cases it defines the boundary between
two independent software components (e.g. Xen and a guest Linux
kernel) so is properly called an "ABI".
(Casual readers may wish to skip this section.)
Tmem should be thought of as another "entity" that "owns"
some memory. The entity might be an in-kernel driver, another
kernel, or a hypervisor/host. As previously noted, tmem cannot be
enumerated by the kernel; the size of tmem is unknowable to the
kernel, may change dynamically, and may at any time be "full".
As a result, the kernel must "ask" tmem on every individual page
to accept data or to retrieve data.
Tmem is not byte-addressable -- only large chunks of data (exactly
or approximately a page in size) are copied between kernel memory and
tmem. Since the kernel cannot "see" tmem, it is the tmem side of the
API/ABI that copies the data from/to kernel memory. Tmem organizes
related chunks of data in a pool; within a pool, the kernel chooses
a unique "handle" to represent the equivalent of an address for
the chunk of data. When the kernel requests the creation of
a pool, it specifies certain attributes to be described below.
If pool creation is successful, tmem provides a "pool id".
Handles are unique within pools, not across pools, and consist
of a 192-bit "object id" and a 32-bit "index." The rough equivalent
of an object is a "file" and the index is the rough equivalent of
a page offset into the file.
The two basic operations of tmem are "put" and "get". If the
kernel wishes to save a chunk of data in tmem, it uses the "put"
operation, providing a pool id, a handle, and the location of the
data; if the put returns success, tmem has copied the data. If the kernel
wishes to retrieve data, it uses the "get" operation and provides the
pool id, the handle, and a location for tmem to place the data; if
the get succeeds, on return, the data will be present at the specified
location. Note that, unlike I/O, the copying performed by tmem is fully
synchronous. As a result, arbitrary locks can (and, to avoid races,
often should!) be held by the caller.
There are two basic pool types: ephemeral and persistent.
Pages successfully put to an ephemeral pool may or may not be
present later when the kernel uses a subsequent get with a matching
handle. Pages successfully put to a persistent pool are guaranteed
to be present for a subsequent get. (Additionally, a pool may
be "shared" or "private".)
The kernel is responsible for maintaining coherency between tmem
and the kernel's own data, and tmem has two types of "flush" operations
to assist with this: To disassociate a handle from any tmem data, the
kernel uses a "flush" operation. To disassociate all chunks of data in
an object, the kernel uses a "flush object" operation. After a flush,
subsequent gets will fail. A get on an (unshared) ephemeral pool is
destructive, i.e. implies a flush; otherwise, the get is non-destructive
and an explicit flush is required. (There are two additional coherency
guarantees that are described in Appendix B.)
3: Transcendent memory frontends: frontswap and cleancache
While other frontends are possible, the two existing tmem frontends,
frontswap and cleancache, cover two of the primary types of kernel
memory that are sensitive to memory pressure. These two frontends
are complementary: cleancache handles (clean) mapped pages that
would otherwise be reclaimed by the kernel; frontswap handles
(dirty) anonymous pages that would otherwise be swapped out by
the kernel. When a successful cleancache_get happens, a disk
read has been avoided. When a successful frontswap_put (or get)
happens, a swap device write (or read) had been avoided. Together,
assuming tmem is significantly faster than disk paging/swapping,
substantial performance gains may be obtained in a memory-constrained
environment.
Frontswap
The total amount of "virtual memory" in a Linux system is the
sum of the physical RAM plus the sum of all configured swap devices.
When the "working set" of a workload exceeds the size of physical
RAM, swapping occurs -- swap devices are essentially used to emulate
physical RAM. But, generally, a swap device is several orders of
magnitude slower than RAM so swapping has become synonymous with
horrible performance. As a result, wise system administrators increase
physical RAM and/or redistribute workloads to ensure that swapping
is avoided. But what if swapping isn't always slow?
Frontswap allows the Linux swap subsystem to use transcendent memory,
when available, in place of sending data to and from a swap device.
Frontswap is not in itself a swap device and, thus, requires no
swap-device-like configuration. It does not change the total virtual
memory in the system; it just results in faster swapping... some/most/nearly
all of the time, but not necessarily always. Remember that the
quantity of transcendent memory is unknowable and dynamic. With
frontswap, whenever a page needs to be swapped out the swap subsystem asks
tmem if it is willing to take the page of data. If tmem rejects it,
the swap subsystem writes the page, as normal, to the swap device.
If tmem accepts it, the swap subsystem can request the page of
data back at any time and it is guaranteed to be retrievable from
tmem. And, later, if the swap subsystem is certain the data is no
longer valid (e.g. if the owning process has exited), it can flush the
page of data from tmem.
Note that tmem can reject any or every frontswap "put". Why would it?
One example is if tmem is a resource shared between multiple kernels
(aka tmem "clients"), as is the case for Xen tmem or for RAMster;
another kernel may have already claimed the space, or perhaps this
kernel has exceeded some tmem-managed quota. Another example is
if tmem is compressing data as it does in zcache and it determines
that the compressed page of data is too large; in this case, tmem might
reject any page that isn't sufficiently compressible OR perhaps even
if the mean compression ratio is growing unacceptably.
The frontswap patchset is non-invasive and does not impact the
behavior of the swap subsystem at all when frontswap is disabled.
Indeed, a key kernel maintainer has observed that frontswap appears
to be "bolted on" to the swap subsystem. That is a good thing as
the existing swap subsystem code is very stable, infrequently used
(because swapping is so slow), yet critical to system correctness;
dramatic change to the swap subsystem is probably unwise and frontswap
only touches the fringes of it.
A few implementation notes: Frontswap requires one bit
of metadata per page of enabled swap. (The Linux swap subsystem
until recently required 16 bits, and now requires eight bits of
metadata per page so frontswap increases this by 12.5%.) This
bit-per-page records whether the page is in tmem or is on the physical
swap device. Since, at any time, some pages may be in frontswap and
some on the physical device, the swap subsystem "swapoff" code also
requires some modification. And because in-use tmem is more valuable
than swap device space, some additional modifications are provided
by frontswap so that a "partial swapoff" can be performed. And,
of course, hooks are at the read-page and write-page routines
to divert data into tmem and a hook is added to flush the data
when it is no longer needed. All told, the patch parts that affect
core kernel components add up to less than 100 lines.
Cleancache
In most workloads, the kernel fetches pages from a slow disk and, when
RAM is plentiful, the kernel retains copies of many of these pages in
memory, assuming that a disk page used once is likely to be used again.
There's no sense incurring two disk reads when one will do and
there's nothing else to do with that plentiful RAM anyway. If any data
is written to one of those pages, the changes must be written to disk
but, in anticipation of future changes, the (now clean) page continues to
be retained in memory. As a result, the number of clean pages in this
"page cache" often grows to fill the vast majority of memory. Eventually,
when memory is nearly filled, or perhaps if the workload grows to require
more memory, the kernel "reclaims" some of those clean pages; the
data is discarded and the page frames are used for something else.
No data is lost because a clean page in memory is identical to the
same page on disk. However, if the kernel later determines that it does
need that page of data after all, it must again be fetched from disk,
which is called a "refault." Since the kernel can't predict the future,
some pages are retained that will never be used again and some pages
are reclaimed that soon result in a refault.
Cleancache allows tmem to be used to store clean page cache pages resulting
in fewer refaults. When the kernel reclaims a page, rather than
discard the data, it places the data into tmem, tagged as "ephemeral",
which means that the page of data may be discarded if tmem chooses.
Later, if the kernel determines it needs that page of data after all,
it asks tmem to give it back. If tmem has retained the page, it
gives it back; if tmem hasn't retained the page, the kernel proceeds
with the refault, fetching the data from the disk as usual.
To function properly, cleancache "hooks" are placed where pages are
reclaimed and where the refault occurs. The kernel is also responsible
for ensuring coherency between the page cache, disk, and tmem, so hooks
are also present where ever the kernel might invalidate the data.
Since cleancache affects the kernel's VFS layer, and since not all
filesystems use all VFS features, a filesystem must "opt in" to use
cleancache whenever it mounts a filesystem.
One interesting note about cleancache is that clean pages may be retained
in tmem for a file that has no pages remaining in the kernel's page cache.
Thus the kernel must provide a name ("handle") for the page which is unique
across the entire filesystem. For some filesystems, the inode number is
sufficient, but for modern filesystems, the 192-bit "exportfs" handle is used.
Other tmem frontends
A common question is: can user code use tmem? For example, can enterprise
applications that otherwise circumvent the pagecache use tmem? Currently
the answer is no, but one could implement "tmem syscalls" to allow this.
Coherency issues may arise, and it remains to be seen if they could be
managed in user space.
What about other in-kernel uses? Some have suggested that the kernel dcache
might provide a useful source of data for tmem. This too deserves further
investigation.
4: Transcendent memory backends
The tmem interface allows multiple frontends to function with
different backends. Currently only one backend may be configured though,
in the future, some form of layering may be possible. Tmem backends share some
common characteristics: Although a tmem backend might seem similar to
a block device, it does not perform I/O and does not use the
block I/O (bio) subsystem. In fact, a tmem backend must perform its functions
fully synchronously, that is, it must not sleep and the scheduler
may not be called. When a "put" completes, the kernels's page of data
has been copied. And a successful "get" may not complete
until the page of data has been copied to the kernel's data page.
While these constraints create some difficulty for tmem backends,
they also ensure that the tmem backend meets the tmem's interface
requirements while also minimizing changes to the core kernel.
Zcache
Although tmem was conceived as a way to share a fixed resource (RAM)
among a number of clients with constantly varying memory appetites,
it also works nicely when the amount of RAM needed by a single kernel
to store some number, N, of pages of data is less than N*PAGE_SIZE
and when those pages of data need only be accessed only at a page
granularity. So zcache combines an in-kernel implementation of tmem
with in-kernel compression code to reduce the space requirements
for data provided through a tmem frontend. As a result, when the
kernel is under memory pressure, zcache can substantially increase
the number of clean page cache pages and swap cache pages stored in
RAM and thus significantly decrease disk I/O.
The zcache implementation is currently a staging driver so it is subject
to change; it handles both persistent pages (from frontswap) and ephemeral
pages (from cleancache) and, in both cases, uses the in-kernel lzo1x
routines to compress/decompress the data contained in the pages.
Space for persistent pages is obtained through a shim to xvmalloc,
a memory allocator in the zram staging driver designed to store
compressed pages. Space for ephemeral pages is obtained through
standard kernel get_free_page() calls, then pairs of compressed
ephemeral
pages are matched using an algorithm called "compression buddies".
This algorithm ensures that physical page frames containing two
compressed ephemeral pages can easily be reclaimed when necessary;
zcache provides a standard "shrinker" routine so those whole page frames
can be reclaimed when required by the kernel using the existing
kernel shrinker mechanism.
Zcache nicely demonstrates one of the flexibility features of tmem:
Recall that, although data may often compress nicely (i.e. by a factor of
two or more), it is possible that some workloads may produce long
sequences of data that compress poorly. Since tmem allows any page
to be rejected at the time of put, zcache policy (adjustable with
sysfs tuneables in 3.1) avoids storing this poorly compressible data, instead
passing it on to the original swap device for storage, thus dynamically
optimizing the density of pages stored in RAM.
RAMster
RAMster is still under development but a proof-of-concept exists today.
RAMster assumes that we have a cluster-like set of systems with some
high-speed communication layer, or "exofabric", connecting them. The
collected RAM of all the systems in the "collective" is the shared RAM
resource used by tmem. Each cluster node acts as both a tmem client
and a tmem server, and decides how much of its RAM to provide to the
collective. Thus RAMster is a "peer-to-peer" implementation of tmem.
Ideally this exofabric allows some form of synchronous remote DMA to allow
one system to read or write the RAM on another system, but in the initial
RAMster proof-of-concept (RAMster-POC), a standard Ethernet connection is
used instead. As long as the exofabric is sufficiently faster than
disk reads/writes, there is still a net performance win.
Interestingly, RAMster-POC demonstrates a useful dimension of tmem:
Once pages have been placed in tmem, the data can be transformed in
various ways as long as the pages can be reconstituted when required.
When pages are put to RAMster-POC, they are first compressed and cached
locally using a zcache-like tmem backend. As local memory constraints
increase, an asynchronous process attempts to "remotify" pages to another
cluster node; if one node rejects the attempt, another node can be
used as long as the local node tracks where the remote data resides.
Although the current RAMster-POC doesn't implement this, one could
even remotify multiple copies to achieve higher availability (i.e.
to recover from node failures).
While this multi-level mechanism in RAMster works nicely for puts,
there is currently no counterpart for gets. When a tmem frontend
requests a persistent get, the data must be fetched immediately
and synchronously; the thread requiring the data must busy-wait
for the data to arrive and the scheduler must not be called. As
a result current RAMster-POC is best suited for many-core processors,
where it is unusual for all cores to be simultaneously active.
Transcendent memory for Xen
Tmem was originally conceived for Xen and so the Xen implementation
is the most mature. The tmem backend in Xen utilizes spare hypervisor
memory to store data, supports a large number of guests, and optionally
implements both compression and deduplication (both within a guest
and across guests) to maximize the volume of data that can be stored.
The tmem frontends are converted to Xen hypercalls using a shim.
Individual guests may be equipped with "self-ballooning" and
"frontswap-self-shrinking" (both in Linux 3.1) to optimize their
interaction with Xen tmem. Xen tmem also supports shared ephemeral
pools, so that guests co-located on a physical server that share a
cluster filesystem need only keep one copy of a cleancache page in tmem.
The Xen control plane also fully implements tmem: An extensive set
of statistics is available; live migration and save/restore of
tmem-using guests is fully supported and limits, or "weights", may be
applied to tmem guests to avoid denial-of-service.
Transcendent memory for kvm
The in-kernel tmem code included in zcache has been updated in 3.1 to
support multiple tmem clients. With this in place, a KVM implementation
of tmem should be fairly easy to complete, at least in prototype form.
As with Xen, a shim would need to be placed in the guest to convert
the cleancache and frontswap frontend calls to KVM hypercalls. On the host
side, these hypercalls would need to be interfaced with the in-kernel
tmem backend code. Some additional control plane support would also be
necessary for this to be used in a KVM distribution.
Future tmem backends
The flexibility and dynamicity of tmem suggests that it may be useful
for other storage needs and other backends have been proposed. The
idiosyncrasies of some RAM-extension technologies, such as SSD
and phase-change (PRAM) have been observed to be a possible fit;
since page-size quantities are always used, writes can be
carefully controlled and accounted, and user code never
writes to tmem, memory technologies that could previously only be
used as a fast I/O device could now instead be used as slow RAM.
Some of these ideas are already under investigation.
Appendix A: Self-ballooning and frontswap-selfshrinking
After a system has been running for awhile, it is not uncommon for
the vast majority of its memory to be filled with clean pagecache
pages. With some tmem backends, especially Xen, it may make sense
for those pages to reside in tmem instead of in the guest. To achieve
this, Xen implements aggressive "self-ballooning", which artificially
creates memory pressure by driving the Xen balloon driver to claim
page frames, thus forcing the kernel to reclaim pages, which sends
them to tmem. The algorithm essentially uses control theory to drive
towards a memory target that approximates the current "working set"
of the workload using the "Committed_AS" kernel variable. Since
Committed_AS doesn't account for clean, mapped pages, these pages
end up residing in Xen tmem where, queueing theory assures us, Xen
can manage the pages more efficiently.
If the working set increases unexpectedly and faster than the
self-balloon driver is able to (or chooses to) provide usable RAM,
swapping occurs, but, in most cases, frontswap is able to absorb
this swapping into Xen tmem. However, due to the fact that the
kernel swap subsystem assumes that swapping occurs to a disk,
swapped pages may sit on the "disk" for a very long time, even if
the kernel knows the page will never be used again, because the
disk space costs very little and can be overwritten when necessary.
When such stale pages are in frontswap, however, they are taking
up valuable space.
Frontswap-self-shrinking works to resolve this problem:
when frontswap activity is stable and the guest kernel returns to
a state where it is not under memory pressure, pressure is provided
to remove some pages from frontswap, using a "partial" swapoff
interface, and return them to kernel memory, thus freeing tmem
space for more urgent needs, i.e. other guests that are currently
memory-constrained.
Both self-ballooning and frontswap-self-shrinking provide sysfs
tuneables to drive their control processes. Further experimentation
will be necessary to optimize them.
Appendix B: Subtle tmem implementation requirements
Although tmem places most coherency responsibility on its clients, a
tmem backend itself must enforce two coherency requirements. These are
called "get-get" coherency and "put-put-get" coherency. For the
former, a tmem backend guarantees that if a get fails, a subsequent
get to the same handle will also fail (unless, of course, there is an
intermediate put). For the latter, if a put places data "A" into
tmem and a subsequent put with the same handle places data "B" into
tmem, a subsequent "get" must never return "A".
This second coherency requirement results in an unusual corner-case
which affects the API/ABI specification: If a put with a handle "X" of
data "A" is accepted, and then a subsequent put is done to handle "X" with
data "B", this is referred to as a "duplicate put". In this case,
the API/ABI allows the backend implementation two options, and the
frontend must be prepared for either: (1) if the duplicate put is
accepted, the backend replaces data "A" with data "B" and success
is returned and (2) the duplicate put may be failed, and the backend
must flush the data associated with "X" so that a subsequent get
will fail. This is the only case where a persistent get of
a previously accepted put may fail; fortunately in this case
the frontend has the new data "B" which would have overwritten the
old data "A" anyway.
Comments (5 posted)
Patches and updates
Kernel trees
- Peter Zijlstra: 3.0.1-rt9 .
(August 12, 2011)
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>