The current development kernel is 3.3-rc7
on March 10 despite Linus's
earlier wish not to do any more 3.3 prepatches. "Now, none
of the fixes here are all that scary in themselves, but there were just too
many of them, and across various subsystems. Networking, memory
management, drivers, you name it. And instead of having fewer commits than
in -rc6, we have more of them. So my hope that things would calm down
simply just didn't materialize.
Stable updates: the 3.2.10 and 3.0.24 updates were released on March 12;
both contain a long list of important fixes. 3.2.11 followed one day later to fix a build
The 220.127.116.11 update, containing almost 200
changes, is in the review process as of this writing.
Comments (none posted)
Programming is not just an act of telling a computer what to do: it
is also an act of telling other programmers what you wished the
computer to do. Both are important, and the latter deserves care.
-- Andrew Morton
Dammit, I'm continually surprised by the *idiots* out there that
don't understand that binary compatibility is one of the absolute
top priorities. The *only* reason for an OS kernel existing in the
first place is to serve user-space. The kernel has no relevance on
its own. Breaking existing binaries - and then not acknowledging
how horribly bad that was - is just about the *worst* offense any
kernel developer can do.
-- Linus Torvalds
Kernel developers have thick heads, in most cases thicker than
-- Borislav Petkov
Comments (9 posted)
Greg Kroah-Hartman discusses the
history of the 2.6.32 stable kernel series
and why he has stopped
supporting it. "With the 2.6.32 kernel being the base of these
longterm enterprise distros, it was originally guessed that it would be
hanging around for many many years to come. But, my old argument about how
moving a kernel forward for an enterprise distro finally sunk in for 2 of
the 3 major players. Both Oracle Linux and SLES 11, in their latest
releases these few months, have moved to the 3.0 kernel as the base of
them, despite leaving almost all other parts of the distro alone. They did
this to take advantage of the better support for hardware, newer features,
newer filesystems, and the hundreds of thousands of different changes that
has happened in the kernel.org releases since way back in 2009.
Comments (none posted)
Paul McKenney looks at
the state of hardware transactional memory
with an eye toward how it
might be useful for current software. "Even with forward-progress
guarantees, HTM is subject to aborts and rollbacks, which (aside from
wasting energy) are failure paths. Failure code paths are in my experience
difficult to work with. The possibility of failure is not handled
particularly well by human brain cells, which are programmed for
optimism. Failure code paths also pose difficulties for validations,
particularly in cases where the probability of failure is low or in cases
where multiple failures are required to reach a given code path.
Comments (20 posted)
After the late-February discussion on the
future of control groups
, Tejun Heo has boiled down the comments and
come to some conclusions
as to where he
would like to go with this subsystem. The first of these is that multiple
hierarchies are doomed in the long term:
At least to me, nobody seems to have strong enough justification
for orthogonal multiple hierarchies, so, yeah, unless something
else happens, I'm scheduling multiple hierarchy support for the
chopping block. This is a long term thing (think years), so no
need to panic right now and as is life plans may change and fail to
materialize, but I intend to at least move away from it.
So there will, someday, be a single control group hierarchy. It will not,
however, be tied to the process tree; it will be an independent tree of
groups allowing processes to be combined in arbitrary ways.
The responses to Tejun's conclusions have mostly focused on details (how to
handle controllers that are not fully hierarchical, for example). There
does not appear to be any determined opposition to the idea of removing the
multiple hierarchy feature at some point when it can be done without
breaking systems, so users of control groups should consider the writing to
be on the wall.
Comments (3 posted)
Kernel development news
Kernel developers like to grumble about the kernels shipped by enterprise
distributions. Those kernels tend to be managed in ways that ignore the
best features of the Linux development process; indeed, sometimes they seem
to work against that process. But, enterprise kernels and the systems
built on them are also the platform on which the money that supports kernel
development is made, so developers only push their complaints so far. For
years, it has seemed that nothing could change the "enterprise mindset,"
but recent releases show that there may, indeed, be change brewing in this
Consider Red Hat Enterprise Linux 6; its kernel is ostensibly based on the
2.6.32 release. The actual kernel, as shipped by Red Hat, differs from
2.6.32 by around 7,700 patches, though. Many of those are fixes, but
others are major new features, often backported from more recent releases.
Thus, the RHEL "2.6.32" kernel includes features like per-session group
scheduling, receive packet/flow steering, transparent huge pages, pstore,
and, of course, support for a wide range of hardware that was not available
when 2.6.32 shipped. Throw in a few out-of-tree features (SystemTap, for
example), and the end result is a kernel far removed from anything shipped
by kernel.org. That is why Red Hat has had no real use for the 2.6.32
stable kernel series for some years.
Red Hat's motivation for creating these kernels is not hard to understand;
the company is trying to provide its customers with a combination of the
stability that comes from well-aged software and the features, fixes, and
performance improvements from the leading edge. This process, when it goes
well, can give those customers the best of both worlds. On the other hand,
the resulting kernels differ widely from the community's product, have not
been tested by the community, and exclude recent features that have not
been chosen for backporting. They are also quite expensive to create;
behind Red Hat's many high-profile kernel hackers is an army of developers
tasked with backporting features and keeping the resulting kernel stable
When developers grumble about enterprise kernels, what they are really
saying is that enterprise distributions might be better served by simply
updating to more current kernels. In the process they would get all those
features, improvements, and bug fixes from the community, in the form that
they were developed and tested by that community. Enterprise distributors
shipping current kernels could dispense with much of their support expense
and could better benefit from shared maintenance of stable kernel
releases. The response that typically comes back is that enterprise
customers worry about kernel version bumps (though massive changes hidden
behind a minor number change are apparently not a problem) and that new
kernels bring new bugs with them. The cost of stabilizing a new kernel
release, it is suggested, could exceed that of backporting desired features
into an older release.
Given that, it is interesting to see two other enterprise distributors
pushing forward with newer kernels. Both SUSE Linux
Enterprise Server 11 Service Pack 2 and Oracle's Unbreakable
Enterprise Kernel Release 2 feature much more recent kernels -
3.0.10 and 3.0.16, respectively. In each case, the shift to a newer kernel
is a clear attempt to create a more attractive distribution; we may be
seeing the beginning of a change in the longstanding enterprise mindset.
SUSE seems firmly stuck in a second-place market position relative to Red
Hat. As a
result, the company will be searching for ways to differentiate its
distribution from RHEL. SUSE almost certainly also lacks the kind of
resources that Red Hat is able to apply to its enterprise kernels, so it
will be looking for cheaper ways to provide a competitive set of features.
Taking better advantage of the community's work by shipping more current
kernels is one obvious way to do that. By shipping recent releases, SUSE
does not have to backport fixes and features, and it is able to take
advantage of the long-term stable support planned for the 3.0 kernel. In
that context, it is not entirely surprising that SUSE has repeatedly pulled its
customers forward, jumping from 2.6.27 to 2.6.32 in the Service Pack 1
release, then to 3.0.
Oracle, too, has a need to differentiate its distribution - even more so,
given that said distribution is really just a rebranded RHEL. To that end,
Oracle would like to push some of its in-house features like btrfs,
which is optimistically labeled "production-ready" in a recent press
release. If btrfs is indeed ready for production use, it certainly has
only gotten there in very recent releases; moving to the 3.0 kernel allows
Oracle to push this feature while minimizing the amount of work required to
backport the most recent fixes. Oracle is offering this kernel with
releases 5 and 6 of Oracle Linux; had Oracle stuck with Red Hat's
RHEL 5 kernel, Oracle Linux 5 users would still be running
something based on 2.6.18. For a company trying to provide a more
feature-rich distribution on a budget, dropping in a current kernel must
seem like a bargain.
What about the down side of new kernels - all those new bugs? Both
companies have clearly tried to mitigate that risk by letting 3.0 stabilize
for six months or so before shipping it to customers. There have been over
1,500 fixes applied in the 24 updates to 3.0 released so far. The real proof,
though, is in users' experience. If SLES or Oracle Linux users experience
bugs or performance regressions as a result of the kernel version change,
they may soon start looking for alternatives. In the Oracle case, the
original Red Hat kernel remains an option for customers; SUSE, instead,
seems committed to the newer version.
Between these two distributions there should be enough users to eventually
establish whether moving to newer kernels in the middle of an enterprise
distribution's support period is a smart move or not. If it works out,
SUSE and Oracle may benefit from an influx of customers who are tired of
Red Hat's hybrid kernels. If the new kernels prove not to be
enterprise-ready, instead, Red Hat's position may become even stronger.
Learning which way things will go may take a while. Should Red Hat show up
one day with a newer kernel for RHEL customers, though, we'll know that the
issue has been decided at last.
Comments (10 posted)
Traditionally, the kernel has allowed the modification of pages in memory
while those pages are in the process of being written back to persistent
storage. If a process writes to a section of a file that is currently
under writeback, that
specific writeback operation may or may not contain all of the most
recently written data. This behavior is not normally a problem; all the
data will get to disk eventually, and developers (should) know that if
they want to get data to disk at a specific time, they should use the
system call to get it there. That said, there are times
when modifying under-writeback pages can create problems; those problems
have been addressed, but now it appears that the cure may be as bad as the
Some storage hardware can transmit and store checksums along with data;
those checksums can provide assurance that the data written to (or read
from) disk matches what the processor thought it was writing. If the data
in a page changes after the calculation of the checksum, though, that data
will appear to be corrupted when the checksum is verified later on.
Volatile data can also create problems on RAID devices and with filesystems
implementing advanced features like data compression. For all of these
reasons, the stable pages feature was added
to ext4 for the 3.0 release (some other filesystems, btrfs included, have
had stable pages for some time). With this feature, pages under writeback
are marked as not being writable; any process attempting to write to such a
page will block until the writeback completes. It is a relatively simple
change that makes system behavior more deterministic and predictable.
That was the thought, anyway, and things do work out that way most of the
time. But, occasionally, as described by
Ted Ts'o, processes performing writes can find themselves blocked for
lengthy periods (multiple seconds) of time. Occasional latency spikes are
not the sort of deterministic behavior the developers were after; they also
leave users unamused.
In a general sense, it is not hard to imagine what may be going on after seeing
this kind of problem report. The system in question is very busy, with
many processes contending for the available I/O bandwidth. One process is
happily minding its own business while appending to its log file. At some
point, though, the final page in that log file is submitted for writeback;
it then becomes unwritable. As soon as our hapless process tries to add
another line to the file, it will be blocked waiting for that writeback to
complete. Since the disks are contended and the I/O queues are long, that
wait can go on for some time. By the time the process is allowed to
proceed, it has suffered an extensive, unexpected period of latency.
Ted's proposed solution was to only implement stable pages if the data
integrity features are built into the kernel. That fix is unlikely to be
merged in that form for a few reasons. Many distributor kernels are likely
to have the feature enabled, but it will actually be used on relatively few
systems. As noted above, there are other places where changing data in
pages under writeback can create problems. So the real solution may
be some sort of runtime switch - perhaps a filesystem mount option -
indicating when stable pages are needed.
It is also possible that the real problem is somewhere else. Chris Mason
expressed discomfort with the idea of only
using stable pages where they are strictly needed:
I'm not against only turning on stable pages when they are needed,
but the code that isn't the default tends to be somewhat less used.
So it does increase testing burden when we do want stable pages,
and it tends to make for awkward bugs that are hard to reproduce
because someone neglects to mention it.
According to Chris, writeback latencies simply should not be seen on the
scale of multiple seconds; he would like to see some effort put into
figuring out why that is happening. Then, perhaps, the real problem could
But it may be that the real problem is simply that the system's resources
are heavily oversubscribed and the I/O queues are long. In that case, a
real fix may be hard to come by.
Boaz Harrosh suggested avoiding writeback on the final
pages of any files that have been modified in the last few seconds. That
might help in the "appending to a log file" case, but will not avoid
unpredictable latency resulting from modification of the file at any
location other than the end. People have suggested that pages modified
while under writeback could be copied, allowing the modification to proceed
immediately and not interfere with the writeback. That solution, though,
requires more memory (perhaps during a time when the system is desperately
trying to free memory) and copying pages is not free. Another option, suggested by Ted, would be to add a callback
to be invoked by the block layer just before a page is passed on to the
device; that callback could calculate checksums and mark the page
unwritable only for the (presumably much shorter) time that it is actually
Other solutions certainly exist. The first step, though, would appear to
be to get a real handle on the problem so that solutions are written with
an understanding of where the latency is actually coming from. Then,
perhaps, we can have a stable pages implementation that provides stable
data with stable latency in all situations.
Comments (15 posted)
Memory Allocator (or CMA), which LWN looked at back in June 2011,
has been developed to allow allocation of big, physically-contiguous memory blocks. Simple in principle, it has grown
quite complicated, requiring cooperation between many
subsystems. Depending on one's perspective, there are different things to be
done and watch out for with CMA. In this article, I will
how to use CMA and how to integrate it with a given platform.
From a device driver author's point of view, nothing should
change. CMA is integrated with the DMA subsystem, so the usual calls
to the DMA API (such as dma_alloc_coherent()) should work
as usual. In fact, device drivers should never need to call the CMA API
directly, since instead of bus addresses and kernel mappings it
operates on pages and page frame numbers (PFNs), and provides no
maintaining cache coherency.
For more information, looking at Documentation/DMA-API.txt
will be useful.
Those two documents describe the provided functions as well as giving
Of course, someone has to integrate CMA with the DMA subsystem of
a given architecture. This is performed in a few, fairly easy
CMA works by reserving memory early at boot time. This memory,
called a CMA area or a CMA context, is later
returned to the buddy allocator so that it can be used by regular
applications. To do the reservation, one needs to
void dma_contiguous_reserve(phys_addr_t limit);
just after the low-level "memblock" allocator is initialized but
prior to the buddy
allocator setup. On ARM, for example, it is called in
arm_memblock_init(), whereas on x86 it is just after memblock
is set up in setup_arch().
The limit argument specifies physical address above
which no memory will be prepared for CMA. The intention is to
limit CMA contexts to addresses that DMA can handle. In the
case of ARM, the limit is the minimum of arm_dma_limit and
arm_lowmem_limit. Passing zero will allow CMA to
allocate its context as high as it wants. The only constraint is
that the reserved memory must belong to the same zone.
The amount of reserved memory depends on a few Kconfig options
and a cma kernel parameter. I will describe them further down in the article.
The dma_contiguous_reserve() function will reserve memory
and prepare it to be used with CMA. On some architectures (eg. ARM) some
architecture-specific work needs to be performed as well. To
allow that, CMA will call the following function:
void dma_contiguous_early_fixup(phys_addr_t base, unsigned long size);
It is the architecture's responsibility to provide it along with
its declaration in the asm/dma-contiguous.h header file. If
a given architecture does not need any special handling, it's enough
to provide an empty function definition.
It will be called quite early, thus some subsystems
(e.g. kmalloc()) will not be available. Furthermore, it
may be called several times (since, as described below, several
CMA contexts may exist).
The second thing to do is to change the architecture's DMA implementation to use
the whole machinery. To allocate CMA memory one uses:
struct page *dma_alloc_from_contiguous(struct device *dev, int count, unsigned int align);
Its first argument is a device that the allocation is performed on
behalf of. The second specifies the number of pages (not
bytes or order) to allocate. The third argument is the alignment expressed as a page order.
It enables allocation of buffers whose physical addresses are aligned
2align pages. To avoid fragmentation, if at
all possible pass zero here. It is worth noting that there is
a Kconfig option (CONFIG_CMA_ALIGNMENT) which specifies
maximum alignment accepted by the function. Its default value is
8 meaning 256-page alignment.
The return value is the first of a sequence of count
To free the allocated buffer, one needs to call:
bool dma_release_from_contiguous(struct device *dev, struct page *pages, int count);
The dev and count arguments are same as before,
whereas pages is what
dma_alloc_from_contiguous() returned. If the region passed to the function did not come from CMA, the
function will return false. Otherwise, it will return
true. This removes the need for higher-level functions to track
which allocations were made with CMA and which were made using some other
Beware that dma_alloc_from_contiguous() may not be
called from atomic context. It performs some “heavy” operations
such as page migration, direct reclaim, etc., which may take
a while. Because of that, to make
dma_alloc_coherent() and friends work as advertised,
the architecture needs to have a different method of allocating
memory in atomic context.
The simplest solution is to put aside a bit of memory at boot
time and perform atomic allocations from that. This is in fact what
ARM is doing. Existing architectures most likely already have a special
path for atomic allocations.
Special memory requirements
At this point, most of the drivers should “just work”. They
use the DMA API, which calls CMA. Life is beautiful. Except some
devices may have special memory requirements. For instance,
Samsung's S5P Multi-format codec requires buffers to be located in different
memory banks (which allows reading them through two memory channels,
thus increasing memory bandwidth).
Furthermore, one may want to separate some devices' allocations
from others to limit fragmentation within CMA areas.
CMA operates on contexts. Devices use one global area by
default, but private contexts can be used as well. There is
a many-to-one mapping between struct devices and a
struct cma (ie. CMA context). This means that a single
device driver needs to have separate struct device
objects to use more than one CMA context, while at the same time
several struct device objects may point to the same CMA
To assign a CMA context to a device, all one needs to do is
int dma_declare_contiguous(struct device *dev, unsigned long size,
phys_addr_t base, phys_addr_t limit);
dma_contiguous_reserve(), this needs to be called
after memblock initializes but before too much memory gets grabbed
from it. For ARM platforms, a convenient place to put the call
to this function is in the machine's reserve() callback. This
won't work for automatically probed devices or those loaded as
modules, so some other mechanism will be needed if those kinds of
devices require CMA contexts.
The first argument of the function is the device that the new
context is to be assigned to. The second specifies the size in
bytes (not in pages) to reserve for the areas. The third is the physical address of
the area or zero. The last one has the same meaning as
dma_contiguous_reserve()'s limit argument. The
return value is
either zero or a negative error code.
There is a limit to how many “private” areas can be declared,
namely CONFIG_CMA_AREAS. Its default value is seven but
it can be safely increased if the need arises.
Things get a little bit more complicated if the same non-default CMA
context needs to be used by two or more devices. The
current API does not provide a trivial way to do that. What can
be done is to use dev_get_cma_area() to figure out the CMA area
that one device is using, and dev_set_cma_area() to set the
same context to another device. This sequence must be called no
sooner than in postcore_initcall(). Here is how it might
static int __init foo_set_up_cma_areas(void)
struct cma *cma;
cma = dev_get_cma_area(device1);
As a matter of fact, there is nothing special about the
default context that is created by
dma_contiguous_reserve() function. It is in no way
required and the system will work without it. If there is no default
context, dma_alloc_from_contiguous() will return
NULL for devices without assigned
areas. dev_get_cma_area() can be used to
distinguish between this situation and allocation failure.
dma_contiguous_reserve() does not take a size as an
argument, so how does it know how much
memory should be reserved? There are two sources of this information:
There is a set of Kconfig options, which specify the default
size of the reservation. All of those options are located under
“Device Drivers” » “Generic Driver Options” » “Contiguous Memory
Allocator” in the Kconfig menu. They allow choosing from four
possibilities: the size can be an absolute value in megabytes,
a percentage of total memory, the smaller of the two, or the larger
of the two. The default is to allocate 16 MiBs.
There is also a cma= kernel command line option. It
lets one specify the size of the area at boot time without the
need to recompile the kernel. This option specifies the size in
bytes and accepts the usual suffixes.
So how does it work?
To understand how CMA works, one needs to know a little about
migrate types and pageblocks.
When requesting memory from the buddy allocator, one provides
a gfp_mask. Among other things, it specifies the
"migrate type" of the requested page(s). One of the migrate types
is MIGRATE_MOVABLE. The idea behind it is that data
from a movable page can be migrated (or moved, hence the name),
which works well for disk caches, process pages, etc.
To keep pages with the same migrate type together, the buddy
pages into "pageblocks," each having a migrate type assigned to it.
The allocator then tries to allocate pages from pageblocks with a type
corresponding to the request. If that's not possible, however, it
will take pages from different pageblocks and may even
change a pageblock's migrate type.
This means that a non-movable page can be allocated from
a MIGRATE_MOVABLE pageblock which can also result in that
pageblock changing its migrate type. This is undesirable for CMA,
so it introduces a MIGRATE_CMA type which has one
important property: only movable pages can be allocated from a
So, at boot time, when the dma_contiguous_reserve() and/or
dma_declare_contiguous() functions are called, CMA talks
to memblock to reserve a portion of RAM, just to give it back to the
buddy system later on with the underlying pageblock's migrate type
set to MIGRATE_CMA. The end result is that all the
reserved pages end up back in the buddy allocator, so they
can be used to satisfy movable page allocations.
During CMA allocation, dma_alloc_from_contiguous()
chooses a page range and calls:
int alloc_contig_range(unsigned long start, unsigned long end,
The start and end arguments specify the page
frame numbers (or the PFN range) of the target memory. The
migratetype, indicates the migration type of the
underlying pageblocks; in the case of CMA, this is
The first thing this function does is to mark the pageblocks
contained within the [start, end) range as
MIGRATE_ISOLATE. The buddy allocator will never touch
a pageblock with that migrate type.
Changing the migrate type does not magically free pages, though; this is
why __alloc_conting_migrate_range() is called next. It
scans the PFN range and looks for pages that can be migrated
Migration is the process of copying a page to some other portion
of system memory and updating any references to it. The former is
straightforward and the latter is handled by the memory management
subsystem. After its data has been migrated, the old page is freed by
giving it back to the buddy allocator. This is why the containing
to be marked as MIGRATE_ISOLATE beforehand. Had they been given
a different migrate type, the buddy allocator would not think twice
about using them to fulfill other allocation requests.
Now all of the pages that alloc_contig_range() cares
about are (hopefully) free. The function takes them away from buddy
system, then changes pageblock's migrate type back to
MIGRATE_CMA. Those pages are then returned to the caller.
Freeing memory is much simpler process.
dma_release_from_contiguous() delegates most of its work
void free_contig_range(unsigned long pfn, unsigned nr_pages);
which simply iterates over all the pages and puts them back to the
The Contiguous Memory Allocator patch set has gone a long way from its first version (and even longer from its
predecessor – Physical
Memory Management posted almost three years ago). On the way, it
lost some of its functionality but got better at what it does now. On
complex platforms, it is likely that CMA won't be usable on its own,
but will be used in combination with ION and dmabuf.
Even though it is at its 23rd version, CMA is still not
perfect and, as always, there's still a lot that can be done to
improve it. Hopefully though, getting it finally merged into the -mm tree
will get more people working on it to create a solution that benefits
Comments (none posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>