Brief items
The current development kernel is 3.2-rc2,
released on November 15. "
And for
being an -rc2 release of a pretty large merge-window, it seems to be quite
reasonably sized. In fact, despite this having been the largest linux-next
in a release in our linux-next history (I think), rc2 has the exact same
number of commits since rc1 as we had during the 3.1 release." There
are lots of fixes and a couple of ktest improvements.
Stable updates: the 3.0.9 and 3.1.1 stable kernels were released on
November 11. Both contain a very long list of important fixes.
Comments (none posted)
And what do pgfault/pgmajfault mean within memcg? I now fear to
ask - given the pgpgin/pgpgout situation, these are probably
related to my shampoo viscosity or something.
--
Andrew Morton
It's difficult to know for sure that this is the right thing to do -
there's zero public documentation on the interaction between all of
these components. But enough vendors enable ASPM on platforms and then
set this bit that it seems likely that they're expecting the OS to
leave them alone.
Measured to save around 5W on an idle Thinkpad X220.
--
Matthew Garrett maybe fixes a serious
power problem
Comments (none posted)
Kernel development news
By Jonathan Corbet
November 14, 2011
It is a rare event, but it is no fun when it strikes. Plug in a slow
storage device - a USB stick or a music player, for example - and run
something like
rsync to move a lot of data to that device. The
operation takes a while, which is unsurprising; more surprising is when
random processes begin to stall. In the worst cases, the desktop can lock
up for minutes at a time; that, needless to say, is not the kind of
interactive response that most users are looking for. The problem can
strike in seemingly arbitrary places; the web browser freezes, but a
network audio stream continues to play without a hiccup. Everything
unblocks eventually, but, by then, the user is on their third beer and
contemplating the virtues of proprietary operating systems. One might be
forgiven for thinking that the system should work a little better than that.
Numerous people have reported this sort of behavior in recent times; your
editor has seen it as well. But it is hard to reproduce, which means it
has been hard to track down. It is also entirely possible that there is
more than one bug causing this kind of behavior. In any case, there should
now be one less bug of this type if Mel
Gorman's patch proves to be effective. But a few developers are
wondering if, in some cases, the cure is worse than the disease.
The problem Mel found appears to go somewhat like this. A process (that
web browser, say) is doing its job when it incurs a page fault. This is
normal; the whole point of contemporary browsers sometimes seems to be to
stress-test virtual memory management systems. The kernel
will respond by grabbing a free page to slot into the process's address
space. But, if the transparent huge pages feature is built into the kernel
(and most distributors do enable this feature), the page fault handler will
attempt to allocate a huge page instead. With luck, there will be a huge
page just waiting for this occasion, but that is not always the case; in
particular, if there is a process dirtying a lot of memory, there may be no
huge pages available. That is when things start to go wrong.
Once upon a time, one just had to assume that, once the system had been
running for a while, large chunks of physically-contiguous memory would
simply not exist. Virtual memory management tends to fragment such chunks
quickly. So it is a bad idea to assume that huge pages will just be
sitting there waiting for a good home; the kernel has to take explicit
action to cause those pages to exist. That action is compaction: moving pages around to defragment
the free space and bring free huge pages into existence. Without
compaction, features like transparent huge pages would simply not work in
any useful way.
A lot of the compaction work is done in the background. But current
kernels will also perform "synchronous compaction" when an attempt to
allocate a huge page would fail due to lack of availability. The process
attempting to perform that allocation gets put to work migrating pages in
an attempt to create the huge page it is asking for. This operation is not
free in the best of times, but it should not be causing multi-second (or
multi-minute) stalls. That is where the USB stick comes in.
If a lot of data is being written to a slow storage device, memory will
quickly be filled with dirty pages waiting to be written out. That, in
itself, can be a problem, which is why the recently-merged I/O-less dirty throttling code tries hard to
keep pages for any single device from taking too much memory. But
writeback to a slow device plays poorly with compaction; the memory
management code cannot migrate a page that is being written back until the
I/O operation completes. When synchronous compaction encounters such
a page, it will go to sleep waiting for the I/O on that page to complete.
If the page is headed to a slow device, and it is far back on a queue of
many such pages, that sleep can go on for a long time.
One should not forget that producing a single huge page can involve
migrating hundreds of ordinary pages. So once that long sleep completes,
the job is far from done; the process stuck performing compaction may find
itself at the back of the writeback queue quite a few times before it can
finally get its page fault resolved. Only then will it be able to resume
executing the code that the user actually wanted run - until the
next page fault happens and the whole mess starts over again.
Mel's fix is a simple one-liner: if a process is attempting to allocate a
transparent huge page, synchronous compaction should not be performed. In
such a situation, Mel figured, it is far better to just give the process an
ordinary page and let it continue running. The interesting thing is that
not everybody seems to agree with him.
Andrew Morton was the first to object,
saying "Presumably some people would prefer to get lots of
huge pages for their 1000-hour compute job, and waiting a bit to get
those pages is acceptable." David Rientjes, presumably thinking of
Google's throughput-oriented tasks, said
that there are times when the latency is entirely acceptable, but that some
tasks really want to get huge pages at fault time. Mel's change makes it
that much less likely that processes will be allocated huge pages in
response to faults; David does not appear to see that as a good thing.
One could (and Mel did) respond that the transparent huge page mechanism
does not only work at fault time. The kernel will also try to replace
small pages with huge pages in the background while the process is running;
that mechanism should bring more huge pages into use - for longer-running
processes, at least - even if they are not available at fault time. In
cases where that is not enough, there has been talk of adding a new knob
to allow the system administrator to request that synchronous compaction be
used. The actual semantics of such a knob are not clear; one could argue
that if huge page allocations are that much more important than latency,
the system should perform more aggressive page reclaim as well.
Andrea Arcangeli commented that he does not like how Mel's
change causes failures to use huge pages at fault time; he would rather
find a way to keep synchronous compaction from stalling instead. Some
ideas for doing that are being thrown around, but no solution has been
found as of this writing.
Such details can certainly be worked out over time. Meanwhile, if Mel's
patch turns out to be the best fix, the decision on merging should be clear
enough. Given a choice between (1) a system that continues to be
responsive during heavy I/O to slow devices and (2) random, lengthy
lockups in such situations, one might reasonably guess that most users
would choose the first alternative. Barring complications, one would
expect this patch to go into the mainline fairly soon, and possibly into
the stable tree shortly thereafter.
Comments (39 posted)
By Jake Edge
November 16, 2011
Back in June, we looked at a proposed
mechanism for adding aliases to device names, disks in particular. Since
then, the patch has been merged into the mainline, but some kernel
developers are not happy with that and have asked that it be reverted. Part of the
complaint is that the functionality adds to the kernel ABI, which will need
to be maintained "forever", but there are other solutions to the problem
that don't require kernel changes. So far, the patch has not been
reverted, but there is an underlying question: who gets to decide when and
where to extend the kernel's ABI?
The alias
patch was authored by Nao Nishijima and came into the mainline (for
3.2-rc1) by way
of James Bottomley's SCSI tree. The patch allows administrators to
associate an alias name for a particular disk by writing to the
/sys/block/<disk>/alias sysfs file. That way, certain log
messages can be made using the user-supplied disk name rather than the raw
name of the disk, which may change on each boot.
Tejun Heo
requested that the patch be reverted, noting that it "has been nacked by people working on device driver core, block
layer and kernel-userland interface and shouldn't have been
upstreamed". That request was quickly acked by several people (Greg
Kroah-Hartman, Kay Sievers, Jens Axboe, and Jeff Garzik), with Axboe
explicitly noting that it should be done soon: "We need to
revert it before 3.2 rolls out, otherwise we are stuck with it."
As might be guessed, though, Bottomley disagreed that it should be reverted, saying
that it solved a real
problem:
No, I can't agree with this ... if you propose a working alternative,
I'm listening, but in the absence of one, I think the hack fills a gap
we have in log analysis and tides us over until we have an agreed on
proper solution (at which point, I'm perfectly happy to pull the hack
back out).
Several folks pounced on the "hack" admission in Bottomley's note, but
both Kroah-Hartman and Sievers believe that there is no need for a
kernel-side solution at all. As Sievers put
it:
The solution to this problem is to let udev log the known symlinks to
the log stream at device discovery time. That way you can log _all_
kernel device messages to the currently [known] disk names. This works
already even on old systems,
Furthermore, Kroah-Hartman pointed out that
Nishijima recognizes that it can be solved in user space: "Again, this is fixable in userspace, the author of the patch agrees with
that, yet refuses to make the userspace changes despite having a few
_years_ in which to so so". As with the others commenting,
Sievers is also concerned about adding to the user-space interface:
"Such hacks are not supposed to get in, and its really hard to get
them
out again."
While the patch has not been reverted, Nishijima may be anticipating that
outcome with a post that looks at changes
to udev: "I understood why this patch is not acceptable and would like to
solve the problem of the device name mismatch in *user space* using
udev". Kroah-Hartman suggests posting udev patches that implement
the changes to the linux-hotplug
mailing list as a good starting point.
It would seem that Bottomley made something of an end-run
around the objections of various maintainers by pulling the change into his
tree. His reasons for doing so make sense, because there are
customers asking for the change, but it still routes around the usual
paths. Heo's request certainly indicates that he doesn't believe it came
in
via the proper path, and Kroah-Hartman is blunt about that as well:
"Also, you should have gotten this through the block layer
maintainer...". It is a hack as everyone seems to agree, but
it's a hack that leaves behind an ABI for the kernel to maintain
forevermore. It is not surprising that a number of core developers would
like to see it reverted.
Comments (3 posted)
By Jonathan Corbet
November 16, 2011
Direct memory access (DMA) I/O is a simple-sounding concept: devices are
able to access memory directly and transfer data without involving the
CPU. In practice, of course, it turns into a complex problem when
confronted with the real world and its strange architectural differences,
problematic devices, and varying I/O needs. The DMA mapping API was
created as a way to minimize the amount of DMA-related complexity that
drivers have to deal with, a goal it has achieved fairly well. Changing
needs, and increasing hardware complexity are driving some changes in this
area, though, with the side benefit that the ARM architecture should get a
nice cleanup as well.
As is the case in many areas, the ARM architecture has its own
implementation of the DMA API, despite the fact that there is quite a bit
of architecture-independent code available to be used. The usual reasons
apply here: a combination of developers only working in the ARM tree
and peculiarities specific to that architecture. It is a pattern that has
been seen in many other places; it is certainly not specific to ARM.
One of the first things done by Marek Szyprowski's ARM DMA redesign patch set is to hook ARM into the
common DMA mapping framework. That enables the deletion of a certain
amount of duplicated code and its replacement with common code. Among
other things, this work simplifies the handling of differences within the
ARM architecture itself. Through the use of the common struct
dma_map_ops, an architecture can provide a set of mapping operations
specific to a given situation - different devices can have different DMA
operations, for example.
But there
is more to ARM's DMA implementation than the common interface; ARM's API
has special functions like:
void *dma_alloc_writecombine(struct device *dev, size_t len,
dma_addr_t *dma_addr, gfp_t flags);
This function allocates a DMA buffer with "write combining" attributes,
meaning that data written to that memory (by the CPU) may be delayed by the
memory hardware and flushed out in batches. Use of write-combining memory can
yield significant performance improvements for some device types, but this
memory clearly has to be handled carefully so that deferred writes don't
get mixed up with accesses by the device. A number of drivers use this
function, but only one other architecture (avr32) provides it.
ARM also
has special functions for mapping DMA buffers into user space:
int dma_mmap_coherent(struct device *dev, struct vm_area_struct *vma,
void *cpu_addr, dma_addr_t dma_addr, size_t len);
On most architectures, memory-mapping a coherent buffer requires no special
handling, so the generic DMA code does not provide any special support for this
operation; only one
other architecture (PowerPC) has felt the need to add this function.
Clearly, bringing the ARM DMA API into line with common code will require
some way of handling these special functions. The fact that, for each of
the above functions, one other architecture has added an implementation
indicates that ARM, as strange as it is, is not alone in needing an
expanded API. So the logical thing to do is to move support for these
functions into the common DMA core implementation.
That could be done by adding new alloc_writecombine() and
mmap_coherent() functions (and, yes, mmap_writecombine()
too) to struct dma_map_ops. As the number of combinations of
operations and memory attributes grows, though, the size of that structure
will grow as well. Marek decided to take a different approach; his patch
removes the existing alloc_coherent() and free_coherent()
members, replacing them with:
void* (*alloc)(struct device *dev, size_t size, dma_addr_t *dma_handle,
gfp_t gfp, struct dma_attrs *attrs);
void (*free)(struct device *dev, size_t size, void *vaddr,
dma_addr_t dma_handle, struct dma_attrs *attrs);
int (*mmap)(struct device *dev, struct vm_area_struct *vma, void *cpu_addr,
dma_addr_t dma_addr, size_t size, struct dma_attrs *attrs);
As it happens, struct dma_attrs already exists in current kernels.
It is not heavily used, though; there are currently only two attributes defined
(described in Documentation/DMA-attributes.txt) that seem to
only be implemented in the ia64 and PowerPC/Cell architectures. Only one
of them (DMA_ATTR_WRITE_BARRIER) seems to actually be used, and in
only one place (the InfiniBand code). But the mechanism already exists, so
adding more attributes seems like a better approach than adding a new way
to express things like "write combining." Marek's patch adds the
convention that a null attrs pointer means "coherent," then adds
attributes for noncoherent and write-combining mappings. The various
allocation functions can then be replaced with:
void *dma_alloc_attrs(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t flag,
struct dma_attrs *attrs);
This function can be used to request a mapping with any set of attributes
that the underlying platform may support; similar functions exist for
freeing and memory-mapping DMA buffers. Marek's patch does not extend this
functionality into other architectures - even those that have added
functions similar to those used by ARM - but that seems like an obvious
next step.
Once that is done, Marek can get to what was perhaps his real goal: adding
support for per-device I/O memory management units (IOMMUs) to the ARM DMA
API. Some hardware has a separate IOMMU built into it that cannot be used
for other devices, so the IOMMU cannot be made available to the system as a
whole. But it is possible to attach a device-specific dma_map_ops
structure to such devices that would cause the DMA API to use the IOMMU
without the device driver even needing to know about it. And that, of
course, leads to simpler and more reliable code.
Prior to this work, IOMMU awareness had been built into specific drivers
directly. But that caused opposition at review time; drivers written in
that way cannot really be merged into the mainline. When he talked about
this work at LinuxCon Prague, Marek passed on a few lessons that he had
learned from the experience. The first of those is that one should always
use existing APIs whenever possible. Every developer thinks they can do
something better; that may or may not be true, but using the common code
works out better in the long run. But, he said, developers should not be
afraid of extending core interfaces when the need arises. That is how
problems get solved and how the core gets better. The final lesson was
"expect it to take some time" when one has to solve problems of this
nature.
On the subject of time: it is not clear when this work might make it into
the mainline. It has not yet really been submitted for inclusion; the
current patches have some obvious work that needs to be done before they
are ready. But Marek, after a number of tries, appears to have gotten past
the serious technical objections and is now working on getting the details
right. So, while one should follow his advice and expect it to take some
time, the value of "some time" should be approaching a reasonably small
number.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>