User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.2-rc2, released on November 15. "And for being an -rc2 release of a pretty large merge-window, it seems to be quite reasonably sized. In fact, despite this having been the largest linux-next in a release in our linux-next history (I think), rc2 has the exact same number of commits since rc1 as we had during the 3.1 release." There are lots of fixes and a couple of ktest improvements.

Stable updates: the 3.0.9 and 3.1.1 stable kernels were released on November 11. Both contain a very long list of important fixes.

Comments (none posted)

Quotes of the week

And what do pgfault/pgmajfault mean within memcg? I now fear to ask - given the pgpgin/pgpgout situation, these are probably related to my shampoo viscosity or something.
-- Andrew Morton

It's difficult to know for sure that this is the right thing to do - there's zero public documentation on the interaction between all of these components. But enough vendors enable ASPM on platforms and then set this bit that it seems likely that they're expecting the OS to leave them alone.

Measured to save around 5W on an idle Thinkpad X220.

-- Matthew Garrett maybe fixes a serious power problem

Comments (none posted)

Kernel development news

Huge pages, slow drives, and long delays

By Jonathan Corbet
November 14, 2011
It is a rare event, but it is no fun when it strikes. Plug in a slow storage device - a USB stick or a music player, for example - and run something like rsync to move a lot of data to that device. The operation takes a while, which is unsurprising; more surprising is when random processes begin to stall. In the worst cases, the desktop can lock up for minutes at a time; that, needless to say, is not the kind of interactive response that most users are looking for. The problem can strike in seemingly arbitrary places; the web browser freezes, but a network audio stream continues to play without a hiccup. Everything unblocks eventually, but, by then, the user is on their third beer and contemplating the virtues of proprietary operating systems. One might be forgiven for thinking that the system should work a little better than that.

Numerous people have reported this sort of behavior in recent times; your editor has seen it as well. But it is hard to reproduce, which means it has been hard to track down. It is also entirely possible that there is more than one bug causing this kind of behavior. In any case, there should now be one less bug of this type if Mel Gorman's patch proves to be effective. But a few developers are wondering if, in some cases, the cure is worse than the disease.

The problem Mel found appears to go somewhat like this. A process (that web browser, say) is doing its job when it incurs a page fault. This is normal; the whole point of contemporary browsers sometimes seems to be to stress-test virtual memory management systems. The kernel will respond by grabbing a free page to slot into the process's address space. But, if the transparent huge pages feature is built into the kernel (and most distributors do enable this feature), the page fault handler will attempt to allocate a huge page instead. With luck, there will be a huge page just waiting for this occasion, but that is not always the case; in particular, if there is a process dirtying a lot of memory, there may be no huge pages available. That is when things start to go wrong.

Once upon a time, one just had to assume that, once the system had been running for a while, large chunks of physically-contiguous memory would simply not exist. Virtual memory management tends to fragment such chunks quickly. So it is a bad idea to assume that huge pages will just be sitting there waiting for a good home; the kernel has to take explicit action to cause those pages to exist. That action is compaction: moving pages around to defragment the free space and bring free huge pages into existence. Without compaction, features like transparent huge pages would simply not work in any useful way.

A lot of the compaction work is done in the background. But current kernels will also perform "synchronous compaction" when an attempt to allocate a huge page would fail due to lack of availability. The process attempting to perform that allocation gets put to work migrating pages in an attempt to create the huge page it is asking for. This operation is not free in the best of times, but it should not be causing multi-second (or multi-minute) stalls. That is where the USB stick comes in.

If a lot of data is being written to a slow storage device, memory will quickly be filled with dirty pages waiting to be written out. That, in itself, can be a problem, which is why the recently-merged I/O-less dirty throttling code tries hard to keep pages for any single device from taking too much memory. But writeback to a slow device plays poorly with compaction; the memory management code cannot migrate a page that is being written back until the I/O operation completes. When synchronous compaction encounters such a page, it will go to sleep waiting for the I/O on that page to complete. If the page is headed to a slow device, and it is far back on a queue of many such pages, that sleep can go on for a long time.

One should not forget that producing a single huge page can involve migrating hundreds of ordinary pages. So once that long sleep completes, the job is far from done; the process stuck performing compaction may find itself at the back of the writeback queue quite a few times before it can finally get its page fault resolved. Only then will it be able to resume executing the code that the user actually wanted run - until the next page fault happens and the whole mess starts over again.

Mel's fix is a simple one-liner: if a process is attempting to allocate a transparent huge page, synchronous compaction should not be performed. In such a situation, Mel figured, it is far better to just give the process an ordinary page and let it continue running. The interesting thing is that not everybody seems to agree with him.

Andrew Morton was the first to object, saying "Presumably some people would prefer to get lots of huge pages for their 1000-hour compute job, and waiting a bit to get those pages is acceptable." David Rientjes, presumably thinking of Google's throughput-oriented tasks, said that there are times when the latency is entirely acceptable, but that some tasks really want to get huge pages at fault time. Mel's change makes it that much less likely that processes will be allocated huge pages in response to faults; David does not appear to see that as a good thing.

One could (and Mel did) respond that the transparent huge page mechanism does not only work at fault time. The kernel will also try to replace small pages with huge pages in the background while the process is running; that mechanism should bring more huge pages into use - for longer-running processes, at least - even if they are not available at fault time. In cases where that is not enough, there has been talk of adding a new knob to allow the system administrator to request that synchronous compaction be used. The actual semantics of such a knob are not clear; one could argue that if huge page allocations are that much more important than latency, the system should perform more aggressive page reclaim as well.

Andrea Arcangeli commented that he does not like how Mel's change causes failures to use huge pages at fault time; he would rather find a way to keep synchronous compaction from stalling instead. Some ideas for doing that are being thrown around, but no solution has been found as of this writing.

Such details can certainly be worked out over time. Meanwhile, if Mel's patch turns out to be the best fix, the decision on merging should be clear enough. Given a choice between (1) a system that continues to be responsive during heavy I/O to slow devices and (2) random, lengthy lockups in such situations, one might reasonably guess that most users would choose the first alternative. Barring complications, one would expect this patch to go into the mainline fairly soon, and possibly into the stable tree shortly thereafter.

Comments (39 posted)

Reverting disk aliases?

By Jake Edge
November 16, 2011

Back in June, we looked at a proposed mechanism for adding aliases to device names, disks in particular. Since then, the patch has been merged into the mainline, but some kernel developers are not happy with that and have asked that it be reverted. Part of the complaint is that the functionality adds to the kernel ABI, which will need to be maintained "forever", but there are other solutions to the problem that don't require kernel changes. So far, the patch has not been reverted, but there is an underlying question: who gets to decide when and where to extend the kernel's ABI?

The alias patch was authored by Nao Nishijima and came into the mainline (for 3.2-rc1) by way of James Bottomley's SCSI tree. The patch allows administrators to associate an alias name for a particular disk by writing to the /sys/block/<disk>/alias sysfs file. That way, certain log messages can be made using the user-supplied disk name rather than the raw name of the disk, which may change on each boot.

Tejun Heo requested that the patch be reverted, noting that it "has been nacked by people working on device driver core, block layer and kernel-userland interface and shouldn't have been upstreamed". That request was quickly acked by several people (Greg Kroah-Hartman, Kay Sievers, Jens Axboe, and Jeff Garzik), with Axboe explicitly noting that it should be done soon: "We need to revert it before 3.2 rolls out, otherwise we are stuck with it."

As might be guessed, though, Bottomley disagreed that it should be reverted, saying that it solved a real problem:

No, I can't agree with this ... if you propose a working alternative, I'm listening, but in the absence of one, I think the hack fills a gap we have in log analysis and tides us over until we have an agreed on proper solution (at which point, I'm perfectly happy to pull the hack back out).

Several folks pounced on the "hack" admission in Bottomley's note, but both Kroah-Hartman and Sievers believe that there is no need for a kernel-side solution at all. As Sievers put it:

The solution to this problem is to let udev log the known symlinks to the log stream at device discovery time. That way you can log _all_ kernel device messages to the currently [known] disk names. This works already even on old systems,

Furthermore, Kroah-Hartman pointed out that Nishijima recognizes that it can be solved in user space: "Again, this is fixable in userspace, the author of the patch agrees with that, yet refuses to make the userspace changes despite having a few _years_ in which to so so". As with the others commenting, Sievers is also concerned about adding to the user-space interface: "Such hacks are not supposed to get in, and its really hard to get them out again."

While the patch has not been reverted, Nishijima may be anticipating that outcome with a post that looks at changes to udev: "I understood why this patch is not acceptable and would like to solve the problem of the device name mismatch in *user space* using udev". Kroah-Hartman suggests posting udev patches that implement the changes to the linux-hotplug mailing list as a good starting point.

It would seem that Bottomley made something of an end-run around the objections of various maintainers by pulling the change into his tree. His reasons for doing so make sense, because there are customers asking for the change, but it still routes around the usual paths. Heo's request certainly indicates that he doesn't believe it came in via the proper path, and Kroah-Hartman is blunt about that as well: "Also, you should have gotten this through the block layer maintainer...". It is a hack as everyone seems to agree, but it's a hack that leaves behind an ABI for the kernel to maintain forevermore. It is not surprising that a number of core developers would like to see it reverted.

Comments (3 posted)

Reworking the DMA mapping code (especially on ARM)

By Jonathan Corbet
November 16, 2011
Direct memory access (DMA) I/O is a simple-sounding concept: devices are able to access memory directly and transfer data without involving the CPU. In practice, of course, it turns into a complex problem when confronted with the real world and its strange architectural differences, problematic devices, and varying I/O needs. The DMA mapping API was created as a way to minimize the amount of DMA-related complexity that drivers have to deal with, a goal it has achieved fairly well. Changing needs, and increasing hardware complexity are driving some changes in this area, though, with the side benefit that the ARM architecture should get a nice cleanup as well.

As is the case in many areas, the ARM architecture has its own implementation of the DMA API, despite the fact that there is quite a bit of architecture-independent code available to be used. The usual reasons apply here: a combination of developers only working in the ARM tree and peculiarities specific to that architecture. It is a pattern that has been seen in many other places; it is certainly not specific to ARM.

One of the first things done by Marek Szyprowski's ARM DMA redesign patch set is to hook ARM into the common DMA mapping framework. That enables the deletion of a certain amount of duplicated code and its replacement with common code. Among other things, this work simplifies the handling of differences within the ARM architecture itself. Through the use of the common struct dma_map_ops, an architecture can provide a set of mapping operations specific to a given situation - different devices can have different DMA operations, for example.

But there is more to ARM's DMA implementation than the common interface; ARM's API has special functions like:

    void *dma_alloc_writecombine(struct device *dev, size_t len, 
				 dma_addr_t *dma_addr, gfp_t flags);

This function allocates a DMA buffer with "write combining" attributes, meaning that data written to that memory (by the CPU) may be delayed by the memory hardware and flushed out in batches. Use of write-combining memory can yield significant performance improvements for some device types, but this memory clearly has to be handled carefully so that deferred writes don't get mixed up with accesses by the device. A number of drivers use this function, but only one other architecture (avr32) provides it.

ARM also has special functions for mapping DMA buffers into user space:

    int dma_mmap_coherent(struct device *dev, struct vm_area_struct *vma,
			  void *cpu_addr, dma_addr_t dma_addr, size_t len);

On most architectures, memory-mapping a coherent buffer requires no special handling, so the generic DMA code does not provide any special support for this operation; only one other architecture (PowerPC) has felt the need to add this function.

Clearly, bringing the ARM DMA API into line with common code will require some way of handling these special functions. The fact that, for each of the above functions, one other architecture has added an implementation indicates that ARM, as strange as it is, is not alone in needing an expanded API. So the logical thing to do is to move support for these functions into the common DMA core implementation.

That could be done by adding new alloc_writecombine() and mmap_coherent() functions (and, yes, mmap_writecombine() too) to struct dma_map_ops. As the number of combinations of operations and memory attributes grows, though, the size of that structure will grow as well. Marek decided to take a different approach; his patch removes the existing alloc_coherent() and free_coherent() members, replacing them with:

    void* (*alloc)(struct device *dev, size_t size, dma_addr_t *dma_handle, 
		   gfp_t gfp, struct dma_attrs *attrs);
    void (*free)(struct device *dev, size_t size, void *vaddr,
		 dma_addr_t dma_handle, struct dma_attrs *attrs);
    int (*mmap)(struct device *dev, struct vm_area_struct *vma, void *cpu_addr,
		dma_addr_t dma_addr, size_t size, struct dma_attrs *attrs);

As it happens, struct dma_attrs already exists in current kernels. It is not heavily used, though; there are currently only two attributes defined (described in Documentation/DMA-attributes.txt) that seem to only be implemented in the ia64 and PowerPC/Cell architectures. Only one of them (DMA_ATTR_WRITE_BARRIER) seems to actually be used, and in only one place (the InfiniBand code). But the mechanism already exists, so adding more attributes seems like a better approach than adding a new way to express things like "write combining." Marek's patch adds the convention that a null attrs pointer means "coherent," then adds attributes for noncoherent and write-combining mappings. The various allocation functions can then be replaced with:

    void *dma_alloc_attrs(struct device *dev, size_t size, 
			  dma_addr_t *dma_handle, gfp_t flag, 
			  struct dma_attrs *attrs);

This function can be used to request a mapping with any set of attributes that the underlying platform may support; similar functions exist for freeing and memory-mapping DMA buffers. Marek's patch does not extend this functionality into other architectures - even those that have added functions similar to those used by ARM - but that seems like an obvious next step.

Once that is done, Marek can get to what was perhaps his real goal: adding support for per-device I/O memory management units (IOMMUs) to the ARM DMA API. Some hardware has a separate IOMMU built into it that cannot be used for other devices, so the IOMMU cannot be made available to the system as a whole. But it is possible to attach a device-specific dma_map_ops structure to such devices that would cause the DMA API to use the IOMMU without the device driver even needing to know about it. And that, of course, leads to simpler and more reliable code.

Prior to this work, IOMMU awareness had been built into specific drivers directly. But that caused opposition at review time; drivers written in that way cannot really be merged into the mainline. When he talked about this work at LinuxCon Prague, Marek passed on a few lessons that he had learned from the experience. The first of those is that one should always use existing APIs whenever possible. Every developer thinks they can do something better; that may or may not be true, but using the common code works out better in the long run. But, he said, developers should not be afraid of extending core interfaces when the need arises. That is how problems get solved and how the core gets better. The final lesson was "expect it to take some time" when one has to solve problems of this nature.

On the subject of time: it is not clear when this work might make it into the mainline. It has not yet really been submitted for inclusion; the current patches have some obvious work that needs to be done before they are ready. But Marek, after a number of tries, appears to have gotten past the serious technical objections and is now working on getting the details right. So, while one should follow his advice and expect it to take some time, the value of "some time" should be approaching a reasonably small number.

Comments (none posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management



Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds