User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel remains 2.6.35-rc5; no new -rc releases have been made over the last week. The stream of fixes into the mainline continues, though; there has also been a renaming of the logical memory block (LMB) code to "memblock." The -rc6 release can probably be expected shortly after LWN's publication time.

Stable updates: there have been no stable updates over the last week.

Comments (none posted)

Quote of the week

With Rusty's patch for the modules bug, and reverting Greg Vandal-Hartman's "driver core: remove CONFIG_SYSFS_DEPRECATED" and deleting the BUG_ON from generic_delete_inode(), I have a login prompt! Admittedly I don't have any networking any more, but that seems a minor quibble.
-- Andrew Morton

Comments (1 posted)

The ghost of sysfs past

By Jonathan Corbet
July 21, 2010
A few years back, it seemed that incompatible sysfs changes created broken systems on a regular basis. Since then, though, things have gotten better, with no reports of broken systems or forced udev upgrades for a while. That improvement is the result of a deliberate effort on the part of the sysfs hackers to stabilize things and to establish best practices for the use of sysfs-exported information. As some linux-next testers are currently finding out, though, the legacy of older sysfs problems has not entirely faded away yet.

The CONFIG_SYSFS_DEPRECATED configuration option exists as one way of mitigating the effects of a major sysfs change. In the early days of sysfs, devices tended to pop up in strange places, including, especially, under /sys/class. In order to bring more consistency to the filesystem, the layout was reorganized to move more device information into /sys/devices, create the /sys/block directory, and more. Needless to say, any such change would be fatal for systems which expected the old layout, so the configuration option was added to restore that old layout when needed.

In 2010, nobody has shipped a distribution which relies on the old layout for some time. So Greg Kroah-Hartman has posted a patch to remove the configuration option and the significant amount of code needed to support it; that patch has also gone into linux-next. Greg notes: "This is no longer needed by any userspace tools, so it's safe to remove."

Except that maybe it's not safe to remove. Andrew Morton quickly reported that his Fedora Core 6 box would not boot without this option. Andrew is well known for running archaic distributions just for the purpose of finding this kind of compatibility issue; one might argue that there probably are not that many other FC6 boxes in use, and even fewer which will be wanting to run 2.6.35 kernels. But, as Dave Airlie noted, RHEL5 boxes will also fail to boot, and there are rather more of those in operation.

Dave's advice was blunt: "Live with your mistakes guys, don't try and bury them." He knows as well as anybody what the cost of living with mistakes is: the graphics ABIs include a few of their own. Mistakes will happen, but, when they become part of the user-space ABI, they can be difficult to get away from. That is why ABI additions tend to come under high levels of scrutiny: once somebody depends on them, they must be supported indefinitely.

Comments (19 posted)

Kernel development news

Fixing writeback from direct reclaim

By Jonathan Corbet
July 20, 2010
"Writeback" is the process of writing the contents of dirty memory pages back to their backing store, where that backing store is normally a file or swap area. Proper handling of writeback is crucial for both system performance and data integrity. If writeback falls too far behind the dirtying of pages, it could leave the system with severe memory pressure problems. Having lots of dirty data in memory also increases the amount of data which may be lost in the event of a system crash. Overly enthusiastic writeback, on the other hand, can lead to excessive I/O bandwidth usage, and poorly-planned writeback can greatly reduce I/O performance with excessive disk seeks. Like many memory-management tasks, getting writeback right is a tricky exercise involving compromises and heuristics.

Back in April, LWN looked at a specific writeback problem: quite a bit of writeback activity was happening in direct reclaim. Normally, memory pages are reclaimed (made available for new uses, with data written back, if necessary) in the background; when all goes well, there will always be a list of free pages available when memory is needed. If, however, a memory allocation request cannot be satisfied from the free list, the kernel may try to reclaim pages directly in the context of the process performing the allocation. Diverting an allocation request into this kind of cleanup activity is called "direct reclaim."

Direct reclaim normally works, and it is a good way to throttle memory-hungry processes, but it also suffers from a couple of significant problems. One of those is stack overflows; direct reclaim can happen from almost anywhere in the kernel, so it may be that the kernel stack is already mostly used before the reclaim process even starts. But if reclaim involves writing file pages back, it can be just the beginning of a long call chain in its own right, leading to the overflow of the kernel stack. Beyond that, direct reclaim, which reclaims pages wherever it can find them, tends to create seek-intensive I/O, hurting the whole system's I/O performance.

Both of these problems have been seen on production systems. In response, a number of filesystems have been changed so that they simply ignore writeback requests which come from the direct reclaim code. That makes the problem go away, but it is a kind of papering-over that pleases nobody; it also arguably increases the risk that the system could go into the dreaded out-of-memory state.

Mel Gorman has been working on the reclaim problem, on and off, for a few months now. His latest patch set will, with luck, improve the situation. The actual changes made are relatively small, but they apparently tweak things in the right direction.

The key to solving a problem is understanding it. So, perhaps, it's not surprising that the bulk of the changes do not actually affect writeback; they are, instead, tracing instrumentation and tools which provide information on what the reclaim code is actually doing. The new tracepoints provide visibility into the nature of the problem and, importantly, how much each specific change helps.

The core change is deep within the direct reclaim loop. If direct reclaim stumbles across a page which is dirty, it now must think a bit harder about what to do with it. If the dirty page is an anonymous (process data) page, writeback happens as before. The reasoning here seems to be that the writeback path for these pages (which will be going to a swap area) will be simpler than it is for file-backed pages; there are also fewer opportunities for anonymous pages to be written back via other paths. As a result, anonymous writeback might still create seek problems - but only if the swap area shares a spindle with other, high-bandwidth data.

For dirty, file-backed pages, the situation is a little different; direct reclaim will no longer try to write back those pages directly. Instead, it creates a list of the dirty pages it encounters, then hands them over to the appropriate background process for the real writeback work. In some cases (such as when lumpy reclaim is trying to free specific larger chunks of memory), the direct reclaim code will wait in the hope that the identified pages will soon become free. The rest of the time, it simply moves on, trying to find free pages elsewhere.

Handing the reclaim work over to the threads which exist for that task has a couple of benefits. It is, in effect, a simple way of switching to another kernel stack - one which is known to be mostly empty - before heading into the writeback paths. Switching stacks directly in the direct reclaim code had been discussed, but it was decided that the mechanism the kernel already has for switching stacks (context switches) was probably the right thing to use in this situation. Keeping the writeback work in kswapd and the per-BDI writeback threads should also help performance, since those threads try to order operations to minimize head seeks.

When this problem was discussed in April, Andrew Morton pointed out that, over time, the amount of memory written back in direct reclaim has grown significantly, with an adverse effect on system performance. He wanted to see thought put into why that change has happened rather than trying to mitigate its effects. The final patch in Mel's series looks like an attempt to address this concern. It changes the direct reclaim code so that, if that code starts encountering dirty pages, it pokes the writeback threads and tells them to start cleaning pages more aggressively. The idea here is to keep the normal reclaim mechanisms running at a fast-enough pace that direct reclaim is not needed so often.

This tweak seems to have a significant effect on some benchmarks; Mel says:

Apparently, background flush must have been doing a better job getting [pages] cleaned in time and the direct reclaim stalls are harmful overall. Waking background threads for dirty pages made a very large difference to the number of pages written back. With all patches applied, just 759 filesystem pages were written back in comparison to 581811 in the vanilla kernel and overall the number of pages scanned was reduced.

Anybody who likes digging through benchmark results is advised to look at Mel's patch posting - he appears to have run just about every test that he could to quantify the effects of this patch series. This kind of extensive benchmarking makes sense for deep memory management changes, since even small changes can have surprising results on specific workloads. At this point, it seems that the changes have the desired effect and most of the concerns expressed with previous versions have been addressed. The writeback changes, perhaps, are getting ready for production use.

Comments (9 posted)

Adding periods to SCHED_DEADLINE

By Jonathan Corbet
July 20, 2010
The Linux scheduler, in both the mainline and realtime versions, provides a couple of scheduling classes for realtime tasks. These classes implement the classic POSIX priority-based semantics, wherein the highest-priority runnable task is guaranteed to have access to the CPU. While this scheduler works as advertised, priority-based scheduling has a number of problems and has not been the focus of realtime research for some time. Cool schedulers in this century are based on deadlines instead. Linux does not yet have a deadline scheduler, though there is one in the works. A recent discussion on implementing the full deadline model has shown, once again, just how complex it can be to get deadline scheduling right in the real world.

Deadline scheduling does away with priorities, replacing them with a three-parameter tuple: a worst-case execution time (or budget), a deadline, and a period. In essence, a process tells the scheduler that it will require up to a certain amount of CPU time (the budget) by the given deadline, and that the deadline optionally repeats with the given period. So, for example, a video-processing application might request 1ms of CPU time to process the next incoming frame, expected in 10ms, with a 33ms period thereafter for subsequent frames. Deadline scheduling is appealing because it allows the specification of a process's requirements in a natural way which is not affected by any other processes running in the system. There is also great value, though, in using the deadline parameters to guarantee that a process will be able to meet its deadline, and to reject any process which might cause a failure to keep those guarantees.

The SCHED_DEADLINE scheduler being developed by Dario Faggioli appears to be on track for an eventual mainline merger, though nobody, yet, has been so foolish to set a deadline for that particular task. This scheduler works, but, thus far, it takes a bit of a shortcut: in SCHED_DEADLINE, the deadline and the period are assumed to be the same. This simplification makes the "admission test" - the decision as to whether to accept a new SCHED_DEADLINE task - relatively easy. Each process gets a "bandwidth" parameter, being the ratio of the CPU budget to the deadline/period value. As long as the sum of the bandwidth values for all processes on a given CPU does not exceed 1.0, the scheduler can guarantee that the deadlines will be met.

As Dario recently brought up on linux-kernel, there are users who would like to be able to specify separate deadline and period values. Adjusting the scheduler to implement those semantics is not particularly hard. Coming up with an admission test which insures that deadlines can still be met is rather harder, though. Once the period and the deadline are separated from each other, it becomes possible for processes to miss their deadlines even if the total bandwidth of the CPU has not been oversubscribed.

To see how this might come about, consider an example posted by Dario. Imagine a process which needs 4ms of CPU time with a period of 10ms and a deadline of 5ms. A timeline of how that process might be scheduled could look like this:

[Scheduling timeline]

Here, the scheduler is able to run the process within its deadline; indeed, there is 1ms of time to spare. Now, though, if a second process comes in with the same set of requirements, the results will not be as good:

[Scheduling timeline]

Despite the fact that 20% of the available CPU time remains unused, process P2 will miss its deadline by 3ms. In a hard realtime situation, that tardiness could prove fatal. Clearly, the scheduler should reject P2 in this situation. The problem is that detecting this kind of problem is not easy, especially if the scheduler is (as seems reasonable) expected to leave some CPU time for applications and not use it all performing complex admission calculations. For this reason, admission decision algorithms are currently an area of considerable research interest in the academic realtime community. See this paper by Alejandro Masrur et al. [PDF] or this one by Marko Bertogna [PDF] for examples of how involved it can get.

There are a couple of ways to simplify the problem. One of those would be to change the bandwidth calculation to be the ratio of the CPU budget to the relative deadline time (rather than to the period). For the example processes shown above, each has a bandwidth of 0.8 using this formula; the scheduler, on seeing that the second process would bump the total to 1.6, could then decide to reject it. Using this calculation, the scheduler can, once again, guarantee that deadlines will be met, but at a cost: this formula will cause the rejection of processes that, in reality, could be scheduled without trouble. It is an overly pessimistic heuristic which will prevent full utilization of the available resources.

An alternative, proposed by Dario, would be to stick with the period-based bandwidth values for admission decisions and to take the risk that some deadlines might not be met. In this case, user-space developers would be responsible for ensuring that the full set of tasks on the system can be scheduled. They might take some comfort in the fact that, since the overall bandwidth of the CPU would still not be oversubscribed, the amount by which a deadline could be missed would be deterministically bounded.

That idea did not survive its encounter with Peter Zijlstra, who thinks it ruins everything that a deadline scheduler is supposed to provide:

The whole reason SCHED_FIFO and friends suck so much is that they don't provide any kind of isolation, and thus, as an Operating-System abstraction they're an utter failure. If you take out admission control you end up with a similar situation.

Peter's suggestion, instead, was to split deadline scheduling logically into two different schedulers. The hard realtime scheduler would retain the highest priority, and would require that the deadline and period values be the same. If, at some future time, a suitable admission controller is developed then that requirement could be relaxed as long as deadlines could still be guaranteed.

Below the hard realtime scheduler would be a soft realtime scheduler which would have access to (most of) the CPU bandwidth left unused by the hard scheduler. That scheduler could accept processes using period-based bandwidth values with the explicit understanding that deadlines might be missed by bounded amounts. Soft realtime is entirely good enough for a great many applications, so there is no real reason not to provide it as long as hard realtime is not adversely affected.

So that is probably how things will go, though the real shape of the solution will not be seen until Dario posts the next version of the SCHED_DEADLINE patch. Even after this problem is solved, though, deadline scheduling has a number of other challenges to overcome, with good multi-core performance being near the top of the list. So, while Linux will almost certainly have a deadline scheduler at some point, it's still hard to say just when that might be.

(Readers who are interested in the intersection of academic and practical realtime work might want to peruse the recently-released proceedings [PDF] from the OSPERT 2010 conference, held in Brussels in early July.)

Comments (21 posted)

Contiguous memory allocation for drivers

By Jonathan Corbet
July 21, 2010
Allocation of physically-contiguous memory buffers is required by many device drivers, but it cannot always be reliably done on long-running Linux systems. That leads to all kinds of unsatisfying workarounds in driver code. The contiguous memory allocator patches recently posted by Michal Nazarewicz are an attempt to solve this problem in a consistent way for all drivers.

A few years ago, when your editor was writing the camera driver for the original OLPC XO system, a problem turned up. The video acquisition hardware on the system was capable of copying video frames into memory via DMA operations, but only to physically contiguous buffers. There was, in other words, no scatter/gather DMA capability built into this (cheap) DMA engine. A choice was thus forced: either allocate memory for video acquisition at boot time, or attempt to allocate it on the fly when the camera is actually used. The former choice is reliable, but it has the disadvantage of leaving a significant chunk of memory idle (on a memory-constrained system) whenever the camera is not in use - most of the time on most systems. The latter choice does not waste memory, but is unreliable - large, contiguous allocations are increasingly hard to do as memory gets fragmented. In the OLPC case, the decision was to sacrifice the memory to ensure that the camera would always work.

This particular problem has been faced many times by many developers over the years; each driver author has tended to go with whatever ad hoc solution seems to make sense at the time. For some years, the "bigphysarea" patch was available to help, but that patch was never put into the mainline and has not seen any maintenance for some time. So the problem remains unsolved in any sort of general sense.

The contiguous memory allocation (CMA) patches are an attempt to put together a flexible solution which can be used in all drivers. The basic technique will be familiar: CMA grabs a chunk of contiguous physical memory at boot time (when it's plentiful), then doles it out to drivers in response to allocation requests. Where it differs is mainly in an elaborate mechanism for defining the memory region(s) to reserve and the policies for handing them out.

A system using CMA will always need to have at least one boot-time parameter describing the memory region(s) to use and the policy for allocating from those regions. The syntax used is rather complex, to the point that a large portion of the patch is made up of parsing code; see the included Documentation/cma.txt file for the full details. A simple example of a CMA command-line option would be something like:

    cma=c=10M cma_map=camera=c

This defines a 10MB region (called "c") and states that allocation requests from the camera device should be satisfied from this region. Multiple regions can be defined, each with its own size, alignment constraints, and allocation algorithm, and memory regions can be split into different "kinds" as well. The "kinds" feature might be used to separate large and small allocations, or to put different buffers into different DMA zones or NUMA nodes. The more complex command lines are reminiscent of regular expressions, but with less readability. The purpose behind this complexity is to enable a great deal of flexibility in how memory is handled without the need to change the drivers which are working with that memory. Whether that flexibility is worth the cost is not (to your editor, at least) entirely clear.

A driver can actually allocate a memory chunk with:

    #include <linux/cma.h>

    unsigned long cma_alloc(const struct device *dev, const char *kind,
	     		    unsigned long size, unsigned long alignment);

If all goes well, the return value will be the physical address of the allocated memory region.

For reasons which are not entirely clear, buffers allocated with CMA have a reference count associated with them. So two functions are provided to manipulate that count:

    int cma_get(unsigned long addr);
    int cma_put(unsigned long addr);

Since reference counting is used, there is no cma_free() function; instead, the memory chunk is passed to cma_put() and freed internally when the reference count goes to zero.

CMA comes with a best-fit allocator, but it is designed to work with multiple internal allocators. So, should there be a need to use a different allocation algorithm, it's a really straightforward matter to add it to the system. Naturally enough, the command-line syntax offers a way to specify which allocator should be used for each region.

In summary: CMA offers a solution to a problem which driver authors have been dealing with for some years. Your editor suspects, though, that it will require some changes before a mainline merge can be contemplated. The complexity of the solution is probably more than is really called for in this situation, and the whole thing might benefit from some integration with the DMA mapping infrastructure. But, someday, it would be nice to incorporate a solution to the large-buffer problem that all drivers can use.

Comments (15 posted)

Patches and updates


Core kernel code

Device drivers

Filesystems and block I/O

Memory management



Virtualization and containers


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds