|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.31-rc2, released on July 4. It contains a long list of bug fixes. The short-form changelog is in the announcement, or see the long-form changelog for all the details.

The current stable kernel is 2.6.30.1, released on July 2. It contains a long list of fixes (147 files changed) for serious problems in a number of areas. The 2.6.29.6 and 2.6.27.26 updates came out at the same time; 2.6.29.6 is the final stable update for the 2.6.29 series.

Comments (none posted)

Kernel development news

Quotes of the week (VFAT special)

We put in workarounds for broken hardware, even though we rightfully dislike it. Similarly, in this case a minor change prevents a real problem experienced by our users (yes, I know we only have one public bug report).

It's bad. Yes, let's bitch about it. And workaround it anyway.

-- Rusty Russell

When the vendor of an operating system doesn't even bother to display a clean "sorry, you don't get updates any more" page for their OS then I think it is safe to say that the operating system is dead and buried.
-- Andrew Tridgell

How is shipping something labelled as vfat that doesn't work properly "fighting back". It's "trying to make the users think our code is crap".
-- Alan Cox

Comments (1 posted)

In brief

By Jonathan Corbet
July 8, 2009
VFAT. The VFAT patent workaround discussion continues, focused primarily on two points. The first is whether it is appropriate to put patent workarounds into the kernel. The main opposition comes from Alan Cox, who asserts that individual companies need to find ways to navigate around country-specific legal landmines; workarounds for these problems should not find their way into the upstream code. Besides, Alan claims that vendors who are worried about patent suits will not be satisfied with a kernel configuration option; they will hack out the code entirely.

Meanwhile, Andrew Tridgell continues to push for the use of workarounds - a strategy which has worked well for Samba. Tridge says:

The only weapon we have to fight patents right now is our collective ability to find smart solutions to problems. The "every vendor for themselves" approach that you seem to be so keen on makes that collective approach pretty much impossible.

How this side of things will play out remains to be seen. Meanwhile, there has also been an extensive technical discussion focused on interoperability problems caused by the patch. Some of these problems, as explained by Tridge, are really the result of various existing VFAT mount options; in some cases, even without the patch under discussion, Linux will create long name entries for names which appear to fit the 8.3 format. This kind of interoperability problem has existed for quite some time.

Mount options do not explain all of the problems, though. It seems that there are some music players out there which understand long names, but which still require that valid 8.3 names exist. There are also difficulties with Windows98 which may (or may not) be resolved by changes in how the patch fills the 8.3 information. Tridge suggests that Windows98 is old enough to not be worth supporting, but not all agree on that.

The checkpatch police. Alan Cox recently proposed the addition of a new event interface for the virtual terminal driver. Ingo Molnar responded with a list of errors from the checkpatch.pl script, requesting that they be fixed. Alan's reply was:

Given the code doesn't currently *work* yet but is a first draft its hardly important and there are better uses of time than playing checkpatch policeman.

What followed was an extensive discussion on the value of checkpatch.pl, whether code should be checkpatch-clean even before first submission, whether coding style problems should be fixed piece-by-piece as other work is done, and so on. Ingo would like to see coding style issues dealt with early on; among other things, consistent coding style makes reviewing the code much easier. Alan sees that sort of cleanup as a distraction, to be done after more substantial issues (which are not lacking in the TTY code) have been dealt with.

Consensus was not to be had in this discussion; expect this to be one of those themes which returns regularly to linux-kernel.

Comments (10 posted)

Some ado about zero

By Jonathan Corbet
July 7, 2009
Computers use a lot of zeroes. Early in your editor's programming career, he worked on a machine that provided a special hardware register containing zero; programmers on this system knew they could use all the zeroes they needed with no fear of running out. Meanwhile, in this century, the Linux kernel sets aside a page full of zeros. It's called empty_zero_page on the x86 architecture, and it's even exported to modules. Interestingly, this special page is not used as heavily as it was prior to the 2.6.24 kernel, but that may be about to change.

In the good old days, the kernel would use the zero page in situations where it knew it needed a page full of zeroes. So, for example, if a process incurred a read fault on a page it had never used, the kernel would simply map the zero page into that address. A copy-on-write mapping would be used, of course; if the process subsequently modified the page, it would end up with its own copy. But deferring the creation of a new, zero-filled page helped to conserve zeroes, keeping the kernel from running out. Incidentally, it also saved memory, reduced cache pressure, and eliminated the need to clear the new page.

Memory management changes made back in 2007 had the effect of adding reference counting to the zero page. And that turned out to be a problem on multiprocessor machines. Since all processors shared the same zero page (per-CPU differences being unlikely), they also all manipulated the same reference count. That led to serious problems with cache line bouncing, with a measurable performance impact. In response, Nick Piggin evaluated a number of possible fixes, including special hacks to avoid reference-counting the zero page or adding per-CPU zero pages. The patch that got merged, though, simply eliminated most use of the zero page altogether. The change was justified this way:

Inserting a ZERO_PAGE for anonymous read faults appears to be a false optimisation: if an application is performance critical, it would not be doing many read faults of new memory, or at least it could be expected to write to that memory soon afterwards. If cache or memory use is critical, it should not be working with a significant number of ZERO_PAGEs anyway (a more compact representation of zeroes should be used).

There was some nervousness about the patch at the time; Linus grumbled about the changes which created the problem in the first place, and worried:

The kernel has *always* (since pretty much day 1) done that ZERO_PAGE thing. This means that I would not be at all surprised if some application basically depends on it. I've written test-programs that depends on it - maybe people have written other code that basically has been written for and tested with a kernel that has basically always made read-only zero pages extra cheap.

Despite his misgivings, Linus merged the patch for 2.6.24 to see what sort of problems might come to the surface. For the next 18 months, it appeared that such problems were scarce indeed; most people forgot about the zero page altogether. In early June, though, Julian Phillips reported a problem he had observed:

I have a program which creates a reasonably large private anonymous map. The program then writes into a few places in the map, but ends up reading from all of them.

When I run this program on a system running 2.6.20.7 the process only ever seems to use enough memory to hold the data that has actually been written (well - in units of PAGE_SIZE). When I run the program on a system running 2.6.24.5 then as it reads the map the amount of memory used continues to increase until the complete map has actually been allocated (and since the total size is greater than the physically available RAM causes swapping). Basically I seem to be seeing copy-on-read instead of copy-on-write type behaviour.

What Julian was seeing, of course, was the effects from the removal of the zero page. On older kernels, all of the unwritten pages in the data structure would be mapped to the zero page, using no additional physical memory at all. As of 2.6.24, each of those pages gets an actual physical page - containing nothing but zeroes - assigned to it, increasing memory use significantly.

Hiroyuki Kamezawa reports that he has seen zero-page-dependent workloads at other sites. Many of those sites, he says, are running enterprise Linux distributions which have not, yet, shipped kernels new enough to lack zero page support. He worries that these users will encounter the same sort of unpleasant surprise Julian found when they upgrade to newer kernels. In response, he has posted a patch which restores zero page support to the kernel.

Hiroyuki's zero page support isn't quite the same as what came before, though. It avoids reference counting for the zero page, a change which should eliminate the worst of the performance problems. It does add some interesting special cases, though, where virtual memory code has to be careful to test for zero pages; the bulk of those cases are handled with the addition of a get_user_pages_nonzero() function which removes any zero pages from the indicated range. Linus dislikes the special cases, thinking that they are unnecessary. Instead, he has proposed an alternative implementation using the relatively new PTE_SPECIAL flag to mark zero pages. As of this writing, a updated version of the patch using this approach has not yet been posted.

Nick Piggin, who wrote the patch removing zero page support in the first place, would rather not see it return. With regard to the affected users, he asks:

Can we just try to wean them off it? Using zero page for huge sparse matricies is probably not ideal anyway because it needs to still be faulted in and it occupies TLB space. They might see better performance by using a better algorithm.

Linus, however, would like to see this feature restored if it can be done in a clean way. So the return of zero page support seems fairly likely, assuming the patch can be worked into sufficiently good shape. Whether that will bring comfort to enterprise kernel users remains to be seen, though; the next generation of enterprise Linux releases look set to use kernels around 2.6.27. Unless distributors backport the zero page patch, enterprise Linux users will still be stuck with the current, zero-wasting behavior.

Comments (14 posted)

A lockless ring-buffer

By Jake Edge
July 8, 2009

One of the outcomes from last year's Kernel Summit and Linux Plumbers Conference was a plan to create a low-level ring-buffer implementation that could be shared among the various kernel and user-space tracing solutions available for Linux. One implementation of the common ring-buffer was released as part of 2.6.28, but it was somewhat lock-heavy, which impacted its performance. Recently, Steven Rostedt has proposed a lockless ring-buffer algorithm, which would eliminate locking on writes, which is the fast path for tracing.

As tracing information is gathered in the kernel, it needs to be stored somewhere very quickly, so that the impact on the timing of the events observed—and system performance overall—is fairly minimal. Reading the data is done from user space, though, so it is generally not performance-sensitive. The current ring-buffer implementation creates a circular, doubly-linked list of pages, along with a head and tail pointer, so writes are done at the tail, while reads are done from the head.

If the ring-buffer gets full, or nearly so, there is the potential for writers to overwrite data in the head page, which could corrupt data that is being read. For this reason, there is a separate reader page, which has been removed from the list entirely, that reader processes can use without being concerned about corruption from writers. But, having that separate page requires that there be a bit of a dance whenever the reader is done with the page and needs a new one. The reader page must be placed back into the list somewhere after the tail, while the current head page needs to be removed as the new reader page, and the head page must be pushed forward. That requires locking.

The diagram below, from Rostedt's ring-buffer-design.txt document, gives an idea of how the ring-buffer would look. Observant readers will note the H pointer, which is the HEADER-flagged pointer described below.

	      reader page
		  |
		  v
		 +---+
		 |   |------+
		 +---+      |
			    |
			    v
	+---+    +---+    +---+    +---+
    <---|   |--->|   |-H->|   |--->|   |--->
    --->|   |<---|   |<---|   |<---|   |<---
	+---+    +---+    +---+    +---+

Writers can be interrupted by other writers, so long as the interrupting writer completes its write before the interrupted writer can continue. This is in keeping with the way interrupts stack, and it is important that it be enforced for the integrity of the ring-buffer structure. When a write is initiated, space is reserved after the tail pointer to hold the event. This moves the tail pointer, so another pointer, called the commit pointer, is needed to track the latest complete write.

In nearly empty ring-buffers, it is possible for the reader page to also be the commit and tail pages. While the reader page has been removed from the ring-buffer, its next pointer still leads to the next ring-buffer entry. Once enough writes are done, the commit and tail pointers will simply follow the next pointer as they normally do.

In order to remove the locking for writers, which currently need to use locks to synchronize updates of the head, tail, and commit pointers, Rostedt leverages the cmpxchg() atomic operation available on some architectures. It works as follows:

    R = cmpxchg(A, C, B) 
        - Assign A = B if A == C
        - Return A at the time of the call, unconditionally
The success of the exchange can be determined by checking whether R is equal to C, if so, the exchange was done.

The algorithm requires that the pointers to the linked-list structures be 4-byte aligned so that it can reserve the bottom 2 bits for flags. The two flags are:

  • HEADER - the pointer is to the current head page

  • UPDATE - the pointer is to a page that is currently being written and either is, or is about to be, the head page
These flags, along with the cmpxchg() operation, are what allow lockless writing to the buffer.

When the reader page has been exhausted, the current head page needs to be detached from the ring-buffer as the new reader page. By using the HEADER flag on the next pointer that points to the head page, writers can keep readers from interfering without taking a lock. When trying to change the next pointer as part of the swapping process, readers use cmpxchg() to require that the HEADER flag be present. Writers can prevent that from happening by setting the flag to UPDATE or clearing the flags entirely. When the reader's cmpxchg() fails, it means that writers have changed the state of the ring, so the reader must look for a new head page and start the process all over.

When writers change to a new tail page, as they fill the buffer, they check the next pointer of the new page for the HEADER flag. If it is present, it is changed to UPDATE. That indicates that the page is volatile, as writers are currently using it, and will cause the cmpxchg() of a reader to fail, should it try to detach the head page. This is an indication that the buffer is close to full, only one page (i.e. the new tail page) remains for storing events.

The ring-buffer can operate in two modes, overwrite (aka "flight recorder") mode, where new events overwrite older events when the buffer fills up, or producer/consumer mode where writing to a full buffer causes the write to fail. In producer/consumer mode, the head page only changes at the behest of a reader, but in overwrite mode, once the tail page reaches the head, the head must be pushed forward one page, which is why the UPDATE flag must be used.

The basic function of the algorithm is relatively straightforward—if a bit head-exploding—but there are number of more complex scenarios to consider. One is the possibility that nested writes cause the buffer to fill, such that the tail page reaches the commit page. There is no choice but to drop writes at that point, but it is possible that the commit page is on the reader page (as shown below). Naïvely pushing the head page forward, past the entry that the reader page points to, would break the ability for the commit page to move from the reader page back into the ring-buffer. So writers must check for this condition and start dropping writes if it is detected.

	      reader page    commit page
		  |              |
		  v              |
		 +---+           |
		 |   |<----------+
		 |   |
		 |   |------+
		 +---+      |
			    |
			    v
	+---+    +---+    +---+    +---+
    <---|   |--->|   |-H->|   |--->|   |--->
    --->|   |<---|   |<---|   |<---|   |<---
	+---+    +---+    +---+    +---+
		   ^
		   |
	       tail page

Other complex scenarios are possible. Interested readers are directed to Rostedt's design document for more information. It is quite detailed and chock full of ASCII artwork depicting ring-buffer operations. The algorithm itself is the subject of a patent application by Rostedt for Red Hat. If granted, it will be available for free software implementations under Red Hat's patent policy.

Mathieu Desnoyers, developer of the Linux Trace Toolkit Next Generation (LTTng), has been following the ring-buffer submission closely, as LTTng would be one of the tracing solutions expected to use the common ring-buffer. The proposed algorithm is complex, "near that of RCU mechanisms", he said, but unlike RCU (or the LTTng lockless buffer algorithm), no formal proof of the algorithm has been done. He agrees that lockless buffers for tracing are desirable and achievable, but he is concerned that the lack of formal verification of Rostedt's algorithm could lead to an extended period of bug chasing. That complexity has a bit of a silver lining, though, as Desnoyers noted in a review of the design: "The great news to me is that no one can say LTTng's lockless buffering algorithm is complex compared to this. ;)"

Two other concerns he mentioned are performance and fast user-space tracing. Rostedt's algorithm depends on being able to disable preemption, which is not possible for user-space tracing. Desnoyers said that LTTng has more compact events which he believes will allow the LTTng version to be able to handle more events per second than Rostedt's, but no real performance comparisons have, as yet, been done. Desnoyers is hopeful that he will be able to propose an alternative lockless ring-buffer implementation based on the LTTng code sometime soon, but there is the small matter of a Ph.D. dissertation to complete before that can happen.

Rostedt is targeting the 2.6.32 kernel for merging the lockless ring-buffer, it remains to be seen if there will be objections to its inclusion. It may also have to fend off alternatives. Sooner or later, though, some kind of lockless buffering for trace events seems likely to make it into the kernel.

Comments (none posted)

Transcendent memory

By Jonathan Corbet
July 8, 2009
Making the best use of available memory is one of the biggest challenges for any operating system. Throwing virtualization into the mix adds both new challenges (balancing memory use between guests, for example) and opportunities (sharing pages between guests). Developers have responded with technologies like hot-plug memory and KSM, but nobody seems to think that the problem is fully solved. Transcendent memory is a new memory-management technique which, it is hoped, will improve the system's use of scarce RAM, regardless of whether virtualization is being used.

In his linux-kernel introduction, Dan Magenheimer asks:

What if there was a class of memory that is of unknown and dynamically variable size, is addressable only indirectly by the kernel, can be configured either as persistent or as "ephemeral" (meaning it will be around for awhile, but might disappear without warning), and is still fast enough to be synchronously accessible?

Dan (along with a list of other kernel developers) is exploring this concept, which he calls "transcendental memory." In short, transcendental memory can be thought of as a sort of RAM disk with some interesting characteristics: nobody knows how big it is, writes to the disk may not succeed, and, potentially, data written to the disk may vanish before being read back again. At a first blush, it may seem like a relatively useless sort of device, but it is hoped that transcendental memory will be able to improve performance in a few situations.

There is an API specification [PDF] available; there is also a related C API found in the patch itself. This discussion will focus on the latter, which suffers from less EXCESSIVE CAPITAL USE and is generally easier to understand.

Transcendental memory operates on the concept of page pools; once a pool is created, data can be stored to pages within the pool. The calls for creating and destroying pools look like this:

    u32 pool_id = tmem_new_pool(struct tmem_pool_uuid uuid, u32 flags)
    tmem_destroy_pool(u32 pool_id);

Pools are identified by the uuid value, though the identification really only matters for pools which might be shared among multiple users. A fair amount of information is stored in the flags field, including:

  • An "ephemeral" bit, which controls whether data successfully written to the pool is allowed to disappear at a random future time.

  • A "shared" bit indicating whether the pool is to be shared with other users.

  • The size of pages to use in the pool, expressed as a kernel "order" value.

  • A specification version number, used to ensure that both sides of the conversation know how to understand each other.

While users are expected to specify an expected page size, there is no way to specify the size of the pool as a whole. Determining the proper sizing for a pool (which almost certainly changes over time) is left to the hypervisor or whatever other software component is managing the pool.

As suggested by the above interface, transcendental memory is very much page-based. Beyond that, it also can never be referenced directly; users are required to copy data into and out of the pool explicitly. The functions used for moving data between normal and transcendental memory are:

    int tmem_put_page(u32 pool_id, u64 object_id, u32 page_id, unsigned long pfn);
    int tmem_get_page(u32 pool_id, u64 object_id, u32 page_id, unsigned long pfn);

For both of these calls, pool_id specifies an existing pool. The object_id and page_id values, together, form a unique identifier for the page within the pool. If the pool is being used to cache file pages, for example, the object_id would identify the file, while page_id would be the offset within the file. pfn (a page frame number) identifies the page which is the source of the data (for tmem_put_page()) or the destination (tmem_get_page()).

Note that either call might fail. Since the size of the pool is not known, callers can never know in advance whether tmem_put_page() will succeed. So any transcendental memory user must have a backup plan ready in case the call fails. For pools marked as "ephemeral," tmem_get_page() is allowed to fail even if tmem_put_page() on the same ID succeeded; in other words, the implementation is allowed to drop pages from ephemeral pools if it decides that the memory can be put to better use elsewhere. It's also worth noting that, with private, ephemeral pools, tmem_get_page() will remove the indicated page from the pool.

As an example of how this feature might be used, consider the Linux page cache, which maintains copies of pages from disk files. When memory gets tight, the page cache will start forgetting pages which are clean, but which have not been referenced in the recent past. With transcendental memory, the page cache could, before dropping the pages, attempt to store them into an ephemeral transcendental memory pool. At some future time, when one of those pages is needed again, the page cache would first attempt to fetch it from the pool. If the tmem_get_page() call succeeds, a disk I/O operation will have been avoided and everybody benefits; otherwise the page is read from disk as usual.

Persistent (non-ephemeral) pools could be used as a sort of swap device. If the swapping code succeeds in writing a page to the pool, it can avoid writing it to the real swap device. The result is saved I/O at both swap-out and swap-in times. If the pool lacks space for the swapped page, it will be written to the real swap device in the usual way.

Meanwhile, the transcendental memory implementation can try to optimize its management of the memory pools. Guests which are more active (or which have been given a higher priority) might be allowed to allocate more pages from the pools. Duplicate pages can be coalesced; KSM-like techniques could be used, but the use of object IDs could make it easier to detect duplicates in a number of situations. And so on.

The API specifies a number of other operations. There are a couple of calls to flush pages from the pool; one of them can remove all pages with a given object ID. Sub-page-size reads and writes are supported; there is also a tmem_xchg() call to atomically exchange data within a transcendental memory page. See the API specification for the full list.

A number of concerns were raised in the subsequent discussion; as a result, the above API is likely to change a bit. The biggest concern, though, appears to be security. The potential for hostile code to tap into shared pools and read out pages has developers worried; the need to guess a 128-bit UUID first has proved not to be sufficiently reassuring. Even with legitimate users only, a shared pool has the potential to contain data which should not, in reality, be shared between guests. As a result, any transcendental memory user will have to be written to take high-level security issues into account in low-level code.

Dan seemingly doesn't see the security problems as being as worrisome as others do. Even so, he eventually announced that the next transcendental memory patch would not include support for shared pools, and, indeed, version 2 lacks that feature. That feature will probably not come back until the security issues have been thought through and the concerns have been addressed.

Beyond that, transcendental memory will need some convincing evidence that it improves performance before it can make it into the mainline. The potential for improvements is clearly there; it is essentially a way for the system to take higher-level information into account when managing its virtual memory resources. If transcendental memory is able to fulfill that potential in a secure way, there may well be a place for it in the mainline kernel.

Comments (21 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.31-rc2 ?
Greg KH Linux 2.6.30.1 ?
Greg KH Linux 2.6.29.6 ?
Greg KH Linux 2.6.27.26 ?

Architecture-specific

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds