LWN.net Logo

Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.26-rc4, released on May 26. In addition to the usual fixes, there are quite a few driver updates in various areas: DRI, networking, MMC, USB, watchdog, and IDE. There is also a fix for 32-bit x86 users who have "too much memory" in their machines. "So if you had PAE enabled _and_ a recent enough CPU to have NX, but not recent enough to be 64-bit (or you were just perverse and wanted to run a 32-bit kernel despite having a chip that could do 64-bit and enough memory that you _really_ should have used a 64-bit kernel), you'd get various random program failures with SIGSEGV. It ranged from X not starting up to apparently OpenOffice not working if it did." As always, all the gory details can be found in the long-format changelog.

Comments (none posted)

Kernel development news

Quotes of the week

Stress testing is rarely useful here - things like ioctl races you find by thinking evil thoughts.
-- Alan Cox

The greatest problem, as I see it is that by pouring vitriol like this on newbies, we're really damaging our reputation as a community that welcomes newcomers and strangling our necessary supply of willing volunteers. On the other hand, as a maintainer, when there's people yelling me at about patches not being included plus a persistent regressions list and about ten bug reports to track down, the last thing I want to see within a million miles of my inbox is a white space fixing patch. The more of these patches we get, the worse the problem becomes and the shorter and more inflammatory the responses get. We can't go on like this.
-- James Bottomley

The #1 project for all kernel beginners should surely be "make sure that the kernel runs perfectly at all times on all machines which you can lay your hands on". Usually the way to do this is to work with others on getting things fixed up (this can require persistence!) but that's fine - it's a part of kernel development.
-- Andrew Morton

Comments (6 posted)

Responding to ext4 journal corruption

By Jonathan Corbet
May 27, 2008
Last week's article on barriers described one way in which things could go wrong with journaling filesystems. Therein, it was noted that the journal checksum feature added to the ext4 filesystem would mitigate some of those problems by preventing the replay of the journal if it had not been completely written before a crash. As a discussion this week shows, though, the situation is not quite that simple.

Ted Ts'o was doing some ext4 testing when he noticed a problem with how the journal checksum is handled. The journal will normally contain several transactions which have not yet been fully played into the filesystem. Each one of those transactions includes a commit record which contains, among other things, a checksum for the transaction. If the checksum matches the actual transaction data in the journal, the system knows that the transaction was written completely and without errors; it should thus be safe to replay the transaction into the filesystem.

The problem that Ted noticed was this: if a transaction in the middle of the series failed to match its checksum, the playback of the journal would stop - but only after writing the corrupted transaction into the filesystem. This is a sort of worst-of-all-worlds scenario: the kernel will dump data which is known to be corrupt into the filesystem, then silently throw away the (presumably good) transactions after the bad one. The ext4 developers quickly arrived at a consensus that this behavior is a bug which should be fixed.

But what should really done is not as clear as one might think. Ted's suggestion was this:

So I think the right thing to do is to replay the *entire* journal, including the commits with the failed checksums (except in the case where journal_async_commit is enabled and the last commit has a bad checksum, in which case we skip the last transaction). By replaying the entire journal, we don't lose any of the revoke blocks, which is critical in making sure we don't overwrite any data blocks, and replaying subsequent metadata blocks will probably leave us in a much better position for e2fsck to be able to recover the filesystem.

A bit of background might help in understanding the problem that Ted is trying to solve here. In the default data=ordered mode, ext3 and ext4 do not write all data to the journal before it goes to the filesystem itself. Instead, only filesystem metadata goes to the journal; data blocks are written directly to the filesystem. The "ordered" part means that all of the data blocks will be written before the filesystem code will start writing the metadata; in this way, the metadata will always describe a complete and correct filesystem.

Now imagine a journal which contains a set of transactions similar to these (in this order):

  1. A file is created, with its associated metadata.

  2. That file is then deleted, and its metadata blocks are released.

  3. Some other file is extended, with the newly-freed metadata blocks being reused as data blocks.

Imagine further that the system crashes with those transactions in the journal, but transaction 2 is corrupt. Simply skipping the bad transaction and replaying transaction 3 would lead to the filesystem being most confused about the status of the reused blocks. But just stopping at the corrupt transaction also has a problem: the data blocks created in transaction 3 may have already been written, but, as of transaction 1, the filesystem thinks those are metadata blocks. That, too, leads to a corrupt filesystem. By replaying the entire journal, Ted hopes to catch situations like that and leave the filesystem in an overall better shape.

It is, perhaps, not surprising that there was some disagreement with this approach. Andreas Dilger argued:

The whole point of this patch was to avoid the case where random garbage had been written into the journal and then splattering it all over the filesystem. Considering that the journal has the highest density of important metadata in the filesystem, it is virtually impossible to get more serious corruption than in the journal.

The next proposal was to make a change to the on-disk journal format ("one more time") turning the per-transaction checksum into a per-block checksum. Then it would be possible to get a handle on just how bad any corruption is, and even corrupt transactions could be mostly replayed. As of this writing, that looks like the approach which will be taken.

Arguably, the real conclusion to take from this discussion was best expressed by Arjan van de Ven in an entirely different context: "having a journal is soooo 1999". The Btrfs filesystem, which has a good chance of replacing ext3 and ext4 a few years from now, does not have a journal; instead, it uses its fast snapshot mechanism to keep transactions consistent. Btrfs may, thus, avoid some of the problems that come with journaling - though, perhaps, at the cost of introducing a set of interesting new problems.

Comments (7 posted)

Using the firmware loader for static data

By Jake Edge
May 28, 2008

Some device drivers need firmware to load into the hardware at initialization time. The kernel firmware loader interface exists to support that functionality, but it requires help from user space which may not be available in all environments. David Woodhouse has proposed a patch that would eliminate that requirement so that more drivers can use the firmware loader rather than craft their own solution.

Embedded devices will be one of the main users of this ability. Many of those do not have a user space filesystem available at boot time—via initrd or initramfs—but they still need to access firmware images to download to peripherals. The new request_firmware() implementation would allow those devices to link the firmware into the kernel while still using the kernel firmware infrastructure.

Woodhouse has an excellent summary of what he is trying to do in the patch posting:

Some drivers have their own hacks to bypass the kernel's firmware loader and build their firmware into the kernel; this renders those unnecessary.

Other drivers don't use the firmware loader at all, because they always want the firmware to be available. This allows them to start using the firmware loader.

A third set of drivers already use the firmware loader, but can't be used without help from userspace, which sometimes requires an initrd. This allows them to work in a static kernel.

A driver that has static firmware data, declares it using:

    DECLARE_BUILTIN_FIRMWARE("firmware_name", blob);
The firmware_name is used as a key to find the specific firmware when request_firmware() is called. blob is a pointer to the actual code. The declaration adds the firmware to the end of an array holding struct builtin_fw elements, which look like this:
    struct builtin_fw {
            char *name;
            void *data;
            unsigned long size;
    };

When a call is made to request_firmware(), the new code linearly searches the array for a matching key before calling out to user space. This allows any statically created firmware blobs to take precedence over those in the filesystem. Whichever is found is returned.

There seemed to be strong agreement that Woodhouse's approach was the right way to go. His original implementation copied the firmware blob before returning it to a request_firmware() caller which required a vmalloc()—a waste of precious memory on embedded devices. Woodhouse was concerned that some drivers might modify the firmware before loading it into the device. Once he started looking, he found examples of that, but instead of penalizing all devices, he changed the firmware data returned in a struct firmware to be constant, resulting in the following structure:

    struct firmware {
            size_t size;
            const u8 *data;
    };

This constitutes an API change for anyone using the request_firmware() interface. In-tree drivers have been modified by Woodhouse appropriately, but out-of-tree drivers need to be aware of the change. Any driver that needs to modify the data must make a copy for themselves.

Another feature that would be useful for memory-constrained devices is compression of the firmware in the kernel image. This is on Woodhouse's radar, but is not seen as a feature that must be in the first release. Not copying the data for most drivers is a bigger win, but compression, especially for large firmware images might help. In those cases, though, both the compressed and uncompressed data will be in memory while the driver is downloading it.

Getting this work included into 2.6.26 has been discussed, even though the merge window has closed. Woodhouse thinks it might be possible:

Well, it's supposedly too late, but it's dead simple and shouldn't have much chance of breaking anything, so I suppose as long as we don't include the korg1212 patch and the rest of the similar patches which we're still working on, that's not such an insane request.

This is a fairly simple patch that adds some very useful functionality, especially for the embedded community. Woodhouse has recently stepped up as one the kernel embedded maintainers, so we may see more things like this from him in the future. It is unlikely that Linus Torvalds will merge this feature so late in the 2.6.26 cycle, but inclusion into 2.6.27 seems quite probable.

Comments (3 posted)

GEM v. TTM

By Jonathan Corbet
May 28, 2008
Getting high-performance, three-dimensional graphics working under Linux is quite a challenge even when the fundamental hardware programming information is available. One component of this problem is memory management: a graphics processor (GPU) is, essentially, a computer of its own with a distinct view of memory. Managing the GPU's memory - and its view of system RAM - must be done carefully if the resulting system is intended to work at all, much less with acceptable performance.

Not that long ago, it appeared that this problem had been solved with the translation table maps (TTM) subsystem. TTM remains outside of the mainline kernel, though, as do all drivers which use it. A recent query about what would be required to get TTM merged led to an interesting discussion where it turned out that, in fact, TTM may not be the future of graphics memory management after all.

A number of complaints about TTM have been raised. Its API is far larger than is needed for any free Linux driver; it has, in other words, a certain amount of code dedicated to the needs of binary-only drivers. The fencing mechanism (which manages concurrency between the host CPUs and the GPU) is seen as being complex, difficult to work with, and not always yielding the best performance. Heavy use of memory-mapped buffers can create performance problems of its own. The TTM API is an exercise in trying to provide for everything in all situations; as a result it is, according to some driver developers, hard to match to any specific hardware, hard to get started with, and still insufficiently flexible. And, importantly, there is a distinct shortage of working free drivers which use TTM. So Dave Airlie worries:

I was hoping that by now, one of the radeon or nouveau drivers would have adopted TTM, or at least demoed something working using it, this hasn't happened which worries me... The real question is whether TTM suits the driver writers for use in Linux desktop and embedded environments, and I think so far I'm not seeing enough positive feedback from the desktop side

All of these worries would seem to be moot, since TTM is available and there is nothing else out there. Except, as it turns out, there is something out there: it's called the Graphics Execution Manager, or GEM. The Intel-sponsored GEM project is all of one month old, as of this writing. The GEM developers had not really intended to announce their work quite yet, but the TTM discussion brought the issue to the fore.

Keith Packard's introduction to GEM includes a document describing the API as it exists so far. There are a number of significant differences in how GEM does things. To begin with, GEM allocates graphical buffer objects using normal, anonymous, user-space memory. That means that these buffers can be forced out to swap when memory gets tight. There are clear advantages to this approach, and not just in memory flexibility: it also makes the implementation of suspend and resume easier by automatically providing backing store for all buffer objects.

The GEM API tries to do away with the mapping of buffers into user space. That mapping is expensive to do and brings all sorts of interesting issues with cache coherency between the CPU and GPU. So, instead, buffer objects are accessed with simple read() and write() calls. Or, at least, that's the way it would be if the GEM developers could attach a file descriptor to each buffer object. The kernel, however, does not make the management of that many file descriptors easy (yet), so the real API uses separate handles for buffer objects and a series of ioctl() calls.

That said, it is possible to map a buffer object into user space. But then the user-space driver must take explicit responsibility for the management of cache coherency. To that end there is a set of ioctl() calls for managing the "domain" of a buffer; the domain, essentially, describes which component of the system owns the buffer and is entitled to operate on it. Changing the domains (there are two, one for read access and one for writes) of a buffer will perform the necessary cache flushes. In a sense, this mechanism resembles the streaming DMA API, where the ownership of DMA buffers can be switched between the CPU and the peripheral controller. That is not entirely surprising, as a very similar problem is being solved.

This API also does away with the need for explicit fence operations. Instead, a CPU operation which requires access to a buffer will simply wait, if necessary, for the GPU to finish any outstanding operations involving that buffer.

Finally, the GEM API does not try to solve the entire problem; a number of important operations (such as the execution of a set of GPU commands) are left for the hardware-specific driver to implement. GEM is, thus, quite specific to the needs of Intel's driver at this time; it does not try for the same sort of generality that was a goal of TTM. As described by Eric Anholt:

The problem with TTM is that it's designed to expose one general API for all hardware, when that's not what our drivers want... We're trying to come at it from the other direction: Implement one driver well. When someone else implements another driver and finds that there's code that should be common, make it into a support library and share it.

The advantage to this approach is that it makes it relatively easy to create something which works well with Intel drivers. And that may well be a good start; one working set of drivers is better than none. On the other hand, that means that a significant amount of work may be required to get GEM to the point where it can support drivers for other hardware. There seem to be two points of view on how that might be done: (1) add capabilities to GEM when needed by other drivers, or (2) have each driver use its own memory manager.

The first approach is, in many ways, more pleasing. But it implies that the GEM API could change significantly over time. And that, in turn, could delay the merging of the whole thing; the GEM API is exported to user space, and, as a result, must remain compatible as things change. So there may be resistance to a quick merge of an API which looks like it may yet have to evolve for some time.

The second approach, instead, is best described by Dave Airlie:

Well the thing is I can't believe we don't know enough to do this in some way generically, but maybe the TTM vs GEM thing proves its not possible. So we can then punt to having one memory manager per driver, but I suspect this will be a maintenance nightmare, so if people decide this is the way forward, I'm happy to see it happen. However the person submitting the memory manager n+1 must damn well be willing to stand behind the interface until time ends, and explain why they couldn't re-use 1..n memory managers.

One other remaining issue is performance. Keith Whitwell posted some benchmark results showing that the i915 driver performs significantly worse with either TTM or GEM than without. Keith Packard gets different results, though; his tests show that the GEM-based driver is significantly faster. Clearly there is a need for a set of consistent benchmarks; performance of graphics drivers is important, but performance cannot be optimized if it cannot be reliably measured.

The use of anonymous memory also raises some performance concerns: a first-person shooter game will not provide the same experience if its blood-and-gore textures must be continually paged in. Anonymous memory can also be high memory, and, thus, not necessarily accessible via a 32-bit pointer. Some GPU hardware cannot address high memory; that will likely force the use of bounce buffers within the kernel. In the end, GEM will have to prove that it can deliver good performance; GEM's developers are highly motivated to make their hardware look good, so there is a reasonable chance that things will work out on this front.

The conclusion to draw from all of this is that the GPU memory management problem cannot yet be considered solved. GEM might eventually become that solution, but it is a very new API which still needs a fair amount of work. There is likely to be a lot of work yet to be done in this area.

(Thanks to Timo Jyrinki for suggesting this topic.)

Comments (15 posted)

Patches and updates

Kernel trees

Build system

Core kernel code

Development tools

Device drivers

Documentation

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Security-related

Virtualization and containers

Benchmarks and bugs

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds