Brief items
The current 2.6 development kernel is 2.6.26-rc4,
released on May 26. In
addition to the usual fixes, there are quite a few driver updates in
various areas: DRI, networking, MMC, USB, watchdog, and IDE. There is also
a fix for 32-bit x86 users who have "too much memory" in their
machines. "
So if you had PAE enabled _and_ a recent enough CPU to
have NX, but not recent enough to be 64-bit (or you were just perverse and
wanted to run a 32-bit kernel despite having a chip that could do 64-bit
and enough memory that you _really_ should have used a 64-bit kernel),
you'd get various random program failures with SIGSEGV. It ranged from X
not starting up to apparently OpenOffice not working if it did." As
always, all the gory details can be found in the
long-format
changelog.
Comments (none posted)
Kernel development news
Stress testing is rarely useful here - things like ioctl races you
find by thinking evil thoughts.
--
Alan Cox
The greatest problem, as I see it is that by pouring vitriol like
this on newbies, we're really damaging our reputation as a
community that welcomes newcomers and strangling our necessary
supply of willing volunteers. On the other hand, as a maintainer,
when there's people yelling me at about patches not being included
plus a persistent regressions list and about ten bug reports to
track down, the last thing I want to see within a million miles of
my inbox is a white space fixing patch. The more of these patches
we get, the worse the problem becomes and the shorter and more
inflammatory the responses get. We can't go on like this.
--
James Bottomley
The #1 project for all kernel beginners should surely be "make sure
that the kernel runs perfectly at all times on all machines which
you can lay your hands on". Usually the way to do this is to work
with others on getting things fixed up (this can require
persistence!) but that's fine - it's a part of kernel development.
--
Andrew Morton
Comments (6 posted)
By Jonathan Corbet
May 27, 2008
Last week's article on
barriers described one way in which things could go wrong with journaling
filesystems. Therein, it was noted that the journal checksum feature added
to the ext4 filesystem would mitigate some of those problems by preventing
the replay of the journal if it had not been completely written before a
crash. As a discussion this week shows, though, the situation is not quite
that simple.
Ted Ts'o was doing some ext4 testing when he noticed a problem with how the journal
checksum is handled. The journal will normally contain several
transactions which have not yet been fully played into the filesystem.
Each one of those transactions includes a commit record which contains,
among other things, a checksum for the transaction. If the checksum
matches the actual transaction data in the journal, the system knows that
the transaction was written completely and without errors; it should thus
be safe to replay the transaction into the filesystem.
The problem that Ted noticed was this: if a transaction in the middle of
the series failed to match its checksum, the playback of the journal would
stop - but only after writing the corrupted transaction into the
filesystem. This is a sort of worst-of-all-worlds scenario: the kernel
will dump data which is known to be corrupt into the filesystem, then
silently throw away the (presumably good) transactions after the bad one. The
ext4 developers quickly arrived at a consensus that this behavior is a bug
which should be fixed.
But what should really done is not as clear as one might think. Ted's
suggestion was this:
So I think the right thing to do is to replay the *entire* journal,
including the commits with the failed checksums (except in the case
where journal_async_commit is enabled and the last commit has a bad
checksum, in which case we skip the last transaction). By
replaying the entire journal, we don't lose any of the revoke
blocks, which is critical in making sure we don't overwrite any
data blocks, and replaying subsequent metadata blocks will probably
leave us in a much better position for e2fsck to be able to recover
the filesystem.
A bit of background might help in understanding the problem that Ted is
trying to solve here. In the default data=ordered mode, ext3 and
ext4 do not write all data to the journal before it goes to the filesystem
itself. Instead, only filesystem metadata goes to the journal; data
blocks are written directly to the filesystem. The "ordered" part means
that all of the data blocks will be written before the filesystem code will
start writing the metadata; in this way, the metadata will always describe
a complete and correct filesystem.
Now imagine a journal which contains a set of transactions similar to these
(in this order):
- A file is created, with its associated metadata.
- That file is then deleted, and its metadata blocks are released.
- Some other file is extended, with the newly-freed metadata blocks
being reused as data blocks.
Imagine further that the system crashes with those transactions in the journal,
but transaction 2 is corrupt. Simply skipping the bad transaction and
replaying transaction 3 would lead to the filesystem being most
confused about the status of the reused blocks. But just stopping at the
corrupt transaction also has a problem: the data blocks created in
transaction 3 may have already been written, but, as of
transaction 1, the filesystem thinks those are metadata blocks. That,
too, leads to a corrupt filesystem. By replaying the entire journal, Ted
hopes to catch situations like that and leave the filesystem in an overall
better shape.
It is, perhaps, not surprising that there was some disagreement with this
approach. Andreas Dilger argued:
The whole point of this patch was to avoid the case where random
garbage had been written into the journal and then splattering it
all over the filesystem. Considering that the journal has the
highest density of important metadata in the filesystem, it is
virtually impossible to get more serious corruption than in the
journal.
The next proposal was to make a change to
the on-disk journal format ("one more time") turning the per-transaction
checksum into a per-block checksum. Then it would be possible to get a
handle on just how bad any corruption is, and even corrupt transactions
could be mostly replayed. As of this writing, that looks like the approach
which will be taken.
Arguably, the real conclusion to take from this discussion was best expressed by Arjan van de Ven in
an entirely different context: "having a journal is soooo
1999". The Btrfs filesystem, which has a good chance of replacing
ext3 and ext4 a few years from now, does not have a journal; instead, it
uses its fast snapshot mechanism to keep transactions consistent. Btrfs
may, thus, avoid some of the problems that come with journaling - though,
perhaps, at the cost of introducing a set of interesting new problems.
Comments (7 posted)
By Jake Edge
May 28, 2008
Some device drivers need firmware to load into the hardware at
initialization time. The kernel firmware loader interface exists to
support that functionality, but it requires help from user space
which may not be available in all environments. David Woodhouse has
proposed a patch that would
eliminate that requirement so that more drivers can use the firmware
loader rather than craft their own solution.
Embedded devices will be one of the main users of this ability. Many
of those do not have a user space filesystem available at boot
time—via initrd or initramfs—but they still need to access
firmware images to download to peripherals. The new
request_firmware() implementation would allow those devices to
link the firmware into the kernel while still using the kernel firmware
infrastructure.
Woodhouse has an excellent summary of what he is trying to do in the patch
posting:
Some drivers have their own hacks to bypass the kernel's firmware loader
and build their firmware into the kernel; this renders those unnecessary.
Other drivers don't use the firmware loader at all, because they always
want the firmware to be available. This allows them to start using the
firmware loader.
A third set of drivers already use the firmware loader, but can't be
used without help from userspace, which sometimes requires an initrd.
This allows them to work in a static kernel.
A driver that has static firmware data, declares it using:
DECLARE_BUILTIN_FIRMWARE("firmware_name", blob);
The
firmware_name is used as a key to find the specific firmware
when
request_firmware() is called.
blob is a pointer to
the actual code. The declaration adds the firmware to the end of an array
holding
struct builtin_fw elements, which look like this:
struct builtin_fw {
char *name;
void *data;
unsigned long size;
};
When a call is made to request_firmware(), the new code linearly
searches the array for a matching key before calling out to user space.
This allows any statically created firmware blobs to take precedence over
those in the filesystem. Whichever is found is returned.
There seemed to be strong agreement that Woodhouse's approach was the right
way to go. His original implementation copied
the firmware blob before returning it to a request_firmware()
caller which required a vmalloc()—a waste of precious memory
on embedded devices.
Woodhouse was concerned that some drivers might modify the firmware before
loading it into the device. Once he started looking, he found examples of
that, but instead of penalizing all devices, he changed the firmware data
returned in a struct firmware to be constant, resulting in the
following structure:
struct firmware {
size_t size;
const u8 *data;
};
This constitutes an API change for anyone using the
request_firmware() interface. In-tree drivers have been modified
by Woodhouse appropriately, but out-of-tree drivers need to be aware of the
change. Any driver that needs to modify the data
must make a copy for themselves.
Another feature that would be useful for memory-constrained devices is
compression of the firmware in the kernel image. This is on Woodhouse's radar, but is not seen as a feature that must be
in the first release. Not copying the data for most drivers is
a bigger win, but compression, especially for large firmware images might
help. In those cases, though, both the compressed and uncompressed data
will be in memory while the driver is downloading it.
Getting this work included into 2.6.26 has been discussed, even though the
merge window has closed. Woodhouse thinks
it might be possible:
Well, it's supposedly too late, but it's dead simple and shouldn't have
much chance of breaking anything, so I suppose as long as we don't
include the korg1212 patch and the rest of the similar patches which
we're still working on, that's not such an insane request.
This is a fairly simple patch that adds some very useful functionality,
especially for the embedded community. Woodhouse has recently stepped up as one the kernel
embedded maintainers, so we may see more things like this from him in
the future. It is unlikely that Linus Torvalds will merge this
feature
so late in the 2.6.26 cycle, but inclusion into 2.6.27 seems quite probable.
Comments (3 posted)
By Jonathan Corbet
May 28, 2008
Getting high-performance, three-dimensional graphics working under Linux is
quite a challenge even when the fundamental hardware programming
information is available. One component of this problem is memory
management: a graphics processor (GPU) is, essentially, a computer of its
own with a distinct view of memory. Managing the GPU's memory - and its
view of system RAM - must be done carefully if the resulting system is
intended to work at all, much less with acceptable performance.
Not that long ago, it appeared that this problem had been solved with the
translation table maps (TTM)
subsystem. TTM remains outside of the mainline kernel, though, as do
all drivers which use it. A recent query
about what would be required to get TTM merged led to an interesting
discussion where it turned out that, in fact, TTM may not be the future of
graphics memory management after all.
A number of complaints about TTM have been raised. Its API is far larger
than is needed for any free Linux driver; it has, in other words, a certain
amount of code dedicated to the needs of binary-only drivers. The fencing
mechanism (which manages concurrency between the host CPUs and the GPU) is
seen as being complex, difficult to work with, and not always yielding the
best performance. Heavy use of memory-mapped buffers can create performance
problems of its own. The TTM API is an exercise in trying to provide for
everything in all situations; as a result it is, according to some
driver developers, hard to match to
any specific hardware, hard to get started with, and still insufficiently
flexible. And, importantly, there is a
distinct shortage of working free drivers which use TTM. So Dave Airlie worries:
I was hoping that by now, one of the radeon or nouveau drivers
would have adopted TTM, or at least demoed something working using
it, this hasn't happened which worries me... The real question is
whether TTM suits the driver writers for use in Linux desktop and
embedded environments, and I think so far I'm not seeing enough
positive feedback from the desktop side
All of these worries would seem to be moot, since TTM is available and
there is nothing else out there. Except, as it turns out, there is
something out there: it's called the Graphics Execution Manager, or GEM.
The Intel-sponsored GEM project is all of one month old, as of this writing.
The GEM developers had not really intended to announce
their work quite yet, but the TTM discussion brought the issue to the fore.
Keith Packard's introduction to GEM includes a
document describing the API as it exists so far. There are a number of
significant differences in how GEM does things. To begin with, GEM
allocates graphical buffer objects using normal, anonymous, user-space
memory. That means that these buffers can be forced out to swap when
memory gets tight. There are clear advantages to this approach, and not
just in memory flexibility: it also makes the implementation of suspend and
resume easier by automatically providing backing store for all buffer
objects.
The GEM API tries to do away with the mapping of buffers into user space.
That mapping is expensive to do and brings all sorts of interesting issues
with cache coherency between the CPU and GPU. So, instead, buffer objects
are accessed with simple read() and write() calls. Or,
at least, that's the way it would be if the GEM developers could attach a
file descriptor to each buffer object. The kernel, however, does not make
the management of that many file descriptors easy (yet), so the real API
uses separate handles for buffer objects and a series of ioctl()
calls.
That said, it is possible to map a buffer object into user space. But then
the user-space driver must take explicit responsibility for the management
of cache coherency. To that end there is a set of ioctl() calls
for managing the "domain" of a buffer; the domain, essentially, describes
which component of the system owns the buffer and is entitled to operate on
it. Changing the domains (there are two, one for read access and one for
writes) of a buffer will perform the necessary cache flushes. In a sense,
this mechanism resembles the streaming DMA API, where the ownership of DMA
buffers can be switched between the CPU and the peripheral controller.
That is not entirely surprising, as a very similar problem is being solved.
This API also does away with the need for explicit fence operations.
Instead, a CPU operation which requires access to a buffer will simply
wait, if necessary, for the GPU to finish any outstanding operations
involving that buffer.
Finally, the GEM API does not try to solve the entire problem; a number of
important operations (such as the execution of a set of GPU commands) are
left for the hardware-specific driver to implement. GEM is, thus, quite
specific to the needs of Intel's driver at this time; it does not try for
the same sort of generality that was a goal of TTM. As described by Eric Anholt:
The problem with TTM is that it's designed to expose one general
API for all hardware, when that's not what our drivers want...
We're trying to come at it from the other direction: Implement one
driver well. When someone else implements another driver and finds
that there's code that should be common, make it into a support
library and share it.
The advantage to this approach is that it makes it relatively easy to
create something which works well with Intel drivers. And that may well be
a good start; one working set of drivers is better than none. On the other
hand, that means that a significant amount of work may be required to get
GEM to the point where it can support drivers for other hardware. There
seem to be two points of view on how that might be done: (1) add
capabilities to GEM when needed by other drivers, or (2) have each
driver use its own memory manager.
The first approach is, in many ways, more pleasing. But it implies that
the GEM API could change significantly over time. And that, in turn, could
delay the merging of the whole thing; the GEM API is exported to user
space, and, as a result, must remain compatible as things change. So there
may be resistance to a quick merge of an API which looks like it may yet
have to evolve for some time.
The second approach, instead, is best described by Dave Airlie:
Well the thing is I can't believe we don't know enough to do this
in some way generically, but maybe the TTM vs GEM thing proves its
not possible. So we can then punt to having one memory manager per
driver, but I suspect this will be a maintenance nightmare, so if
people decide this is the way forward, I'm happy to see it
happen. However the person submitting the memory manager n+1 must
damn well be willing to stand behind the interface until time ends,
and explain why they couldn't re-use 1..n memory managers.
One other remaining issue is performance. Keith Whitwell posted some benchmark results showing that the
i915 driver performs significantly worse with either TTM or GEM than
without. Keith Packard gets different
results, though; his tests show that the GEM-based driver is significantly
faster. Clearly there is a need for a set of consistent benchmarks;
performance of graphics drivers is important, but performance cannot be
optimized if it cannot be reliably measured.
The use of anonymous memory also raises some performance concerns: a
first-person shooter game will not provide the same experience if its
blood-and-gore textures must be continually paged in. Anonymous memory can
also be high memory, and, thus, not necessarily accessible via a 32-bit
pointer. Some GPU hardware cannot address high memory; that will likely
force the use of bounce buffers within the kernel. In the end, GEM will
have to prove that it can deliver good performance; GEM's developers are
highly motivated to make their hardware look good, so there is a reasonable
chance that things will work out on this front.
The conclusion to draw from all of this is that the GPU memory management
problem cannot yet be considered solved. GEM might eventually become that
solution, but it is a very new API which still needs a fair amount of
work. There is likely to be a lot of work yet to be done in this area.
(Thanks to Timo Jyrinki for suggesting this topic.)
Comments (15 posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>