LWN.net Logo

Advertisement

GStreamer, Embedded Linux, Android, VoD, Smooth Streaming, DRM, RTSP, HEVC, PulseAudio, OpenGL. Register now to attend.

Advertise here

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.35-rc4, released on July 4. "I've been back online for a week, and at least judging by the kinds of patches and pull requests I've been getting, I have to say that I think that having been strict for -rc3 ended up working out pretty well. The diffstat of rc3..rc4 looks quite reasonable, despite the longer window between rc's. And while there were certainly some things that needed fixing, I'm hoping that we'll have a timely 2.6.35 release despite my vacation [...]". For all of the details, both the short-form changelog and the full changelog are available.

The stable kernel floodgates opened up and five separate kernels poured out: 2.6.27.48, 2.6.31.14, 2.6.32.16, 2.6.33.6, and 2.6.34.1. There will be no more 2.6.31 stable kernels, and there will only be one more 2.6.33 release. The others will still see updates for some time to come.

Comments (3 posted)

Quotes of the week

So you've upstreamed the kernel bits, kept the good userspace bits to yourselfs, are stroking them on your lap like some sort of Dr Evil, now why should the upstream kernel maintainers take the burden when you won't actually give them the stuff to really make their hardware work?
-- Dave Airlie

Side note: when a respected information source covers something where you have on-the-ground experience, the result is often to make you wonder how much fecal matter you've swallowed in areas outside your own expertise.
-- Rusty Russell

Comments (3 posted)

Kernel development news

Ask a kernel developer

July 7, 2010

This article was contributed by Greg Kroah-Hartman.

One in a series of columns in which questions are asked of a kernel developer and he tries to answer them. If you have unanswered questions relating to technical or procedural things around Linux kernel development, ask them in the comment section, or email them directly to the author.

How do I figure out who to email about a problem I am having with the kernel. I see a range of messages in the kernel log, how do I go from that to the proper developers who can help me out?

For example, right now I am seeing the following error:

    [  658.831697] [drm:edid_is_valid] *ERROR* Raw EDID:
    [  658.831702] 48 48 48 48 50 50 50 50 20 20 20 20 4c 4c 4c 4c HHHHPPPP    LLLL
    [  658.831705] 50 50 50 50 33 33 33 33 30 30 30 30 36 36 36 36 PPPP333300006666
    [  658.831709] 35 35 35 35 0a 0a 0a 0a 20 20 20 20 20 20 20 20 5555....
Where do I start with tracking this down?

The kernel log is telling you where the problem is, so the real trick is going to be in tracking it down to who is responsible for the messages. There are different ways to go about this. You could first try to just grep the kernel source tree for the error string:

    $ cd linux-2.6
    $ git grep "\*ERROR\* Raw EDID"
	
Ok, that didn't work, let's try to narrow down the string some:
    $ git grep "Raw EDID"
    drivers/gpu/drm/drm_edid.c:		DRM_ERROR("Raw EDID:\n");
	
Ok, now you have a file to look at. But who is responsible for this file? As mentioned previously you can use the get_maintainer.pl script by passing it the filename you are curious about:
    $ ./scripts/get_maintainer.pl -f drivers/gpu/drm/drm_edid.c
    David Airlie <airlied@linux.ie>
    Dave Airlie <airlied@redhat.com>
    Adam Jackson <ajax@redhat.com>
    Zhao Yakui <yakui.zhao@intel.com>
    dri-devel@lists.freedesktop.org
    linux-kernel@vger.kernel.org
	
This shows that the main DRM developer, David Airlie, is the best person to ask, but other developers may also be able to help out. Sending mail to David, and CCing the others listed (including the mailing lists) will get it in front of those who are most likely to be able to assist.

Another way to find out what code is responsible for the problem is to look at the name of the function that was writing out the error:

    [  658.831697] [drm:edid_is_valid] *ERROR* Raw EDID:
The function name is in the [] characters, and we can look for that to see what code is calling it:
    $ git grep edid_is_valid
    drivers/gpu/drm/drm_edid.c: * drm_edid_is_valid - sanity check EDID data
    drivers/gpu/drm/drm_edid.c:bool drm_edid_is_valid(struct edid *edid)
    drivers/gpu/drm/drm_edid.c:EXPORT_SYMBOL(drm_edid_is_valid);
    drivers/gpu/drm/drm_edid.c:	if (!drm_edid_is_valid(edid)) {
    drivers/gpu/drm/radeon/radeon_combios.c:	if (!drm_edid_is_valid(edid)) {
    include/drm/drm_crtc.h:extern bool drm_edid_is_valid(struct edid *edid);
	
This points again at the drivers/gpu/drm/drm_edid.c file as being responsible for the error message.

In looking at the function drm_edid_is_valid there are a number of other messages that could have been produced in the kernel log right before this one:

    if (csum) {
	    DRM_ERROR("EDID checksum is invalid, remainder is %d\n", csum);
	    goto bad;
    }

    if (edid->version != 1) {
	    DRM_ERROR("EDID has major version %d, instead of 1\n", edid->version);
	    goto bad;
    }
	
So when you email the developers and mailing list found by the get_maintainer.pl script, it is always important to provide all of the kernel log, not just the few single last lines of the error, because there might be more information a bit higher up that shows more information that the developers can use to help debug the problem.

[ Thanks to Peter Favrholdt for sending in this question. ]

Comments (12 posted)

On the scalability of Linus

By Jonathan Corbet
July 2, 2010
The Linux kernel development process stands out in a number of ways; one of those is the fact that there is exactly one person who can commit code to the "official" repository. There are many maintainers looking after various subsystems, but every patch they merge must eventually be accepted by Linus Torvalds if it is to get into the mainline. Linus's unique role affects the process in a number of ways; for example, as this article is being written, Linus has just returned from a vacation which resulted in nothing going into the mainline for a couple of weeks. There are more serious concerns associated with the single-committer model, though, with scalability being near the top of the list.

Some LWN readers certainly remember the 2.1.123 release in September, 1998. That kernel failed to compile due to the incomplete merging (by Linus) of some frame buffer patches. A compilation failure in a development kernel does not seem like a major crisis, but this mistake served as a flash point for anger which had been growing in the development community for a while: patches were increasingly being dropped and people were getting tired of it. At the end of a long and unpleasant discussion, Linus threw up his hands and walked away:

Quite frankly, this particular discussion (and others before it) has just made me irritable, and is ADDING pressure. Instead, I'd suggest that if you have a complaint about how I handle patches, you think about what I end up having to deal with for five minutes.

Go away, people. Or at least don't Cc me any more. I'm not interested, I'm taking a vacation, and I don't want to hear about it any more. In short, get the hell out of my mailbox.

This was, of course, the famous "Linus burnout" episode of 1998. Everything stopped for a while until Linus rested a bit, came back, and started merging patches again. Things got kind of rough again in 2000, leading to Eric Raymond's somewhat sanctimonious curse of the gifted lecture. In 2002, as the 2.5 series was getting going, frustration with dropped patches was, again, on the rise; Rob Landley's "patch penguin" proposal became the basis for yet another extended flame war on the dysfunctional nature of the kernel development process and the "Linus does not scale" problem.

Shortly thereafter, things got a whole lot smoother. There is no doubt as to what changed: the adoption of BitKeeper - for all the flame wars that it inspired - made the kernel development process work. The change to Git improved things even more; it turns out that, given the right tools, Linus scales very well indeed. In 2010, he handles a volume of patches which would have been inconceivable back then and the process as a whole is humming along smoothly.

Your editor, however, is concerned that there may be some clouds on the horizon; might there be another Linus scalability crunch point coming? In the 2.6.34 cycle, Linus established a policy of unpredictable merge window lengths - though that policy has been more talk than fact so far. For 2.6.35, quite a few developers saw the merge window end with no response to their pull requests; Linus simply decided to ignore them. The blowup over the ARM architecture was part of this, but quite a few other trees remained unpulled as well. We have not gone back to the bad old days where patches would simply disappear into the void, and perhaps Linus is just experimenting a bit with the development process to try to encourage different behavior from some maintainers. Still, silently ignoring pull requests does bring back a few memories from that time.

Too many pulls?

A typical development cycle sees more than 10,000 changes merged into the mainline. Linus does not touch most of those patches directly, though; instead, he pulls them from trees managed by subsystem maintainers. How much attention is paid to specific pull requests is not entirely clear; he does look at each closely enough to ensure that it contains what the maintainer said would be there. Some pulls are obviously subjected to closer scrutiny, while others get by with a quick glance. Still, it's clear that every pull request and every patch will require a certain amount of attention and thought before being merged.

The following table summarized mainline merging activity by Linus over the last ten kernel releases (the 2.6.35 line is through 2.6.35-rc3):

ReleasePulls Patches
MergeWinTotalDirectTotal
2.6.26159426 2881496
2.6.27153436 3391413
2.6.28150398 313954
2.6.29129418 267896
2.6.30145411 249618
2.6.31187479 300788
2.6.32185451 112789
2.6.33176444 104605
2.6.34118393 94581
2.6.35160218 38405

The two columns under "pulls" show the number of trees pulled during the merge window and during the development cycle as a whole. Note that it's possible that these numbers are too small, since "fast-forward" merges do not necessarily leave any traces in the git history. Linus does very few fast-forward merges, though, so the number of missed merges, if any, will be small.

Linus still directly commits some patches into his repository. The bulk of those come from Andrew Morton, who does not use git to push patches to Linus. In the table above, the "total" column includes changes that went by way of Andrew, while the "direct" column only counts patches that Andrew did not handle.

Some trends are obvious from this table: the number of patches going directly into the mainline has dropped significantly; almost everything goes through somebody else's tree first. What's left for Linus, at this point, is mostly release tagging, urgent fixes, and reverts. Andrew Morton remains the maintainer of last resort for much of the kernel, but, increasingly, changes are not going through his tree. Meanwhile, the number of pulls is staying roughly the same. It is interesting to think about why that might be.

Perhaps there is no need for more pulls despite the increase in the number of subsystem trees over time. Or perhaps we're approaching the natural limit of how many subsystem pull requests one talented benevolent dictator can pay attention to without burning out. After all, it stands to reason that the number of pull requests handled by Linus cannot increase without bound; if the kernel community continues to grow, there must eventually be a scalability bottleneck there. The only real question is where it might be.

[2.6.35 merge paths] If there is a legitimate concern here, then it might be worth contemplating a response before things break down again. One obvious approach would be to change the fact that almost all trees are pulled directly into the mainline; see this plot to see just how flat the structure is for 2.6.35. Subsystem maintainers who have earned sufficient trust could possibly handle more lower-level pull requests and present a result to Linus that he can merge with relatively little worry. The networking subsystem already works this way; a number of trees feed into David Miller's networking tree before being sent upward. Meanwhile, other pressures have led to the opposite thing happening with the ARM architecture: there are now several subarchitecture trees which go straight to Linus. The number of ARM pulls seems to have been a clear motivation for Linus to shut things down during the 2.6.35 merge window.

Another solution, of course, would be to empower others to push trees directly into the mainline. It's not clear that anybody is ready for such a radical change in the kernel development process, though. Ted Ts'o's 1998 warning to anybody wanting a "core team" model still bears reading nearly twelve years later.

But if Linus is to retain his central position in Linux kernel development, the community as a whole needs to ensure that the process scales and does not overwhelm him. Doing more merges below him seems like an approach that could have potential, but the word your editor has heard is that Linus is resistant to too much coalescing of trees; he wants to look stuff over on its way into the mainline. Still, there must be places where this would work. Maybe we need an overall ARM architecture tree again, and perhaps there could be a place for a tree which would collect most driver patches.

The Linux kernel and its development process have a much higher profile than they did even back in 2002. If the process were to choke again due to scalability problems at the top, the resulting mess would be played out in a very public way. While there is no danger of immediate trouble, we should not let the smoothness of the process over the last several years fool us into thinking that it cannot happen again. As with the code itself, it makes sense to think about the next level of scalability issues in the development process before they strike.

Comments (32 posted)

Bcache: Caching beyond just RAM

July 2, 2010

By William Stearns and Kent Overstreet

Kent Overstreet has been working on bcache, which is a Linux kernel module intended to improve the performance of block devices. Instead of using just memory to cache hard drives, he proposes to use one or more solid-state storage devices (SSDs) to cache block data (hence bcache, a block device cache).

The code is largely filesystem agnostic as long as the filesystem has an embedded UUID (which includes the standard Linux filesystems and swap devices). When data is read from the hard drive, a copy is saved to the SSD. Later, when one of those sectors needs to be retrieved again, the kernel checks to see if it's still in page cache. If so, the read comes from RAM just like it always has on Linux. If it's not in RAM but it is on the SSD, it's read from there. It's like we've added 64GB or more of - slower - RAM to the system and devoted it to caching.

The design of bcache allows the use of more than one SSD to perform caching. It is also possible to cache more than one existing filesystem, or choose instead to just cache a small number of performance-critical filesystems. It would be perfectly reasonable to cache a partition used by a database manager, but not cache a large filesystem holding archives of old projects. The standard Linux page cache can be wiped out by copying a few large (near the size of your system RAM) files. Large file copies on that project archive partition won't wipe out an SSD-based cache using bcache.

Another potential use is using local media to cache remote disks. You could use an existing partition/drive/loopback device to cache any of the following: AOE (ATA-over-Ethernet) drive, SAN LUN, DRBD or NBD remote drives, iSCSI drives, local CD, or local (slow) USB thumb drives. The local cache wouldn't have to be an SSD, it would just have to be faster than the media you're caching.

Note that this type of caching is only for block devices (anything that shows up as a block device in /dev/). It isn't for network filesystems like NFS, CIFS, and so on (see the FS-cache module for the ability to cache individual files on an NFS or AFS client).

Implementation

To intercept filesystem operations, bcache hooks into the top of the block layer, in __generic_make_request(). It thus works entirely in terms of BIO structures. By hooking into the sole function through which all disk requests pass, bcache doesn't need to make any changes to block device naming or filesystem mounting. If /dev/md5 was originally mounted on /usr/, it continues to show up as /dev/md5 mounted on /usr/ after bcache is enabled for it. Because the caching is transparent, there are no changes to the boot process; in fact, bcache could be turned on long after the system is up and running. This approach of intercepting bio requests in the background allows us to start and stop caching on the fly, to add and remove cache devices, and to boot with or without bcache.

bcache's design focuses on avoiding random writes and playing to SSD's strengths. Roughly, a cache device is divided up into buckets, which are intended to match the physical disk's erase blocks. Each bucket has an eight-bit generation number which is maintained in a separate array on the SSD just past the superblock. Pointers (both to btree buckets and to cached data) contain the generation number of the bucket they point to; thus to free and reuse a bucket, it is sufficient to increment the generation number.

This mechanism allows bcache keep the cache device completely full; when it wants to write some new data, it just picks a bucket, increments its generation number, invalidating all the existing pointers to it. Garbage collection will remove the actual pointers eventually; there is no need for backpointers or any other infrastructure.

For each bucket, bcache remembers the generation number from the last time it had a full garbage collection performed. Once the difference between the current generation number and the remembered number reaches 64, it's time for another garbage collection. Since the generation number has no chance to wrap, an 8-bit generation number is sufficient.

Unlike standard btrees, bcache's btrees aren't kept fully sorted, so if you want to insert a key you don't have to rebalance the whole thing. Rather, they're kept sorted according to when they were written out; if the first ten pages are already on disk, bcache will insert into the 11th page, in sorted order, until it's full. During garbage collection (and, in the future, during insertion if there are too many sets) it'll re-sort the whole bucket. This means bcache doesn't have much of the index pinned in memory, but it also doesn't have to do much work to keep the index written out. Compare that to a hash table of ten million or so entries and the advantages are obvious.

State of the code

Currently, the code is looking fairly stable; it's survived overnight torture tests. Production is still a ways away, and there are some corner cases and IO error handling to flesh out, but more testing would be very welcome at this point.

There's a long list of planned features:

IO tracking: By keeping a hash of the most recent IOs, it's possible to track sequential IO and bypass the cache - large file copies, backups, and raid resyncs will all bypass the cache. This one's mostly implemented.

Write behind caching: Synchronous writes are becoming more and more of a problem for many workloads, but with bcache random writes become sequential writes. If you've got the ability to buffer 50 or 100GB of writes, many might never hit your RAID array before being rewritten.

The initial write behind caching implementation isn't far off, but at first it'll work by flushing out new btree pages to disk quicker, before they fill up - journaling won't be required because the only metadata to write out in order is a single index. Since we have garbage collection, we can mark buckets as in-use before we use them and leaking free space is a non issue. (This is analogous to soft updates - only much more practical). However, journaling would still be advantageous so that all new keys can be flushed out sequentially; then updates to the btree can happen as pages fill up, versus many smaller writes so that synchronous writes can be completed quickly. Since bcache's btree is very flat, this won't be much of an issue for most workloads, but should still be worthwhile.

Multiple cache devices have been planned from the start, and mostly implemented. Suppose you had multiple SSDs to use - you could stripe them, but then you have no redundancy, which is a problem for writeback caching. Or you could mirror them, but then you're pointlessly duplicating data that's present elsewhere. Bcache will be able to mirror only the dirty data, and then drop one of the copies when it's flushed out.

Checksumming is a ways off, but definitely planned; it'll keep checksums of all the cached data in the btree, analogous to what Btrfs does. If a checksum doesn't match, that data can be simply tossed, the error logged, and the data read from the backing device or redundant copy.

There's also a lot of room for experimentation and potential improvement in the various heuristics. Right now the cache functions in a least-recently-used (LRU) mode, but it's flexible enough to allow for other schemes. Potentially, we can retain data based on how much real time it saves the backing device, calculated from both the seek time and bandwidth.

Sample performance numbers

Of course, performance is the only reason to use bcache, so benchmarks matter. Unfortunately there's still an odd bug affecting buffered IO so the current benchmarks don't yet fully reflect bcache's potential, but are more a measure of current progress. Bonnie isn't particularly indicative of real world performance, but has the advantage of familiarity and being easy to interpret; here is the bonnie output:

Uncached: SATA 2 TB Western Digital Green hard drive

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
utumno          16G   672  91 68156   7 36398   4  2837  98 102864   5 269.3   2
Latency             14400us    2014ms   12486ms   18666us     549ms     460ms

And now cached with a 64 gb Corsair Nova:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
utumno          16G   536  92 70825   7 53352   7  2785  99 181433  11  1756  15
Latency             14773us    1826ms    3153ms    3918us    2212us   12480us

In these numbers, the per character columns are mostly irrelevant for our purposes, as they're affected by other parts of the kernel. The write and rewrite numbers are only interesting in that they don't go down, since bcache isn't doing write behind caching yet. The sequential input is reading data bonnie previously wrote, and thus should all be coming from the SSD. That's where bcache is lacking, the SSD is capable of about 235 mb/sec. The random IO numbers are actually about 90% reads, 10% writes of 4k each; without write behind caching bonnie is actually bottlenecked on the writes hitting the spinning metal disk, and bcache isn't that far off from the theoretical maximum.

For more information

The bcache wiki holds more details about the software, more formal benchmark numbers, and sample commands for getting started.

The git repository for the kernel code is available at git://evilpiepirate.org/~kent/linux-bcache.git. The userspace tools are in a separate repository: git://evilpiepirate.org/~kent/bcache-tools.git. Both are viewable with a web browser at the gitweb site.

Comments (50 posted)

Patches and updates

Kernel trees

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Architecture-specific

Virtualization and containers

Miscellaneous

Page editor: Jake Edge
Next page: Distributions>>

Copyright © 2010, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds