Brief items
The current development kernel is 2.6.35-rc4, released on July 4. "I've been back online for a week, and at least judging by the kinds of
patches and pull requests I've been getting, I have to say that I
think that having been strict for -rc3 ended up working out pretty
well. The diffstat of rc3..rc4 looks quite reasonable, despite the
longer window between rc's. And while there were certainly some things
that needed fixing, I'm hoping that we'll have a timely 2.6.35 release
despite my vacation [...]". For all of the details, both the short-form
changelog and the full
changelog are available.
The stable kernel floodgates opened up and five separate kernels poured
out: 2.6.27.48, 2.6.31.14, 2.6.32.16, 2.6.33.6, and 2.6.34.1. There will be no more 2.6.31 stable
kernels, and there will only be one more 2.6.33 release. The others will
still see updates for some time to come.
Comments (3 posted)
So you've upstreamed the kernel bits, kept the good userspace bits
to yourselfs, are stroking them on your lap like some sort of Dr
Evil, now why should the upstream kernel maintainers take the
burden when you won't actually give them the stuff to really make
their hardware work?
--
Dave Airlie
Side note: when a respected information source covers something where you
have on-the-ground experience, the result is often to make you wonder how
much fecal matter you've swallowed in areas outside your own expertise.
--
Rusty Russell
Comments (3 posted)
Kernel development news
One in a series of columns in which questions are asked of a kernel
developer and he tries to answer them. If you have unanswered questions
relating to technical or procedural things around Linux kernel
development, ask them in the comment section, or email them directly to
the author.
How do I figure out who to email about a problem I am having with the
kernel. I see a range of messages in the kernel log, how do I go from
that to the proper developers who can help me out?
For example, right now I am seeing the following error:
[ 658.831697] [drm:edid_is_valid] *ERROR* Raw EDID:
[ 658.831702] 48 48 48 48 50 50 50 50 20 20 20 20 4c 4c 4c 4c HHHHPPPP LLLL
[ 658.831705] 50 50 50 50 33 33 33 33 30 30 30 30 36 36 36 36 PPPP333300006666
[ 658.831709] 35 35 35 35 0a 0a 0a 0a 20 20 20 20 20 20 20 20 5555....
Where do I start with tracking this down?
The kernel log is telling you where the problem is, so the real trick is
going to be in tracking it down to who is responsible for the messages.
There are different ways to go about this. You could first try to just
grep the kernel source tree for the error string:
$ cd linux-2.6
$ git grep "\*ERROR\* Raw EDID"
Ok, that didn't work, let's try to narrow down the string some:
$ git grep "Raw EDID"
drivers/gpu/drm/drm_edid.c: DRM_ERROR("Raw EDID:\n");
Ok, now you have a file to look at. But who is responsible for this
file? As
mentioned previously
you can use the
get_maintainer.pl script by passing it the
filename you are curious about:
$ ./scripts/get_maintainer.pl -f drivers/gpu/drm/drm_edid.c
David Airlie <airlied@linux.ie>
Dave Airlie <airlied@redhat.com>
Adam Jackson <ajax@redhat.com>
Zhao Yakui <yakui.zhao@intel.com>
dri-devel@lists.freedesktop.org
linux-kernel@vger.kernel.org
This shows that the main DRM developer, David Airlie, is the best person
to ask, but other developers may also be able to help out. Sending
mail to David, and CCing the others listed (including the mailing lists)
will get it in front of those who are most likely to be able to assist.
Another way to find out what code is responsible for the problem is to
look at the name of the function that was writing out the error:
[ 658.831697] [drm:edid_is_valid] *ERROR* Raw EDID:
The function name is in the
[] characters, and we can look
for that to see what code is calling it:
$ git grep edid_is_valid
drivers/gpu/drm/drm_edid.c: * drm_edid_is_valid - sanity check EDID data
drivers/gpu/drm/drm_edid.c:bool drm_edid_is_valid(struct edid *edid)
drivers/gpu/drm/drm_edid.c:EXPORT_SYMBOL(drm_edid_is_valid);
drivers/gpu/drm/drm_edid.c: if (!drm_edid_is_valid(edid)) {
drivers/gpu/drm/radeon/radeon_combios.c: if (!drm_edid_is_valid(edid)) {
include/drm/drm_crtc.h:extern bool drm_edid_is_valid(struct edid *edid);
This points again at the
drivers/gpu/drm/drm_edid.c file as
being responsible for the error message.
In looking at the function drm_edid_is_valid there are a
number of other messages that could have been produced in the kernel log
right before this one:
if (csum) {
DRM_ERROR("EDID checksum is invalid, remainder is %d\n", csum);
goto bad;
}
if (edid->version != 1) {
DRM_ERROR("EDID has major version %d, instead of 1\n", edid->version);
goto bad;
}
So when you email the developers and mailing list found by the
get_maintainer.pl script, it is always important to provide
all of the kernel log, not just the few single last lines of the error,
because there might be more information a bit higher up that shows more
information that the developers can use to help debug the problem.
[ Thanks to Peter Favrholdt for sending in this question. ]
Comments (12 posted)
By Jonathan Corbet
July 2, 2010
The Linux kernel development process stands out in a number of ways; one of
those is the fact that there is exactly one person who can commit code to
the "official" repository. There are many maintainers looking after
various subsystems, but every patch they merge must eventually be accepted
by Linus Torvalds if it is to get into the mainline. Linus's unique role
affects the process in a number of ways; for example, as this article is
being written, Linus has just returned from a vacation which resulted in
nothing going into the mainline
for a couple of weeks. There are more serious concerns associated with the
single-committer model, though, with scalability
being near the top of the list.
Some LWN readers certainly remember the 2.1.123 release in September,
1998. That kernel failed to compile due to the incomplete merging (by
Linus) of some frame buffer patches. A compilation failure in a
development kernel does not seem like a major crisis, but this mistake
served as a flash point for anger which had been growing in the development
community for a while: patches were increasingly being dropped and people
were getting tired of it. At the end of a long and unpleasant discussion,
Linus threw
up his hands and walked away:
Quite frankly, this particular discussion (and others before it)
has just made me irritable, and is ADDING pressure. Instead, I'd
suggest that if you have a complaint about how I handle patches,
you think about what I end up having to deal with for five minutes.
Go away, people. Or at least don't Cc me any more. I'm not
interested, I'm taking a vacation, and I don't want to hear about
it any more. In short, get the hell out of my mailbox.
This was, of course, the famous "Linus burnout" episode of 1998.
Everything stopped for a while until Linus rested a bit, came back, and
started merging patches again. Things got kind of rough again in 2000,
leading to Eric Raymond's somewhat sanctimonious curse of the gifted
lecture. In 2002, as the 2.5 series was getting going, frustration
with dropped patches was, again, on the rise; Rob Landley's "patch penguin"
proposal became the basis for yet another extended flame war on the
dysfunctional nature of the kernel development process and the "Linus does not
scale" problem.
Shortly thereafter, things got a whole lot smoother. There is no doubt as
to what changed: the adoption of BitKeeper - for all the flame wars that it
inspired - made the kernel development process work. The change to Git
improved things even more; it turns out that,
given the right tools, Linus scales very well indeed. In 2010, he handles
a volume of patches which would have been inconceivable back then and the
process as a whole is humming along smoothly.
Your editor, however, is concerned that there may be some clouds on the
horizon; might there be another Linus scalability crunch point coming? In
the 2.6.34 cycle, Linus established a policy of unpredictable merge window
lengths - though that policy has been more talk than fact so far. For
2.6.35, quite a few developers saw the merge window end with no response to
their pull requests; Linus simply decided to ignore them. The blowup over the ARM architecture
was part of this, but quite a few other trees remained unpulled as well.
We have not gone back to the bad old days where patches would simply
disappear into the void, and perhaps Linus is just experimenting a
bit with the development process to try to encourage different behavior from
some maintainers. Still, silently ignoring pull requests does bring
back a few memories from that time.
Too many pulls?
A typical development cycle sees more than 10,000 changes merged into the
mainline. Linus does not touch most of those patches directly, though;
instead, he pulls them from trees managed by subsystem maintainers. How
much attention is paid to specific pull requests is not entirely clear; he
does look at each closely enough to ensure that it contains what the
maintainer said would be there. Some pulls are obviously subjected to
closer scrutiny, while others get by with a quick glance. Still, it's
clear that every pull request and every patch will require a certain amount
of attention and thought before being merged.
The following table summarized mainline merging activity by Linus over the
last ten kernel releases (the 2.6.35 line is through 2.6.35-rc3):
| Release | Pulls |
Patches |
| MergeWin | Total | Direct | Total |
| 2.6.26 | 159 | 426 |
288 | 1496 |
| 2.6.27 | 153 | 436 |
339 | 1413 |
| 2.6.28 | 150 | 398 |
313 | 954 |
| 2.6.29 | 129 | 418 |
267 | 896 |
| 2.6.30 | 145 | 411 |
249 | 618 |
| 2.6.31 | 187 | 479 |
300 | 788 |
| 2.6.32 | 185 | 451 |
112 | 789 |
| 2.6.33 | 176 | 444 |
104 | 605 |
| 2.6.34 | 118 | 393 |
94 | 581 |
| 2.6.35 | 160 | 218 |
38 | 405 |
The two columns under "pulls" show the number of trees pulled during the
merge window and during the development cycle as a whole. Note that it's
possible that these numbers are too small, since "fast-forward" merges do
not necessarily leave any traces in the git history. Linus does very few
fast-forward merges, though, so the number of missed merges, if any, will
be small.
Linus still directly commits some patches into his repository. The bulk of
those come from Andrew Morton, who does not use git to push patches to
Linus. In the table above, the "total" column includes changes that went
by way of Andrew, while the "direct" column only counts patches that Andrew
did not handle.
Some trends are obvious from this table: the number of patches going
directly into the mainline has dropped significantly; almost everything
goes through somebody else's tree first. What's left for Linus, at this
point, is mostly release tagging, urgent fixes, and reverts. Andrew Morton
remains the maintainer of last resort for much of the kernel, but,
increasingly, changes are not going through his tree. Meanwhile, the
number of pulls is staying roughly the same. It is interesting to think
about why that might be.
Perhaps there is no need for more pulls despite the increase in the number
of subsystem trees over time. Or perhaps we're approaching the natural
limit of how many subsystem pull requests one talented benevolent dictator
can pay attention to without burning out. After all, it stands to reason
that the number of pull requests handled by Linus cannot increase without
bound; if the kernel community continues to grow, there must eventually be
a scalability bottleneck there. The only real question is where it might
be.
If there is a legitimate concern here, then it might be worth contemplating
a response before things break down again. One obvious approach would be
to change the fact that almost all trees are pulled directly into the
mainline; see this plot to see just how
flat the structure is for 2.6.35.
Subsystem maintainers who have earned sufficient trust could possibly handle more
lower-level pull requests and present a result to Linus that he can merge
with relatively little worry. The networking subsystem already works this
way; a number of trees feed into David Miller's networking tree before
being sent upward. Meanwhile, other pressures have led to the opposite
thing happening with the ARM architecture: there are now several
subarchitecture trees which go straight to Linus. The number of ARM pulls
seems to have been a clear motivation for Linus to shut things down during
the 2.6.35 merge window.
Another solution, of course, would be to empower others to push trees
directly into the mainline. It's not clear that anybody is ready for such
a radical change in the kernel development process, though. Ted Ts'o's 1998 warning to anybody
wanting a "core team" model still bears reading nearly twelve years later.
But if Linus is to retain his central position in Linux kernel development,
the community as a whole needs to ensure that the process scales and does
not overwhelm him. Doing more merges below him seems like an approach that
could have potential, but the word your editor has heard is that Linus is
resistant to too much coalescing of trees; he wants to look stuff over on
its way into the mainline. Still, there must be places where this would
work. Maybe we need an overall ARM architecture tree again, and perhaps
there could be a place for a tree which would collect most driver patches.
The Linux kernel and its development process have a much higher profile
than they did even back in 2002. If the process were to choke again due to
scalability problems at the top, the resulting mess would be played out in
a very public way. While there is no danger of immediate trouble, we should
not let the smoothness of the process over the last several years fool us
into thinking that it cannot happen again. As with the code itself,
it makes sense to think about the next level of scalability issues in the
development process before they strike.
Comments (32 posted)
July 2, 2010
By William Stearns and Kent Overstreet
Kent Overstreet has been working on bcache, which is a Linux kernel
module intended to improve the performance of block devices.
Instead of using just memory to cache hard drives, he proposes
to use one or more solid-state storage devices (SSDs) to cache block data
(hence bcache, a block device cache).
The code is largely filesystem agnostic as long as the
filesystem has an embedded UUID (which includes the standard Linux
filesystems and swap devices). When data is read from the hard drive, a
copy is saved to the SSD. Later, when one of those sectors needs to be
retrieved again, the kernel checks to see if it's still in page cache. If so,
the read comes from RAM just like it always has on Linux. If it's not
in RAM but it is on the SSD, it's read from there. It's like we've added
64GB or more of - slower - RAM to the system and devoted it to
caching.
The design of bcache allows the use of more than one SSD to
perform caching. It is also possible to cache more than one existing
filesystem,
or choose instead to just cache a small number of performance-critical
filesystems. It would be perfectly reasonable to cache a partition used
by a database manager, but not cache a large filesystem holding archives of old
projects. The standard Linux page cache can be wiped out
by copying a few large (near the size of your system RAM) files. Large
file copies on that project archive partition won't wipe out an SSD-based cache
using bcache.
Another potential use is using local media to cache remote
disks. You could use an existing partition/drive/loopback device to
cache any of the following: AOE (ATA-over-Ethernet) drive, SAN LUN, DRBD
or NBD remote drives, iSCSI drives, local CD, or local (slow) USB thumb
drives. The local cache wouldn't have to be an SSD, it would just have
to be faster than the media you're caching.
Note that this type of caching is only for block devices
(anything that shows up as a block device in /dev/). It isn't for
network filesystems like NFS, CIFS, and so on (see the FS-cache module for the
ability to cache individual files on an NFS or AFS client).
Implementation
To intercept filesystem operations, bcache hooks into the top
of the block layer, in __generic_make_request(). It thus works
entirely in terms of BIO structures.
By hooking into the sole function through which all disk
requests pass, bcache doesn't need to make any changes to block device naming
or filesystem mounting. If /dev/md5 was originally mounted on
/usr/, it
continues to show up as /dev/md5 mounted on /usr/ after
bcache is
enabled for it. Because the caching is transparent, there are no
changes to the boot process; in fact, bcache could be turned on long
after the system is up and running. This approach of intercepting bio
requests in the background allows us to start and stop caching on the
fly, to add and remove cache devices, and to boot with or without
bcache.
bcache's design focuses on avoiding random writes and playing
to SSD's strengths. Roughly, a cache device is divided up into buckets,
which are intended to match the physical disk's erase blocks. Each
bucket has an eight-bit generation number which is maintained in a separate
array
on the SSD just past the superblock. Pointers (both to btree buckets
and to cached data) contain the generation number of the bucket they
point to; thus to free and reuse a bucket, it is sufficient to increment
the generation number.
This mechanism allows bcache keep the cache device completely
full; when it wants to write some new data, it just picks a bucket,
increments its generation number, invalidating all the existing pointers to
it. Garbage collection will remove the actual pointers eventually; there
is no need for backpointers or any other infrastructure.
For each bucket, bcache remembers the generation number from
the last time it had a full garbage collection performed. Once the
difference between the current generation number and the remembered number
reaches 64, it's time for
another garbage collection. Since the generation number has no chance to
wrap, an 8-bit generation number is sufficient.
Unlike standard btrees, bcache's btrees aren't kept fully
sorted, so if you want to insert a key you don't have to rebalance the
whole thing. Rather, they're kept sorted according to when they were
written out; if the first ten pages are already on disk, bcache will insert into
the 11th page, in sorted order, until it's full. During garbage
collection (and, in the future, during insertion if there are too many sets)
it'll re-sort the whole bucket. This means bcache
doesn't have much of the index pinned in memory, but it also doesn't have to do
much work to keep the index written out. Compare that to a hash table of ten
million or so entries and the advantages are obvious.
State of the code
Currently, the code is looking fairly stable; it's survived overnight torture
tests. Production is still a ways away, and there are some corner cases and IO
error handling to flesh out, but more testing would be very welcome
at this point.
There's a long list of planned features:
IO tracking: By keeping a hash of the most recent IOs, it's possible to track
sequential IO and bypass the cache - large file copies, backups, and raid
resyncs will all bypass the cache. This one's mostly implemented.
Write behind caching: Synchronous writes are becoming more and more of a
problem for many workloads, but with bcache random writes become sequential
writes. If you've got the ability to buffer 50 or 100GB of writes, many
might never hit your RAID array before being rewritten.
The initial write behind caching implementation isn't far off, but at first
it'll work by flushing out new btree pages to disk quicker, before they fill
up - journaling won't be required because the only metadata to write out in
order is a single index. Since we have garbage collection, we can mark
buckets as in-use before we use them and leaking free space is a non issue.
(This is analogous to soft updates - only much more practical). However, journaling
would still be advantageous so that all new keys can be flushed out sequentially;
then updates to the btree can happen as pages fill up, versus many smaller
writes so that synchronous writes can be completed quickly. Since bcache's btree
is very flat, this won't be much of an issue for most workloads, but should
still be worthwhile.
Multiple cache devices have been planned from the start, and mostly
implemented. Suppose you had multiple SSDs to use - you could stripe them, but
then you have no redundancy, which is a problem for writeback caching. Or you
could mirror them, but then you're pointlessly duplicating data that's present
elsewhere. Bcache will be able to mirror only the dirty data, and then drop
one of the copies when it's flushed out.
Checksumming is a ways off, but definitely planned; it'll keep checksums of
all the cached data in the btree, analogous to what Btrfs does. If a
checksum doesn't match, that data can be simply tossed, the error
logged, and the data read from the backing device or redundant copy.
There's also a lot of room for experimentation and potential improvement
in the various heuristics. Right now the cache functions in a
least-recently-used (LRU) mode, but
it's flexible enough to allow for other schemes. Potentially, we can retain
data based on how much real time it saves the backing device, calculated
from both the seek time and bandwidth.
Sample performance numbers
Of course, performance is the only reason to use bcache, so benchmarks
matter. Unfortunately there's still an odd bug affecting buffered IO so
the current benchmarks don't yet fully reflect bcache's potential, but
are more a measure of current progress. Bonnie isn't particularly
indicative of real world performance, but has the advantage of
familiarity and being easy to interpret; here is the bonnie output:
Uncached: SATA 2 TB Western Digital Green hard drive
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
utumno 16G 672 91 68156 7 36398 4 2837 98 102864 5 269.3 2
Latency 14400us 2014ms 12486ms 18666us 549ms 460ms
And now cached with a 64 gb Corsair Nova:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
utumno 16G 536 92 70825 7 53352 7 2785 99 181433 11 1756 15
Latency 14773us 1826ms 3153ms 3918us 2212us 12480us
In these numbers, the per character columns are mostly irrelevant
for our purposes, as they're affected by other parts of the kernel. The
write and rewrite numbers are only interesting in that they don't go
down, since bcache isn't doing write behind caching yet. The sequential
input is reading data bonnie previously wrote, and thus should all be
coming from the SSD. That's where bcache is lacking, the SSD is capable
of about 235 mb/sec. The random IO numbers are actually about 90% reads,
10% writes of 4k each; without write behind caching bonnie is actually
bottlenecked on the writes hitting the spinning metal disk, and bcache
isn't that far off from the theoretical maximum.
For more information
The bcache
wiki holds more details about the software, more formal benchmark
numbers, and sample commands for getting started.
The git repository for the kernel code is available at
git://evilpiepirate.org/~kent/linux-bcache.git. The userspace tools
are in a separate repository:
git://evilpiepirate.org/~kent/bcache-tools.git. Both are viewable with
a web browser at the gitweb
site.
Comments (50 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Virtualization and containers
Miscellaneous
Page editor: Jake Edge
Next page: Distributions>>