The 2.6.32 merge window is still open
as of this writing, so there
is no current 2.6 development kernel. The 2.6.32-rc1 release (and the
closing of the merge window) can be expected as soon as September 24.
The current stable kernel is 2.6.31. There have been no stable
update releases in the last week;
a series of stable updates is in the review process, but they have not been
released as of this writing.
Comments (53 posted)
Quite frankly, I have _never_ever_ seen a good reason for talking
to the kernel with some idiotic packet interface. It's just a fancy
way to do ioctl's, and everybody knows that ioctl's are bad and
evil. Why are fancy packet interfaces suddenly much better?
-- Linus Torvalds
I've yet to see any believable and complete explanation for these
gains. I've asked about these things multiple times and nothing
I suspect that what happened over time was that previously-working
code got broken, then later people noticed the breakage but failed
to analyse and fix it in favour of simply ripping everything out
and starting again.
So for the want of analysing and fixing several possible
regressions, we've tossed away some very sensitive core kernel code
which had tens of millions of machine-years testing. I find this
-- Andrew Morton
on per-BDI writeback
-extern void refrigerator(void);
+extern void refrigerator(void) __cold;
-- Stephen Hemminger
on proper refrigeration
Comments (2 posted)
The AppArmor security module has had a difficult life - even considering
that security modules tend to have a hard path into the mainline in
pathname-based approach concerned numerous developers, and its
implementation caused the net to echo with NACKs. Eventually, its core
developers lost their jobs and moved on to other pursuits, some
distributors lost interest, and AppArmor disappeared from
view. Meanwhile, the pathname-based TOMOYO Linux module managed to
overcome the hurdles and get into the mainline.
Speaking at LinuxCon, your editor stated that he did not know if AppArmor
would come back or not. The next day, a new set of AppArmor patches
was posted by John Johansen. Interestingly, John works at Canonical, so
AppArmor, should it get into the mainline, could well become one of that
company's largest contributions to the kernel. Its chances of merger
should be better now; TOMOYO Linux has broken down the barriers to
pathname-based mandatory access control, and AppArmor uses the new security
module hooks which were added to support TOMOYO. As of this writing,
though, there have been no reviews posted, so anything could still happen.
Comments (15 posted)
The SystemTap team has announced the release of SystemTap 1.0
; SystemTap is a dynamic tracing
tool for Linux. Your editor is not sure why
this particular release qualifies as 1.0, but there is a lot of new stuff
in it, including "experimental support for unprivileged users,
cross-compiling for foreign architectures, matching C++ class and namespace
scopes, reduced runtime memory consumption, reduced dormant overhead in
userspace markers, bug fixes, and more...
" See the announcement for
Comments (3 posted)
Some developers have been unhappy about the merging of devtmpfs for 2.6.32;
one even posted a patch to remove it again. Ingo Molnar, instead, simply
reported a bug
: when devtmpfs created
, it made them inaccessible to
unprivileged accounts. That breaks most applications in the system, which,
Ingo thought, was not entirely desirable.
The devtmpfs developers originally responded that udev should have set the
permissions properly by the time any sort of user-space application was
running. But devtmpfs raises the possibility of running without udev
altogether, at least on relatively simple systems. Linus agreed that this would be a nice possibility,
but noted that it would not work if a small number of special files were
not world-accessible. Setting the permissions properly is not that hard,
but it leads in a direction the devtmpfs developers had not wanted to go:
it puts a certain amount of administrative policy into the kernel.
In the end, though, that is exactly what happened; devtmpfs gained the
query to get default permissions from kernel subsystems and implement them
in the filesystem. Given that these permissions were Linus's largest
complaint about the whole thing, it now seems likely that devtmpfs has a secure
place in the 2.6.32 kernel.
Comments (1 posted)
The paravirt_ops mechanism provides a way for the Linux kernel, when
running in a virtualized mode, to hook efficiently into the hypervisor for
privileged operations. Over time, processors have grown hardware features
aimed at supporting virtualization, but there has still been a performance
benefit to implementing some operations through paravirt_ops. That
situation would appear to be changing, though.
VMI is a paravirtualization layer for VMWare, built on top of
paravirt_ops. Recently, developers at VMWare ran a series of tests and came to an
interesting conclusion: with contemporary hardware, using VMI did not
improve the performance of guest systems. Indeed, it made things worse.
Reasonable hardware virtualization should be available on almost all
systems that matter in the near future, so VMWare's developers have decided
that VMI no longer makes sense; they are now planning to remove it.
KVM developer Avi Kivity noted that a
similar conclusion had been reached in that camp; KVM will be dropping
support for some paravirtualized operations in the near future. That
leaves two other systems - Xen and lguest - using paravirt_ops. Xen, it
seems, will continue to do so for some time, and lguest is highly unlikely
to ever sacrifice sufficient puppies to move to hardware virtualization.
So paravirt_ops will remain for a little while yet, but the its eventual
demise would appear to be in the cards. When it goes, it may just take
lguest with it.
Comments (6 posted)
Kernel development news
Since last week's update
some 3300 changesets have been merged into the mainline for the 2.6.32
development cycle. The total number of non-merge changesets going into
2.6.32 is now just over 7800 - quite a few, but not, yet, a record.
Changes visible to users include:
- There are two new system clocks available:
CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE.
They are aimed at applications which need to obtain timestamps with a
minimal cost, and are willing to lose some resolution in the process.
- The Sunplus
S+core architecture is now supported.
- The performance monitoring code has gained new capabilities for
recording and analyzing scheduler latency information. There is
a new facility for tracking power management state change events.
There has also been a rebranding from "performance counters" to
- Arjan van de Ven's timechart
tool has been merged. Timechart records system events in a way
which allows users to zoom in on specific periods of time and gain
increasing levels of detail on where system delays are coming from.
- The Intel i915 graphics driver now supports dynamic clock frequency
control. This feature allows clock frequencies to be reduced when
there is little or no graphics activity with a corresponding reduction
in power use.
- The Radeon kernel mode setting (KMS) code continues to evolve at a rapid
rate, with increasing numbers of features being supported there.
There is now KMS support for the R600 series.
- Quite a bit of new information has been added to virtual files like
/proc/meminfo with the intent of helping administrators track
down memory users in out-of-memory situations.
- The kernel shared memory
(KSM) subsystem has been merged. KSM scans memory for pages with
identical content. Duplicate pages are replaced with copy-on-write
links, resulting in significant reductions in memory use.
- The cpuidle governor changes described in this article have been
- The Video4Linux layer now understands the ISDB-T and ISDB-S broadcast
standards, giving access to digital TV in places like Japan and
- Expanded information about thread stack usage can now be found under
- The ocfs2 filesystem has gained reflink support, but without
the (to be reworked) reflink() system
- Write support has been removed from the qnx4 filesystem; that is likely
to inconvenience very few users, since it never really worked anyway.
- There is the usual pile of new drivers:
- Boards and processors:
Broadcom BCM63xx system-on-chip processors,
TI DA830/OMAP-L137 and DA850/OMAP-L138 SOC processors,
EcoVec (SH7724) SuperH boards, and
SuperH SH7757 processors.
- Graphics: VIA VX855 integrated graphics chipsets,
Gumstix Overo LCD controllers,
OMAP3 EVM LCD controllers,
Qualcomm MSM/QSD framebuffers.
ARTOP 867X 64bit 4-channel UDMA133 ATA controllers.
- USB: Nuvoton W90X900(W90P910) EHCI controllers and
Philips ISP1362 host controllers.
Conexant 25821-based TV cards
DiBcom DiB8000 ISDB-T/ISDB-Tsb demodulators,
GL860 USB camera devices,
NXP SAA7164-based TV cards,
Friio ISDB-T USB2.0 receivers, and
Earthsoft PT1 PCI cards.
Texas Instruments TMP421/422/423 temperature sensors,
GPIO devices on a number of Freescale Coldfire CPUs,
Wolfson Microelectronics WM831x power management ICs,
Motorola PCAP touchscreens,
ST-Ericsson AB3100 RTC devices,
Renesas R8A66597 USB Peripheral Controllers,
Nuvoton NUC900 series watchdog devices,
Winbond IR remote control devices,
Qualcomm MSM 7X00A SDCC controllers,
OMAP4 multimedia card interfaces,
PPC4xx SPI controllers,
Freescale STMP37xx/378x SPI/SSP controllers,
Freescale MC33880 high-side/low-side switches,
ST-Ericsson COH 901 331 realtime clocks,
Philips PCF2123 RTC devices,
Freescale STMP3xxx and MXC RTC devices,
ACPI 4.0 power meters, and
TI TPS65023 and TPS6507x voltage regulator devices.
Changes visible to kernel developers include:
- The x86 architecture code has been significantly reorganized
so that support for the Intel "Moorestown" architecture could be
- The driver core API has been extended to
allow subsystems to provide non-default permissions for device
nodes created in devtmpfs.
- The (now) unused kernel markers mechanism has been removed;
tracepoints should be used instead.
- The user-space USB driver API now allows drivers to claim specific hub
- There are new tracepoints for memory page allocation and freeing
events and timer (and hrtimer) events.
The merge window would normally be nearing its end;
it's possible that Linus will extend it slightly, though, to make up for
the time he has spent at LinuxCon and the Linux Plumbers conference.
Comments (3 posted)
: Greg Kroah-Hartman has graciously agreed to write an
occasional column for LWN in which he answers questions that readers may
wish to ask of the kernel development community. Greg will do a great job,
but the key to a successful column will be good questions; please come up
with your best and send them in.]
Hi, and welcome to a new semi-weekly column. In here, we are going to
try to answer your common questions about Linux kernel development.
This column will rely on the readers to submit new questions to be
answered either here in comments, or by email to email@example.com, with
the understanding that not all questions can be answered.
Valid topics can range from the technical, to the procedural, or on
toward anything remotely related to the Linux kernel that you can think
To start it off, I've provided a few "seed" questions that I get asked a
lot, and would like to finally answer in one place so I don't have to do
Why is the 2.6.27 kernel still being maintained while the newer
2.6.29 kernel is no longer getting updates?
The Linux kernel stable series strives to only maintain one kernel tree
at at time, the most recent one, with a small overlap of a release or
two when a new kernel is released. So for right now, as the 2.6.31
kernel was just released, both the .31 and .30 trees are being updated.
After the next release of the .30 stable tree, it will be abandoned, and
only the .31 tree will be updated with security and bug fixes.
But some kernel trees are a bit "special". The 2.6.27 kernel looked
like a good kernel to maintain for a longer period of time. Some users
have reported that they like to remain on one kernel version for longer
than 3-4 months, so the 2.6.27 kernel tree will try to be a tree that
they can rely on to get security and bug fixes for a longer time frame.
As the 2.6.27 kernel was first released on October 9, 2008, there has
almost been a full year of support for this kernel so far.
After I get tired of maintaining this kernel branch, Adrian Bunk has
volunteered to maintain it even longer, so in another year or so,
maintenance will switch over to him, and it will continue to live on.
How do I get a patch included in the stable kernel updates?
First off, take a look at the file
verify that the patch you are
considering meets the rules for a stable kernel release. If it does,
the easiest way to get it included is to add a:
Cc: stable <firstname.lastname@example.org>
line to the Signed-off-by: area in the patch before it is sent to the
subsystem maintainer. When a patch with that line in it is accepted
into Linus's kernel tree, the stable team will be automatically notified
that the patch should be included, and they will queue it up for the
next stable kernel release(s).
If you notice a patch that you feel should be included in the stable
release, and does not have this marking, and is already in Linus's tree,
a simple email to the email@example.com address with the git commit id
of the patch in Linus's tree and a short description of which stable
kernel releases you feel it should be included in is all that is needed.
So bring on the questions!
Comments (21 posted)
When you say "log-structured file system," most storage developers
will immediately think of Ousterhout and Rosenblum's classic paper,
Design and Implementation of a Log-structured File System
the nearly two decades of subsequent work attempting to solve the
nasty segment cleaner problem (see below) that came with it. Linux
developers might think of JFFS2, NILFS, or LogFS, three of several
modern log-structured file systems specialized for use with solid
state devices (SSDs). Few people, however, will think of SSD firmware. The
flash translation layer in a modern, full-featured SSD resembles a
log-structured file system in several important ways. Extrapolating
from log-structured file systems research lets us predict how to get
the best performance out of an SSD. In particular, full support for
the TRIM command, at both the SSD and file system levels, will be key for
sustaining long-term peak performance for most SSDs.
What is a log-structured file system?
Log-structured file systems, oddly enough, evolved from logging
file systems. A logging (or journaling) file system is a normal
write-in-place file system in the style of ext2 or FFS, just with a
log of write operations bolted on to the side of it. (We'll use the
term "journaling file system" in the rest of the paper to avoid
confusion between "logging" and "log-structured" file systems.) A
journaling file system keeps the on-disk state of the file system
consistent by writing a summary of each write operation to the log,
stored somewhere non-volatile like disk (or NVRAM if you have the
money), before writing the changes directly to their long-term place
in the file system. This summary, or log record, contains enough
information to repeat the entire operation if the direct write to the
file system gets interrupted mid-way through (e.g., by a system
crash). This operation is called replaying the log. So, in short,
every change to the file system gets written to disk twice:
once to the log, and once in the permanent location.
Around 1988, John K. Ousterhout and several collaborators realized
that they could skip the second write entirely if they treated the
entire file system as one enormous log. Instead of writing the
operation to the log and then rewriting the changes in place somewhere
else on the disk, it would just write it once to the end of the log
(wherever that is) and be done with it. Writes to existing files and
inodes are copy-on-write - the old version is marked as free space,
and the new version is written at the end of the log. Conceptually,
finding the current state of the file system is a matter of replaying
the log from beginning to end. In practice, a log-structured file
system writes checkpoints to disk periodically; these checkpoints describe the
state of the file system at that point in time without requiring any
log replay. Any changes to the file system after the checkpoint are
recovered by replaying the relatively small number of log entries
following the checkpoint.
One of the interesting benefits of the log-structured file system (LFS)
structure is that most
writes to the file system are sequential. The section describing the
motivation for Sprite LFS, written nearly 20 years ago, demonstrates
how little has changed in the storage world:
Over the last decade CPU speeds have increased dramatically while disk
access times have only improved slowly. This trend is likely to
continue in the future and it will cause more and more applications to
become disk-bound. [...] Log-structured file systems are based on the
assumption that files are cached in main memory and that increasing
memory sizes will make the caches more and more effective at
satisfying read requests. As a result, disk traffic will become
dominated by writes.
But wait, why are we still talking about disk seeks? SSDs have
totally changed the performance characteristics of storage! Disks are
dead! Long live flash!
Surprisingly, log-structured file systems are more relevant than ever
when it comes to SSDs. The founding assumption of log-structured file
systems - that reads are cheap and writes are expensive - is
emphatically true for the bare-metal building blocks of
flash. (For the rest of this article, "flash" refers to NAND-based
flash and SSD refers to a NAND-based flash device with a
wear-leveling, write-gathering flash translation layer.) When it comes
to flash, reads may be done at small granularities - a few hundreds of
bytes - but writes must be done in large contiguous blocks - on the
order of tens of thousands or hundreds of thousands of bytes. A write
to flash takes two steps: First the entire block is cleared, setting
all the bits to the same value (usually 1, counter-intuitively).
Second, individual bits in the block are flipped back to 0 until you
get the block you wanted.
Log-structured file systems turn out to be a natural fit for flash.
One of the details of the log-structured design is that the log is
written in large contiguous chunks, called "segments," on the
order of several megabytes in size. To cut down on metadata overhead
and get the best performance, log entries are gathered and written out
sequentially to a completely free segment. Most segments are
partially in use and partially free at any given time, so the file
system has to collect all the in-use data from a segment and move it
elsewhere before it can start writing to it. When the file system
needs a fresh segment, it first
cleans an existing partially-used segment by moving all the
in-use, or live data to another free segment - basically, it
garbage-collects. Now that everything is arranged properly, the file
system can do one big streaming write to the empty segment. This
system of segments and cleaning is exactly what is needed to
efficiently write to a flash device, given the necessity to erase
large contiguous blocks of flash before writing to them.
thousands of people probably now associate the Tux penguin bootup logo
with the inability to watch TV on long distance flights.
The match between log-structured file systems and flash is obvious
when you look at file systems written for the bare flash programming
interface - that is, for devices without built-in wear-leveling or
write-gathering. File systems that know about and have to manage
erase blocks and other details of the flash hardware are almost
invariably log-structured in design. The most widely used such file
system for Linux is JFFS2, used in many embedded devices, such as
ticket machines and seatback airline entertainment systems. More than
once, I've boarded a plane and seen a JFFS2 error message reporting
flash corruption on a hung seatback entertainment system. (Sadly, many
thousands of people probably now associate the Tux penguin bootup logo
with the inability to watch TV on long distance flights.)
For SSDs that export a disk-style block interface - most
consumer-grade SSDs these days - the operating systems uses a regular
file system to talk to the SSD via the block interface (that is, read
block #37 into this buffer, write this buffer into block #42, etc.).
However, this system still contains the logical equivalent of a
log-structured file system; it's just hidden inside the SSD. The
firmware that implements wear-leveling, write-gathering, and any other
features has to solve the same problems as a log-structured file
Most SSD manufacturers refuse to reveal any details of their internal
firmware, but we can be fairly confident that it has a lot in common
with log-structured file systems. First, the only way to implement
efficient random writes is to buffer them and write them out to a
single erase block together. This requires clearing an erase block,
moving all the in-use blocks to another area, and keeping a mapping
between the logical location of blocks and their physical locations -
exactly what a log-structured file system does. Second, when we do
get SSD implementation details
publications, they look like log-structured file systems. Third,
when we look at long-term performance testing of SSDs, we see the same
pattern of performance degradation over time that we do with
log-structured file systems. We'll talk about this in detail in the
Log-structured file system performance
Log-structured file systems are a natural fit for flash-based storage
today, but back in 1990, they appeared to have great potential for
disk-based file systems as well. Yet, as we all know, we're not using
log-structured file systems on our disk-based laptops and servers.
In short, log-structured file systems performed relatively well as
long as most of the segment cleaning - movement of live data out of a
segment so it can be re-used - could be done in the background when
the file system wasn't busy with "real" work.
major follow-up paper on LFS [PDF] found performance of LFS degraded by
up to 40% from the best case at real-world levels of disk utilization,
memory-to-disk ratio, and file system traffic. In short, in the
steady state the file system was spending a significant amount of disk
access time cleaning segments - moving old data out of a segment so it
could be used for new writes. This segment cleaning problem
was the subject of active research for at least another decade, but
none of the solutions could consistently beat state-of-the-art
write-in-place file systems at practical levels of disk utilization.
It's a little bit like comparing garbage collection to explicit
reference counting for memory management; when memory usage is low and
the occasional high latency hit is okay, the convenience of garbage
collecting outweighs the performance benefits. But at "high" levels
of disk utilization - as little as 50% - the cleaning cost and
periodic high latencies waiting for space to be freed up become a
paper showed, the key to good performance in a log-structured file
system is to place data such that nearly empty segments are created
about as quickly as they are used. The file system write bandwidth is
limited by the rate at which it can produce clean segments. The worst
case happens when, in a file system that is X% full, every segment is also X%
full. Producing one clean segment requires collecting the live data
N = ceiling(1/(1 - X))
segments and writing out the
old data to
N - 1 of those segments. For a disk
utilization of 80%, we get:
N = ceiling(1/(1 - .80)) = 1/.20 = 5
segments to clean.
If segments were 1MB in size, we'd have to read
5 * 800KB = 4MB
of data seekily and write 4MB sequentially before we could
write 1MB of new data. (Note to pedants: I'm using MB/KB in powers of
10, not 2).
The best case, instead, is a file system with two kinds of
segments, completely full and completely empty. The best case write
pattern is one that changes all of the metadata and data in a single
segment, so that when the new versions are written out, the old
versions are freed and the entire segment becomes free again. Reality
lies somewhere between these two cases. The goal for a log-structured
file system is to create a bimodal segment usage distribution: Most
segments are either very full or very empty, and full segments tend to
be unchanged. This turns out to be difficult to achieve.
SSDs have an extra interesting constraint: wear-leveling. Even in the
best case in which most segments are 100% full and no writes ever
change the data in them, the SSD must still move those segments around
occasionally because it has to spread writes out over every available
flash block. This adds an extra segment move in some cases and makes
achieving good performance even harder than in a disk-based
log-structured file system.
Lessons - learned?
It's great that SSD manufacturers can learn from two decades of prior
work on log-structured file systems. What's not clear is whether they
are doing so. Most manufacturers take a very closed approach to SSD
firmware development - it's the secret sauce that turns cheap
commodity flash with very low margins into extremely expensive,
reliable, high-performance storage devices with high margins. Some
better at this task than others
. Currently, manufacturers are
taking the trade secret strategy for maintaining their competitive
advantage - apply for patents on individual elements of the design,
but keep the overall implementation a secret. The message to file
systems developers is "Just trust us" and "Don't worry your pretty
little systems programmers' heads about it" whenever we ask for more
information on SSD implementation. You can't particularly argue with
this strategy at present, but it tends to come from (and reinforce) the
mindset that not only refuses to share information with the outside, but
also ignores information from the outside, such as previously
published academic work.
One of the greatest missed opportunities for optimization based on
lessons learned from log-structured file systems is the slow adoption
support for SSDs. TRIM is a command to a block device informing it
that a certain range of blocks is no longer in use by the file system
- basically a
free() call for blocks. As described
earlier, the best performance comes when empty segments are created as
a side effect of ongoing writes. As a simple example, imagine a
segment that contains only a single inode and all of its file data.
If the next set of writes to the file system overwrites all of the
file data (and the inode as a side effect), then that segment becomes
completely free and the file system doesn't have to move any live data
around before it uses that segment again. The equivalent action for
an SSD is to write to a block that has already been written in the
past. Internally, the SSD knows that the old copy of that block is
now free, and it can reuse it without copying its data elsewhere.
But log-structured file systems have a distinct advantage over
pre-TRIM SSDs (basically all commercially available SSDs as of now,
September 2009). Log-structured file systems know when on-disk data
has been freed even when it isn't overwritten. Consider the case of
deleting the one-segment file: the entire segment is freed, but no
overwrite occurred. A log-structured file system knows that this
happened and now has a free segment to work with. All the SSD sees is
a couple of tiny writes to other blocks on the disk. As far as it's
concerned, the blocks used by the now-deleted file are still precious
data in-use by the file system and it must continue to move that data
around forever. Once every block in the device has been written at
least once, the SSD is doomed to a worst case performance state in
which its spare blocks are at a minimum and data must be moved each
time a new block is rotated into use.
As we've seen, the key to good performance in a log-structured file
system is the availability of free or nearly-free segments. An SSD
without TRIM support does not know about many free segments and
accrues an immense performance disadvantage, which make it somewhat
shocking that any SSD ever shipped without the TRIM feature. My guess
is that SSDs were initially performance tested only with
write-in-place file systems (cough, cough, NTFS) and low total file
system usage (say, 70% or less).
Unfortunately, TRIM in its current form is both designed and implemented to
perform incredibly poorly: TRIM commands aren't tagged and at least one
SSD takes hundreds of milliseconds to process a TRIM command.
Kernel developers have debated exactly how to implement TRIM support
at the Linux Plumbers
Storage and File System Workshop, and on mailing lists: what the
performance cost of each TRIM is, what granularity TRIMs should have,
how often they should be issued, and whether it's okay to forget or miss
TRIM commands. In my opinion, the in-use/free state of a block on a
TRIM-enabled device should be tracked as carefully as that of a page
of memory. The file system implementation can take the form of
alloc()/free() calls, or else
asynchronous garbage collection (during a file system check or
scrubbing run), but we shouldn't "leak" in-use blocks for all the same
reasons we don't leak memory.
Additionally, in an ideal world, TRIM would be redesigned or replaced by a
command that is a full-featured, well-designed first-class citizen in the
ATA spec, rather than a hack bolted on after the fact.
Of course, all this is speculation in the absence of implementation
details from the SSD manufacturers. Perhaps some SSD firmware
programmers have come up with entirely new algorithms for remapping
and write-gathering that don't resemble log-structured file systems at
all, and the performance characteristics and optimizations we have
seen so far just happen to match those for log-structured file
systems. However, so far it appears that treating an SSD as though it
were backed by a log-structured file system is a good rule of thumb
for getting good performance. Full TRIM support by both SSDs and file
systems will be key to long-term good performance.
Comments (63 posted)
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>