Brief items
The current development kernel is 2.6.31-rc8,
released on August 27.
"
This should be the last -rc, and it's really been quieting
down. There's 131 commits there, and it's all pretty trivial." He
predicts the final 2.6.31 release will happen on Labor Day
(September 7).
There have been no stable updates in the last week, and none are in the
review process as of this writing.
Comments (none posted)
Kernel development news
As I see it, there are no SSD devices which don't lose data; there
are only SSD devices which haven't lost your data _yet_.
--
David Woodhouse
What I've been recommending for some time is that people use LVM, and
run fsck on a snapshot every week or two, at some convenient time when
the system load is at a minimum. There is an e2croncheck script in
the e2fsprogs sources, in the contrib directory; it's short enough
that I'll attach here here.
Is it *necessary*? In a world where hardware is perfect, no. In a
world where people don't bother buying ECC memory because it's 10%
more expensive, and PC builders use the cheapest possible parts --- I
think it's a really good idea.
--
Ted Ts'o
What it basically shows is how intolerant the mainline kernel
community members have become towards people who hold a different
view to them. The attitude is: either conform or you're an idiot
and we're going to attack you until you conform.
I do hope others see what has happened here, and seriously consider
whether they want to get involved in a sniping dictatorial
community. Maybe considering to go down the BSD route instead.
--
Russell King
Because it throws out everything about what we know is good about how to
design a modern scheduler in scalability.
Because it's so ridiculously simple.
Because it performs so ridiculously well on what it's good at despite being
that simple.
Because it's designed in such a way that mainline would never be interested
in adopting it, which is how I like it.
Because it will make people sit up and take notice of where the problems are
in the current design.
Because it throws out the philosophy that one scheduler fits all and shows
that you can do a -lot- better with a scheduler designed for a particular
purpose. I don't want to use a steamroller to crack nuts.
--
Con Kolivas
is back
Comments (12 posted)
By Jonathan Corbet
September 2, 2009
CFS hard limits. The Linux "completely fair scheduler" works by
dividing the available CPU time between the processes contending for
it. In many situations, though, processes running on the system will not
actually use their full fair share; they may spend enough time waiting for
I/O, for example, that they simply cannot run enough to use all of the time
they are entitled to. In such situations, CFS will give the left-over time
to more CPU-intensive processes that can make good use of it, even if those
processes have exceeded their allocation.
That is normally the right thing to do; better to put the CPU time to good
use than to have the processor go idle while processes want to run. But
there are, it seems, situations where system administrators would rather
not hand out excess CPU time in that way. If, for example, the processes
belong to a customer who is paying for a certain amount of processing time,
giving away more could be bad business. To keep this from happening,
Bharata B Rao has created the CFS
hard limits patch set. Hard limits are managed using control groups;
they allow the administrator to set an absolute limit on the amount of CPU
time the control group as a whole is able to use over a given period of
real time. Billing users who want their limit raised is, of course, a
user-space policy issue, so it's not part of this patch.
Discard again. The "discard" operation, which informs a block
storage device that specific blocks are no longer in use, should help a
wide variety of storage technologies - including solid-state devices and
"thin provisioned" arrays - to perform better. But discard, itself, has
some performance issues; see the
trouble with discard for details.
Christoph Hellwig is trying to improve discard performance with a new set of patches, some of
which originally come from Matthew Wilcox. These changes allow discard
requests to cover much larger sections of the storage device; previously
they had been limited by the maximum request size for the device. When
combined with the XFS-specific XFS_IOC_TRIM ioctl()
command, this change allows user-space to issue bulk discard operations for
all of the free portions of a filesystem partition at an opportune time.
The patches also add better control over whether any specific discard
request should be seen as a queue barrier and whether it should be
performed as a blocking operation.
Upcoming network driver API change. Not content with having
reworked the network driver API once (by moving operations into their own
structure), Stephen Hemminger now has a new patch set which changes
the API implemented by all drivers. The function involved is
ndo_start_xmit(), which is used by the networking layer to pass a
packet to the driver for transmission. This function should really only
return one of two values: NETDEV_TX_OK (meaning that the packet
has been accepted and queued for transmission) or NETDEV_TX_BUSY
(the packet was not accepted because the queue was full or some similar
problem came up). Drivers using the deprecated LLTX mode can also return
NETDE_TX_LOCKED to indicate that the transmit lock was already
taken.
The problem is that the return type for ndo_start_xmit() was
defined as int; some driver writers thought that meant they could
return arbitrary error codes to the networking layer. With Stephen's
patch, the return type becomes netdev_tx_t, an enum
containing only the defined return codes. That should catch any driver
writers who try to return the wrong thing - but at the cost of changing a
lot of drivers.
Checkpoint/restore wiki. There is a new wiki
dedicated to the collection of information about the rapidly-developing
checkpoint/restore functionality. It's a little bare at the moment, but,
one assumes, it will soon be filled with information about this feature.
The actual checkpoint/restore task remains an exercise in complexity. As
an example, consider one of the most recently-posted pieces: checkpoint and restore for security
credentials. It requires a number of hooks into LSM modules to obtain
the current security state, serialize it, and to restore it at some future
time. It can all probably be made to work, but long-term maintenance could
prove to be painful.
The BFS scheduler. Con Kolivas, who worked on desktop interactivity
issues in the past before abruptly leaving the kernel
development community in 2007, has posted a new
scheduler called BFS. Con Says:
It was designed to be forward looking
only, make the most of lower spec machines, and not scale to massive
hardware. ie it is a desktop orientated scheduler, with extremely low
latencies for excellent interactivity by design rather than 'calculated',
with rigid fairness, nice priority distribution and extreme scalability
within normal load levels.
(See the original LWN posting
for the associated comment thread.)
Comments (none posted)
By Jonathan Corbet
September 1, 2009
When developers think about forcing data written to files to be flushed to
the underlying storage device, they tend to think about the
fsync()
system call. But it is also possible to request synchronous behavior for
all operations on
a file descriptor, either at
open() time or using
fcntl(). Support in Linux for synchronous I/O flags is likely to
improve in 2.6.32, but this work has raised a couple of interesting issues
with regard to the current implementation and forward compatibility.
There are three standard-defined flags which can be used to specify
synchronous I/O behavior:
- O_SYNC: requires that any write operations block until all
data and all metadata have been written to persistent storage.
- O_DSYNC: like O_SYNC, except that there is no
requirement to wait for any metadata changes which are not necessary
to read the just-written data. In practice, O_DSYNC means
that the application does not need to wait until ancillary information
(the file modification time, for example) has been written to disk.
Using O_DSYNC instead of O_SYNC can often eliminate
the need to flush the file inode on a write.
- O_RSYNC: this flag, which only affects read operations, must
be used in combination with either O_SYNC or
O_DSYNC. It will cause a read() call to block until
the data (and maybe metadata) being read has been flushed to disk (if
necessary). This flag thus gives the kernel the option of delaying
the flushing of data to disk; any number of writes can happen, but
data need not be flushed until the application reads it back.
O_DSYNC and O_RSYNC are not new; they were added to the
relevant standards well over ten years ago. But Linux has never really
supported them (they are optional features), so glibc simply defines them
both to be the same as O_SYNC.
Christoph Hellwig is working on a proper
implementation of these flags, with an eye toward merging the changes
in 2.6.32.
It should be a relatively straightforward change at this point; the kernel
has some nice infrastructure for handling data and metadata flushing now.
What is potentially harder is making the change in a way which best meets
the expectations of existing applications.
There are two unrelated issues which make this transition harder than one
might expect it should be:
- Linux has never actually implemented O_SYNC; what
applications have been getting, instead, is O_DSYNC.
- The open() implementation in the kernel simply ignores flags
that it knows nothing about. This behavior can be changed only at
risk of breaking unknown numbers of applications; it's an aspect of
the kernel ABI.
Given the first problem listed above, one might be tempted to make a new flag
for O_DSYNC and use it to obtain the current behavior, while
O_SYNC would get the full metadata synchronization semantics. If
this were to be done, though, applications which are built against a new C
library but run on an older kernel would be presenting an unknown flag to
open(), which would duly ignore it. That application would not get
synchronous I/O behavior at all, which is almost certainly not a good
thing. So something trickier will have to be done.
There is also the question of which semantics older applications should
get. Jamie Lokier argued that applications
requesting O_SYNC behavior wanted full metadata synchronization,
even if the kernel has been
cheating them out of the full experience. So, when running under a future
kernel with a proper O_SYNC implementation, an old, binary
application should get O_SYNC behavior. Ulrich Drepper, instead,
thinks that behavior should not change for
older applications:
But these programs apparently can live with the broken semantics.
I don't worry too much about this. If people really need the fixed
O_SYNC semantics then let them recompile their code.
It looks like Ulrich's view will win out, for the simple reason that the
performance cost of the additional metadata synchronization seems worse than
giving applications the semantics they have been running with anyway, even
if those semantics are not quite what was promised.
Christoph outlined the likely course of
action. Internally, O_SYNC will become O_DSYNC, and the
open() flag which is currently O_SYNC will come to mean O_DSYNC. The
open() system call will then take a new flag (name unknown;
O_FULLSYNC and O_ISYNC have been suggested) which will be
hidden from applications. At the glibc level, applications will see this:
#define O_SYNC (O_FULLSYNC|O_DSYNC)
On older kernels, the O_DSYNC flag (with the same value as
O_SYNC now) will yield the same behavior as always, while
O_FULLSYNC will be ignored. On newer
kernels, the new flag will yield the full O_SYNC semantics. As
long as applications do not reach under the hood and try to manipulate the
O_FULLSYNC flag directly, all will be well.
Comments (none posted)
By Jake Edge
September 2, 2009
One of the primary functions of any kernel is to manage the CPU resources
of the hardware that it is running on. A recent patch, proposed by Raz
Ben-Yehuda, would change that, by removing one or more CPUs out from under the
kernel's control, so that processes could run, undisturbed, on those
processors. The "offline scheduler", as Ben-Yehuda calls his patch, had
some rough sailing in the initial reactions to the idea, but as the thread
on linux-kernel evolved, kernel hackers stepped back and looked at the
problems it is trying to solve—and came up with other potential
solutions.
The basic idea behind the offline scheduler is fairly straightforward: use
the CPU hot-unplug facility to remove the processor from the system, but
instead of halting the processor, allow other code to be run on it.
Because the processor would not be participating in the various CPU
synchronization schemes (RCU, spinlocks, etc.), nor would it be handling
interrupts, it can completely devote its attention to the code that it is
running. The idea is that code running on the offline processor would not
suffer from any kernel-introduced latencies at all.
The core patch is fairly small. It
provides an interface to register a function to be called when a particular
CPU is taken offline:
int register_offsched(void (*offsched_callback)(void), int cpuid);
This registers a callback that will be made when the CPU with the given
cpuid
is taken offline (i.e. hot unplugged). Typically, a user would load a
module that calls
register_offsched(), then take the CPU
offline which triggers the callback on the just-offlined CPU. When the
processing completes, and
the callback returns, the
processor will then be halted.
At that point, the CPU can be brought back online and returned to the
kernel's control.
The interface points to one of the problems that potential users of the
offline scheduler have brought up: one can only run kernel-context, and not
user-space, code using the facility. Because many of the applications that
might benefit from having the full attention of a CPU are existing
user-space programs, making the switch to in-kernel code is seen as
problematic.
Ben-Yehuda notes that the isolated
processor has "access to every piece of memory in the system"
and the kernel would still have access to any memory that the isolated
processor is using. He sees that as a benefit, but others, particularly
Mike Galbraith, see it differently:
I personally find the concept of
injecting an RTOS into a general purpose OS with no isolation to be
alien. Intriguing, but very very alien.
One of the main problems that some kernel hackers see with the offline
scheduler approach is that it
bypasses Linux entirely. That is, of course, the entire point of the
patch: devoting 100% of a CPU to a particular job. As Christoph Lameter puts it:
OFFSCHED takes the OS noise (interrupts,
timers, RCU, cacheline stealing etc etc) out of certain processors. You
cannot run an undisturbed piece of software on the OS right now.
Peter Zijlstra, though, sees that as a major negative: "Going around
the kernel doesn't benefit anybody, least of all Linux." There are
existing ways to do the same thing, so adding one into the kernel adds no
benefit, he says:
So its the concept of running stuff on a CPU outside of Linux that I
don't like. I mean, if you want that, go ahead and run RTLinux, RTAI,
L4-Linux etc.. lots of special non-Linux hypervisor/exo-kernel like
things around for you to run things outside Linux with.
But, Ben-Yehuda sees multiple applications for processors dedicated to
specific tasks. He envisions a different kind of system, which he calls a
Service Oriented System (SOS), where the kernel is just one component, and
if the kernel "disturbs" a specific service, it should be
moved out of the way:
What i am suggesting is merely a different approach of how to handle
multiple core systems. instead of thinking in processes, threads and so
on i am thinking in services. Why not take a processor and define this
processor to do just firewalling ? encryption ? routing ? transmission ?
video processing... and so on...
Moving the kernel out of the way is not particularly popular with many
kernel hackers. But the idea of completely dedicating a processor to a
specific task is important to some users. In the high performance
computing (HPC) world, multiple processors spend most of their time working
on a
single, typically number-crunching, task. Removing even minimal
interruptions, those that perform scheduling and other kernel housekeeping
tasks, leads
to better overall performance. Essentially, those users want the
convenience of Linux running on one CPU, while the rest of the system's
CPUs are devoted to their particular application.
After a somewhat heated digression about generally reducing latencies in
the kernel, Andrew Morton asked for a
problem statement: "All I've seen is 'I want 100% access to a CPU'.
That's not a problem
statement - it's an implementation."
In answer, Chris Friesen described one
possible application:
In our case the problem statement was that we had an inherently
single-threaded emulator app that we wanted to push as hard as
absolutely possible.
We gave it as close to a whole cpu as we could using cpu and irq
affinity and we used message queues in shared memory to allow another
cpu to handle I/O. In our case we still had kernel threads running on
the app cpu, but if we'd had a straightforward way to avoid them we
would have used it.
That led Thomas Gleixner to consider an
alternative approach. He restated the problem as: "Run exactly one
thread on a dedicated CPU w/o any disturbance by the
scheduler tick." Given that definition, he suggested a fairly simple
approach:
All you need is a way to tell the
kernel that CPUx can switch off the scheduler tick when only one
thread is running and that very thread is running in user space. Once
another thread arrives on that CPU or the single thread enters the
kernel for a blocking syscall the scheduler tick has to be
restarted.
Gregory Haskins then suggested modifying
the FIFO scheduler class, or creating a new class with a higher priority,
so that it disables the scheduler tick. That would incorporate Gleixner's
idea into the existing scheduling framework. As might be guessed, there
are still some details to work out on running a process without the
scheduler tick, but Gleixner and others think it is something that can be
done.
The offline scheduler itself kind of fell by the wayside in the
discussion. Ben-Yehuda, unsurprisingly, is still pushing his approach, but
aside from the distaste expressed about circumventing the kernel, the
inability to run user-space code is problematic. Gleixner was fairly blunt about it:
I was talking about the problem that you
cannot run an ordinary user space task on your offlined CPU. That's
the main point where the design sucks. Having specialized programming
environments which impose tight restrictions on the application
programmer for no good reason are horrible.
Others are also thinking about the problem, as a similar idea to Gleixner's
was recently posted by Josh Triplett in an
RFC to linux-kernel. Triplett's tiny patch simply disables the timer tick
permanently
as a demonstration of the gain in performance that can be achieved for CPU-bound
processes. He notes that the overhead for the timer tick can be
significant:
On my system, the timer tick takes about
80us, every 1/HZ seconds; that represents a significant overhead. 80us
out of every 1ms, for instance, means 8% overhead. Furthermore, the
time taken varies, and the timer interrupts lead to jitter in the
performance of the number crunching.
Triplett warns that his patch is "by no means represents a complete
solution" in that it breaks RCU, process accounting, and other
things. But it does boot and can run his tests. He has fixes for some of
those problems in progress, as well as an overall goal: "I'd like to work towards a patch which really can kill off the timer
tick, making the kernel entirely event-driven and removing the polling
that occurs in the timer tick. I've reviewed everything the timer tick
does, and every last bit of it could occur using an event-driven
approach."
It is pretty unlikely that we will see the offline scheduler ever make it
into the mainline, but the idea behind it has spawned some interesting
discussions that may lead to a solution for those looking to eliminate
kernel overhead on some CPUs. In many ways, it is another example of the
perils of
developing kernel code in isolation. Had Ben-Yehuda been working in the
open, and looking for comments from the kernel community, he might have
realized that his approach would not be acceptable—at least for the
mainline—much sooner.
Comments (11 posted)
By Jonathan Corbet
August 31, 2009
Technologies such as filesystem journaling (as used with ext3) or RAID are
generally adopted with the purpose of improving overall reliability. Some
system administrators may thus be a little disconcerted by a recent
linux-kernel thread suggesting that, in some situations, those technologies
can actually increase the risk of data loss. This article attempts to
straighten out the arguments and reach a conclusion about how worried
system administrators should be.
The conversation actually began last March, when Pavel Machek posted a proposed documentation patch describing the
assumptions that he saw as underlying the design of Linux filesystems.
Things went quiet for a while, before springing back to life at the end of
August. It
would appear that Pavel had run into some data-loss problems when using a
flash drive with a flaky connection to the computer; subsequent tests done
by deliberately removing active drives confirmed that it is easy to lose
data that way. He hadn't expected that:
Before I pulled that flash card, I assumed that doing so is safe,
because flashcard is presented as block device and ext3 should cope
with sudden disk disconnects. And I was wrong wrong wrong. (Noone
told me at the university. I guess I should want my money back).
In an attempt to prevent a surge in refund requests at universities
worldwide, Pavel tried to get some warnings put into the kernel
documentation. He has run into a surprising amount of opposition, which he
(and some others) have taken as an attempt to sweep shortcomings in Linux
filesystems under the rug. The real story, naturally, is a bit more
complex.
Journaling technology like that used in ext3 works by writing some data to
the filesystem twice. Whenever the filesystem must make a metadata change,
it will first gather together all of the block-level changes required and
write them to a special area of the disk (the journal). Once it is known
that the full description of the changes has made it to the media, a
"commit record" is written, indicating that the filesystem code is
committed to the change. Once the commit record is also safely on the
media, the filesystem can start writing the metadata changes to the
filesystem itself. Should the operation be interrupted (by a power
failure, say, or a system crash or abrupt removal of the media), the
filesystem can recover the plan for the changes from the journal and start
the process over again. The end result is to make metadata changes
transactional; they either happen completely or not at all. And that
should prevent corruption of the filesystem structure.
One thing worth noting here is that actual data is not normally written to
the journal, so a certain amount of recently-written data can be lost in
an abrupt failure. It is possible to configure ext3 (and ext4) to write
data to the journal as well, but, since the performance cost is
significant, this
option is not heavily used. So one should keep in mind that most
filesystem journaling is there to protect metadata, not the data itself.
Journaling does provide some data protection anyway - if the metadata is
lost, the associated data can no longer be found - but that's not its
primary reason for existing.
It is not the lack of journaling for data which has created grief for Pavel
and others, though. The nature of flash-based storage makes another
"interesting" failure mode possible. Filesystems work with fixed-size
blocks, normally 4096 bytes on Linux. Storage devices also use fixed-size
blocks; on traditional rotating media, those blocks are traditionally 512
bytes in length, though larger
block sizes are on the horizon. The key point is that, on a normal
rotating disk, the filesystem can write a block without disturbing any
unrelated blocks on the drive.
Flash storage also uses fixed-size blocks, but they tend to be large -
typically tens to hundreds of kilobytes. Flash blocks can only be
rewritten as a unit, so writing a 4096-byte "block" at the operating system
level will require a larger read-modify-write cycle within the flash drive. It is
certainly possible for a careful programmer to write flash-drive firmware
which does this operation in a safe, transactional manner. It is also possible
that the flash drive manufacturer was rather more interested in getting a
cheap device to market quickly than careful programming. In the commodity
PC hardware market, that possibility becomes something much closer to a
certainty.
What this all means is that, on a low-quality flash drive, an interrupted
write operation could result in the corruption of blocks unrelated to that
operation. If the interrupted write was for metadata, a journaling
filesystem will redo the operation on the next mount, ensuring that the
metadata ends up in its intended destination. But the filesystem cannot
know about any unrelated blocks which might have been trashed at the same
time. So journaling will not protect against this kind of failure - even
if it causes the sort of metadata corruption that journaling is intended to
prevent.
This is the "bug" in ext3 that Pavel wished to document. He further
asserted that journaling filesystems can actually make things worse in this
situation. Since a full fsck is not normally required on journaling
filesystems, even after an improper dismount, any "collateral" metadata
damage will go undetected. At best, the user may remain unaware for some
time that random data has been lost. At worst, corrupt metadata could
cause the code to corrupt other parts of the filesystem over the course of
subsequent operation. The skipped fsck may have enabled the system to come
back up quickly, but it has done so at the risk of letting corruption
persist and, possibly, spread.
One could easily argue that the real problem here is the use of hidden
translation layers to make a flash device look like a normal drive. David
Woodhouse did exactly that:
This just goes to show why having this "translation layer" done in
firmware on the device itself is a _bad_ idea. We're much better
off when we have full access to the underlying flash and the OS can
actually see what's going on. That way, we can actually debug, fix
and recover from such problems.
The manufacturers of flash drives have, thus far, proved impervious to this
line of reasoning, though.
There is a similar failure mode with RAID devices which was also
discussed. Drives can be grouped into a RAID5 or RAID6 array, with the
result that the array as a whole can survive the total failure of any drive
within it. As long as only one drive fails at a time, users of RAID arrays
can rest assured that the smoke coming out of their array is not taking
their data with it.
But what if more than one drive fails? RAID works by combining blocks into
larger stripes and associating checksums with those stripes. Updating a
block requires rewriting the stripe containing it and the associated
checksum block. So, if writing a block can cause the array to lose the
entire stripe, we could see data loss much like that which can happen with
a flash drive. As a normal rule, this kind of loss will not occur with a
RAID array. But it can happen if (1) one drive has already
failed, causing the array to run in "degraded" mode, and (2) a second
failure occurs (Pavel pulls the power cord, say) while the write is
happening.
Pavel concluded from this scenario that RAID devices may actually be more
dangerous than storing data on a single disk; he started a whole separate
subthread (under the subject "raid is dangerous
but that's secret") to that effect. This claim caused a fair amount of
concern on the list; many felt that it would push users to forgo
technologies like RAID in favor of single, non-redundant drive
configurations. Users who do that will avoid the possibility of data loss
resulting from a specific, unlikely double failure, but at the cost of
rendering themselves entirely vulnerable to a much more likely single
failure. The end result would be a lot more data lost.
The real lessons from this discussion are fairly straightforward:
- Treat flash drives with care, do not expect them to be more reliable
than they are, and do not remove them from the system until all writes
are complete.
- RAID arrays can increase data reliability, but an array which is not
running with its full complement of working, populated drives has lost
the redundancy which provides that reliability. If the consequences
of a second failure would be too severe, one should avoid writing to
arrays running in degraded mode.
- As Ric Wheeler pointed out, the
easiest way to lose data on a Linux system is to run the disks with
their write cache enabled. This is especially true on RAID5/6
systems, where write barriers are still not properly supported. There
has been some talk of
disabling drive write caches and enabling barriers by default, but no
patches have been posted yet.
- There is no substitute for good backups. Your editor would add that
any backups which have not been checked recently have a strong chance
of not being good backups.
How this information will be reflected in the kernel documentation remains
to be seen. Some of it seems like the sort of system administration
information which is not normally considered appropriate for inclusion in
the documentation of the kernel itself. But there is value in knowing what
assumptions one's filesystems are built on and what the possible failure
modes are. A better understanding of how we can lose data can only help us
to keep that from actually happening.
Comments (100 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>