|
|
Log in / Subscribe / Register

Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.31-rc8, released on August 27. "This should be the last -rc, and it's really been quieting down. There's 131 commits there, and it's all pretty trivial." He predicts the final 2.6.31 release will happen on Labor Day (September 7).

There have been no stable updates in the last week, and none are in the review process as of this writing.

Comments (none posted)

Kernel development news

Quotes of the week

As I see it, there are no SSD devices which don't lose data; there are only SSD devices which haven't lost your data _yet_.
-- David Woodhouse

What I've been recommending for some time is that people use LVM, and run fsck on a snapshot every week or two, at some convenient time when the system load is at a minimum. There is an e2croncheck script in the e2fsprogs sources, in the contrib directory; it's short enough that I'll attach here here.

Is it *necessary*? In a world where hardware is perfect, no. In a world where people don't bother buying ECC memory because it's 10% more expensive, and PC builders use the cheapest possible parts --- I think it's a really good idea.

-- Ted Ts'o

What it basically shows is how intolerant the mainline kernel community members have become towards people who hold a different view to them. The attitude is: either conform or you're an idiot and we're going to attack you until you conform.

I do hope others see what has happened here, and seriously consider whether they want to get involved in a sniping dictatorial community. Maybe considering to go down the BSD route instead.

-- Russell King

Because it throws out everything about what we know is good about how to design a modern scheduler in scalability. Because it's so ridiculously simple. Because it performs so ridiculously well on what it's good at despite being that simple. Because it's designed in such a way that mainline would never be interested in adopting it, which is how I like it. Because it will make people sit up and take notice of where the problems are in the current design. Because it throws out the philosophy that one scheduler fits all and shows that you can do a -lot- better with a scheduler designed for a particular purpose. I don't want to use a steamroller to crack nuts.
-- Con Kolivas is back

Comments (12 posted)

In brief

By Jonathan Corbet
September 2, 2009
CFS hard limits. The Linux "completely fair scheduler" works by dividing the available CPU time between the processes contending for it. In many situations, though, processes running on the system will not actually use their full fair share; they may spend enough time waiting for I/O, for example, that they simply cannot run enough to use all of the time they are entitled to. In such situations, CFS will give the left-over time to more CPU-intensive processes that can make good use of it, even if those processes have exceeded their allocation.

That is normally the right thing to do; better to put the CPU time to good use than to have the processor go idle while processes want to run. But there are, it seems, situations where system administrators would rather not hand out excess CPU time in that way. If, for example, the processes belong to a customer who is paying for a certain amount of processing time, giving away more could be bad business. To keep this from happening, Bharata B Rao has created the CFS hard limits patch set. Hard limits are managed using control groups; they allow the administrator to set an absolute limit on the amount of CPU time the control group as a whole is able to use over a given period of real time. Billing users who want their limit raised is, of course, a user-space policy issue, so it's not part of this patch.

Discard again. The "discard" operation, which informs a block storage device that specific blocks are no longer in use, should help a wide variety of storage technologies - including solid-state devices and "thin provisioned" arrays - to perform better. But discard, itself, has some performance issues; see the trouble with discard for details.

Christoph Hellwig is trying to improve discard performance with a new set of patches, some of which originally come from Matthew Wilcox. These changes allow discard requests to cover much larger sections of the storage device; previously they had been limited by the maximum request size for the device. When combined with the XFS-specific XFS_IOC_TRIM ioctl() command, this change allows user-space to issue bulk discard operations for all of the free portions of a filesystem partition at an opportune time. The patches also add better control over whether any specific discard request should be seen as a queue barrier and whether it should be performed as a blocking operation.

Upcoming network driver API change. Not content with having reworked the network driver API once (by moving operations into their own structure), Stephen Hemminger now has a new patch set which changes the API implemented by all drivers. The function involved is ndo_start_xmit(), which is used by the networking layer to pass a packet to the driver for transmission. This function should really only return one of two values: NETDEV_TX_OK (meaning that the packet has been accepted and queued for transmission) or NETDEV_TX_BUSY (the packet was not accepted because the queue was full or some similar problem came up). Drivers using the deprecated LLTX mode can also return NETDE_TX_LOCKED to indicate that the transmit lock was already taken.

The problem is that the return type for ndo_start_xmit() was defined as int; some driver writers thought that meant they could return arbitrary error codes to the networking layer. With Stephen's patch, the return type becomes netdev_tx_t, an enum containing only the defined return codes. That should catch any driver writers who try to return the wrong thing - but at the cost of changing a lot of drivers.

Checkpoint/restore wiki. There is a new wiki dedicated to the collection of information about the rapidly-developing checkpoint/restore functionality. It's a little bare at the moment, but, one assumes, it will soon be filled with information about this feature.

The actual checkpoint/restore task remains an exercise in complexity. As an example, consider one of the most recently-posted pieces: checkpoint and restore for security credentials. It requires a number of hooks into LSM modules to obtain the current security state, serialize it, and to restore it at some future time. It can all probably be made to work, but long-term maintenance could prove to be painful.

The BFS scheduler. Con Kolivas, who worked on desktop interactivity issues in the past before abruptly leaving the kernel development community in 2007, has posted a new scheduler called BFS. Con Says:

It was designed to be forward looking only, make the most of lower spec machines, and not scale to massive hardware. ie it is a desktop orientated scheduler, with extremely low latencies for excellent interactivity by design rather than 'calculated', with rigid fairness, nice priority distribution and extreme scalability within normal load levels.

(See the original LWN posting for the associated comment thread.)

Comments (none posted)

O_*SYNC

By Jonathan Corbet
September 1, 2009
When developers think about forcing data written to files to be flushed to the underlying storage device, they tend to think about the fsync() system call. But it is also possible to request synchronous behavior for all operations on a file descriptor, either at open() time or using fcntl(). Support in Linux for synchronous I/O flags is likely to improve in 2.6.32, but this work has raised a couple of interesting issues with regard to the current implementation and forward compatibility.

There are three standard-defined flags which can be used to specify synchronous I/O behavior:

  • O_SYNC: requires that any write operations block until all data and all metadata have been written to persistent storage.

  • O_DSYNC: like O_SYNC, except that there is no requirement to wait for any metadata changes which are not necessary to read the just-written data. In practice, O_DSYNC means that the application does not need to wait until ancillary information (the file modification time, for example) has been written to disk. Using O_DSYNC instead of O_SYNC can often eliminate the need to flush the file inode on a write.

  • O_RSYNC: this flag, which only affects read operations, must be used in combination with either O_SYNC or O_DSYNC. It will cause a read() call to block until the data (and maybe metadata) being read has been flushed to disk (if necessary). This flag thus gives the kernel the option of delaying the flushing of data to disk; any number of writes can happen, but data need not be flushed until the application reads it back.

O_DSYNC and O_RSYNC are not new; they were added to the relevant standards well over ten years ago. But Linux has never really supported them (they are optional features), so glibc simply defines them both to be the same as O_SYNC.

Christoph Hellwig is working on a proper implementation of these flags, with an eye toward merging the changes in 2.6.32. It should be a relatively straightforward change at this point; the kernel has some nice infrastructure for handling data and metadata flushing now. What is potentially harder is making the change in a way which best meets the expectations of existing applications.

There are two unrelated issues which make this transition harder than one might expect it should be:

  • Linux has never actually implemented O_SYNC; what applications have been getting, instead, is O_DSYNC.

  • The open() implementation in the kernel simply ignores flags that it knows nothing about. This behavior can be changed only at risk of breaking unknown numbers of applications; it's an aspect of the kernel ABI.

Given the first problem listed above, one might be tempted to make a new flag for O_DSYNC and use it to obtain the current behavior, while O_SYNC would get the full metadata synchronization semantics. If this were to be done, though, applications which are built against a new C library but run on an older kernel would be presenting an unknown flag to open(), which would duly ignore it. That application would not get synchronous I/O behavior at all, which is almost certainly not a good thing. So something trickier will have to be done.

There is also the question of which semantics older applications should get. Jamie Lokier argued that applications requesting O_SYNC behavior wanted full metadata synchronization, even if the kernel has been cheating them out of the full experience. So, when running under a future kernel with a proper O_SYNC implementation, an old, binary application should get O_SYNC behavior. Ulrich Drepper, instead, thinks that behavior should not change for older applications:

But these programs apparently can live with the broken semantics. I don't worry too much about this. If people really need the fixed O_SYNC semantics then let them recompile their code.

It looks like Ulrich's view will win out, for the simple reason that the performance cost of the additional metadata synchronization seems worse than giving applications the semantics they have been running with anyway, even if those semantics are not quite what was promised.

Christoph outlined the likely course of action. Internally, O_SYNC will become O_DSYNC, and the open() flag which is currently O_SYNC will come to mean O_DSYNC. The open() system call will then take a new flag (name unknown; O_FULLSYNC and O_ISYNC have been suggested) which will be hidden from applications. At the glibc level, applications will see this:

    #define O_SYNC	(O_FULLSYNC|O_DSYNC)

On older kernels, the O_DSYNC flag (with the same value as O_SYNC now) will yield the same behavior as always, while O_FULLSYNC will be ignored. On newer kernels, the new flag will yield the full O_SYNC semantics. As long as applications do not reach under the hood and try to manipulate the O_FULLSYNC flag directly, all will be well.

Comments (none posted)

The offline scheduler

By Jake Edge
September 2, 2009

One of the primary functions of any kernel is to manage the CPU resources of the hardware that it is running on. A recent patch, proposed by Raz Ben-Yehuda, would change that, by removing one or more CPUs out from under the kernel's control, so that processes could run, undisturbed, on those processors. The "offline scheduler", as Ben-Yehuda calls his patch, had some rough sailing in the initial reactions to the idea, but as the thread on linux-kernel evolved, kernel hackers stepped back and looked at the problems it is trying to solve—and came up with other potential solutions.

The basic idea behind the offline scheduler is fairly straightforward: use the CPU hot-unplug facility to remove the processor from the system, but instead of halting the processor, allow other code to be run on it. Because the processor would not be participating in the various CPU synchronization schemes (RCU, spinlocks, etc.), nor would it be handling interrupts, it can completely devote its attention to the code that it is running. The idea is that code running on the offline processor would not suffer from any kernel-introduced latencies at all.

The core patch is fairly small. It provides an interface to register a function to be called when a particular CPU is taken offline:

    int register_offsched(void (*offsched_callback)(void), int cpuid);
This registers a callback that will be made when the CPU with the given cpuid is taken offline (i.e. hot unplugged). Typically, a user would load a module that calls register_offsched(), then take the CPU offline which triggers the callback on the just-offlined CPU. When the processing completes, and the callback returns, the processor will then be halted. At that point, the CPU can be brought back online and returned to the kernel's control.

The interface points to one of the problems that potential users of the offline scheduler have brought up: one can only run kernel-context, and not user-space, code using the facility. Because many of the applications that might benefit from having the full attention of a CPU are existing user-space programs, making the switch to in-kernel code is seen as problematic.

Ben-Yehuda notes that the isolated processor has "access to every piece of memory in the system" and the kernel would still have access to any memory that the isolated processor is using. He sees that as a benefit, but others, particularly Mike Galbraith, see it differently:

I personally find the concept of injecting an RTOS into a general purpose OS with no isolation to be alien. Intriguing, but very very alien.

One of the main problems that some kernel hackers see with the offline scheduler approach is that it bypasses Linux entirely. That is, of course, the entire point of the patch: devoting 100% of a CPU to a particular job. As Christoph Lameter puts it:

OFFSCHED takes the OS noise (interrupts, timers, RCU, cacheline stealing etc etc) out of certain processors. You cannot run an undisturbed piece of software on the OS right now.

Peter Zijlstra, though, sees that as a major negative: "Going around the kernel doesn't benefit anybody, least of all Linux." There are existing ways to do the same thing, so adding one into the kernel adds no benefit, he says:

So its the concept of running stuff on a CPU outside of Linux that I don't like. I mean, if you want that, go ahead and run RTLinux, RTAI, L4-Linux etc.. lots of special non-Linux hypervisor/exo-kernel like things around for you to run things outside Linux with.

But, Ben-Yehuda sees multiple applications for processors dedicated to specific tasks. He envisions a different kind of system, which he calls a Service Oriented System (SOS), where the kernel is just one component, and if the kernel "disturbs" a specific service, it should be moved out of the way:

What i am suggesting is merely a different approach of how to handle multiple core systems. instead of thinking in processes, threads and so on i am thinking in services. Why not take a processor and define this processor to do just firewalling ? encryption ? routing ? transmission ? video processing... and so on...

Moving the kernel out of the way is not particularly popular with many kernel hackers. But the idea of completely dedicating a processor to a specific task is important to some users. In the high performance computing (HPC) world, multiple processors spend most of their time working on a single, typically number-crunching, task. Removing even minimal interruptions, those that perform scheduling and other kernel housekeeping tasks, leads to better overall performance. Essentially, those users want the convenience of Linux running on one CPU, while the rest of the system's CPUs are devoted to their particular application.

After a somewhat heated digression about generally reducing latencies in the kernel, Andrew Morton asked for a problem statement: "All I've seen is 'I want 100% access to a CPU'. That's not a problem statement - it's an implementation." In answer, Chris Friesen described one possible application:

In our case the problem statement was that we had an inherently single-threaded emulator app that we wanted to push as hard as absolutely possible.

We gave it as close to a whole cpu as we could using cpu and irq affinity and we used message queues in shared memory to allow another cpu to handle I/O. In our case we still had kernel threads running on the app cpu, but if we'd had a straightforward way to avoid them we would have used it.

That led Thomas Gleixner to consider an alternative approach. He restated the problem as: "Run exactly one thread on a dedicated CPU w/o any disturbance by the scheduler tick." Given that definition, he suggested a fairly simple approach:

All you need is a way to tell the kernel that CPUx can switch off the scheduler tick when only one thread is running and that very thread is running in user space. Once another thread arrives on that CPU or the single thread enters the kernel for a blocking syscall the scheduler tick has to be restarted.

Gregory Haskins then suggested modifying the FIFO scheduler class, or creating a new class with a higher priority, so that it disables the scheduler tick. That would incorporate Gleixner's idea into the existing scheduling framework. As might be guessed, there are still some details to work out on running a process without the scheduler tick, but Gleixner and others think it is something that can be done.

The offline scheduler itself kind of fell by the wayside in the discussion. Ben-Yehuda, unsurprisingly, is still pushing his approach, but aside from the distaste expressed about circumventing the kernel, the inability to run user-space code is problematic. Gleixner was fairly blunt about it:

I was talking about the problem that you cannot run an ordinary user space task on your offlined CPU. That's the main point where the design sucks. Having specialized programming environments which impose tight restrictions on the application programmer for no good reason are horrible.

Others are also thinking about the problem, as a similar idea to Gleixner's was recently posted by Josh Triplett in an RFC to linux-kernel. Triplett's tiny patch simply disables the timer tick permanently as a demonstration of the gain in performance that can be achieved for CPU-bound processes. He notes that the overhead for the timer tick can be significant:

On my system, the timer tick takes about 80us, every 1/HZ seconds; that represents a significant overhead. 80us out of every 1ms, for instance, means 8% overhead. Furthermore, the time taken varies, and the timer interrupts lead to jitter in the performance of the number crunching.

Triplett warns that his patch is "by no means represents a complete solution" in that it breaks RCU, process accounting, and other things. But it does boot and can run his tests. He has fixes for some of those problems in progress, as well as an overall goal: "I'd like to work towards a patch which really can kill off the timer tick, making the kernel entirely event-driven and removing the polling that occurs in the timer tick. I've reviewed everything the timer tick does, and every last bit of it could occur using an event-driven approach."

It is pretty unlikely that we will see the offline scheduler ever make it into the mainline, but the idea behind it has spawned some interesting discussions that may lead to a solution for those looking to eliminate kernel overhead on some CPUs. In many ways, it is another example of the perils of developing kernel code in isolation. Had Ben-Yehuda been working in the open, and looking for comments from the kernel community, he might have realized that his approach would not be acceptable—at least for the mainline—much sooner.

Comments (11 posted)

Ext3 and RAID: silent data killers?

By Jonathan Corbet
August 31, 2009
Technologies such as filesystem journaling (as used with ext3) or RAID are generally adopted with the purpose of improving overall reliability. Some system administrators may thus be a little disconcerted by a recent linux-kernel thread suggesting that, in some situations, those technologies can actually increase the risk of data loss. This article attempts to straighten out the arguments and reach a conclusion about how worried system administrators should be.

The conversation actually began last March, when Pavel Machek posted a proposed documentation patch describing the assumptions that he saw as underlying the design of Linux filesystems. Things went quiet for a while, before springing back to life at the end of August. It would appear that Pavel had run into some data-loss problems when using a flash drive with a flaky connection to the computer; subsequent tests done by deliberately removing active drives confirmed that it is easy to lose data that way. He hadn't expected that:

Before I pulled that flash card, I assumed that doing so is safe, because flashcard is presented as block device and ext3 should cope with sudden disk disconnects. And I was wrong wrong wrong. (Noone told me at the university. I guess I should want my money back).

In an attempt to prevent a surge in refund requests at universities worldwide, Pavel tried to get some warnings put into the kernel documentation. He has run into a surprising amount of opposition, which he (and some others) have taken as an attempt to sweep shortcomings in Linux filesystems under the rug. The real story, naturally, is a bit more complex.

Journaling technology like that used in ext3 works by writing some data to the filesystem twice. Whenever the filesystem must make a metadata change, it will first gather together all of the block-level changes required and write them to a special area of the disk (the journal). Once it is known that the full description of the changes has made it to the media, a "commit record" is written, indicating that the filesystem code is committed to the change. Once the commit record is also safely on the media, the filesystem can start writing the metadata changes to the filesystem itself. Should the operation be interrupted (by a power failure, say, or a system crash or abrupt removal of the media), the filesystem can recover the plan for the changes from the journal and start the process over again. The end result is to make metadata changes transactional; they either happen completely or not at all. And that should prevent corruption of the filesystem structure.

One thing worth noting here is that actual data is not normally written to the journal, so a certain amount of recently-written data can be lost in an abrupt failure. It is possible to configure ext3 (and ext4) to write data to the journal as well, but, since the performance cost is significant, this option is not heavily used. So one should keep in mind that most filesystem journaling is there to protect metadata, not the data itself. Journaling does provide some data protection anyway - if the metadata is lost, the associated data can no longer be found - but that's not its primary reason for existing.

It is not the lack of journaling for data which has created grief for Pavel and others, though. The nature of flash-based storage makes another "interesting" failure mode possible. Filesystems work with fixed-size blocks, normally 4096 bytes on Linux. Storage devices also use fixed-size blocks; on traditional rotating media, those blocks are traditionally 512 bytes in length, though larger block sizes are on the horizon. The key point is that, on a normal rotating disk, the filesystem can write a block without disturbing any unrelated blocks on the drive.

Flash storage also uses fixed-size blocks, but they tend to be large - typically tens to hundreds of kilobytes. Flash blocks can only be rewritten as a unit, so writing a 4096-byte "block" at the operating system level will require a larger read-modify-write cycle within the flash drive. It is certainly possible for a careful programmer to write flash-drive firmware which does this operation in a safe, transactional manner. It is also possible that the flash drive manufacturer was rather more interested in getting a cheap device to market quickly than careful programming. In the commodity PC hardware market, that possibility becomes something much closer to a certainty.

What this all means is that, on a low-quality flash drive, an interrupted write operation could result in the corruption of blocks unrelated to that operation. If the interrupted write was for metadata, a journaling filesystem will redo the operation on the next mount, ensuring that the metadata ends up in its intended destination. But the filesystem cannot know about any unrelated blocks which might have been trashed at the same time. So journaling will not protect against this kind of failure - even if it causes the sort of metadata corruption that journaling is intended to prevent.

This is the "bug" in ext3 that Pavel wished to document. He further asserted that journaling filesystems can actually make things worse in this situation. Since a full fsck is not normally required on journaling filesystems, even after an improper dismount, any "collateral" metadata damage will go undetected. At best, the user may remain unaware for some time that random data has been lost. At worst, corrupt metadata could cause the code to corrupt other parts of the filesystem over the course of subsequent operation. The skipped fsck may have enabled the system to come back up quickly, but it has done so at the risk of letting corruption persist and, possibly, spread.

One could easily argue that the real problem here is the use of hidden translation layers to make a flash device look like a normal drive. David Woodhouse did exactly that:

This just goes to show why having this "translation layer" done in firmware on the device itself is a _bad_ idea. We're much better off when we have full access to the underlying flash and the OS can actually see what's going on. That way, we can actually debug, fix and recover from such problems.

The manufacturers of flash drives have, thus far, proved impervious to this line of reasoning, though.

There is a similar failure mode with RAID devices which was also discussed. Drives can be grouped into a RAID5 or RAID6 array, with the result that the array as a whole can survive the total failure of any drive within it. As long as only one drive fails at a time, users of RAID arrays can rest assured that the smoke coming out of their array is not taking their data with it.

But what if more than one drive fails? RAID works by combining blocks into larger stripes and associating checksums with those stripes. Updating a block requires rewriting the stripe containing it and the associated checksum block. So, if writing a block can cause the array to lose the entire stripe, we could see data loss much like that which can happen with a flash drive. As a normal rule, this kind of loss will not occur with a RAID array. But it can happen if (1) one drive has already failed, causing the array to run in "degraded" mode, and (2) a second failure occurs (Pavel pulls the power cord, say) while the write is happening.

Pavel concluded from this scenario that RAID devices may actually be more dangerous than storing data on a single disk; he started a whole separate subthread (under the subject "raid is dangerous but that's secret") to that effect. This claim caused a fair amount of concern on the list; many felt that it would push users to forgo technologies like RAID in favor of single, non-redundant drive configurations. Users who do that will avoid the possibility of data loss resulting from a specific, unlikely double failure, but at the cost of rendering themselves entirely vulnerable to a much more likely single failure. The end result would be a lot more data lost.

The real lessons from this discussion are fairly straightforward:

  • Treat flash drives with care, do not expect them to be more reliable than they are, and do not remove them from the system until all writes are complete.

  • RAID arrays can increase data reliability, but an array which is not running with its full complement of working, populated drives has lost the redundancy which provides that reliability. If the consequences of a second failure would be too severe, one should avoid writing to arrays running in degraded mode.

  • As Ric Wheeler pointed out, the easiest way to lose data on a Linux system is to run the disks with their write cache enabled. This is especially true on RAID5/6 systems, where write barriers are still not properly supported. There has been some talk of disabling drive write caches and enabling barriers by default, but no patches have been posted yet.

  • There is no substitute for good backups. Your editor would add that any backups which have not been checked recently have a strong chance of not being good backups.

How this information will be reflected in the kernel documentation remains to be seen. Some of it seems like the sort of system administration information which is not normally considered appropriate for inclusion in the documentation of the kernel itself. But there is value in knowing what assumptions one's filesystems are built on and what the possible failure modes are. A better understanding of how we can lose data can only help us to keep that from actually happening.

Comments (100 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.31-rc8 ?

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Virtualization and containers

Miscellaneous

Page editor: Jonathan Corbet
Next page: Distributions>>


Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds