Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.38-rc2, released on January 21. "Anyway. -rc2 is out there, and the only reason it's reasonably sized is that it was a short -rc2 - I think a few of the pull requests I got were a bit larger than I would have been happy with. And I might as well warn people that because the laptop I'm bringing with me is pitifully slow, I'm also planning on going into 'anal' mode, and not even bother pulling from trees unless they are clearly -rc material. IOW, don't try to push large pushes on me. I won't take them, and they can wait for 39." See the announcement for the short changelog, or the full changelog for all the details.

Stable updates: there have been no stable or longterm updates released in the last week.

Comments (none posted)

Quotes of the week

We care about everything. If the objective was to make life easier for ourselves, we'd all be on the golf course.

-- Andrew Morton

No way in hell do I want the situation of "the system is screwed, so let's overwrite the disk" to be something the kernel I release might do. It's crazy.

-- Linus Torvalds

Nice to see it gone - it seemed such a good idea in Linux 1.3

-- Alan Cox won't miss the BKL

Comments (3 posted)

The real BKL end game

By Jonathan Corbet
January 26, 2011

The removal of the big kernel lock (BKL) has been on the kernel community's "to do" list almost since that lock was first added to make the kernel work on multiprocessor systems. Over time, the significance of the lock has diminished as finer-grained locking was added to various kernel subsystems, but the BKL itself has endured. Getting rid of it for good remained desirable because the BKL can still cause unwanted latencies at times. There's also a certain amount of pride involved in completing the job. That completion has been long in coming, though; once the worst performance issues associated with the BKL were resolved, interest in doing the low-level work needed to finish the job declined.

Two years or so ago, though, developers started working on BKL removal again. Some of this work was motivated by the realtime tree, where patience with latency sources is rather more limited. Still, it seemed like completion remained a distant goal; hundreds of BKL call sites remained in the kernel.

Then Arnd Bergmann took on the task of eliminating the BKL entirely. His cleanup work has been going on for some time; if he has his way, this patch set (or something derived from it) will remove the BKL entirely in 2.6.39. To get there, about a dozen modules need to be addressed. Some of them (i830, autofs3, and smbfs) are simply to be removed. Others (appletalk and hpfs) are to be moved to the staging tree for near-term removal, though there is some resistance to that idea. The remaining modules are to be fixed in some way. Once that's taken care of, the final patch in the series removes the lock itself. It will not be missed.

Comments (24 posted)

Kernel development news

LCA: Server power management

By Jonathan Corbet
January 26, 2011

Power management is often seen as a concern mostly for embedded and mobile systems. They worry about power management because we want our phones to run for longer between recharges and our laptops to not inflict burns on our thighs. But power management is equally important for data centers, which are currently responsible for about 3% of the total power consumption in the US. Keeping the net going in the US requires about ~~15TW~~ 15GW of power - the dedicated output of about 15 nuclear power plants. Clearly there would be some real value to saving some of that power. Matthew Garrett's talk during the Southern Plumbers Miniconf at linux.conf.au 2011 covered the work that is being done in that area and where Linux stands relative to other operating systems.

Much of the power consumed by data centers is not directly controllable by Linux - it is overhead which is consumed outside of the computers themselves. About one watt of power is consumed by overhead for each watt consumed by computation. This overhead includes network infrastructure and power supply loss, but the biggest component is air conditioning. So the obvious thing to do here is to create more efficient cooling and power infrastructure. Running at higher ambient temperatures, while uncomfortable for humans, can also help. The best contemporary data centers have been able to reduce their overhead to about 20% - a big improvement. Cogeneration techniques - using heat from data centers to warm buildings, for example - can reduce that overhead even further.

But we still have trouble. A 48-core system, Matthew says, will draw about 350W when it is idle; a rack full of such systems will still pull a lot of power. What can be done? Most power management attention has been focused on the CPU, which is where a lot of the power goes. As a result, an idle Intel CPU now draws approximately zero watts of power - it is "terrifying" how well it works. When the CPU is working, though, the situation is a bit different; the power consumption is about 20W per core, or about 960W for a busy 48-core system.

The clear implication is that we should keep the CPUs idle whenever possible. That can be tricky, though; it is hard to write software which does nothing. Or - as Matthew corrected himself - it's hard to write useful software which does nothing.

There are some trends which can be pointed to in this area. CPU power management is essentially solved; Linux is quite good at it. In fact, Linux is better than any other operating system with regard to CPU power; we have more time in deep idle states and fewer wakeups than others. So interest is shifting toward memory power management. If all of the CPUs in a package within the system can be idled, the associated memory controller will go idle as well. It's also possible to put memory into "self-refresh" mode if it is idle, reducing power use while preserving the contents. In other situations, running memory at a lower clock rate can reduce power usage. There will be a lot of work in this area because, at this point, memory looks like the biggest, lowest-hanging fruit.

Even more power can be saved by simply turning a system off; that is where virtualization comes into play. If applications are run on virtualized servers, those servers can be consolidated onto a small number of machines during times of low load, allowing the other machines to be powered down. There is a great deal of customer interest in this capability, but there is still work to be done; in particular, we need fast guest migration, which is a hard problem to solve.

The other hard problem is the fact that optimal power behavior may make tradeoffs which enterprise customers may be unwilling to make. Performance matters for these people, and, if that means expending more energy, they are willing to pay that cost. As an example, consider the gettimeofday() system call which, while having been ruthlessly optimized, can still be slower than some people would like. Simply reading the processor's time stamp counter (TSC) can be faster. The problem is that the TSC can become unreliable in the presence of power management. Once upon a time, changing the CPU frequency would change the rate of the TSC, but that problem has been solved by the CPU vendors for a few years now. So TSC problems are no longer an excuse to avoid lowering the clock frequency.

Unfortunately, that is not too useful, because it rarely makes sense to run a CPU at a lower frequency; best results usually come from running at full speed and spending more time in a sleep state ("C state"). But C states can stop the TSC altogether, once again creating problems for performance-sensitive users. In response, manufacturers have caused the TSC to run even when the CPU is sleeping. So, while virtualization remains a hassle, systems running on bare metal can expect the TSC to work properly in all power management states.

But that still doesn't satisfy some performance-sensitive users because deep C states create latency. It can take a millisecond to wake a CPU out of a deep sleep - that is a very long time in some applications. We have the pm_qos mechanism which can let the kernel know whether deep sleeps are acceptable at any given time, allowing power management to happen when latency is not an immediate concern. Not a perfect solution, but that may be as good as it gets for now.

Another interesting feature of contemporary CPUs is the "turbo" mode, which can allow a CPU to run in an overclocked mode for a period of time. Using this mode can get work done faster, allowing longer sleeps and better power behavior, but it depends on good power management if it is to work at all. If a core is to run in turbo mode, all other cores on the same die must be in a sleep state. The end result is that turbo mode can give good results for single-threaded workloads.

Some effort is going into powering down unused hardware components - I/O controllers, for example - even though the gains to be had in this area are relatively small. Many systems have quite a few USB ports, many of which are entirely unused. Versions 1 and 2 of the USB specification make powering down those port hard; even worse, those ports will repeatedly wake the CPU even if nothing is plugged in. USB 3 is better in this regard.

Unfortunately, even in this case, it's hard to power down the ports because it is a feature which is poorly specified, poorly documented, and poorly implemented. The reliability of the hardware varies; Windows tends not to use the PCI power management event infrastructure, so it often simply does not work. This problem has been solved by polling the hardware once every second; that is "the least bad thing" they could come up with. The result is better power behavior, but also up to one second of latency before the system responds to the plugging-in of a new USB device. Since, as Matthew noted, that one second is probably less than the user already lost while trying to insert the plug upside-down, it shouldn't be a problem.

Similar things can be done with other types of hardware - firewire ports, audio devices, SD ports, etc. It's just a matter of figuring out how to make it work. There is also some interest in reducing the power consumption of graphics processors (GPUs), even though enterprise systems tend not to have fancy GPUs. The level of support varies from one GPU to the next, but work is being done to improve power consumption for most of them.

Work for the future includes better CPU frequency governor development; we need to do better at ramping up the processor's frequency when there is work to be done. The scheduler needs tweaks to do a better job of consolidating jobs onto one package, allowing others to be powered down. And there is the continued exploitation of other power management features in hardware; there are a lot of them that we are not using. On the other hand, others are not using those features either, so they probably do not work.

In summary: Linux is doing pretty well with regard to enterprise-level power management; the GPU is the only place where we perform worse than Windows does. But we can always do better, so work will continue in that direction.

Comments (8 posted)

Concurrent code and expensive instructions

January 26, 2011

This article was contributed by Paul McKenney

Introduction

Symmetric multiprocessing (SMP) code often requires expensive instructions, including atomic operations and memory barriers, and often causes expensive cache misses. Yet some SMP code can be extremely cheap and fast, using no expensive instructions at all. Examples of cheap SMP code include per-CPU counters and RCU read-side critical sections. So why can't all SMP code be cheap? Is it just that we aren't smart enough to spot clever ways of implementing other algorithms, for example, concurrent stacks and queues? Is it that we might be able to implement concurrent stacks and queues without expensive instructions, but only at the cost of mind-bending complexity? Or is it simply impossible to implement concurrent stacks and queues without using expensive instructions?

My traditional approach has been to place my faith in two observations: (1) if you beat your head against a wall long enough, one of two things is bound to happen, and (2) I have a hard head. Although this approach has worked well, something less painful would be quite welcome. And so it was with great interest that I read a paper entitled "Laws of Order: Expensive Synchronization in Concurrent Algorithms Cannot be Eliminated" by Attiya et al., with the “et al.” including Maged Michael, whom I have had the privilege of working with for quite some time.

It is important to note that the title overstates the paper's case somewhat. Yes, the paper does present some laws requiring that many concurrent algorithms use expensive instructions, however, all laws have their loopholes, including the Laws of Order. So while we do need to understand the Laws of Order, we most especially need to understand how to fully exploit their loopholes.

To arrive at the Laws of Order, this paper first expands the definition of commutativity to include sequential composition, which in the C language can best be thought of as the “;” operator. In this case, commutativity depends not just on the operator, but on the operands, which for our purposes can be thought of as calls to arbitrary C functions. For example, the statements:

    atomic_inc(&x); 
    atomic_inc(&y);

are commutative: the values of x and y are the same regardless of the order of execution. In contrast:

    atomic_set(&x, 1); 
    atomic_set(&x, 2);

are non-commutative: the value of x will be either 1 or 2, depending on which executes first.

These examples execute sequentially, but the paper considers concurrent execution. To see this, consider a concurrent set that has these operations:

a set-member-addition function (call it set_add()) that returns an indication of whether the element to be added was already in the set,
a set-member-test function (call it set_member()), and
a set-member-removal function (call it set_remove()) that returns an indication of whether anything had actually been removed.

Then concurrently testing two distinct members is commutative: the order in which set_member(&s, 1) and set_member(&s, 2) execute will not affect the return values from either, and the final value of set s will be the same in either case. Therefore, it is not necessary for the two invocations to coordinate with each other. The fact that coordination is not required means that there is some hope that expensive instructions are not needed to implement set_member().

In contrast, concurrent invocation of set_add(&s, 1) and set_member(&s, 1) would not be commutative if the set s initially did not contain the value 1. The set_member() invocation would return true only if it executed after the set_add(). Some coordination between the two functions is clearly required.

The most important results of the paper rely on a strong form of non-commutativity, which is strangely enough termed “strong non-commutativity”, which we can abbreviate to “SNC”. The example in the previous paragraph is not SNC because, while set_add() can affect set_member(), the reverse is not the case. In contrast, an SNC pair of functions would each affect the other's result. For example, consider set_add(&s, 1) and set_remove(&s, 1), where the set s is initially empty. If set_add(&s, 1) executes first, then both functions will indicate success, and set s will be empty. On the other hand, if set_remove(&s, 1) executes first, then only the set_add(&s, 1) will indicate success and the set s will contain 1. In this case, the return value of set_remove() is affected by the order of execution. On the other hand, if the set s initially contains 1, it will be set_add(&s, 1) whose return value is affected. Therefore, the order of execution can affect the return value of both functions, and these functions are therefore SNC.

Quick Quiz 1: Is atomic_add_return() SNC? In other words, are multiple concurrent calls to this function SNC?
Answer

The key result of the paper is that under certain conditions, the implementation of a pair of SNC functions must contain a heavyweight instruction, where a “heavyweight instruction” can either be an atomic read-modify-write instruction or a heavyweight memory barrier. In the Linux kernel, only smp_mb() qualifies as a heavyweight memory barrier.

The “certain conditions” are:

Both functions in the pair must be deterministic, in other words, the final state (including return values) must be a strict function of the initial state and order of execution.
The functions must be linearizable.

Interestingly enough, although the paper requires that the implementation of an SNC, deterministic, and linearizable pair of functions each contain at least one heavyweight instruction, it does not require that this instruction be executed on each invocation.

Quick Quiz 2: Imagine an increment function that is not permitted to lose counts even when multiple invocations execute concurrently, and that does not return the value of the counter. Must the implementation of such a function contain an atomic read-modify-write instruction or a heavyweight memory barrier?
Answer

So if we want our code to run fast, we have four ways to avoid heavyweight instructions:

Formulate the API to be non-SNC.
Design the implementation so that any required heavyweight instructions almost never need to actually be executed.
Accept non-determinism.
Accept non-linearizability. The paper ignores this possibility, possibly due to the common academic view that non-linearizable algorithms are by definition faulty.

Interestingly enough, relativistic programming has long suggested use of several of these approaches to attain good performance and scalability. The “Laws of Order” therefore provides a good theoretical basis for understanding why relativistic programming is both desirable and necessary.

Let's take a look at some examples, starting with a memory allocator. Given that concurrent calls to kmalloc() are not supposed to return a pointer to the same block of memory, we have to conclude that kmalloc() is SNC and thus might need heavyweight instructions in its implementation.

Quick Quiz 3: How can we avoid the use of heavyweight instructions in the implementation of kmalloc? If it turns out to be impossible to completely avoid their use, how can we reduce the frequency of their execution?
Answer

The second example is of course RCU. Let's focus on the rcu_read_lock(), rcu_read_unlock(), synchronize_rcu(), rcu_dereference() and rcu_assign_pointer() API members. The rcu_read_lock() function is unaffected by any of the other members, so any pair that includes rcu_read_lock() is non-SNC, which is why this function need not include any heavyweight instructions. The same is true of rcu_read_unlock().

Interestingly enough, synchronize_rcu() is affected by both rcu_read_lock() and rcu_read_unlock(), in that the former can prevent synchronize_rcu() from returning and the latter can enable it to return. However, neither rcu_read_lock() nor rcu_read_unlock() is affected by synchronize_rcu(). This means that synchronize_rcu() is non-SNC and might therefore have an implementation that does not use heavyweight instructions. However, such an implementation seems quite implausible if you include the actions of the updater both before and after the call to synchronize_rcu() in conjunction with the RCU read-side critical section. The paper, though, considers only data flowing via those function's arguments and return value. It would be interesting to see a generalization of this work that includes side effects.

My guess is that for a given code fragment to be non-SNC, any conceivable API would need to be non-SNC. If my guess is correct, then the full RCU update is non-SNC with respect to any RCU read-side critical section containing rcu_dereference(). The reasoning is that the return value from rcu_dereference() can be affected by the RCU update, and the duration for which synchronize_rcu() blocks can be affected by rcu_read_lock() and rcu_read_unlock().

Quick Quiz 4: Are there any conditions in which rcu_read_unlock() will be SNC with respect to synchronize_rcu()?
Answer

Finally, let us look at the set implementation that includes set_add(), set_member(), and set_remove(). We saw that set_add() and set_remove() were SNC.

Quick Quiz 5: Is there any way to implement set_add() and set_remove() without using heavyweight instructions?
Answer

Of course, this paper does have a few shortcomings, many of which fall under the rubric of “future work”:

The paper describes the theoretical limitations at great length, but does not describe many ways of avoiding them. However, I am quite confident that the Linux kernel community will be more than able to produce good software engineering solutions that work around these limitations. In fact, there is a lot to be said for letting the theoreticians worry about limitations and letting us hackers worry about solving problems in spite of those limitations.
The paper focuses almost exclusively on reordering carried out by the CPU. It turns out that reordering due to compiler optimizations can be at least as “interesting” as CPU reordering. These sorts of compiler optimizations are allowed by the current C-language standard, which permits the compiler to assume that there is only one thread in the address space. Within the Linux kernel, the barrier() directive restricts the compiler's ability to move code, and this directive (or its open-coded equivalent) is used in locking primitives, atomic operations, and memory barriers.
There is some uncertainty about exactly what properties of code must be SNC for this paper's results to hold. The paper focuses almost exclusively on function arguments and return values, but my guess is that the list of properties is quite general. For example, an unconditional lock-acquisition primitive certainly seems like it should be covered by this paper's result, but such primitives do not return a value. Can the fact that the second of two concurrent acquisitions simply fails to return be considered to be evidence of the SNC nature of lock acquisition? If not, exactly why not? If so, exactly what is the set of effects that must be taken into account when judging whether or not this code fragment is SNC?
This seems to be a future-work topic.
A bit of thought about the results of this paper give clear reasons why it is often so hard to parallelize existing sequential code. Sequential code inflicts no penalties for the use of SNC APIs, so SNC APIs can be expected to appear in sequential code even when a non-SNC API might have served just as well. After all, what programmer could resist the temptation to make set_add() return an indication of whether the element was already in the set? The paper would have done well to state this point clearly.
The paper fails to call out non-linearizability as a valid loophole to its laws of order.
An interesting open question: What are the consequences of using one of the loopholes of the laws of order? In my limited personal experience, leveraging non-linearizability and privatization permits full generality (for example, parallel memory allocators), while leveraging non-SNC and non-determinism results in specialized algorithms (for example, RCU). It would be quite interesting to better understand any theoretical and software-engineering limitations imposed by these loopholes.
The paper overreaches a bit when it states that:

For synthesis and verification of concurrent algorithms, our result is potentially useful in the sense that a synthesizer or a verifier need not generate or attempt to verify algorithms that do not use RAW [smp_mb()] and AWAR [atomic read-modify-write operations] for they are certainly incorrect.
As we have seen, it is perfectly legal for a concurrent algorithm to avoid use of these operations as long as that algorithm is either: (1) non-SNC, (2) non-deterministic, or (3) non-linearizable. There are a few other places where the limitations on the main result are not stated as carefully as they should be. Given that the rest of the paper seems quite accurate and on-point, I would guess that this sentence is simply an honest error that slipped through the peer-review process. We all make mistakes.

Although I hope that these shortcomings will be addressed, I hasten to add that they are insignificant compared to the huge step forward that this paper represents.

In summary, the “Laws of Order” paper shines some much-needed light on the question of whether heavyweight instructions are needed to implement a given concurrent algorithm. Although I am not going to say that this paper fully captures my parallel-programming intuition, I am quite happy that it does land within a timezone or two, which represents a great improvement over previous academic papers. But the really good news is that the limitations called out in this paper have some interesting loopholes that can be exploited in many cases. If the Linux kernel community pays careful attention to both the limitations and the loopholes called out in this paper, I am confident that the community's already-impressive parallel-programming capabilities will become even more formidable.

Acknowledgments

I owe thanks to Maged Michael, Josh Triplett, and Jon Walpole for illuminating discussions and for their review of this paper, and to Jim Wasko for his support of this effort.

Legal Statement

This work represents the view of the author and does not necessarily represent the view of IBM.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or service marks of others.

Answers to Quick Quizzes

Quick Quiz 1: Is atomic_add_return() SNC? In other words, are multiple concurrent calls to this function SNC?

Answer: Yes. Suppose that an atomic_t variable named a is initially zero and that a pair of concurrent atomic_add_return(1, &a) functions execute. The first one to execute will return zero, and the second one will return one. Each instance's return value is therefore affected by the order of execution, which indicates strong non-commutativity.

This may seem strange, given that addition is commutative. And in fact the final value of a will be two regardless of order of execution.

To see the reasoning behind the definition of SNC, consider atomic_inc(&a), which also adds one to a but does not return the initial value. In this case, because there are no return values, the invocations of atomic_inc(&a) cannot possibly affect each others' return values.

Therefore, atomic_inc(&a) is non-SNC.

It is interesting to note that the designers of the Linux kernel's suite of atomic operations had an intuitive understanding of the results of this paper. The atomic operations that return a value (and thus are more likely to be SNC) are the ones that are required to provide full memory ordering.

Back to Quick Quiz 1.

Answer: No. Although this function is deterministic and linearizable, it is non-SNC. And in fact such a function could be implemented via a “split counter” that uses per-CPU non-atomic variables. Because each CPU increments only its own variable, counts are never lost. To get the aggregate value of the counter, simply sum up the individual per-CPU variables.

Of course, it might be necessary to disable preemption and/or interrupts across the increments, but such disabling requires neither atomic read-modify-write instructions nor heavyweight memory barriers.

However, the linearizability of this function depends on the counter always being incremented by the value 1. To see this, imagine a counter with an initial value of zero to which three CPUs are concurrently adding the values 3, 5, and 7, and that meanwhile three other CPUs are reading out the counter's value. Because there are no ordering guarantees, these three other CPUs might see the additions in any order. One of these CPUs might add the per-CPU variables and obtain a sum of 3, another might obtain a sum of 5, and the third might obtain a sum of 7. These three results are not consistent with any possible ordering of the additions, so this counter is not linearizable.

However, for a great many uses, this lack of linearizability is not a problem.

Back to Quick Quiz 2.

Quick Quiz 3: How can we avoid the use of heavyweight instructions in the implementation of kmalloc()? If it turns out to be impossible to completely avoid their use, how can we reduce the frequency of their execution?

Answer: The usual approach is to observe that a given pair of invocations of kmalloc() invocations will be SNC only if it is possible for them to be satisfied by the same block of memory. The usual way to greatly reduce the probability of a pair of kmalloc() invocations fighting over the same block of memory is to maintain per-CPU pools of memory blocks, which is what the Linux kernel's implementation of kmalloc() actually does. Heavyweight instructions are executed only if a given CPU's pool either becomes exhausted or overflows. This approach is related to the paper's suggestion of using “single-owner” algorithms.

It might be possible to avoid heavyweight instructions by introducing non-determinism, for example, by making kmalloc() randomly fail. This can certainly be accomplished by making kmalloc() unconditionally return NULL if the CPU's pool was exhausted, but such an implementation might not prove to be fully satisfactory to its users. Coming up with a reasonable implementation that uses non-determinism to avoid heavyweight instructions is left as an exercise for the adventurous reader.

Similarly, eliminating heavyweight instructions by introducing non-linearizability is left as an exercise for the adventurous reader.

Back to Quick Quiz 3.

Quick Quiz 4: Are there any conditions in which rcu_read_unlock() will be SNC with respect to synchronize_rcu()?

Answer: In some implementations of RCU, synchronize_rcu() can interact directly with rcu_read_unlock() when the grace period has extended too long, either via force_quiescent_state() machinations our via RCU priority boosting. In these implementations, rcu_read_unlock() will be SNC with respect to synchronize_rcu(). The Linux kernel's Preemptible Tree RCU is an example of such an implementation, as can be seen by examining the rcu_read_unlock_special() function in kernel/rcutree_plugin.h. This code executes rarely, thus using the second loophole called out above (“Design the implementation so that any required heavyweight instructions almost never need to actually be executed”).

Back to Quick Quiz 4.

Quick Quiz 5: Is there any way to implement set_add() and set_remove() without using heavyweight instructions?

Answer: This can be done easily for sets containing small integers if there is no linearizability requirement. The set is represented as a dense array of bytes so that each potential member of the set maps to a specific byte. The set_add() function would set the corresponding byte to one, the set_remove() function would clear the corresponding byte to zero, and the set_member() function would test the corresponding byte for non-zero.

This implementation is non-linearizable because different CPUs might well disagree on the order that members were added to and removed from the set.

Back to Quick Quiz 5.

Comments (11 posted)

A tale of two SCSI targets

January 22, 2011

This article was contributed by Goldwyn Rodrigues

At the end of 2010, the LIO project was chosen to replace STGT as the in-kernel SCSI target implementation. There were two main contenders (LIO and SCST) which tried to get their code into the Linux kernel tree. This article will compare the two projects and try to describe what these implementations have to offer.

What are SCSI targets?

The SCSI subsystem uses a sort of client-server model. Typically a computer is the client or "initiator," requesting that blocks be written to or read from a "target," which is usually a data storage device. The SCSI target subsystem enables a computer node to behave as a SCSI storage device, responding to storage requests by other SCSI initiator nodes. This opens up the possibility of creating custom SCSI devices and putting intelligence behind the storage.

An example of an intelligent SCSI target is Data Domain's online backup appliance, which supports de-duplication (thus saving space). The appliance, functioning as a SCSI target, is a computer node which intelligently writes only those blocks which are not already stored, and increases the reference counts of the blocks which are already present, thus writing only the blocks which have changed since the last backup. On the other side of the SCSI link, the initiator sees the appliance as a normal, shared SCSI storage device and uses its regular backup application to write to the target.

The most common implementation of the SCSI target subsystem is an iSCSI server, which uses a standard TCP/IP encapsulation of SCSI to export a SCSI device over the network. Most SCSI target projects started with the idea supporting iSCSI targets before supporting other protocols. Since only a network interface is needed to act as both an iSCSI initiator and an iSCSI target, supporting iSCSI doesn't require any special hardware beyond a network port, which almost every computer has these days. However, most SCSI targets can be supported with existing initiator cards, so if you have a Fibre, SAS, or Parallel SCSI card, it should be possible to use one of the SCSI target projects to make your computer into a SCSI target for the particular SCSI bus supported by the card.

Current Status

The Linux kernel SCSI subsystem currently uses STGT to implement the SCSI target functionality; STGT was introduced into the Linux kernel at the end of 2006 by Fujita Tomonori. It has a library in the kernel which assists the in-kernel target drivers. All target processing happens in user space, which may lead to performance bottlenecks.

Two out-of-tree kernel SCSI target solutions were contenders to replace STGT: LIO and SCST. SCST has been pushing to be included in the Linux kernel since at least 2008. It was decided then that the STGT project could serve the kernel for a little longer. As time passed, the design limitations of STGT were encountered and a replacement sought. The main criteria for a replacement SCSI target subsystem defined by James Bottomley, the SCSI maintainer, were:

That it would be a drop in replacement for STGT (our current in-kernel target mode driver), since there is room for only one SCSI target infrastructure.
That it used a modern sysfs-based control and configuration plane.
That the code was reviewed as clean enough for inclusion.

The first condition proved to be too restrictive; it was not possible to avoid breaking the ABI entirely. So the current goal, instead, is to find a way to gracefully transition STGT users to the new interface.

Hints of LIO replacing the STGT project came in the 2010 Linux Storage and Filesystem Summit. Christoph Hellwig volunteered to review and clean up the code; he managed to reduce the code-base by around 10,000 lines to make it ready to merge into the kernel.

Comparison

Both projects have drawn comparison charts of their feature lists which are available on their respective web sites: LIO and SCST. However, before exploring the differences, lets compare the similarities. Both projects implement an in-kernel SCSI target core. They provide local SCSI targets similar to loop devices, which comes in handy for using targets in virtualized environments. Both projects support iSCSI, which was one of the initial and main motivations for both projects.

Back-storage handlers are available on both projects in kernel space as well as for user space. Back-storage handlers allow target administrators to control how devices are exported to the initiators. For example, a pass-through handler allows exporting the SCSI hardware as it is, instead of masking the details of that hardware, while a virtual-disk handler allows exporting of files as virtual disk to the initiator.

Both projects support Persistent Reservations (PR); a feature for I/O fencing and failover/retakeover of storage devices in high-availability clusters. Using the PR commands, an initiator can establish, preempt, query, or reset a reservation policy with a specified target. During a failover takeover, the new virtual resource can reset the reservation policy of the old virtual resource, making device takeover easier and faster.

SCST

The main users of the SCSI target subsystem are storage companies providing storage solutions to the industry. Most of these storage solutions are plug-and-play appliances which can be attached to the storage network and used with little or no configuration. SCST boasts of a wider user base, which probably comes from the fact that they have wider range of transport support.

SCST supports both Qlogic and Emulex fibre channel cards whereas LIO supports only Qlogic target drives for now, and it is still in its beta stages of development. SCST supports the SCSI RDMA Protocol (SRP), and claims to be ahead in terms of development with respect to Fibre Channel over Ethernet (FCoE), LSI's Parallel/Wide SCSI Fibre Channel, and Serial Attached SCSI (SAS). It already has support for IBM's pSeries Virtual SCSI. Companies such as Scalable Informatics, Storewize, and Open-e have developed PnP appliance products which rely on these target transports based on SCST.

SCST supports notifications of session changes using asynchronous event notification (AEN). AEN is a protocol feature that may be used by SCSI targets to notify a SCSI initiator of events that occur in the target, when the target is not serving a request. This enables initiators to be notified of changes at the target end, such as devices added, removed, resized, or media changes. This way the initiators can see any target changes in a plug-and-play manner.

The SCST developers claim that their design conforms to more SCSI standards in terms of robustness and safety. The SCSI protocol requires that if an initiator clears a reservation held by another initiator, the reservation holder must be notified about the reservation clearance or else several initiators could change reservation data, ultimately corrupting it. SCST is capable of implementing safe RESERVE/RELEASE operations on devices to avoid such corruption.

According to the SCSI protocol, the initiator and target can communicate with each other to decide on the transfer size. An incorrect transfer size communicated by the initiator can lead to target device lockups or a crash. SCST safeguards against miscommunication of transfer sizes or transfer directions to avoid such a situation. The code claims to have a good memory management policy to avoid out-of-memory (OOM) situations. It can also limit the number of initiators that can connect to the target to avoid resource usage by too many connections. It also offers per-portal visibility control, which means that it can be configured in such a way that a target is visible to a particular subset of initiators only.

LIO

The LIO project began with the iSCSI design as its core objective, and created a generic SCSI target subsystem to support iSCSI. Simplicity has been a major design goal and hence LIO is easier to understand. Beyond that, the LIO developers have shown more willingness to work with the kernel developers as James pointed out to SCST maintainer Vladislav Bolkhovitin:

Look, let me try to make it simple: It's not about the community you bring to the table, it's about the community you have to join when you become part of the linux kernel. The interactions in the wider community are critical to the success of an open source project. You've had the opportunity to interact with a couple of them: sysfs we've covered elsewhere, but in the STGT case you basically said, here's our interface, use it. LIO actually asked what they wanted and constructed something to fit. Why are you amazed then when the STGT people seem to prefer LIO?

The LIO project also boasts of features which are either not present in SCST or are in early development phases. For example, LIO supports asymmetric logical unit assignment (ALUA). ALUA allows a target administrator to manage the access states and path attributes of the targets. This allows the multipath routing method to select the best possible path to optimize usage of available bandwidth, depending on the current access states of the targets. In other words, the path taken by the initiator in a multipath environment can be manipulated by target administrator by changing the access states.

LIO supports Management Information Base (MIB) which makes management of SCSI devices simpler. The SCSI target devices export management information values described in SCSI MIB RFC-4455 which is picked up by an SNMP agent. This feature extends to iSCSI devices and is beneficial in managing a storage network with multiple SCSI devices.

An error in the iSCSI connection can happen at three different levels: the session, digest, or connection level. Error recovery can be initiated at each of these levels, which makes sure that the recovery is made at the current level, and the error does not pass through to the next one. Error recovery starts with detecting a broken connection. In reponse, the iSCSI initiator driver establishes another TCP connection to the target, then it informs the target that the SCSI command path is being changed to the new TCP connection. The target can then continue processing SCSI commands on the new TCP connection. The upper level SCSI driver remains unaware that a new TCP connection has been established and that control has been transferred to the new connection. The iSCSI Session remains active during the period and does not have to be reinstated. LIO supports a maximum Error Recovery Level (ERL) of 2, which means that it can recover errors at the session, digest, or connection levels. SCST supports an ERL of 0, which means it can recover from session-level errors only and that all connection oriented errors are communicated to the SCSI driver.

LIO also supports "multiple connections per session" (MC/S). MC/S allows the initiator to open multiple connections between the initiator and target, either on the same or a different physical link. Hence, in case of a failure of one path, the established session can use another path without terminating the session. MC/S can also be used for load balancing across all established connections. Architectural session command ordering is preserved across those communication paths.

The LIO project also claims that its code is used in a number of appliance products and deployments though the user base does not seem to be as varied as that of SCST.

No comparison can be complete without a performance comparison. SCST developers have released their performance numbers from time to time. However, all their numbers were compared against STGT. The SCST comparison page speaks of SCST performing better than LIO, but the results were drawn on source-code study and not using real-world tests. SCST blames LIO for not releasing performance numbers, and there exist no performance data (to my knowledge) which would compare apples to apples.

The decision has finally been made, though, with quite a bit of opposition. Now comes the task of getting all the niche features which LIO lacks to be ported from SCST to LIO. While the decision was contentious, it is yet another example of the difficulty of getting something merged without being able to cooperate with the kernel development community.

Comments (8 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.38-rc2 ?

Architecture-specific

Daniel Walker Nexus One Support ?

Catalin Marinas ARM: Add support for the Large Physical Address Extensions ?

Akinobu Mita Introduce little-endian bitops ?

Sebastian Andrzej Siewior Device tree on x86, part v3 ?

Core kernel code

Rik van Riel directed yield for Pause Loop Exiting ?

Development tools

Ahmed S. Darwish [PATCH 0/2][concept RFC] x86: BIOS-save kernel log to disk upon panic ?

Device drivers

riyer@nvidia.com input: tegra-kbc - Add tegra keyboard driver ?

Po-Yu Chuang net: add Faraday FTMAC100 10/100 Ethernet driver ?

halli manjunatha TI WL 128x FM V4L2 driver ?

Filesystems and block I/O

Tejun Heo block: reimplement FLUSH/FUA to support merge ?

Jens Axboe On-stack explicit block queue plugging ?

Darrick J. Wong Refactor barrier=/nobarrier flags from fs to block layer ?

Janitorial

Arnd Bergmann Proposal for remaining BKL users ?

Memory management

Balbir Singh Unmapped page cache control (v3) ?

Jeremy Fitzhardinge Add apply_to_page_range_batch() and use it ?

Peter Zijlstra mm: Preemptibility -v7 ?

Security-related

Roberto Sassu eCryptfs: added support for the encrypted key type ?

Jari Ruusu Announce loop-AES-v3.6a file/swap crypto package ?

Virtualization and containers

Naoya Horiguchi HWPOISON for hugepage backed KVM guest ?

Glauber Costa New Proposal for steal time in KVM ?

Miscellaneous

Borislav Petkov RAS daemon v4 ?

Karel Zak util-linux v2.19-rc2 ?

Page editor: Jonathan Corbet
Next page: Distributions>>