Brief items
The current development kernel is 2.6.38-rc2,
released on January 21.
"
Anyway. -rc2 is out there, and the only reason it's reasonably sized
is that it was a short -rc2 - I think a few of the pull requests I got were
a bit larger than I would have been happy with. And I might as well warn
people that because the laptop I'm bringing with me is pitifully slow, I'm
also planning on going into 'anal' mode, and not even bother pulling from
trees unless they are clearly -rc material. IOW, don't try to push large
pushes on me. I won't take them, and they can wait for 39." See the
announcement for the short changelog, or
the
full changelog for all the details.
Stable updates: there have been no stable or longterm updates
released in the last week.
Comments (none posted)
We care about everything. If the objective was to make life easier
for ourselves, we'd all be on the golf course.
--
Andrew Morton
No way in hell do I want the situation of "the system is screwed,
so let's overwrite the disk" to be something the kernel I release
might do. It's crazy.
--
Linus Torvalds
Nice to see it gone - it seemed such a good idea in Linux 1.3
--
Alan Cox won't miss the BKL
Comments (3 posted)
By Jonathan Corbet
January 26, 2011
The removal of the big kernel lock (BKL) has been on the kernel
community's "to do" list almost since that lock was first added to make the
kernel work on multiprocessor systems. Over time, the significance of the
lock has diminished as finer-grained locking was added to various kernel
subsystems, but the BKL itself has endured. Getting rid of it for good
remained desirable because the BKL can still cause unwanted latencies at
times. There's also a certain amount of pride involved in completing the
job. That completion has been long in coming, though; once the worst
performance issues associated with the BKL were resolved, interest in doing
the low-level work needed to finish the job declined.
Two years or so ago, though, developers started working on BKL removal
again. Some of this work was motivated by the realtime tree, where
patience with latency sources is rather more limited. Still, it seemed
like completion remained a distant goal; hundreds of BKL call sites
remained in the kernel.
Then Arnd Bergmann took on the task of eliminating the BKL entirely. His
cleanup work has been going on for some time; if he has his way, this patch set (or something derived from it)
will remove the BKL entirely in 2.6.39. To get there, about a dozen
modules need to be addressed. Some of them (i830, autofs3, and smbfs) are
simply to be removed. Others (appletalk and hpfs) are to be moved to the
staging tree for near-term removal, though there is some resistance to that
idea. The remaining modules are to be fixed in some way. Once that's taken
care of, the final patch in the series
removes the lock itself. It will not be missed.
Comments (24 posted)
Kernel development news
By Jonathan Corbet
January 26, 2011
Power management is often seen as a concern mostly for embedded and mobile
systems. They worry about power management because we want our phones
to run for longer between recharges and our laptops to not inflict burns on
our thighs. But power management is equally important for data centers,
which are currently responsible for about 3% of the total power consumption
in the US. Keeping the net going in the US requires about
15TW 15GW of power -
the dedicated output of about 15 nuclear power plants. Clearly there would
be some real value to saving some of that power. Matthew Garrett's
talk during the Southern Plumbers Miniconf at linux.conf.au 2011 covered
the work that is being done in that area and where Linux stands relative to
other operating systems.
Much of the power consumed by data centers is not directly controllable by
Linux - it is overhead which is consumed outside of the computers
themselves. About one watt of power is consumed by overhead for each watt
consumed by computation. This overhead includes network infrastructure
and power supply loss, but the biggest component is air conditioning. So
the obvious thing to do here is to create more efficient cooling and power
infrastructure. Running at higher ambient temperatures, while
uncomfortable for humans, can also help. The best contemporary data
centers have been able to reduce their overhead to about 20% - a big
improvement. Cogeneration techniques - using heat from data centers to
warm buildings, for example - can reduce that overhead even further.
But we still have trouble. A 48-core system, Matthew says, will draw about
350W when it is idle; a rack full of such systems will still pull a lot of
power. What can be done? Most power management attention has been focused
on the CPU, which is where a lot of the power goes. As a result, an idle
Intel CPU now draws approximately zero watts of power - it is "terrifying"
how well it works. When the CPU is working, though, the situation is a bit
different; the power consumption is about 20W per core, or about 960W for a
busy 48-core system.
The clear implication is that we should keep the CPUs idle whenever
possible. That can be tricky, though; it is hard to write software which
does nothing. Or - as Matthew corrected himself - it's hard to write
useful software which does nothing.
There are some trends which can be pointed to in this area. CPU power
management is essentially solved; Linux is quite good at it. In fact,
Linux is better than any other operating system with regard to CPU power;
we have more time in deep idle states and fewer wakeups than others. So
interest is shifting toward memory power management. If all of the CPUs in
a package within the system can be idled, the associated memory controller
will go idle as well. It's also possible to put memory into "self-refresh"
mode if it is idle, reducing power use while preserving the contents. In
other situations, running memory at a lower clock rate can reduce power
usage. There will be a lot of work in this area because, at this point,
memory looks like the biggest, lowest-hanging fruit.
Even more power can be saved by simply turning a system off; that is where
virtualization comes into play. If applications are run on virtualized
servers, those servers can be consolidated onto a small number of machines
during times of low load, allowing the other machines to be powered down.
There is a great deal of customer interest in this capability, but there is
still work to be done; in particular, we need fast guest migration, which
is a hard problem to solve.
The other hard problem is the fact that optimal power behavior may make
tradeoffs which enterprise customers may be unwilling to make. Performance
matters for these people, and, if that means expending more energy, they
are willing to pay that cost.
As an example, consider the gettimeofday() system call which,
while having been ruthlessly optimized, can still be slower than some
people would like. Simply reading the processor's time stamp counter (TSC)
can be faster. The problem is that the TSC can become unreliable in the
presence of power management. Once upon a time, changing the CPU frequency
would change the rate of the TSC, but that problem has been solved by the
CPU vendors for a few years now. So TSC problems are no longer an excuse
to avoid lowering the clock frequency.
Unfortunately, that is not too useful, because it rarely makes sense to run
a CPU at a lower frequency; best results usually come from running at full
speed and spending more time in a sleep state ("C state"). But C
states can stop the TSC altogether, once again creating problems for
performance-sensitive
users. In response, manufacturers have caused the TSC to run even when the
CPU is sleeping. So, while virtualization remains a hassle, systems
running on bare metal can expect the TSC to work properly in all power
management states.
But that still doesn't satisfy some performance-sensitive users because
deep C states create latency. It can take a millisecond to wake a CPU out
of a deep sleep - that is a very long time in some applications. We have
the pm_qos mechanism which can let the
kernel know whether deep sleeps are acceptable at any given time, allowing
power management to happen when latency is not an immediate concern. Not a
perfect solution, but that may be as good as it gets for now.
Another interesting feature of contemporary CPUs is the "turbo" mode, which
can allow a CPU to run in an overclocked mode for a period of time. Using
this mode can get work done faster, allowing longer sleeps and better power
behavior, but it depends on good power management if it is to work at all.
If a core is to run in turbo mode, all other cores on the same die must be
in a sleep state. The end result is that turbo mode can give good results
for single-threaded workloads.
Some effort is going into powering down unused hardware components - I/O
controllers, for example - even though the gains to be had in this area are
relatively small. Many systems have quite a few USB ports, many of which
are entirely unused. Versions 1 and 2 of the USB specification
make powering down those port hard; even worse, those ports will repeatedly
wake the CPU even if nothing is plugged in. USB 3 is better in this
regard.
Unfortunately, even in this case, it's hard to power down the ports because
it is a feature which is poorly specified, poorly documented, and poorly
implemented. The reliability of the hardware varies; Windows tends not to
use the PCI power management event infrastructure, so it often simply does
not work. This problem has been solved by polling the hardware once every
second; that is "the least bad thing" they could come up with. The result
is better power behavior, but also up to one second of latency before the
system responds to the plugging-in of a new USB device. Since, as Matthew
noted, that one second is probably less than the user already lost while
trying to insert the plug upside-down, it shouldn't be a problem.
Similar things can be done with other types of hardware - firewire ports,
audio devices, SD ports, etc. It's just a matter of figuring out how to
make it work. There is also some interest in reducing the power
consumption of graphics processors (GPUs), even though enterprise systems
tend not
to have fancy GPUs. The level of support varies from one GPU to the next,
but work is being done to improve power consumption for most of them.
Work for the future includes
better CPU frequency governor development; we need to do better at ramping
up the processor's frequency when there is work to be done. The scheduler
needs tweaks to do a better job of consolidating jobs onto one package,
allowing others to be powered down. And there is the continued
exploitation of other power management features in hardware; there are a
lot of them that we are not using. On the other hand, others are not using
those features either, so they probably do not work.
In summary: Linux is doing pretty well with regard to enterprise-level
power management; the GPU is the only place where we perform worse than
Windows does. But we can always do better, so work will continue in that
direction.
Comments (8 posted)
January 26, 2011
This article was contributed by Paul McKenney
Introduction
Symmetric multiprocessing (SMP) code often requires expensive
instructions, including atomic
operations and memory barriers, and often causes expensive cache
misses.
Yet some SMP code can be extremely cheap and fast, using no expensive
instructions at all.
Examples of cheap SMP code
include per-CPU counters and RCU read-side critical sections.
So why can't all SMP code be cheap?
Is it just that we aren't smart enough to spot clever ways of implementing
other algorithms, for example, concurrent stacks and queues?
Is it that we might be able to implement concurrent stacks and queues
without expensive instructions, but only at the cost of mind-bending
complexity?
Or is it simply impossible to implement concurrent stacks and queues
without using expensive instructions?
My traditional approach has been to place my faith
in two observations: (1) if you beat your head against a wall long
enough, one of two things is bound to happen, and (2) I
have a hard head.
Although this approach has worked well, something less painful would
be quite welcome.
And so it was with great interest that I read a paper
entitled "Laws of
Order: Expensive Synchronization in
Concurrent Algorithms Cannot be Eliminated" by Attiya et al., with
the “et al.” including Maged Michael, whom I have had
the privilege of working with for quite some time.
It is important to note that the title overstates the paper's
case somewhat.
Yes, the paper does present some laws requiring that many concurrent
algorithms use expensive instructions,
however, all laws have their loopholes,
including the Laws of Order.
So while we do need to understand the Laws of Order, we most especially
need to understand how to fully exploit their loopholes.
To arrive at the Laws of Order, this paper first expands
the definition of commutativity to include
sequential composition, which in the C language can best be thought
of as the “;” operator.
In this case, commutativity depends not just on the operator, but on
the operands, which for our purposes can be thought of as calls
to arbitrary C functions.
For example, the statements:
atomic_inc(&x);
atomic_inc(&y);
are commutative: the values of x and y are
the same regardless of the order of execution.
In contrast:
atomic_set(&x, 1);
atomic_set(&x, 2);
are non-commutative:
the value of x will be either 1 or 2, depending on which
executes first.
These examples execute sequentially, but the paper considers
concurrent execution.
To see this, consider a concurrent set that has these operations:
- a set-member-addition
function (call it
set_add())
that returns an indication of whether the element to be
added was already in the set,
- a set-member-test function (call it
set_member()), and
- a set-member-removal function (call it
set_remove())
that returns an indication of whether anything had actually been
removed.
Then concurrently testing two distinct members is commutative: the
order in which set_member(&s, 1) and
set_member(&s, 2)
execute will not affect the return values from either, and the final
value of set s will be the same in either case.
Therefore, it is not necessary for the two invocations to
coordinate with each other.
The fact that coordination is not required means that there is some hope
that expensive instructions are not needed to implement
set_member().
In contrast, concurrent invocation of set_add(&s, 1)
and set_member(&s, 1)
would not be commutative if the set s initially
did not contain the value 1.
The set_member() invocation would return true only if it
executed after the set_add().
Some coordination between the two functions is clearly required.
The most important results of the paper rely on a strong form
of non-commutativity, which is strangely enough termed
“strong non-commutativity”, which we can abbreviate
to “SNC”.
The example in the previous paragraph is not SNC because, while
set_add() can affect set_member(),
the reverse is not the case.
In contrast, an SNC pair of functions would each affect the other's
result.
For example, consider set_add(&s, 1) and
set_remove(&s, 1), where the set s is initially
empty.
If set_add(&s, 1) executes first, then both functions
will indicate success, and set s will be empty.
On the other hand, if set_remove(&s, 1) executes first,
then only the set_add(&s, 1) will indicate success
and the set s will contain 1.
In this case, the return value of set_remove() is affected
by the order of execution.
On the other hand, if the set s initially contains 1,
it will be set_add(&s, 1) whose return value is affected.
Therefore, the order of execution can affect the return value of both
functions, and these functions are therefore SNC.
Quick Quiz 1:
Is atomic_add_return() SNC? In other words, are multiple
concurrent calls to this function SNC?
Answer
The key result of the paper is that under certain conditions,
the implementation of
a pair of SNC functions must contain a heavyweight instruction,
where a “heavyweight instruction” can either be an
atomic read-modify-write instruction or a heavyweight memory barrier.
In the Linux kernel, only smp_mb() qualifies as
a heavyweight memory barrier.
The “certain conditions” are:
- Both functions in the pair must be
deterministic, in other words,
the final state (including return values) must be a strict
function of the initial state and order of execution.
- The functions must be
linearizable.
Interestingly enough, although the paper requires that the implementation
of an SNC, deterministic, and linearizable pair of functions each
contain at least one heavyweight instruction, it does
not require that
this instruction be executed on each invocation.
Quick Quiz 2:
Imagine an increment function that is not permitted to lose counts
even when multiple invocations execute concurrently, and that does
not return the value of the counter.
Must the implementation of such a function contain an
atomic read-modify-write instruction or a heavyweight memory barrier?
Answer
So if we want our code to run fast, we have four ways to avoid
heavyweight instructions:
- Formulate the API to be non-SNC.
- Design the implementation so that any required heavyweight
instructions almost never need to actually be executed.
- Accept non-determinism.
- Accept non-linearizability. The paper ignores
this possibility, possibly due to the common academic
view that non-linearizable algorithms are by
definition faulty.
Interestingly enough,
relativistic programming
has long suggested use of several of these approaches to attain
good performance and scalability.
The “Laws of Order” therefore provides a good theoretical
basis for understanding why relativistic programming is both
desirable and necessary.
Let's take a look at some examples, starting with a memory allocator.
Given that concurrent calls to kmalloc() are not supposed
to return a pointer to the same block of memory, we have to conclude that
kmalloc() is SNC and thus might need heavyweight instructions
in its implementation.
Quick Quiz 3:
How can we avoid the use of heavyweight instructions in the implementation
of kmalloc?
If it turns out to be impossible to completely avoid their use, how can
we reduce the frequency of their execution?
Answer
The second example is of course RCU.
Let's focus on the rcu_read_lock(),
rcu_read_unlock(),
synchronize_rcu(),
rcu_dereference() and
rcu_assign_pointer() API members.
The rcu_read_lock() function is unaffected by any of the
other members, so any pair that includes rcu_read_lock()
is non-SNC, which is why this function need not include any heavyweight
instructions.
The same is true of rcu_read_unlock().
Interestingly enough, synchronize_rcu() is affected
by both rcu_read_lock() and rcu_read_unlock(),
in that the former can prevent synchronize_rcu() from
returning and the latter can enable it to return.
However, neither rcu_read_lock() nor
rcu_read_unlock() is affected by synchronize_rcu().
This means that synchronize_rcu() is non-SNC and might therefore
have an implementation that does not use heavyweight instructions.
However, such an implementation seems quite implausible if you include
the actions of the updater both before and after the call to
synchronize_rcu() in conjunction with the RCU
read-side critical section.
The paper, though, considers only data flowing via
those function's arguments and return value.
It would be interesting to see a generalization of this work that
includes side effects.
My guess is that for a given code fragment to be non-SNC, any conceivable
API would need to be non-SNC.
If my guess is correct, then the full RCU update is non-SNC with respect
to any RCU read-side critical section containing rcu_dereference().
The reasoning is that the return value from rcu_dereference() can
be affected by the RCU update, and the duration for which
synchronize_rcu() blocks can be affected by
rcu_read_lock() and rcu_read_unlock().
Quick Quiz 4:
Are there any conditions in which rcu_read_unlock() will
be SNC with respect to synchronize_rcu()?
Answer
Finally, let us look at the set implementation that includes
set_add(),
set_member(), and
set_remove().
We saw that set_add() and set_remove()
were SNC.
Quick Quiz 5:
Is there any way to implement set_add() and
set_remove() without using heavyweight instructions?
Answer
Of course, this paper does have a few shortcomings, many of
which fall under the rubric of “future work”:
- The paper describes the theoretical limitations at great length,
but does not describe many ways of avoiding them.
However, I am quite confident that the Linux kernel community will be
more than able to produce good
software engineering solutions that work around
these limitations.
In fact, there is a lot to be said for letting the theoreticians
worry about limitations and letting us hackers worry about
solving problems in spite of those limitations.
- The paper focuses almost exclusively on reordering carried out
by the CPU.
It turns out that reordering due to compiler optimizations can be
at least as “interesting” as CPU reordering.
These sorts of compiler optimizations are allowed by the
current C-language standard, which permits the
compiler to
assume that there is only one thread in the address space.
Within the Linux kernel, the
barrier() directive
restricts the compiler's ability to move code, and this
directive (or its open-coded equivalent) is used in locking
primitives, atomic operations, and memory barriers.
- There is some uncertainty about exactly what properties of code
must be SNC for this paper's results to hold.
The paper focuses almost exclusively on function arguments and return
values, but
my guess is that the list of properties is quite general.
For example, an unconditional lock-acquisition primitive certainly
seems like it should be covered by this paper's result,
but such primitives do not return a value.
Can the fact that the second of two concurrent acquisitions
simply fails to return be considered to be evidence of the
SNC nature of lock acquisition?
If not, exactly why not?
If so, exactly what is the set of effects that must be
taken into account when judging whether or not this code fragment
is SNC?
This seems to be a future-work topic.
- A bit of thought about the results of this paper
give clear reasons why it is
often so hard to
parallelize existing sequential code.
Sequential code inflicts no penalties for the use of SNC
APIs, so SNC APIs can be expected to appear in sequential
code even when a non-SNC API might have served just as well.
After all, what programmer could resist the
temptation to make
set_add() return an indication
of whether the element was already in the set?
The paper would have done well to state this point clearly.
- The paper fails to call out non-linearizability as a valid
loophole to its laws of order.
- An interesting open question: What are the consequences of
using one of the loopholes of the laws of order?
In my limited personal experience, leveraging non-linearizability
and privatization permits full generality (for example,
parallel memory allocators), while leveraging non-SNC and
non-determinism results in specialized algorithms
(for example, RCU).
It would be quite interesting to better understand any
theoretical and software-engineering limitations imposed
by these loopholes.
- The paper overreaches a bit when it states that:
For synthesis and verification of concurrent algorithms,
our result is potentially useful in the sense that a
synthesizer or a verifier need not generate or attempt
to verify algorithms that do not use RAW
[smp_mb()]
and AWAR [atomic read-modify-write operations] for
they are certainly incorrect.
As we have seen, it is perfectly legal for a concurrent algorithm
to avoid use of these operations as long as that algorithm is either:
(1) non-SNC, (2) non-deterministic, or (3) non-linearizable.
There are a few other places where the limitations on the
main result are not stated as carefully as they should be.
Given that the rest of the paper seems quite accurate and
on-point, I would guess that this sentence is simply an
honest error that slipped through the peer-review process.
We all make mistakes.
Although I hope that these shortcomings will be addressed, I
hasten to add that they are insignificant compared to the huge
step forward that this paper represents.
In summary, the “Laws of Order” paper shines some
much-needed
light on the question of whether heavyweight instructions are needed to
implement a given concurrent algorithm.
Although I am not going to say that this paper fully captures my
parallel-programming intuition, I am quite happy that it does
land within a timezone or two, which represents a great improvement
over previous academic papers.
But the really good news is that the limitations called out in this
paper have some interesting
loopholes that can be exploited in many cases.
If the Linux kernel community pays careful attention to both the
limitations and the loopholes called out in this paper,
I am confident that the community's
already-impressive parallel-programming capabilities will
become even more formidable.
I owe thanks to Maged Michael, Josh Triplett, and Jon Walpole for illuminating
discussions and for their review of this paper, and to Jim Wasko
for his support of this effort.
This work represents the view of the author and does not necessarily
represent the view of IBM.
Linux is a registered trademark of Linus Torvalds.
Other company, product, and service names may be trademarks or
service marks of others.
Quick Quiz 1:
Is atomic_add_return() SNC? In other words, are multiple
concurrent calls to this function SNC?
Answer: Yes.
Suppose that an atomic_t variable named a
is initially zero
and that a pair of concurrent atomic_add_return(1, &a)
functions execute.
The first one to execute will return zero, and the second one will
return one.
Each instance's return value is therefore affected by the order of
execution, which indicates strong non-commutativity.
This may seem strange, given that addition is commutative.
And in fact the final value of a will be two
regardless of order of execution.
To see the reasoning behind the definition of SNC,
consider atomic_inc(&a),
which also adds one to a but does not return the
initial value.
In this case, because there are no return values, the invocations
of atomic_inc(&a) cannot possibly affect each others'
return values.
Therefore, atomic_inc(&a) is non-SNC.
It is interesting to note that the designers of the Linux
kernel's suite of atomic operations had an intuitive understanding
of the results of this paper.
The atomic operations that return a value (and thus are more likely
to be SNC) are the ones that are required to provide
full memory ordering.
Back to Quick Quiz 1.
Quick Quiz 2:
Imagine an increment function that is not permitted to lose counts
even when multiple invocations execute concurrently, and that does
not return the value of the counter.
Must the implementation of such a function contain an
atomic read-modify-write instruction or a heavyweight memory barrier?
Answer:
No.
Although this function is deterministic and linearizable, it is non-SNC.
And in fact such a function could be implemented via a “split
counter” that uses per-CPU non-atomic variables.
Because each CPU increments only its own variable, counts are never
lost.
To get the aggregate value of the counter, simply sum up the individual
per-CPU variables.
Of course, it might be necessary to disable preemption and/or
interrupts across the increments, but such disabling requires
neither atomic read-modify-write instructions nor heavyweight memory
barriers.
However, the linearizability of this function depends
on the counter always being incremented by the value 1.
To see this, imagine a counter with an initial value of zero
to which three CPUs are concurrently adding the
values 3, 5, and 7, and that meanwhile three other CPUs are
reading out the counter's value.
Because there are no ordering guarantees, these three other CPUs
might see the additions in any order.
One of these CPUs might add the per-CPU variables and obtain a sum
of 3, another might obtain a sum of 5, and the third might obtain
a sum of 7.
These three results are not consistent with any possible ordering
of the additions, so this counter is not linearizable.
However, for a great many uses, this lack of linearizability
is not a problem.
Back to Quick Quiz 2.
Quick Quiz 3:
How can we avoid the use of heavyweight instructions in the implementation
of kmalloc()?
If it turns out to be impossible to completely avoid their use, how can
we reduce the frequency of their execution?
Answer:
The usual approach is to observe that a given pair of invocations of
kmalloc() invocations will be SNC only if it is possible
for them to be satisfied by the same block of memory.
The usual way to greatly reduce the probability of a pair of
kmalloc() invocations fighting over the same block of
memory is to maintain per-CPU pools of memory blocks, which is what
the Linux kernel's implementation of kmalloc() actually
does.
Heavyweight instructions are executed only if a given CPU's pool either
becomes exhausted or overflows.
This approach is related to the paper's suggestion of using
“single-owner” algorithms.
It might be possible to avoid heavyweight instructions by
introducing non-determinism, for example, by making kmalloc()
randomly fail.
This can certainly be accomplished by making kmalloc()
unconditionally return NULL if the CPU's pool was
exhausted, but such an implementation might not prove to be fully
satisfactory to its users.
Coming up with a reasonable implementation that uses non-determinism
to avoid heavyweight instructions is left as an exercise for the
adventurous reader.
Similarly, eliminating heavyweight instructions by introducing
non-linearizability is left as an exercise for the adventurous reader.
Back to Quick Quiz 3.
Quick Quiz 4:
Are there any conditions in which rcu_read_unlock() will
be SNC with respect to synchronize_rcu()?
Answer:
In some implementations of RCU, synchronize_rcu() can
interact directly with rcu_read_unlock() when the
grace period has extended too long, either via
force_quiescent_state()
machinations our via
RCU priority boosting.
In these implementations, rcu_read_unlock() will
be SNC with respect to synchronize_rcu(). The
Linux kernel's Preemptible Tree RCU
is an example of such an implementation, as can be seen by examining
the rcu_read_unlock_special() function in
kernel/rcutree_plugin.h.
This code executes rarely, thus using the second loophole called out
above (“Design the implementation so that any required heavyweight
instructions almost never need to actually be executed”).
Back to Quick Quiz 4.
Quick Quiz 5:
Is there any way to implement set_add() and
set_remove() without using heavyweight instructions?
Answer:
This can be done easily for sets containing small integers if
there is no linearizability requirement.
The set is represented as a dense array of bytes so that each potential
member of the set maps to a specific byte.
The set_add() function would set the corresponding byte to
one, the set_remove() function would clear the corresponding
byte to zero, and the set_member() function would
test the corresponding byte for non-zero.
This implementation is non-linearizable because different CPUs
might well disagree on the order that members were added to and removed
from the set.
Back to Quick Quiz 5.
Comments (9 posted)
January 22, 2011
This article was contributed by Goldwyn Rodrigues
At the end of 2010,
the LIO
project was chosen to replace STGT as the in-kernel SCSI target
implementation. There were two main contenders (LIO and SCST) which tried
to get their code into the Linux kernel tree. This article will compare
the two projects and try to describe what these implementations have to offer.
What are SCSI targets?
The SCSI subsystem uses a sort of client-server model. Typically a computer
is the client or "initiator," requesting that blocks be written to or read
from a "target," which is usually a data storage device. The SCSI target
subsystem enables a computer node to behave as a SCSI storage device, responding
to storage requests by other SCSI initiator nodes. This opens up the possibility
of creating custom SCSI devices and putting intelligence behind the storage.
An example of an intelligent SCSI target is Data Domain's
online backup appliance, which supports de-duplication (thus saving
space). The appliance, functioning as a SCSI target, is a computer node which
intelligently writes only those blocks which are not already stored, and
increases the reference counts of the blocks which are already present, thus
writing only the blocks which have changed since the last backup. On the
other side of the SCSI link,
the initiator sees the appliance as a normal, shared SCSI storage device and
uses its regular backup application to write to the target.
The most common implementation of the SCSI target subsystem is an iSCSI
server, which uses a standard TCP/IP encapsulation of SCSI to export a SCSI
device over the network. Most SCSI target projects started with the idea
supporting iSCSI
targets before supporting other protocols. Since only a network interface is needed to act as both an iSCSI
initiator and an iSCSI target, supporting iSCSI doesn't require any
special hardware beyond a network port, which almost every computer has
these days. However, most SCSI
targets can be supported with existing initiator cards, so if you have a
Fibre, SAS, or Parallel SCSI card, it should be possible to use one of
the SCSI target projects to make your computer into a SCSI target for
the particular SCSI bus supported by the card.
Current Status
The Linux kernel SCSI subsystem currently uses STGT to implement the
SCSI target functionality; STGT was introduced into the Linux kernel at the
end of 2006 by
Fujita Tomonori. It has a library in the kernel which assists the in-kernel
target drivers. All target processing happens in user space, which may
lead to performance bottlenecks.
Two out-of-tree kernel SCSI target solutions were contenders to replace
STGT: LIO and SCST. SCST has been pushing to
be included in the Linux kernel since at least 2008. It was decided
then that
the STGT project could serve the kernel for a little longer. As time passed, the
design limitations of STGT were encountered and a replacement sought. The
main criteria for a replacement SCSI target subsystem
defined by James Bottomley, the SCSI maintainer, were:
- That it would be a drop in replacement for STGT (our current in-kernel
target mode driver), since there is room for only one SCSI target
infrastructure.
- That it used a modern sysfs-based control and configuration plane.
- That the code was reviewed as clean enough for inclusion.
The first condition proved to be too restrictive; it was not possible to
avoid breaking the ABI entirely. So the current goal, instead, is to find
a way to gracefully transition STGT users to the new interface.
Hints of LIO replacing the STGT project came in
the 2010 Linux Storage and Filesystem
Summit. Christoph Hellwig volunteered to review and clean
up the code; he managed to reduce the code-base by around 10,000 lines to
make it
ready to merge into the kernel.
Comparison
Both projects have drawn comparison charts of their feature lists which are
available on their respective web sites: LIO and SCST. However, before
exploring the differences, lets compare the similarities. Both projects
implement an in-kernel SCSI target core. They provide local SCSI targets
similar to loop devices, which comes in handy for using targets in virtualized
environments. Both projects support iSCSI, which was one of the initial and main
motivations for both projects.
Back-storage handlers are available on both projects in kernel space as
well as for user space. Back-storage handlers allow target administrators
to control how devices are exported to the initiators. For example, a
pass-through handler allows exporting the SCSI hardware as it is, instead
of masking the details of that hardware, while a virtual-disk handler
allows exporting of files as virtual disk to the initiator.
Both projects support Persistent Reservations (PR); a feature for I/O
fencing and failover/retakeover of storage devices in high-availability
clusters. Using the PR commands, an initiator can establish, preempt,
query, or reset a reservation policy with a specified target. During a
failover takeover, the new virtual resource can reset the reservation
policy of the old virtual resource, making device takeover easier and
faster.
SCST
The main users of the SCSI target subsystem are storage companies providing
storage solutions to the industry. Most of these storage solutions are
plug-and-play appliances which can be attached to the storage network and
used with little or no configuration. SCST boasts of a wider user base, which
probably comes from the fact that they have wider range of transport support.
SCST supports both Qlogic and Emulex fibre channel cards whereas LIO
supports only Qlogic target drives for now, and it is still in its beta stages of
development. SCST supports the SCSI RDMA Protocol
(SRP), and claims to be ahead in terms of
development with respect to Fibre
Channel over Ethernet (FCoE), LSI's
Parallel/Wide SCSI Fibre Channel, and Serial Attached SCSI
(SAS). It already has
support for IBM's
pSeries Virtual SCSI. Companies such as Scalable Informatics, Storewize, and Open-e
have developed PnP appliance products which rely on these target
transports based on SCST.
SCST supports notifications of session changes using asynchronous event
notification (AEN). AEN is a protocol feature that may be used
by SCSI targets to notify a SCSI initiator of events that occur in the target,
when the target is not serving a request. This enables
initiators to be notified of changes at the target end, such as devices added,
removed, resized, or media changes. This way the initiators can see any target
changes in a plug-and-play manner.
The SCST developers claim that their design conforms to more SCSI standards
in terms of robustness
and safety. The SCSI protocol requires that if an initiator clears a
reservation
held by another initiator, the reservation holder must be notified about the
reservation clearance or else several initiators could change reservation data,
ultimately corrupting it. SCST is capable of implementing safe RESERVE/RELEASE
operations on devices to avoid such corruption.
According to the SCSI protocol, the initiator and target can communicate with
each other to decide on the transfer size. An incorrect transfer size
communicated by the initiator can lead to target device lockups or a crash.
SCST safeguards against miscommunication of transfer sizes or transfer
directions to avoid such a situation. The code claims to have a good memory
management policy to avoid out-of-memory (OOM) situations. It can also limit the
number of initiators that can connect to the target to avoid resource usage by
too many connections. It also offers per-portal visibility control, which
means that it can be configured in such a way that a target is visible to a
particular subset of initiators only.
LIO
The LIO project began with the iSCSI design as its core objective, and
created a generic SCSI target subsystem to support iSCSI. Simplicity has
been a major
design goal and hence LIO is easier to understand. Beyond that, the LIO
developers have shown more willingness to work with the kernel developers
as James pointed out
to SCST maintainer Vladislav Bolkhovitin:
Look, let me try to make it simple: It's not about the community you
bring to the table, it's about the community you have to join when you
become part of the linux kernel. The interactions in the wider
community are critical to the success of an open source project. You've
had the opportunity to interact with a couple of them: sysfs we've
covered elsewhere, but in the STGT case you basically said, here's our
interface, use it. LIO actually asked what they wanted and constructed
something to fit. Why are you amazed then when the STGT people seem to
prefer LIO?
The LIO project also boasts of features which are either not present in SCST
or are in early development phases. For example, LIO supports asymmetric
logical unit
assignment (ALUA). ALUA allows a target administrator to manage the access
states and path attributes of the targets. This allows the multipath routing
method to select the best possible path to optimize usage of available
bandwidth, depending on the current access states of the targets. In other
words, the path taken by the initiator in a multipath environment can be
manipulated by target administrator by changing the access states.
LIO supports Management
Information Base (MIB) which makes management of SCSI
devices simpler. The SCSI target devices export management
information values described in SCSI MIB RFC-4455 which is
picked up by an SNMP agent. This feature extends to iSCSI devices and is
beneficial in managing a storage network with multiple SCSI devices.
An error in the iSCSI connection can happen at three different levels: the
session, digest, or connection level. Error recovery can be initiated
at each of these levels, which makes sure that the recovery is made at the
current level, and the error does not pass through to the next one. Error
recovery starts with detecting a broken connection. In reponse, the iSCSI
initiator driver
establishes another TCP connection to the target, then it informs the
target that the SCSI command path is being changed to
the new TCP connection. The target can then continue processing
SCSI commands on the new TCP connection. The upper level SCSI driver remains
unaware that a new TCP connection has been established and that control has
been transferred to the new connection. The iSCSI Session remains active during
the period and does not have to be reinstated. LIO supports a maximum Error
Recovery Level (ERL) of 2, which means that it can recover errors at the
session, digest, or connection levels. SCST supports an ERL of 0, which
means
it can recover from session-level errors only and that all connection
oriented errors are communicated to the SCSI driver.
LIO also supports "multiple connections per session" (MC/S). MC/S allows the
initiator to open multiple connections between the initiator and target, either
on the same or a different physical link. Hence, in case of a failure of
one path,
the established session can use another path without terminating the session.
MC/S can also be used for load balancing across all established connections.
Architectural session command ordering is preserved across those communication
paths.
The LIO project also claims that its code is used in a number of appliance products and
deployments though the user base does not seem to be as varied as that of
SCST.
No comparison can be complete without a performance comparison. SCST developers
have released their performance numbers from time to time. However, all their
numbers were compared against STGT. The SCST comparison
page speaks of SCST performing better than LIO, but the results were drawn on
source-code study and not using real-world tests. SCST blames LIO for not
releasing performance numbers, and there exist no performance data (to my
knowledge) which would compare apples to apples.
The decision has finally been made, though, with quite a bit of opposition.
Now comes the task of getting all the niche features which LIO lacks to
be ported from SCST to LIO. While the decision was contentious, it is yet
another example of the difficulty of getting something merged without
being able to cooperate with the kernel development community.
Comments (8 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>