Brief items
The current development kernel is 2.6.32-rc1,
released by Linus on
September 27. Note that Linus fat-fingered the version number in the
makefile, so this kernel thinks it's -rc2. See the separate article,
below, for the list of significant changes added at the end of the merge
window.
The current stable kernel is 2.6.31.1, released (along with 2.6.27.35 and 2.6.30.8) on September 24.
These updates contain a number of important fixes, some of which are
security-related.
Comments (none posted)
The "Anti-Malware" industry is just snake oil anyway. I think the
proper approach to support it is just to add various no-op exports
claim to do something and all the people requiring anti-virus on
Linux will be just as happy with it.
--
Christoph Hellwig
Oh nnooooooooooooo-(takes breath)-ooooooo...
Damn. I hadn't realized. I'm a moron.
Ok, so it's an extra-special -rc1. It's the "short bus" kind of special
-rc1 release.
--
Linus Torvalds
Comments (none posted)
By Jonathan Corbet
September 30, 2009
The 2.6.32 merge window closed on September 27 with the
2.6.32-rc1 release; this merge
window ran a little longer than usual to make up for the distractions of
LinuxCon and the Linux Plumbers Conference. Changes
merged since
last week's
update include:
- The 9p (Plan9) filesystem has been updated to make use of the FS-cache
caching layer.
- Control group hierarchies can now have names bound to them.
- The fcntl() system call supports new F_SETOWN_EX and
F_GETOWN_EX operations. They differ from F_SETOWN
and F_GETOWN in that they direct SIGIO signals to a specific
thread within a multi-threaded application.
- The HWPOISON subsystem
has been merged.
- Framebuffer compression support has been added for Intel graphics
chipsets. Compression reduces the amount of work involved in driving
the display, leading to a claimed 0.5 watt reduction in power
consumption. A set of tracepoints has also been added to the Intel
graphics driver.
- There are new drivers for
ADP5588 I2C QWERTY Keypad and IO Expander devices,
OpenCores keyboard controllers,
Atmel AT42QT2160 touch sensor chips,
MELFAS MCS-5000 touchscreen controllers,
Maxim MAX7359 key switch controllers,
ARM "tightly-coupled memory" areas,
Palm Tungsten|C handheld systems,
Iskratel Electronics XCEP boards,
EMS CPC-USB/ARM7 CAN/USB interfaces,
Broadcom 43xx-based SDIO devices,
Avionic Design Xanthos watchdog and backlight devices,
WM831x PMIC backlight devices,
Samsung LMS283GF05 LCDs,
Analog Devices ADP5520/ADP5501 MFD PMIC backlight devices, and
WM831x PMIC status LEDs.
- The proc_handler function prototype, used in sysctl handling,
has lost its unused struct file argument.
In the end, 8742 non-merge changesets were incorporated in the 2.6.32 merge
window.
Comments (4 posted)
By Jonathan Corbet
September 30, 2009
Last week's
quotes of the
week included a complaint from Andrew Morton about the replacement of
the writeback code in 2.6.32. According to Andrew, a bunch of critical
code had been redone, replacing a well-tested implementation with new code
without any hard justification. It's a complaint which should be taken
seriously; replacing the writeback code has the potential to introduce
performance regressions for specific workloads. It should not be done
without a solid reason.
Chris Mason has tried to provide that
justification with a combination of benchmark results and
explanations. The benchmarks show a clear - and large - performance
improvement from the use of per-BDI writeback. That is good, but does not,
by itself, justify the switch to per-BDI writeback; Andrew had suggested
that the older code was slower as the result of performance regressions
introduced over time by other changes. If the 2.6.31 code could be fixed, the
performance improvement could be (re)gained without replacing the entire
subsystem.
What Chris is saying is that the old, per-CPU pdflush method could not be
fixed. The fundamental problem with pdflush is that it would back off when
the backing device appeared to be congested. But congestion is easy to
cause, and no other part of the system backs off in the same way. So
pdflush could end up not doing writeback for significant periods of time.
Forcing all other writers to back off in the face of congestion could
improve things, but that would be a big change which doesn't address the
other problem: congestion-based backoff can defeat attempts by filesystem
code and the block layer to write large, contiguous segments to disk.
As it happens, there is a more general throttling mechanism already built
into the block layer: the finite number of outstanding requests allowed for
any specific device. Once requests are exhausted, threads generating block
I/O operations are forced to wait until request slots become free again.
Pdflush cannot use this mechanism, though, because it must perform
writeback to multiple devices at once; it cannot block on request
allocation. A per-device writeback thread can block there, though,
since it will not affect I/O to any other device. The per-BDI patch
creates these per-device threads and, as a result, it is able to keep
devices busier. That, it seems, is why the old writeback code needed to be
replaced instead of patched.
Comments (5 posted)
By Jonathan Corbet
September 30, 2009
Tracepoints are proving to be increasingly useful as system development and
diagnostic tools. There is one question about tracepoints, though, which
has not yet gotten a real answer: do tracepoints constitute a user-space
ABI? If so, some serious constraints come into play. An ABI, once
exposed, cannot be changed in a way which might break applications.
Tracepoints, being tightly tied to the kernel code they instrument, are
inherently hard to keep stable. If a tracepoint cannot be modified or
removed, it will make modifications to the surrounding code harder.
In the worst case, ABI-preservation
requirements could block the incorporation of important kernel changes - an
outcome which could quickly sour developers on the tracepoint idea as a
whole.
Arjan van de Ven's TRACE_EVENT_ABI patch is an
attempt to bring some clarity to the situation. For now, it just defines a
tracepoint in exactly the same way as TRACE_EVENT; the difference
is that it is meant to create a tracepoint which can be relied upon as part
of the kernel ABI. Such tracepoints should continue to exist in future
kernel releases, and the format of the associated trace information will
not change in application-breaking ways. What that means in practice is
that no fields would be deleted, and any new fields would be added at the
end.
Whether this approach will work remains to be seen. The word from Linus in
the past has been that kernel ABIs are created by applications which rely on an
interface, rather than any specific marking on the interface itself. So if people start
using applications which expect to be able to use a specific tracepoint,
that tracepoint may be set in cement regardless of whether it was defined with
TRACE_EVENT_ABI. This macro would thus be a good guide to the kernel
developers' intent, but it can make no guarantee that only specially-marked
tracepoints will be subject to ABI stability requirements.
Comments (4 posted)
Kernel development news
September 30, 2009
This article was contributed by Valerie Aurora (formerly Henson)
Soft updates, a method of
maintaining on-disk file system consistency through carefully ordering
writes to disk, have only been implemented once in a production operating
system (FreeBSD). You can argue about exactly why they have not been
implemented elsewhere, and in Linux in particular, but my theory is that
not enough file systems geniuses exist in the world to write and maintain
more than one instance of soft updates.
Chris Frost, a graduate student
at UCLA, agrees with the too-complicated-for-mere-mortals theory. That's
why in 2006, he and several co-conspirators at UCLA wrote the
Featherstitch system for
keeping on-disk data consistent.
Featherstitch is a generalization of the soft updates system of write
dependencies and rollback data. The resulting system is general
enough that most (possibly all) other file system consistency strategies
(e.g., journaling) can be efficiently implemented on top of the
Featherstitch interface. What makes Featherstitch unique among file
systems consistency techniques is that it exports a safe, efficient,
non-blocking mechanism to userland applications that lets them group
and order writes without using fsync() or relying on file
system-specific behavior (like ext3 data=ordered mode).
Featherstitch basics: patches, dependencies, and undo data
What is Featherstitch, other than something file system aficionados
throw in your face whenever you complain about soft updates being too
complicated? Featherstitch grew out of soft updates and has a lot in
common with that approach architecturally. The main difference between
Featherstitch and soft updates is that the latter implements each
file system operation individually with a different specialized set of
data structures specific to the FFS file system, while Featherstitch
generalizes the concept of a set of updates to different blocks and
creates one data structure and write-handling mechanism shared by all
file system operations. As a result, Featherstitch is easier to
understand and implement than soft updates.
Featherstitch records all changes to the file system in "patches" (the
dearth of original terminology in software development strikes again).
A patch includes the block number, a linked list of patches that this
patch depends on, and the "undo data." The undo data is a byte-level
diff of the changes made to this block by this patch, including the
offset, length, and contents of the range of bytes overwritten by this
change. Another version of a patch is optimized for bit-flip changes,
like those made to block bitmaps. The rule for writing patches out to
storage is simple: If any of the patches this patch depends on - its
dependencies - aren't confirmed to be written to disk, this patch
can't be written yet.
In other words, patches and dependencies look a lot like a generic
directed acyclic graph (DAG), with patches as the circles and
dependencies as the arrows. If you are a programmer, you've probably
drawn hundreds or thousands of these pictures in your life. Just
imagine a little diff hanging off each circle and you've got a good
mental model for thinking about Featherstitch. The interesting bits
are around reducing the number of little circles - in the first
implementation, the memory used by Featherstitch undo data was often
twice that of the actual changes written to disk.
For example, untarring a 220MB kernel source tree
allocated about 460MB of undo data.
The acyclic-ness of Featherstitch patch dependencies deserves a little
more attention. It is the caller's responsibility to avoid creating
circular patch dependencies in the first place; Featherstitch doesn't
detect or attempt to fix them. (The simplified interface exported to
userspace makes cycles impossible to create in the first place, more
about that later.) However, lack of circular dependencies among
patches does not imply a lack of circular dependencies between blocks.
Patches are a record of a change to a block and each block can have
multiple outstanding patches against it. Imagine a patch dependency,
patch A depends on patch B, which depends on patch C. That
is, A->B->C, where "->" reads as "depends
on." If patch A applies to block 1, and patch B applies to block 2,
and patch C applies to block 1, then viewing the blocks and their
outstanding patches as a whole, you have a circular dependency where
block 1 must be written before block 2, but block 2 must also be
written before block 1. This is called a "block-level cycle" and it
causes most of the headaches in a system based on write ordering.
The way both soft updates and Featherstitch resolve block level cycles is
by keeping enough information about each change to roll it back. When it
comes time to write a block, any applied patches which can't be written yet
(because their dependencies haven't been written yet) are rolled back using
their undo data. In our example, with A->B->C and A and
C both applied to block 1, we would roll back A on block 1, write block 1
with patch C applied, write B's block, and then write block 1 a second time
with both patch A and C applied.
Optimization
The first version of Featherstitch was elegant, general purpose, easy
to understand, and extraordinarily inefficient. On several
benchmarks, the original implementation allocated over twice as much
memory for patches and undo data as needed for the actual new data
itself. The system became CPU-bound with as few as 128 blocks in the
buffer cache.
The first goal was to reduce the number of patches needed to complete
an operation. In many cases, a patch will never be reverted - for
example, if we write to a file's data block when no other writes are
outstanding on the file system, then there is no reason we'd ever have
to roll back to the old version of the block. In this case,
Featherstitch creates a "hard patch" - a patch that doesn't keep any
undo data. The next optimization is to merge patches when they can
always be written together without violating any dependencies. A
third optimization merges overlapping patches in some cases. All of
these patch reduction techniques hinge on the Featherstitch rules for
creating patches and dependencies, in particular that a patch's
dependencies must be specified at creation time. Some opportunities
for merging can be detected at patch creation time, others when a
patch commits and is removed from the queue.
The second major goal was to efficiently find patches ready for
writing. A normal buffer cache holds several hundred thousand blocks,
so any per-block data structures and algorithms must be extremely
efficient. Normally, the buffer cache just has to, in essence, walk a
list of dirty blocks and issue writes on them in some reasonably
optimal manner. With Featherstitch, it can find a dirty block, but
then it has to walk its list of patches checking to see if there is a
subset whose dependencies have been satisfied and are therefore ready
for writing. This list can be long, and it can turn out that none of
the patches are ready, in which case it has to give up and go on to
the next patch. Rather than randomly searching in the cache,
Featherstitch instead keeps a list of patches that are ready to be
written. When a patch has committed, the list of patches that
depended on it is traversed and newly ready patches added to the list.
With these optimizations, the memory overhead of Featherstitch dropped
from 200+% to 4-18% in the set of benchmarks used for evaluation -
still high, but in the realm of practicality. The optimizations
described above were only partially implemented in some cases, leaving
more room for improvement without any further insight.
Performance
For head-to-head performance comparisons, the authors implemented
several versions of file system consistency using the Featherstitch
patch interface and compared them to the ext2 and ext3 equivalents.
Using ext2 as the on-disk file system format, they re-implemented soft
updates, metadata-only journaling, and full data/metadata journaling.
Metadata-only journaling corresponds to
ext3's
data=writeback mode (file data is written without
regard to the state of file system metadata that refers to it) and
full journaling corresponds to ext3's
data=full mode (all
file data is written directly to the journal along with file system
metadata).
The benchmarks used were extraction of a ~200MB tar file (the kernel
source code, natch), deletion of the results of previous, a Postmark
run, and a modified Andrew file system benchmark - in other words, the
usually motley crew of terrible, incomplete, unrepresentative file
system benchmarks we always run because there's nothing better
available. The deficiency shows: under this workload, ext3 performed
about the same in data=writeback
and data=ordered mode (not usually the case in real-world
systems), which is one of the reasons the authors didn't implement
ordered mode for Featherstitch. The overall performance result was
that the Featherstitch implementations were at par or somewhat better
with the comparable ext3 version for elapsed time, but used
significantly more CPU time.
Patchgroups: Featherstitch for userspace
So, you can use Featherstitch to re-implement all kinds of file system
consistency schemes - soft updates, copy-on-write, journaling of all
flavors - and it will go about as fast the old version while using up
more of your CPU. When you have big new features like checksums and
snapshots in btrfs, it's hard to get excited about an under-the-covers
re-implementation of file system internals. It's cool, but no one but
file systems developers will care, right?
In my opinion, the most exciting application of Featherstitch is not
in the kernel, but userland. In short, Featherstitch exports an
interface that applications can use to get the on-disk consistency
results they want, AND keep most of the performance benefits that come
with re-ordering and delaying writes. Right now, applications have
only two practical choices for controlling the order of changes to the
file system: Wait for all writes to a file to complete using fsync(),
or rely on file system-specific implementation details, like
ext3 data=ordered mode. Featherstitch gives you a third
option: Describe the exact, minimal ordering relationship between
various file system writes and then let the kernel re-order, delay,
and otherwise optimize the writes as much possible within those
constraints.
The userland interface is called "patchgroups." The interface
prevents the two major pitfalls that usually accompany exporting a
kernel-level consistency mechanism to userspace. First, it prevents
deadlocks caused by dependency cycles ("Hey, kernel! Write A depends
on write B! And, oh yeah, write B depends on write A! Have a nice
day!"). In the kernel, you can define misuse of the interface as a
kernel bug, but if an application screws up a dependency, the whole
kernel grinds to a halt. Second, it prevents applications from
stalling their own or other writes by opening a transaction and
holding it open indefinitely while it adds new changes to the
transaction (or goes off into an infinite loop, or crashes, or
otherwise fails to wrap up its changes in a neat little bow).
The patchgroups interface simply says that all of those writes over
there must be on-disk before any of these writes over here can start
being written to disk. Any other writes that happen to be going on
outside of these two sets can go to disk in any order they please, and
the writes inside each set are not ordered with respect to each other
either. Here's a pseudo-code example of using patchgroups:
/* Atomic update of a file using patchgroups */
/* Create a patch group to track the creation of the new copy of the file */
copy_pg = pg_create();
/* Tell it to track all our file systems changes until pg_disengage() */
pg_engage(copy_pg);
/* Open the source file, get a temporary filename, etc. */
/* Create the temp file */
temp_fd = creat();
/* Copy the original file data to the temp file and make your changes */
/* All changes done, now wrap up this patchgroup */
pg_disengage(copy_pg);
The temp file now contains the new version of the file, and all of
the related file system changes are part of the current patchgroup.
Now we want to put the following rename() in a separate patchgroup
that depends on the patchgroup containing the new version of the
file.
/* Start a new patchgroup for the rename() */
rename_pg = pg_create();
pg_engage(rename_pg);
/*
* MAGIC OCCURS HERE: This is where we tell the system that the
* rename() can't hit disk until the temporary file's changes have
* committed. If you don't have patchgroups, this is where you would
* fsync() instead. fsync() can also be thought of as:
*
* pg_depend(all previous writes to this file, this_pg);
* pg_sync(this_pg);
*/
/* This new patchgroup, rename_pg, depends on the copy_pg patchgroup */
pg_depend(copy_pg, rename_pg);
/* This rename() becomes part of the rename_pg patchgroup */
rename();
/* All set! */
pg_disengage(rename_pg);
/* Cleanup. */
pg_close(copy_pg);
pg_close(rename_pg);
Short version: No more
"
Firefox
fsync()" bug, O_PONIES for
everyone who wants them and very little cost for those who don't.
Conclusion
Featherstitch is a generalization and simplification of soft updates,
with reasonable, but not stellar, performance and overhead.
Featherstitch really shines when it comes to exporting a useful, safe
write ordering interface for userspace applications. It replaces the
enormous performance-destroying hammer of
fsync() with a
minimal and elegant write grouping and ordering mechanism.
When it comes to the Featherstitch paper itself, I highly recommend
reading the entire paper simply for the brief yet accurate summaries
of complex storage-related issues. Sometimes I feel like I'm reading
the distillation of three hours of the Linux
Storage and File Systems Workshop plus another couple of weeks of
mailing list discussion, all in one paragraph. For example, section 7
describes, in a most extraordinarily succinct manner, the options for
correctly flushing a disk's write cache, including specific commands,
both SCSI and ATA, and a brief summary of the quality of hardware
support for these commands.
Comments (13 posted)
By Jonathan Corbet
September 28, 2009
Prior to the
Eleventh
Real Time Linux Workshop in Dresden, Germany, a small
group met to discuss the further development of the realtime preemption
work for the Linux kernel. This "mini-summit" covered a wide range of
topics, but was driven by a straightforward set of goals: the continuing
improvement of realtime capabilities in Linux and the merging of the
realtime preemption patches into the mainline.
The participants were:
Stefan Assmann,
Jan Blunck,
Jonathan Corbet,
Sven-Thorsten Dietrich,
Thomas Gleixner,
Darren Hart,
John Kacur,
Paul McKenney,
Ingo Molnar,
Oleg Nesterov,
Steven Rostedt,
Frederic Weisbecker,
Clark Williams, and
Peter Zijlstra.
Together they represented several companies working in the area of realtime
Linux; they brought a lot of experience with customer needs to the table.
The discussion was somewhat unstructured - no formal agenda existed - but
a lot of useful topics were covered.
Threaded interrupt handlers came out early in the discussion. This
feature was merged into the mainline for the 2.6.30 kernel; it is useful in
realtime situations because it allows interrupt handlers to be prioritized
and scheduled like any other process.
There is one part of the threaded interrupt code which remains outside of
the mainline: the piece which forces all drivers to use threaded
handlers. There are no plans to move that code into the mainline; instead,
it's going to be a matter of persuasion to get driver writers to switch to
the newer way of doing things.
Uptake in the mainline is small so
far; few drivers are actually using this feature. That is beginning to
change, though; the SCSI layer is one example. SCSI has always featured
relatively heavyweight interrupt-handling code and work done in
single-threaded workqueues. This code could move fairly
naturally to process context; the SCSI developers are said to be evaluating
a possible move toward threaded interrupt handlers in the near future.
There have also been suggestions that the network stack
might eventually move in that direction.
System management interrupts (SMIs) are a very different sort of
problem. These interrupts happen at a very low level in the hardware and
are handled by the BIOS code. They often perform hardware monitoring
tasks, from simple thermal monitoring to far more complex operations not
normally associated with BIOS-level software. SMIs are almost entirely
invisible to the operating system and are
generally not subject to control at that level, but they are visible in
some important ways: they monopolize anything between one CPU and all CPUs
in the system for a measurable period of time, and they can change
important parameters like the system clock rate. SMIs on some types of
hardware can run for surprisingly long periods; one vendor sells systems
where an SMI for managing ECC memory runs for 200µs every three
minutes. That is long enough to play havoc with any latency deadlines
that the operating system is trying to meet.
Dealing with the SMI problem is a challenge. Some hardware allows SMIs to
be disabled, but it's never clear what the consequences of doing so might
be; if the CPU melts into a puddle of silicon, the resulting latencies will
be even worse than before. Sharing information about SMI problems can be
hard because
many of the people working in this area are working under non-disclosure
agreements with the hardware vendors; this is unfortunate, because some
vendors have done a far better job of avoiding SMI-related latencies than
others. There is a tool now (hwlat_detector)
which can measure SMI latency, so we should start seeing more
publicly-posted information on this issue. And, with luck, vendors will
start to deal with the problem.
Not all hardware latency is caused by SMIs; hypervisors, too, can be a
significant source of latency problems.
A related issue is hardware changes imposed by SMI handlers. If the BIOS
determines that the system is overheating, it may respond by slowing the
clock rate or lowering the processor voltage. On a throughput-oriented
system, that may well be the right thing to do. When latencies are
important, though, slowing the processor could be a mistake - it could
cause applications to miss their deadlines. A better response might be to
simply shut down some processors while keeping others at full speed. What
is really needed here is a way to get this information to user space so
that policy decisions can be made there.
Testing is always an issue in this kind of software development; how
do the developers know that they are really making things better? There
are various test suites out there (RTMB, for example),
but there is no complete and integrated test suite.
There was some talk of trying to move more of the realtime testing code into the
Linux Test Project, but LTP is a huge body of code. So the realtime tests
might remain on their own, but it would be nice, at least, to standardize
test options and output formats to help with the automation of testing.
XML output from test programs is favored by some, but it is fair to say
that XML is not universally loved in this crowd.
The big kernel lock is a perennial outstanding issue for realtime
development for a couple of reasons. One is that, despite having been
pushed out of much of the core code, the BKL can still create long
latencies. The other is that elimination of the BKL would appear to be
part of the price for an eventual merge of sleeping spinlocks into the
mainline kernel. The ability to preempt code running under the BKL was
removed in 2.6.26; this change was directly motivated by a performance
regression caused by the semaphore rewrite, but it was also seen as a way
to help inspire BKL-removal efforts by those who care about latencies.
Much of the hard work in getting rid of the BKL has been done; one big
outstanding piece is the conversion of reiserfs being done by Frederic
Weisbecker. After that, what's left is a lot of grunt work: figuring out
what (if anything) is protected by a lock_kernel() call and putting in
proper locking. The "tip" tree has a branch (rt/kill-the-bkl) where this
work can be coordinated and collected.
Signal delivery is still not an entirely solved problem. Actually,
signals are always a problem, for implementers and users alike. In the
realtime context, signal delivery has some specific latency issues. Signal
delivery to thread groups involves an O(n) algorithm to determine which
specific thread to target; getting through this code can create excessive
latencies. There are also some locks in the delivery path
which interfere with the delivery of signals in realtime interrupt
context.
Everybody agrees that the proper solution is to avoid signals in
applications whenever possible. For example, timerfd() can be
used for timer events. But everybody also agrees that applications will
continue to use signals, so they have to be made to work somehow. The
probable solution is to remove much of the work from the immediate signal
delivery path. Signal delivery would just enqueue the information and set
a bit in the task structure; the real work would then be done in the
context of the receiving process. That work might still be expensive, but
it would at least fall to the process which is actually using signals
instead of imposing latencies on random parts of the system.
A side discussion on best practices for efficient realtime
application development yielded a few basic recommendations. The best API
to use, it turns out, is the basic pthread interface; it has been well
optimized over time. SYSV IPC is best avoided.
Cpusets work better than the affinity mechanism
for CPU isolation. In general, developers should realize that getting the
best performance out of a realtime system will require a certain amount of
manual tuning effort. Realtime Linux allows the prioritization of things
like interrupt handlers, but the hard work of figuring out what those
priorities should be can only be done by developers or administrators. It
was acknowledged that the interfaces provided to administrators currently
are not entirely easy to use; it can be hard to identify interrupt threads,
for example. Red Hat's tuna
tool can help in this regard, but more needs to be done.
Scalability was a common theme at the meeting. As a general rule,
realtime development has not been focused specifically on scalability
issues. But there is interest in running realtime applications on larger
systems, and that is bringing out problems. The realtime kernel tends to
run into scalability problems before the mainline kernel does; it was
described as an early warning system which highlights issues that the
mainline will be dealing with five years from now. So realtime will tend
to scale more poorly than mainline, but fixing realtime's problems will
eventually benefit mainline users as well.
Darren Hart presented a couple of charts
containing the results of some work by John Stultz
showing the impact of running the realtime kernel on a 24-processor
system. When running in anything other than uniprocessor mode, the
realtime kernel imposes a roughly 50% throughput penalty on a suitably
pathological workload - a severe price.
Interestingly, if the locking changes from the realtime kernel are removed
while leaving all of the other changes, most of the performance loss goes
away. This has led Darren to wonder if there should be a hybrid option
available for situations where hard latency requirements are not present.
In other situations, the realtime kernel generally shows performance
degradation starting with eight CPUS, with sixteen showing unacceptable
overhead.
As it happens, nobody really understands where the performance cost of
realtime locking comes from. It could be in the sleeping spinlocks, but
there is also a lot of suspicion directed at reader-writer locks. In the
mainline kernel, rwlocks allow multiple readers to run in parallel; in the
realtime kernel, instead, only one reader runs at a time. That change is
necessary to make priority inheritance work; priority inheritance in the
presence of multiple readers is a difficult problem. One obvious
conclusion that comes from this observation is that, perhaps, rwlocks
should not implement priority inheritance. There is resistance to that
idea, though; priority inheritance is important in situations where the
highest-priority process should always run as quickly as possible.
The alternative to changing rwlocks is to simply stop using them whenever
possible. The usual way to remove an rwlock is to replace it with a
read-copy-update scheme. Switching to RCU will improve scalability,
arguably at the cost of increasing complexity. But before embarking on any
such effort, it is important to get a handle on how much of the problem
really comes down to rwlocks. Some research will be done in the near
future to better understand the source of the scalability problems.
Another problem is per-CPU variables, which work by disabling preemption
while a specific variable is being used. Disabling preemption is anathema
to the realtime developers, so per-CPU variables in the realtime tree are
protected by sleeping locks instead. That increases overhead. The problem
is especially acute in slab-level memory allocators, which make extensive
use of per-CPU variables.
Solutions take a number of forms. There will eventually be a more
realtime-friendly slab allocator, probably a variant of SLQB. Minimizing
the use of per-CPU variables in general makes sense for realtime.
There are also schemes involving the creation of multiple virtual "CPUs" so
that even processes running on the same processor can have their own
"per-CPU" variables. That decreases contention for those variables
considerably at the cost of a slightly higher cache footprint.
Plain old locks can also be a problem; a run of dbench on a 16-processor
system during the workshop showed a 90% reduction in throughput, with the
processors sitting idle half the time. The problem in this case turns out
to be dcache_lock, one of the last global spinlocks remaining in
the kernel. The realtime tree feels the effects of this lock more strongly
for a couple of reasons. One is that threads holding the lock can be
preempted; that leads to longer lock hold times and more context switches. The
other is that sleeping spinlocks are simply more complicated, especially in
the contended slow path of the code. So the locking primitives themselves
require more CPU time.
The solution to this particular problem can only be the elimination of the
global dcache_lock. Nick Piggin has a patch set which does
exactly that, but it has not yet been tested with the realtime tree.
Realtime makes life harder for the scheduler. On a normal system, the
scheduler can optimize for overall system throughput. The constraints
imposed by realtime, though, require the scheduler to respond much more
aggressively to events. So context switches are higher and processes are
much more likely to migrate between CPUs - better for bounded response
times, but worse for throughput. By the time the system scales up to
something relatively large - 128 CPUs, say - there does not seem to be
any practical way to get consistently good decisions from the scheduler.
There is some interest in deadline-oriented schedulers. Adding an
"earliest deadline first" or related scheduler could be useful for
application developers, but nobody seems to feel that a deadline scheduler
would scale better than the current code.
What all this means is that realtime applications running on that kind of system
must be partitioned. When specific CPUs are set aside for specific
processes, the scheduling problem gets simpler. Partitioning requires real
work on the part of the administrator, but it seems unavoidable for larger
systems.
It doesn't help that complete CPU isolation is still hard to accomplish on
a Linux system. Certain sorts of operations, such as workqueue flushes,
can spill into a processor which has been set aside for specific
processes. In general, anything involving interrupts - both device
interrupts and inter-processor interrupts - is a problem when one is trying
to dedicate a CPU to a task. Steering device interrupts to a given
processor is not that hard, though the management tools could use
improvement. Inter-processor interrupts are currently harder to avoid;
code generating IPIs needs to be reviewed and, when possible, modified to
avoid interrupting processors which do not actually have work to do.
Integrating interrupt management into the current cpuset and control group
code would be useful for system administrators. That seems to be a harder
task; Paul Jackson, the original cpuset developer, was strongly opposed to
trying to include interrupt management there. There's a lack of good
abstractions for this kind of administration, though the generic IRQ layer
helps. The opinion at the meeting seemed to be that this was a solvable
problem; if it can be solved for the x86 architecture, the other
architectures will eventually follow.
Going to a fully tickless kernel is also an important step for full CPU
isolation. Some work has recently been done in that direction, but much
remains to be done.
Stable kernel ABI concerns made a surprising appearance. The
"enterprise" Linux offerings from distributors generally include a promise
that the internal kernel interface will not change. The realtime
enterprise distributions have been an exception to this rule, though; the
realtime code is simply in too much flux to make such a promise practical.
This exemption has made life easier for developers working on that code,
naturally; it also has made it possible for customers to get the newest
code much more quickly. There are some concerns that, once the remaining
realtime code is merged into the mainline, the same kernel ABI constraints
may be imposed on realtime distributions. It is not clear that this needs
to happen, though; realtime customers seem to be more interested in keeping
up with newer technology and more willing to put up with large changes.
Future work was discussed briefly. Some of the things remaining to
be done include:
- More SMP work, especially on NUMA systems.
- A realtime idle loop. There is the usual tension there between
preserving the best response time and minimizing power consumption.
- Supporting hardware-assisted operations - things like onboard
cryptographic acceleration hardware.
- Elimination of the timer tick.
- Synchronization of clock events across CPUs. Clock synchronization is
always a challenging task. In this case, it's complicated by the fact
that a certain amount of clock skew can actually be advantageous on an
SMP system. If clock events are strictly synchronized, processors
will be trying to do things at the same time and lock contention will
increase.
A near-future issue is spinlock naming. Merging the sleeping
spinlock code requires a way to distinguish between traditional, spinning
locks and the newer type of lock which might sleep on a realtime system.
The best solution, in theory, is to rename sleeping locks to something like
lock_t, but that would be a huge change affecting many thousands
of files. So the realtime developers have been contemplating a new name
for non-sleeping locks instead. There are far fewer of these locks, so
renaming them to something like atomic_spinlock would be much less
disruptive.
There was some talk of the best names for "atomic spinlocks"; they could be
"core locks," "little kernel locks," or "dread locks." What really came
out of the discussion, though, is that there was a fair amount of confusion
regarding the two types of locks even in this group, which understands them
better than anybody else. That suggests that some extra care should go
into the naming, with the goal of making the locking semantics clear and
discouraging the use of non-sleeping locks. If the semantics of
spinlock_t change, there is a good argument that the name should
also change. That supports the idea of the massive lock renaming,
regardless of how disruptive it might be.
Whether such a change would be accepted is an open question, though. For
now, both the small renaming and the massive renaming will be prepared for
review. The issue may then be taken to the kernel summit in October for a
final decision.
Tools for realtime developers came up a couple of times. There are
a number of tools for application optimization now, but they are scattered
and not always easy to use. And, it is said, there needs to be a tool with
a graphical interface or a lot of users simply will not take it seriously.
The "perf" tool, part of the kernels "performance events" subsystem, seems
poised to grow into this role. It can handle many of the desired tasks -
latency tracing, for example - now, and new features are being added. The
"tuna" tool may be extended to provide a nicer interface to perf.
User-space tracepoints seem to be high on the list of desirable features
for application developers. Best would be to integrate these tracepoints
with ftrace somehow. Alternatively, user-space trace data could be
collected separately and integrated with kernel trace data at
postprocessing time. That leads to clock synchronization issues, though,
which are never easy to solve.
The final part of the meeting became a series of informal discussions and
hacking efforts. The participants universally saw it as a worthwhile
gathering, with much learned by all. There are some obvious
action items, including more testing to better understand scalability
problems, increasing adoption of threaded interrupt handlers, solving the
spinlock naming problem, improving tools, and more. Plenty of work for all
to do. But your editor has been assured that the work will be done and
merged in the next year - for real this time.
Comments (43 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jake Edge
Next page: Distributions>>