Brief items
The current development kernel remains 2.6.32-rc6; there have been
no 2.6.32 prepatches released over the last week.
The current stable kernel is 2.6.31.6, released (along with 2.6.27.39) on November 9.
Both kernels contain a long list of important fixes, including some which
are security-related.
Comments (none posted)
I, Paul E. McKenney, maintainer of the RCU implementation embodied
in the Linux kernel and co-inventor of RCU, being of sound mind and
body, notwithstanding the wear and tear inherent in my numerous
decades sojourn on this planet, hereby declare that the following
usage of work queues constitutes a valid RCU implementation...
--
Acked-by: Paul McKenney style (thanks to
Amit Shah)
Hugh's hypothesis: for every variable x initialized by a
subfunction, there exists at least one version V of gcc, such that
V reports that x may be used uninitialized.
--
Hugh Dickins
Comments (1 posted)
By Jonathan Corbet
November 7, 2009
I/O bandwidth allocation and control is a tricky problem; as was
discussed here earlier this
year, several groups have tried to solve this problem at different levels
in the block I/O stack. At a minisummit in Tokyo, the developers behind
various I/O controllers got together and decided on a multi-level approach
to the problem. Since then, developer Vivek Goyal has remarked:
IO control is a huge problem and the moment we start addressing all
the issues in one patchset, it bloats to unmanageable proportions
and then nothing gets inside the kernel. So at io mini summit we
agreed that lets take small steps and once a piece of code is
inside the kernel and stabilized, take the next step. So this is the
first step.
This first small step is a
20-part patch series implementing a proportional-weight I/O controller
within the CFQ I/O scheduler. A new control group (cgroup) class (called "blkio") is
created, with each group given an I/O weight between 100 and 1000.
The available I/O bandwidth is then handed out to groups as determined by
their configured weight.
Working within CFQ gives the I/O controller fine-grained control over how
much bandwidth each group gets. On the other hand, CFQ is poorly
positioned to identify block I/O traffic which is not directly generated by
user space (writeback, for example) or to throttle activity at the virtual
memory level. Achieving that will require the integration of a controller
which hooks into the system at a much higher level. The plan is to write
such a controller and have it work transparently within the same blkio
cgroup. But a couple more small steps will probably be required before
that is ready to go in.
Comments (1 posted)
By Jonathan Corbet
November 11, 2009
The "sysctl" mechanism is used by the kernel to export a wide variety of
tuning options to user space. Sysctl is actually two interfaces which have
been awkwardly joined together: the
sysctl() system call and the
/proc/sys directory hierarchy. Of the two,
/proc/sys is
much more widely used, to the point that developers rarely even think about
the system call. But the
sysctl() implementation is a significant
amount of code which suffers from chronic neglect. It has thus been
the
target of removal attempts for years.
The problem with removing sysctl(), of course, is that it is
part of the kernel ABI. As long as the possibility of broken applications
exists, this ABI cannot be removed. So it continues to sit in the kernel,
despite the fact that its absence would be noted by few people.
Eric Biederman has come up with a
new approach to the problem. His patch set removes the current
sysctl() implementation, getting rid of a few thousand lines of
unloved code. He then adds back a new wrapper which emulates the
sysctl() ABI by way of /proc/sys. So any applications
using sysctl() should continue to work, but the code dedicated to
making it work is much reduced from what was there before.
The patch set still concerns some developers. The compatibility wrapper has
its own configuration option, leading some to worry that distributions
might disable it and cause obscure things to break. Going through
/proc/sys will make access to these variables much slower than it
was before. That should not really be a problem: access to sysctl
variables is not normally a performance-critical operation. So there does
not appear to be any sort of real obstacle to the merging of these patches;
maybe, someday, binary sysctl() will truly vanish into the past.
Comments (4 posted)
IBM developerWorks has posted
an introduction to the NilFS2 and exofs filesystems. "
An interesting aspect of NiLFS(2) is its technique of continuous snap-shotting. As NILFS is log structured, new data is written to the head of the log while old data still exists (until it's necessary to garbage-collect it). Because the old data is there, you can step back in time to inspect epochs of the file system. These epochs are called checkpoints in NiLFS(2) and are an integral part of the file system. NiLFS(2) creates these checkpoints as changes are made, but you can also force a checkpoint to occur."
Comments (8 posted)
Kernel development news
By Jonathan Corbet
November 11, 2009
Much of the fuss involving
fsync() and crash robustness over the
last year was really about how applications can get transactional semantics
out of filesystem operations. An application developer often wants to see
a set of operations either succeed or fail as a unit, without the
possibility of partially-completed operations. Providing ways for
applications to get that behavior can be a challenge, though.
Btrfs has tried to make this capability available to user space by way of
the BTRFS_IOC_TRANS_START and BTRFS_IOC_TRANS_END
ioctl() calls. There are some real problems with this approach,
though. They operate as a pair of system calls, with any operations
between the two being treated as a transaction within the filesystem. But
if something fails, or if the application never quite gets around to ending
the transaction, things will eventually come to a halt. That is why the
btrfs code includes this comment:
There are many ways the trans_start and trans_end ioctls can lead
to deadlocks. They should only be used by applications that
basically own the machine, and have a very in depth understanding
of all the possible deadlocks and enospc problems.
It is, in other words, a dangerous capability which cannot be made
generally available.
Sage Weil has posted a patch
taking a rather different approach to the problem. The key idea is to
avoid the problem of never-completed transactions by encapsulating the
entire thing within a single system call. The result is a new
BTRFS_IOC_USERTRANS command for ioctl(); chances are it
will require a bit of work yet, but it could be the base for user-space
transactions in the future.
This command takes a structure which looks something like the following:
struct btrfs_ioctl_usertrans {
__u64 num_ops;
struct btrfs_ioctl_usertrans_op *ops_ptr;
__u64 num_fds;
__u64 data_bytes, metadata_ops;
__u64 flags;
__u64 ops_completed;
};
The ops_ptr argument points to an array of num_ops
individual operations to complete:
struct btrfs_ioctl_usertrans_op {
__u64 op;
__s64 args[5];
__s64 rval;
__u64 flags;
};
Here, op describes an individual operation. It can be
BTRFS_IOC_UT_OP_OPEN (open a file),
BTRFS_IOC_UT_OP_CLOSE (close a file),
BTRFS_IOC_UT_OP_PWRITE (write to a file),
BTRFS_IOC_UT_OP_UNLINK (unlink a file),
BTRFS_IOC_UT_OP_LINK (make a link to a file),
BTRFS_IOC_UT_OP_MKDIR (create a directory),
BTRFS_IOC_UT_OP_RMDIR (remove a directory),
BTRFS_IOC_UT_OP_TRUNCATE (truncate a file),
BTRFS_IOC_UT_OP_SETXATTR (set an extended attribute),
BTRFS_IOC_UT_OP_REMOVEXATTR (remove an extended attribute), or
BTRFS_IOC_UT_OP_CLONERANGE (copy a range of data from one file to
another).
For each operation, the args field contains a set of arguments
similar to what one would see for the corresponding system call. One
interesting difference is that there are no hard-coded file descriptor numbers;
instead, the transaction gets a new file descriptor table and all
operations work with indexes into that table. Essentially, transactions
work within a file descriptor space separated from that used by the calling
process.
The flags field describes how the return value from each operation
should be interpreted. It can be contain any of:
BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE,
BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ,
BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT,
BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT,
BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE, and
BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE. In each case, the flag causes
the return value to be compared against the passed-in rval field;
if the comparison is successful, the transaction will fail.
What happens if the transaction fails? The partially-completed transaction
will not be rolled back; btrfs, not being a database, is not really set up
to do that. Instead, the number of successfully-completed operations will
be passed back to user space. Optionally, the application can provide the
BTRFS_IOC_UT_FLAG_WEDGEONFAIL flag, causing btrfs to leave the
transaction open, locking up the filesystem until the system is rebooted.
This may seem like a rather antisocial approach to transaction atomicity,
but, if failure is both highly unlikely and highly problematic, that might
be the right thing to do.
A patch like this raises a lot of questions. The first obstacle may be the
fact that it requires exporting a number of system call implementations to
modules, a change which has been resisted in the past. Kernel code need
not normally call functions like sys_mkdir(), but this patch does
exactly that. Calling system call implementations directly can be a bit
tricky on some architectures, and there are good reasons for not making
these functions available to modules in general.
Another problem is that the filesystem has
no real way of determining whether a transaction will succeed before
jumping into it; the best it can do is reserve some metadata space in
accordance with an estimate provided by the application. If transactions
are allowed to complete partially, they are no longer transactions. But
the alternative of locking up the system can only leave people wondering if
there isn't a better way.
Then, there is a question which was raised on the list: btrfs provides
cheap snapshots, why not use them to implement transactions? Using a
snapshot would take advantage of existing functionality and should make
proper rollback possible. The problem would appear to be performance:
btrfs snapshots are not quite that cheap, especially when one
considers the need to exclude other activity on the filesystem while the
transaction is active. So Chris Mason has suggested that the old standby - write-ahead
logging - would be preferable because it will perform much better. But, he
thinks, the multi-operation ioctl() could maybe perform better
yet.
Finally, there would appear to be some real similarities between this API
and the rather more general syslets mechanism. Syslets have
been on the back burner for some time now, but they could come back forward
if they seemed like a good way to solve this problem.
Clearly, like much of btrfs, this new ioctl() is a work in
progress. If it gets into the mainline, it will be likely to have changed
quite a bit on the way. But the problem this patch is trying to solve is
real; it's clearly an issue which is worth thinking about.
Comments (18 posted)
By Jonathan Corbet
November 11, 2009
"Fanotify" is a much-revised system for providing filesystem event
notifications to user space, and, possibly, allowing user space to block
open() operations on specific files. The intended use case is
malware-scanning utilities, but there are others: hierarchical storage has
been cited as one possibility. This code has had a long, hard path into
the kernel for a couple of reasons: kernel developers are not big fans of
malware scanning, and nailing down the user-space ABI has been
challenging.
The first obstacle has been more-or-less overcome. Even developers who
think that malware scanning is the worst sort of security snake oil can
agree that having these utilities use a well-defined kernel interface is
better than having them employ nasty tricks like hooking into the system
call table. ABI difficulties can be harder to overcome, though. With the latest fanotify posting,
developer Eric Paris may have resolved this issue for at least a portion of
the fanotify functionality.
The new version does away with the novel interface using
setsockopt() in favor of a couple of new system calls. The first
of these is fanotify_init():
int fanotify_init(unsigned int flags, unsigned int event_f_flags,
int priority);
This system call initializes the fanotify subsystem, returning a file
descriptor which is used for further operations. There are two
flags values implemented: FAN_NONBLOCK creates a
nonblocking file descriptor, and FAN_CLOEXEC sets the
close-on-exec flag. Currently, event_f_flags and
priority are unused; they should be set to zero.
Management of notification events is then done with
fanotify_mark():
int fanotify_mark(int fanotify_fd, unsigned int flags,
int dfd, const char *pathname, u64 mask,
u64 ignored_mask);
This call is used to "mark" specific parts of the filesystem hierarchy,
indicating an interest in events involving those files.
fanotify_fd is the file descriptor returned by
fanotify_init(). The flags parameter must be one of
FAN_MARK_ADD or FAN_MARK_REMOVE, indicating whether this
call adds new marks or removes existing ones; there are also a couple of
flags to control following of symbolic links and the marking of directories
(without their contents).
The file(s) to be marked are determined by dfd and
pathname; these parameters work much like in any of the
*at() system calls. If dfd is AT_FDCWD, the
pathname is resolved using the current working directory. If,
instead, dfd points to a directory, the pathname lookup
starts at that directory. If pathname is null, though, then
dfd is interpreted as the actual object to mark.
Finally, mask and ignored_mask control which events are
reported. To generate a specific event, a file must have the appropriate
flag set in mask and clear in ignored_mask. The flags
are FAN_ACCESS (file access),
FAN_MODIFY (a file is modified),
FAN_CLOSE_WRITE (a writable file has been closed),
FAN_CLOSE_NOWRITE (a read-only file has been closed),
FAN_OPEN (a file has been opened), and
FAN_EVENT_ON_CHILD (events on children of a directory). There is
also a FAN_Q_OVERFLOW event for event queue overflows, but that is
not currently implemented.
Once files have been marked, the application can simply read from the
fanotify file descriptor to get events. The events look like:
struct fanotify_event_metadata {
__u32 event_len;
__u32 vers;
__s32 fd;
__u64 mask;
};
Here, event_len is the length of the structure, vers
indicates which version of fanotify generated the structure, fd is
an open file descriptor for the object being accessed, and mask
describes what is actually happening.
There is one crucial component missing in these patches: there is no way
for the fanotify user to react to these events. In particular, the ability
to block an open() call, a core part of the malware-scanning
process, is missing. That, presumably, is to be added in a future
revision. Meanwhile, Eric has requested permission to put the notification
code into linux-next, presumably with a 2.6.33 merge in mind. As of this
writing, objections have not been forthcoming.
Comments (11 posted)
November 11, 2009
This article was contributed by Darren Hart
The
futex
[PDF] mechanism, introduced in 2.5.7 by Rusty Russell, Hubertus Franke, and
Mathew Kirkwood, is a fast, lightweight kernel-assisted locking
primitive for user-space applications. It provides for very fast
uncontended lock acquisition and
release. The futex state is stored in a user-space variable (an unsigned
32-bit integer on all platforms). Atomic operations are used in order to
change the state of the futex in the uncontended case without the
overhead of a syscall. In the contended cases, the kernel is invoked to
put tasks to sleep and wake them up.
Futexes are the basis of several mutual exclusion constructs commonly
used in threaded programming. These include pthread mutexes,
condvars, semaphores, rwlocks, and barriers. They have been through a
lot of reconstructive and cosmetic surgery over the last several years,
and are now more efficient, more functional, and better documented than
ever before.
Overview
While few application developers will use futexes directly, a cursory
knowledge of how to do so is necessary to appreciate the improvements
presented a bit later. By way of a simple example, futexes can be used
to store the state of a lock and provide a kernel waitqueue for tasks
blocking on the lock. To minimize syscall overhead, this state should
allow for atomic lock acquisition when the lock is uncontended. The
state could be defined as:
- unlocked
- locked
In order to acquire the lock, an atomic test-and-set instruction (such
as cmpxchg()) can be used to test for 0 and set to 1. In
this case, the
locking thread acquires the lock without involving the kernel (and the
kernel has no knowledge that this futex exists). When the next thread
attempts to acquire the lock, the test for zero will fail and the kernel
needs to be involved. The blocking thread can then use the futex()
system call with the FUTEX_WAIT opcode to put itself to sleep on the futex,
passing the address of the futex state variable as an argument. To
release the lock, the owner changes the lock state to zero (unlocked) and
issues the FUTEX_WAKE opcode, which will wake the blocked thread to
return to user space and try to acquire the lock (as described above).
This is a an obviously trivial example with lots of room for
optimization. Ulrich Drepper's "Futexes are Tricky" [PDF] is still the
undisputed reference for using futexes to build locking primitives such
as mutexes. It explores the many race conditions involved with using
futexes as well as optimizations to improve on the example given here.
When the user threads call into the kernel with the futex() system
call,
they pass the address of the futex state (uaddr), the opcode to perform
(op), and various other arguments. The uaddr is used by the kernel to
generate a unique "futex_key" to reference the futex. When a thread
requests to block on a futex, as with FUTEX_WAIT, a "futex_q" is created
and queued in the "futex_queues" hash table. There is one futex_q for
every task blocked on a futex, possibly many futex_q's per futex. The
futex_queues themselves (the hash table lists, not the "futex_q's") are
shared among futexes, since multiple futex_keys will hash to the same
queue. These relationships are illustrated below:
In most cases, there is no policy defining how the user space state
variable is to be used (despite what the futex man pages may or
may not say). The application (or a library such as glibc) uses this
value to define the state of the locking construct being implemented.
This can be as simple as a boolean variable (as in the example above),
however optimized implementations and other locking mechanisms require
more complex state values.
In addition to the simple FUTEX_WAIT and FUTEX_WAKE operations, the
kernel also manages special operations that require more knowledge of
the locking construct's state than can be had in user space, most
notably the priority inheritance (PI) chains and robust lists. PI and
robust futexes are exceptions to the user-defined-policy rule regarding
the state variable. Their state depends not only on the locked state of
the mutex, but also on the identity of the owner and whether or not
there are waiters. As such, the futex value is defined as the thread
identifier (TID) of the owner and a bit to indicate pending owners. This
policy still allows for user space atomic operations to avoid calling
into the kernel in the uncontended case.
Improvements
Futexes have seen numerous improvements from a handful of developers
since their debut appearance in 2.5.7. Some notable improvements include
priority based wake-up for real-time tasks (by Pierre Peiffer) and
robust and PI futexes (by Ingo Molnar and Thomas Gleixner).
These latter features have been in use for some time and have been
adequately covered here on LWN.net as well as in the excellent
discussions on the kernel mailing list. Your author's foray into futexes
picks up here, about two and a half years ago. Aside from several fixes
to address rare corner cases and race conditions, the futex code has
seen several functional and performance improvements since those earlier
contributions.
Significant effort went into reducing futex overhead. Eric Dumazet
introduced private futexes as an optimization for
PTHREAD_PROCESS_PRIVATE pthread mutexes. Private futexes can only be
used by threads of the same process. They are distinguishable from each
other simply by their virtual address, while shared futexes have
different virtual addresses in each process, requiring the kernel to
lookup their physical address for unique identification. This
optimization eliminates the use of the mmap_sem semaphore for
private futexes,
reducing system-wide contention. It also eliminates the atomic
operations used in reference counting for private futexes, resulting in
less cache-line bouncing on SMP machines. Glibc now uses private futexes
by default.
Peter Ziljstra further reduced the futex dependency on mmap_sem by
using get_user_pages_fast() in the fast paths, making use of
get_user_pages(), and pushing the mmap_sem locks down tightly
around the
slow paths (September 2008). These changes had the added benefit of
removing much of the virtual memory related logic from kernel/futex.c,
simplifying the code considerably. Due to their dependence on user space
addresses, futexes are burdened with several possible fault points.
Holding mmap_sem complicated the fault logic since it had to be
released prior to calling get_user(). With mmap_sem
usage reduced,
your author greatly simplified the fault logic (March 2009), resulting
in far more legible code.
Bitset conditional wakeup was added by Thomas Gleixner (February 2008)
in order to enable optimized rwlock implementations in glibc.
FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET allow the user to specify a
bitmask which limits the woken tasks to those which specified the same
bitset (or a superset, such as FUTEX_BITSET_MATCH_ANY) at wait time.
Since the introduction of PI futexes, the glibc condvar implementation
of pthread_cond_broadcast() (with a PI mutex) has been forced to wake
all waiters, rather than take advantage of FUTEX_REQUEUE, due to the
lack of support for requeueing to PI futexes. This leads to a wake-up
storm as all the waiters race back to user space to contend for the
lock. It also fails to ensure that the highest priority task acquires
the lock first. Recent kernels (2.6.31-rt* and 2.6.32-rc*) now have your
author's FUTEX_CMP_REQUEUE_PI patches (April 2009) which provide the
kernel-side support for requeueing waiters from a non-PI futex to a PI
futex. With glibc patches in the works by Dinakar Guniguntala,
real-time applications will soon be able to use pthread condition
variables with guaranteed wake-up order and fewer overall wake-ups.
Now What?
While there are several items that futex developers may consider in the
future, they are hopeful that kernel/futex.c and all its
brain-bending,
liver-killing insanity can be put to rest for at least a little while.
However, since no article is complete without a list of next steps, the
following items may receive some attention in the future:
Man pages: The current man pages do not include some of the new futex
operations. They suggest a policy for the value of the futex which has
led to some confusion regarding usage of futexes. Worst of all, the user
space futex() definition has been removed from
/usr/include/linux/futex.h, rendering the man pages not only
incomplete,
but also inaccurate. Users of futexes must use the syscall interface
directly.
Adaptive futexes: It is possible that some of the scheduling overhead of
futexes can be reduced by some optional amount of spinning prior to
going to sleep in the kernel. However, as futexes expose their state to
user space, this spinning can also be done in user space, as is done
with adaptive mutexes in glibc now, albeit without the knowledge of whether
the owner is running, so spinning is reduced to a simple maximum-retries loop.
Interruptible futexes: There is some interest in interruptible blocking
lock operations from large proprietary software projects. Futex
operations currently restart themselves in the event of a signal, rather
than returning -EINTR to user space. Futexes could be flagged with
FUTEX_INTERRUPTIBLE which would be checked on signal-induced wakeup to
determine if the syscall should be restarted or if -ECANCELED should be
returned to user space. Exposing such a feature in the pthread locking
primitives would involve non-POSIX compliant changes to the pthread
library, but this is not without precedent.
Scalability enhancements: There has been some discussion on LKML
regarding private as well as NUMA-optimized hash tables. The futex
hash table is shared across all processes and is protected by spinlocks
which can lead to real overhead, especially on large systems. This
overhead is not serving any useful purpose if these systems are
partitioned on NUMA nodes, or even for processes that use private
futexes exclusively.
Futex test suite: Your author has been compiling a list of requirements
for an exhaustive test suite to validate futex functionality. This
test suite would serve as a regression suite for future development. The
many corner cases and misuse cases possible with futexes complicate the
test suite and present a challenge to its design.
Acknowledgements
I would like to extend my thanks to John Stultz, Will Schmidt, Paul
McKenney, Nivedita Singhvi, and, of course, Jon Corbet, whose reviews
have made this article far more legible and complete than it would have
been otherwise.
Comments (6 posted)
Patches and updates
Kernel trees
Build system
- nir.tzachar@gmail.com: nconfig v5 .
(November 6, 2009)
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>