By Jonathan Corbet
January 16, 2013
Deadlocks in the kernel are a relatively rare occurrence in recent years.
The credit largely belongs to the "
lockdep"
subsystem, which watches
locking activity and points out patterns that could lead to deadlocks when
the timing goes wrong. But locking is not the source of all deadlock
problems, as was recently shown by an old deadlock bug which was only
recently found and fixed.
In early January, Alex Riesen reported some
difficulties with USB devices on recent kernels; among other things, it was
easy to simply lock up the system altogether. A fair amount of discussion
followed before Ming Lei identified the
problem. It comes down to the block layer's use of the asynchronous function call infrastructure used
to increase parallelism in the kernel.
The asynchronous code is relatively simple in concept: a function that is to be run
asynchronously can be called via async_schedule(); it will then
run in its own thread at some future time. There are various ways of
waiting until asynchronously called functions have completed; the most
thorough is async_synchronize_full(), which waits until all
outstanding asynchronous function calls anywhere in the kernel have
completed. There are ways of
waiting for specific functions to complete, but, if the caller does not
know how many asynchronous function calls may be outstanding,
async_synchronize_full() is the only way to be sure that they are
all done.
The block layer in the kernel makes use of I/O schedulers to organize and
optimize I/O operations. There are several I/O schedulers available; they
can be switched at run time and can be loaded as modules. When the block
layer finds that it needs an I/O scheduler that is not currently present in
the system, it will call request_module() to ask user space to
load it. The module loader, in turn, will call
async_synchronize_full() at the end of the loading process; it
needs to ensure that any asynchronous functions called by the newly loaded
module have completed so that the module will be fully ready by the time
control returns to user space.
So far so good, but there is a catch. When a new block device is
discovered, the block layer will do its initial work (partition probing and
such) in an asynchronous function of its own. That work requires
performing I/O to the device; that in turn, requires an I/O scheduler. So
the block layer may well call request_module() from code that is
already running as an asynchronous function. And that is where things turn
bad.
The problem is that the (asynchronous) block code must wait for
request_module() to complete before it can continue with its
work. As described above, the module loading process involves a call to
async_synchronize_full(). That call will wait for all
asynchronous functions, including the one that called
request_module() in the first place, and which is still waiting
for request_module() to complete. Expressed more concisely, the
sequence looks like this:
- sd_probe() calls async_schedule() to scan a device
asynchronously.
- The scanning process tries to read data from the device.
- The block layer realizes it needs an I/O scheduler, so, in
elevator_get(), it calls request_module() to load
the relevant kernel module.
- The module is loaded and initializes itself.
- do_module_init() calls async_synchronize_full() to
wait for any asynchronous functions called by the just-loaded module.
- async_synchronize_full() waits for all asynchronous
functions, including the one
called back in step 1, which is waiting for the
async_synchronize_full() call to complete.
That, of course, is a classic
deadlock.
Fixing that deadlock turns out not to be as easy as one would like. Ming suggested
that the call to async_synchronize_full() in the module loader
should just be removed, and that user space should be taught that devices
might not be ready immediately when the modprobe binary
completes. Linus was not impressed with
this approach, however, and it was quickly discarded.
The optimal solution would be for the module loader to wait only for
asynchronous functions that were called by the loaded module itself. But
the kernel does not currently have the infrastructure to allow that to
happen; adding it as an urgent bug fix is not really an option. So
something else needed to be worked out. To that end, Tejun Heo was brought
into the discussion and asked to help come up with a solution. Tejun
originally thought that the problem could
be solved by detecting deadlock situations and proceeding without waiting
in that case, but the problem of figuring out when it would be safe to
proceed turned out not to be tractable.
The solution that emerged instead is
regarded as a bit of a hack by just about everybody involved. Tejun added
a new process flag (PF_USED_ASYNC) to mark when a process has
called asynchronous functions. The module loader then tests this flag; if
no asynchronous functions are called as the module is loaded, the call to
async_synchronize_full() is skipped. Since the I/O scheduler
modules make no such calls, that check avoids the deadlock in this
particular case. Obviously, the problem remains in any case where an
asynchronously-loaded module calls asynchronous functions of its own, but
no other such cases have come to light at the moment. So it seems like a
workable solution.
Even so, Tejun remarked "It makes me feel dirty but makes the problem
go away and I can't think of anything better." The patch has found
its way into the mainline and will be present in the 3.8 final
release. By then, though, it would not be entirely surprising if somebody
else were to take up the task of finding a more elegant solution for a
future development cycle.
(
Log in to post comments)