Module reference counts - back to the future?
[Posted May 7, 2003 by corbet]
In the 2.2 (and prior) kernels, loadable modules were charged with the task
of maintaining a count of references to the module. When the reference
counting was done correctly, the kernel knew when it was safe to unload a
module. Unfortunately, the maintenance of the reference counts often was
not done right, and, in any case, in-module reference counting was subject
to certain small but unavoidable race conditions. Simply put, there was no
way to completely avoid situations where the kernel was executing module
code while the module's reference count was zero.
Starting in 2.3, the reference counting task moved slowly outside of the
modules themselves. For example, the file_operations structure
exported by char drivers got an owner field which points to the
module. Before the kernel will, say, call a module's open()
method, it increments the reference count. This mechanism puts the
reference count in one place (rather than in hundreds of module
open() methods) and eliminates race conditions. In 2.5, this
mechanism was extended further, and attempts to increment reference counts
were allowed to fail (for example, when the module still exists, but is
being unloaded).
This mechanism works reasonably well for most device drivers; the interface
between the kernel and the module is narrow, and references to the module
are limited to a few types of objects (open files and memory mappings).
Life gets harder, however, when you get into other parts of the kernel. A
recent discussion on the netdev list, started by the discovery of a situation where the
networking subsystem can call into a module which has been unloaded, shows
how hard it can be.
The networking code keeps track of a vast array of objects, each of which
can reference others, and each of which must be reference counted. A
networking module can only be unloaded when all of those objects are no
longer referenced and have been cleaned up. The immediate problem has to
do with network devices (exported by network device drivers); numerous
parts of the kernel can reference such a device. So the device itself
contains a reference count. In some situations, however, the kernel can
remove a
driver module even though a particular device's reference count had not
dropped to zero. One solution to the problem, as proposed by Rusty Russell, is to increment the
module's reference count every time one of its network device's count goes
up. The problem with this approach, according
to David Miller, is that devices are just the beginning.
So you propose to add this kind of thing for every ARP entry, every
route cache entry, every IPSEC policy, every socket, every struct
sock, every networking dynamic object ever created? When we add
SKB recycling, will we need to do a module get/put on every SKB
alloc/free/clone/copy? I think this way lies insanity :)
The insanity comes from the fact that attempts to increment module use
counts can fail. Trying to add an unbelievable number of failure paths to
the networking code to deal with this case does indeed seem like a one-way
ticket to the funny farm. All this extra reference counting also adds
significant overhead to the networking code's hand-crafted fast paths; that
is a penalty that the networking hackers are not prepared to accept.
The solution that some of the networking developers are asking for is to go
back to having modules maintain their own reference counts - sort of. At
the least, modules need some way of saying whether they can or cannot be
unloaded at a particular time. Usually that decision is just a matter of
looking at the internal objects they have to maintain anyway. So the
addition of a simple can_unload() function to many modules would
solve the immediate problem.
There is still another problem, though: actually getting a complex module
to a state where it can be unloaded can be a tricky task. Removing a
network protocol, for example, requires shutting down the protocol and
waiting for all objects to be freed. Little details (like sockets which
must, according to the protocol specification, sit in a 60-second
TIME_WAIT state before going away) complicate the picture and can
make the unload process take a long time. Users tend to worry when an
rmmod command appears to just hang. Handling all the details of
that case (especially if you want to allow users to interrupt the
rmmod operation) gets to be tricky indeed. Possible solutions are
being discussed, but no implementations are currently on the horizon.
Of course, one could always go back to Rusty's suggestion from the 2002
Kernel Summit: simply do not allow modules to be removed from the kernel.
(
Log in to post comments)