One of the nice (and increasingly important) features of the 2.5 device
model is sysfs. This virtual filesystem exports a view of the system's
structure to user space; it also provides a nice control interface - and
/proc replacement - by allowing attributes to be attached to sysfs
entries. Sysfs is not without its traps, however, and many kernel
developers are just now beginning to realize the sort of care that is
necessary to avoid making mistakes.
The hardware supported by Linux is increasingly dynamic; devices can appear
and disappear at any time. The sysfs filesystem adjusts itself in response
to hardware events by creating and removing directories associated with
devices, classes, and other objects. Kernel code typically implements this
functionality by allocating (and registering) device structures and other
objects when a device is plugged in, and deleting those structures when the
device is removed. It tends to work quite well.
But consider the following possible sequence of events:
- A user plugs in a shiny new hotplug PCI frobnicator.
- The driver creates a device structure and registers it; as a result,
the directory /sys/devices/pci0/00:11.0/ (or some such) gets
created and filled with attributes.
- A user process moves into that directory, opens one of the attribute
files, but doesn't get around to reading it yet.
- The user, having done enough frobnication for one day, unplugs the
device.
- The driver unregisters and frees the device structures.
All seems well, except for the small problem of that user process. By
sitting in the directory, it maintains a reference there. The open
attribute file is yet another reference. If the driver has truly cleaned
up and freed the devices, the user process will be holding structures with
pointers into freed memory. An attempt to read the (already open)
attribute file at this point is almost certain to crash the system.
The above scenario is not hypothetical; a fair number of such conditions
exist in the 2.5 kernel now. That is why this issue (titled "kobject
refcounting") appears in the 2.6 must-fix
list. It truly must be fixed.
The infrastructure exists to handle these problems, but it must be used
properly to be effective. The solution lies in the same place as the
problem - the kobject structure. The 2.5.72 version of this
structure looks like:
struct kobject {
char name[KOBJ_NAME_LEN];
atomic_t refcount;
struct list_head entry;
struct kobject *parent;
struct kset *kset;
struct kobj_type *ktype;
struct dentry *dentry;
};
Entries in sysfs are closely tied to kobjects; there is a kobject
associated with each directory in the filesystem. When a process moves
into a sysfs directory or opens a sysfs file, the associated kobject has
its refcount field incremented. As long as the reference count is
above zero, the kobject cannot be deleted.
The same kobjects, of course, are embedded deeply within the structures
used to represent devices and other system objects. So a nonzero reference
count in a kobject means that the entire device structure (and, perhaps,
the module infrastructure supporting it) is still in use. Safely putting
things into sysfs is really just a matter of not deleting objects until
their reference counts hits zero.
Of course, that is easily said, but the current mechanism for implementing
such a policy is not entirely obvious. An example might help, so we'll
look at the block subsystem, which does things right. Disks, within the
kernel, are represented by the gendisk
structure. The function used to create a gendisk is
alloc_disk(), which, after allocating and initializing a
gendisk structure (which contains a kobject), executes this
mysterious line of code:
kobj_set_kset_s(disk,block_subsys);
This line tweaks the kobject within disk (the gendisk
structure) to make it a part of block_subsys. The block subsystem
structure, in turn, contains a pointer to a kobj_type structure,
which, in this case, looks like:
static struct kobj_type ktype_block = {
.release = disk_release,
.sysfs_ops = &disk_sysfs_ops,
.default_attrs = default_attrs,
};
We'll come back to this structure in a moment. For now, suffice to say
that it identifies the kobject (and the gendisk structure that
contains it) as something belonging to the block code, and provides some
methods implementing the object's operations.
The function which puts a new disk into the system is add_disk();
it creates the associated sysfs structure, and increments the disk's
reference count. The disk then goes through its lifecycle, with the
reference count going up and down as it is mounted and unmounted, and as
its sysfs files are accessed. Should the disk disappear, the driver will
do some cleanup and call del_gendisk() to return the
gendisk structure to the system.
del_gendisk() does not actually free the structures, however. It
removes the sysfs entries and generally shuts things down; it then finishes
by decrementing the reference count. That operation releases the reference
which was first obtained in add_disk(). The driver also must
release its own reference with put_disk(). These operations may
drop the reference count to zero - if nobody else is holding a reference to
the disk. But there is no way to know ahead of time.
Sooner or later, however, the last reference will go away. The function
which actually decrements the count (kobject_put()) tests that
count for zero. If no references remain, kobject_put() will go
back to the kobj_type structure associated with the kobject
(the ktype_block we saw above, in the case of a gendisk) and
call the release() method found there. That method, knowing that
nobody is referring to the object, can actually remove it from the system.
That is how sysfs objects must be managed. They must have a
destructor associated with them, by way of the kobj_type
structure, and that destructor must understand the higher-level objects
that it is dealing with. With this mechanism in place, objects will
continue to exist as long as references to them are held.
Of course, things can get more complicated than that. If, for example, a
module adds attributes to sysfs entries, that module cannot be removed
until it is certain that all of the relevant references have gone away.
It gets even worse if kernel code tries to attach attributes to objects
which it does not own; in that case it can be very hard to get everything
right. It may eventually prove necessary to rework some of the sysfs
interfaces to make it easier to avoid mistakes, but that seems unlikely for
2.5 at this point. In the mean time, connecting the pieces together
correctly can be an intimidating task the first time around, but the
alternative is to put denial of service vulnerabilities into the kernel.
(
Log in to post comments)