User: Password:
|
|
Subscribe / Log in / New account

Avoiding sysfs surprises

One of the nice (and increasingly important) features of the 2.5 device model is sysfs. This virtual filesystem exports a view of the system's structure to user space; it also provides a nice control interface - and /proc replacement - by allowing attributes to be attached to sysfs entries. Sysfs is not without its traps, however, and many kernel developers are just now beginning to realize the sort of care that is necessary to avoid making mistakes.

The hardware supported by Linux is increasingly dynamic; devices can appear and disappear at any time. The sysfs filesystem adjusts itself in response to hardware events by creating and removing directories associated with devices, classes, and other objects. Kernel code typically implements this functionality by allocating (and registering) device structures and other objects when a device is plugged in, and deleting those structures when the device is removed. It tends to work quite well.

But consider the following possible sequence of events:

  1. A user plugs in a shiny new hotplug PCI frobnicator.

  2. The driver creates a device structure and registers it; as a result, the directory /sys/devices/pci0/00:11.0/ (or some such) gets created and filled with attributes.

  3. A user process moves into that directory, opens one of the attribute files, but doesn't get around to reading it yet.

  4. The user, having done enough frobnication for one day, unplugs the device.

  5. The driver unregisters and frees the device structures.

All seems well, except for the small problem of that user process. By sitting in the directory, it maintains a reference there. The open attribute file is yet another reference. If the driver has truly cleaned up and freed the devices, the user process will be holding structures with pointers into freed memory. An attempt to read the (already open) attribute file at this point is almost certain to crash the system.

The above scenario is not hypothetical; a fair number of such conditions exist in the 2.5 kernel now. That is why this issue (titled "kobject refcounting") appears in the 2.6 must-fix list. It truly must be fixed.

The infrastructure exists to handle these problems, but it must be used properly to be effective. The solution lies in the same place as the problem - the kobject structure. The 2.5.72 version of this structure looks like:

struct kobject {
	char			name[KOBJ_NAME_LEN];
	atomic_t		refcount;
	struct list_head	entry;
	struct kobject		*parent;
	struct kset		*kset;
	struct kobj_type	*ktype;
	struct dentry		*dentry;
};

Entries in sysfs are closely tied to kobjects; there is a kobject associated with each directory in the filesystem. When a process moves into a sysfs directory or opens a sysfs file, the associated kobject has its refcount field incremented. As long as the reference count is above zero, the kobject cannot be deleted.

The same kobjects, of course, are embedded deeply within the structures used to represent devices and other system objects. So a nonzero reference count in a kobject means that the entire device structure (and, perhaps, the module infrastructure supporting it) is still in use. Safely putting things into sysfs is really just a matter of not deleting objects until their reference counts hits zero.

Of course, that is easily said, but the current mechanism for implementing such a policy is not entirely obvious. An example might help, so we'll look at the block subsystem, which does things right. Disks, within the kernel, are represented by the gendisk structure. The function used to create a gendisk is alloc_disk(), which, after allocating and initializing a gendisk structure (which contains a kobject), executes this mysterious line of code:

    kobj_set_kset_s(disk,block_subsys);

This line tweaks the kobject within disk (the gendisk structure) to make it a part of block_subsys. The block subsystem structure, in turn, contains a pointer to a kobj_type structure, which, in this case, looks like:

static struct kobj_type ktype_block = {
	.release	= disk_release,
	.sysfs_ops	= &disk_sysfs_ops,
	.default_attrs	= default_attrs,
};

We'll come back to this structure in a moment. For now, suffice to say that it identifies the kobject (and the gendisk structure that contains it) as something belonging to the block code, and provides some methods implementing the object's operations.

The function which puts a new disk into the system is add_disk(); it creates the associated sysfs structure, and increments the disk's reference count. The disk then goes through its lifecycle, with the reference count going up and down as it is mounted and unmounted, and as its sysfs files are accessed. Should the disk disappear, the driver will do some cleanup and call del_gendisk() to return the gendisk structure to the system.

del_gendisk() does not actually free the structures, however. It removes the sysfs entries and generally shuts things down; it then finishes by decrementing the reference count. That operation releases the reference which was first obtained in add_disk(). The driver also must release its own reference with put_disk(). These operations may drop the reference count to zero - if nobody else is holding a reference to the disk. But there is no way to know ahead of time.

Sooner or later, however, the last reference will go away. The function which actually decrements the count (kobject_put()) tests that count for zero. If no references remain, kobject_put() will go back to the kobj_type structure associated with the kobject (the ktype_block we saw above, in the case of a gendisk) and call the release() method found there. That method, knowing that nobody is referring to the object, can actually remove it from the system.

That is how sysfs objects must be managed. They must have a destructor associated with them, by way of the kobj_type structure, and that destructor must understand the higher-level objects that it is dealing with. With this mechanism in place, objects will continue to exist as long as references to them are held.

Of course, things can get more complicated than that. If, for example, a module adds attributes to sysfs entries, that module cannot be removed until it is certain that all of the relevant references have gone away. It gets even worse if kernel code tries to attach attributes to objects which it does not own; in that case it can be very hard to get everything right. It may eventually prove necessary to rework some of the sysfs interfaces to make it easier to avoid mistakes, but that seems unlikely for 2.5 at this point. In the mean time, connecting the pieces together correctly can be an intimidating task the first time around, but the alternative is to put denial of service vulnerabilities into the kernel.


(Log in to post comments)

Garbage collection

Posted Jun 21, 2003 16:40 UTC (Sat) by rwmj (guest, #5474) [Link]

Looks like a good argument for garbage collection to me.

Rich.

Avoiding sysfs surprises

Posted Sep 25, 2003 14:58 UTC (Thu) by stripes (guest, #15431) [Link]

All seems well, except for the small problem of that user process. By sitting in the directory, it maintains a reference there. The open attribute file is yet another reference. If the driver has truly cleaned up and freed the devices, the user process will be holding structures with pointers into freed memory. An attempt to read the (already open) attribute file at this point is almost certain to crash the system.

There is another way this can be handled. In BSD at least (and I assume Linux) "umount -f" forcibly unmounts a filesystem even when files are still open. The open files handles have their in memory VNODEs replaced with VNODEs from the "dead file system" which returns an error for every attempt to use it except for close. That is also done with tty/pttys that have hangup called for them, and some other places where file handles (or VNODE backed memory segments) become invalid (and there is a command line utility to revoke access in case there is a wee security lapse...).

It is a useful concept, and a lot simpler then reference counting. In a few cases "more correct" as well. In this case, for things like hot plugable devices it seems like the right thing. For other stuff that might show up on sysfs, maybe not (should a processes memory image stay around after the process is kill -9'ed just because a debugger has the image open via sysfs?).


Copyright © 2003, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds