Kernel development

Brief items

Kernel release status

The current development kernel is 2.6.0-test8, which was released by Linus on October 17. This patch includes a working NFS direct I/O implementation, a workaround for the Athlon prefetch bug, various architecture updates, working signal handling for kernel threads, an ALSA update, some software suspend work, and numerous other fixes. The long-format changelog has the details.

Linus's BitKeeper repository is full of stability fixes, as is appropriate for his current goal of getting 2.6.0 in shape. It also includes an SGI Altix serial console driver and Jeff Garzik's libata driver (covered here last August).

The current stable kernel is 2.4.22; Marcelo has not released any 2.4.23 prepatches since 2.4.23-pre7 on October 9.

Comments (none posted)

Kernel development news

The unfinished SCSI job

The repository for SCSI patches has just been forked into two separate trees. One of them is a bugfix-only repository, with its contents meant to get past Linus's "stability fixes only" filter and into the 2.6.0-test kernel. The other is for everything else, which will be held for 2.7, or, at least, a post-2.6.0 release.

This change brought out the question: what about expanding the number of SCSI disks (and partitions) that can be supported by the kernel? That was, after all, one of the reasons for expanding the dev_t type in the first place. The larger device numbers are now in place, but there are no patches in the mainline to make more SCSI disks available.

There are, as it turns out, a few remaining issues that must be addressed before the SCSI expansion can be completed. One of those is naming. Currently, the first 26 SCSI drives are called sda through sdz. Then a second letter is added, making sdaa through sdzz available. The default plan seems to be to go to sdaaa thereafter, and sdaaaa if need be.

Is the number of partitions per drive to be expanded? The current limit of fifteen is apparently constraining to some. As a result, there has been persistent talk of raising the limit to 63. That change, however, would create interesting numbering challenges. The current numbering scheme divides the (eight-bit) minor number in half; the upper nibble is the drive number, and the lower nibble is the partition number. To support more partitions, the portion of the (now 20-bit) minor number dedicated to the partition number would have to be expanded. A naive implementation would simply remap the minor number so that bits 0..5 describe the partition, and bits 6..19 the drive number.

The only problem with that approach is that it would break all existing SCSI device nodes. The kernel hackers have a sense that they might get a complaint or two if they did that, so they are fairly strongly committed to ensuring that old device numbers continue to work. As a result, there have been proposals for more complicated schemes, with the two new partition bits being placed, for example, up at the high end of the minor number. This approach would put an end to the manual creation of device nodes for large SCSI devices - who wants to figure out what number to give to mknod? - but there was not likely to be much of that going on anyway.

A better long-term approach might be to go to one or more completely new major numbers for SCSI drives. The block layer could then assign numbers dynamicly as the drives are discovered, with a tool like udev creating device nodes on demand. For sites that need old numbers to work, a small compatibility module could map between the old and new numbers at device open time. That is all certainly 2.7 material, however. For 2.6.0, the most likely scenario might be the merging of a simple patch (like Badari Pulavarty's patch found in the -mm tree) which expands the number of disks supported in a relatively unintrusive way. The complete solution can come later.

Comments (2 posted)

The cpuset mechanism

A set of patches has been making the rounds for the last month or so which implements a concept known as a "cpuset." A cpuset is simply an arbitrary collection of processors in an SMP system; cpusets can be used to partition a large system into smaller virtual machines in a flexible sort of way. This patch was originally posted by Simon Derr; more recent versions (found in the "patches" section, below) have been sent out by Stephen Hemminger at OSDL.

Internally, the patch creates a hierarchy of cpusets. At boot time, the root set is created containing all of the system's processors. System calls can then be used to create child sets. The creation of a cpuset is not a privileged task, but no process can expand beyond the set of processors initially assigned to it. Thus, for example, the system administrator can create a cpuset for a particular group of processes which will be confined to the designated processors. Those processes can, however, further partition the set for their own purposes.

In normal use, one would expect cpusets to correspond to the underlying hardware; all processors in a set would normally be part of the same NUMA node, for example. There is nothing in the patch that requires users to do things that way, however; cpusets can be any arbitrary subset of the available processors. Processors can also belong to multiple cpusets, so cpusets can overlap each other in arbitrary ways. There is, however, a "strict" flag which can be set to disallow the sharing of processors in this way.

There are a few new system calls created by this patch:

cpuset_create();: Creates a new cpuset as a child of the process's current cpuset, containing the same processors as the parent.
cpuset_destroy();: Destroys the given cpuset.
cpuset_attach(): Attaches a process to a particular cpuset.
cpuset_alloc(): Changes the set of processors belonging to a cpuset. The name of this call is a little misleading, since it can release processors from a cpuset. In fact, removing CPUs will be the normal usage, since a cpuset cannot contain processors which are not also contained in its parent.
cpuset_getfreecpus();: Returns a list of processors which are not part of the current cpuset, but which could be added.

Processes running within a cpuset have no view of the processors which are not contained within that set. Processors in a cpuset are renumbered to appear to be the only processors on the system; thus, for example, system calls like sched_setaffinity() will only bind processes within their particular cpuset.

This patch has generated a certain amount of interest in the large-systems community. It clearly does not fall within the 2.6.0-test "stability patches only" mandate, but there may be pressure to get it into the kernel not much after 2.6.0 is released.

Comments (1 posted)

Driver porting

kobjects and sysfs

This article is part of the LWN Porting Drivers to 2.5 series.

In The Zen of Kobjects, this series looked at the kobject abstraction and the various interfaces that go with it. That article, however, glossed over one important part of the kobject structure (with a promise to fill in in later): its interface to the sysfs virtual filesystem. The time has come to fulfill our promise, however, and look at how sysfs works at the lower levels.

To use the functions described below, you will need to include both <linux/kobject.h> and <linux/sysfs.h> in your source files.

How kobjects get sysfs entries

As we saw in the previous article, there are two functions which are used to set up a kobject. If you use kobject_init() by itself, you will get a standalone kobject with no representation in sysfs. If, instead, you use kobject_register() (or call kobject_add() separately), a sysfs directory will be created for the kobject; no other effort is required on the programmer's part.

The name of the directory will be the same as the name given to the kobject itself. The location within sysfs will reflect the kobject's position in the hierarchy you have created. In short: the kobject's directory will be found in its parent's directory, as determined by the kobject's parent field. If you have not explicitly set the parent field, but you have set its kset pointer, then the kset will become the kobject's parent. If there is no parent and no kset, the kobject's directory will become a top-level directory within sysfs, which is rarely what you really want.

Populating a kobject's directory

Getting a sysfs directory corresponding to a kobject is easy, as we have seen. That directory will be empty, however, which is not particularly useful. Most applications will want the kobject's sysfs entry to contain one or more attributes with useful information. Creating those attributes requires some additional steps, but is not all that hard.

The key to sysfs attributes is the kobject's kobj_type pointer. When we looked at kobject types before, we passed over a couple of sysfs-related entries. One, called default_attrs, describes the attributes that all kobjects of this type should have; it is a pointer to an array of pointers to attribute structures:

    struct attribute {
	char			*name;
	struct module 		*owner;
	mode_t			mode;
    };

In this structure, name is the name of the attribute (as it will appear within sysfs), owner is a pointer to the module (if any) which is responsible for the implementation of this attribute, and mode is the protection bits which are to be applied to this attribute. The mode is usually S_IRUGO for read-only attributes; if the attribute is writable, you can toss in S_IWUSR to give write access to root only. The last entry in the default_attrs list must be NULL.

The default_attrs array says what the attributes are, but does not tell sysfs how to actually implement those attributes. That task falls to the kobj_type->sysfs_ops field, which points to a structure defined as:

    struct sysfs_ops {
	ssize_t	(*show)(struct kobject *kobj, struct attribute *attr, 
                        char *buffer);
	ssize_t	(*store)(struct kobject *kobj, struct attribute *attr, 
			const char *buffer, size_t size);
    };

These functions will be called for each read and write operation, respectively, on an attribute of a kobject of the given type. In each case, kobj is the kobject whose attribute is being accessed, attr is the struct attribute for the specific attribute, and buffer is a one-page buffer for attribute data.

The show() function should encode the attribute's full value into buffer, being sure not to overrun PAGE_SIZE. Remember that the sysfs convention requires that attributes contain single values or, at most, an array of similar values, so the one-page limit should never be a problem. The return value is, of course, the number of bytes of data actually put into buffer or a negative error code.

The store() function has a similar interface; the additional size parameter gives the length of the data received from user space. Never forget that buffer contains unchecked, user-supplied data; treat it carefully and be sure that it fits whatever format you require. The return value should normally be the same as size, unless something has gone wrong.

As you can see, sysfs requires the use of a single set of show() and store() functions for all attributes of kobjects of the same type. Those functions will, usually, maintain their own array of attribute information to enable them to find the real function charged with implementing each attribute.

Non-default attributes

In many cases, the kobject type's default_attrs field describes all of the attributes that kobject will ever have. It does not need to be that way, however; attributes can be added and removed at will. If you wish to add a new attribute to a kobject's sysfs directory, simply fill in an attribute structure and pass it to:

    int sysfs_create_file(struct kobject *kobj, struct attribute *attr);

If all goes well, the file will be created with the name given in the attribute structure and the return value will be zero; otherwise, the usual negative error code is returned.

Note that the same show() and store() functions will be called to implement operations on the new attribute. Before you add a new, non-default attribute to a kobject, you should take whatever steps are necessary to ensure that those functions know how to implement that attribute.

To remove an attribute, call:

    int sysfs_remove_file(struct kobject *kobj, struct attribute *attr);

After the call, the attribute will no longer appear in the kobject's sysfs entry. Do be aware, however, that a user-space process could have an open file descriptor for that attribute, and that show() and store() calls are still possible after the attribute has been removed.

Symbolic links

The sysfs filesystem has the usual tree structure, reflecting the hierarchical organization of the kobjects it represents. The relationships between objects in the kernel is often more complicated than that, however. For example, one sysfs subtree (/sys/devices) represents all of the devices known to the system, while others represent the device drivers. These trees do not, however, represent the relationships between the drivers and the devices they implement. Showing these additional relationships requires extra pointers which, in sysfs, are implemented with symbolic links.

Creating a symbolic link within sysfs is easy:

    int sysfs_create_link(struct kobject *kobj, 
			  struct kobject *target,
			  char *name);

This function will create a link (called name) pointing to target's sysfs entry as an attribute of kobj. It will be a relative link, so it works regardless of where sysfs is mounted on any particular system.

The link will persist even if target is removed from the system. If you are creating symbolic links to other kobjects, you should probably have a way of knowing about changes to those kobjects, or some sort of assurance that the target kobjects will not disappear. The consequences (dead symbolic links within sysfs) are not particularly grave, but they would not do much to create confidence in the proper functioning of the system either.

Symbolic links can be removed with:

    void sysfs_remove_link(struct kobject *kobj, char *name);

Binary attributes

The sysfs conventions call for all attributes to contain a single value in a human-readable text format. That said, there is an occasional, rare need for the creation of attributes which can handle larger chunks of binary data. In the 2.6.0-test kernel, the only use of binary attributes is in the firmware subsystem. When a device requiring firmware is encountered in the system, a user-space program can be started (via the hotplug mechanism); that program then passes the firmware code to the kernel via binary sysfs attribute. If you are contemplating any other use of binary attributes, you should think carefully and be sure there is no other way to accomplish your objective.

Binary attributes are described with a bin_attribute structure:

    struct bin_attribute {
	struct attribute attr;
	size_t size;
	ssize_t (*read)(struct kobject *kobj, char *buffer, 
			loff_t pos, size_t size);
	ssize_t (*write)(struct kobject *kobj, char *buffer, 
			loff_t pos, size_t size);
    };

Here, attr is an attribute structure giving the name, owner, and permissions for the binary attribute, and size is the maximum size of the binary attribute (or zero if there is no maximum). The read() and write() functions work similarly to the normal char driver equivalents; they can be called multiple times for a single load with a maximum of one page worth of data in each call. There is no way for sysfs to signal the last of a set of write operations, so code implementing a binary attribute must be able to determine that some other way.

Binary attributes must be created explicitly; they cannot be set up as default attributes. To create a binary attribute, call:

    int sysfs_create_bin_file(struct kobject *kobj, 
			      struct bin_attribute *attr);

Binary attributes can be removed with:

    int sysfs_remove_bin_file(struct kobject *kobj, 
			      struct bin_attribute *attr);

Last notes

This article has described the low-level interface between kobjects and sysfs. Unless you are implementing a new subsystem, however, you are unlikely to work with this interface directly. Each subsystem typically implements its own set of default attributes, and, perhaps, a mechanism for interested code to add new ones. This mechanism is generally a straightforward wrapper around the low-level attribute code, however, so it should look familiar to readers of this page.

Comments (3 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 2.6.0-test8 ?

Andrew Morton 2.6.0-test8-mm1 ?

Architecture-specific

Pallipadi, Venkatesh 0/3 Dynamic cpufreq governor and updates to ACPI P-state driver ?

Pallipadi, Venkatesh 1/3 Dynamic cpufreq governor and updates to ACPI P-state driver ?

Pallipadi, Venkatesh 2/3 Dynamic cpufreq governor and updates to ACPI P-state driver ?

Pallipadi, Venkatesh 3/3 Dynamic cpufreq governor and updates to ACPI P-state driver ?

Ingo Molnar updated exec-shield patch, 2.4/2.6 -G4 ?

Patrick Gefre Altix console driver ?

Core kernel code

Nick Piggin Nick's scheduler v16 ?

Nigel Cunningham swsusp 2.0-rc2. ?

Randy.Dunlap kexec patches for 2.6.0-test8 ?

Stephen Hemminger (1/4) [PATCH] cpuset -- 2.6.0-test8 ?

Stephen Hemminger (2/4) [PATCH] cpuset - Kconfig ?

Stephen Hemminger (3/4) [PATCH] cpuset - build without CPUSET configured ?

Stephen Hemminger (4/4) [PATCH] cpuset -- seqfile change ?

Device drivers

Roman Zippel iSCSI target implementation ?

Stephen Hemminger NAPI for 8139too ?

Filesystems and block I/O

David Woodhouse JFFS2 support for NAND flash.. ?

Nikita Danilov new reiser4 snapshot ?

Nir Tzachar srfs - a new file system. ?

Benchmarks and bugs

Randy Hron I/O regression after 2.6.0-test5 ?

Miscellaneous

Eli Billauer frandom - fast random generator module ?

Greg KH udev 003 release ?

Greg KH udev 004 release ?

Harald Welte Summary of netfilter developer workshop 2003 ?

Page editor: Jonathan Corbet
Next page: Distributions>>