Release status
Kernel release status
The current development kernel is 2.6.0-test8, which was
released by Linus on October 17. This patch
includes a working NFS direct I/O implementation, a workaround for the
Athlon prefetch bug, various architecture updates, working signal handling
for kernel threads, an ALSA update, some software suspend work, and
numerous other fixes. The
long-format
changelog has the details.
Linus's BitKeeper repository is full of stability fixes, as is appropriate
for his current goal of getting 2.6.0 in shape. It also includes an SGI
Altix serial console driver and Jeff Garzik's libata driver (covered here last August).
The current stable kernel is 2.4.22; Marcelo has not released any
2.4.23 prepatches since 2.4.23-pre7 on
October 9.
Comments (none posted)
Kernel development news
The unfinished SCSI job
The repository for SCSI patches
has just been
forked into two separate trees. One of them is a bugfix-only
repository, with its contents meant to get past Linus's "stability fixes
only" filter and into the 2.6.0-test kernel. The other is for everything
else, which will be held for 2.7, or, at least, a post-2.6.0 release.
This change brought out the question: what about expanding the number of
SCSI disks (and partitions) that can be supported by the kernel? That was,
after all, one of the reasons for expanding the dev_t type in the
first place. The larger device numbers are now in place, but there are no
patches in the mainline to make more SCSI disks available.
There are, as it turns out, a few remaining
issues that must be addressed before the SCSI expansion can be
completed. One of those is naming. Currently, the first 26 SCSI drives
are called sda through sdz. Then a second letter is
added, making sdaa through sdzz available. The default
plan seems to be to go to sdaaa thereafter, and sdaaaa if
need be.
Is the number of partitions per drive to be expanded? The current limit of
fifteen is apparently constraining to some. As a result, there has been
persistent talk of raising the limit to 63.
That change, however, would create interesting numbering challenges. The
current numbering scheme divides the (eight-bit) minor number in half; the
upper nibble is the drive number, and the lower nibble is the partition
number. To support more partitions, the portion of the (now 20-bit) minor
number dedicated to the partition number would have to be expanded. A
naive implementation would simply remap the minor number so that bits 0..5
describe the partition, and bits 6..19 the drive number.
The only problem with that approach is that it would break all existing
SCSI device nodes. The kernel hackers have a sense that they might get a
complaint or two if they did that, so they are fairly strongly committed to
ensuring that old device numbers continue to work. As a result, there have
been proposals for more complicated schemes, with the two new partition
bits being placed, for example, up at the high end of the minor number.
This approach would put an end to the manual creation of device nodes for
large SCSI devices - who wants to figure out what number to give to
mknod? - but there was not likely to be much of that going on
anyway.
A better long-term approach might be to go to one or more completely new
major numbers for SCSI drives. The block layer could then assign numbers
dynamicly as the drives are discovered, with a tool like udev creating
device nodes on demand. For sites that need old numbers to work, a small
compatibility module could map between the old and new numbers at device
open time. That is all certainly 2.7 material, however. For 2.6.0, the
most likely scenario might be the merging of a simple patch (like Badari
Pulavarty's patch found in the -mm tree) which expands the number of
disks supported in a relatively unintrusive way. The complete solution can
come later.
Comments (2 posted)
The cpuset mechanism
A set of patches has been making the rounds for the last month or so which
implements a concept known as a "cpuset." A cpuset is simply an arbitrary
collection of processors in an SMP system; cpusets can be used to partition
a large system into smaller virtual machines in a flexible sort of way.
This patch was originally
posted by Simon
Derr; more recent versions (found in the "patches" section, below) have
been sent out by Stephen Hemminger at OSDL.
Internally, the patch creates a hierarchy of cpusets. At boot time, the
root set is created containing all of the system's processors. System
calls can
then be used to create child sets. The creation of a cpuset is not a
privileged task, but no process can expand beyond the set of processors
initially assigned to it. Thus, for example, the system administrator can
create a cpuset for a particular group of processes which will be confined
to the designated processors. Those processes can, however, further
partition the set for their own purposes.
In normal use, one would expect cpusets to correspond to the underlying
hardware; all processors in a set would normally be part of the same NUMA
node, for example. There is nothing in the patch that requires users to do
things that way, however; cpusets can be any arbitrary subset of the
available processors. Processors can also belong to multiple cpusets, so
cpusets can overlap each other in arbitrary ways. There is, however, a
"strict" flag which can be set to disallow the sharing of processors in
this way.
There are a few new system calls created by this patch:
- cpuset_create();
- Creates a new cpuset as a child of the process's current cpuset,
containing the same processors as the parent.
- cpuset_destroy();
- Destroys the given cpuset.
- cpuset_attach()
- Attaches a process to a particular cpuset.
- cpuset_alloc()
- Changes the set of processors belonging to a cpuset. The name of this
call is a little misleading, since it can release processors from a
cpuset. In fact, removing CPUs will be the normal usage, since a
cpuset cannot contain processors which are not also contained in its
parent.
- cpuset_getfreecpus();
- Returns a list of processors which are not part of the current cpuset,
but which could be added.
Processes running within a cpuset have no view of the processors which are
not contained within that set. Processors in a cpuset are renumbered to
appear to be the only processors on the system; thus, for example, system
calls like sched_setaffinity() will only bind processes within
their particular cpuset.
This patch has generated a certain amount of interest in the large-systems
community. It clearly does not fall within the 2.6.0-test "stability
patches only" mandate, but there may be pressure to get it into the kernel
not much after 2.6.0 is released.
Comments (1 posted)
Driver porting
kobjects and sysfs
In
The Zen of Kobjects, this
series looked at the kobject abstraction and the various interfaces that go
with it. That article, however, glossed over one important part of the
kobject structure (with a promise to fill in in later): its interface to
the sysfs virtual filesystem. The time
has come to fulfill our promise, however, and look at how sysfs works at
the lower levels.
To use the functions described below, you will need to include both
<linux/kobject.h> and <linux/sysfs.h> in your
source files.
How kobjects get sysfs entries
As we saw in the previous article, there are two functions which are used
to set up a kobject. If you use
kobject_init() by itself, you
will get a standalone kobject with no representation in sysfs. If,
instead, you use
kobject_register() (or call
kobject_add() separately), a sysfs directory will be created for
the kobject; no other effort is required on the programmer's part.
The name of the directory will be the same as the name given to the kobject
itself. The location within sysfs will reflect the kobject's position in
the hierarchy you have created. In short: the kobject's directory will be
found in its parent's directory, as determined by the kobject's
parent field. If you have not explicitly set the parent field,
but you have set its kset pointer, then the kset will become the
kobject's parent. If there is no parent and no kset, the kobject's
directory will become a top-level directory within sysfs, which is rarely
what you really want.
Populating a kobject's directory
Getting a sysfs directory corresponding to a kobject is easy, as we have
seen. That directory will be empty, however, which is not particularly
useful. Most applications will want the kobject's sysfs entry to contain
one or more attributes with useful information. Creating those attributes
requires some additional steps, but is not all that hard.
The key to sysfs attributes is the kobject's kobj_type pointer.
When we looked at kobject types before, we passed over a couple of
sysfs-related entries. One, called default_attrs, describes the
attributes that all kobjects of this type should have; it is a pointer to
an array of pointers to attribute structures:
struct attribute {
char *name;
struct module *owner;
mode_t mode;
};
In this structure, name is the name of the attribute (as it
will appear within sysfs), owner is a pointer to the module (if any)
which is responsible for the implementation of this attribute, and
mode is the protection bits which are to be applied to this
attribute. The mode is usually S_IRUGO for read-only attributes;
if the attribute is writable, you can toss in S_IWUSR to give
write access to root only. The last entry in the default_attrs
list must be NULL.
The default_attrs array says what the attributes are, but does not
tell sysfs how to actually implement those attributes. That task falls to
the kobj_type->sysfs_ops field, which points to a structure
defined as:
struct sysfs_ops {
ssize_t (*show)(struct kobject *kobj, struct attribute *attr,
char *buffer);
ssize_t (*store)(struct kobject *kobj, struct attribute *attr,
const char *buffer, size_t size);
};
These functions will be called for each read and write operation,
respectively, on an attribute of a kobject of the given type. In each
case, kobj is the kobject whose attribute is being accessed,
attr is the struct attribute for the specific attribute,
and buffer is a one-page buffer for attribute data.
The show() function should encode the attribute's full value into
buffer, being sure not to overrun PAGE_SIZE. Remember
that the sysfs convention requires that attributes contain single values
or, at most, an array of similar values, so the one-page limit should never
be a problem. The return value is, of course, the number of bytes of data
actually put into buffer or a negative error code.
The store() function has a similar interface; the additional
size parameter gives the length of the data received from user
space. Never forget that buffer contains unchecked, user-supplied
data; treat it carefully and be sure that it fits whatever format you
require. The return value should normally be the same as size,
unless something has gone wrong.
As you can see, sysfs requires the use of a single set of show()
and store() functions for all attributes of kobjects of the same
type. Those functions will, usually, maintain their own array of attribute
information to enable them to find the real function charged with
implementing each attribute.
Non-default attributes
In many cases, the kobject type's
default_attrs field describes
all of the attributes that kobject will ever have. It does not need to be
that way, however; attributes can be added and removed at will. If you
wish to add a new attribute to a kobject's sysfs directory, simply fill in
an
attribute structure and pass it to:
int sysfs_create_file(struct kobject *kobj, struct attribute *attr);
If all goes well, the file will be created with the name given in the
attribute structure and the return value will be zero; otherwise,
the usual negative error code is returned.
Note that the same show() and store() functions will be
called to implement operations on the new attribute. Before you add a new,
non-default attribute to a kobject, you should take whatever steps are
necessary to ensure that those functions know how to implement that
attribute.
To remove an attribute, call:
int sysfs_remove_file(struct kobject *kobj, struct attribute *attr);
After the call, the attribute will no longer appear in the kobject's sysfs
entry. Do be aware, however, that a user-space process could have an open
file descriptor for that attribute, and that show() and
store() calls are still possible after the attribute has been
removed.
Symbolic links
The sysfs filesystem has the usual tree structure, reflecting the
hierarchical organization of the kobjects it represents. The relationships
between objects in the kernel is often more complicated than that,
however. For example, one sysfs subtree (
/sys/devices) represents
all of the devices known to the
system, while others represent the device drivers. These trees do not,
however, represent the relationships between the drivers and the devices
they implement. Showing these additional relationships requires extra
pointers which, in sysfs, are implemented with symbolic links.
Creating a symbolic link within sysfs is easy:
int sysfs_create_link(struct kobject *kobj,
struct kobject *target,
char *name);
This function will create a link (called name) pointing to
target's sysfs entry as an attribute of kobj. It will be
a relative link, so it works regardless of where sysfs is mounted on any
particular system.
The link will persist even if target is removed from the system.
If you are creating symbolic links to other kobjects, you should probably
have a way of knowing about changes to those kobjects, or some sort of
assurance that the target kobjects will not disappear. The consequences
(dead symbolic links within sysfs) are not particularly grave, but they
would not do much to create confidence in the proper functioning of the
system either.
Symbolic links can be removed with:
void sysfs_remove_link(struct kobject *kobj, char *name);
Binary attributes
The sysfs conventions call for all attributes to contain a single value in
a human-readable text format. That said, there is an occasional, rare need
for the creation of attributes which can handle larger chunks of binary
data. In the 2.6.0-test kernel, the only use of binary attributes is in
the
firmware subsystem. When a device requiring firmware is
encountered in the system, a user-space program can be started (via the
hotplug mechanism); that program then passes the firmware code to
the kernel via
binary sysfs attribute. If you are contemplating any other use of binary
attributes, you should think carefully and be sure there is no other way to
accomplish your objective.
Binary attributes are described with a bin_attribute structure:
struct bin_attribute {
struct attribute attr;
size_t size;
ssize_t (*read)(struct kobject *kobj, char *buffer,
loff_t pos, size_t size);
ssize_t (*write)(struct kobject *kobj, char *buffer,
loff_t pos, size_t size);
};
Here, attr is an attribute structure giving the name,
owner, and
permissions for the binary attribute, and size is the maximum size
of the binary attribute (or zero if there is no maximum). The
read() and write() functions work similarly to the normal
char driver equivalents; they can be called multiple times for a single
load with a maximum of one page worth of data in each call. There is no
way for sysfs to signal the last of a set of write operations, so code
implementing a binary attribute must be able to determine that some other
way.
Binary attributes must be created explicitly; they cannot be set up as
default attributes. To create a binary attribute, call:
int sysfs_create_bin_file(struct kobject *kobj,
struct bin_attribute *attr);
Binary attributes can be removed with:
int sysfs_remove_bin_file(struct kobject *kobj,
struct bin_attribute *attr);
Last notes
This article has described the low-level interface between kobjects and
sysfs. Unless you are implementing a new subsystem, however, you are
unlikely to work with this interface directly. Each subsystem typically
implements its own set of default attributes, and, perhaps, a mechanism for
interested code to add new ones. This mechanism is generally a
straightforward wrapper around the low-level attribute code, however, so it
should look familiar to readers of this page.
Comments (1 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Filesystems and block I/O
Architecture-specific
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>