Kernel development
Brief items
Kernel release status
The current 2.6 prepatch is 2.6.13-rc7, announced by Linus on August 23. This prepatch, probably the final one before 2.6.13, includes a rather large number of small fixes; the long-format changelog has the details.A handful of additional fixes has found its way into Linus's git repository since 2.6.13-rc7 came out.
The current -mm tree is 2.6.13-rc6-mm2. Recent changes to -mm include a number of architecture updates (including various i386 tweaks to better support virtualization), a couple of new timeout functions (see below), and various fixes.
Kernel development news
A pair of new timeout functions
The traditional way to delay a process for a given period of time is via schedule_timeout():
set_current_state(state);
schedule_timeout(delay);
The state parameter to set_current_state() should be either TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE, depending on whether the delay should be cut short on signal delivery or not. Evidently, a common error is to omit the call to set_current_state(), with the result that the request delay does not happen. As a way of making life easier, the -mm tree now includes a pair of new functions:
signed long schedule_timeout_interruptible(signed long timeout);
signed long schedule_timeout_uninterruptible(signed long timeout);
These functions take care of setting the process state, so the delay should always happen as expected. Presumably these functions will be merged into the mainline for 2.6.14.
Linux and TCP offload engines
The TCP/IP protocol suite takes a certain amount of CPU power to implement. So it is not surprising that network adapter manufacturers have long been adding protocol support to their cards. This support can vary from the simple (checksumming of packets, for example) through to full TCP/IP implementations. An adapter with full protocol support is often called a TCP offload engine or TOE.Linux has never supported the TOE features of any network cards. For some time, there had not even been much discussion of TOE support. The topic has returned, however, with this patch adding TOE support which was posted by Scott Bardone of Chelsio Communications. This TOE patch is clearly intended to support Chelsio's line of network adapters, but it has been coded as a more generic "open TOE" framework. The Chelsio folks would very much like to see this patch merged for the 2.6.14 kernel release.
Those who are curious about the TOE patch can go in and look at the code; it is relatively straightforward. At its core, it creates a new type of extended network device (struct toedev) with an additional set of methods:
int (*open)(struct toedev *dev); int (*close)(struct toedev *dev); int (*can_offload)(struct toedev *dev, struct sock *sk); int (*connect)(struct toedev *dev, struct sock *sk); int (*send)(struct toedev *dev, struct sk_buff *skb); int (*recv)(struct toedev *dev, struct sk_buff **skb, int n); int (*ctl)(struct toedev *dev, unsigned int req, void *data); void (*neigh_update)(struct net_device *lldev, struct toedev *dev, struct neighbour *neigh, int fl);
There are various hooks sprinkled through the TCP code to detect when a TOE-capable device is being used and call the appropriate method rather than performing the TCP processing in the kernel. One assumes that the patch works as advertised, but its chances of getting into the kernel appear to be relatively small. There is a very long list of objections which have been raised, including:
- The TOE code must, by necessity, hook deeply into the Linux TCP
implementation. These hooks will make it harder to make high-level
TCP changes in the future. The TOE patch thus represents a long-term
maintenance burden.
- TOE shorts out much of the Linux networking code. In the process, it
cuts out little features like netfilter, traffic control, and more.
So a Linux system using TOE will lack many of the capabilities which
characterize the Linux networking stack. The networking hackers can
already foresee the interminable series of "why doesn't my TOE adapter
support netfilter?" questions which will go their way.
- The Linux networking stack is easy to fix when a bug or security issue
comes up. If a security problem turns up in a TOE adapter, instead, there is
very little which can be done to fix it.
- The performance benefits from TOE are minimal at best. Even if a TOE adapter and software stack currently outperforms "dumber" adapters for very high networking speeds (10G currently, say), that advantage tends to disappear by the time those speeds are in common use. Jeff Garzik claims that 100Mb/s TOE adaptors (which used to be the bleeding-edge high speed) are now slower than the Linux networking stack. So any performance advantage from TOE is a temporary thing, but, once it is merged, the code must be supported forever.
There is also the inconvenient little detail that a company called Alacritech owns several patents relating to TOE. It recently used those patents to extract money from Microsoft, which is including TOE support in its upcoming Windows release. This, alone, would almost certainly cause distributors to disable TOE support, even if it were to find its way into the kernel. (For the record, Chelsio claims to have done its legal homework, but not everybody finds that claim to be convincing).
Will it find its way in? Not if David Miller has anything to say on the matter:
There is essentially zero chance of a networking patch being merged over David's objections, so the TOE developers have an uphill road ahead of them.
One might well ask: if TOE cannot be merged, how will Linux maintain competitive speeds as networks get faster? A big area of interest, currently, is offloading parts of the protocol which do not require great intelligence or state in the card. The kernel already supports TCP segmentation offloading (TSO), where an adapter can create TCP packets out of a large array of data. TSO reduces the necessary CPU power, bus overhead, and cache impact to send a series of packets, but it still does not require that the adapter actually know anything about specific TCP connections. There is talk of using a similar technique for incoming packets: an adapter could merge a configurable set of incoming packets into a single array, thus reducing the demands on the rest of the system. One way or another, the networking stack is likely to keep up with the demands of current hardware.
It has often been said that a maintainer's real job is to say "no" to patches. Not all features are worth their (very real) cost, and merging some patches can be detrimental to the kernel in the long run. For years, the networking maintainers have felt that TOE support is the kind of patch which should not be accepted, and the current implementation appears not to have changed their minds. TOE appears to be one of those ideas which never really goes away, however, so chances are good that we will see this debate again in the future.
Configfs - an introduction
Complicated kernel subsystems can require complex configuration. Traditionally, Unix-like subsystems have made this configuration possible either via new system calls, or by way of a complex, ioctl()-based interface. Neither approach is considered to be optimal. New system calls clutter the namespace and must be added separately for each architecture; they are also quite inflexible once defined and used by user-space code. Anybody who uses the ioctl() interface for new code tends to get sneered at; using ioctl() is like adding new system calls but without the clear definition of the interface that a system call gives you.So how should a new subsystem allow for configuration from user space? In some cases, sysfs can be used. Sysfs, however, was never really meant for this application. It provides a view into the kernel's data structures, and it can be used to cause things to happen with those structures. But sysfs cannot be used to create new objects - at least, not without distorting the interface somewhat. It is the wrong tool for this job.
The right tool might turn out to be a thing called configfs. It is yet another virtual filesystem, but one which is oriented toward user-space configuration tasks. It is currently part of the OCFS2 patch set, but it is likely to be merged separately due to interest from other kernel projects. It could, conceivably, be merged as early as 2.6.14.
Configfs is meant to be mounted on /config. Each subsystem which uses configfs then creates one or more top-level directories within configfs for their configurations; the distributed lock manager code, for example, creates /config/dlm/. That directory can start out empty, or it can be populated with the initial configuration of the subsystem, whichever is appropriate.
Like sysfs, configfs uses directories as the way of representing objects. Directories contain files ("attributes") which display the current state of the object, and which, optionally, may be writable to change that state. A fundamental difference, however, is that a suitably-privileged user-space process can create directories within configfs. That action will result in a callback within the kernel and the creation of the corresponding object. Directories created within configfs will have a set of attribute files from the beginning.
As an example (taken from the configfs documentation), consider a hypothetical network block device driver called "fakenbd." This driver would set up /config/fakenbd, which would start out empty. A system administrator could then use mkdir to create a network disk by creating an appropriately-named subdirectory under /config/fakenbd. That directory (called disk1, say) would be populated by the kernel with the relevant attributes: target for the IP address of the server providing the disk, device for the device on the server, and rw to control whether the disk is to be writable or not. The administrator would simply write the appropriate value into each attribute, and the disk would be configured.
Some observers have questioned the distinction between configfs and sysfs. Users may well wonder why there are two separate directory trees performing similar tasks - especially since sysfs can be used for certain types of administrative functions. Configfs also has certain problems (such as persistence of attribute permissions) which have already been encountered - and solved - in sysfs. The kernel developers do see the two as being fundamentally different, however, so a merger seems unlikely.
If configfs takes off, one could imagine it being used all over the kernel. Much of what is done with ioctl() now could be moved over. Other patches (such as CKRM) which have their own configuration filesystems could switch to configfs. In the long term, configfs could be the path to a much more consistent - and transparent - way of configuring the many subsystems which make up the Linux kernel.
Configfs - the API
The configfs introduction described how this filesystem looks from user space. Anybody wanting to use configfs within a kernel subsystem will also be interested in the kernel-side interface. The configfs API will be somewhat familiar to developers who have worked with kobjects and sysfs; there are some differences, however. What follows is a blindingly fast overview of the configfs API; hold on tight.Configfs implements a set of object types used to put together a configuration hierarchy:
- A config item (struct config_item) is the internal
representation of an object to be configured. It corresponds to a
directory in user space, and behaves somewhat like (the sysfs aspect
of) a kobject in kernel space. Each config item has one or more
attributes, represented in user space as files containing text
values. Like kobjects, config items are almost always embedded within
other, domain-specific structures.
- A config group (struct config_group) is just a config item which can contain other
config items (or groups).
- A config subsystem (struct configfs_subsystem) is a top-level config group. Like the sysfs subsystem type, it contains a semaphore used for mutual exclusion within the configfs code. The presence of the semaphore is somewhat interesting; the sysfs equivalent has been recognized for a while as being superfluous, and it will eventually be eliminated. The system being configured will have to perform its own internal locking anyway, so the same lock might as well be used at the configfs level.
More specifically, anybody wanting to create a configfs hierarchy must set up one or more config items - even if the only item, at the outset, is the config_subsystem structure implementing the top-level directory. Creating a config item requires, in turn, that some other structures be set up. The first of these is:
struct configfs_item_operations {
void (*release)(struct config_item *item);
ssize_t (*show_attribute)(struct config_item *item,
struct configfs_attribute *attr,
char *buffer);
ssize_t (*store_attribute)(struct config_item *item,
struct configfs_attribute *attr,
const char *buffer, size_t size);
int (*allow_link)(struct config_item *src,
struct config_item *target);
int (*drop_link)(struct config_item *src,
struct config_item *target);
};
This structure defines how a specific config item operates. The release() method will be called whenever a config item's reference count drops to zero; its job is to perform the necessary cleanup. Attributes are implemented via the show_attribute() and store_attribute() methods, which work in the obvious manner. The final two methods, if present, control whether the creation of symbolic links between config items is allowed (allow_link()) and provide notification when a symbolic link is removed (drop_link()).
The above operations structure should be filled in for a specific type of config item. Then, it is necessary to store a pointer to the structure in a config_item_type structure:
struct config_item_type {
struct module *ct_owner;
struct configfs_item_operations *ct_item_ops;
struct configfs_group_operations *ct_group_ops;
struct configfs_attribute **ct_attrs;
};
Here, ct_owner is used to manage module reference counts, and ct_item_ops is the set of methods seen above. ct_group_ops is a separate set of operations for config groups; we'll get to those shortly. The final field, ct_attrs, defines the actual attributes which belong to this type of config item; it is an array of pointers to configfs_attribute structures:
struct configfs_attribute {
char *ca_name;
struct module *ca_owner;
mode_t ca_mode;
};
As with sysfs, the structure representing an attribute contains little information beyond its name and permissions. A single set of functions is used for all attributes belonging to a given item type; they must figure out which attribute is being accessed themselves by looking at the name or by embedding the configfs_attribute structure inside another structure.
An actual config item looks like this:
struct config_item {
char *ci_name;
char ci_namebuf[UOBJ_NAME_LEN];
struct kref ci_kref;
struct list_head ci_entry;
struct config_item *ci_parent;
struct config_group *ci_group;
struct config_item_type *ci_type;
struct dentry *ci_dentry;
};
Code creating a config item should zero the entire structure, then initialize it with one of:
void config_item_init(struct config_item *item);
void config_item_init_type_name(struct config_item *item,
const char *name,
struct config_item_type *type);
If the name and type are set using the second form, no other initialization is required. The item, once created, will show up in configfs and will contain the attributes defined by its type.
Config items have a reference count, which is manipulated with the usual sort of functions:
struct config_item *config_item_get(struct config_item *item);
void config_item_put(struct config_item *item);
Items are created within a config group, defined by this structure:
struct config_group {
struct config_item cg_item;
struct list_head cg_children;
struct configfs_subsystem *cg_subsys;
struct config_group **default_groups;
};
As noted before, a config group is just a config item which can contain other items (or groups). So it has a config_item structure embedded within it. There is also a set of subgroups which will automatically be created whenever a group is created within this group. This list (default_groups), along with the list of attributes associated with the config item, define the full contents of the group's directory when it is created.
Groups are initialized in a manner similar to items:
void config_group_init(struct config_group *group);
void config_group_init_type_name(struct config_group *group,
const char *name,
struct config_item_type *type);
Groups must define a set of group operations (and store a pointer to them in the config_item_type structure):
struct configfs_group_operations {
struct config_item *(*make_item)(struct config_group *group,
const char *name);
struct config_group *(*make_group)(struct config_group *group,
const char *name);
int (*commit_item)(struct config_item *item);
void (*drop_item)(struct config_group *group,
struct config_item *item);
};
Any particular config group type should only define either make_item() or make_group(), but not both. If make_group() exists, it will be called in response to a request from user space to create a directory; its job is to create a config_group structure, initialize it, and return it. In the absence of a make_group() method, make_item() will be called instead. There is, thus, no way to create a group which allows the dynamic creation of both items and groups within it; that limitation is unlikely to be a problem in most cases.
The drop_item() method will be called when an item (or group) is deleted from the group. The commit_item() method is there to support transactional access to group members; that functionality is not implemented in the current configfs patch.
The top level of the hierarchy is a configfs_subsystem structure, which is just a special group:
struct configfs_subsystem {
struct config_group su_group;
struct semaphore su_sem;
};
Code creating a subsystem must first initialize the embedded group in the usual manner, then register the subsystem with:
int configfs_register_subsystem(struct configfs_subsystem *subsys);
There is a configfs_unregister_subsystem() function as well.
The above whirlwind tour is, hopefully, enough to give a feel for how to work with configfs. Those wanting more information may wish to consult the extensive documentation file and the example module distributed with the configfs patch.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Janitorial
Memory management
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
