Kernel development [LWN.net]

Kernel release status

The current stable 2.6 kernel is 2.6.11.5, released on March 18.

The current 2.6 prepatch is 2.6.12-rc1, released (without an announcement) by Linus on March 18. This huge patch contains, among many other things, a driver for the "trusted computing" TPM chip (see the Trusted Computing Group site for more information on TPM), SuperHyway bus support, a new multi-level security implementation for SELinux, the "cpuset" patch (see cpusets.txt for information on cpusets), a new nVidia framebuffer driver, the device mapper multipath patches, an IPv6 update (including a patch removing the "experimental" designation for IPv6), a patch enabling an administrator to enable a subset of the "magic SysRq" functions, numerous driver updates, the address space randomization patches, a new packet classifier mechanism for the networking layer, a Tiger digest algorithm implementation, the restoration of the Philips webcam driver, some software suspend improvements, a big block I/O barrier rewrite (which enables full barrier support on serial ATA drives), a set of patches to shrink the kernel for embedded use, and high-resolution POSIX CPU clock support (not the full high-resolution timers patch). The details can be found in the long-format changelog.

Linus's BitKeeper repository contains some architecture updates, some networking fixes, and an IPv4 multipath implementation. Linus is out of the office this week, so patches are not being merged for a little bit. Andrew Morton, meanwhile, is encouraging developers to work on shortening his list of 140 2.6.12-rc1 bugs.

The current -mm tree is 2.6.12-rc1-mm1. Recent changes to -mm include ACPI-based PCI bridge hotplug support, the pluggable TCP congestion avoidance modules patch (see below), and some kernel timer improvements.

The current 2.4 prepatch is 2.4.30-rc1, released by Marcelo on March 18.

Comments (none posted)

Device model changes in store

The Linux device model is a core subsystem which implements various useful device-level functions, including reference counting, sysfs, hotplug event generation, and more. Some of the lower-level device model subsystems were covered in the LWN driver porting series; there is also a device model chapter in LDD3. All of that nice documentation is now threatened with obsolescence, however; a number of device model changes are currently in the works.

Class code changes

The device model "class" code is the mechanism behind /sys/class. Its purpose is to make information about devices (and more) available in a way which is independent of the underlying hardware topology. The largest use of classes, probably, is to export device numbers which can be used (by tools like udev), to create device nodes when hardware is added to the system. The class subsystem, like much of the device model code, has proved to be somewhat complex and error-prone to work with.

As a way of making things easier, the "class_simple" interface was added some time ago. This interface handles much of the boilerplate code for allocation of class structures, management of attributes, and life cycle management. Greg Kroah-Hartman has now concluded that class_simple was the sort of interface which was needed from the outset, so he has posted a set of patches which move the full class interface in that direction.

With the new interface, class structures are no longer created by the driver. Instead, one is allocated with a call to:

    struct class *class_create(struct module *owner, char *name);

This function will allocate the structure, initialize it, and register it with the given name. When the structure is no longer needed, it can be handed to class_destroy(), which will unregister it, decrement its reference count, and, eventually, get rid of it.

The class_device structure, which represents a single device under a class, also gets a dynamic allocation function:

    struct class_device *class_device_create(struct class *cls, dev_t devno,
                                             struct device *device, 
                                             char *fmt, ...);

The devno parameter is the device number; it is used to create the dev attribute for the class device entry. If device is non-NULL, it will be used to create a symbolic link to the appropriate entry under /sys/devices. The name of the device is passed in as a printk()-style format string.

Interestingly, a class_device structure is not destroyed directly; instead, the driver should call:

    void class_device_destroy(struct class *cls, dev_t devno);

The class code will find the class_device entry corresponding to the given device number and get rid of it.

The new functions may just look like some added convenience utilities, but Greg's long-term intent is to phase out the current class interface in favor of the new functions. The older versions, he says, are simply too hard to use correctly. Others may agree with this point, but there have been a few objections to this change. It really does represent a different way of doing things with the driver model.

Under the old scheme, class and class_device structures are typically embedded within larger, bus-specific (or driver-specific) structures. The reference counting implemented for the class-subsystem structures also worked for the containing structure. Thus the higher-level code, if written right, did not have to implement separate reference counting and life cycle management for its own structures.

The new way of doing things makes it impossible to embed the class structures in this way; they must, instead, be allocated separately and accessed via a pointer. So the bus-level or driver-level code must do its own reference counting for its own structures. The changes are often small. The patch to change the USB subsystem over, for example, adds a kref to struct usb_bus. Then, the function for obtaining a reference to a USB bus structure is changed this way:

    struct usb_bus *usb_bus_get(struct usb_bus *bus)
   {
 	if (bus)
   -		class_device_get(&bus->class_dev);
   +		kref_get(&bus->kref);
 	return bus; 
    }

So the changes are not all that huge, but, if all users of the old interface are to be switched over, new reference counts will have to be added in a number of places. If this change goes through, look for similar changes to other parts of the device model API in the future.

Delaying hotplug events

When a device is added to (or removed from) the system (more specifically, when a kobject representing that device is added or removed), the kernel generates a hotplug event to inform user space. That event is passed on to a tool like udev, which looks up the device number in sysfs and creates the appropriate device node(s). As it turns out, however, the hotplug event is generated before the sysfs attribute containing the device number is created. So, if the timing works out badly, udev must spin in a loop waiting for the attribute it needs to show up.

Kay Sievers has posted a series of patches which addresses this problem by making a change to the kobject API. In particular, kobject_add() and kobject_del() no longer generate hotplug events. Kernel code which uses those interfaces must explicitly generate hotplug events itself through calls to kobject_hotplug(). This change would appear to put extra work on higher-level code, but it has an important advantage: the kobject_hotplug() call can be made after the relevant sysfs attributes have been set up properly. Making the system as a whole work more smoothly is worth a small amount of added complexity.

The wrapper functions kobject_register() and kobject_unregister() have not been changed, and still generate hotplug events.

Locking and klists

The device model implements a shockingly complex data structure which must be protected against concurrent access. Much of that protection is handled by a reader-writer semaphore (rwsem) kept in the top-level subsystem structure. There has been a slow stream of patches aimed at removing that rwsem for a while now; it is seen as inelegant and a performance bottleneck. Pat Mochel has just posted a series of patches aimed at pushing this process forward some more.

Many of the structures needing for locking are linked lists. In the current device model code, the standard kernel list type is used to implement these lists. Pat has decided that a new list type, which he calls a klist, is the right way to deal with many of the locking issues in the device core. The klist is built on the regular list_head type, but it adds some interesting properties.

The first of those properties is that the real head of the list has a different type (struct klist) from the entries in the list (struct klist_node). So klists are not explicitly circular lists; they have a clear starting point. The klist structure contains a spinlock which is used to serialize access to the list itself (but not to the individual nodes on the list).

The set of basic klist functions is rather smaller than the equivalent list_head functions:

    void klist_init(struct klist *list);
    void klist_add_tail(struct klist *list, struct klist_node *node);
    void klist_add_head(struct klist *list, struct klist_node *node);

The node structure is initialized automatically when it is added to the list, so there is no need for the caller to worry about it.

The klist_node structures contain their own reference count; as long as the count is non-zero, the node will continue to be part of the list. There are two removal functions:

    void klist_del(struct klist_node *node);
    void klist_remove(struct klist_node *node);

A call to klist_del() will decrement the node's reference count and return immediately; the entry may still exist on the list at that point. klist_remove() is like klist_del(), but it will, if necessary, sleep until the last reference has been given up and the node has actually been taken off the list.

Working through a klist requires the creation of an iterator structure - struct klist_iter. Iteration is started with a call to one of:

    void klist_iter_init(struct klist *list, struct klist_iter *iter);
    void klist_iter_init_node(struct klist *list, struct klist_iter *iter,
                              struct klist_node *node);

The first form starts iteration at the beginning of the list, while the second can be used to start at an arbitrary entry within the list. Stepping through the list is accomplished with:

    struct klist_node *klist_next(struct klist_iter *iter);

This function will return a pointer to the next node in the list, if there is one. It also will grab a reference to that node, so that it will not go away while the iterating code is working with it. Among other things, that feature makes it safe to call klist_del() on a node while iterating through the list; that node will continue to exist (at least) until klist_next() is called. Also implied is that calling klist_remove() while iterating through a list is a very bad idea; it will wait rather longer than the caller intended.

Iteration is ended with:

    void klist_iter_exit(struct klist_iter *iter);

This function will release the reference on the last node returned from klist_next() (if any) and stop the iteration.

The klist code drew an objection about the obfuscation caused by all of the device model "kfoo stuff." Pat responds that the klist code is, instead, a step toward cleaning up some of that obfuscation. There were not a whole lot of other comments on this patch series.

It's worth noting that, as of this writing, none of the patches described above have been merged. They are sufficiently disruptive that, at this point, they may have to wait until 2.6.13.

Comments (none posted)

Pluggable congestion avoidance modules

Many years ago, when the TCP/IP protocols were young, the early Internet went through a bad period. As the number of systems on the net grew, the high-speed (56K) long-haul links which tied the backbone sites together became clogged and the net became very difficult to use. The TCP implementations in use at that time did not understand how to deal with (or even detect) congestion, and, as a result, made the problem worse. Some people began to ask if TCP was going to work at all.

Van Jacobson saved the situation with a simple observation: there is no point in sending data faster than the slowest link between the endpoints can handle it, even if the local network connection is very fast. Overwhelming the long-haul link just causes lots of dropped packets, retransmissions, and even more congestion. The solution was to start transmitting data slowly on a new connection, then to ramp up the speed until packets started getting dropped. The optimal speed was deemed to be one at which just a very small number of packets would fail to arrive. That speed would be adjusted over the life of the connection as conditions on the network changed. With TCP tweaked in this way, communicating systems would scale back their transmissions as the network got more congested, but would ramp up when the bandwidth became available. The result was a net which actually worked for everybody involved. It became possible, for example, to download the entire GNU emacs distribution without having to split it into dozens of small pieces first.

We had to content ourselves with what we could get in those days.

Since then, the net has become much larger, more complex, and faster. The congestion avoidance problem has grown as well, to the point that there are several competing algorithms seeking to provide the best TCP performance while being fair to other network users. Several of these algorithms have found their way into Linux, with a corresponding increase in the complexity of the TCP code. As a way of helping those experimenting with congestion avoidance and eliminating the need to patch the TCP code directly, Stephen Hemminger has posted a new infrastructure which allows congestion avoidance algorithms to be written as pluggable modules. He has also reworked the existing algorithms in the kernel to use the new infrastructure. The result is, among other things, an opportunity to look at how these algorithms work.

The core of the TCP protocol is the concept of a "window," being the amount of data which one side is willing to accept from the other at any given time. The window size reflects what the receiving system can handle - how much buffer space it has available - but it says nothing about what the routers in between can deal with. Congestion avoidance algorithms try to account for the slowest link serving a connection with a "congestion window," which is the maximum amount of data which can be in transit without an acknowledgment from the remote end. An ideal congestion window setting will allow a system to maximize throughput on a connection without excessive packet loss rates, and without taking an unfair amount of the shared network bandwidth. Finding that setting is still more of an art than a science.

Stephen's patches create a new structure to identify a congestion avoidance algorithm:

    struct tcp_ca_type {
	void (*start)(struct tcp_sock *tp);
	u32 (*ssthresh)(struct tcp_sock *tp);
	u32 (*min_cwnd)(struct tcp_sock *tp);
	void (*cong_avoid)(struct tcp_sock *tp, u32 ack, 
			   u32 rtt, u32 in_flight, int good);
	void (*rtt_sample)(struct tcp_sock *tp, u32 rtt);
	void (*set_state)(struct tcp_sock *tp, u8 new_state);

	void (*cwnd_event)(struct tcp_sock *tp, enum tcp_ca_event ev);
	u32  (*undo_cwnd)(struct tcp_sock *tp);
	void (*get_info)(struct tcp_sock *tp, u32 ext, struct sk_buff *skb);

	struct list_head	list;
	struct module 		*owner;
	const char 		*name;
    };

Each of the methods in this structure is a hook into the TCP code which allows the algorithm to obtain information on network conditions and react accordingly:

The start() method initializes the algorithm when a new batch of data is being transmitted; this can happen for new sockets, or when one has been idle for a while.
The ssthresh() method calculates the "slow start threshold"; when the congestion window is below that threshold, the connection is in slow start mode rather than full congestion avoidance mode. This method is called when congestion occurs.
The actual initial window may be set by min_cwnd() to be less than the threshold value as a starting point for the slow start algorithm.
When an acknowledgment arrives from the remote end, the cong_avoid() method is invoked; it may respond to successful packet delivery by enlarging the congestion window.
rtt_sample() tells the algorithm about a measured round-trip time - the time taken between sending a packet and receiving the corresponding acknowledgment.
set_state() indicates that the TCP state of the socket has changed.
Various events of interest can be communicated to the algorithm via cwnd_event().
Sometimes, transient situations can cause the congestion window to be reduced; the undo_cwnd() method can be called when such a situation is detected to restore a larger window.
The get_info() method can be used to make congestion avoidance information available to user space.

The TCP "Reno" algorithm is Van Jacobson's original; it remains wired into the kernel in a non-pluggable form (though it can be overridden). The congestion window starts at the min_cwnd() value, and increases by one segment for each sequential acknowledgment received from the remote end until it hits the slow-start threshold. At that point, the congestion window increases much more slowly until it either hits the TCP window size or packet loss happens. When congestion is detected, the congestion window is cut in half (to a minimum of two segments) and the process starts over.

The Westwood algorithm is a tweak to the Reno approach. The Westwood code carefully tracks the round-trip times of the packets sent, and uses that information to estimate the effective bandwidth of the network connection. When packets get dropped, the congestion window and slow start thresholds are set relative to that bandwidth estimate. As a result, Westwood tends to back off more slowly than Reno, and should, thus, get better bandwidth overall. Its authors claim that Westwood is especially good for wireless links or other situations where the loss of an occasional packet may have nothing to do with congestion.

TCP Vegas also makes use of detailed round-trip time information. In particular, it tries to address a perceived failure in the Reno algorithm, which determines the optimal packet rate by increasing the congestion window until that rate is exceeded. Vegas, instead, monitors changes to the packet round-trip time as the congestion window is increased. If a larger window leads to longer round-trip times, the algorithm concludes that congestion is about to set in and the window is reduced slightly. The Vegas algorithm (or at least the Linux implementation thereof) does not perform well in all environments, so it is not enabled by default.

Binary Increase Congestion Control (BIC) [PDF] tries to be smarter about how the congestion window size is adjusted. Among other things, it is aimed at high-performance networks where the TCP window may be quite large. The other algorithms may, in congestion avoidance mode, make large changes to the congestion window which can result in abrupt increases in network traffic. The BIC code combines two algorithms as a way of trying to quickly converge on the proper congestion window while avoiding massive packet dumps. The core technique is a binary search; if the window is to be increased, the point midway between the current value and the maximum size is chosen. Decreases are handled by picking the midpoint between the current value and the threshold. If, however, the endpoints are too far apart, an "additive increase" is done instead - the congestion window is resized by a constant value.

The High-speed TCP algorithm is optimized for very fat pipes - 10G Ethernet and such. When things are congested, it behaves much like the Reno algorithm. When the congestion window is being increased, however, the high-speed algorithm makes use of a table to pick large increment values. This approach lets the congestion window get very large (i.e. tens of thousands of segments) quickly, and to stay large, without requiring that the network function for long periods of time without a single dropped packet.

The last of the pluggable modules is the TCP Hybla implementation. Hybla is based on the observation that the other algorithms, which use round-trip times heavily in their calculations, tend to be biased against satellite links and other high-latency connections. So Hybla includes a calculation which allows the congestion window to become larger, more quickly, when the round-trip time is very high. In this way, it tries to keep the pipe full enough to make use of the available bandwidth, even though the time to turn around any individual packet is long.

Stephen is currently suggesting that this patch set should go into 2.6.13, after a good shakedown period in the -mm tree. There does not seem to be a whole lot of opposition, so things may well happen just that way.

Comments (5 posted)

Andrew Morton 2.6.12-rc1-mm1 ?

Domen Puncer 2.6.12-rc1-kj ?

Greg KH Linux 2.6.11.5 ?

Marcelo Tosatti Linux 2.4.30-rc1 ?

Willy Tarreau linux-2.4.29-hf5 ?

Paul Mackerras PPC64 Implement non-executable stacks ?

Guillaume Thouvenin fork_connector: add a fork connector ?

Erik Jacobson PAGG (Process Aggregates/Hook Management) for 2.6.11 ?

Paul E. McKenney Real-Time Preemption and RCU ?

Ingo Molnar Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00 ?

Ingo Molnar Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07 ?

Pavel Machek smp/swsusp done right ?

sugita The new function of LKST is released ?

Dominik Brodowski PCI-PCI transparent bridge handling improvements (pci core) ?

Dominik Brodowski PCI-PCI transparent bridge handling improvements (yenta_socket) ?

Kay Sievers kobject/hotplug split ?

Len Brown ACPI for 2.6.12-rc1 ?

Bartlomiej Zolnierkiewicz ide-2.6 update ?

Randy.Dunlap io_remap_pfn_range: intro. ?

Jon Smirl Rework of request firmware ?

Nick Sillik Driver to Add Usability for Maxtor External Drives ?

Nicholas A. Bellinger iSCSI Initiator Core Stack v1.6.1.23 for 2.6.12-rc1 ?

Patrick Mochel [RFC] Steps to fixing the driver model locking ?

Patrick Mochel [RFC] Steps to fixing the driver model locking (klist code) ?

Pierre Ossman wbsd update ?

Jeff Garzik netdev-2.6 queue updated ?

Alex Aizman Experimental Driver for Neterion/S2io 10GbE Adapters HAL layer has been removed. ?

Michael Raymond User Level Interrupts ?

Chuck Lever VFS: New /proc file /proc/self/mountstats ?

Chuck Lever RFC: exporting per-superblock statistics to user space ?

Chuck Lever NFS: add I/O performance counters ?

Dave Hansen Sparsemem: chop up the global mem_map[] ?

Dave Hansen sparsemem: chop up the global mem_map[] ?

Dave Hansen sparsemem: hotplug support ?

Dave Hansen sparsemem: steal some bits from mem_map ?

Dave Hansen sparsemem: i386 implementation ?

Dave Hansen sparsemem: more separation between NUMA and DISCONTIG ?

Dave Hansen sparsemem: MAX_ORDER optimizations ?

Christoph Lameter Prezeroing V8 ?

Paul Jackson cpusets policy kill no swap ?

Hugh Dickins freepgt: free_pgtables use vma list ?

Hugh Dickins freepgt: mpnt to vma cleanup ?

Hugh Dickins freepgt: remove MM_VM_SIZE(mm) ?

Hugh Dickins freepgt: remove arch pgd_addr_end ?

Hugh Dickins freepgt: hugetlb_free_pgd_range ?

Stephen Hemminger TCP infrastructure split out ?

Stephen Hemminger TCP BIC 1.1 support ?

Stephen Hemminger TCP High speed support ?

Stephen Hemminger TCP Vegas support ?

Stephen Hemminger TCP Hybla ?

Stephen Hemminger TCP Westwood+ support ?

Jeff Garzik 2.6.x wireless update and status ?

Rene Scharfe Change proc file permissions with sysctls ?

Herbert Xu [CRYPTO] Speed up crypt() ?

Herbert Xu [CRYPTO] Do scatterwalk_whichbuf inline ?

Herbert Xu [CRYPTO] Handle in_place flag in crypt() ?

Herbert Xu [CRYPTO] Split src/dst handling out from crypt() ?

Herbert Xu [CRYPTO] Eliminate most calls to scatterwalk_copychunks from crypt() ?

Herbert Xu [CRYPTO] Optimise kmap calls in crypt() ?

Lorenzo =?ISO-8859-1?Q?Hern=E1ndez_?= =?ISO-8859-1?Q?Garc=EDa-Hierro?= vsecurity 0.2-cvs (security fix revision) ?

Jari Ruusu Announce loop-AES-v3.0c file/swap crypto package ?

sugita Disk Allocation Viewer (dav) is released ?

christophe varoqui multipath-tools-0.4.3 ?

Erik van Konijnenburg yaird 0.0.5, a mkinitrd based on hotplug concepts ?

Nicholas A. Bellinger iSCSI Initiator Core Tools v2.2 ?

Kernel development

Brief items

Kernel release status

Kernel development news

Device model changes in store

Class code changes

Delaying hotplug events

Locking and klists

Pluggable congestion avoidance modules

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management

Networking

Security-related

Miscellaneous