Kernel development
Brief items
Kernel release status
The current 2.6 prepatch remains 2.6.18-rc4; Linus will be on vacation for some time yet. In his absence, Greg Kroah-Hartman has released 2.6.18-rc4-gkh1, containing 64 patches intended for merging into the mainline after Linus returns.
The current -mm tree is 2.6.18-rc4-mm1. Recent changes
to -mm include a reworking of the serial ATA configuration options
("If you blindly run `make oldconfig' you won't have any
disks.
"), a new set of USB endpoint functions, a big x86-64 update,
a reworking of the network time protocol code, support for read-only bind
mounts, and the new Thinkpad embedded controller driver (despite concerns
about its origin - see below).
The current 2.4 kernel is 2.4.33, released by Marcelo on August 11. This is Marcelo's final 2.4 release; the maintainership of this kernel now passes on to Willy Tarreau.
Kernel development news
Quote of the week
The return of network block device deadlock prevention
Just over one year ago, LWN covered a patch set aimed at preventing potential deadlocks in the network subsystem. The problem being addressed can come about when the system is using a block (disk) device which is located on the other side of a network link. When the system runs short on memory, one of the things it must do is to write dirty pages back to disk, allowing that memory to be reused for other purposes. But writing to a network disk can require memory allocations in its own right - a need which comes at the worst possible time. This particular problem, which also arises with locally-attached drives, has been solved for a while by keeping a small memory reserve specifically for block I/O operations.Network-attached drives have an additional problem, however, in that no write can be considered complete until an acknowledgment has been received from the remote device. Receiving that acknowledgment requires that the system be able to receive (and process) network packets - and that can require unbounded amounts of memory. There may be any amount of incoming network data which has nothing to do with outstanding block I/O requests, and that data can make it impossible to receive the packets which the memory-constrained system is so desperately waiting to receive. The deadlock avoidance patch made some changes aimed at ensuring that the system could always receive and process incoming block I/O traffic.
A year later, this patch set has resurfaced. The original author (Daniel Phillips) has stepped aside, and Peter Zijlstra has taken the lead. In many ways, the current version of the patch resembled its predecessors, but there have been enough changes to warrant a new look.
The patch still works by enlarging the emergency reserve area maintained by the core page allocator. There is a GFP flag (__GFP_MEMALLOC) which allows a particular allocation call to be satisfied out of the reserve, if necessary. The core idea is to use this reserve to receive vital incoming network packets without allowing it to be overrun with useless stuff.
To that end, code which is performing block I/O over a network connection sets the SOCK_MEMALLOC flag on its socket(s). Previous versions of the patch would then set a flag on any associated network interfaces to indicate that block I/O was passing through that interface, but the current version skips that step. Instead, any attempt to allocate an sk_buff (packet) structure from a network device driver will dip into the memory reserves if need be. Thus, as long as the reserves hold out, the system will always be able to allocate buffers for incoming packets.
The key is to receive the important packets without exhausting the reserves with useless data (streaming video from LinuxWorld keynotes, say). To that end, the networking code is patched to check for the SOCK_MEMALLOC flag as soon as possible after the socket for each incoming packet is identified. If that flag is not set, and the incoming packet is using memory from the reserves, the packet will be dropped immediately, freeing its memory for other uses. So packets related to block I/O are received and processed as usual; just about everything else gets dropped at the earliest possible moment.
The latest version of the patch includes a new memory allocator, called SROG, which is used for handling reserve memory. It is intended to be fast and simple, and to release memory back to the system as quickly as possible. To that end, it tries to group related allocations together, and it isolates each group of allocations (generally the sk_buff structure and its associated data area) onto their own pages. So every time a packet is released, its associated memory immediately becomes available to the system as a whole.
This patch set is proving to be a bit of a hard sell, however. The deadlock scenario is seen as being relatively unlikely - there have not been streams of bug reports on this topic - and, in most cases, it can be avoided simply by swapping to a local disk. The set of systems whose owners can afford fancy network storage arrays, but where those same owners are unable to invest in a local disk for swapping, is thought to be small. Making the networking layer more complex to address this particular problem does not appeal to everybody.
Networking maintainer David Miller would like to see a different sort of approach to network memory allocations:
We already limit and control TCP socket memory globally in the system. If we do this for all socket and anonymous network buffer allocations, which is sort of implicity in Evgeniy's network tree allocator design, we can solve this problem in a more reasonable way.
This comment refers to Evgeniy Polyakov's network memory allocator patch, recently posted for consideration. This work is in a highly transitional state and is a little hard to read. The core, however, is this: it is (yet another) separate memory allocator, oriented toward the needs of the networking system. It is designed to keep memory allocations local to a single CPU, so each processor has its own set of pages to hand out. Allocated objects are packed as tightly as possible, minimizing internal fragmentation. There is no recourse to the system memory allocator in the current design, so, when a particular processor runs out, allocations will fail. Memory exhaustion in the rest of the system will not affect the network allocator, however. The author claims improved networking performance:
This code is also written with an eye toward mapping networking buffers directly into user space, perhaps in conjunction with a future network channel implementation.
The network allocator patch clearly has the eye of the networking maintainer at the moment. That code is fairly far from being ready to merge, however, and not everybody agrees that it solves all of problems. So this is a discussion which could go on for some time yet.
Code of (still) uncertain origin
In last week's episode, we looked at the story of the new Thinkpad embedded controller driver and its author "Shem Multinymous." The situation had been put on hold after Pavel Machek had offered to sign off on the code, and the discussion died down for a bit. Not for long, though.
Robert Love, the author of the accelerometer driver which (among other
things) is replaced by this code, reviewed
it, noting "I am glad someone has apparently better access
to hardware specs than I did
" That brought Andrew Morton back in, saying:
We're setting precedent here and we need Linus around to resolve this. Perhaps we can ask "Shem" to reveal his true identity to Linus (and maybe me) privately and then we proceed on that basis. The rule could be "each of the Signed-off-by:ers should know the identity of the others".
That is not good enough for Greg Kroah-Hartman, however:
Jean Delvare has also declined to look at the code, saying that the legal uncertainty is too strong. Shem Multinymous, on the other hand, seems willing to come clean to Linus and Andrew if that is what it takes to get the code into the kernel. So it is conceivable that things could happen that way, with the code bypassing the maintainers who would normally handle (and review) it. Some residual concern could remain, however, perhaps to the point that distributors would consider removing the code from the kernels they ship.
"Shem" has also posted two separate messages on the provenance of the information used in this driver. The story, it seems, starts with a reverse-engineered Windows driver. Then, a real spec for the embedded controller chip was found. After that, it was mostly a matter of putting the pieces together. Or so it is said.
If this story holds together, then the new code probably is something which can be merged into the mainline without worry; it should be at least as legitimate as the original driver which it replaces. But, even if it gets in, this code will have set a precedent of sorts: anonymous submissions (at least, those submitted under an obvious pseudonym) are going to have a hard time getting through the process. Nobody wants to be the person who guided bad code into the kernel.
The cdev interface
Since time immemorial, the basic registration interface for char devices in the kernel has been:
int register_chrdev(unsigned int major, const char *name,
const struct file_operations *fops);
int unregister_chrdev(unsigned int major, const char *name);
In the old days, register_chrdev() would allocate all 256 minor numbers associated with the given major, associating the given name and file operations with all of them. If the major number is given as zero, one will be allocated on the fly. The corresponding unregister_chrdev() call would release all of those minor numbers. This call asked for the name as a safety measure; if the name did not match that provided when the major number was registered, the unregister_chrdev() call would fail.
In the intense period prior to the release of the 2.6.0 kernel, Al Viro set out to find a way to expand the device number range. One of the problems to be solved was the huge set of drivers which "knew" that minor numbers never went any higher than 255. One option would have been to audit every driver in the tree, ensuring that it did the right thing with minor numbers. Time was in short supply, however, and volunteers to do that particular job were in even shorter supply. So Al took a different approach: he created a new interface for the registration of char devices, then reimplemented the old interface as a compatibility layer which would allocate minor numbers 0..255 for a given major. In this way, unconverted code would continue to work as always, with the kernel guaranteeing that it would never see any minor numbers that it would not have seen before. Over time, drivers could be converted to the new interface, which has a number of advantages.
As it happens, that conversion never really came to be. Since the old interface continued to work, was familiar, and was a little simpler to use, developers stuck with it. Perhaps more importantly, the long-feared device number shortage never happened. Greater use of dynamic numbers, more generic device interfaces, and the hotplug mechanism all came together to make (most) Linux systems fit easily within the older device number space, to the point that the expanded numbers are rarely used. A quick scan on your editor's system reveals exactly three minor numbers greater than 255, all under /dev/bus/usb. So there has been no strong reason to convert to the new character device interface.
Recently, Alexey Dobriyan noticed that unregister_chrdev() no longer checks the name argument, so he posted a patch which removes that argument, fixing all callers in the process. Your editor suggested that, perhaps, this would be a good time to move those callers to the newer interface, rather than reworking the older, compatibility interface. In response, another developer suggested that better documentation for the new interface would be a good thing to have. To that end, here is a quick overview of how char device registration is meant to be done in 2.6.
The newer interface breaks down char device registration into two distinct steps: allocation of a range of device numbers, and association of specific devices with those numbers. The allocation phase is handled with either of:
int register_chrdev_region(dev_t first, unsigned int count,
const char *name);
int alloc_chrdev_region(dev_t *first, unsigned int firstminor,
unsigned int count, char *name);
The first form will allocate count minor numbers, starting with the major/minor pair found in first, and remembering name with all of them. The second form is intended for use when the desired major number is not known ahead of time; it will allocate a major number, then allocate count minor numbers, starting at firstminor. The beginning of the allocated number range will be returned in first. The return value will be zero on success or a negative error code on failure.
A few things are worth noting here. With either version, the major number used could be shared with other, completely unrelated devices. Only the specific minor number range allocated belongs to any given caller. These minor numbers can be greater than 255. It is possible that the allocated range of device numbers could overflow the minor number range, spilling into the next major number. That behavior is enabled by design, and everything should work correctly - though, as far as your editor knows, no production kernel has any allocations which work that way.
Regardless of which allocation function was used, device numbers can be returned to the system with:
void unregister_chrdev_region(dev_t first, unsigned int count);
The association of device numbers with specific devices happens by way of the cdev structure, found in <linux/cdev.h>. It is possible to allocate an initialize a cdev structure with a sequence like:
struct cdev *my_dev = cdev_alloc();
if (my_dev != NULL)
my_dev->ops = &my_fops; /* The file_operations structure */
my_dev->owner = THIS_MODULE;
else
/* No memory, we lose */
In the more common usage pattern, however, the cdev structure will be embedded within some larger, device-specific structure, and it will be allocated with that structure. In this case, the function to initialize the cdev is:
void cdev_init(struct cdev *cdev, const struct file_operations *fops);
/* Need to set ->owner separately */
Either way, the structure is put into proper operating condition, and it will be equipped with the file_operations which should be invoked for the associated device. The owner field of the structure should be initialized to THIS_MODULE to protect against ill-advised module unloads while the device is active.
The final step is to add the cdev to the system, associating it with the appropriate device number(s). The tool for that job is:
int cdev_add(struct cdev *cdev, dev_t first, unsigned int count);
This function will add cdev to the system. It will service operations for the count device numbers starting with first; a cdev will often serve a single device number, but it does not have to be that way. Note that cdev_add() can fail; if the return code is zero, the device has not been added to the system.
Just as importantly: as soon as cdev_add() succeeds, the device is live, and its file operations can be called by the kernel. So a driver should not call cdev_add() until the initialization of the associated device is complete. To do otherwise is to invite unpleasant race conditions.
Removal of a char device from the system is done with:
void cdev_del(struct cdev *cdev);
The cdev should not be referenced after this call. In particular, if cdev was obtained with cdev_alloc(), it will likely be freed in cdev_del().
One final trick worth knowing about: when a char device's file operations are invoked, the associated inode pointer will be passed in, as usual. The field inode->i_cdev contains a pointer to the cdev structure for the device. Drivers can use that pointer to get to their own device-specific structure (perhaps with container_of()). It is, thus, no longer necessary to try to map the minor number onto an internal device - an operation which many drivers got wrong.
The cdev interface evolved somewhat in early 2.6 releases, but has not seen any changes in some time.
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page:
Distributions>>
