Brief items
The current 2.6 prepatch remains 2.6.18-rc4; Linus will be on
vacation for some time yet. In his absence, Greg Kroah-Hartman has
released
2.6.18-rc4-gkh1,
containing 64 patches intended for merging into the mainline after Linus
returns.
The current -mm tree is 2.6.18-rc4-mm1. Recent changes
to -mm include a reworking of the serial ATA configuration options
("If you blindly run `make oldconfig' you won't have any
disks."), a new set of USB endpoint functions, a big x86-64 update,
a reworking of the network time protocol code, support for read-only bind
mounts, and the new Thinkpad embedded controller driver (despite concerns
about its origin - see below).
The current 2.4 kernel is 2.4.33, released by Marcelo on
August 11. This is Marcelo's final 2.4 release; the maintainership of
this kernel now passes on to Willy Tarreau.
Comments (none posted)
Kernel development news
ext3 will be around for many years yet. We cannot just let it rot
due to some false belief that performing routine maintenance
against it will for some magical reason cause it to break.
-- Andrew Morton
Comments (none posted)
Just over one year ago, LWN
covered a patch set aimed at
preventing potential deadlocks in the network subsystem. The problem being
addressed can come about when the system is using a block (disk) device
which is located on the other side of a network link. When the system runs
short on memory, one of the things it must do is to write dirty pages back
to disk, allowing that memory to be reused for other purposes. But writing to
a network disk can require memory allocations in its own right - a need
which comes at the worst possible time. This particular problem, which
also arises with locally-attached drives, has been solved for a while by
keeping a small memory reserve specifically for block I/O operations.
Network-attached drives have an additional problem, however, in that no
write can be considered complete until an acknowledgment has been received
from the remote device. Receiving that acknowledgment requires that the
system be able to receive (and process) network packets - and that can
require unbounded amounts of memory. There may be any amount of incoming
network data which has nothing to do with outstanding block I/O requests,
and that data can make it impossible to receive the packets which the
memory-constrained system is so desperately waiting to receive. The
deadlock avoidance patch made some changes aimed at ensuring that the
system could always receive and process incoming block I/O traffic.
A year later, this patch set has resurfaced. The original author
(Daniel Phillips) has stepped aside, and Peter Zijlstra has taken the
lead. In many ways, the current version of the patch resembled its
predecessors, but there have been enough changes to warrant a new look.
The patch still works by enlarging the emergency reserve area maintained by
the core page allocator. There is a GFP flag (__GFP_MEMALLOC)
which allows a particular allocation call to be satisfied out of the
reserve, if necessary. The core idea is to use this reserve to receive
vital incoming network packets without allowing it to be overrun with
useless stuff.
To that end, code which is performing block I/O over a network connection
sets the SOCK_MEMALLOC flag on its socket(s). Previous versions
of the patch would then set a flag on any associated network interfaces to
indicate that block I/O was passing through that interface, but the current
version skips that step. Instead, any attempt to allocate an
sk_buff (packet) structure from a network device driver will dip
into the memory reserves if need be. Thus, as long as the reserves hold
out, the system will always be able to allocate buffers for incoming
packets.
The key is to receive the important packets without exhausting the reserves
with useless data (streaming video from LinuxWorld keynotes, say). To that
end, the networking code is patched to check for the SOCK_MEMALLOC
flag as soon as possible after the socket for each incoming packet is
identified. If that flag is not set, and the incoming packet is using
memory from the reserves, the packet will be dropped immediately, freeing
its memory for other uses. So packets related to block I/O are received
and processed as usual; just about everything else gets dropped at the
earliest possible moment.
The latest version of the patch includes a new memory allocator, called
SROG, which is used for handling reserve memory. It is intended to be fast
and simple, and to release memory back to the system as quickly as
possible. To that end, it tries to group related allocations together, and
it isolates each group of allocations (generally the sk_buff
structure and its associated data area) onto their own pages. So every
time a packet is released, its associated memory immediately becomes
available to the system as a whole.
This patch set is proving to be a bit of a hard sell, however. The
deadlock scenario is seen as being relatively unlikely - there have not
been streams of bug reports on this topic - and, in most cases, it can be
avoided simply by swapping to a local disk. The set of systems whose
owners can afford fancy network storage arrays, but where those same owners
are unable to invest in a local disk for swapping, is thought to be small.
Making the networking layer more complex to address this particular problem
does not appeal to everybody.
Networking maintainer David Miller would like
to see a different sort of approach to network memory allocations:
I think there is more profitability from a solution that really
does something about "network memory", and doesn't try to say
"these devices are special" or "these sockets are special".
Special cases generally suck.
We already limit and control TCP socket memory globally in the
system. If we do this for all socket and anonymous network buffer
allocations, which is sort of implicity in Evgeniy's network tree
allocator design, we can solve this problem in a more reasonable
way.
This comment refers to Evgeniy Polyakov's network memory allocator patch,
recently posted for consideration. This work is in a highly transitional
state and is a little hard to read. The core, however, is this: it is (yet
another) separate memory allocator, oriented toward the needs of the
networking system. It is designed to keep memory allocations local to a
single CPU, so each processor has its own set of pages to hand out.
Allocated objects are packed as tightly as possible, minimizing internal
fragmentation. There
is no recourse to the system memory allocator in the current design, so,
when a particular processor runs out, allocations will fail. Memory
exhaustion in the rest of the system will not affect the network allocator,
however. The author claims improved networking performance:
Benchmarks with trivial epoll based web server showed noticeable
(more than 40%) improvements of the request rates (1600-1800
requests per second vs. more than 2300 ones). It can be described
by more cache-friendly freeing algorithm, by tighter objects
packing and thus reduced cache line ping-pongs, reduced lookups
into higher-layer caches and so on.
This code is also written with an eye toward mapping networking buffers
directly into user space, perhaps in conjunction with a future network
channel implementation.
The network allocator patch clearly has the eye of the networking
maintainer at the moment. That code is fairly far from being ready to
merge, however, and not everybody agrees that it solves all of problems.
So this is a discussion which could go on for some time yet.
Comments (4 posted)
In
last week's episode, we
looked at the story of the new Thinkpad embedded controller driver and its
author "Shem Multinymous." The situation had been put on hold after Pavel
Machek had offered to sign off on the code, and the discussion died down
for a bit. Not for long, though.
Robert Love, the author of the accelerometer driver which (among other
things) is replaced by this code, reviewed
it, noting "I am glad someone has apparently better access
to hardware specs than I did" That brought Andrew Morton back in, saying:
This situation is still a concern. From where did this additional
register information come? [...]
We're setting precedent here and we need Linus around to resolve
this. Perhaps we can ask "Shem" to reveal his true identity to
Linus (and maybe me) privately and then we proceed on that basis.
The rule could be "each of the Signed-off-by:ers should know the
identity of the others".
That is not good enough for Greg
Kroah-Hartman, however:
For what it's worth, I'm not going to be handling these patches at
all (normally the hwmon patches go to Linus through Jean and then
through me.) If the original developer does not want to work in
the open like the rest of us, I can respect that, but unfortunately
I can't accept the risk of accepting their code.
Jean Delvare has also declined to look at the
code, saying that the legal uncertainty is too strong. Shem
Multinymous, on the other hand, seems willing to come clean to Linus and
Andrew if that is what it takes to get the code into the kernel. So it is
conceivable that things could happen that way, with the code bypassing the
maintainers who would normally handle (and review) it. Some residual
concern could remain, however, perhaps to the point that distributors would
consider removing the code from the kernels they ship.
"Shem" has also posted two separate messages on the provenance of the
information used in this driver. The story, it seems, starts with a
reverse-engineered Windows driver. Then, a real spec for the embedded
controller chip was found. After that, it was mostly a matter of putting
the pieces together. Or so it is said.
If this story holds together, then the new code probably is something which
can be merged into the mainline without worry; it should be at least as
legitimate as the original driver which it replaces. But, even if it gets
in, this code will have set a precedent of sorts: anonymous submissions (at
least, those submitted under an obvious pseudonym) are going to
have a hard time getting through the process. Nobody wants to be the
person who guided bad code into the kernel.
Comments (12 posted)
Since time immemorial, the basic registration interface for char devices in
the kernel has been:
int register_chrdev(unsigned int major, const char *name,
const struct file_operations *fops);
int unregister_chrdev(unsigned int major, const char *name);
In the old days, register_chrdev() would allocate all 256 minor
numbers associated with the given major, associating the given
name and file operations with all of them. If the major number is
given as zero, one will be allocated on the fly. The corresponding
unregister_chrdev() call would release all of those minor numbers.
This call asked for the name as a safety measure; if the name did not match
that provided when the major number was registered, the
unregister_chrdev() call would fail.
In the intense period prior to the release of the 2.6.0 kernel, Al Viro set
out to find a way to expand the device number range. One of the problems
to be solved was the huge set of drivers which "knew" that minor numbers
never went any higher than 255. One option would have been to audit every
driver in the tree, ensuring that it did the right thing with minor
numbers. Time was in short supply, however, and volunteers to do that
particular job were in even shorter supply. So Al took a different
approach: he created a new interface for the registration of char devices,
then reimplemented the old interface as a compatibility layer which would
allocate minor numbers 0..255 for a given major. In this way, unconverted
code would continue to work as always, with the kernel guaranteeing that it
would never see any minor numbers that it would not have seen before. Over
time, drivers could be converted to the new interface, which has a number
of advantages.
As it happens, that conversion never really came to be. Since the old
interface continued to work, was familiar, and was a little simpler to use,
developers stuck with it. Perhaps more importantly, the long-feared device
number shortage never happened. Greater use of dynamic numbers, more
generic device interfaces, and the hotplug mechanism all came together to
make (most) Linux systems fit easily within the older device number space,
to the point that the expanded numbers are rarely used. A quick scan on
your editor's system reveals exactly three minor numbers greater than 255,
all under /dev/bus/usb. So there has been no strong reason to
convert to the new character device interface.
Recently, Alexey Dobriyan noticed that unregister_chrdev() no
longer checks the name argument, so he posted a patch which removes that
argument, fixing all callers in the process. Your editor suggested that,
perhaps, this would be a good time to move those callers to the newer
interface, rather than reworking the older, compatibility interface. In
response, another developer suggested that better documentation for the new
interface would be a good thing to have. To that end, here is a quick
overview of how char device registration is meant to be done in 2.6.
The newer interface breaks down char device registration into two distinct
steps: allocation of a range of device numbers, and association of specific
devices with those numbers. The allocation phase is handled with either
of:
int register_chrdev_region(dev_t first, unsigned int count,
const char *name);
int alloc_chrdev_region(dev_t *first, unsigned int firstminor,
unsigned int count, char *name);
The first form will allocate count minor numbers, starting with
the major/minor pair found in first, and remembering name
with all of them. The second form is intended for use when the desired
major number is not known ahead of time; it will allocate a major number,
then allocate count minor numbers, starting at
firstminor. The beginning of the allocated number range will be
returned in first. The return value will be zero on success or a
negative error code on failure.
A few things are worth noting here. With either version, the major number
used could be shared with other, completely unrelated devices. Only the
specific minor number range allocated belongs to any given caller. These
minor numbers can be greater than 255. It is possible that the allocated
range of device numbers could overflow the minor number range, spilling
into the next major number. That behavior is enabled by design, and
everything should work correctly - though, as far as your editor knows, no
production kernel has any allocations which work that way.
Regardless of which allocation function was used, device numbers can be
returned to the system with:
void unregister_chrdev_region(dev_t first, unsigned int count);
The association of device numbers with specific devices happens by way of
the cdev structure, found in <linux/cdev.h>. It is
possible to allocate an initialize a cdev structure with a
sequence like:
struct cdev *my_dev = cdev_alloc();
if (my_dev != NULL)
my_dev->ops = &my_fops; /* The file_operations structure */
my_dev->owner = THIS_MODULE;
else
/* No memory, we lose */
In the more common usage pattern, however, the cdev structure will
be embedded within some larger, device-specific structure, and it will be
allocated with that structure. In this case, the function to initialize
the cdev is:
void cdev_init(struct cdev *cdev, const struct file_operations *fops);
/* Need to set ->owner separately */
Either way, the structure is put into proper operating condition, and it
will be equipped with the file_operations which should be invoked
for the associated device. The owner field of the structure
should be initialized to THIS_MODULE to protect against
ill-advised module unloads while the device is active.
The final step is to add the cdev to the system, associating it
with the appropriate device number(s). The tool for that job is:
int cdev_add(struct cdev *cdev, dev_t first, unsigned int count);
This function will add cdev to the system. It will service
operations for the count device numbers starting with
first; a cdev will often serve a single device number,
but it does not have to be that way. Note that cdev_add() can
fail; if the return code is zero, the device has not been added to
the system.
Just as importantly: as soon as cdev_add() succeeds, the device is
live, and its file operations can be called by the kernel. So a driver
should not call cdev_add() until the initialization of the
associated device is complete. To do otherwise is to invite unpleasant
race conditions.
Removal of a char device from the system is done with:
void cdev_del(struct cdev *cdev);
The cdev should not be referenced after this call. In particular,
if cdev was obtained with cdev_alloc(), it will likely be
freed in cdev_del().
One final trick worth knowing about: when a char device's file operations
are invoked, the associated inode pointer will be passed in, as
usual. The field inode->i_cdev contains a pointer to the
cdev structure for the device. Drivers can use that pointer to
get to their own device-specific structure (perhaps with
container_of()). It is, thus, no longer necessary to try to map
the minor number onto an internal device - an operation which many drivers
got wrong.
The cdev interface evolved somewhat in early 2.6 releases, but has
not seen any changes in some time.
Comments (3 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Janitorial
Memory management
Networking
Security-related
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>