Brief items
The current 2.6 prepatch remains 2.6.9-rc2; Linus has released no
prepatches since September 13.
Linus's BitKeeper repository contains more __iomem annotations
(see last week's Kernel Page) and new
sparse annotations intended to flush out byte endianness errors, an NTFS
update, ethtool support in the loopback driver, m32r architecture support,
the "string" I/O memory access
functions, support for more than eight partitions on BSD-labeled disks,
some User-mode Linux cleanups, a tunable "max sectors" limit for block I/O
requests (a latency reduction feature), a new prctl() option
allowing programs to change their name, some shared memory scalability
improvements, and a change in TCP ICMP source quench behavior (such
messages are simply ignored now).
The current tree from Andrew Morton is 2.6.9-rc2-mm1. Recent changes to -mm include
the inclusion of a number of Ingo Molnar's latency reduction patches, a
rework of tty locking, a number of User-mode Linux updates, and various
fixes.
The current 2.4 prepatch is still 2.4.28-pre3; Marcelo has released
no prepatches since September 11.
Comments (5 posted)
Kernel development news
The I/O scheduler ("elevator") has a challenging job: it must arrange for disk I/O
operations to be executed in the optimal order. "Optimal" means maximizing
the I/O bandwidth to the disk while, simultaneously, ensuring that all
requests are satisfied in a timely manner, no process suffers excessive
latency, and, for desktop systems, that the interactive "feel" of the
system is responsive. Some schedulers take on additional tasks, such as
dividing the available bandwidth equally between processes (or users)
contending for each disk.
Given that set of demands, it is not surprising that there are multiple I/O
schedulers in the Linux kernel. The deadline scheduler works by
enforcing a maximum latency for all requests. The anticipatory scheduler
briefly stalls I/O after a read request completes with the idea that
another, nearby read is likely to come in quickly. The completely fair
queueing scheduler (recently updated by
Jens Axboe) applies a bandwidth allocation policy. And there is a simple
"noop" scheduler for devices, such as RAM disks, which do not benefit from
fancy scheduling schemes (though such devices usually short out the request
queue entirely).
The kernel has a nice, modular scheme for defining and using I/O
schedulers. What it lacks, however, is any flexible way of letting a
system administrator choose a
scheduler. I/O schedulers are built into the kernel code, and exactly one
of them can be selected - for all disks in the system - at boot time with the
elevator= parameter. There is no way to use different schedulers
for different drives, or to change schedulers once the system boots. The
chosen scheduler is used, and any others configured into the system simply
sit there and consume memory.
Jens Axboe has recently posted a patch
which improves on this situation. With this patch in place, I/O schedulers
can be built as loadable modules (though, as Jens cautions, at least one
scheduler must be linked directly into the kernel or the system will have a
hard time booting). A new scheduler attribute in each drive's
sysfs tree lists the available schedulers, noting which one is active at
any given time. Changing schedulers is simply a matter of writing the name
of the new scheduler into that attribute.
The patch is long, but the amount of work required to support switchable
I/O schedulers wasn't all that great. The internal structures describing
elevators have been split apart to reflect the more dynamic nature of
things; struct elevator_ops contains the scheduler methods, while
struct elevator_type holds the metadata which describes an I/O
scheduler to the kernel. The new elevator_queue structure glues
an instance of an I/O scheduler to a specific request queue. Updating the
mainline schedulers to work with the new structures required a fair number
of relatively straightforward code changes. Each scheduler now also has
module initialization and cleanup functions which have been separated from
the code needed to set up or destroy an elevator for a specific queue.
One interesting question is: what should be done with the currently queued
block requests when an I/O scheduler change is requested? One could
imagine requeueing all of those requests with the new scheduler in order to
let it have its say immediately. The simpler approach, which was chosen
for this patch, is to block the creation of new requests and wait for the
queue to empty out. Once all outstanding I/O has been finished up, the old
scheduler can be shut down and moved out of the way.
There have been no (public) objections to the patch; chances are it will
find its way into the mainline sometime after 2.6.9 comes out.
Comments (14 posted)
In the Good Old Days, loadable modules had to manage their own reference
counts with the
MOD_INC_USE_COUNT and
MOD_DEC_USE_COUNT
macros. This mechanism was always subject to race conditions; since the
count was manipulated inside the module itself, there was no way to avoid
situations where the kernel was executing inside the module, but the use
count was zero. And that was for correctly written modules; distributing
responsibility for the reference count in this way also provided lots of
opportunities for module writers to get things wrong.
So, for 2.6, reference count management was moved up into the code which
calls into modules, and the MOD_*_USE_COUNT macros were
deprecated. In recent times the kernel janitors have been busy, to the
effect that, at this point, there are no more users of those macros in the
mainline kernel. So Christoph Hellwig has posted a patch removing them altogether. That patch
has not been merged as of this writing, but the writing is clearly on the
wall. Any external modules which are still using these macros should
probably be fixed up in a hurry.
Christoph has also sent out a patch marking
the lightly-used inter_module functions as deprecated. These functions,
which perform a sort of run-time linking between modules, have never been
seen as elegant or safe to use.
Rusty Russell, meanwhile, has added a warning
to the kernel informing users that the ipchains and ipfwadm interfaces
to netfilter will be going away soon. They have been obsolete since 2.4,
but the kernel developers have kept them around because they are a
user-space interface which is still very much in use. Once a site
administrator gets a set of firewall rules that works, he or she is rarely
amused by the idea of rewriting everything for a new interface.
Supporting these interfaces requires the maintenance of an intermediate
compatibility layer in the netfilter code, however, and that makes
maintenance and development of the code hard. In the interests of carrying
the code forward, the netfilter developers want to get rid of the older
cruft. For now, they are just adding a warning; no time frame has been
given for (1) firmer warnings, or (2) actual removal of the
code.
There are a couple of obstacles to actually taking this code out:
- The users of the old interfaces. For those trying to convert to
iptables, William Stearns has posted a
script which converts ipchains rules to iptables.
- 32-bit emulation. The binary interface used by iptables is
exceedingly difficult to implement for 32-bit user-space programs in a
64-bit kernel - with the result that it has not been done. For this
reason, x86-64 maintainer Andi Kleen has requested that ipchains not be removed at
this time. Fixing that problem will not be a straightforward task,
however.
In the longer term, it seems clear that the older interfaces have to go.
The alternative is a steady accumulation of compatibility cruft which,
eventually, causes the kernel to collapse under its own weight.
Comments (none posted)
Some platforms, it seems, have an interesting property: writes to I/O
memory space from multiple processors may be reordered before reaching the
device. Even if the device registers are protected by a lock (pretty much
necessary to keep multiple processors from writing simultaneously and
confusing the device), writes issued by one CPU can arrive before those
from another, even if the second CPU had held the lock and issued its
writes first. The Itanium architecture in particular behaves this way,
though others may as well.
The answer, according to Jesse Barnes is
the addition of a new type of memory barrier to force the ordering of
writes to the device. Jesse's patch adds a new function,
mmiowb(), which implements this barrier. He has also updated the
qla1280 driver to make use of it.
Authors of PCI drivers are accustomed to coding a different sort of
barrier: reading from a device register to ensure that all writes have
actually been posted to the device. mmiowb() is a different,
lighter-weight mechanism. After a call to mmiowb(), writes might
still have not reached the device. Writes are not forced out; they
just have their ordering with respect to subsequent writes guaranteed. In
many situations, that sort of guarantee is all that is needed.
Comments (1 posted)
Li Shaohua
ran into a problem when
repeatedly plugging and unplugging an e1000 network adaptor. After 32
times, the adaptor would no longer work. It seems that the driver (like
many others in the 2.6 kernel) was
designed to discover at most 32 devices at boot time, and it has space for
configuration parameters for just that many devices. Each new hotplug
event looked like a new device, so the driver quickly ran out of parameter
storage. In fact, the e1000 driver can handle many more devices than that;
it just lacks space in its boot-time arrays to hold default configuration
information.
Mr. Li's diagnosis was that the problem lies with the e1000 driver's
inability to reuse board numbers internally. So he wrote up a patch to
keep track of existing boards, and to reuse their numbers when they are
removed. After some discussion, this patch was reworked into a general mechanism using the
"idr" facility (described in the next article) - since the e1000 is not the
only driver which behaves this
way, it makes sense to fix the problem once for everybody.
Not everybody agrees that this is the right
approach, however. Boot-time configuration parameters can be useful for
many (if not most) systems where the network interfaces are screwed down
and are unlikely to be replaced while the system is up. But do they really
make sense for hotpluggable devices? There is a whole system in place for
the configuration of hotpluggable devices; perhaps that should be used
rather than adding complexity to the network drivers. Given that the
conversation came to a hard stop after this view was posted, it seems
likely to carry the day.
Comments (none posted)
There has been a fair number of patches in recent times which
convert one part or other of the kernel over to the "idr" facility. Idr is
a set of library functions for the management of small integer ID numbers.
In essence, an idr object can be thought of as a sparse array mapping
integer IDs onto arbitrary pointers, with a "get me an available entry"
function as well. This code was first added in February, 2003 as part of
the POSIX clocks patch, and has seen various tweaks since.
Working with idr requires including <linux/idr.h>. Creating
a new idr object is simply a matter of allocating a
struct idr and passing it to:
void idr_init(struct idr *idp);
The interface for allocating new IDs is somewhat unintuitive and
interesting. The authors decided to separate out the parts of the
ID allocation process which may require getting memory from the system;
the idea was that the memory allocation could be done with no locks held,
while the actual generation of an ID number could be done in a locked
state. Thus, before allocating a new ID, one must call:
int idr_pre_get(struct idr *idp, unsigned int gfp_mask);
This function will get set up to allocate a new ID number, allocating
memory (with the given gfp_mask) if necessary. Contrary to the
usual conventions, the return value
will be zero if something goes wrong, nonzero otherwise.
Once that is done, a new ID can be allocated with either of:
int idr_get_new(struct idr *idp, void *ptr, int *id);
int idr_get_new_above(struct idr *idp, void *ptr, int start_id, int *id);
The first form gets the next available ID number, stores it in id,
and associates it with the given ptr internally. If you wish to
specify a minimum value for the new ID, use idr_get_new_above()
instead. If all goes well, the return value will be zero; if no more IDs
can be allocated, -ENOSPC will be returned.
Imagine a situation where two processors are both looking to allocate a new
ID. Both call idr_pre_get(), guaranteeing that enough memory
exists to allocate at least one more ID. Then one processor swoops in and
grabs that ID, leaving no memory for the other. In that case,
idr_get_new() will not attempt to allocate more memory; it will,
instead, return -EAGAIN. At that point, the code should emit a
heavy sigh, release its locks, and go back to the idr_pre_get()
stage. Thus, ID allocation code can look something like this:
again:
if (idr_pre_get(&my_idr, GFP_KERNEL) == 0) {
/* No memory, give up entirely */
}
spin_lock(&my_lock);
result = idr_get_new(&my_idr, &target, &id);
if (result == -EAGAIN) {
sigh();
spin_unlock(&my_lock);
goto again;
}
It should be noted that calls to idr_get_new() (and most other idr
functions) must be serialized by some sort of lock, or unpleasant things
could happen. idr_pre_get() can sleep, however, and should not be
called under lock.
Looking up an existing ID is much simpler:
void *idr_find(struct idr *idp, int id);
The return value will be the pointer associated with the given id,
or NULL otherwise.
To deallocate an ID, use:
void idr_remove(struct idr *idp, int id);
With these functions, kernel code can generate ID numbers to use as minor
device numbers, inode numbers, or in any other place where small integer
IDs are useful.
There is one more interesting twist to the idr code: it does (almost)
nothing to help users detect reused ID numbers. When an object is
destroyed, it may not be possible to tell whether anybody still has its ID
number around or not. When some part of the kernel comes along with an ID
number, it would be nice to know that refers to a currently-existing
object, rather than being left over from some previous time.
The idr code makes it possible for callers to perform this check by
ignoring the high-order bits in the ID number. Here, "high-order" is
defined as "all the bits which are not needed to represent the largest
allocated ID." By putting some sort of unique information in the upper
part of the ID (and by limiting the maximum ID number which can be used),
idr users can turn the small ID numbers into unique identifiers. The POSIX
timer and SCTP code use idr in this way; most of the other in-kernel users
treat idr as a sort of unique number generation service and do not perform
this sort of check.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Janitorial
Memory management
Networking
Architecture-specific
Security-related
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>