Brief items
The current 2.6 prepatch is 2.6.18-rc6,
released by Linus on
September 3. It is possibly the final prepatch before the 2.6.18
final release. There are a lot of fixes; one of those is the removal of
much of the
SMP alternatives
work, which was causing build problems with some compilers. See
the long-format changelog for the details.
A handful of additional fixes have gone into the mainline git repository
since the -rc6 release.
The current -mm tree is 2.6.18-rc5-mm1. Recent changes
to -mm include some enhancements to the no-MMU architecture support, a
number of NFS server improvements, a steady trickle of reiser4 fixes, and a
set of patches allowing a kernel to be built with no block device support
(though those don't quite work yet).
On the 2.4 front, Willy Tarreau has released 2.4.33.3 with a small set of
important fixes, and 2.4.34-pre2 with a larger set of
fixes. The current plan is that the gcc 4.x support discussed
here a couple of weeks ago will be merged into 2.4.34-pre3.
Other notes: a number of developers have been having difficulties
posting patches to lists hosted on vger.kernel.org recently. The list
maintainers have begun using bogofilter in an attempt to cut down on the
amount of spam getting through to the list - and, presumably, to reduce the
amount of time they put into manually maintaining filter patterns. To
date, bogofilter appears to believe that quite a few patches are spam. On
the other hand, a substantial amount of distinctly non-technical mail
involving nontraditional approaches to family quality time and members of
the animal kingdom has been
sailing through without a hitch. One can only assume that the training
process will eventually iron out these little problems.
Comments (none posted)
Kernel development news
"Tiedostoa tai hakemistoa ei ole" just means "No such file or directory".
EVERYBODY knows that.
It's not like there's even a single ä or ö in the whole sentence, so you
can't even blame the strange letters (and that's unusual, since in
Finnish, usually every other letter is 'ä' if only just to confuse the
uninitiated).
--
Linus Torvalds
Comments (9 posted)
Device drivers are generally done inside the kernel for the usual reasons
of performance and control. There are times, however, when the ability to
run a device driver in user space is helpful. These include situations
where the code is far too large to go into the kernel (X.org, for example)
and where the author of the driver does not wish to
place the code under the GPL. Some types of drivers (such as those for USB
devices) are easily run in user space now, but others can be a bit more
challenging. Very few PCI drivers, for example, are written in user space.
Thomas Gleixner has written an
interface module which may help to change that situation. With this
code in place, PCI drivers (some of them, at least) can be written almost
entirely in user space, with only a small stub module loaded into the kernel.
That module has two specific jobs to carry out. The first is to register
the device to be driven, with a couple of bits of important information.
To that end, it should fill out an iio_device structure, which
contains the following fields:
struct iio_device {
char *name;
char *version;
unsigned long physaddr;
void *virtaddr;
unsigned long size;
long irq;
irqreturn_t (*handler)(int irq, void *dev_id, struct pt_regs *regs);
ssize_t (*event_write)(struct file *filep, const char __user * buf,
size_t count, loff_t *ppos);
struct file_operations *fops;
void *priv;
/* ... */
};
The first part of the structure provides information about the hardware
to be driven - its name, where its I/O memory area lives
(physaddr), where that area has been mapped into the kernel
(virtaddr), its size, and the interrupt being used by the device.
If virtaddr is zero, then physaddr is interpreted as the
beginning of a range of I/O ports, rather than a memory address.
The fops field provides the file operations for the device;
normally, they are set to the generic versions provided by the IIO (for
"industrial I/O") driver: iio_open(), iio_read(),
iio_mmap(), etc.
With this setup, the driver can create a
basic device which allows a user-space program to read from or write to
device memory (or ports). I/O memory can also be mapped into user space.
The capabilities described thus far are not all that different from what
can be done with /dev/mem; the main difference is that the stub
driver can enable the PCI device and perform any other needed
initialization. The real hitch in writing user-space PCI drivers, however,
has been in the handling of interrupts. There is currently no way to write
a user-space interrupt handler, and the IIO patch doesn't really change
that. Instead, the stub driver is expected to provide a minimal interrupt
handler of its own.
This handler is needed because every device requires its own specific
interrupt acknowledgment ritual. The kernel must respond quickly to an
interrupt and give the device the attention it craves so that said device
will stop asserting the interrupt. After that, any additional processing
can be done at relative leisure. So, once the handler provided with the
stub driver acknowledges the interrupt, the rest of the work can normally
be done by the user-space driver.
All that is needed is to let this driver know that the interrupt has
happened. The IIO module provides a couple of mechanisms for that
purpose. One is a second device node associated with the device; whenever
an interrupt happens, a byte can be read from this "event device." So a
user-space driver can simply block on a read from that device, or it can
use poll() in more complicated situations. It is also possible
for the user-space driver to receive SIGIO signals when an
interrupt happens, but using signals will normally increase the ultimate
response time to the interrupt.
So, to make all this happen, the stub driver provides a minimal interrupt
handler in the handler() field of the iio_device
structure. When an interrupt happens, the IIO module will call this handler;
if it returns IRQ_HANDLED, user space will be notified.
If the stub driver provides an event_write() function, that
function will be called in response to a write operation on the event
device. This capability can be used to further control the kernel-space
response to interrupts, request that interrupts be masked, etc.
Readers who think that the event mechanism shares some features with the
proposed kevent subsystem are right. It is probable that the IIO event
handling code will be rewritten to use kevents, if and when kevents are
merged into the mainline.
Meanwhile, however, the IIO driver works. Thomas has posted an example driver (or parts of one, anyway) to
show how this mechanism can be used. The real question which appears to be
on a number of minds, however, is: could ATI and nVidia use IIO to move
their drivers out of the kernel. Only those vendors can answer that
question, however, so, until they say something, nobody really knows.
Comments (29 posted)
Back in 2003, Jeff Garzik
announced the availability of "a
new SCSI driver." That driver was, in fact, the libata subsystem, which
was to be the foundation for serial ATA support in Linux. In the process,
however, Jeff had thought a bit about supporting the current parallel ATA
(PATA) drives, but that was not really his goal:
Note that PATA in my driver is only an afterthought. The main area
of focus, now and in the future, is SATA.
In the last three years, the parallel ATA drives that most of us use have
continued to be driven by the old IDE driver subsystem. Some of this code
dates back to the beginning of Linux; since then it has been maintained by
a substantial list of people, a number of whom are widely held to have been
driven insane by the experience. The current maintainer, Bartlomiej
Zolnierkiewicz, has kept a rather low profile for some time now; he
signed off no patches in either of the 2.6.17 or upcoming 2.6.18 kernels.
Not much has been happening in the IDE area.
That does not mean that things have been quiet in the parallel ATA area,
however. Over the last year or so, Alan Cox has been working to bring full
PATA support into the libata code. The resulting drivers have been sitting
in the -mm tree for a while, but that period is about to end: the PATA
driver set has been queued for
merging into 2.6.19.
The stated advantages of the new PATA code are many. The code has been
reworked from the beginning, and is up to current kernel standards. The
use of libata means that these drivers are well integrated with their SATA
cousins, bringing two divergent subsystems back together. The new drivers
support a number of chipsets that the IDE layer doesn't handle. Error
handling has been much improved. Also, according to Alan's announcement from August,
the new drivers feature "active maintenance and updates" and "more
interesting bugs to find and help fix."
On the other hand, the new PATA drivers are not considered to be ready for
production use yet, and distributors are not expected to enable them in the
near future. The merging into 2.6.19 is intended mainly to broaden the
test base. A completely new disk subsystem is the sort of thing that one
likes to test very well before entrusting it with data that one wishes to
actually keep; that process may go on for a little while yet. It is also
worth noting that the new PATA code also drops support for some ancient IDE
controllers.
The issue that gets everybody's attention, however, is that, as with all
drives handled through libata, PATA drives show up as if they were SCSI
disks, and are named /dev/sd*. Anybody who just switches to the
new drivers without updating /etc/fstab (or using the
mount-by-label feature) is likely to have a rough bootstrap experience.
That is an easy problem to work around, but the use of the SCSI drive
namespace seems to bother some people. What appears to be happening in
reality is that Linux is slowly moving toward having a generic disk
subsystem, where everything can just be called /dev/diskN. All
that's left is a few details and a new set of udev rules to rename the
device nodes.
Someday, most of us will be using the new PATA code. But this is not a
process which is expected to go quickly, and there are no plans to remove
or deprecate the existing IDE code:
At this point in time it is premature to discuss or plan the point
at which the old IDE layer would go away. That discussion can start
at the point where everyone is happy that the new libata based
layer is providing better quality and coverage than the old
one. Even then there would be no need to hurry.
So it appears that Linux will have parallel subsystems for parallel ATA
support for some time.
Comments (5 posted)
Paravirtualized systems are operating systems unto themselves - they look
like independent systems to the greatest extent possible. In the end,
however, a paravirtualized system is still running under a host, and must
interact with that host. A recent set of patches (entitled "
guest page hinting") shows how running
paravirtualized systems in a fully independent mode can hurt performance -
and the sorts of tricks which can be required to make things run more
efficiently.
Consider, for example, a short-lived application which runs on a guest
system. That application may dirty a number of pages, then exit, its job
finished. The guest system knows that the dirty pages are no longer in
use, and can be recycled. From the host's point of view, however, the only
thing known is that the pages are dirty. So the host will, if needs to
reclaim those pages, carefully write their (useless) data out to swap
first. This is a wasted effort which would be nice to avoid.
The hinting patches add a couple of low-level primitives for use by guest
operating systems: set_page_unused() and
set_page_stable(). The former marks a page as being unneeded by
the guest, while the latter marks the page as being in active use. The
s/390 architecture (which is the main target for this patch set currently)
can implement these states through a pair of page flags which the guest can
set, making the operations fast. Once pages have been marked as unused,
the host system can reclaim them with no further effort, making the whole
virtual memory subsystem more efficient.
The next step is to consider page cache pages. These pages will contain
data from a file found on a storage device somewhere, meaning that they can
be recreated from the source if need be. That, in turn, means that the
host could discard them in response to memory pressure. But, once again,
the host knows nothing about the
guests' page caches. So the hinting patches add another state, called
"volatile," to mark pages with backing store. When the host is feeling
memory pressure, it is
free to discard volatile pages without saving their contents
first. It must, however, make sure that the guest system knows that
this action has taken place so that the page can be removed from the
guest's page cache. In the current patch set, this notification only works
for s/390 machines, however.
Pages which have been locked into memory pose an extra challenge here -
they can be part of the page cache, but they still shouldn't be taken away
by the host system. So such pages cannot be marked as "volatile." The
problem is that figuring out if a page is locked is harder than it might
seem; it can involve scanning a list of virtual memory area (VMA)
structures, which is slow. So the hinting patches add a new flag to the
address_space structure to note that somebody has locked pages
from that address space in memory. When the flag is set, those pages are
not marked as being volatile.
The swap cache also benefits from some hinting work - once the guest has written
a page to swap, that page has good backing store and can be grabbed by the
host system. The approach taken is similar to that used with the page
cache, though there are a few extra details to take care of. For example,
the guest must take care to have the page marked stable (and deal with its
potentially having been discarded by the host) before freeing the
associated entry in the swap area.
Attentive readers may have noticed that these patches are heavily oriented
toward the s/390 architecture. IBM has, of course, been doing
virtualization for a very long time, so it is not surprising that some
relatively advanced virtualization patches are coming from that direction -
or that IBM's architectures are designed with virtualization in mind.
Other paravirtualization projects will encounter many of the same issues,
however, and may well benefit from this work. So the next stage for this
patch set should be consideration by other projects and possible work to
make the hinting features more generally applicable.
Comments (1 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
- Andreas Gruenbacher: Tmpfs acls.
(September 4, 2006)
Memory management
Networking
Architecture-specific
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>