The current development kernel remains 2.5.67
; Linus has not
released a development kernel since April 7. He has been merging
numerous patches into his BitKeeper tree, however; along with the usual
fixes there is some NFS performance tuning, some changes to the workqueue
interface, the merging of s390 and s390x into a single architecture (along
with a bunch of other s390 work), the generation of hotplug events from
kobject registration, a new __user
attribute to mark user-space
pointers (to help find bugs with static analyzers), a small change to the
(it no longer actually starts any I/O), some
reverse-mapping VM speedups, a new requirement that gcc version 2.95 (or
later) be used to compile the kernel, a big pile of small fixes from Alan
Cox, an NFSv4 update, and a big IA-64 update.
Dave Jones has posted a new version of his
"what to expect in 2.5" document. It's a good read for people interested
in testing the new kernel, or for those who are simply interested in what
The current stable kernel is 2.4.20. The last 2.4.21 prepatch was
2.4.21-pre7, released on April 4.
Comments (none posted)
Kernel development news
The coming increase in the size of dev_t
adds to the urgency of
the device naming problem. Even if device numbers remain entirely static,
there will be management issues to deal with. Consider the case of SCSI
disks, for example. The wider dev_t
will make it possible to have
thousands of disks on a single system, and the maximum number of partitions
will be increased to 64. /dev
is already a big directory on
modern distributions - over 12,000 entries on a Red Hat Linux 7.3
system, 2000 in the cciss
subdirectory alone. It is unwieldy to
work with now, but consider what happens with
the device names for all those new drives and partitions are added; now
has several hundred thousand entries. And we haven't even
begun to look at all those new serial ports, tape drives, printers, and
CueCat barcode readers we'll be able to add.
Richard Gooch beat the rush and started worrying about this problem some
years ago; the result was devfs. The devfs code has been in the mainline
kernel since the 2.3 days, but it is not heavily used. It puts naming
policy firmly in the kernel itself (you get /dev/disc whether you
like it or not), and it solves persistent permissions issues by way of a
deamon process and a "make a tarball at shutdown" technique that strikes
some as inelegant. Some kernel developers have also made a longstanding
hobby of complaining about the quality of the devfs code.
The end result is that there would seem to be an opening for a different
approach. One alternative began to come into focus this week with the release of udev 0.1. udev is an
effort by Greg Kroah-Hartman (and others) to push the device naming issue
completely into user space, with the result that the kernel hackers would
be free to go off and argue about something else. The current udev
implementation is a minimal demonstration of the concept, but the
longer-term vision calls for three distinct components:
- "namedev" is a subsystem which has the job of coming up with useful
names for devices. It could make use of whatever information is
available: device numbers, hardware ID numbers, filesystem labels,
etc.; it would then apply the site's particular policy to produce a
suitable name. On simple systems, a simple flat file (or hardcoded
names) would suffice; the 4000-disk monster system could dedicate one
drive to a relational database for device naming.
- "libsysfs" would provide a common API for obtaining information about
devices from sysfs.
- "udev" is a separate application which is run in response to hotplug
events; it uses the above two modules to gather the information it
needs, then creates or removes device nodes as appropriate.
In the current release, everything is bundled together into a single "udev"
binary. It requires a series of patches on top of 2.5.67 to create hotplug
events when kobjects are registered (these patches have been merged into
Linus's BitKeeper repository, and thus will be unnecessary for 2.5.68 and
and, even then, can only work with devices which export their device number
via sysfs. Still, your editor had no trouble making it work on his
sacrificial system. Loading the simple block
driver from the driver porting series caused a set of block device
nodes to be created in /udev - with no changes to the driver
required. The basic idea works.
A lot of work remains to be done before udev is ready for prime time,
however. Some of the issues needing resolution are:
- Robust management of device events. The current hotplug mechanism
creates a separate process for each event, each of which runs whatever
program has been designated to handle those events. Among other
things, this mechanism has race conditions; if a device is quickly
attached and removed, the unplug event could end up being processed
first. Attaching a large disk array could create an "event storm"
that threatens to overwhelm the system. So there is a fair amount of
interest in serializing events, but little agreement on how that
should be done.
- A related issue is that multiple programs may want to receive hotplug
events. One might load a driver, another runs udev, yet another
mounts partitions on a newly-attached disk, etc. Possible solutions
here include using Greg's /sbin/hotplug
multiplexor, distributing events in user space with D-BUS, or
distributing them in the kernel via a new
- How desirable is per-site device naming policy anyway? A world where
each distribution, if not each installation, has its own device naming
scheme does not look like an improvement to a lot of people. Vendors
cringe at trying to support that sort of setup. So there is a need
for some sort of common policy. The Linux Standard Base decrees that
devices.txt file is the definitive authority for standard device
names, which is a start. But there is a strong desire for more
flexible and generic naming (all disks under /dev/disk, for
example, with no distinction between SCSI and IDE drives); the device
list will probably have to be revised to fit the dynamic, very large
systems of the future.
All of these issues should be solvable, of course, and the fact that they
are being discussed indicates that people are getting serious about solving
the problems. The 2.6 kernel will probably go out with the larger
dev_t and, perhaps, some hooks for udev-like programs. Things
could get more interesting once the 2.7 development series opens up,
Comments (12 posted)
One of the latest bright ideas to go around on the linux-kernel mailing
list is that the messages printed by the kernel should be presented in the
local language. After all, the rest of the system can be localized, but
the kernel remains firmly English-only. Wouldn't it be better to complete
There are a number of approaches one could take to this sort of problem.
One would be to have the various printk() strings available to the
kernel in all supported languages, with the correct one selected at run
time. One need only look at what that approach would do to the size of the
kernel to reject it outright. Trying to support a compile-time language
option seems impractical at best.
And besides, Linus has been quite clear on
what he thinks of in-kernel localization support:
The answer is: go ahead and do it, but don't do it in the
kernel. Do it in klogd or similar.
So would-be translators are forced to look at user-space solutions. Riley
Williams posted one possible approach: add a
unique message number to each message printed to the kernel. Format
strings passed to printk() are already expected to begin with a
string like "<2>", which provides the log level of the
message. Why not put in, instead, something like
"<2.12345>"? User-space translation code could then use the
message number to index into a file of localized messages.
The devil, of course, is in the details. In the 2.5.67 kernel, there are
almost 52,000 details (in the form of printk() statements). It is
hard to imagine anybody having the patience to go through and assign unique
message numbers to each of those statement. It's even harder to conceive
of anybody being willing to translate that many messages into even a single
other language. They do not make the most exciting reading material,
especially since all the really good profanity is restricted to code
comments. There are very few prospective translators with an itch that
requires scratching that strongly.
Now try to imagine that whole structure of message numbers and translations
surviving past more than about two minor kernel releases. Each new message
would require a new number; just administering the number space would take
quite a bit of somebody's time. Translations would have to keep up with
changes to messages. Bear in mind that the 2.5.67 patch, alone, affected
824 printk() statements. 2.4.20, amazingly, affected more than
6,000. This system would be entirely unmaintainable.
So in-kernel support for internationalization is unlikely in any form.
Whether it can be done entirely externally is another question; Linus suggests trying to translate the messages
directly from text. That, probably, is a way of saying that it will not
happen at all. But one never knows...
Comments (11 posted)
The driver porting series this week contains two articles having to do with
memory management; one looks at supporting the mmap()
(mapping kernel memory into user space), and the other at
(mapping user space pages into the kernel). In
addition, a couple of older articles (on workqueues
) have been updated to keep them current with recent
kernels. As always, the full set of articles can be found on this page
Comments (none posted)
Occasionally, a device driver will need to map an address range into a user
process's space. This mapping can be done to give the process direct
access to a device's I/O memory area, or to the driver's DMA buffers. 2.6
features a number of changes to the virtual memory subsystem, but, for most
drivers, supporing mmap()
will be relatively painless.
There are two techniques in use for implementing mmap()
; often the
simpler of the two is using remap_page_range()
. This function
creates a set of page table entries covering a given physical address
range. The prototype of remap_page_range()
changed slightly in
2.5.3; the relevant virtual memory area (VMA) pointer must be passed as the
int remap_page_range(struct vm_area_struct *vma, unsigned long from,
unsigned long to, unsigned long size,
remap_page_range() is now explicitly documented as requiring that
the memory management semaphore (usually
current->mm->mmap_sem) be held when the function is called.
Drivers will almost invariably call remap_page_range() from their
mmap() method, where that semaphore is already held. So, in other
words, driver writers do not normally need to worry about acquiring
mmap_sem themselves. If you use remap_page_range() from
somewhere other than your mmap() method, however, do be sure you
have acquired the semaphore first.
Note that, if you are remapping into I/O space, you may want to use:
int io_remap_page_range(struct vm_area_struct *vma, unsigned long from,
unsigned long to, unsigned long size,
On all architectures other than SPARC, io_remap_page_range() is
just another name for remap_page_range(). On SPARC systems,
however, io_remap_page_range() uses the systems I/O mapping
hardware to provide access to I/O memory.
remap_page_range() retains its longstanding limitation: it cannot
be used to remap most system RAM. Thus, it works well for I/O memory
areas, but not for internal buffers. For that case, it is necessary to
define a nopage() method. (Yes, if you are curious, the "mark
pages reserved" hack still works as a way of getting around this
limitation, but its use is strongly discouraged).
The other way of implementing mmap is to override the default VMA
operations to set up a driver-specific nopage()
method will be called to deal with page faults in the mapped area; it is
expected to return a struct page
pointer to satisfy the fault. The
approach is flexible, but it cannot be used to remap I/O
regions; only memory represented in the system memory map can be mapped in
The nopage() method made it through the entire 2.5 development
series without changes, only to be modified in the 2.6.1 release.
The prototype for that
function used to be:
struct page *(*nopage)(struct vm_area_struct *area,
unsigned long address,
As of 2.6.1, the unused argument is no longer unused, and the
prototype has changed to:
struct page *(*nopage)(struct vm_area_struct *area,
unsigned long address,
The type argument is now used to return the type of the page
fault; VM_FAULT_MINOR would indicate a minor fault - one where the
page was in memory, and all that was needed was a page table fixup. A
return of VM_FAULT_MAJOR would, instead, indicate that the page
had to be fetched from disk. Driver code using nopage() to
implement a device mapping would probably return VM_FAULT_MINOR.
In-tree code checks whether type is NULL before assigning
the fault type; other users would be well advised to do the same.
There are a couple of other things worth mentioning. One is that the
vm_operations_struct is rather smaller than it was in 2.4.0; the
methods have all gone away (they were actually deleted in 2.4.2). Device
drivers made little use of these methods, and should not be affected by
There is also one new vm_operations_struct method:
int (*populate)(struct vm_area_struct *area, unsigned long address,
unsigned long len, pgprot_t prot, unsigned long pgoff,
The populate() method was added in 2.5.46; its purpose is to
"prefault" pages within a VMA. A device driver could certainly implement
this method by simply invoking its nopage() method for each page
within the given range, then using:
int install_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, struct page *page,
to create the page table entries. In practice, however, there is no real
advantage to doing things in this way. No driver in the mainline (2.5.67)
kernel tree implements the populate() method.
Finally, one use of nopage() is to allow a user process to map a
kernel buffer which was created with vmalloc(). In the past, a
driver had to walk through the page tables to find a struct page
corresponding to a vmalloc() address. As of 2.5.5 (and 2.4.19),
however, all that is needed is a call to:
struct page *vmalloc_to_page(void *address);
This call is not a variant of vmalloc() - it allocates no memory.
It simply returns a pointer to the struct page associated with an
address obtained from vmalloc().
Comments (7 posted)
abstraction was introduced in 2.3 as a low-level way of
representing I/O buffers. Its primary use, perhaps, was to represent
zero-copy I/O operations going directly to or from user space. A number of
problems were found with the kiobuf
interface, however; among
other things, it forced large I/O operations to be broken down into small
chunks, and it was seen as a heavyweight data structure. So, in 2.5.43,
kiobufs were removed from the kernel.
This article looks at how to port drivers which used the kiobuf
interface in 2.4. We'll proceed on the assumption that the real feature of
interest was direct access to user space; there wasn't much motivation to
use a kiobuf otherwise.
Zero-copy block I/O
The 2.6 kernel has a well-developed direct I/O capability for block
devices. So, in general, it will not be necessary for block driver writers
to do anything to implement direct I/O themselves. It all "just works."
Should you have a need to perform zero-copy block operations, it's worth
noting the presence of a useful helper function:
struct bio *bio_map_user(struct block_device *bdev,
unsigned long uaddr,
unsigned int len,
This function will return a BIO describing a direct operation to the given
block device bdev. The parameters uaddr and len
describe the user-space buffer to be transferred; callers must check the
returned BIO, however, since the area actually mapped might be smaller than
what was requested. The write_to_vm flag is set if the operation
will change memory - if it is a read-from-disk operation. The returned BIO
(which can be NULL - check it) is ready for submission to the
appropriate device driver.
When the operation is complete, undo the mapping with:
void bio_unmap_user(struct bio *bio, int write_to_vm);
Mapping user-space pages
If you have a char driver which needs direct user-space access (a
high-performance streaming tape driver, say), then you'll want to map
user-space pages yourself. The modern equivalent of
is a function called get_user_pages()
int get_user_pages(struct task_struct *task,
struct mm_struct *mm,
unsigned long start,
struct page **pages,
struct vm_area_struct **vmas);
task is the process performing the mapping; the primary purpose of
this argument is to say who gets charged for page faults incurred while
mapping the pages. This parameter is almost always passed as
current. The memory management structure for the user's address
space is passed in the mm parameter; it is usually
current->mm. Note that get_user_pages() expects that
the caller will have a read lock on mm->mmap_sem.
The start and len parameters describe the user-buffer to
be mapped; len is in pages. If
the memory will be written to, write should be non-zero. The
force flag forces read or write access, even if the current page
protection would otherwise not allow that access. The pages array
(which should be big enough to hold len entries) will be filled
with pointers to the page structures for the user pages. If
vmas is non-NULL, it will be filled with a pointer to the
vm_area_struct structure containing each page.
The return value is the number of pages actually mapped, or a negative
error code if something goes wrong. Assuming things worked, the user pages
will be present (and locked) in memory, and can be accessed by way of the
struct page pointers. Be aware, of course, that some or all of
the pages could be in high memory.
There is no equivalent put_user_pages() function, so callers of
get_user_pages() must perform the cleanup themselves. There are
two things that need to be done: marking of modified pages, and releasing
them from the page cache. If your device modified the user pages, the
virtual memory subsystem may not know about it, and may fail to write the
pages to permanent storage (or swap). That, of course, could lead to data
corruption and grumpy users. The way to avoid this problem is to call:
SetPageDirty(struct page *page);
for each page in the mapping. Current (2.6.3) kernel code checks to ensure
that pages are not reserved first with code like:
But pages mapped from user space should not, normally, be marked reserved
in the first place.
Finally, every mapped page must be released from the page cache, or it will
stay there forever; simply pass each page structure to:
void page_cache_release(struct page *page);
After you have released the page, of course, you should not access it
For a good example of how to use get_user_pages() in a char
driver, see the definition of sgl_map_user_pages() in
Comments (11 posted)
Patches and updates
Core kernel code
- Andries.Brouwer@cwi.nl: kdevt-diff.
(April 14, 2003)
Filesystems and block I/O
- Rik van Riel: rmap 15f.
(April 13, 2003)
Benchmarks and bugs
Page editor: Jonathan Corbet
Next page: Distributions>>