Kernel development
Brief items
Kernel release status
The current development kernel remains 2.5.67; Linus has not released a development kernel since April 7. He has been merging numerous patches into his BitKeeper tree, however; along with the usual fixes there is some NFS performance tuning, some changes to the workqueue interface, the merging of s390 and s390x into a single architecture (along with a bunch of other s390 work), the generation of hotplug events from kobject registration, a new __user attribute to mark user-space pointers (to help find bugs with static analyzers), a small change to the semantics of msync(MS_ASYNC) (it no longer actually starts any I/O), some reverse-mapping VM speedups, a new requirement that gcc version 2.95 (or later) be used to compile the kernel, a big pile of small fixes from Alan Cox, an NFSv4 update, and a big IA-64 update.Dave Jones has posted a new version of his "what to expect in 2.5" document. It's a good read for people interested in testing the new kernel, or for those who are simply interested in what has changed.
The current stable kernel is 2.4.20. The last 2.4.21 prepatch was 2.4.21-pre7, released on April 4.
Kernel development news
Managing dynamic device naming
The coming increase in the size of dev_t adds to the urgency of the device naming problem. Even if device numbers remain entirely static, there will be management issues to deal with. Consider the case of SCSI disks, for example. The wider dev_t will make it possible to have thousands of disks on a single system, and the maximum number of partitions will be increased to 64. /dev is already a big directory on modern distributions - over 12,000 entries on a Red Hat Linux 7.3 system, 2000 in the cciss subdirectory alone. It is unwieldy to work with now, but consider what happens with the device names for all those new drives and partitions are added; now /dev has several hundred thousand entries. And we haven't even begun to look at all those new serial ports, tape drives, printers, and CueCat barcode readers we'll be able to add.Richard Gooch beat the rush and started worrying about this problem some years ago; the result was devfs. The devfs code has been in the mainline kernel since the 2.3 days, but it is not heavily used. It puts naming policy firmly in the kernel itself (you get /dev/disc whether you like it or not), and it solves persistent permissions issues by way of a deamon process and a "make a tarball at shutdown" technique that strikes some as inelegant. Some kernel developers have also made a longstanding hobby of complaining about the quality of the devfs code.
The end result is that there would seem to be an opening for a different approach. One alternative began to come into focus this week with the release of udev 0.1. udev is an effort by Greg Kroah-Hartman (and others) to push the device naming issue completely into user space, with the result that the kernel hackers would be free to go off and argue about something else. The current udev implementation is a minimal demonstration of the concept, but the longer-term vision calls for three distinct components:
- "namedev" is a subsystem which has the job of coming up with useful
names for devices. It could make use of whatever information is
available: device numbers, hardware ID numbers, filesystem labels,
etc.; it would then apply the site's particular policy to produce a
suitable name. On simple systems, a simple flat file (or hardcoded
names) would suffice; the 4000-disk monster system could dedicate one
drive to a relational database for device naming.
- "libsysfs" would provide a common API for obtaining information about
devices from sysfs.
- "udev" is a separate application which is run in response to hotplug events; it uses the above two modules to gather the information it needs, then creates or removes device nodes as appropriate.
In the current release, everything is bundled together into a single "udev" binary. It requires a series of patches on top of 2.5.67 to create hotplug events when kobjects are registered (these patches have been merged into Linus's BitKeeper repository, and thus will be unnecessary for 2.5.68 and later kernels), and, even then, can only work with devices which export their device number via sysfs. Still, your editor had no trouble making it work on his sacrificial system. Loading the simple block driver from the driver porting series caused a set of block device nodes to be created in /udev - with no changes to the driver required. The basic idea works.
A lot of work remains to be done before udev is ready for prime time, however. Some of the issues needing resolution are:
- Robust management of device events. The current hotplug mechanism
creates a separate process for each event, each of which runs whatever
program has been designated to handle those events. Among other
things, this mechanism has race conditions; if a device is quickly
attached and removed, the unplug event could end up being processed
first. Attaching a large disk array could create an "event storm"
that threatens to overwhelm the system. So there is a fair amount of
interest in serializing events, but little agreement on how that
should be done.
- A related issue is that multiple programs may want to receive hotplug
events. One might load a driver, another runs udev, yet another
mounts partitions on a newly-attached disk, etc. Possible solutions
here include using Greg's /sbin/hotplug
multiplexor, distributing events in user space with D-BUS, or
distributing them in the kernel via a new
event interface.
- How desirable is per-site device naming policy anyway? A world where each distribution, if not each installation, has its own device naming scheme does not look like an improvement to a lot of people. Vendors cringe at trying to support that sort of setup. So there is a need for some sort of common policy. The Linux Standard Base decrees that the LANANA devices.txt file is the definitive authority for standard device names, which is a start. But there is a strong desire for more flexible and generic naming (all disks under /dev/disk, for example, with no distinction between SCSI and IDE drives); the device list will probably have to be revised to fit the dynamic, very large systems of the future.
All of these issues should be solvable, of course, and the fact that they are being discussed indicates that people are getting serious about solving the problems. The 2.6 kernel will probably go out with the larger dev_t and, perhaps, some hooks for udev-like programs. Things could get more interesting once the 2.7 development series opens up, however.
Time to internationalize the kernel?
One of the latest bright ideas to go around on the linux-kernel mailing list is that the messages printed by the kernel should be presented in the local language. After all, the rest of the system can be localized, but the kernel remains firmly English-only. Wouldn't it be better to complete the job?There are a number of approaches one could take to this sort of problem. One would be to have the various printk() strings available to the kernel in all supported languages, with the correct one selected at run time. One need only look at what that approach would do to the size of the kernel to reject it outright. Trying to support a compile-time language option seems impractical at best.
And besides, Linus has been quite clear on what he thinks of in-kernel localization support:
So would-be translators are forced to look at user-space solutions. Riley Williams posted one possible approach: add a unique message number to each message printed to the kernel. Format strings passed to printk() are already expected to begin with a string like "<2>", which provides the log level of the message. Why not put in, instead, something like "<2.12345>"? User-space translation code could then use the message number to index into a file of localized messages.
The devil, of course, is in the details. In the 2.5.67 kernel, there are almost 52,000 details (in the form of printk() statements). It is hard to imagine anybody having the patience to go through and assign unique message numbers to each of those statement. It's even harder to conceive of anybody being willing to translate that many messages into even a single other language. They do not make the most exciting reading material, especially since all the really good profanity is restricted to code comments. There are very few prospective translators with an itch that requires scratching that strongly.
Now try to imagine that whole structure of message numbers and translations surviving past more than about two minor kernel releases. Each new message would require a new number; just administering the number space would take quite a bit of somebody's time. Translations would have to keep up with changes to messages. Bear in mind that the 2.5.67 patch, alone, affected 824 printk() statements. 2.4.20, amazingly, affected more than 6,000. This system would be entirely unmaintainable.
So in-kernel support for internationalization is unlikely in any form. Whether it can be done entirely externally is another question; Linus suggests trying to translate the messages directly from text. That, probably, is a way of saying that it will not happen at all. But one never knows...
Driver porting
This week in the driver porting series
The driver porting series this week contains two articles having to do with memory management; one looks at supporting the mmap() system call (mapping kernel memory into user space), and the other at get_user_pages() (mapping user space pages into the kernel). In addition, a couple of older articles (on workqueues and the BIO structure) have been updated to keep them current with recent kernels. As always, the full set of articles can be found on this page.Driver porting: supporting mmap()
| This article is part of the LWN Porting Drivers to 2.6 series. |
Using remap_page_range()
There are two techniques in use for implementing mmap(); often the simpler of the two is using remap_page_range(). This function creates a set of page table entries covering a given physical address range. The prototype of remap_page_range() changed slightly in 2.5.3; the relevant virtual memory area (VMA) pointer must be passed as the first parameter:
int remap_page_range(struct vm_area_struct *vma, unsigned long from,
unsigned long to, unsigned long size,
pgprot_t prot);
remap_page_range() is now explicitly documented as requiring that the memory management semaphore (usually current->mm->mmap_sem) be held when the function is called. Drivers will almost invariably call remap_page_range() from their mmap() method, where that semaphore is already held. So, in other words, driver writers do not normally need to worry about acquiring mmap_sem themselves. If you use remap_page_range() from somewhere other than your mmap() method, however, do be sure you have acquired the semaphore first.
Note that, if you are remapping into I/O space, you may want to use:
int io_remap_page_range(struct vm_area_struct *vma, unsigned long from,
unsigned long to, unsigned long size,
pgprot_t prot);
On all architectures other than SPARC, io_remap_page_range() is just another name for remap_page_range(). On SPARC systems, however, io_remap_page_range() uses the systems I/O mapping hardware to provide access to I/O memory.
remap_page_range() retains its longstanding limitation: it cannot be used to remap most system RAM. Thus, it works well for I/O memory areas, but not for internal buffers. For that case, it is necessary to define a nopage() method. (Yes, if you are curious, the "mark pages reserved" hack still works as a way of getting around this limitation, but its use is strongly discouraged).
Using vm_operations
The other way of implementing mmap is to override the default VMA operations to set up a driver-specific nopage() method. That method will be called to deal with page faults in the mapped area; it is expected to return a struct page pointer to satisfy the fault. The nopage() approach is flexible, but it cannot be used to remap I/O regions; only memory represented in the system memory map can be mapped in this way.The nopage() method made it through the entire 2.5 development series without changes, only to be modified in the 2.6.1 release. The prototype for that function used to be:
struct page *(*nopage)(struct vm_area_struct *area,
unsigned long address,
int unused);
As of 2.6.1, the unused argument is no longer unused, and the prototype has changed to:
struct page *(*nopage)(struct vm_area_struct *area,
unsigned long address,
int *type);
The type argument is now used to return the type of the page fault; VM_FAULT_MINOR would indicate a minor fault - one where the page was in memory, and all that was needed was a page table fixup. A return of VM_FAULT_MAJOR would, instead, indicate that the page had to be fetched from disk. Driver code using nopage() to implement a device mapping would probably return VM_FAULT_MINOR. In-tree code checks whether type is NULL before assigning the fault type; other users would be well advised to do the same.
There are a couple of other things worth mentioning. One is that the vm_operations_struct is rather smaller than it was in 2.4.0; the protect(), swapout(), sync(), unmap(), and wppage() methods have all gone away (they were actually deleted in 2.4.2). Device drivers made little use of these methods, and should not be affected by their removal.
There is also one new vm_operations_struct method:
int (*populate)(struct vm_area_struct *area, unsigned long address,
unsigned long len, pgprot_t prot, unsigned long pgoff,
int nonblock);
The populate() method was added in 2.5.46; its purpose is to "prefault" pages within a VMA. A device driver could certainly implement this method by simply invoking its nopage() method for each page within the given range, then using:
int install_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, struct page *page,
pgprot_t prot);
to create the page table entries. In practice, however, there is no real advantage to doing things in this way. No driver in the mainline (2.5.67) kernel tree implements the populate() method.
Finally, one use of nopage() is to allow a user process to map a kernel buffer which was created with vmalloc(). In the past, a driver had to walk through the page tables to find a struct page corresponding to a vmalloc() address. As of 2.5.5 (and 2.4.19), however, all that is needed is a call to:
struct page *vmalloc_to_page(void *address);
This call is not a variant of vmalloc() - it allocates no memory. It simply returns a pointer to the struct page associated with an address obtained from vmalloc().
Driver porting: Zero-copy user-space access
| This article is part of the LWN Porting Drivers to 2.6 series. |
This article looks at how to port drivers which used the kiobuf interface in 2.4. We'll proceed on the assumption that the real feature of interest was direct access to user space; there wasn't much motivation to use a kiobuf otherwise.
Zero-copy block I/O
The 2.6 kernel has a well-developed direct I/O capability for block devices. So, in general, it will not be necessary for block driver writers to do anything to implement direct I/O themselves. It all "just works."Should you have a need to perform zero-copy block operations, it's worth noting the presence of a useful helper function:
struct bio *bio_map_user(struct block_device *bdev,
unsigned long uaddr,
unsigned int len,
int write_to_vm);
This function will return a BIO describing a direct operation to the given block device bdev. The parameters uaddr and len describe the user-space buffer to be transferred; callers must check the returned BIO, however, since the area actually mapped might be smaller than what was requested. The write_to_vm flag is set if the operation will change memory - if it is a read-from-disk operation. The returned BIO (which can be NULL - check it) is ready for submission to the appropriate device driver.
When the operation is complete, undo the mapping with:
void bio_unmap_user(struct bio *bio, int write_to_vm);
Mapping user-space pages
If you have a char driver which needs direct user-space access (a high-performance streaming tape driver, say), then you'll want to map user-space pages yourself. The modern equivalent of map_user_kiobuf() is a function called get_user_pages():
int get_user_pages(struct task_struct *task,
struct mm_struct *mm,
unsigned long start,
int len,
int write,
int force,
struct page **pages,
struct vm_area_struct **vmas);
task is the process performing the mapping; the primary purpose of this argument is to say who gets charged for page faults incurred while mapping the pages. This parameter is almost always passed as current. The memory management structure for the user's address space is passed in the mm parameter; it is usually current->mm. Note that get_user_pages() expects that the caller will have a read lock on mm->mmap_sem. The start and len parameters describe the user-buffer to be mapped; len is in pages. If the memory will be written to, write should be non-zero. The force flag forces read or write access, even if the current page protection would otherwise not allow that access. The pages array (which should be big enough to hold len entries) will be filled with pointers to the page structures for the user pages. If vmas is non-NULL, it will be filled with a pointer to the vm_area_struct structure containing each page.
The return value is the number of pages actually mapped, or a negative error code if something goes wrong. Assuming things worked, the user pages will be present (and locked) in memory, and can be accessed by way of the struct page pointers. Be aware, of course, that some or all of the pages could be in high memory.
There is no equivalent put_user_pages() function, so callers of get_user_pages() must perform the cleanup themselves. There are two things that need to be done: marking of modified pages, and releasing them from the page cache. If your device modified the user pages, the virtual memory subsystem may not know about it, and may fail to write the pages to permanent storage (or swap). That, of course, could lead to data corruption and grumpy users. The way to avoid this problem is to call:
SetPageDirty(struct page *page);
for each page in the mapping. Current (2.6.3) kernel code checks to ensure that pages are not reserved first with code like:
if (!PageReserved(page))
SetPageDirty(page);
But pages mapped from user space should not, normally, be marked reserved in the first place.
Finally, every mapped page must be released from the page cache, or it will stay there forever; simply pass each page structure to:
void page_cache_release(struct page *page);
After you have released the page, of course, you should not access it again.
For a good example of how to use get_user_pages() in a char driver, see the definition of sgl_map_user_pages() in drivers/scsi/st.c.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
