User: Password:
Subscribe / Log in / New account

Kernel development

Kernel release status

The current 2.6 prepatch is still 2.6.9-rc2; there have been no 2.6.9 prepatches since September 13.

Patches continue to accumulate in Linus's BitKeeper repository; changes queued up for -rc3 include the re-merging of the two in-kernel software suspend mechanisms, an XFS update, a new wait_event_timeout() primitive, more __iomem annotations (see The September 16 Kernel Page), new sparse annotations intended to flush out byte endianness errors, an NTFS update, ethtool support in the loopback driver, m32r architecture support, the "string" I/O memory access functions, support for more than eight partitions on BSD-labeled disks, some User-mode Linux cleanups, a tunable "max sectors" limit for block I/O requests (a latency reduction feature), a new prctl() option allowing programs to change their name, some shared memory scalability improvements, and a change in TCP ICMP source quench behavior (such messages are simply ignored now).

The current prepatch from Andrew Morton is 2.6.9-rc2-mm4. Recent changes to -mm include the "big kernel semaphore" patch (see the September 16 Kernel Page), a consolidation of the x86-64 and i386 no-exec code, a remap_page_range() API change (see below), a rework of the filesystem external attribute code, the "Single Priority Array" scheduler, kernel crash dumps through kexec, and library functions implementing a simple circular buffer structure.

The current 2.4 prepatch remains 2.4.28-pre3, which was released on September 11.

Comments (1 posted)

Kernel development news

Proceedings of Netfilter Developer Workshop 2004

Harald Welte has posted the proceedings from the 2004 Netfilter Developer Workshop. Click below for a plain text version; the proceedings are also available in a number of other formats over here.

Full Story (comments: none)

Respite from the OOM killer

Thomas Habets had an unfortunate experience recently. His Linux system ran out of memory, and the dreaded "OOM killer" was loosed upon the system's unsuspecting processes. One of its victims turned out to be his screen locking program, leaving his session open to whoever might happen to walk by. His response was the oom_pardon patch, which allows the system administrator to exempt certain processes from the OOM killer's revenge. It turns out that SUSE has a similar patch which allows administrators to set the "OOM score" of specific processes, increasing or decreasing their chances of being chosen for an untimely demise.

The OOM killer exists because the Linux kernel, by default, can commit to supplying more memory than it can actually provide. Overcommitting memory in this way allows the kernel to make fuller use of the system's resources, because processes typically do not use all of the memory they claim. As an example, consider the fork() system call, which copies all of a process's memory for the new child process. In fact, all it does is to mark the memory as "copy on write" and allow parent and child to share it. Should either change a page shared in this way, a true copy is made. In theory, the kernel could be called upon to copy all of the copy-on-write memory in this way; in practice, that does not happen. If the kernel reserved all of the necessary virtual memory (which includes swap space), some of that space would certainly go unused. Rather than waste that space - and fail to run programs or memory allocations that, in practice, it could have handled - the kernel overcommits itself and hopes for the best.

When the best does not happen, the OOM killer comes into play; its job is to kill processes and free up some memory. Getting it to kill the right processes has been an ongoing challenge, however. One person's useless memory hog is another's crucial application. Thus, over the years, numerous efforts have been made to refine the OOM killer's heuristics, and patches like "oom_pardon" have been created.

Not everybody agrees that this is a fruitful use of developer time. Andries Brouwer came up with this analogy:

An aircraft company discovered that it was cheaper to fly its planes with less fuel on board. The planes would be lighter and use less fuel and money was saved. On rare occasions however the amount of fuel was insufficient, and the plane would crash. This problem was solved by the engineers of the company by the development of a special OOF (out-of-fuel) mechanism. In emergency cases a passenger was selected and thrown out of the plane. (When necessary, the procedure was repeated.) A large body of theory was developed and many publications were devoted to the problem of properly selecting the victim to be ejected. Should the victim be chosen at random? Or should one choose the heaviest person? Or the oldest? Should passengers pay in order not to be ejected, so that the victim would be the poorest on board? And if for example the heaviest person was chosen, should there be a special exception in case that was the pilot? Should first class passengers be exempted? Now that the OOF mechanism existed, it would be activated every now and then, and eject passengers even when there was no fuel shortage. The engineers are still studying precisely how this malfunction is caused.

Overcommitting memory and fearing the OOM killer are not necessary parts of the Linux experience, however. Simply setting the sysctl parameter vm/overcommit_memory to 2 turns off the overcommit behavior and keeps the OOM killer forever at bay. Most modern systems should have enough disk space to provide an ample swap file for most situations. Rather than trying to keep pet processes from being killed when overcommitted memory runs out, it might be easier just to avoid the situation altogether.

Comments (26 posted)


Last month we looked at a possible change to the heavily-used remap_page_range() function as a way of making io_remap_page_range() be the same on all architectures. Since then, a driver author has stepped forward with a different problem: he wants to remap some reserved memory which sits above the 4GB memory boundary. Since remap_page_range() takes a 32-bit "start" address, it cannot be used to remap memory above that boundary.

In response, William Lee Irwin has posted a series of patches which changes remap_page_range() to:

    int remap_pfn_range(struct vm_area_struct *vma, unsigned long from,
                        unsigned long pfn, unsigned long size,
                        pgprot_t prot);

The old "start" address has been changed to pfn, which is a page frame number. Since these mappings can only happen on page boundaries, this change does not take away any old functionality. It does, however, make twelve bits (typically) of address space available, making it possible to remap memory well above 4GB.

William's patches fix all in-kernel callers of remap_page_range(), of which there are several dozen, and removes the old interface altogether. He also manages to eliminate a fair amount of related code - it seems that large numbers of drivers have their own, private copy of kvirt_to_pa(), which obtains a physical address for memory from vmalloc(). For in-kernel users, the change should be a purely positive thing. Out-of-kernel drivers which use remap_page_range() will have to be fixed, however.

These patches have found their way into the -mm tree, and are thus likely to end up in the mainline eventually.

Comments (none posted)

Watching filesystem events with inotify

It is not uncommon for applications to want to know when something happens within a subtree of a filesystem. File managers are the most obvious example; if an application creates a new file within a directory represented in a file manager, users really like to see that new file show up, quickly. One could also imagine other sorts of applications - such as security monitoring code or just daemons wanting to know when their configuration files have changed - which can benefit from being told about filesystem activity.

The Linux mechanism for communicating filesystem events to user space is called "dnotify." A program watches a directory by opening it, then issuing a fcntl(F_NOTIFY) call. Thereafter, changes in that directory will result in a SIGIO signal being sent to the process, which can then dig through its cached information and try to figure out just what happened. People like to complain about dnotify; the interface is ugly (signals are a pain), it is hard to figure out what the changes are, it requires keeping files open and thus blocks the unmounting of removable media, etc. So there has long been interest in a replacement.

The most visible effort in that direction is inotify, which has been under development (by John McCutchan) for some time now; recently Robert Love has jumped in to help the project along. inotify 0.11 was released on September 28, and an increasingly strong push is being made to get it included into -mm for wider exposure and testing.

inotify works through a new character pseudo-device. Any application which wants to monitor filesystem activity need only open /dev/inotify and issue one of two ioctl() commands to it:

This call provides a filename and a mask of desired events; inotify will begin watching the given file (or directory) for activity.

This call will stop the stream of events for the given file.

Quite a few possible events can be watched for: IN_ACCESS (the file was accessed), IN_MODIFY (the file was changed), IN_ATTRIB (file attributes changed), IN_OPEN and IN_CLOSE (for open and close events), IN_MOVED_FROM and IN_MOVED_TO (when files are renamed), IN_CREATE_SUBDIR and IN_DELETE_SUBDIR (creation and deletion of subdirectories), IN_CREATE_FILE and IN_DELETE_FILE (creation and deletion of files within a directory), IN_DELETE_SELF (when a monitored file is deleted), IN_UNMOUNT (when the filesystem containing the file is unmounted), and a couple of others. The events themselves are obtained by simply reading from the device. Thus a program can block on the device itself, or use poll() to incorporate notifications into a larger event-processing loop. No signals are involved.

The actual implementation of inotify is relatively simple. The in-core inode structure is augmented with a linked list of processes interested in events involving that inode. When an INOTIFY_WATCH call is made, an entry is made in the corresponding list (and the inode is pinned into memory for the duration). Various parts of the filesystem code get an extra inotify_inode_queue_event() call when an action succeeds. The rest is just the usual overhead of maintaining lists of events for processes, waking those processes up when new events arrive, etc.

While most interest and activity seems to be around inotify, it is not the only dnotify replacement in circulation; nonotify is an alternative. There are also some remaining issues about the interface exported by inotify. It has been suggested that the inotify ioctl() calls should take file descriptors rather than file names; that change would eliminate problems in dealing with long file names and would also make access control checks happen automatically. The interface would have to be done in such a way that the application could close the file and still receive events, though; otherwise dnotify's problems with unmount blocking and excessive use of file descriptors would just come back again. These issues notwithstanding, inotify looks like it is headed for inclusion into a mainline kernel in the not-too-distant future.

Comments (17 posted)

Driver core functions: GPL only

Patrick Mochel may have been expecting to start a flame war with this patch, which changes most of the driver core functions to be exported only to GPL-licensed modules. The affected functions include the bus-level code, classes (but not class_simple), device_register() and friends, the platform and system bus functions, low-level sysfs functions, and the kobject primitives. In fact, the flame war failed to materialize; nobody seems to be upset by these changes. Whether Patrick is pleased or disappointed by the silence is for him to say.

The affected functions are a fundamental component of the Linux driver model; they are used by every device driver and filesystem, and by many other parts of the kernel as well. Even so, few, if any, proprietary modules will be affected by this change. The interfaces used by most modules are built on top of - and hide - the driver core. Thus, it is a rare driver which calls device_register(); instead, something like usb_register_dev() is used. Those upper-layer functions remain exported to all modules.

So why make the change? Patrick's reasoning is that he wants all users of the low-level functions to be part of the mainline kernel tree.

In short, being able to audit all of the users of these functions is necessary to their continued evolution (whatever that may entail). It would make the most sense if all users were part of the kernel, and it makes little sense to support their use by any unknown or binary modules.

As the kernel tree becomes more dynamic internally, it will be increasingly hard for external modules - free or not - to keep up with the changes. It would not be surprising to see ever more "encouragement" to merge external modules into the mainline. Code which remains outside will require a higher level of maintenance, or it is likely to break frequently.

Comments (16 posted)

Patches and updates

Kernel trees


Core kernel code

Device drivers

Filesystems and block I/O

Memory management


  • =?iso-8859-1?Q?Kristian_S=F8rensen?=: Umbrella 0.4.1. (September 27, 2004)


Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds