Kernel development
Brief items
Kernel release status
The current 2.6 prepatch remains 2.6.14-rc2; no prepatches have been released over the last week.The flow of patches into Linus's git repository has slowed; that repository currently contains some key management improvements, a SCSI update, some netfilter patches, an InfiniBand update, and lots of fixes.
The current -mm tree is 2.6.14-rc2-mm1. Recent changes to -mm include a cs5535 ALSA driver, a new device_is_registered() helper function (since merged), some network time protocol cleanups, the controversial (see thread starting here) Adaptec serial attached storage patch set, and the usual pile of fixes.
The current 2.4 prepatch is 2.4.32-rc1, released by Marcelo on September 22. This prepatch adds a small set of fixes (some backported from 2.6) to the upcoming 2.4.32 release.
Kernel development news
User-space software suspend
Suspend-to-disk is a feature desired by many Linux users; both laptop and desktop users can benefit from being able to save the state of the system to a local drive and, after a reboot, find everything as they left it. The current in-kernel suspend mechanism works for many, but not everybody is comfortable with the large amount of invasive code required. The out-of-tree suspend2 implementation adds quite a few worthwhile features, but at the cost of expanding the software suspend implementation still further. Concern over putting some of the suspend2 features into the kernel has been one of the factors preventing its merging so far.Pavel Machek, the maintainer of the in-kernel suspend implementation, has now complicated the pictured with the swsusp3 patch, which moves some of the work of suspending the system into user space. This code is said to work; if this approach continues to show promise, it could point the way toward adding suspend2's features without growing the kernel.
The software suspend process, in very rough terms, works like this:
- All processes on the system (with a few exceptions) are put into a
special "frozen" state.
- Any memory which has on-disk backing store is forced out to disk; this
step essentially clears the system of all user-space pages. Any
kernel memory which can be done without - caches and such - is also
dropped.
- Any remaining memory which is not in reserved space (not part of the
kernel text, for all practical purposes) is written to a suspend image
on the disk. Also written is a map saying where the pages came from
in the first place.
- The system is shut down.
When the system is resumed, these steps are reversed in the opposite order - except that user-space memory remains on disk until faulted in by the newly-restarted system.
The swsusp3 patch does not move all of the above work to user space - much of it must be done in the kernel. What does move is step 3 - the writing of kernel memory - to disk. This operation is handled by way of /dev/kmem. To that end, the swsusp3 patch adds a set of scary ioctl() calls to the /dev/kmem driver.
The new user-space suspend program begins by locking itself into memory. This step is required - it would not do for it to change the memory state in the middle of the process via page faults. A call to the new IOCTL_FREEZE operation on /dev/kmem performs the first two steps listed above: freezing processes and clearing memory. The IOCTL_ATOMIC_SNAPSHOT call then puts devices on hold and creates an in-kernel list of pages which must be saved.
The ioctl(/dev/kmem, IOCTL_ATOMIC_SNAPSHOT) call returns a pointer to that list of pages. The user-space program can then obtain the list (by reading it from /dev/kmem) and pass through it. Each page on the list is read from kernel memory and written to the suspend image file. Finally, the list itself is written to the suspend image. Once that is done, the system can be powered down.
The resume process writes the saved image back into kernel memory. It has the additional problem, however, of having to deal with two kernels at once. This process will be running under a freshly-booted kernel (the "resume kernel") with its own idea of the state of the world; that state will eventually be overwritten by the state from the suspended kernel, but that step must be handled carefully. The resume process cannot simply overwrite arbitrary kernel memory, since it is counting on the resume kernel to continue to function until all of the suspended kernel's memory has been read in. So the user-space resume process must be able to allocate pages in kernel space.
The answer is, of course, another ioctl() command, IOCTL_KMALLOC, which executes a get_zeroed_page() call and returns the address of the resulting page to user space. Once a full set of pages has been loaded with the suspended kernel's memory, an updated page map can be stored in the kernel, and an IOCTL_ATOMIC_RESTORE operation tells the resume kernel to finish the process.
This code is very much in an early stage; even people who do not hesitate to use software suspend may want to be careful with swsusp3 on systems they actually care about resuming. Once things settle down, however, swsusp3 could open the door to a number of features, including graphical progress displays and the ability to interrupt the suspend process, which users have been asking for.
Swap prefetching
It's a common occurrence: some large application runs briefly and pushes all kinds of useful memory out to swap space. Examples include large ld runs, backups, slocate, and others. Once the program is done, the Linux system is left with a great deal of free memory, and a substantial amount of useful application data stuck in swap space. When the user tries to use a running application, everything stops while it populates that free memory with its pages. Wouldn't it be nice if the system could restore swapped out pages when the memory becomes available and avoid making the user wait later on?A number of attempts have been made at prefetching swapped data in the past. It has proved hard, however, to repopulate memory from swap in a way which does not adversely affect the performance of the system as a whole. A well-intended interactivity optimization can easily turn into a performance hit in real use. Con Kolivas has been making another try at it, however, with a series of prefetch patches based on code originally written by Thomas Schlichter. Version 11 of the swap prefetch patch was posted on September 23.
This patch creates two new data structures to track pages which have been evicted to swap. Each swapped page is represented by a swapped_entry_t structure; this structure is added to a linked list and a radix tree. The list enables the prefetch code to find the most recently swapped pages, with the idea that those pages are more likely to be useful in the near future than others which have been languishing in swap for longer. The radix tree, instead, allows the quick removal of entries without having to search the entire (possibly very long) list to find them.
Whenever a page is pushed out to swap, it is also added to the list and radix tree. There is a limit on how many pages will be remembered; it is currently set to a relatively high value which keeps the swapped page entries from occupying more than 5% of RAM. If that limit is exceeded, an older entry will be recycled. The add_to_swapped_list() code also refuses to wait for any locks; if there is a conflict with another processor, it will simply forget a page rather than spin on the lock. The consequence of forgetting a page (it will never be prefetched) is relatively small, so holding up the swap process for contention is not worth it in this case.
The code which actually performs prefetching is even more timid; every effort has been made to make the process of swap prefetching as close to free as possible. The prefetch code only runs once every five seconds - and that gets pushed back any time there is VM activity. The number of available free pages must be substantially above the minimum desired number, or prefetching will not happen. The code also checks that no writeback is happening, that the number of dirty pages in the system is relatively small, that the number of mapped pages is not too high, that the swap cache is not too large, and that the available pages are outside of the DMA zone. When all of those conditions are met, a few pages will be read from swap into the swap cache; they remain on the swap device so that they can be immediately reclaimed should a sudden shortage of memory develop.
Con claims that the end result is worthwhile:
That seems like a benefit worth having, if the cost of the prefetch code is truly low. Discussion on the list has been limited, suggesting that developers are unconcerned about the impacts of prefetching - or simply uninterested at this point.
securityfs
Some observers might well believe that the kernel has accumulated plenty of special-purpose virtual filesystems. Even so, 2.6.14 will include yet another one: securityfs. This filesystem is meant to be used by security modules, some of which were otherwise creating their own filesystems; it should be mounted on /sys/kernel/security. Securityfs thus looks, from user space, like part of sysfs, but it is a distinct entity.The API for securityfs is quite simple - it only exports three functions (defined in <linux/security.h>). The usual first step will be to create a directory specific to the security module at hand with:
struct dentry *securityfs_create_dir(const char *name,
struct dentry *parent);
If parent is NULL, the directory will be created in the root of the filesystem.
That directory can be populated with files using:
struct dentry *securityfs_create_file(const char *name,
mode_t mode,
struct dentry *parent,
void *data,
struct file_operations *fops);
Here, name is the name of the file, mode is the permissions the file will have, parent is the containing directory (or NULL for the filesystem root), data is a private data pointer, and fops is a file_operations structure containing the methods which actually implement the file. The calling module must provide operations which make the file behave as desired. Securityfs differs from sysfs in this regard; it makes no attempt to hide the low-level file implementation. As a result, security modules can do ill-advised things like creating highly complex files, providing ioctl() operations, and more. Most modules, however, will simply want to provide straightforward open(), read(), and (maybe) write() methods and be done with it.
All of these files and directories should be cleaned up when the module is unloaded. The same function is used for both files and directories:
void securityfs_remove(struct dentry *dentry);
There is no automatic cleanup of files performed, so this step is mandatory.
Those wanting to see an example of securityfs in action can look at this patch in 2.6.14 which causes the seclvl module to use it.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Networking
Security-related
Page editor: Jonathan Corbet
Next page:
Distributions>>
