The current 2.6 prepatch remains 2.6.14-rc2
; no prepatches have been
released over the last week.
The flow of patches into Linus's git repository has slowed; that repository
currently contains some key management improvements,
a SCSI update, some netfilter patches, an InfiniBand update, and lots of
The current -mm tree is 2.6.14-rc2-mm1. Recent changes
to -mm include a cs5535 ALSA driver, a new device_is_registered()
helper function (since merged), some network time protocol cleanups,
the controversial (see thread starting here) Adaptec serial attached storage patch
set, and the usual pile of fixes.
The current 2.4 prepatch is 2.4.32-rc1, released by Marcelo on September 22.
This prepatch adds a small set of fixes (some backported from 2.6) to the
upcoming 2.4.32 release.
Comments (none posted)
Kernel development news
Suspend-to-disk is a feature desired by many Linux users; both laptop and
desktop users can benefit from being able to save the state of the system
to a local drive and, after a reboot, find everything as they left it. The
current in-kernel suspend mechanism works for many, but not everybody is
comfortable with the large amount of invasive code required. The
adds quite a few worthwhile features,
but at the cost of expanding the software suspend implementation still
further. Concern over putting some of the suspend2 features into the
kernel has been one of the factors preventing its merging so far.
Pavel Machek, the maintainer of the in-kernel suspend implementation, has
now complicated the pictured with the swsusp3 patch, which moves
some of the work of suspending the system into user space. This code is
said to work; if this approach continues to show promise, it could point
the way toward adding suspend2's features without growing the kernel.
The software suspend process, in very rough terms, works like this:
- All processes on the system (with a few exceptions) are put into a
special "frozen" state.
- Any memory which has on-disk backing store is forced out to disk; this
step essentially clears the system of all user-space pages. Any
kernel memory which can be done without - caches and such - is also
- Any remaining memory which is not in reserved space (not part of the
kernel text, for all practical purposes) is written to a suspend image
on the disk. Also written is a map saying where the pages came from
in the first place.
- The system is shut down.
When the system is resumed, these steps are reversed in the opposite order
- except that user-space memory remains on disk until faulted in by the
The swsusp3 patch does not move all of the above work to user space - much
of it must be done in the kernel. What does move is step 3 - the
writing of kernel memory - to disk. This operation is handled by way of
/dev/kmem. To that end, the swsusp3 patch adds a set of scary
ioctl() calls to the /dev/kmem driver.
The new user-space suspend program begins by locking itself into memory.
This step is required - it would not do for it to change the memory state
in the middle of the process via page faults. A call to the new
IOCTL_FREEZE operation on /dev/kmem performs the
first two steps listed above: freezing processes and clearing memory. The
IOCTL_ATOMIC_SNAPSHOT call then puts devices on hold and creates
an in-kernel list of pages which must be saved.
The ioctl(/dev/kmem, IOCTL_ATOMIC_SNAPSHOT) call returns a pointer
to that list of pages. The user-space program can then obtain the list (by
reading it from /dev/kmem) and pass through it. Each page on the
list is read from kernel memory and written to the suspend image file. Finally, the
list itself is written to the suspend image. Once that is done,
the system can be powered down.
The resume process writes the saved image back into kernel memory. It has
the additional problem, however, of having to deal with two kernels at
once. This process will be running under a freshly-booted kernel (the
"resume kernel") with its
own idea of the state of the world; that state will eventually be
overwritten by the state from the suspended kernel, but that step must be
handled carefully. The resume process cannot simply overwrite arbitrary
kernel memory, since it is counting on the resume kernel to continue to
function until all of the suspended kernel's memory has been read in. So
the user-space resume process must be able to allocate pages in kernel
The answer is, of course, another ioctl() command, IOCTL_KMALLOC,
which executes a get_zeroed_page() call and returns the address of
the resulting page to user space. Once a full set of pages has been loaded
with the suspended kernel's memory, an updated page map can be stored in
the kernel, and an IOCTL_ATOMIC_RESTORE operation tells the resume
kernel to finish the process.
This code is very much in an early stage; even people who do not hesitate
to use software suspend may want to be careful with swsusp3 on systems they
actually care about resuming. Once things settle down, however, swsusp3
could open the door to a number of features, including graphical progress
displays and the ability to interrupt the suspend process, which users have
been asking for.
Comments (11 posted)
It's a common occurrence: some large application runs briefly and pushes
all kinds of useful memory out to swap space. Examples include large
runs, backups, slocate
, and others. Once the program
is done, the Linux system is left with a great deal of free memory, and a
substantial amount of useful application data stuck in swap space. When
the user tries to use a running application, everything stops while it
populates that free memory with its pages. Wouldn't it be nice if the
system could restore swapped out pages when the memory becomes available
and avoid making the user wait later on?
A number of attempts have been made at prefetching swapped data in the
past. It has proved hard, however, to repopulate memory from swap in a way
which does not adversely affect the performance of the system as a whole.
A well-intended interactivity optimization can easily turn into a
performance hit in real use.
Con Kolivas has been making another try at it, however, with a series of
prefetch patches based on code originally written by Thomas Schlichter. Version 11 of the swap prefetch patch was
posted on September 23.
This patch creates two new data structures to track pages which have
been evicted to swap. Each swapped page is represented by a
swapped_entry_t structure; this structure is added to a linked
list and a radix tree. The list enables the prefetch code to find the most
recently swapped pages, with the idea that those pages are more likely to
be useful in the near future than others which have been languishing in
swap for longer. The radix tree, instead, allows the quick removal of
entries without having to search the entire (possibly very long) list to
Whenever a page is pushed out to swap, it is also added to the list and
radix tree. There is a limit on how many pages will be remembered; it is
currently set to a relatively high value which keeps the swapped page
entries from occupying more than 5% of RAM. If that limit is exceeded, an
older entry will be recycled. The add_to_swapped_list() code also
refuses to wait for any locks; if there is a conflict with another
processor, it will simply forget a page rather than spin on the lock. The
consequence of forgetting a page (it will never be prefetched) is relatively
small, so holding up the swap process for contention is not worth it in
The code which actually performs prefetching is even more timid; every
effort has been made to make the process of swap prefetching as close to
free as possible. The prefetch code only runs once every five seconds -
and that gets pushed back any time there is VM activity. The number of
available free pages must be substantially above the minimum desired
number, or prefetching will not happen. The code also checks that no
writeback is happening, that the number of dirty pages in the system is
relatively small, that the number of mapped pages is not too high, that the
swap cache is not too large, and that the available pages are outside of
the DMA zone. When all of those conditions are met, a few pages will be
read from swap into the swap cache; they remain on the swap device so that
they can be immediately reclaimed should a sudden shortage of memory
Con claims that the end result is worthwhile:
In testing on modern pc hardware this results in wall-clock time
activation of the firefox browser to speed up 5 fold after a worst
case complete swap-out of the browser on an static web page.
That seems like a benefit worth having, if the cost of the prefetch code is
truly low. Discussion on the list has been limited, suggesting that
developers are unconcerned about the impacts of prefetching - or simply
uninterested at this point.
Comments (13 posted)
Some observers might well believe that the kernel has accumulated plenty of
special-purpose virtual filesystems. Even so, 2.6.14 will include yet
another one: securityfs. This filesystem is meant to be used by security
modules, some of which were otherwise creating their own filesystems; it
should be mounted on /sys/kernel/security
. Securityfs thus looks,
from user space, like part of sysfs, but it is a distinct entity.
The API for securityfs is quite simple - it only exports three functions
(defined in <linux/security.h>). The usual first step will
be to create a directory specific to the security module at hand with:
struct dentry *securityfs_create_dir(const char *name,
struct dentry *parent);
If parent is NULL, the directory will be created in the
root of the filesystem.
That directory can be populated with files using:
struct dentry *securityfs_create_file(const char *name,
struct dentry *parent,
struct file_operations *fops);
Here, name is the name of the file,
mode is the permissions the file will have,
parent is the containing directory (or NULL for the
data is a private data pointer,
and fops is a file_operations structure containing the
methods which actually implement the file. The calling module must
provide operations which make the file behave as desired. Securityfs
differs from sysfs in this regard; it makes no attempt to hide the low-level
file implementation. As a result, security modules can do ill-advised things like
creating highly complex files, providing ioctl() operations, and
more. Most modules, however, will simply want to provide straightforward
open(), read(), and (maybe) write() methods and
be done with it.
All of these files and directories should be cleaned up when the module is
unloaded. The same function is used for both files and directories:
void securityfs_remove(struct dentry *dentry);
There is no automatic cleanup of files performed, so this step is
Those wanting to see an example of securityfs in action can look at this patch in 2.6.14 which causes the
seclvl module to use it.
Comments (13 posted)
Patches and updates
Core kernel code
- Marco Costalba: qgit-0.95.
(September 26, 2005)
- dmitry pervushin: SPI.
(September 28, 2005)
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>