The current 2.6 prepatch is still 2.6.9-rc2; there have been no
2.6.9 prepatches since September 13.
Patches continue to accumulate in Linus's BitKeeper repository; changes
queued up for -rc3 include the re-merging of the two in-kernel software
suspend mechanisms, an XFS update, a new
wait_event_timeout() primitive,
more __iomem annotations
(see The September 16 Kernel Page), new
sparse annotations intended to flush out byte endianness errors, an NTFS
update, ethtool support in the loopback driver, m32r architecture support,
the "string" I/O memory access
functions, support for more than eight partitions on BSD-labeled disks,
some User-mode Linux cleanups, a tunable "max sectors" limit for block I/O
requests (a latency reduction feature), a new prctl() option
allowing programs to change their name, some shared memory scalability
improvements, and a change in TCP ICMP source quench behavior (such
messages are simply ignored now).
The current prepatch from Andrew Morton is 2.6.9-rc2-mm4. Recent changes to -mm include
the "big kernel semaphore" patch (see the
September 16 Kernel Page), a consolidation of the x86-64 and i386
no-exec code, a remap_page_range() API change (see below), a
rework of the filesystem external attribute code, the "Single Priority
Array" scheduler, kernel crash dumps through kexec, and library functions
implementing a simple circular buffer structure.
The current 2.4 prepatch remains 2.4.28-pre3, which was released on
September 11.
Comments (1 posted)
Kernel development news
Harald Welte has posted the proceedings from the 2004 Netfilter Developer
Workshop. Click below for a plain text version; the proceedings are also
available in a number of other formats
over
here.
Full Story (comments: none)
Thomas Habets had an unfortunate experience recently. His Linux system ran
out of memory, and the dreaded "OOM killer" was loosed upon the system's
unsuspecting processes. One of its victims turned out to be his screen
locking program, leaving his session open to whoever might happen to walk
by. His response was
the oom_pardon patch,
which allows the system administrator to exempt certain processes from the
OOM killer's revenge. It turns out that SUSE has
a similar patch which allows administrators to
set the "OOM score" of specific processes, increasing or decreasing their
chances of being chosen for an untimely demise.
The OOM killer exists because the Linux kernel, by default, can commit to
supplying more memory than it can actually provide. Overcommitting memory
in this way allows the kernel to make fuller use of the system's resources,
because processes typically do not use all of the memory they claim. As an
example, consider the fork() system call, which copies all of a
process's memory for the new child process. In fact, all it does is to
mark the memory as "copy on write" and allow parent and child to share it.
Should either change a page shared in this way, a true copy is made. In
theory, the kernel could be called upon to copy all of the copy-on-write
memory in this way; in practice, that does not happen. If the kernel
reserved all of the necessary virtual memory (which includes swap space),
some of that space would certainly go unused. Rather than waste that space
- and fail to run programs or memory allocations that, in practice, it
could have handled - the kernel overcommits itself and hopes for the best.
When the best does not happen, the OOM killer comes into play; its job is
to kill processes and free up some memory. Getting it to kill the right
processes has been an ongoing challenge, however. One person's useless
memory hog is another's crucial application. Thus, over the years,
numerous efforts have been made to refine the OOM killer's heuristics, and
patches like "oom_pardon" have been created.
Not everybody agrees that this is a fruitful use of developer time.
Andries Brouwer came up with this analogy:
An aircraft company discovered that it was cheaper to fly its
planes with less fuel on board. The planes would be lighter and use
less fuel and money was saved. On rare occasions however the amount
of fuel was insufficient, and the plane would crash. This problem
was solved by the engineers of the company by the development of a
special OOF (out-of-fuel) mechanism. In emergency cases a passenger
was selected and thrown out of the plane. (When necessary, the
procedure was repeated.) A large body of theory was developed and
many publications were devoted to the problem of properly selecting
the victim to be ejected. Should the victim be chosen at random?
Or should one choose the heaviest person? Or the oldest? Should
passengers pay in order not to be ejected, so that the victim would
be the poorest on board? And if for example the heaviest person was
chosen, should there be a special exception in case that was the
pilot? Should first class passengers be exempted? Now that the OOF
mechanism existed, it would be activated every now and then, and
eject passengers even when there was no fuel shortage. The
engineers are still studying precisely how this malfunction is
caused.
Overcommitting memory and fearing the OOM killer are not necessary parts of
the Linux experience, however. Simply setting the sysctl parameter
vm/overcommit_memory to 2 turns off the overcommit
behavior and keeps the OOM killer forever at bay. Most modern systems
should have enough disk space to provide an ample swap file for most
situations. Rather than trying to keep pet processes from being killed
when overcommitted memory runs out, it might be easier just to avoid the
situation altogether.
Comments (22 posted)
Last month we looked at a possible change to
the heavily-used
remap_page_range() function as a way of making
io_remap_page_range() be the same on all architectures. Since
then, a driver author has stepped forward with a different problem: he
wants to remap some reserved memory which sits above the 4GB memory
boundary. Since
remap_page_range() takes a 32-bit "start"
address, it cannot be used to remap memory above that boundary.
In response, William Lee Irwin has posted a
series of patches which changes remap_page_range() to:
int remap_pfn_range(struct vm_area_struct *vma, unsigned long from,
unsigned long pfn, unsigned long size,
pgprot_t prot);
The old "start" address has been changed to pfn, which is a page
frame number. Since these mappings can only happen on page boundaries,
this change does not take away any old functionality. It does,
however, make twelve bits (typically) of address space available, making it
possible to remap memory well above 4GB.
William's patches fix all in-kernel callers of remap_page_range(),
of which there are several dozen, and removes the old interface
altogether. He also manages to eliminate a fair amount of related code -
it seems that large numbers of drivers have their own, private copy of
kvirt_to_pa(), which obtains a physical address for memory from
vmalloc(). For in-kernel users, the change should be a purely
positive thing. Out-of-kernel drivers which use
remap_page_range() will have to be fixed, however.
These patches have found their way into the -mm tree, and are thus likely
to end up in the mainline eventually.
Comments (none posted)
It is not uncommon for applications to want to know when something happens
within a subtree of a filesystem. File managers are the most obvious
example; if an application creates a new file within a directory
represented in a file manager, users really like to see that new file show
up, quickly. One could also imagine other sorts of applications - such as
security monitoring code or just daemons wanting to know when their
configuration files have changed - which can benefit from being told about
filesystem activity.
The Linux mechanism for communicating filesystem events to user space is
called "dnotify." A program watches a directory by opening it, then
issuing a fcntl(F_NOTIFY) call. Thereafter, changes in that
directory will result in a SIGIO signal being sent to the process,
which can then dig through its cached information and try to figure out
just what happened. People like to complain about dnotify; the interface
is ugly (signals are a pain), it is hard to figure out what the changes
are, it requires keeping files open and thus blocks the unmounting of
removable media, etc. So there has long been interest in a replacement.
The most visible effort in that direction is inotify, which has been under
development (by John McCutchan) for some time now; recently Robert Love has
jumped in to help the project along. inotify
0.11 was released on September 28, and an increasingly strong push
is being made to get it included into -mm for wider exposure and testing.
inotify works through a new character pseudo-device. Any application which
wants to monitor filesystem activity need only open /dev/inotify
and issue one of two ioctl() commands to it:
- INOTIFY_WATCH
- This call provides a filename and a mask of desired events; inotify
will begin watching the given file (or directory) for activity.
- INOTIFY_IGNORE
- This call will stop the stream of events for the given file.
Quite a few possible events can be watched for: IN_ACCESS (the
file was accessed), IN_MODIFY (the file was changed),
IN_ATTRIB (file attributes changed), IN_OPEN and
IN_CLOSE (for open and close events), IN_MOVED_FROM and
IN_MOVED_TO (when files are renamed), IN_CREATE_SUBDIR
and IN_DELETE_SUBDIR (creation and deletion of subdirectories),
IN_CREATE_FILE and IN_DELETE_FILE (creation and deletion
of files within a directory), IN_DELETE_SELF (when a monitored
file is deleted), IN_UNMOUNT (when the filesystem containing the
file is unmounted), and a couple of others. The events themselves are
obtained by simply reading from the device. Thus a program can block on
the device itself, or use poll() to incorporate notifications into
a larger event-processing loop. No signals are involved.
The actual implementation of inotify is relatively simple. The in-core
inode structure is augmented with a linked list of processes interested in
events involving that inode. When an INOTIFY_WATCH call is made,
an entry is made in the corresponding list (and the inode is pinned into
memory for the duration). Various parts of the filesystem code get an
extra inotify_inode_queue_event() call when an action succeeds.
The rest is just the usual overhead of maintaining lists of events for
processes, waking those processes up when new events arrive, etc.
While most interest and activity seems to be around
inotify, it is not the only dnotify replacement in circulation; nonotify is an alternative. There
are also some remaining issues about the interface exported by inotify. It
has been suggested that the inotify ioctl() calls should take file
descriptors rather than file names; that change would eliminate problems in
dealing with long file names and would also make access control checks
happen automatically. The interface would have to be done in such a way
that the application could close the file and still receive events, though;
otherwise dnotify's problems with unmount blocking and excessive use of
file descriptors would just come back again. These issues notwithstanding,
inotify looks like it is headed for inclusion into a mainline kernel in the
not-too-distant future.
Comments (17 posted)
Patrick Mochel may have been expecting to start a flame war with
this patch, which changes most of the driver
core functions to be exported only to GPL-licensed modules. The affected
functions include the bus-level code, classes (but not
class_simple),
device_register() and friends, the
platform and system bus functions, low-level sysfs functions, and the
kobject primitives. In fact, the flame war failed to materialize; nobody
seems to be upset by these changes. Whether Patrick is pleased or
disappointed by the silence is for him to say.
The affected functions are a fundamental component of the Linux driver
model; they are used by every device driver and filesystem, and by many
other parts of the kernel as well. Even so, few, if any, proprietary
modules will be affected by this change. The interfaces used by most
modules are built on top of - and hide - the driver core. Thus, it is a
rare driver which calls device_register(); instead, something like
usb_register_dev() is used. Those upper-layer functions remain
exported to all modules.
So why make the change? Patrick's reasoning is that he wants all users of
the low-level functions to be part of the mainline kernel tree.
In short, being able to audit all of the users of these functions
is necessary to their continued evolution (whatever that may
entail). It would make the most sense if all users were part of the
kernel, and it makes little sense to support their use by any
unknown or binary modules.
As the kernel tree becomes more dynamic internally, it will be increasingly
hard for external modules - free or not - to keep up with the changes. It
would not be surprising to see ever more "encouragement" to merge external
modules into the mainline. Code which remains outside will require a
higher level of maintenance, or it is likely to break frequently.
Comments (16 posted)
Patches and updates
Kernel trees
Core kernel code
Device drivers
Filesystems and block I/O
Memory management
Architecture-specific
Security-related
- =?iso-8859-1?Q?Kristian_S=F8rensen?=: Umbrella 0.4.1.
(September 27, 2004)
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>