Brief items
The current 2.6 development kernel remains 2.6.29-rc3. As of this
writing, just over 500 changesets have been merged into the mainline since
2.6.29-rc3; they are dominated by fixes but there are also some UBIFS
enhancements (including direct I/O support), a driver for AMD CS5536 PATA
controllers, and SPARC64 NMI watchdog support. Past experience suggests
that the 2.6.29-rc4 release can be expected a few milliseconds after this
page is published.
The current stable 2.6 kernel is 2.6.28.3, released on February 2; 2.6.27.14 was released at the
same time. Both updates contain a long list of fixes for serious problems.
The 2.6.28.4 and 2.6.27.15 updates are in the review
process as of this writing; their probable release date is
February 6.
Comments (none posted)
Kernel development news
The kernel is a lazy, deceitful sack of scum; this is the
fundamental principle of virtual memory. It applies in most
situations, some familiar and some surprising, but the rule is that
VMAs record what has been agreed upon, while PTEs reflect what has
actually been done by the lazy kernel.
--
Gustavo
Duartes describes memory management
--
Evgeniy Polyakov claims a
slight performance advantage
Unless we uncover devastating issues with the transition from ext3
to ext4 as the default file system, Fedora 11 installed systems
will be using ext4.
--
James Laska
Comments (3 posted)
By Jake Edge
February 4, 2009
Creating initramfs images, for use by the kernel at "early boot" time, is a
rather messy business. It is made more so by the fact that each individual
distribution has its own tools to build the image, as well as its own set
of tools inside it. At the 2008 Kernel Summit, Dave Jones spent some time
discussing the problem along with his idea
to start over by creating a cross-distribution initramfs. That has led to
the Dracut project, which was announced by Jeremy Katz in
December, and a new mailing list,
aptly named "initramfs", in which to discuss it.
An initramfs is a cpio archive of the initial filesystem that gets loaded
into memory when the kernel is loaded. That filesystem needs to contain
all of the drivers and tools needed to mount the real root filesystem. It
isn't strictly necessary to have an initramfs, a minimal /dev
along with the required drivers built into the kernel is another
alternative. Distributions, though, all use an initramfs and,
over time, each has come up with their own way to handle this
process. Jones, Katz, and others would like to see something more
standardized, that gets pushed upstream into the mainline kernel so that
distributions can stop fussing with the problem.
There are a number of advantages to that approach. Building an initramfs
from the kernel sources would eliminate problems that users who build their
own kernels sometimes run into. If a distribution's initramfs scheme falls
behind the pace of kernel development in some fashion, users can find
themselves unable to build a kernel+initramfs combination that will work.
There is also hope that dracut will help speed up the boot process by using
udev, as Katz puts it:
By instead moving to where we're basing everything off of uevents we can
hopefully move away from the massive shell scripts of doom, speed up
boot and also maybe get to where a more general initramfs can be built
_with the kernel_ instead of per-system.
Because initramfs is so integral to the early boot process—and so
difficult to debug if problems arise—there is a concern about
starting over. It is not surprising, then, that there is some resistance
to throwing out years of hard-earned knowledge that is embodied in the
various distributions' initramfs handling, leading Maximilian Attems to ask:
btw why do we need dracut at all?
your blog has vague allusion against initramfs-tools,
which is much better tested and has seen the field.
beside having more features and flexibility it does not hardcode udev usage,
nor bash, why should it not be considered at first!?
It is a question that is frequently asked, but one that Jones has a ready
answer for:
"why not use the ubuntu one?"
"why not use the suse one?"
they all have some good and bad tradeoffs. Distro X has feature Y
which no-one else does. etc.
When the project began we spent some time looking at what everyone
else already does, and "lets start over and hope others participate"
seemed more attractive than taking an existing one and bending it to fit.
So, the Red Hat folks, at least, are proceeding with dracut. Jones
recently posted a status
report on his blog that outlined what is working and what still needs
to be done. Though it currently is "Fedora-centric, with a few
hardcoded assumptions in there, so it'll likely fall over on other
distros", fixing that is clearly high on the to-do list. The status
report is an effort to get people up-to-speed so that other distributions
can start trying it out. In addition, he plans to start trying it on
various distributions himself.
In its current form, dracut is rather minimal. It has a script named
dracut that will generate a gzipped cpio file for the initramfs
image, as
well as an init shell script that ends up in that image.
Jones says that init "achieves quite a lot in its 119
lines": setting up device nodes, starting udev, waiting for the root
device to show up and mounting it, mounting /proc and /sys,
and more. If anything goes wrong during that process, init will
drop to a shell that will allow diagnosis of the problem. So far, it only
supports
the simpler cases for the location of the root filesystem:
Currently, dracut supports root on raw disks (/dev/sda), lvm (/dev/mapper...),
and mounting root by label or uuid.
If you have a more esoteric rootfs setup, such as root-on-nfs, right now
it'll fail horribly.
There is only one remaining barrier to getting rid of the unlamented
nash, and that is a utility to do a switch_root (i.e. switch to a new
root directory and start an init from there). The plan is to
write a standalone utility that would be added to the util-linux
package. The environment
provided by the initramfs would include util-linux, bash, and
use glibc,
which doesn't sit well with some embedded folks. They generally prefer a
statically linked busybox environment. Kay Sievers outlines the reasons for a standard environment:
Busybox is nice as an option to be able to rescue/hack. It should
definitely be provided as an optional "plugin" for people who need it.
But there is no chance to depend on it by default, for the very same
reason klibc, or any other libc is not an option.
Full-featured distros who make their money with support, can just not
afford to support tools compiled differently from the tools in the
real rootfs. SUSE used klibc for one release, and stopped doing that
immediately, because you go crazy if you run into problems with bootup
problems on [customer] setups you can not reproduce with the tools from
the real rootfs.
There is plenty to do to make dracut into a real tool for creating
initramfs images—at least ones that work on more than just
Fedora—more root filesystem types need to be handled, hibernation
signatures need to be recognized and handled, the udev rules
need to be cleaned up, kdump images need to be supported, etc. But the
overriding question is: will other distributions start working on dracut as
well? If and when Jones (or others) get things at least limping along on
Debian/Ubuntu and/or SUSE, will those distributions start getting on board?
So far, there is not a lot of evidence of anyone other than Red Hat working
on dracut.
But, the plan is to eventually submit dracut upstream to the mainline
kernel, so that make initramfs works in a standard kernel tree. It
would seem that many kernel hackers see the need for standardizing
initramfs and eventually moving it into the kernel, as Ted Ts'o notes:
[...] So the idea that was explored was adding a
common mkinitramfs with basic functionality into kernel sources, with
the ability for distributions to add various "value add" enhancements
if they like. This way if the kernel wants to move more functionality
(for example, in the area of resuming from hibernation) out of the
kernel into initramfs, it can do so without breaking the ability of
older distributions from being able to use kernel.org kernels.
So IMHO, it's important not only that the distributions standardize on
a single initramfs framework, but that framework get integrated into
the kernel sources.
No one is very happy about losing their particular version of the
tools to build an initramfs—if only because of familiarity—but
a standardized solution is something whose time has come. Probably any of
the existing tools could have been used as a starting point, but for
political
reasons, it makes sense to start anew. There is a fair amount of
cruft that has built up in the existing tools as well, which folks are
unlikely to miss, so there are also technical reasons to start over. It should
come
as no surprise that a project started by Red Hat might be somewhat
Fedora-centric in its early form, but the clear intent is to make it
distribution-agnostic. It would seem the right time for other
distributions and constituencies (embedded for example) to get involved to
help shape dracut into something useful for all.
Comments (17 posted)
By Jonathan Corbet
February 4, 2009
Any filesystem designed for use with rotating media must pay careful
attention to the layout of files on the disk. If a file's blocks can be
placed sequentially on the device, they can be read or written as a unit,
without the need for performance-destroying head seeks in the middle. Even
the most careful filesystem will sometimes fail to lay out files in a
minimal number of contiguous extents, though. If a file grows, for
example, and the blocks just past the previous end are not available, the
filesystem has no choice other than placing the new blocks somewhere else.
Depending on how full the filesystem is, those blocks could end up far away
indeed. This sort of fragmentation can result in filesystems slowing down
over time.
Fragmentation problems can be fixed up after the fact. The most obvious
way to defragment a disk is to make a new filesystem on it; after all,
empty filesystems tend not to have fragmentation problems. But the new
filesystem will have less fragmentation even after its old contents have
been restored onto it. When the ultimate size of every file is known in
advance, it's relatively easy to make good layout decisions. Knowing this,
system administrators have used backup-and-restore cycles as a way of
cleaning up overly fragmented disks for many years.
There is, of course, a problem with this approach which goes beyond the
risk of discovering that one's backup is not quite as good as one had
thought. The downtime associated with rewriting a disk can be unwelcome to
users; a filesystem which is down responds even more slowly than a
filesystem with fragmentation problems. So it would be nice to have a way
to defragment a filesystem while keeping it online and available. This
online defragmentation capability has been on the ext4 "planned features"
list for a long time; it is, at this point, about the only planned feature
which has not yet been merged into the mainline.
Some attempts at online defragmentation have been made in the past, but
they have not, yet, gotten through review. Now Akira Fujita has come
forward with a new ext4 online
defragmentation patch which, by virtue of a different view of the
problem, might just make it into the mainline. Previous attempts exposed
an interface whereby a user-space application could ask the filesystem to
defragment a specific file by allocating new (contiguous) blocks to it.
That turned out to be a bit too much work to put into the kernel; so, with
this patch, Akira has created an interface which moves a bit more of the
work into user space.
In the new scheme, a user-space defragmentation daemon will pick a file
which, in its opinion, is too spread out on the disk. The daemon will then
set about creating a new, less-fragmented file to replace it. That is done
by creating a new, temporary file on the same filesystem, then unlinking it
(while holding the file descriptor open). Calls to fallocate()
can then be used to add the requisite number of blocks to the new file.
Once the new file is up to the correct size, the daemon can use the
FS_IOC_FIEMAP
ioctl() to query the number of extents (fragments) it contains. If the
new file is not an improvement over the old one, the daemon should just
close it and give up; the filesystem simply does not have enough contiguous
storage available.
The daemon could, at this point, simply copy the old file into the new one,
then put the newly defragmented version in the place of the old one. The
problems with that approach include performance (all that data must be
copied through user space) and robustness. If some other process changes
the file while the copy is happening, the new file may lose those changes.
Indeed, if some process has the old file open, it may never notice that the
replacement has happened. So something smarter is needed.
Akira's patch addresses these problems with the creation of a new, magic
ioctl() call for ext4. The defragmentation application must fill
out a structure like:
struct move_extent {
int org_fd; /* original file descriptor */
int dest_fd; /* destination file descriptor */
ext4_lblk_t start; /* logical offset of org_fd and dest_fd*/
ext4_lblk_t len; /* exchange block length */
};
This structure, when passed to the new EXT4_IOC_DEFRAG
ioctl(), expresses a request to the kernel to move len
blocks from the original file to the new one, starting at start.
Essentially, it copies an extent's worth of data into the (fully allocated,
nicely contiguous) space in the new file, then performs a magic block
swap. The contiguous blocks from the new file are patched into the old
file, while the fragmented blocks are, instead, put into the new file.
Once the entire file has been treated in this way, the file will have been
defragmented without having been visibly moved.
The final step is to delete the "new" file, which now contains the "old"
file's blocks. Since the file had been unlinked, that will cause the
filesystem to recover the old blocks and the task will be complete. For
the curious, Akira has posted the source for a
user-space defragmentation tool which shows how this interface can be
used.
There have not been a whole lot of objections to the new code. Chris Mason
did point out that the system will do
unfortunate things if the layout of a swap file changes. He has clearly
thought about the problem - to an extent:
Btrfs is currently getting around this by dropping bmap support, so
swapfiles on btrfs won't work at all. A real long term solution is
required ;)
Beyond that, there are some minor issues, such as the definition of the ABI
in terms of types like int instead of architecture-independent
types. Requests for separate source and destination block numbers have
been made; that feature would help developers working on hierarchical
storage systems. The ability to guide the allocation of blocks would be
useful in situations where performance can be improved by grouping related
files together on the disk.
There could also be value in finding a way to move much of this
functionality into the VFS layer where it could be used with other
filesystems as well; that could prove to be a difficult task, though, and
ext4 maintainer Ted Ts'o has little
desire to take on that job.
Those little issues notwithstanding, it does appear that the ext4 filesystem
may be closer to getting the much-requested online defragmentation feature.
Comments (12 posted)
February 4, 2009
This article was contributed by Goldwyn Rodrigues
Under desperately low memory conditions, the out-of-memory (OOM) killer
kicks in and picks a process to kill using a set of heuristics which has
evolved over time. This may be pretty annoying for users who may have
wanted a different process to be killed. The process killed may also be
important from the system's perspective. To avoid the untimely demise of
the wrong processes, many developers feel that a greater degree of control
over the OOM killer's activities is required.
Why the OOM-killer?
Major distribution kernels set the default value of
/proc/sys/vm/overcommit_memory to zero, which means that processes
can
request more memory than is currently free in the system. This is
done based on the heuristics that allocated memory is not used
immediately, and that processes, over their lifetime, also do not use all
of the
memory they allocate. Without overcommit, a system will
not fully utilize its memory, thus wasting some of it.
Overcommiting memory allows the system to use the memory in a more
efficient way, but at the risk of OOM situations. Memory-hogging programs
can deplete the system's memory, bringing the whole system to a
grinding halt. This can lead to a situation, when memory is so low, that
even a single page cannot be allocated to a user process, to
allow the administrator to kill an appropriate task, or to the
kernel to carry out important operations such as freeing memory. In
such a situation, the OOM-killer kicks in and identifies the process
to be the sacrificial lamb for the benefit of the rest of the system.
Users and system administrators have often asked for ways to control the
behavior of the OOM killer. To facilitate control, the
/proc/<pid>/oom_adj knob was introduced to save
important processes in the
system from being killed, and define an order of processes to be
killed. The possible values of oom_adj range from -17 to
+15. The higher the
score, more likely the associated process is to be killed by OOM-killer. If
oom_adj is set
to -17, the process is not considered for OOM-killing.
Who's Bad?
The process to be killed in an out-of-memory situation is selected
based on its badness score. The badness score is reflected in
/proc/<pid>/oom_score. This value is determined on
the basis that the system
loses the minimum amount of work done, recovers a large amount of
memory, doesn't kill any innocent process eating tons of memory, and
kills the minimum number of processes (if possible limited to one).
The badness score is computed using the original memory size of the process,
its CPU time (utime + stime), the run time (uptime - start time) and
its oom_adj value. The more memory the process uses, the higher
the score.
The longer a process is alive in the system, the smaller the score.
Any process unlucky enough to be in the swapoff() system call
(which removes a swap file from the system) will be
selected to be killed first. For the rest,
the initial memory size becomes the original badness score of the process.
Half of each child's memory size is added to the parent's score if they do not
share the same memory. Thus forking servers are the prime candidates
to be killed. Having only one "hungry" child will make the parent less
preferable than the child. Finally, the following heuristics are
applied to save important processes:
- if the task has nice value above zero, its score doubles
- superuser or direct hardware access tasks (CAP_SYS_ADMIN,
CAP_SYS_RESOURCE or CAP_SYS_RAWIO) have their score divided
by 4. This is cumulative, i.e., a super-user task with
hardware access would have its score divided by 16.
- if OOM condition happened in one cpuset and checked task
does not belong to that set, its score is divided by 8.
- the resulting score is multiplied by two to the power of
oom_adj (i.e.
points <<= oom_adj when it is
positive and
points >>= -(oom_adj)
otherwise).
The task with the highest badness score is then selected and its children
are killed. The process itself will be killed in an OOM situation when it
does not have children.
Shifting OOM-killing policy to user-space
/proc/<pid>/oom_score is a dynamic value which changes
with time, and is
not flexible with different and dynamic policies required by the
administrator. It is difficult to determine which process will be killed
in case of an OOM condition. The administrator must adjust the score
for every process created, and for every process which exits. This
could be quite a task in a system with quickly-spawning processes. In an
attempt to
make OOM-killer policy implementation easier, a name-based solution
was proposed by Evgeniy Polyakov. With his patch, the process to die first
is the one running the program whose name is found in
/proc/sys/vm/oom_victim.
A name based solution has its limitations:
- task name is not a reliable indicator of true name
and is truncated in the process name fields.
Moreover, symlinks to executing binaries, but with
different names will not work with this approach
- This approach can specify only one name at a time, ruling
out the possibility of a hierarchy
- There could be multiple processes of the same name but from
different binaries.
- The behavior boils down to the default current
implementation if there is no process by the name defined by
/proc/sys/vm/oom_victim. This increases the number of scans
required to find the victim process.
Alan Cox disliked this solution, suggesting that
containers are the most appropriate way to
control the problem. In response to this suggestion, the oom_killer controller,
contributed by Nikanth
Karthikesan, provides control of the sequence of processes to be killed when the
system runs out of memory. The patch introduces an OOM control group
(cgroup) with an oom.priority field. The process to be killed is
selected from the processes having the highest oom.priority value.
To take control of the OOM-killer, mount the cgroup OOM
pseudo-filesystem introduced by the patch:
# mount -t cgroup -o oom oom /mnt/oom-killer
The OOM-killer directory contains the list of all processes in the file
tasks, and their OOM priority in oom.priority. By default,
oom.priority is set to one.
If you want to create a special control group containing the list of
processes which should be the first to receive the OOM killer's
attention, create a directory under /mnt/oom-killer to represent it:
# mkdir lambs
Set oom.priority to a value high enough:
# echo 256 > /mnt/oom-killer/lambs/oom.priority
oom.priority is a 64-bit unsigned integer, and can have a maximum
value an unsigned 64-bit number can hold. While scanning for the
process to be killed, the OOM-killer selects a process from the list
of tasks with the highest oom.priority value.
Add the PID of the process to be added to the list of tasks:
# echo <pid> > /mnt/oom-killer/lambs/tasks
To create a list of processes, which will not be killed by the
OOM-killer, make a directory to contain the processes:
# mkdir invincibles
Setting oom.priority to zero makes all the process in this cgroup to be
excluded from the list of target processes to be killed.
# echo 0 > /mnt/oom-killer/invincibles/oom.priority
To add more processes to this group, add the pid of the task to the
list of tasks in the invincible group:
# echo <pid> > /mnt/oom-killer/invincibles/tasks
Important processes, such as database processes and their
controllers, can be added to this group, so they are ignored when
OOM-killer searches for processes to be killed.
All children of the processes listed in tasks automatically are added
to the same control group and inherit the oom.priority of the parent.
When multiple tasks have the highest oom.priority, the OOM killer
selects the process based on the oom_score and oom_adj.
This approach did not appeal to cpuset users, though. Consider two
cpusets, A and B. If a process in cpuset A has a high oom.priority
value, it will be killed if cpuset B runs out of memory,
even though there is enough memory in cpuset A. This calls for a
different design to tame the OOM killer.
An interesting outcome of the discussion has been handling OOM situations in
user space. The kernel sends notification to user space, and
applications respond by dropping their user-space caches. In case the
user-space processes are not able to free enough memory, or the
processes ignore the kernel's requests to free memory, the kernel
resorts to the good old method of killing processes.
mem_notify, developed
by Kosaki Motohiro, is one such attempt made in the past. However, the
mem_notify patch
cannot be applied to versions beyond 2.6.28 because the memory
management reclaiming sequence have changed, but the design principles
and goals can be reused. David Rientjes suggests having one of the
two hybrid solutions:
One is the cgroup OOM notifier that allows you to attach a task to
wait on an OOM condition for a collection of tasks. This allows userspace to
respond to the condition by dropping caches, adding nodes to a cpuset,
elevating memory controller limits, sending a signal, etc. It can
also defer to the kernel OOM killer as a last resort.
The other is /dev/mem_notify that allows you to poll() on a device
file and be informed of low memory events. This can include the cgroup oom
notifier behavior when a collection of tasks is completely out of memory,
but can also warn when such a condition may be imminent. I suggested that
this be implemented as a client of cgroups so that different handlers can
be responsible for different aggregates of tasks.
Most developers prefer making /dev/mem_notify a client of control
groups. This can be further extended to merge with the proposed
oom-controller.
Low Memory in Embedded Systems
The Android developers required a greater degree of control over the low
memory situation because the OOM killer does not kick in till late in
the low memory situation, i.e. till all the cache is emptied. Android
wanted a solution which would start early while the free memory is
being depleted. So they introduced the "lowmemory" driver, which
has multiple thresholds of low memory. In a low-memory situation, when
the first thresholds are met, background processes are notified of the
problem. They do
not exit, but, instead, save their state. This affects the latency when
switching applications, because the application has to reload on
activation. On further pressure, the lowmemory killer kills the
non-critical background processes whose state had been saved in the
previous threshold and, finally, the foreground applications.
Keeping multiple low memory triggers gives the processes enough time to free
memory from their caches because in an OOM situation, user-space
processes may not be able to run at all. All it takes is a single
allocation from the kernel's internal structures, or a page fault
to make the system run out of memory. An earlier notification
of a low-memory situation could avoid the OOM situation with a little help
from the user space applications which respond to low memory notifications.
Killing processes based on kernel heuristics is not an
optimal solution, and these new initiatives of offering better
control to the user in selecting the process to be the sacrificial
lamb are steps to a robust design to give more control to the user.
However, it may take some time to come to a consensus on a final control
solution.
Comments (53 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Networking
Architecture-specific
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>