Kernel development
Brief items
Kernel release status
The current 2.6 development kernel remains 2.6.29-rc3. As of this writing, just over 500 changesets have been merged into the mainline since 2.6.29-rc3; they are dominated by fixes but there are also some UBIFS enhancements (including direct I/O support), a driver for AMD CS5536 PATA controllers, and SPARC64 NMI watchdog support. Past experience suggests that the 2.6.29-rc4 release can be expected a few milliseconds after this page is published.The current stable 2.6 kernel is 2.6.28.3, released on February 2; 2.6.27.14 was released at the same time. Both updates contain a long list of fixes for serious problems.
The 2.6.28.4 and 2.6.27.15 updates are in the review process as of this writing; their probable release date is February 6.
Kernel development news
Quotes of the week
-- Evgeniy Polyakov claims a slight performance advantage![]()
Dracut looks to replace the initramfs patchwork
Creating initramfs images, for use by the kernel at "early boot" time, is a rather messy business. It is made more so by the fact that each individual distribution has its own tools to build the image, as well as its own set of tools inside it. At the 2008 Kernel Summit, Dave Jones spent some time discussing the problem along with his idea to start over by creating a cross-distribution initramfs. That has led to the Dracut project, which was announced by Jeremy Katz in December, and a new mailing list, aptly named "initramfs", in which to discuss it.
An initramfs is a cpio archive of the initial filesystem that gets loaded into memory when the kernel is loaded. That filesystem needs to contain all of the drivers and tools needed to mount the real root filesystem. It isn't strictly necessary to have an initramfs, a minimal /dev along with the required drivers built into the kernel is another alternative. Distributions, though, all use an initramfs and, over time, each has come up with their own way to handle this process. Jones, Katz, and others would like to see something more standardized, that gets pushed upstream into the mainline kernel so that distributions can stop fussing with the problem.
There are a number of advantages to that approach. Building an initramfs from the kernel sources would eliminate problems that users who build their own kernels sometimes run into. If a distribution's initramfs scheme falls behind the pace of kernel development in some fashion, users can find themselves unable to build a kernel+initramfs combination that will work. There is also hope that dracut will help speed up the boot process by using udev, as Katz puts it:
Because initramfs is so integral to the early boot process—and so difficult to debug if problems arise—there is a concern about starting over. It is not surprising, then, that there is some resistance to throwing out years of hard-earned knowledge that is embodied in the various distributions' initramfs handling, leading Maximilian Attems to ask:
beside having more features and flexibility it does not hardcode udev usage, nor bash, why should it not be considered at first!?
It is a question that is frequently asked, but one that Jones has a ready answer for:
"why not use the suse one?"
they all have some good and bad tradeoffs. Distro X has feature Y which no-one else does. etc.
When the project began we spent some time looking at what everyone else already does, and "lets start over and hope others participate" seemed more attractive than taking an existing one and bending it to fit.
So, the Red Hat folks, at least, are proceeding with dracut. Jones
recently posted a status
report on his blog that outlined what is working and what still needs
to be done. Though it currently is "Fedora-centric, with a few
hardcoded assumptions in there, so it'll likely fall over on other
distros
", fixing that is clearly high on the to-do list. The status
report is an effort to get people up-to-speed so that other distributions
can start trying it out. In addition, he plans to start trying it on
various distributions himself.
In its current form, dracut is rather minimal. It has a script named
dracut that will generate a gzipped cpio file for the initramfs
image, as
well as an init shell script that ends up in that image.
Jones says that init "achieves quite a lot in its 119
lines
": setting up device nodes, starting udev, waiting for the root
device to show up and mounting it, mounting /proc and /sys,
and more. If anything goes wrong during that process, init will
drop to a shell that will allow diagnosis of the problem. So far, it only
supports
the simpler cases for the location of the root filesystem:
There is only one remaining barrier to getting rid of the unlamented nash, and that is a utility to do a switch_root (i.e. switch to a new root directory and start an init from there). The plan is to write a standalone utility that would be added to the util-linux package. The environment provided by the initramfs would include util-linux, bash, and use glibc, which doesn't sit well with some embedded folks. They generally prefer a statically linked busybox environment. Kay Sievers outlines the reasons for a standard environment:
Full-featured distros who make their money with support, can just not afford to support tools compiled differently from the tools in the real rootfs. SUSE used klibc for one release, and stopped doing that immediately, because you go crazy if you run into problems with bootup problems on [customer] setups you can not reproduce with the tools from the real rootfs.
There is plenty to do to make dracut into a real tool for creating initramfs images—at least ones that work on more than just Fedora—more root filesystem types need to be handled, hibernation signatures need to be recognized and handled, the udev rules need to be cleaned up, kdump images need to be supported, etc. But the overriding question is: will other distributions start working on dracut as well? If and when Jones (or others) get things at least limping along on Debian/Ubuntu and/or SUSE, will those distributions start getting on board? So far, there is not a lot of evidence of anyone other than Red Hat working on dracut.
But, the plan is to eventually submit dracut upstream to the mainline kernel, so that make initramfs works in a standard kernel tree. It would seem that many kernel hackers see the need for standardizing initramfs and eventually moving it into the kernel, as Ted Ts'o notes:
So IMHO, it's important not only that the distributions standardize on a single initramfs framework, but that framework get integrated into the kernel sources.
No one is very happy about losing their particular version of the tools to build an initramfs—if only because of familiarity—but a standardized solution is something whose time has come. Probably any of the existing tools could have been used as a starting point, but for political reasons, it makes sense to start anew. There is a fair amount of cruft that has built up in the existing tools as well, which folks are unlikely to miss, so there are also technical reasons to start over. It should come as no surprise that a project started by Red Hat might be somewhat Fedora-centric in its early form, but the clear intent is to make it distribution-agnostic. It would seem the right time for other distributions and constituencies (embedded for example) to get involved to help shape dracut into something useful for all.
Online defragmentation for ext4
Any filesystem designed for use with rotating media must pay careful attention to the layout of files on the disk. If a file's blocks can be placed sequentially on the device, they can be read or written as a unit, without the need for performance-destroying head seeks in the middle. Even the most careful filesystem will sometimes fail to lay out files in a minimal number of contiguous extents, though. If a file grows, for example, and the blocks just past the previous end are not available, the filesystem has no choice other than placing the new blocks somewhere else. Depending on how full the filesystem is, those blocks could end up far away indeed. This sort of fragmentation can result in filesystems slowing down over time.Fragmentation problems can be fixed up after the fact. The most obvious way to defragment a disk is to make a new filesystem on it; after all, empty filesystems tend not to have fragmentation problems. But the new filesystem will have less fragmentation even after its old contents have been restored onto it. When the ultimate size of every file is known in advance, it's relatively easy to make good layout decisions. Knowing this, system administrators have used backup-and-restore cycles as a way of cleaning up overly fragmented disks for many years.
There is, of course, a problem with this approach which goes beyond the risk of discovering that one's backup is not quite as good as one had thought. The downtime associated with rewriting a disk can be unwelcome to users; a filesystem which is down responds even more slowly than a filesystem with fragmentation problems. So it would be nice to have a way to defragment a filesystem while keeping it online and available. This online defragmentation capability has been on the ext4 "planned features" list for a long time; it is, at this point, about the only planned feature which has not yet been merged into the mainline.
Some attempts at online defragmentation have been made in the past, but they have not, yet, gotten through review. Now Akira Fujita has come forward with a new ext4 online defragmentation patch which, by virtue of a different view of the problem, might just make it into the mainline. Previous attempts exposed an interface whereby a user-space application could ask the filesystem to defragment a specific file by allocating new (contiguous) blocks to it. That turned out to be a bit too much work to put into the kernel; so, with this patch, Akira has created an interface which moves a bit more of the work into user space.
In the new scheme, a user-space defragmentation daemon will pick a file which, in its opinion, is too spread out on the disk. The daemon will then set about creating a new, less-fragmented file to replace it. That is done by creating a new, temporary file on the same filesystem, then unlinking it (while holding the file descriptor open). Calls to fallocate() can then be used to add the requisite number of blocks to the new file. Once the new file is up to the correct size, the daemon can use the FS_IOC_FIEMAP ioctl() to query the number of extents (fragments) it contains. If the new file is not an improvement over the old one, the daemon should just close it and give up; the filesystem simply does not have enough contiguous storage available.
The daemon could, at this point, simply copy the old file into the new one, then put the newly defragmented version in the place of the old one. The problems with that approach include performance (all that data must be copied through user space) and robustness. If some other process changes the file while the copy is happening, the new file may lose those changes. Indeed, if some process has the old file open, it may never notice that the replacement has happened. So something smarter is needed.
Akira's patch addresses these problems with the creation of a new, magic ioctl() call for ext4. The defragmentation application must fill out a structure like:
struct move_extent {
int org_fd; /* original file descriptor */
int dest_fd; /* destination file descriptor */
ext4_lblk_t start; /* logical offset of org_fd and dest_fd*/
ext4_lblk_t len; /* exchange block length */
};
This structure, when passed to the new EXT4_IOC_DEFRAG ioctl(), expresses a request to the kernel to move len blocks from the original file to the new one, starting at start. Essentially, it copies an extent's worth of data into the (fully allocated, nicely contiguous) space in the new file, then performs a magic block swap. The contiguous blocks from the new file are patched into the old file, while the fragmented blocks are, instead, put into the new file. Once the entire file has been treated in this way, the file will have been defragmented without having been visibly moved.
The final step is to delete the "new" file, which now contains the "old" file's blocks. Since the file had been unlinked, that will cause the filesystem to recover the old blocks and the task will be complete. For the curious, Akira has posted the source for a user-space defragmentation tool which shows how this interface can be used.
There have not been a whole lot of objections to the new code. Chris Mason did point out that the system will do unfortunate things if the layout of a swap file changes. He has clearly thought about the problem - to an extent:
Beyond that, there are some minor issues, such as the definition of the ABI in terms of types like int instead of architecture-independent types. Requests for separate source and destination block numbers have been made; that feature would help developers working on hierarchical storage systems. The ability to guide the allocation of blocks would be useful in situations where performance can be improved by grouping related files together on the disk.
There could also be value in finding a way to move much of this functionality into the VFS layer where it could be used with other filesystems as well; that could prove to be a difficult task, though, and ext4 maintainer Ted Ts'o has little desire to take on that job.
Those little issues notwithstanding, it does appear that the ext4 filesystem may be closer to getting the much-requested online defragmentation feature.
Taming the OOM killer
Under desperately low memory conditions, the out-of-memory (OOM) killer kicks in and picks a process to kill using a set of heuristics which has evolved over time. This may be pretty annoying for users who may have wanted a different process to be killed. The process killed may also be important from the system's perspective. To avoid the untimely demise of the wrong processes, many developers feel that a greater degree of control over the OOM killer's activities is required.
Why the OOM-killer?
Major distribution kernels set the default value of /proc/sys/vm/overcommit_memory to zero, which means that processes can request more memory than is currently free in the system. This is done based on the heuristics that allocated memory is not used immediately, and that processes, over their lifetime, also do not use all of the memory they allocate. Without overcommit, a system will not fully utilize its memory, thus wasting some of it. Overcommiting memory allows the system to use the memory in a more efficient way, but at the risk of OOM situations. Memory-hogging programs can deplete the system's memory, bringing the whole system to a grinding halt. This can lead to a situation, when memory is so low, that even a single page cannot be allocated to a user process, to allow the administrator to kill an appropriate task, or to the kernel to carry out important operations such as freeing memory. In such a situation, the OOM-killer kicks in and identifies the process to be the sacrificial lamb for the benefit of the rest of the system.
Users and system administrators have often asked for ways to control the behavior of the OOM killer. To facilitate control, the /proc/<pid>/oom_adj knob was introduced to save important processes in the system from being killed, and define an order of processes to be killed. The possible values of oom_adj range from -17 to +15. The higher the score, more likely the associated process is to be killed by OOM-killer. If oom_adj is set to -17, the process is not considered for OOM-killing.
Who's Bad?
The process to be killed in an out-of-memory situation is selected based on its badness score. The badness score is reflected in /proc/<pid>/oom_score. This value is determined on the basis that the system loses the minimum amount of work done, recovers a large amount of memory, doesn't kill any innocent process eating tons of memory, and kills the minimum number of processes (if possible limited to one). The badness score is computed using the original memory size of the process, its CPU time (utime + stime), the run time (uptime - start time) and its oom_adj value. The more memory the process uses, the higher the score. The longer a process is alive in the system, the smaller the score.
Any process unlucky enough to be in the swapoff() system call (which removes a swap file from the system) will be selected to be killed first. For the rest, the initial memory size becomes the original badness score of the process. Half of each child's memory size is added to the parent's score if they do not share the same memory. Thus forking servers are the prime candidates to be killed. Having only one "hungry" child will make the parent less preferable than the child. Finally, the following heuristics are applied to save important processes:
- if the task has nice value above zero, its score doubles
- superuser or direct hardware access tasks (CAP_SYS_ADMIN,
CAP_SYS_RESOURCE or CAP_SYS_RAWIO) have their score divided
by 4. This is cumulative, i.e., a super-user task with
hardware access would have its score divided by 16.
- if OOM condition happened in one cpuset and checked task
does not belong to that set, its score is divided by 8.
- the resulting score is multiplied by two to the power of oom_adj (i.e. points <<= oom_adj when it is positive and points >>= -(oom_adj) otherwise).
The task with the highest badness score is then selected and its children are killed. The process itself will be killed in an OOM situation when it does not have children.
Shifting OOM-killing policy to user-space
/proc/<pid>/oom_score is a dynamic value which changes with time, and is not flexible with different and dynamic policies required by the administrator. It is difficult to determine which process will be killed in case of an OOM condition. The administrator must adjust the score for every process created, and for every process which exits. This could be quite a task in a system with quickly-spawning processes. In an attempt to make OOM-killer policy implementation easier, a name-based solution was proposed by Evgeniy Polyakov. With his patch, the process to die first is the one running the program whose name is found in /proc/sys/vm/oom_victim. A name based solution has its limitations:
- task name is not a reliable indicator of true name
and is truncated in the process name fields.
Moreover, symlinks to executing binaries, but with
different names will not work with this approach
- This approach can specify only one name at a time, ruling out the possibility of a hierarchy
- There could be multiple processes of the same name but from
different binaries.
- The behavior boils down to the default current implementation if there is no process by the name defined by /proc/sys/vm/oom_victim. This increases the number of scans required to find the victim process.
Alan Cox disliked this solution, suggesting that containers are the most appropriate way to control the problem. In response to this suggestion, the oom_killer controller, contributed by Nikanth Karthikesan, provides control of the sequence of processes to be killed when the system runs out of memory. The patch introduces an OOM control group (cgroup) with an oom.priority field. The process to be killed is selected from the processes having the highest oom.priority value.
To take control of the OOM-killer, mount the cgroup OOM pseudo-filesystem introduced by the patch:
# mount -t cgroup -o oom oom /mnt/oom-killer
The OOM-killer directory contains the list of all processes in the file tasks, and their OOM priority in oom.priority. By default, oom.priority is set to one.
If you want to create a special control group containing the list of processes which should be the first to receive the OOM killer's attention, create a directory under /mnt/oom-killer to represent it:
# mkdir lambs
Set oom.priority to a value high enough:
# echo 256 > /mnt/oom-killer/lambs/oom.priority
oom.priority is a 64-bit unsigned integer, and can have a maximum value an unsigned 64-bit number can hold. While scanning for the process to be killed, the OOM-killer selects a process from the list of tasks with the highest oom.priority value.
Add the PID of the process to be added to the list of tasks:
# echo <pid> > /mnt/oom-killer/lambs/tasks
To create a list of processes, which will not be killed by the OOM-killer, make a directory to contain the processes:
# mkdir invincibles
Setting oom.priority to zero makes all the process in this cgroup to be excluded from the list of target processes to be killed.
# echo 0 > /mnt/oom-killer/invincibles/oom.priority
To add more processes to this group, add the pid of the task to the list of tasks in the invincible group:
# echo <pid> > /mnt/oom-killer/invincibles/tasks
Important processes, such as database processes and their controllers, can be added to this group, so they are ignored when OOM-killer searches for processes to be killed. All children of the processes listed in tasks automatically are added to the same control group and inherit the oom.priority of the parent. When multiple tasks have the highest oom.priority, the OOM killer selects the process based on the oom_score and oom_adj.
This approach did not appeal to cpuset users, though. Consider two cpusets, A and B. If a process in cpuset A has a high oom.priority value, it will be killed if cpuset B runs out of memory, even though there is enough memory in cpuset A. This calls for a different design to tame the OOM killer.
An interesting outcome of the discussion has been handling OOM situations in user space. The kernel sends notification to user space, and applications respond by dropping their user-space caches. In case the user-space processes are not able to free enough memory, or the processes ignore the kernel's requests to free memory, the kernel resorts to the good old method of killing processes. mem_notify, developed by Kosaki Motohiro, is one such attempt made in the past. However, the mem_notify patch cannot be applied to versions beyond 2.6.28 because the memory management reclaiming sequence have changed, but the design principles and goals can be reused. David Rientjes suggests having one of the two hybrid solutions:
The other is /dev/mem_notify that allows you to poll() on a device file and be informed of low memory events. This can include the cgroup oom notifier behavior when a collection of tasks is completely out of memory, but can also warn when such a condition may be imminent. I suggested that this be implemented as a client of cgroups so that different handlers can be responsible for different aggregates of tasks.
Most developers prefer making /dev/mem_notify a client of control groups. This can be further extended to merge with the proposed oom-controller.
Low Memory in Embedded Systems
The Android developers required a greater degree of control over the low memory situation because the OOM killer does not kick in till late in the low memory situation, i.e. till all the cache is emptied. Android wanted a solution which would start early while the free memory is being depleted. So they introduced the "lowmemory" driver, which has multiple thresholds of low memory. In a low-memory situation, when the first thresholds are met, background processes are notified of the problem. They do not exit, but, instead, save their state. This affects the latency when switching applications, because the application has to reload on activation. On further pressure, the lowmemory killer kills the non-critical background processes whose state had been saved in the previous threshold and, finally, the foreground applications.
Keeping multiple low memory triggers gives the processes enough time to free memory from their caches because in an OOM situation, user-space processes may not be able to run at all. All it takes is a single allocation from the kernel's internal structures, or a page fault to make the system run out of memory. An earlier notification of a low-memory situation could avoid the OOM situation with a little help from the user space applications which respond to low memory notifications.
Killing processes based on kernel heuristics is not an optimal solution, and these new initiatives of offering better control to the user in selecting the process to be the sacrificial lamb are steps to a robust design to give more control to the user. However, it may take some time to come to a consensus on a final control solution.
Patches and updates
Kernel trees
Architecture-specific
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Networking
Security-related
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page:
Distributions>>
