Kernel development
Brief items
Kernel release status
The 2.6.28 kernel is out released on December 24. Some of the highlights of this kernel are the addition of the GEM GPU memory manager, the ext4 filesystem is no longer "experimental", scalability improvements in memory management via the reworked vmap() and pageout scalability patches, moving the -staging drivers into the mainline, and much more. See the excellent KernelNewbies summary for lots more details about 2.6.28.The current 2.6 stable kernel is 2.6.27.10 released on December 18 as well. It contains nearly two dozen fixes of some fairly serious problems in 2.6.27.
Kernel development news
Quotes of the week
On the subject of the longstanding "treason uncloaked!" kernel message:
Now that certainly fits my definition of amusing and if my goal for Linux was to amuse myself at the expense of users, I'd be all for keeping it[1]. But perversely, I actually want users to enjoy their Linux experience.
[1] Hell, I'd probably even get them to use git.
Justifying FS-Cache
In what must seem like a never-ending effort, David Howells is once again trying to get a generic mechanism to do local caching for network filesystems into the kernel. The latest version, number 41, of his FS-Cache patches was posted back in November, so now he is asking for it to be added to linux-next. That would mean that the feature was on-track for the mainline in 2.6.29, but it would appear that 2.6.30—if ever—is more likely.
The idea behind FS-Cache is to create a way for "slow" filesystems to cache their data on the local disk, so that repeated accesses do not require accessing the underlying slow storage. Howells has been working on getting it into the kernel for a number of years; our first article about it appeared in 2004. The canonical example of where it might be useful is a network filesystem on a heavily-used or low bandwidth link—the cost of re-reading data from the network may be much higher than retrieving it from a local disk. In addition, the cache can be persistent across reboots, allowing some files to live locally for a very long time.
But, Howells already has a fairly large, intrusive patch that is headed for 2.6.29: credentials. That patch touches a lot of code in the kernel, in particular the VFS layer. Christoph Hellwig is concerned about both credentials and FS-Cache going in at the same time :
While that would delay the addition of FS-Cache, Andrew Morton has a larger concern:
It's a huuuuuuuuge lump of new code, so it really needs to provide decent value. Can we revisit this? Yet again? What do we get from all this?
Morton is worried about adding additional maintenance headaches with
no—or limited—benefits. Using a local disk to cache data from
a remote disk is only useful in some scenarios; it can certainly make
things worse in others. As Howells puts
it: "It's a compromise: a trade-off between the loading and
latencies of your
network vs the loading and latencies of your disk; you sacrifice disk space to
make up for the deficiencies of your network.
" What Morton is
looking for is a push from users, be that
end users or distributions that
are shipping the feature. He would also like to see some benchmarks that
show what gain there is when using FS-Cache.
Howells has patiently answered these concerns, pointing at some benchmarks he had posted in November that showed some significant savings. The benchmarks used NFS over a deliberately slow link (to simulate a heavily used network) and showed a huge decrease in the time required to read a large file, but was essentially break-even when operating on a kernel tree. In the kernel tree benchmark, though, the reduction in network traffic was significant.
More importantly, perhaps, is the fact that Red Hat has shipped FS-Cache in RHEL 5 and there are customers using it, as well as customers interested in using it as Howells pointed out:
While shipping out-of-tree code is no guarantee that the feature will get merged—AppArmor is an excellent counterexample—actual users whose needs are being met by a particular feature are a fairly persuasive argument. Howells outlines some customer use cases for FS-Cache, for example:
In all, it would seem that Morton's concerns were addressed. Whether that means the path is clear for 2.6.30 or these or other concerns will come to the fore is a question that will likely have to wait another three months or so.
Development statistics for 2.6.28
As of this writing, the 2.6.28 kernel is getting quite close to its final release. The flow of patches into the mainline repository has slowed to a trickle. So it become appropriate to look at what was done in this development cycle, and where all that code came from.In these articles, your editor routinely forgets to thank Greg Kroah-Hartman, who continues to do a lot of work to ensure that these statistics are at least moderately accurate. So we'll get that taken care of at the outset: thanks, Greg!
The 2.6.28 development cycle has seen the incorporation of just under 9,000 changesets; that makes it a bit smaller in this regard than 2.6.27 (10,600) or 2.6.26 (10,100). The development base broadened, though; 1,262 developers have contributed to 2.6.28, more than has been seen with its predecessors. Those developers added 769,000 lines of code while removing 285,000, for a net growth of 484,000 lines - a relatively large amount. Much of that growth came by way of a single developer, as we will see below.
In recent development cycles, some 25% of the patches merged were accepted after the close of the merge window. Linus Torvalds has been making sounds about tightening the criteria for patches during the stabilization period, to the point that they would have to address known regressions to be accepted. A look at 2.6.28, though, shows that 1835 patches (so far) have gone in since 2.6.28-rc1. At 20% of the total, the patch flow rate during the stabilization period has fallen - but not by much.
So where did these patches come from? Here's the top twenty contributors to 2.6.28:
Most active 2.6.28 developers
By changesets David S. Miller 239 2.7% Yinghai Lu 200 2.2% Al Viro 154 1.7% Bartlomiej Zolnierkiewicz 150 1.7% Alexey Dobriyan 121 1.3% Paul Mundt 117 1.3% Ingo Molnar 109 1.2% Gerrit Renker 109 1.2% Russell King 91 1.0% Johannes Berg 91 1.0% Steven Rostedt 85 0.9% Alan Cox 84 0.9% Takashi Iwai 83 0.9% Tejun Heo 75 0.8% Harvey Harrison 75 0.8% Mark Brown 75 0.8% Suresh Siddha 73 0.8% Joerg Roedel 72 0.8% Hans Verkuil 71 0.8% Eric Miao 70 0.8%
By changed lines Greg Kroah-Hartman 127848 14.9% Inaky Perez-Gonzalez 24084 2.8% Mark Brown 17714 2.1% Joseph Chan 15749 1.8% Pavel Machek 15529 1.8% David S. Miller 15368 1.8% Herbert Xu 13309 1.5% Yinghai Lu 12861 1.5% Paul Mundt 10088 1.2% Magnus Damm 10077 1.2% James Smart 8103 0.9% Gerrit Renker 7536 0.9% Johannes Berg 7196 0.8% Bartlomiej Zolnierkiewicz 7182 0.8% Eric Miao 7130 0.8% Ron Mercer 7093 0.8% Michael Buesch 6475 0.8% Nick Kossifidis 6380 0.7% David Vrabel 6357 0.7% Adrian Bunk 6289 0.7%
On the changesets side, David Miller contributes a lot of work to the network stack, but the bulk of his changes this time around are to the SPARC architecture code. Yinghai Lu is a constant source of x86 architecture patches. Al Viro returns to the list with a lot of cleanup work in the VFS code, user-mode Linux, and beyond. Bartlomiej Zolnierkiewicz continues to clean up the legacy IDE code, despite the fact that its user base is shrinking. And Alexey Dobriyan contributed work in a number of areas, with the bulk of it being in the netfilter subsystem and /proc.
When looking at changed lines, one gets the sense that Greg Kroah-Hartman has been rather busy this time around. As it happens, Greg did not actually write most of that code; the bulk of it came in with the addition of the -staging tree. It seems that Greg, the self-named "maintainer of crap," has acquired substantial amounts of it. Inaky Perez-Gonzalez was the source of the patches adding support for ultrawideband radio and wireless USB. Expect to see him show up again soon; he is now working to get the WIMAX subsystem into the kernel. Mark Brown added drivers for a number of Wolfson Micro devices. Joseph Chan contributed the VIA framebuffer driver, and Pavel Machek added a handful of miscellaneous drivers.
So who paid for this work to be done? The 2.6.28 employer table looks like this:
Most active 2.6.28 employers
By changesets (None) 1683 18.8% Red Hat 1101 12.3% (Unknown) 790 8.8% Intel 654 7.3% IBM 526 5.9% Novell 460 5.1% (Consultant) 227 2.5% Oracle 206 2.3% Sun 203 2.3% Renesas Technology 169 1.9% AMD 158 1.8% Parallels 152 1.7% Marvell 134 1.5% (Academia) 131 1.5% Analog Devices 122 1.4% HP 120 1.3% University of Aberdeen 109 1.2% Fujitsu 106 1.2% Nokia 97 1.1% Freescale 87 1.0%
By lines changed Novell 159527 18.6% (None) 119373 13.9% (Unknown) 78785 9.2% Red Hat 67972 7.9% Intel 64108 7.5% IBM 31289 3.6% Renesas Technology 24900 2.9% Sun 19926 2.3% (Consultant) 19605 2.3% Wolfson Micro 17697 2.1% VIA 17210 2.0% Marvell 14108 1.6% Freescale 12693 1.5% Oracle 12101 1.4% Analog Devices 10170 1.2% University of Aberdeen 9969 1.2% Emulex 8112 0.9% Nokia 7744 0.9% QLogic 7676 0.9% Atmel 6885 0.8%
In general, the employer tables tend not to change too much from one development cycle to the next. Greg's staging tree work did put Novell at the top of the lines-changed column, despite the fact that this work did not originate at Novell. As always, one needs to bear in mind that these numbers are approximate.
One welcome change is the first-time appearance of VIA. It appears that this company is truly getting serious about supporting Linux, and that can only be a good thing.
Writing all this code is important, but so is reviewing, testing, and reporting bugs. Continuing with a relatively new tradition, we'll look at who shows up in patch tags indicating this kind of participation, starting with the reviewers:
Developers with the most reviews (total 83) James Morris 12 14.5% Rene Herman 12 14.5% Matthew Wilcox 6 7.2% KOSAKI Motohiro 5 6.0% Richard Genoud 4 4.8% Tomas Winkler 3 3.6% Paul E. McKenney 3 3.6% Mingming Cao 2 2.4% Michael Krufky 2 2.4% KAMEZAWA Hiroyuki 2 2.4% Pekka Enberg 2 2.4% Daisuke Nishimura 2 2.4% Christoph Lameter 2 2.4% Balbir Singh 2 2.4% Julius Volz 2 2.4%
At this point, we are seeing about one Reviewed-by tag for every 100 changes going into the mainline repository. Fortunately, the review situation is not quite that bad; most reviewers simply do not provide these tags for the patches they look at.
The numbers for bug reporting and patch testing look like this:
Most credited 2.6.28 testers
Reported-by credits Adrian Bunk 5 2.6% Randy Dunlap 4 2.1% Arjan van de Ven 3 1.5% Ingo Molnar 3 1.5% Stephen Rothwell 3 1.5% Robert P. J. Day 3 1.5% Stephane Eranian 3 1.5% Daniel Marjamäki 3 1.5% Rafael J. Wysocki 2 1.0% Yinghai Lu 2 1.0% Venki Pallipadi 2 1.0% Eric Dumazet 2 1.0% Carlos R. Mafra 2 1.0% Wu Fengguang 2 1.0% Zoltan Borbely 2 1.0% Andy Wettstein 2 1.0% Steven Noonan 2 1.0% Alexander Beregalov 2 1.0% Andrew Morton 2 1.0% Alexey Dobriyan 2 1.0% Heiko Carstens 2 1.0% Jiri Slaby 2 1.0% Sergei Shtylyov 2 1.0% Johannes Weiner 2 1.0% Mike Galbraith 2 1.0% Hideo Saito 2 1.0% Zvonimir Rakamaric 2 1.0% Rik Theys 2 1.0% Andreas Steffen 2 1.0% Vegard Nossum 2 1.0%
Tested-by: credits Ingo Molnar 5 2.9% Dirk Teurlings 5 2.9% Peter van Valderen 5 2.9% Nicolas Pitre 4 2.3% Matt Helsley 4 2.3% Christian Borntraeger 3 1.7% Rafael J. Wysocki 3 1.7% Riku Voipio 3 1.7% Byron Bradley 3 1.7% Tim Ellis 3 1.7% Kamalesh Babulal 3 1.7% Alan Jenkins 3 1.7% Robert Jarzmik 3 1.7% Martyn Welch 3 1.7% Takashi Iwai 2 1.2% Badari Pulavarty 2 1.2% Jeff Moyer 2 1.2% Eric Dumazet 2 1.2% Jesper Dangaard Brouer 2 1.2% Ramon Casellas 2 1.2% Markus Trippelsdorf 2 1.2% Sitsofe Wheeler 2 1.2% Andrey Borzenkov 2 1.2%
In each case, everybody with at least two credits was listed. The good news is that, while there's certainly some familiar names on that list, we are also seeing appearances by people who are not known as kernel developers. There really is a testing community out there which includes more than just developers. Your editor suspects that we still are not doing a very good job of crediting them for their work, but this convention is relatively new and we can still hope for progress in this direction. To that end, the developers who are crediting reporters and testers are:
Developers giving credits in 2.6.28
Reported-by credits Jiri Kosina 9 4.6% Ingo Molnar 8 4.1% Adrian Bunk 7 3.6% Bartlomiej Zolnierkiewicz 6 3.1% Linus Torvalds 6 3.1% Peter Zijlstra 6 3.1% Markus Metzger 6 3.1% Randy Dunlap 5 2.6% Andrew Morton 5 2.6% Yinghai Lu 4 2.1% Venki Pallipadi 4 2.1% Jiri Slaby 4 2.1% Suresh Siddha 4 2.1% Roland Dreier 4 2.1% Patrick McHardy 4 2.1% Mark Brown 4 2.1% Takashi Iwai 3 1.5% Steven Rostedt 3 1.5% Stefan Richter 3 1.5% Paul Mundt 3 1.5% Thomas Gleixner 3 1.5% Dmitry Torokhov 3 1.5%
Tested-by: credits Lennert Buytenhek 22 12.8% Takashi Iwai 6 3.5% Rafael J. Wysocki 5 2.9% Linus Torvalds 5 2.9% Alan Stern 5 2.9% Alexey Starikovskiy 5 2.9% Henrik Rydberg 5 2.9% Matt Helsley 4 2.3% KAMEZAWA Hiroyuki 4 2.3% Russell King 4 2.3% Patrick McHardy 4 2.3% Paul Mundt 3 1.7% Jens Axboe 3 1.7% Theodore Tso 3 1.7% Bartlomiej Zolnierkiewicz 3 1.7% Jean Delvare 3 1.7% Thomas Gleixner 3 1.7% David Brownell 3 1.7% FUJITA Tomonori 3 1.7%
A quick grep shows that the number of Reported-by and Tested-by tags in patches was almost exactly the same over the 2.6.27 and 2.6.28 development cycles. Given the smaller number of patches in 2.6.28, this indicates that a slightly higher percentages of patches are now carrying those tags. Emphasis on "slightly" is in order, though; we are, for the most part, still not crediting a great many people who have helped to get 2.6.28 into shape.
Unifying filesystems with union mounts
Unification of filesystems is the concept of mounting several filesystems on a single mount point, with the resulting mount showing the logical combination of all the filesystems. Traditionally, when a filesystem is mounted on a directory, the existing contents of the directory are masked, and the content of the latest mounted filesystem is shown. These masked files are available only after the mounted filesystem is unmounted. Even though these files exist, they are inaccessible to the user. Union mount overcomes this by providing access to all directories and files present in the directory, even after a mount.
In the kernel, the filesystems are stacked in order of their mount sequence, the first mounted filesystem is at the bottom of the mount stack, and the latest mount is at the top of the stack. Only the files and directories of the top of the mount stack are visible. With union mounts, directory entries from the lower filesystems are merged with the directory entries of upper filesystem, thus making a logical combination of all mounted filesystems. Files with the same name in a lower filesystem are masked, as the upper one takes precedence.
Union mounts could be used to update packages of a distribution on a DVD. A writable filesystem could be mounted over the read-only filesystem on the DVD. All new and updated package files would be written to the writable, topmost filesystem, while hiding the duplicate files of the read-only media, or even deleting files (this is done through white-outs discussed later). This allows the user to change any of the files on the system, with the new file stored transparently in the image. Such a setup could be used to roll-up an updated DVD, or maintain a package repository with the latest packages for network installs.
As compared to other implementations, such as unionFS, union mounts try to do all directory entry unification handling in the VFS layer, instead of creating a new filesystem type. Some of the advantages of this approach are:
- Simple and Lightweight Design: Since all merges happen inside VFS, there is no need for an additional filesystem layer to maintain and merge metadata.
- No need to re-iterate the mount stack by the user while mounting: the user is not required to list the directories participating in the union as a part of the mount command. Only the mount point is enough.
- Bind mount works without any problems: this is a VFS feature to remount part of the filesystem hierarchy at additional mount points.
Union mount, developed by Jan Blunck, Bharta B Rao, and Miklos Szeredi, is the first step in unifying mounts in the VFS. The patch implementation is similar to that of the Plan 9/Inferno operating system. Currently, it only does namespace unification at the root directory level and not in the subdirectories.
To mount directories through union mount, the mount command must be modified to recognize and set the union mount options. The util-linux patches that update the mount command can be found at ftp://ftp.suse.com/pub/people/jblunck/union-mount/
As an example, consider the following directory structure of two filesystems:
Issuing the following commands will perform a union mount:
# mount /dev/sdb /mnt # ls /mnt dir1 file1 link1 # mount --union /dev/sdc /mnt # ls /mnt dir1 dir4 file1 link1
After the union, the directory structure looks like:
Unmounting the /mnt directory unwinds the filesystem mount stack:
# umount /mnt # ls /mnt dir1 file1 link1
The filesystems are stacked in the mount order in the kernel. The MNT_UNION flag in vfsmnt is set while mounting union mounts. This helps to identify that the directory entries of the stacked filesystems are supposed to be merged. While performing the lookup sequence, if the MNT_UNION flag is set, all root directory entries of all filesystems are scanned. Scanning happens from top of the filesystem stack to bottom, and the first matching entry is returned. This way any duplicate entries in underlying filesystems are automatically ignored.
Similarly, for the readdir() call, the directory entries are read from the topmost union mount directory to the lowest, and collected in the cache. The cache is responsible for collecting and keeping the directory entries across the stacked filesystem, with different callbacks for each filesystem. Like regular files, directories are seekable and the position of the following read is marked by the file position filp->f_pos. When reading from directories across filesystems, it is possible that the file position exceeds the inode size of the directory where it is merged. In such a situation, the file position is rearranged to select the correct directory in the union stack. This is done by subtracting the inode size if the file position exceeds it and selecting the next member of the union.
This works for filesystems such as ext2 that use flat file directories. The directory entry offsets are arranged linearly and are always smaller than the inode size of the directory. However, some filesystems return special cookies as directory entry offsets which are unrelated to the position in the directory or the inode size. Updating file->f_pos to accommodate more directories does not not work for such filesystems.
There can be multiple calls to readdir()/getdents() routines for reading the entries of a single directory. Currently, the union directory cache is not maintained across these calls. Instead, for every call the previously read entries are re-read into the cache and newly read entries are compared against these for duplicates before being returned to user space. The developers are working on making this efficient by maintaining the cache across readdir()/getdents() calls.
Future Plans: Writable Unions
Currently, the namespace unification is limited to the root filesystem directory entries. Future plans, known as writable unions, would come close to the implementations of unionfs namespace unification. Directory entry merging would not be limited to the root filesystem, but would be done for subdirectories as well. Though these patches have been developed, they still require some time and clean up for the mainline.
Using the example above, a writable union mount of the two filesystems would contain:
Note that dir1 directory now contains both file_b1 and file_c1.![]()
All writes are directed to the topmost mounted filesystem if it is mounted read-write. Mounting a new filesystem upon the current union mount makes all filesystems lower in the stack read-only, though the unified namespace would appear read-write to the user. Any modifications in the files of lower filesystems is handled through copy-on-write. If a file belonging to the lower layers of the stack is opened, the entire file is copied on the topmost filesystem on the stack. This is also known as copy-up, where the file is copied to the topmost layer if it has to record a change. While performing a copy-up, the directory path of the file is also recreated on the topmost filesystem, so that the next time it is mounted as a union, it appears in the same location. The older file gets masked during the directory merge the next time the filesystems are union-mounted in the same order.
Rename on union mounts is handled through -EXDEV. -EXDEV is returned in a rename() operation if the source and destination file paths are on different mounted filesystems. In such a case, the application, such as mv, resorts to a copy operation, and unlinks the file from which the filesystem moved. On union mounts, since any writes are performed in the topmost layer, a move operation to directories in the lower layers returns -EXDEV, which means the application must copy the file to the new directory. If both the source and destination of the rename() operation are in the topmost later, the traditional rename method is used.
Deletion of files is handled by a special file type called white-outs. The white-out file type is similar to negative dentries: they describe a filename which isn't there. This is used to mark a file in the lower read-only filesystem as deleted, since only the topmost layer can be modified. However, white-outs would require support from all the filesystems, to store and recognize such a special file type. Currently, there is a special type, DT_WHT defined in include/linux/fs.h which defines a white-out, but is not in use.
Directory namespace unification is a tough task. FreeBSD implementations gave up after calling it "messy code", while unionfs entered the -mm tree for a brief period, it did not make it to mainline. Since the unification is a pathname-based it is best handled in the VFS instead of using a separate stacked filesystem. The union mount offers a cleaner and more lightweight approach for merging directories, however getting it to adhere to POSIX compliant directory calls such as telldir() or seekdir() is still a challenge and is currently being worked on.
The git repository to track union mounts is located at:
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.gitunder the union-dir branch. The union mounts developers intend to release the patches in a phased manner, starting with the current patch of root directory level merging. Further developments would see patches related to merging at the subdirectory level as well.
Patches and updates
Kernel trees
Architecture-specific
Build system
Core kernel code
Development tools
Device drivers
Filesystems and block I/O
Memory management
Virtualization and containers
Benchmarks and bugs
Page editor: Jake Edge
Next page:
Distributions>>