User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 development kernel is 2.6.31-rc1, released by Linus on June 24. "There's a lot in there, but let me say that as far as the whole merge window has gone, I've seldom felt happier merging stuff from people. I'm really hoping it isn't just a fluke, but people kept their git trees clean, and while there were a couple of times that I said "no, I'm not going to merge that", on the whole this was a really painless merge window for me." Linus did say that he is likely to still merge the S+Core architecture (score) as it only touches the MAINTAINERS file outside of its tree. The significant changes for 2.6.31 include performance counters, character devices in user space (CUSE), kernel modesetting for Radeon hardware, kmemleak and kmemcheck, fsnotify (which provides a common implementation for dnotify and inotify), along with a vast quantity of drivers of various sorts. See the long-format changelog for all the details.

There have been no stable updates in the last week, nor are there any stable patches out for review.

Comments (none posted)

Kernel development news

Quotes of the week

Poulsbo is another example of this. Intel wanted a low-power mobile graphics chipset and chose to buy in a 3D core from an external vendor. IP issues prevent them from releasing any significant information about that 3D core, so the driver remains closed source. The implication is pretty clear - whichever section of Intel was responsible for the design of Poulsbo presumably had "Linux support" as a necessary feature, but didn't think "Open driver" was a required part of that.
-- Matthew Garrett (the entire post is worth reading)

We are removing more crap than we are adding, looks like progress to me! :)
-- Greg Kroah-Hartman gives an update on the -staging tree

When it [comes] to code coverage, x86 matters _so_ much more than any other architecture, that verification features like lockdep etc are way more important on x86 than on anything else.

Sure, there may be locking issues in some arch-specific code, and other architectures could be better off caring. But the advantage of lockdep for some pissant architecture that has a very limited user base (maybe lots of chips, but much more limited _use_ - fewer drivers, fewer workloads etc) is much lower, since those architectures know that x86 will give them 99% of the coverage.

-- Linus Torvalds

Comments (1 posted)

2.6.31 merge window, week 2

By Jake Edge
June 24, 2009

There have been around 2050 non-merge changesets merged into the mainline since last week's article, bringing the total to 8288 changes merged for 2.6.31. The merge window has closed, so all major features of 2.6.31 (with the possible exception of the S+Core architecture) will have been merged. Interesting changes since last week that are user visible include:

  • The MIPS architecture has added support for hugetlbfs, as well as hibernation support (but only for uni-processor systems).

  • The SLUB page allocator has added new diagnostic information printk()s for debugging OOM conditions.

  • A /proc/softirqs file has been added to show the number of software interrupts for each CPU. Also, a "softirq" line has been added to /proc/stat.

  • The gcov profiling infrastructure, for code coverage testing, has been merged. It adds functions needed by the profiling code, kbuild support for building kernels with gcov profiling, along with a debugfs interface to retrieve the profiling data.

  • An API for pulse-per-second (PPS) devices has been added. These are devices which provide a high-precision signal that can be used to adjust system clock time.

  • The EXT4_IOC_MOVE_EXT ioctl() has been added to support ext4 online defragmentation.

  • A sysfs interface to add I2C devices, which takes the place of various force_* module parameters, has been merged.

  • A command stream checker for Radeon r3xx-r5xx hardware has been added to stop user-space processes from accessing memory outside of what they own.

  • The perf tool has added multiple features, including raw data output as well as call graph profiling.

  • The PowerPC architecture has added support for software performance counters.

  • PCI end-to-end CRC checking (ECRC) can now be enabled or disabled with the ecrc boot parameter.

  • A PCI Express Advanced Error Reporting (AER) software error injector has been merged.

  • NFS version 4.1 client support has been added as an experimental feature. Server support for 4.1 is, as yet, not merged.

  • Firewire (IEEE 1394) now has support for IPv4 networking.

  • New device drivers

    • Architectures/processors/systems: Keymile KMETER1 PPC boards, X-ES Freescale MPC85xx-based single-board computers, Palm Treo 680 smartphones, Openmoko GTA02 / Freerunner phones, MINI2440 ARM-based development boards.

    • Network: Xtensa S6105 GMAC ethernet devices.

    • Input devices: TI DaVinci DM355 EVM keypads and IR remotes, TWL4030 Power buttons, WM97xx Atmel accelerated touchscreens, LM8323 keypad chips, W90P910 touchscreens, EETI touchscreen panels, Synaptics I2C touchpads,

    • Miscellaneous: Toshiba TXx9 SoC DMA controllers, TX4939 hardware random number generators, ST-Ericsson AB3100 Mixed Signal circuits (core functionality needed for other AB3100 devices), PCAP ASIC for EZX phones (needed to support other devices), Epson RX-8025SA/NB real-time clocks, IBM CPC925 PPC Memory Controller, PrimeCell PL061 GPIO devices, TI DaVinci DM355 EVM Keypad and IR remote devices, VIA SD/MMC card readers, MSM7K onboard serial devices, NAND Flash devices for OMAP2 and OMAP3, Broadcom BCM47xx watchdog timers, PNX833x hardware watchdog timers, TWL4030 watchdog timers, ST-Ericsson COH 901 327 watchdog timers, Freescale STMP3XXX watchdog timers, FibreChannel ELS/CT pass-thru support, Synopsys DesignWare I2C adapters, Maxim MAX17040 Fuel Gauge batteries.

    • Staging: Cavium Networks Octeon ethernet ports, CPC CAN USB driver, USB Quatech ESU-100 8 Port Serial Driver (as serqt_usb2, replacing the obsolete serqt_usb staging driver), RDC_17F3101X IDE devices, Displaylink USB framebuffer devices, Realtek RTL8192 USB wifi devices.

Changes visible to kernel developers include:

  • Quite a bit of Big Kernel Lock (BKL) removal code has been merged in the fs/ tree. Now, all of the super_operations and address_space_operations are called without holding the BKL.

  • IRQF_SAMPLE_RANDOM, which governs whether a driver's interrupts are used as an entropy source, has been added to the feature-removal-schedule.

  • The memory debugging infrastructure for DRM has been removed. "It hasn't been used in ages, and having the user tell your how much memory is being freed at free time is a recipe for disaster even if it was ever used."

  • David Miller is now the IDE subsystem maintainer, taking over from Bartlomiej Zolnierkiewicz, in a friendly handoff. Miller plans to put IDE into maintenance-only mode.

  • The SCSI device information matching has added support for multiple blacklist tables.

  • The instrumentation of jbd2 and ext4 has been converted from kernel markers to tracepoints.

  • OCFS2 has added support for lockdep, by adding the proper lockdep annotations for all of the cluster locks except those that are acquired for a node, rather than a process.

  • Access control list (ACL) information is now cached in struct inode for some filesystems (jfs, ext2, ext3, ext4, jffs2, btrfs, reiserfs, nilfs2, xfs).

Since the merge window has closed, the next step is stabilization. Something approaching 3000 more changes will likely make their way into the mainline before the 2.6.31 release, which should happen in late August or early September.

Comments (5 posted)

Protected RAMFS

June 24, 2009

This article was contributed by Goldwyn Rodrigues

Many embedded systems have a block of non-volatile RAM (NVRAM) separate from normal system memory. A recent patch, posted by Marco Stornelli, is a filesystem for these kinds of NVRAM devices, where the device could store frequently accessed data (such as the address book for a cellphone). Protected RAMFS (PRAMFS) protects the NVRAM-based filesystem from errant or stray writes to the protected portion of the RAM caused by kernel bugs. Because it is stored in the NVRAM, the filesystem can survive a reboot, and hence can also be used to keep important crash information.

Basic Features

PRAMFS is robust in the face of errant writes to the protected area, which could arise due to kernel bugs. The page table entries that map the backing-store RAM are marked read-only on initialization. Write operations to the filesystem temporarily mark the pages to be written as writable, the write operation is carried out with locks held, and then the pte is marked read-only again. This limits the writes to the filesystem in the window when the locks are held. The write-protection feature can be disabled by the kernel config option CONFIG_PRAMFS_NOWP.

PRAMFS forces all files to use direct-IO. The filp->f_flags is set to O_DIRECT when the files are opened. Opening all files as O_DIRECT avoids page caching, and data is written immediately to a storage device. This is nearly equal to the speed of the system RAM, but it forces applications to do block-aligned I/O.

PRAMFS does not have recovery facilities, such as journaling, to survive a crash or power failure during a write operation. The filesystem maintains checksums for the superblock and inode to check the validity of the stored object. An inode with an incorrect checksum is marked as bad, which may lead to data loss in case of power failure during a write operation.

PRAMFS also supports execute in place (XIP), which is a technique that executes programs directly from the storage instead of copying it into RAM. For a RAM filesystem, XIP makes sense since the system can execute from the storage device as fast as it can from the system RAM, and it does not make a duplicate copy in RAM.


There is no mkfs utility to create a PRAMFS. The filesystem is automatically created when the filesystem is mounted with the init option. The command to create and mount a PRAMFS is:

    # mount -t pramfs -o physaddr=0x20000000,init=0x2F000,bs=1024 none /mnt/pram

This command creates a filesystem of 0x2F000 bytes, with a block size of 1024 bytes, and locates it at the physical address 0x20000000.

To retrieve an existing filesystem, mount the PRAMFS with the physaddr parameter that was used in the previous mount. The details of the filesystem such as blocksize and filesystem size are read from the superblock:

    # mount -t pramfs -o physaddr=0x20000000 none /mnt/pram

Other filesystem parameters are:

  • bpi: specifies the bytes-per-inode ratio. For every bpi bytes in the filesystem, an inode is created.

  • N: specifies the number of inodes to allocate in the inode table. If the option is not specified, the bytes-per-inode ratio is used to calculate the number of inodes.

If the init option is not specified, the bs, bpi, or N options are ignored by the mount, since this information is picked up from the existing filesystem. When creating the filesystem, if no option for the inode reservation is specified, by default 5% of the filesystem space is used for the inode table.

To test the memory protection of PRAMFS, the developers have written a kernel module that attempts to write within the PRAMFS memory with the intention of corrupting the memory space. This causes a kernel protection fault, and, after a reboot, you may re-mount the filesystem to find that the test module was not capable of corrupting the filesystem.

Filesystem Layout

PRAMFS has a simple layout, with the super-block in the first 128 bytes of the RAM block, followed by the inode table, the block usage map, and finally the data blocks. The superblock is 128 bytes long and contains all of the important information, such as filesystem size, block size, etc., needed to remount the filesystem.

[PRAMFS layout]

The inode table consists of the inodes required for the filesystem. The number of inodes are computed when the filesystem is initialized. Each inode is 128 bytes long. Directory entry information, such as filename and owning inode, are contained within the inode. This presents a problem for hard links because a hard link requires two directory entries under different directories for the same inode. Hence, PRAMFS does not support hard links. The inode format also limits the filename to 48 characters. The inode number is the absolute offset of that inode from the beginning of the filesystem.

Regular PRAMFS file inodes contain the i_type.reg.row_block field, which points to a data block which contains doubly-indirect pointers to the file's data blocks. This is similar to the double indirect block field of the ext2 filesystem inode. But, that means that a file smaller than 1 block will require 3 blocks to store it.

[PRAMFS inode]
Inodes within a directory are linked together in a doubly-linked list. The directory inode stores the first and last inode in the directory listing. The previous entry of the first inode and the next entry of the last inode are null terminated.

Write Protection

PRAMFS utilizes the system's paging unit by mapping its RAM initially as read-only. Writes to data objects first mark the corresponding page table entries as writable, perform the write and then mark them read-only again. This operation is done atomically by holding the page-table spin-lock with interrupts disabled. Following a write, stale entries in the system TLB are flushed. Write locks are held at the superblock, inode, or block level, depending on the granularity of modification.

Since PRAMFS attempts to avoid filesystem corruption caused because of kernel bugs, shared mmap() regions can only be read. Dirty pages in the page cache cannot be written back to the filesystem. For this reason, PRAMFS defines only the readpage() member of struct address_space_operations; the writepage() entry is declared as NULL.


This is the second attempt to get PRAMFS in the mainline. The previous attempt was done in 2004 by Steve Longerbeam of Montavista.

The home page of PRAMFS claims the filesystem to be fully-featured. But, as part of the linux-kernel discussion, Henrique de Moraes Holschuh strongly disagreed:

It is not full-featured if it doesn't have support for hardlinks, security labels, extended attributes, etc. Please call it a specialized filesystem instead, that seems to be much more in line with the comments about pramfs use cases in this thread...

There are not enough performance benchmarks information against other filesystems, yet, to form an opinion. Performance tests done while adding Execute in Place (XIP) reveal a performance as low as 13Mbps for per-character writes and 35Mbps for block writes using bonnie. Pavel Machek considers these numbers to be pretty low, especially for a RAM-based filesystem:

Even on real embedded hardware you should get better than 13MB/sec writing to _RAM_. I guess something is seriously wrong with pramfs.

No tests have been performed using existing solutions, such as ramdisk on the same hardware, to compare apples with apples. The low performance is attributed to the excessive locking done for writes. Pavel believes the developers of PRAMFS are confused regarding the goals of the filesystem, and whether they are designing for speed, completeness, or robustness.

PRAMFS is a niche filesystem, mostly for embedded devices with NVRAM, and hence lacks important features, such as hard links and shared mmap()s. However, for quite a number of situations an entire filesystem seems like overkill. Pavel suggests a special NVRAM-based block device with a traditional filesystem or a filesystem based on Solid State Device (SSD) filesystems would be a better option. With the current number of objections, PRAMFS is unlikely to go into the mainline. However, Marco plans to further improve the code with more features, and to update the PRAMFS homepage to better reflect the filesystem's goals.

Comments (5 posted)

Linux kernel design patterns - part 3

June 22, 2009

This article was contributed by Neil Brown

In this final article we will be looking at just one design pattern. We started with the fine details of reference counting, zoomed out to look at whole data structures, and now move to the even larger perspective of designing subsystems. Like every pattern, this pattern needs a name, and our working title is "midlayer mistake". This makes it sounds more like an anti-pattern, as it appears to describe something that should be avoided. While that is valid, it is also very strongly a pattern with firm prescriptive guides. When you start seeing a "midlayer" you know you are in the target area for this pattern and it is time to see if this pattern applies and wants to guide you in a different direction.

In the Linux world, the term "midlayer" seems (in your author's mind and also in Google's cache) most strongly related to SCSI. The "scsi midlayer" went through a bad patch quite some years ago, and there was plenty of debate on the relevant lists as to why it failed to do what was needed. It was watching those discussions that provided the germ from which this pattern slowly took form.

The term "midlayer" clearly implies a "top layer" and a "bottom layer". In this context, the "top" layer is a suite of code that applies to lots of related subsystems. This might be the POSIX system call layer which supports all system calls, the block layer which supports all block devices, or the VFS which supports all filesystems. The block layer would be the top layer in the "scsi midlayer" example. The "bottom" layer then is a particular implementation of some service. It might be a specific system call, or the driver for a specific piece of hardware or a specific filesystem. Drivers for different SCSI controllers fill the bottom layer to the SCSI midlayer. Brief reflection on the list of examples shows that which position a piece of code takes is largely a matter of perspective. To the VFS, a given filesystem is part of the bottom layer. To a block device, the same filesystem is part of the top layer.

A midlayer sits between the top and bottom layers. It receives requests from the top layer, performs some processing common to the implementations in the bottom layer, and then passes the preprocessed requests - presumably now much simpler and domain-specific - down to the relevant driver. This provides uniformity of implementation, code sharing, and greatly simplifies that task of implementing a bottom-layer driver.

The core thesis of the "midlayer mistake" is that midlayers are bad and should not exist. That common functionality which it is so tempting to put in a midlayer should instead be provided as library routines which can used, augmented, or ignored by each bottom level driver independently. Thus every subsystem that supports multiple implementations (or drivers) should provide a very thin top layer which calls directly into the bottom layer drivers, and a rich library of support code that eases the implementation of those drivers. This library is available to, but not forced upon, those drivers.

To try to illuminate this pattern, we will explore three different subsystems and see how the pattern specifically applies to them - the block layer, the VFS, and the 'md' raid layer (i.e. the areas your author is most familiar with).

Block Layer

The bulk of the work done by the block layer is to take 'read' and 'write' requests for block devices and send them off to the appropriate bottom level device driver. Sounds simple enough. The interesting point is that block devices tend to involve rotating media, and rotating media benefits from having consecutive requests being close together in address space. This helps reduce seek time. Even non-rotating media can benefit from having requests to adjacent addresses be adjacent in time so they can be combined into a smaller number of large requests. So, many block devices can benefit from having all requests pass through an elevator algorithm to sort them by address and so make better use of the device.

It is very tempting to implement this elevator algorithm in a 'midlayer'. i.e. a layer just under the top layer. This is exactly what Linux did back in the days of 2.2 kernels and earlier. Requests came in to ll_rw_block() (the top layer) which performed basic sanity checks and initialized some internal-use fields of the structure, and then passed the request to make_request() - the heart of the elevator. Not quite every request went to make_request() though. A special exception was made for "md" devices. Those requests were passed to md_make_request() which did something completely different as is appropriate for a RAID device.

Here we see the first reason to dislike midlayers - they encourage special cases. When writing a midlayer it is impossible to foresee every possible need that a bottom level driver might have, so it is impossible to allow for them all in the midlayer. The midlayer could conceivably be redesigned every time a new requirement came along, but that is unlikely to be an effective use of time. Instead, special cases tend to grow.

Today's block layer is, in many ways, similar to the way it was back then with an elevator being very central. Of course lots of detail has changed and there is a lot more sophistication in the scheduling of IO requests. But there is still a strong family resemblance. One important difference (for our purposes) is the existence of the function blk_queue_make_request() which every block device driver must call, either directly or indirectly via a call to blk_init_queue(). This registers a function, similar to make_request() or md_make_request() from 2.2, which should be called to handle each IO request.

This one little addition effectively turns the elevator from a midlayer which is imposed on every device into a library function which is available for devices to call upon. This was a significant step in the right direction. It is now easy for drivers to choose not to use the elevator. All virtual drivers (md, dm, loop, drbd, etc.) do this, and even some drivers for physical hardware (e.g. umem) provide their own make_request_fn().

While the elevator has made a firm break from being a mid-layer, it still retains the appearance of a midlayer in a number of ways. One example is the struct request_queue structure (defined in <linux/blkdev.h>). This structure is really part of the block layer. It contains fields that are fundamental parts of the block interface, such as the make_request_fn() function pointer that we have already mentioned. However many other fields are specific to the elevator code, such as elevator (which chooses among several IO schedulers) and last_merge (which is used to speed lookups in the current queue). While the elevator can place fields in struct request_queue, all other code must make use of the queuedata pointer to store a secondary data structure.

This arrangement is another tell-tale for a midlayer. When a primary data structure contains a pointer to a subordinate data structure, we probably have a midlayer managing that primary data structure. A better arrangement is to use the "embedded anchor" pattern from the previous article in this series. The bottom level driver should allocate its own data structure which contains the data structure (or data structures) used by the libraries embedded within it. struct inode is a good example of this approach, though with slightly different detail. In 2.2, struct inode contained a union of the filesystem-specific data structure for each filesystem, plus a pointer (generic_ip) for another filesystem to use. In the 2.6 kernel, struct inode is normally embedded inside a filesystem-specific inode structure (though there is still an i_private pointer which seems unnecessary).

One last tell-tale sign of a midlayer, which we can still see hints of in the elevator, is the tendency to group unrelated code together. The library design will naturally provide separate functionality as separate functions and leave it to the bottom level driver to call whatever it needs. The midlayer will simply call everything that might be needed.

If we look at __make_request() (the 2.6 entry point for the elevator), we see an early call to blk_queue_bounce(). This provides support for hardware that cannot access the entire address space when using DMA to move data between system memory and the device. To support such cases, data sometimes needs to be copied into more accessible memory before being transferred to the device, or to be copied from that memory after being transferred from the device. This functionality is quite independent of the elevator, yet it is being imposed on all users of the elevator.

So we see in the block layer, and its relationship with the elevator a subsystem which was once implemented as a midlayer, but has taken a positive step away from being a midlayer by making the elevator clearly optional. It still contains traces of its heritage which have served as a useful introduction to the key identifiers of a midlayer: code being imposed on lower layer, special cases in that code, data structures storing pointers to subordinate data structures, and unrelated code being called by the one support function.

With this picture in mind, let us move on.


The VFS (or Virtual File System) is a rich area to explore to learn about midlayers and their alternatives. This is because there is a lot of variety in filesystems, a lot of useful services that they can make use of, and a lot of work has been done to make it all work together effectively and efficiently. The top layer of the VFS is largely contained in the vfs_ function calls which provide the entry points to the VFS. These are called by the various sys_ functions that implement system calls, by nfsd which does a lot of file system access without using system calls, and from a few other parts of the kernel that need to deal with files.

The vfs_ functions fairly quickly call directly in to the filesystem in question through one of a number of _operations structures which contain a list of function pointers. There are inode_operations, file_operations, super_operations etc, depending on what sort of object is being manipulated. This is exactly the model that the "midlayer mistake" pattern advocates. A thin top layer calls directly into the bottom layer which will, as we shall see, make heavy use of library functions to perform its task.

We will explore and contrast two different sets of services provided to filesystems, the page cache and the directory entry cache.

The page cache

Filesystems generally want to make use of read-ahead and write-behind. When possible, data should be read from storage before it is needed so that, when it is needed, it is already available, and once it has been read, it is good to keep it around in case, as is fairly common, it is needed again. Similarly, there are benefits from delaying writes a little, so that throughput to the device can be evened out and applications don't need to wait for writeout to complete. Both of these features are provided by the page cache, which is largely implemented by mm/filemap.c and mm/page-writeback.c.

In its simplest form a filesystem provides the page cache with an object called an address_space which has, in its address_space_operations, routines to read and write a single page. The page cache then provides operations that can be used as file_operations to provide the abstraction of a file that must be provided to the VFS top layer. If you look at the file_operations for a regular file in ext3, we see:

    const struct file_operations ext3_file_operations = {
	.llseek		= generic_file_llseek,
	.read		= do_sync_read,
	.write		= do_sync_write,
	.aio_read	= generic_file_aio_read,
	.aio_write	= ext3_file_write,
	.unlocked_ioctl	= ext3_ioctl,
    #ifdef CONFIG_COMPAT
	.compat_ioctl	= ext3_compat_ioctl,
	.mmap		= generic_file_mmap,
	.open		= generic_file_open,
	.release	= ext3_release_file,
	.fsync		= ext3_sync_file,
	.splice_read	= generic_file_splice_read,
	.splice_write	= generic_file_splice_write,

Eight of the thirteen operations are generic functions provided by the page cache. Of the remaining five, the two ioctl() operations and the release() operation require implementations specific to the filesystem; ext3_file_write() and ext3_sync_file are moderately sized wrappers around generic functions provided by the page cache. This is the epitome of good subsystem design according to our pattern. The page cache is a well defined library which can be used largely as it stands (as when reading from an ext3 file), allows the filesystem to add functionality around various entry points (like ext3_file_write()) and can be simply ignored altogether when not relevant (as with sysfs or procfs).

Even here there is a small element of a midlayer imposing on the bottom layer as the generic struct inode contains a struct address_space which is only used by the page cache and is irrelevant to non-page-cache filesystems. This small deviation from the pattern could be justified by the simplicity it provides, as the vast majority of filesystems do use the page cache.

The directory entry cache (dcache)

Like the page cache, the dcache provides an important service for a filesystem. File names are often accessed multiple times, much more so than the contents of file. So caching them is vital, and having a well designed and efficient directory entry cache is a big part of having efficient access to all filesystem objects. The dcache has one very important difference from the page cache though: it is not optional. It is imposed upon every filesystem and is effectively a "midlayer." Understanding why this is, and whether it is a good thing, is an important part of understanding the value and applicability of this design pattern.

One of the arguments in favor of an imposed dcache is that there are some interesting races related to directory renames; these races are easy to fail to handle properly. Rather than have every filesystem potentially getting these wrong, they can be solved once and for all in the dcache. The classic example is if /a/x is renamed to /b/c/x at the same time that /b/c is renamed to /a/x/c. If these both succeed, then 'c' and 'x' will contain each other and be disconnected from the rest of the directory tree, which is a situation we would not want.

Protecting against this sort of race is not possible if we only cache directory entries at a per-directory level. The common caching code needs to at least be able to see a whole filesystem to be able to detect such a possible loop-causing race. So maintaining a directory cache on a per-filesystem basis is clearly a good idea, and strongly encouraging local filesystems to use it is very sensible, but whether forcing it on all filesystems is a good choice is less clear.

Network filesystems do not benefit from the loop detection that the dcache can provide as all of that must be done on the server anyway. "Virtual" filesystems such as sysfs, procfs, ptyfs don't particularly need a cache at all as all the file names are in memory permanently. Whether a dcache hurts these filesystems is not easy to tell as we don't have a complete and optimized implementation that does not depend on the dcache to compare with.

Of the key identifiers for a midlayer that were discussed above, the one that most clearly points to a cost is the fact that midlayers tend to grow special case code. So it should be useful to examine the dcache to see if it has suffered from this.

The first special cases that we find in the dcache are among the flags stored in d_flags. Two of these flags are DCACHE_AUTOFS_PENDING and DCACHE_NFSFS_RENAMED. Each is specific to just one filesystem. The AUTOFS flag appears to only be used internally to autofs, so this isn't really a special case in the dcache. However the NFS flag is used to guide decisions made in common dcache code in a couple of places, so it clearly is a special case, though not necessarily a very costly one.

Another place to look for special case code is when a function pointer in an _operations structure is allowed to be NULL, and the NULL is interpreted as implying some specific action (rather than no action at all). This happens when a new operation is added to support some special-case, and NULL is left to mean the 'default' case. This is not always a bad thing, but it can be a warning signal.

In the dentry_operations structure there are several functions that can be NULL. d_revalidate() is an example which is quite harmless. It simply allows a filesystem to check if the entry is still valid and either update it or invalidate it. Filesystems that don't need this simply do nothing as having a function call to do nothing is pointless.

However, we also find d_hash() and d_compare(), which allow the filesystem to provide non-standard hash and compare functions to support, for example, case-insensitive file names. This does look a lot like a special case because the common code uses an explicit default if the pointer is NULL. A more uniform implementation would have every filesystem providing a non-NULL d_hash() and d_compare(), where many filesystems would choose the case-sensitive ones from a library.

It could easily be argued that doing this - forcing an extra function call for hash and compare on common filesystems - would be an undue performance cost, and this is true. But given that, why is it appropriate to impose such a performance cost on filesystems which follow a different standard?

A more library-like approach would have the VFS pass a path to the filesystem and allow it to do the lookup, either by calling in to a cache handler in a library, or by using library routines to pick out the name components and doing the lookups directly against its own stored file tree.

So the dcache is clearly a midlayer, and does have some warts as a result. Of all the midlayers in the kernel it probably best fits the observation above that they could "be redesigned every time a new requirement came along". The dcache does see constant improvement to meet the needs of new filesystems. Whether that is "an effective use of time" must be a debate for a different forum.

The MD/RAID layer

Our final example as we consider midlayers and libraries, is the md driver which supports various software-RAID implementations and related code. md is interesting because it has a mixture of midlayer-like features and library-like features and as such is a bit of a mess.

The "ideal" design for the md driver is (according to the "midlayer mistake" pattern) to provide a bunch of useful library routines which independent RAID-level modules would use. So, for example, RAID1 would be a standalone driver which might use some library support for maintaining spares, performing resync, and reading metadata. RAID0 would be a separate driver which use the same code to read metadata, but which has no use for the spares management or resync code.

Unfortunately that is not how it works. One of the reasons for this relates to the way the block layer formerly managed major and minor device numbers. It is all much more flexible today, but in the past a different major number implied a unique device driver and a unique partitioning scheme for minor numbers. Major numbers were a limited resource, and having a separate major for RAID0, RAID1, and RAID5 etc would have been wasteful. So just one number was allocated (9) and one driver had to be responsible for all RAID levels. This necessity undoubtedly created the mindset that a midlayer to handle all RAID levels was the right thing to do, and it persisted.

Some small steps have been made towards more of a library focus, but they are small and inconclusive. One simple example is the md_check_recovery() function. This is a library function in the sense that a particular RAID level implementation needs to explicitly call it or it doesn't get used. However, it performs several unrelated tasks such as updating the metadata, flushing the write-intent-bitmap, removing devices which have failed, and (surprisingly) checking if recovery is needed. As such it is a little like part of a mid-layer in that it imposes that a number of unrelated tasks are combined together.

Perhaps a better example is md_register_thread() and friends. Some md arrays need to have a kernel thread running to provide some support (such as scheduling read requests to different drives after a failure). md.c provides library routines md_register_thread() and md_unregister_thread(), which can be called by the personality as required. This is all good. However md takes it upon itself to choose to call md_unregister_thread() at times rather than leaving that up to the particular RAID level driver. This is a clear violation of the library approach. While this is not causing any actual problems at the moment, it is exactly the sort of thing that could require the addition of special cases later.

It has often been said that md and dm should be unified in some way (though it is less often that the practical issues of what this actually means are considered). Both md and dm suffer from having a distinct midlayer that effectively keeps them separate. A full understanding of the fact that this midlayer is a mistake, and moving to replace it with an effective library structure is likely to be an important first step towards any sort of unification.

Wrap up

This ends our exploration of midlayers and libraries in the kernel -- except maybe to note that more recent additions include such things as libfs, which provides support for virtual filesystems, and libata, which provides support for SATA drives. These show that the tendency away from midlayers is not only on the wishlist of your author but is present in existing code.

Hopefully it has resulted in an understanding of the issues behind the "midlayer mistake" pattern and the benefits of following the library approach.

Here too ends our little series on design patterns in the Linux kernel. There are doubtlessly many more that could be usefully extracted, named, and illuminated with examples. But they will have to await another day.

Once compiled, such a collection would provide invaluable insight on how to build kernel code both effectively and uniformly. This would be useful in understanding how current code works (or why it doesn't), in making choices when pursuing new development, or when commenting on design during the review process, and would generally improve visibility at this design level of kernel construction. Hopefully this could lead, in the long term, to an increase in general quality.

For now, as a contribution to that process, here is a quick summary of the Patterns we have found.

  • kref: Reference counting when the object is destroyed with the last external reference

  • kcref: Reference counting when the object can persist after the last external reference is dropped

  • plain ref: Reference counting when object lifetime is subordinate to another object.

  • biased-reference: An anti-pattern involving adding a bias to a reference counter to store one bit of information.

  • Embedded Anchor: This is very useful for lists, and can be generalized as can be seen if you explore kobjects.

  • Broad Interfaces: This reminds us that trying to squeeze lots of use-cases in to one function call is not necessary - just provide lots of function calls (with helpful and (hopefully) consistent names).

  • Tool Box: Sometimes it is best not to provide a complete solution for a generic service, but rather to provide a suite of tools that can be used to build custom solutions.

  • Caller Locks: When there is any doubt, choose to have the caller take locks rather than the callee. This puts more control in that hands of the client of a function.

  • Preallocate Outside Locks: This is in some ways fairly obvious. But it is very widely used within the kernel, so stating it explicitly is a good idea.

  • Midlayer Mistake: When services need to be provided to a number of low-level drivers, provide them with a library rather than imposing them with a midlayer.


  1. Examine the "blkdev_ioctl()" interface to the block layer from the perspective of whether it is more like a midlayer or a library. Compare the versions in 2.6.27 with 2.6.28. Discuss.

  2. Choose one other subsystem such as networking, input, or sound, and examine it in the light of this pattern. Look for special cases, and imposed functionality. Examine the history of the subsystem to see if there are signs of it moving away from, or towards, a "midlayer" approach.

  3. Identify a design pattern which is specific to the Linux kernel but has not been covered in this series. Give it a name, and document it together with some examples and counter examples.

Comments (26 posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management


Virtualization and containers


Page editor: Jake Edge
Next page: Distributions>>

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds