User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current 2.6 prepatch is 2.6.24-rc4, released by Linus on December 3. He says that the size of the patch is "a bit disheartening," and, in fact, there are quite a few changes which have been merged. They are almost all fixes, but there also the addition of a CPU accounting controller for monitoring the CPU usage of groups of processes. See the short-form changelog for the details, or the full changelog for lots of details.

As of this writing, just under 100 changesets have gone into the mainline repository since the -rc4 release.

The current -mm tree is 2.6.24-rc4-mm1. Recent changes to -mm include the latest timerfd API, a new memory controller patch, and a reimplemented ramdisk driver.

Comments (none posted)

Kernel development news

Quotes of the week

A person will stand on the top of a hill for a very long time with their mouth open before a roast duck will fly in.
-- James Morris

For the purposes of figuring out what is needed you can consider a random simple user case such as a system which protects you against the works of Eric S Raymond. Replace the mathematical analysis and heuristics with a user space tool which spots the various ESR papers and design it for that if it makes you happier.

SELinux seems to be able to do most of the lifting around the problem as it can relabel a file into eric_t and constrain further access to it.

-- Alan Cox

Comments (6 posted)


By Jonathan Corbet
December 3, 2007
Sparse files have an apparent size which is larger than the amount of storage actually allocated to them. The usual way to create such a file is to seek past its end and write some new data; Unix-derived systems will traditionally not allocate disk blocks for the portion of the file past the previous end which was skipped over. The result is a "hole," a piece of the file which logically exists, but which is not represented on disk. A read operation on a hole succeeds, with the returned data being all zeroes. Relatively smart file archival and backup utilities will recognize holes in files; these holes are not stored in the resulting archive and will not be filled if the file is restored from that archive.

The process of recognizing holes is relatively primitive, though: about the only way to do it in a portable way is to simply look for blocks filled with zeroes. This technique works, but it requires making a pass over the data to obtain information which the lower levels of the system already know. It seems like there should be a better way.

About two years ago, the Solaris ZFS developers proposed an extension to lseek() which would allow an application to find the holes in sparse files more efficiently. This extension works by adding two new "whence" options:

  • SEEK_HOLE positions the file descriptor to the beginning of the first hole which occurs after the given offset. For the purposes of this operation, "hole" is defined as a region of all zeros of any length, but the system is not required to actually detect all holes. So, in practice, small ranges of zeroes will be skipped over, as will, in all likelihood, large (multi-block) ranges which have actually been written to disk.

  • SEEK_DATA moves to the start of next region (after the given offset) which is not a hole.

This functionality has been part of Solaris for a while; the Solaris developers would like to see it spread elsewhere and become something more than a Solaris-only extension. To that end, Josef Bacik has recently posted an implementation of this extension for Linux. Internally, it adds a new member to the file_operations structure (seek_hole_data()) intended to allow filesystems to efficiently implement the new operations.

One might argue that anybody who wants to separate holes and data in a file can already do so with the FIBMAP ioctl() command. While that is true, FIBMAP is an inefficient way of getting this sort of information, especially on filesystems which support extents. A FIBMAP call returns the mapping information for exactly one block; mapping out a large file may require millions of calls when, once again, the filesystem should already know how to provide that information in a much more straightforward manner.

Even so, this patch looks relatively unlikely to make it into the mainline. The API is unpopular, being seen as ugly and as a change in the semantics of the lseek() call. But, more to the point, it may be interesting to learn much more about the representation of a file than just where the holes are. And, as it turns out, there is already a proposed ioctl() command which can provide all of that information. That interface is the FIEMAP ioctl() specified by Andreas Dilger back in October.

A FIEMAP call takes the following structure as an argument:

    struct fiemap {
	__u64	fm_start;	 /* logical starting byte offset (in/out) */
	__u64	fm_length;	 /* logical length of map (in/out) */
	__u32	fm_flags;	 /* FIEMAP_FLAG_* flags for request (in/out) */
	__u32	fm_extent_count; /* number of extents in fm_extents (in/out) */
	__u64	fm_end_offset;	 /* end of mapping in last ioctl */
	struct fiemap_extent	fm_extents[0];

An application wanting to learn something about how a file is stored will put the starting offset into fm_start and the length of the region of interest in fm_length. If fm_flags contains FIEMAP_FLAG_NUM_EXTENTS, the system call will simply set fm_extent_count to the number of extents used to store the specified range of bytes and return. In this form, FIEMAP can be used to determine how fragmented the file is on disk.

If the application is looking for more information than that, it will allocate enough space for one or more fm_extents structures:

    struct fiemap_extent {
    	__u64 fe_offset;/* offset in bytes for the start of the extent */
    	__u64 fe_length;/* length in bytes for the extent */
    	__u32 fe_flags; /* returned FIEMAP_EXTENT_* flags for the extent */
    	__u32 fe_lun;   /* logical device number for extent(starting at 0)*/

In this case, fm_extent_count should be set to the number of these structures before making the FIEMAP call. On return, these structures (as many as is indicated by the returned value of fm_extent_count) will be filled in with information on the actual file extents; fe_offset says where (on disk) the extent starts, and fe_length is the size of the extent. There are quite a few values which can appear in the fe_flags field:

  • FIEMAP_EXTENT_HOLE says that there is no data for this range of the file - it's a hole.

  • FIEMAP_EXTENT_UNWRITTEN says that the space has been allocated on disk, but that nothing has been written to that space. Space which has been preallocated with fallocate() would be marked this way.

  • FIEMAP_EXTENT_UNMAPPED, instead, marks an extent where some application has written data, but for which no disk blocks have been allocated.

  • FIEMAP_EXTENT_DELALLOC indicates that delayed allocation is being done; this flag implies FIEMAP_EXTENT_UNMAPPED as well.

  • FIEMAP_EXTENT_SECONDARY is an indication that the data for this segment is in some sort of secondary storage; one would see this flag on filesystems managed by some sort of hierarchical storage manner. This flag, too, is likely to imply FIEMAP_EXTENT_UNMAPPED.

  • FIEMAP_EXTENT_NO_DIRECT says that the data cannot be accessed directly - it requires processing (decompression or decryption, for example) first.

  • FIEMAP_EXTENT_LAST marks the final extent in a file.

  • FIEMAP_EXTENT_EOF indicates that the requested range goes beyond the end of the file.

  • FIEMAP_EXTENT_ERROR marks an extent which has experienced some sort of error; the fe_offset field will contain an error number in this case.

  • FIEMAP_EXTENT_UNKNOWN says that the data exists, but its location is unknown. This flag would describe much of your editor's personal file space, though it is unclear how the kernel would know that.

As can be seen, there is a wealth of information available from this new call, including details on how the file has been split up on disk, allocation strategies, and even the decisions made by a hierarchical storage engine. An implementation exists for the ext4 filesystem. None of this code has been pushed toward the mainline yet, but it would be surprising if that did not happen sometime in the relatively near future. Once that is done, the C library will be able to implement SEEK_HOLE and SEEK_DATA in user space, should that be desirable.

Comments (6 posted)

Memory access and alignment

December 4, 2007

This article was contributed by Daniel Drake

When developing kernel code, it is usually important to consider constraints and requirements of architectures other than your own. Otherwise, your code may not be portable to other architectures, as I recently discovered when an unaligned memory access bug was reported in a driver which I develop. Not having much familiarity with the concepts of unaligned memory access, I set out to research the topic and complete my understanding of the issues.

Certain architectures rule that memory accesses must meet some certain alignment criteria or are otherwise illegal. The exact criteria that determines whether an access is suitably aligned depends upon the address being accessed and the number of bytes involved in the transaction, and varies from architecture to architecture. Kernel code is typically written to obey natural alignment constraints, a scheme that is sufficiently strict to ensure portability to all supported architectures. Natural alignment requires that every N byte access must be aligned on a memory address boundary of N. We can express this in terms of the modulus operator: addr % N must be zero. Some examples:

  1. Accessing 4 bytes of memory from address 0x10004 is aligned (0x10004 % 4 = 0).
  2. Accessing 4 bytes of memory from address 0x10005 is unaligned (0x10005 % 4 = 1).

The phrase "memory access" is quite vague; the context here is assembly-level instructions which read or write a number of bytes to or from memory (e.g. movb, movw, movl in x86 assembly). It is relatively easy to relate these to C statements, for example the instructions that are generated when the following code is compiled would likely include a single instruction that accesses two bytes (16 bits) of data from memory:

void example_func(unsigned char *data) {
	u16 value = *((u16 *) data);

The effects of unaligned access vary from architecture to architecture. On architectures such as ARM32 and Alpha, a processor exception is raised when an unaligned access occurs, and the kernel is able to catch the exception and correct the memory access (at large cost to performance). Other architectures raise processor exceptions but the exceptions do not provide enough information for the access to be corrected. Some architectures that are not capable of unaligned access do not even raise an exception when unaligned access happens, instead they just perform a different memory access from the one that was requested and silently return the wrong answer.

Some architectures are capable of performing unaligned accesses without having to raise bus errors or processor exceptions, i386 and x86_64 being some common examples. Even so, unaligned accesses can degrade performance on these systems, as Andi Kleen explains:

On Opteron the typical cost of a misaligned access is a single cycle and some possible penalty to load-store forwarding. On Intel it is a bit worse, but not all that much. Unless you do a lot of accesses of it in a loop it's not really worth something caring about too much.

At the end of the day, if you write code that causes unaligned accesses then your software will not work on some systems. This applies to both kernel-space and userspace code.

The theory is relatively easy to get to grips with, but how does this apply to real code? After all, when you allocate a variable on the stack, you have no control over its address. You don't get to control the addresses used to pass function parameters, or the addresses returned by the memory allocation functions. Fortunately, the compiler understands the alignment constraints of your architecture and will handle the common cases just fine; it will align your variables and parameters to suitable boundaries, and it will even insert padding inside structures to ensure the access to members is suitably aligned. Even when using the GCC-specific packed attribute (which tells GCC not to insert padding), GCC will transparently insert extra instructions to ensure that standard accesses to potentially unaligned structure members do not violate alignment constraints (at a cost to performance).

In order to illustrate a situation that might cause unaligned memory access, consider the example_func() implementation from above. The first line of the function accesses two bytes (16 bits) of data from a memory address passed in as a function parameter; however, we do not have any other information about this address. If the data parameter points to an odd address (as opposed to even), for example 0x10005, then we end up with an unaligned access. The main places where you will potentially run into unaligned accesses are when accessing multiple bytes of data (in a single transaction) from a pointer, and when casting variables to types of increased lengths.

Conceptually, the way to avoid unaligned access is to use byte-wise memory access because accessing single bytes of memory cannot violate alignment constraints. For example, for a little-endian system we could replace the example_func() implementation with the following:

void fixed_example_func(unsigned char *data) {
	u16 value = data[0] | data[1] << 8;

memcpy() is another possible alternative in the general case, as long as either the source or destination is a pointer to an 8-bit data type (i.e. char). Inside the kernel, two macros are provided which simplify unaligned accesses: get_unaligned() and put_unaligned(). It is worth noting that using any of these solutions is significantly slower than accessing aligned memory, so it is wise to completely avoid unaligned access where possible.

Another option is to simply document the fact that example_func() requires a 16-bit-aligned data parameter, and rely on the call sites to ensure this or simply not use the function. Linux's optimized routine for comparing two ethernet addresses (compare_ether_addr()) is a real life example of this: the addresses must be 16-bit-aligned.

I have applied my newfound knowledge to the task of writing some kernel documentation, which covers this topic in more detail. If you want to learn more, you may want to read the most recent revision (as of this writing) of the document. Additionally, the initial revision of the document generated a lot of interesting discussion, but be aware that the initial attempt contained some mistakes. Finally, chapter 11 of Linux Device Drivers touches upon this topic.

I'd like to thank everyone who helped me improve my understanding of unaligned access, as this article would not have been possible without their assistance.

Comments (9 posted)

The return of network channels

By Jonathan Corbet
December 4, 2007
The network channels concept was first aired by Van Jacobson almost two years ago at 2006. This idea promises much-improved networking performance by pushing processing of network data as close to the end point as possible - perhaps even into user space. By getting the kernel out of the packet processing business and by keeping that processing in a single place (on the same CPU), channel schemes hope to minimize cache misses, context switches, and other performance-degrading activities. Channels have had a rough encounter with the real world, though; when one starts to consider needs like packet filtering, address translation, and so on, it gets hard to maintain the simplicity upon which the performance of channels relies. So, two years later, there is no channels implementation which is even close to merging into the mainline.

That does not mean that no work is happening in this area, though. Evgeniy Polyakov, perhaps the most discouragement-resistant hacker out there, continues to develop his channel patches; the 22nd release came out on December 4.

This version of the patch has a well-defined internal structure to allow kernel code to hook into channels. The best-developed mode, however, is the one which simply transfers packets to and from user space. To that end, there is a new system call:

    int netchannel_control(struct unetchannel_control *ctl);

The full contents of the unetchannel_control structure can be seen in the patch. The more important fields are:

  • cmd, describing the action that the calling process wishes to execute. Unlike previous versions of the patch, the current code only supports one action: NETCHANNEL_CREATE, which makes a new channel.

  • type, the type of the channel to create. At the moment, the only implemented type is NETCHANNEL_COPY_USER, which copies packets to and from user space.

  • which describes the channel to be created: it contains source and destination addresses and ports and a protocol number.

Once a network channel is created, it is added to a search tree which is oriented toward blindingly-fast lookups. There is a new hook in the packet receive code which looks up each incoming packet in that tree; packets which do not turn up a hit there are processed normally by the kernel's networking stack. Any packet whose addresses, ports, and protocol are matched by an entry in the tree, however, is shunted over to the channel code before even being queued by the network stack.

The final piece (on the receive side) is a simple read() implementation. A process wishing to receive a packet from a network channel need only read the associated file descriptor and the next available packet will be copied into the supplied buffer. It would, of course, be nice to do away with that copy operation, but that is a hard trick to carry out: the packet must be received before its destination is known. There are network adapters which can direct packets based on their header information, but the current netfilter does does not have the driver API enhancements which would be required to use that capability for zero-copy packet reception.

Similarly, a write() operation causes the associated packet to be copied into the kernel and fed into the networking stack at a fairly low level. There is currently no zero-copy write support.

Evgeniy clearly has zero-copy operations in mind, though, probably using his network allocator patch. Even without that feature, though, the channel code, when used with his user-space network stack appears to be quite fast. Some posted benchmark results claim significant improvements over the core Linux networking stack - three times the maximum bandwidth with one-third of the CPU usage when small packets are being transferred. For larger (4096-byte) packets the performance improvements essentially disappear - most likely the cost of copying the packets into and out of the kernel is the dominating factor there.

Improvements in small-packet performance are welcome: there are a number of applications, including high-end financial trading, which require large numbers of small transfers. The addition of zero-copy logic has the potential to make the large-packet performance better as well. The real test, though, will be the addition of all of the other features expected by contemporary networking users, most of which are currently absent from the channels implementation. There are hooks in the code aimed at the insertion of per-packet processing; they could be used for filtering, address translation, traffic control, or any of the other things that one might want to have. Whether those hooks can be used without killing the performance advantages of channels remains to be seen, though. But one suspects that Evgeniy will not give up until he has an answer to that question.

Comments (none posted)

Patches and updates

Kernel trees


Build system

Core kernel code

Development tools

Device drivers


Filesystems and block I/O

  • Chris Mason <> (by way of Chris Mason: Btrfs v0.9. (December 5, 2007)

Memory management


Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds