User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.1-rc5, released on September 4. The ongoing problems at have caused this release to be hosted in an unusual place.

However, is still down, and there really hasn't been a ton of development going on, so I considered just skipping a week. But hey, the whole point (well, *one* of the points) of distributed development is that no single place is really any different from any other, so since I did a github account for my divelog thing, why not see how well it holds up to me just putting my whole kernel repo there too?

For obvious reasons, the full changelog is not available on at this time.

Stable updates: there have been no stable updates released in the last week, and none are in the review process as of this writing.

Comments (1 posted)

Quotes of the week

It's not possible to absolutely say that many Android distributors no longer have the right to distribute Linux. But nor is it possible to absolutely say that they haven't lost that right. Any sufficiently motivated kernel copyright holder probably could engage in a pretty effective shakedown racket against Android vendors. Whether they will do remains to be seen, but honestly if I were an Android vendor I'd be worried. There's plenty of people out there who hold copyright over significant parts of the kernel. Would you really bet on all of them being individuals of extreme virtue?
-- Matthew Garrett

I think that realistically we should definitely look at our practices, but at the same time, I personally do put a lot of trust in "human relationships".

Often way more than "technical models".

So there is a lot of safety in just a purely human "this looks like the kind of pull request I expect". A lot of kernel developers write nice messages explaining the pull, and there may not be a cryptographic signature in text like that, but there is definitely a "human signature" that you start to expect.

-- Linus Torvalds

Kernelroll is a linux kernel module for advanced rickrolling. It works by patching the open() system call to open a specified music file instead of other music files. Currently, it only checks if the file extension "mp3" is present and calls the original open() with the supplied path instead.
-- "fpletz"; acceptance into the mainline not guaranteed

Comments (5 posted)

Kernel development news

On multi-platform drivers

By Jonathan Corbet
September 7, 2011
LWN recently looked at the discussion around moving the Broadcom wireless driver into the mainline from the staging tree. This driver raises a number of issues on how the kernel community interacts with hardware manufacturers. One important aspect of the discussion, though, did not come up until after that article was written. Linux drivers are expected to be drivers for Linux, only. Attempts to maintain a Linux driver as a multi-platform driver will lead to unhappiness, for a number of reasons. What follows is an unabashedly partisan article on why multi-platform drivers do not fit well with the Linux kernel.

Broadcom developer Henry Ptasinski brought the issue to the fore while talking about why the company was not interested in supporting the in-mainline b43 driver:

The brcmsmac driver has architectural alignment with our drivers for other operating systems, and we intend to to enhance and maintain this driver in parallel with drivers for other operating systems. Maintaining alignment between our Linux driver and drivers for other operating systems allows us to leverage feature and chip support across all platforms.

To developers who have worked with the kernel for a while, these words look like a fundamental mistake. To others, though, they seem reasonable: if Broadcom wants to support its hardware, why not let the company do things its way?

One clear problem with trying to maintain "architectural alignment" with drivers for other operating systems is that only the original company can maintain that alignment. The other associated drivers are almost certainly not open source; nobody else in the community has any way to know which changes are consistent with those other drivers and which are not. Not even the relevant subsystem maintainer can make that kind of call.

One also must consider that most other kernel developers have no motivation for - or interest in - maintaining the correspondence between the drivers, even if they did know how to do it.

The obvious conclusion here is that allowing a vendor to maintain a multi-platform driver in the kernel tree will only work if that vendor is given absolute control over the code. If others can make arbitrary changes, there is no way for the vendor to keep the drivers consistent. But, in the kernel, nobody has that kind of absolute control with the possible exception of Linus Torvalds. If something needs to be fixed or changed, anybody with the relevant technical skills can do it. If a piece of the kernel tree were to be fenced off and made off-limits for kernel developers, the kernel as a whole becomes a little less free.

And that freedom matters. Consider the problem of internal API changes. As anybody who watches kernel development knows, internal interfaces are changed all the time in response to problems and changing needs. Those changes can, at times, force significant changes in users of the affected interfaces. Contemporary rules call for a developer who makes an interface change to fix any code broken by that change. Code that has been designated as off limits will be hard to fix in this way, slowing down the evolution of the kernel as a whole. As one example, consider the removal of the big kernel lock; that job required significant locking changes in many places. Literally hundreds of drivers were modified in the process. Impeding those changes would have made the BKL removal task even slower - and maybe impossible.

Manufacturers are not known for long-term support of their products; they have no real reason to support five-year-old chipsets that they no longer sell. Indeed, they have every reason to end such support and encourage the replacement of older hardware with shiny new stuff. Linux, instead, tends to support hardware for as long as it is in use. Giving a vendor absolute control over a driver is certain to create conflict when that vendor moves to drop support for older chipsets.

A vendor's agenda can differ from the community's needs in other ways as well. Vendors may not appreciate patches to enable undocumented features or make low-end offerings behave more like their more expensive alternatives. Or consider Hans Reiser's opposition to the addition of extended attribute and access control list support to reiserfs. His argument was that users should wait for the shiny new Reiser4 filesystem to obtain such features; had he been listened to, reiserfs users never would have had support for those basic filesystem capabilities. The kernel works well because it is maintained as the best kernel for all users over the long term, even if that occasionally causes conflicts with short-term vendor desires.

Multi-platform drivers from vendors tend to be written around the minimal set of support functions that are available on all platforms. The result is a lot of code duplicating functionality already found in the Linux kernel; consider, for example, just how many wireless drivers initially came with their own 802.11 stacks built in. Developing and maintaining just one rock-solid 802.11 implementation is hard; having several of them in the kernel would bloat the kernel and endow it with a number of second-rate implementations - all of which must be maintained into the future. Other kernel support code - from simple linked lists through complicated memory management - is also often avoided by multi-platform drivers. Those drivers will be bigger, buggier, and harder for kernel developers to read and support. They are also much less likely to behave consistently with other Linux drivers for the same type of hardware.

Beyond all of the above, it is also far from clear that maintaining a multi-platform driver actually saves any work. Drivers written for Linux can make full use of the available support infrastructure. Multi-platform drivers must, instead, duplicate much of that functionality and maintain an operating system abstraction layer as well. Maintaining a multi-platform driver means maintaining a larger body of code without help from the community.

In summary: trying to maintain a single driver for multiple operating systems may look like a good idea on the surface. But it is only sustainable in a world where the vendor keeps complete control over the code. Even then, it leads to worse code, duplicated effort, long-term maintenance issues, and more work overall. Linux works best when its drivers are written for Linux and can be fully integrated with the rest of the kernel. The community's developers understand this well; that is why multi-platform drivers have a hard time getting into the mainline.

Comments (12 posted)

Upcoming DSP architectures

September 7, 2011

This article was contributed by Arnd Bergmann

The Linux-3.2 merge window may be the first time that two new CPU architectures get merged at the same time: the c6x architecture from Texas Instruments and the Hexagon architecture from Qualcomm. Following the recently merged OpenRisc platform, the two submissions look very solid and should see no major obstacles getting merged into Linux after the usual review comments have been resolved, but there is still some debate over how to best add glibc support for the architectures.

Interestingly, there is a lot that these two architectures have in common, far beyond coincidentally implementing the same bugs. Both are not regular CPU cores designed to run an operating systems but instead are essentially digital signal processors, similar to the Blackfin architecture that was merged into linux-2.6.22 a few years ago. Further, both Hexagon and c6x are already widely available in systems running Linux on the ARM core of a TI OMAP or Qualcomm MSM system-on-a-chip, where they are used for offloading CPU intensive work such as that required for video codecs.

It will be interesting to see how Linux can coexist in the long run when the same SoC can run Linux on either of the two CPU architectures. The ARM architecture is currently transitioning to probing based on the dts device tree format and all new architectures merged into Linux will have to use that format as well when they have devices that cannot be automatically detected. If the device tree vision comes true, a single board will actually be able to use the same device tree binary on either one, independent of which CPU actually runs the kernel.

Another intriguing scenario is running Linux on both architectures (ARM and DSP) simultaneously, using shared memory to communicate between them. Ohad Ben-Cohen has recently posted a framework based on virtio to allow just that on the TI platform. While virtio was intended to be used for communication between a virtual machine and the host operating system, it turns out to be flexible enough to allow the same drivers for communication between operating system instances on the same hardware.

Looking closer at the actual DSP architectures, there are some major differences between Hexagon and C6x. The former is quite capable, with support for symmetric multiprocessing, a memory management unit and even a hypervisor. It can be seen as a competitor to established CPU architectures like ARM, x86 or powerpc, at least in the embedded space. In contrast, C6x is a rather minimalistic architecture dating back to the TMS320 introduced in 1983. So far, its kernel supports neither SMP nor an MMU, which means it is restricted to running μClibc instead of glibc, and it has a very limited set of applications that can be supported as long as it is missing the MMU.

Beyond Linux 3.2, there are still more architectures that have been around for a long time and could get merged if the respective maintainers were interested. FPGA-based Nios2 is apparently close to getting submitted, while the similar lm32 architecture saw a lot of activity in 2010 but does not seem to be actively worked on now. Synopsys ARC and Imagination META are both claimed to have Linux and Android support, but there is no indication that the authors are actively working on upstream submission or even on making the patches for current kernels easily available. Finally, Donald Knuth's MMIX architecture has seen some occasional work in the past but now appears to be stalled, the latest kernel source version being 2.6.18.

Comments (4 posted)

Ensuring data reaches disk

September 7, 2011

This article was contributed by Jeff Moyer

In a perfect world, there would be no operating system crashes, power outages or disk failures, and programmers wouldn't have to worry about coding for these corner cases. Unfortunately, these failures are more common than one would expect. The purpose of this document is to describe the path data takes from the application down to the storage, concentrating on places where data is buffered, and to then provide best practices for ensuring data is committed to stable storage so it is not lost along the way in the case of an adverse event. The main focus is on the C programming language, though the system calls mentioned should translate fairly easily to most other languages.

I/O buffering

In order to program for data integrity, it is crucial to have an understanding of the overall system architecture. Data can travel through several layers before it finally reaches stable storage, as seen below:

flow diagram]

At the top is the running application which has data that it needs to save to stable storage. That data starts out as one or more blocks of memory, or buffers, in the application itself. Those buffers can also be handed to a library, which may perform its own buffering. Regardless of whether data is buffered in application buffers or by a library, the data lives in the application's address space. The next layer that the data goes through is the kernel, which keeps its own version of a write-back cache called the page cache. Dirty pages can live in the page cache for an indeterminate amount of time, depending on overall system load and I/O patterns. When dirty data is finally evicted from the kernel's page cache, it is written to a storage device (such as a hard disk). The storage device may further buffer the data in a volatile write-back cache. If power is lost while data is in this cache, the data will be lost. Finally, at the very bottom of the stack is the non-volatile storage. When the data hits this layer, it is considered to be "safe."

To further illustrate the layers of buffering, consider an application that listens on a network socket for connections and writes data received from each client to a file. Before closing the connection, the server ensures the received data was written to stable storage, and sends an acknowledgment of such to the client.

After accepting a connection from a client, the application will need to read data from the network socket into a buffer. The following function reads the specified amount of data from the network socket and writes it out to a file. The caller already determined from the client how much data is expected, and opened a file stream to write the data to. The (somewhat simplified) function below is expected to save the data read from the network socket to disk before returning.

 0 int
 1 sock_read(int sockfd, FILE *outfp, size_t nrbytes)
 2 {
 3      int ret;
 4      size_t written = 0;
 5      char *buf = malloc(MY_BUF_SIZE);
 7      if (!buf)
 8              return -1;
10      while (written < nrbytes) {
11              ret = read(sockfd, buf, MY_BUF_SIZE);
12              if (ret =< 0) {
13                      if (errno == EINTR)
14                              continue;
15                      return ret;
16              }
17              written += ret;
18              ret = fwrite((void *)buf, ret, 1, outfp);
19              if (ret != 1)
20                      return ferror(outfp);
21      }
23      ret = fflush(outfp);
24      if (ret != 0)
25              return -1;
27      ret = fsync(fileno(outfp));
28      if (ret < 0)
29              return -1;
30      return 0;
31 }

Line 5 is an example of an application buffer; the data read from the socket is put into this buffer. Now, since the amount of data transferred is already known, and given the nature of network communications (they can be bursty and/or slow), we've decided to use libc's stream functions (fwrite() and fflush(), represented by "Library Buffers" in the figure above) in order to further buffer the data. Lines 10-21 take care of reading the data from the socket and writing it to the file stream. At line 22, all data has been written to the file stream. On line 23, the file stream is flushed, causing the data to move into the "Kernel Buffers" layer. Then, on line 27, the data is saved to the "Stable Storage" layer shown above.


Now that we've hopefully solidified the relationship between APIs and the layering model, let's explore the intricacies of the interfaces in a little more detail. For the sake of this discussion, we'll break I/O down into three different categories: system I/O, stream I/O, and memory mapped (mmap) I/O.

System I/O can be defined as any operation that writes data into the storage layers accessible only to the kernel's address space via the kernel's system call interface. The following routines (not comprehensive; the focus is on write operations here) are part of the system (call) interface:

Openopen(), creat()
Writewrite(), aio_write(), pwrite(), pwritev()
Syncfsync(), sync()

Stream I/O is I/O initiated using the C library's stream interface. Writes using these functions may not result in system calls, meaning that the data still lives in buffers in the application's address space after making such a function call. The following library routines (not comprehensive) are part of the stream interface:

Openfopen(), fdopen(), freopen()
Writefwrite(), fputc(), fputs(), putc(), putchar(), puts()
Syncfflush(), followed by fsync() or sync()

Memory mapped files are similar to the system I/O case above. Files are still opened and closed using the same interfaces, but access to the file data is performed by mapping that data into the process' address space, and then performing memory read and write operations as you would with any other application buffer.

Openopen(), creat()
Writememcpy(), memmove(), read(), or any other routine that writes to application memory

There are two flags that can be specified when opening a file to change its caching behavior: O_SYNC (and related O_DSYNC), and O_DIRECT. I/O operations performed against files opened with O_DIRECT bypass the kernel's page cache, writing directly to the storage. Recall that the storage may itself store the data in a write-back cache, so fsync() is still required for files opened with O_DIRECT in order to save the data to stable storage. The O_DIRECT flag is only relevant for the system I/O API.

Raw devices (/dev/raw/rawN) are a special case of O_DIRECT I/O. These devices can be opened without specifying O_DIRECT, but still provide direct I/O semantics. As such, all of the same rules apply to raw devices that apply to files (or devices) opened with O_DIRECT.

Synchronous I/O is any I/O (system I/O with or without O_DIRECT, or stream I/O) performed to a file descriptor that was opened using the O_SYNC or O_DSYNC flags. These are the synchronous modes, as defined by POSIX:

  • O_SYNC: File data and all file metadata are written synchronously to disk.

  • O_DSYNC: Only file data and metadata needed to access the file data are written synchronously to disk.

  • O_RSYNC: Not implemented

The data and associated metadata for write calls to such file descriptors end up immediately on stable storage. Note the careful wording, there. Metadata that is not required for retrieving the data of the file may not be written immediately. That metadata may include the file's access time, creation time, and/or modification time.

It is also worth pointing out the subtleties of opening a file descriptor with O_SYNC or O_DSYNC, and then associating that file descriptor with a libc file stream. Remember that fwrite()s to the file pointer are buffered by the C library. It is not until an fflush() call is issued that the data is known to be written to disk. In essence, associating a file stream with a synchronous file descriptor means that an fsync() call is not needed on the file descriptor after the fflush(). The fflush() call, however, is still necessary.

When Should You Fsync?

There are some simple rules to follow to determine whether or not an fsync() call is necessary. First and foremost, you must answer the question: is it important that this data is saved now to stable storage? If it's scratch data, then you probably don't need to fsync(). If it's data that can be regenerated, it might not be that important to fsync() it. If, on the other hand, you're saving the result of a transaction, or updating a user's configuration file, you very likely want to get it right. In these cases, use fsync().

The more subtle usages deal with newly created files, or overwriting existing files. A newly created file may require an fsync() of not just the file itself, but also of the directory in which it was created (since this is where the file system looks to find your file). This behavior is actually file system (and mount option) dependent. You can either code specifically for each file system and mount option combination, or just perform fsync() calls on the directories to ensure that your code is portable.

Similarly, if you encounter a system failure (such as power loss, ENOSPC or an I/O error) while overwriting a file, it can result in the loss of existing data. To avoid this problem, it is common practice (and advisable) to write the updated data to a temporary file, ensure that it is safe on stable storage, then rename the temporary file to the original file name (thus replacing the contents). This ensures an atomic update of the file, so that other readers get one copy of the data or another. The following steps are required to perform this type of update:

  1. create a new temp file (on the same file system!)
  2. write data to the temp file
  3. fsync() the temp file
  4. rename the temp file to the appropriate name
  5. fsync() the containing directory

Checking For Errors

When performing write I/O that is buffered by the library or the kernel, errors may not be reported at the time of the write() or the fflush() call, since the data may only be written to the page cache. Errors from writes are instead often reported during calls to fsync(), msync() or close(). Therefore, it is very important to check the return values of these calls.

Write-Back Caches

This section provides some general information on disk caches, and the control of such caches by the operating system. The options discussed in this section should not affect how a program is constructed at all, and so this discussion is intended for informational purposes only.

The write-back cache on a storage device can come in many different flavors. There is the volatile write-back cache, which we've been assuming throughout this document. Such a cache is lost upon power failure. However, most storage devices can be configured to run in either a cache-less mode, or in a write-through caching mode. Each of these modes will not return success for a write request until the request is on stable storage. External storage arrays often have a non-volatile, or battery-backed write-cache. This configuration also will persist data in the event of power loss. From an application programmer's point of view, there is no visibility into these parameters, however. It is best to assume a volatile cache, and program defensively. In cases where the data is saved, the operating system will perform whatever optimizations it can to maintain the highest performance possible.

Some file systems provide mount options to control cache flushing behavior. For ext3, ext4, xfs and btrfs as of kernel version 2.6.35, the mount option is "-o barrier" to turn barriers (write-back cache flushes) on (the default), or "-o nobarrier" to turn barriers off. Previous versions of the kernel may require different options ("-o barrier=0,1"), depending on the file system. Again, the application writer should not need to take these options into account. When barriers are disabled for a file system, it means that fsync calls will not result in the flushing of disk caches. It is expected that the administrator knows that the cache flushes are not required before she specifies this mount option.

Appendix: some examples

This section provides example code for common tasks that application programmers often need to perform.

  1. Synchronizing I/O to a file stream

  2. Synchronizing I/O using file descriptors (system I/O) This is actually a subset of the first example and is independent of the O_DIRECT open flag (so will work whether or not that flag was specified).

  3. Replacing an existing file (overwrite).

  4. sync-samples.h (needed by the above examples).

Comments (41 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers


Filesystems and block I/O

Memory management



Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds