Brief items
The current development kernel is 3.1-rc5,
released on September 4. The ongoing
problems at kernel.org have caused this release to be hosted in an unusual
place.
However, master.kernel.org is still down, and there really
hasn't been a ton of development going on, so I considered just skipping a
week. But hey, the whole point (well, *one* of the points) of distributed
development is that no single place is really any different from any other,
so since I did a github account for my divelog thing, why not see how well
it holds up to me just putting my whole kernel repo there too?
For
obvious reasons, the full changelog is not available on kernel.org at this
time.
Stable updates: there have been no stable updates released in the
last week, and none are in the review process as of this writing.
Comments (1 posted)
It's not possible to absolutely say that many Android distributors
no longer have the right to distribute Linux. But nor is it
possible to absolutely say that they haven't lost that right. Any
sufficiently motivated kernel copyright holder probably could
engage in a pretty effective shakedown racket against Android
vendors. Whether they will do remains to be seen, but honestly if I
were an Android vendor I'd be worried. There's plenty of people out
there who hold copyright over significant parts of the
kernel. Would you really bet on all of them being individuals of
extreme virtue?
--
Matthew Garrett
I think that realistically we should definitely look at our
practices, but at the same time, I personally do put a lot of trust
in "human relationships".
Often way more than "technical models".
So there is a lot of safety in just a purely human "this looks like
the kind of pull request I expect". A lot of kernel developers
write nice messages explaining the pull, and there may not be a
cryptographic signature in text like that, but there is definitely
a "human signature" that you start to expect.
--
Linus Torvalds
Kernelroll is a linux kernel module for advanced rickrolling. It
works by patching the open() system call to open a specified music
file instead of other music files. Currently, it only checks if the
file extension "mp3" is present and calls the original open() with
the supplied path instead.
--
"fpletz"; acceptance
into the mainline not guaranteed
Comments (5 posted)
Kernel development news
By Jonathan Corbet
September 7, 2011
LWN recently
looked at the discussion around
moving the Broadcom wireless driver into the mainline from the staging
tree. This driver raises a number of issues on how the kernel community
interacts with hardware manufacturers. One important aspect of the
discussion, though, did not come up until after that article was written.
Linux drivers are expected to be drivers for Linux, only. Attempts to
maintain a Linux driver as a multi-platform driver will lead to
unhappiness, for a number of reasons. What follows is an unabashedly
partisan article on why multi-platform drivers do not fit well with the
Linux kernel.
Broadcom developer Henry Ptasinski brought the
issue to the fore while talking about why the company was not
interested in supporting the in-mainline b43 driver:
The brcmsmac driver has architectural alignment with our drivers
for other operating systems, and we intend to to enhance and
maintain this driver in parallel with drivers for other operating
systems. Maintaining alignment between our Linux driver and
drivers for other operating systems allows us to leverage feature
and chip support across all platforms.
To developers who have worked with the kernel for a while, these words look
like a fundamental mistake. To others, though, they seem reasonable: if
Broadcom wants to support its hardware, why not let the company do things
its way?
One clear problem with trying to maintain "architectural alignment"
with drivers for other operating systems is that only the original company
can maintain that alignment. The other associated drivers are almost certainly not
open source; nobody else in the community has any way to know which changes
are consistent with those other drivers and which are not. Not even the
relevant subsystem maintainer can make that kind of call.
One also must consider that most other kernel developers have no motivation
for - or interest in - maintaining the correspondence between the drivers,
even if they did know how to do it.
The obvious conclusion here is that allowing a vendor to maintain a
multi-platform driver in the kernel tree will only work if that vendor is
given absolute control over the code. If others can make arbitrary
changes, there is no way for the vendor to keep the drivers consistent.
But, in the kernel, nobody has that kind of absolute control with the
possible exception of Linus Torvalds. If something needs to be fixed or
changed, anybody with the relevant technical skills can do it. If a piece
of the kernel tree were to be fenced off and made off-limits for kernel
developers, the kernel as a whole becomes a little less free.
And that freedom matters. Consider the problem of internal API changes.
As anybody who watches kernel development knows, internal interfaces are
changed all the time in response to problems and changing needs. Those
changes can, at times, force significant changes in users of the affected
interfaces. Contemporary rules call for a developer who makes an interface
change to fix any code broken by that change. Code that has been
designated as off limits will be hard to fix in this way, slowing down the
evolution of the kernel as a whole. As one example, consider the removal
of the big kernel lock; that job required significant locking changes in
many places. Literally hundreds of drivers were modified in the process.
Impeding those changes would have made the BKL removal task even slower -
and maybe impossible.
Manufacturers are not known for long-term support of their products; they
have no real reason to support five-year-old chipsets that they no longer
sell. Indeed, they have every reason to end such support and encourage
the replacement of older hardware with shiny new stuff. Linux, instead, tends to
support hardware for as long as it is in use. Giving a vendor absolute
control over a driver is certain to create conflict when that vendor moves
to drop support for older chipsets.
A vendor's agenda can differ from the community's needs in other ways as
well. Vendors may not appreciate patches to enable undocumented features
or make low-end offerings behave more like their more expensive
alternatives. Or consider Hans Reiser's opposition to the addition of extended
attribute and access control list support to reiserfs. His argument was
that users should wait for the shiny new Reiser4 filesystem to obtain such
features; had he been listened to, reiserfs users never would have had
support for those basic filesystem capabilities. The kernel works well
because it is maintained as the best kernel for all users over the long
term, even if that occasionally causes conflicts with short-term vendor
desires.
Multi-platform drivers from vendors tend to be written around the minimal
set of support functions that are available on all platforms. The result
is a lot of code duplicating functionality already found in the Linux
kernel; consider, for example, just how many wireless drivers initially
came with their own 802.11 stacks built in. Developing and maintaining
just one rock-solid 802.11 implementation is hard; having several of them
in the kernel would bloat the kernel and endow it with a number of
second-rate implementations - all of which must be maintained into the
future. Other kernel support code - from simple linked lists through
complicated memory management - is also often avoided by multi-platform
drivers. Those drivers will be bigger, buggier, and harder for kernel
developers to read and support. They are also much less likely to behave
consistently with other Linux drivers for the same type of hardware.
Beyond all of the above, it is also far from clear that maintaining a
multi-platform
driver actually saves any work. Drivers written for Linux can make full
use of the available support infrastructure. Multi-platform drivers must,
instead, duplicate much of that functionality and maintain an operating
system abstraction layer as well. Maintaining a multi-platform driver
means maintaining a larger body of code without help from the community.
In summary: trying to maintain a single driver for multiple operating
systems may look like a good idea on the surface. But it is only
sustainable in a world where the vendor keeps complete control over the
code. Even then, it leads to worse code, duplicated effort, long-term
maintenance issues, and more work overall. Linux works best when its
drivers are written for Linux and can be fully integrated with the rest of
the kernel. The community's developers understand this well; that is why
multi-platform drivers have a hard time getting into the mainline.
Comments (12 posted)
September 7, 2011
This article was contributed by Arnd Bergmann
The Linux-3.2 merge window may be the first time that two new CPU
architectures get merged at the same time: the
c6x
architecture from Texas Instruments and the
Hexagon architecture
from Qualcomm.
Following the recently merged
OpenRisc
platform, the two submissions look very solid and should see no major
obstacles getting merged into Linux after the usual review comments have
been resolved, but there is still
some
debate over how to best add glibc
support for the architectures.
Interestingly, there is a lot that these two architectures have in common,
far beyond coincidentally
implementing the same bugs.
Both are not regular CPU cores designed to run an operating systems but
instead are essentially digital signal
processors,
similar to the Blackfin architecture that was merged into linux-2.6.22
a few years ago. Further, both Hexagon and c6x are already widely available
in systems running Linux on the ARM core of a TI OMAP or Qualcomm MSM
system-on-a-chip, where they are used for offloading CPU intensive work
such as that required for video codecs.
It will be interesting to see how Linux can coexist in the long run when
the same SoC can run Linux on either of the two CPU architectures. The
ARM architecture is currently transitioning to probing based on the
dts device tree format and all new
architectures merged into Linux will have to use that format as well
when they have devices that cannot be automatically detected. If the device
tree vision comes true, a single board will actually be able to use
the same device tree binary on either one, independent of which CPU
actually runs the kernel.
Another intriguing scenario is running Linux on both architectures
(ARM and DSP) simultaneously, using shared memory to communicate between
them. Ohad Ben-Cohen has recently posted a
framework based on virtio
to allow just that on the TI platform.
While virtio was intended to be used for communication between a virtual
machine and the host operating system, it turns out to be flexible enough
to allow the same drivers for communication between operating system
instances on the same hardware.
Looking closer at the actual DSP architectures, there are some major
differences between Hexagon and C6x. The former is quite capable,
with support for symmetric multiprocessing, a memory management
unit and even a hypervisor. It can be seen as a competitor to
established CPU architectures like ARM, x86 or powerpc, at least in
the embedded space. In contrast, C6x is a rather minimalistic architecture
dating back to the TMS320 introduced in 1983. So far, its kernel supports
neither SMP nor an MMU, which means it is restricted to running μClibc
instead of glibc, and it has a very limited set of
applications that can be supported as long as it is missing the MMU.
Beyond Linux 3.2, there are still more architectures that have been
around for a long time and could get merged if the respective maintainers
were interested. FPGA-based Nios2 is
apparently close to getting submitted, while the similar lm32 architecture
saw a lot
of activity in 2010 but does not seem to be actively worked on now.
Synopsys ARC
and Imagination META are
both claimed
to have Linux and Android support, but there is no indication that the
authors are actively working on upstream submission or even on making
the patches for current kernels easily available.
Finally, Donald Knuth's MMIX
architecture has seen some occasional work in
the past but
now appears to be stalled, the latest kernel source version being 2.6.18.
Comments (4 posted)
September 7, 2011
This article was contributed by Jeff Moyer
In a perfect world, there would be no operating system crashes, power
outages or disk failures, and programmers wouldn't have to worry about
coding for these corner cases. Unfortunately, these failures are more
common than one would expect. The purpose of this document is to
describe the path data takes from the application down to the storage,
concentrating on places where data is buffered, and to then provide
best practices for ensuring data is committed to stable storage so it
is not lost along the way in the case of an adverse event. The main
focus is on the C programming language, though the system calls
mentioned should translate fairly easily to most other languages.
I/O buffering
In order to program for data integrity, it is crucial to have an
understanding of the overall system architecture. Data can travel
through several layers before it finally reaches stable storage, as
seen below:
At the top is the running application which has data that it needs to
save to stable storage. That data starts out as one or more blocks
of memory, or buffers, in the application itself. Those buffers can
also be handed to a library, which may perform its own buffering.
Regardless of whether data is buffered in application buffers or by a
library, the
data lives in the application's address space. The next layer that the data
goes through is the kernel, which keeps its own version of a
write-back cache called the page cache. Dirty pages can live in the
page cache for an indeterminate amount of time, depending on overall
system load and I/O patterns. When dirty data is finally evicted from
the kernel's page cache, it is written to a storage device (such as a
hard disk). The storage device may further buffer the data in a
volatile write-back cache. If power is lost while data is in this
cache, the data will be lost. Finally, at the very bottom of the
stack is the non-volatile storage. When the data hits this layer, it
is considered to be "safe."
To further illustrate the layers of buffering, consider an
application that listens on a network socket for connections and
writes data received from each client to a file. Before closing the
connection, the server ensures the received data was written to stable
storage, and sends an acknowledgment of such to the client.
After accepting a connection from a client, the application will need
to read data from the network socket into a buffer. The following
function reads the specified amount of data from the network socket
and writes it out to a file. The caller already determined from the
client how much data is expected, and opened a file stream to write
the data to. The (somewhat simplified) function below is expected to
save the data read from the network socket to disk before
returning.
0 int
1 sock_read(int sockfd, FILE *outfp, size_t nrbytes)
2 {
3 int ret;
4 size_t written = 0;
5 char *buf = malloc(MY_BUF_SIZE);
6
7 if (!buf)
8 return -1;
9
10 while (written < nrbytes) {
11 ret = read(sockfd, buf, MY_BUF_SIZE);
12 if (ret =< 0) {
13 if (errno == EINTR)
14 continue;
15 return ret;
16 }
17 written += ret;
18 ret = fwrite((void *)buf, ret, 1, outfp);
19 if (ret != 1)
20 return ferror(outfp);
21 }
22
23 ret = fflush(outfp);
24 if (ret != 0)
25 return -1;
26
27 ret = fsync(fileno(outfp));
28 if (ret < 0)
29 return -1;
30 return 0;
31 }
Line 5 is an example of an application buffer; the data read from the
socket is put into this buffer. Now, since the amount of data
transferred is already known, and given the nature of network
communications (they can be bursty and/or slow), we've decided to use
libc's stream functions (fwrite() and fflush(), represented by "Library
Buffers" in the figure above) in order to further buffer the data.
Lines 10-21 take care of reading the data from the socket and writing it
to the file stream. At line 22, all data has been written to the file
stream. On line 23, the file stream is flushed, causing the data to
move into the "Kernel Buffers" layer. Then, on line 27, the data is
saved to the "Stable Storage" layer shown above.
I/O APIs
Now that we've hopefully solidified the relationship between APIs and
the layering model, let's explore the intricacies of the interfaces in
a little more detail. For the sake of this discussion, we'll break
I/O down into three different categories: system I/O, stream I/O, and
memory mapped (mmap) I/O.
System I/O can be defined as any operation that writes data into the
storage layers accessible only to the kernel's address space via the
kernel's system call interface. The following routines (not
comprehensive; the focus is on write operations here) are part of the
system (call) interface:
| Operation | Function(s) |
| Open | open(), creat() |
| Write | write(), aio_write(),
pwrite(), pwritev() |
| Sync | fsync(), sync() |
| Close | close() |
Stream I/O is I/O initiated using the C library's stream interface.
Writes using these functions may not result in system calls, meaning
that the data still lives in buffers in the application's address
space after making such a function call. The following library
routines (not comprehensive) are part of the stream interface:
| Operation | Function(s) |
| Open | fopen(), fdopen(),
freopen() |
| Write | fwrite(), fputc(),
fputs(), putc(), putchar(),
puts() |
| Sync | fflush(), followed by
fsync() or sync() |
| Close | fclose() |
Memory mapped files are similar to the system I/O case above. Files
are still opened and closed using the same interfaces, but access to
the file data is performed by mapping that data into the process'
address space, and then performing memory read and write operations as
you would with any other application buffer.
| Operation | Function(s) |
| Open | open(), creat() |
| Map | mmap() |
| Write | memcpy(), memmove(),
read(), or any other routine that
writes to application memory |
| Sync | msync() |
| Unmap | munmap() |
| Close | close() |
There are two flags that can be specified when opening a file to
change its caching behavior: O_SYNC (and related O_DSYNC), and
O_DIRECT.
I/O operations performed against files opened with O_DIRECT bypass the
kernel's page cache, writing directly to the storage. Recall that the
storage may itself store the data in a write-back cache, so
fsync() is still required for files opened with O_DIRECT
in order to
save the data to stable storage. The O_DIRECT flag is only
relevant for the system I/O API.
Raw devices (/dev/raw/rawN) are a special case of
O_DIRECT I/O. These
devices can be opened without specifying O_DIRECT, but still provide
direct I/O semantics. As such, all of the same rules apply to raw
devices that apply to files (or devices) opened with O_DIRECT.
Synchronous I/O is any I/O (system I/O with or without O_DIRECT, or
stream I/O) performed to a file descriptor that was opened using the
O_SYNC or O_DSYNC flags. These are the synchronous
modes, as defined by POSIX:
- O_SYNC: File data and all file metadata are written
synchronously to disk.
- O_DSYNC: Only file data and metadata needed to access the
file data are written synchronously to disk.
- O_RSYNC: Not implemented
The data and associated metadata for write calls to such file
descriptors end up immediately on stable storage. Note the careful wording,
there. Metadata that is not required for retrieving the data of the
file may not be written immediately. That metadata may include the file's
access time, creation time, and/or modification time.
It is also worth pointing out the subtleties of opening a file
descriptor with O_SYNC or O_DSYNC, and then associating
that file
descriptor with a libc file stream. Remember that fwrite()s to the
file pointer are buffered by the C library. It is not until an
fflush() call
is issued that the data is known to be written to disk. In essence,
associating a file stream with a synchronous file descriptor means
that an fsync() call is not needed on the file descriptor after the
fflush(). The fflush() call, however, is still necessary.
When Should You Fsync?
There are some simple rules to follow to determine whether or not an
fsync() call is necessary. First and foremost, you must answer
the question:
is it important that this data is saved now to stable storage? If
it's scratch data, then you probably don't need to fsync(). If it's
data that can be regenerated, it might not be that important to fsync()
it. If, on the other hand, you're saving the result of a transaction,
or updating a user's configuration file, you very likely want to get
it right. In these cases, use fsync().
The more subtle usages deal with newly created files, or overwriting
existing files. A newly created file may require an fsync() of not just
the file itself, but also of the directory in which it was created
(since this is where the file system looks to find your file). This
behavior is actually file system (and mount option) dependent. You
can either code specifically for each file system and mount option
combination, or just perform fsync() calls on the directories to
ensure that your code is portable.
Similarly, if you encounter a system failure (such as power loss,
ENOSPC or an I/O error) while overwriting a file, it can result in the
loss of existing data. To avoid this problem, it is common practice
(and advisable) to write the updated data to a temporary file,
ensure that it is safe on stable storage, then rename the
temporary file to the original file name (thus replacing the
contents). This ensures an atomic update of the file, so that other
readers get one copy of the data or another. The following steps are
required to perform this type of update:
- create a new temp file (on the same file system!)
- write data to the temp file
- fsync() the temp file
- rename the temp file to the appropriate name
- fsync() the containing directory
Checking For Errors
When performing write I/O that is buffered by the library or the
kernel, errors may not be reported at the time of the write() or
the fflush() call, since the data may only be written to the page
cache. Errors from writes are instead often reported during calls to
fsync(),
msync() or close(). Therefore, it is very important to
check the return values of these calls.
Write-Back Caches
This section provides some general information on disk caches, and the
control of such caches by the operating system. The options discussed
in this section should not affect how a program is constructed at all,
and so this discussion is intended for informational purposes only.
The write-back cache on a storage device can come in many different
flavors. There is the volatile write-back cache, which we've been
assuming throughout this document. Such a cache is lost upon power
failure. However, most storage devices can be configured to run in
either a cache-less mode, or in a write-through caching mode. Each of
these modes will not return success for a write request until the
request is on stable storage. External storage arrays often have a
non-volatile, or battery-backed write-cache. This configuration also
will persist data in the event of power loss. From an application
programmer's point of view, there is no visibility into these
parameters, however. It is best to assume a volatile cache, and
program defensively. In cases where the data is saved, the
operating system will perform whatever optimizations it can to
maintain the highest performance possible.
Some file systems provide mount options to control cache flushing
behavior. For ext3, ext4, xfs and btrfs as of kernel version 2.6.35, the
mount option is "-o barrier" to turn barriers (write-back cache flushes)
on (the default), or "-o nobarrier" to turn barriers off. Previous
versions of the kernel may require different options
("-o barrier=0,1"),
depending on the file system. Again, the application writer should not
need to take these
options into account. When barriers are disabled for a file system,
it means that fsync calls will not result in the flushing of disk
caches. It is expected that the administrator knows that the cache
flushes are not required before she specifies this mount option.
Appendix: some examples
This section provides example code for common tasks that
application programmers often need to perform.
- Synchronizing I/O to a file stream
- Synchronizing I/O using file
descriptors (system I/O)
This is actually a subset of the first example and is independent of the
O_DIRECT open flag (so will work whether or not that flag was
specified).
- Replacing an existing file
(overwrite).
- sync-samples.h (needed by the above
examples).
Comments (40 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Architecture-specific
Security-related
Page editor: Jonathan Corbet
Next page: Distributions>>