Brief items
The current 2.6 prepatch is 2.6.24-rc4,
released by Linus on
December 3. He says that the size of the patch is "a bit
disheartening," and, in fact, there are quite a few changes which have been
merged. They are almost all fixes, but there also the addition of a CPU
accounting controller for monitoring the CPU usage of groups of processes. See
the short-form changelog for the details, or
the
full changelog for lots of details.
As of this writing, just under 100 changesets have gone into the mainline
repository since the -rc4 release.
The current -mm tree is 2.6.24-rc4-mm1. Recent changes
to -mm include the latest timerfd
API, a new memory controller patch, and a reimplemented ramdisk driver.
Comments (none posted)
Kernel development news
A person will stand on the top of a hill for a very long time with
their mouth open before a roast duck will fly in.
--
James Morris
For the purposes of figuring out what is needed you can consider a
random simple user case such as a system which protects you against
the works of Eric S Raymond. Replace the mathematical analysis and
heuristics with a user space tool which spots the various ESR
papers and design it for that if it makes you happier.
SELinux seems to be able to do most of the lifting around the
problem as it can relabel a file into eric_t and constrain further
access to it.
--
Alan Cox
Comments (6 posted)
By Jonathan Corbet
December 3, 2007
Sparse files have an apparent size which is larger than the amount of
storage actually allocated to them. The usual way to create such
a file is to seek past its end and write some new data; Unix-derived
systems will traditionally not allocate disk blocks for the portion of the
file past the previous end which was skipped over. The result is a "hole,"
a piece of the file which logically exists, but which is not represented on
disk. A read operation on a hole succeeds, with the returned data being
all zeroes. Relatively smart file archival and backup utilities will
recognize holes in files; these holes are not stored in the resulting
archive and will not be filled if the file is restored from that archive.
The process of recognizing holes is relatively primitive, though: about the
only way to do it in a portable way is to simply look for blocks filled
with zeroes. This technique works, but it requires making a pass over the
data to obtain information which the lower levels of the system already
know. It seems like there should be a better way.
About two years ago, the Solaris ZFS developers proposed
an extension to lseek() which would allow an application to
find the holes in sparse files more efficiently. This extension
works by adding two new "whence" options:
- SEEK_HOLE positions the file descriptor to the beginning of
the first hole which occurs after the given offset. For the purposes
of this operation, "hole" is defined as a region of all zeros of any
length, but the system is not required to actually detect all holes.
So, in practice, small ranges of zeroes will be skipped over, as will,
in all likelihood, large (multi-block) ranges which have actually been
written to disk.
- SEEK_DATA moves to the start of next region (after the given
offset) which is not a hole.
This functionality has been part of Solaris for a while; the Solaris
developers would like to see it spread elsewhere and become something more
than a Solaris-only extension. To that end, Josef Bacik has recently
posted an implementation of
this extension for Linux. Internally, it adds a new member to the
file_operations structure (seek_hole_data()) intended to
allow filesystems to efficiently implement the new operations.
One might argue that anybody who wants to separate holes and data in a file
can already do so with the FIBMAP ioctl() command. While
that is true, FIBMAP is an inefficient way of getting
this sort of information, especially on filesystems which support extents.
A FIBMAP call returns the mapping information for exactly one
block; mapping out a large file may require millions of calls when, once
again, the filesystem should already know how to provide that information
in a much more straightforward manner.
Even so, this patch looks relatively unlikely to make it into the
mainline. The API is unpopular, being seen as ugly and as a change in the
semantics of the lseek() call. But, more to the point, it may be
interesting to learn much more about the representation of a file than just
where the holes are. And, as it turns out, there is already a proposed
ioctl() command which can provide all of that information. That
interface is the FIEMAP
ioctl() specified by Andreas Dilger back in October.
A FIEMAP call takes the following structure as an argument:
struct fiemap {
__u64 fm_start; /* logical starting byte offset (in/out) */
__u64 fm_length; /* logical length of map (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
__u32 fm_extent_count; /* number of extents in fm_extents (in/out) */
__u64 fm_end_offset; /* end of mapping in last ioctl */
struct fiemap_extent fm_extents[0];
};
An application wanting to learn something about how a file is stored will
put the starting offset into fm_start and the length
of the region of interest in fm_length. If fm_flags
contains FIEMAP_FLAG_NUM_EXTENTS, the system call will simply set
fm_extent_count to the number of extents used to store the
specified range of bytes and return. In this form, FIEMAP can be
used to determine how fragmented the file is on disk.
If the application is looking for more information than that, it will
allocate enough space for one or more fm_extents structures:
struct fiemap_extent {
__u64 fe_offset;/* offset in bytes for the start of the extent */
__u64 fe_length;/* length in bytes for the extent */
__u32 fe_flags; /* returned FIEMAP_EXTENT_* flags for the extent */
__u32 fe_lun; /* logical device number for extent(starting at 0)*/
};
In this case, fm_extent_count should be set to the number of these
structures before making the FIEMAP call. On return, these
structures (as many as is indicated by the returned value of
fm_extent_count) will be filled in with information on the actual
file extents; fe_offset says where (on disk) the extent starts,
and fe_length is the size of the extent. There are quite a few
values which can appear in the fe_flags field:
- FIEMAP_EXTENT_HOLE says that there is no data for this
range of the file - it's a hole.
- FIEMAP_EXTENT_UNWRITTEN says that the space has been
allocated on disk, but that nothing has been written to that space.
Space which has been preallocated with fallocate() would be
marked this way.
- FIEMAP_EXTENT_UNMAPPED, instead, marks an extent where some
application has written data, but for which no disk blocks have been
allocated.
- FIEMAP_EXTENT_DELALLOC indicates that delayed allocation is
being done; this flag implies FIEMAP_EXTENT_UNMAPPED as well.
- FIEMAP_EXTENT_SECONDARY is an indication that the data for
this segment is in some sort of secondary storage; one would see this
flag on filesystems managed by some sort of hierarchical storage
manner. This flag, too, is likely to imply
FIEMAP_EXTENT_UNMAPPED.
- FIEMAP_EXTENT_NO_DIRECT says that the data cannot be accessed
directly - it requires processing (decompression or decryption, for
example) first.
- FIEMAP_EXTENT_LAST marks the final extent in a file.
- FIEMAP_EXTENT_EOF indicates that the requested range goes
beyond the end of the file.
- FIEMAP_EXTENT_ERROR marks an extent which has experienced some
sort of error; the fe_offset field will contain an error
number in this case.
- FIEMAP_EXTENT_UNKNOWN says that the data exists, but its
location is unknown. This flag would describe much of your editor's
personal file space, though it is unclear how the kernel would know
that.
As can be seen, there is a wealth of information available from this new
call, including details on how the file has been split up on disk,
allocation strategies, and even the decisions made by a hierarchical
storage engine. An implementation exists for the ext4 filesystem. None of
this code has been pushed toward the mainline yet, but it would be
surprising if that did not happen sometime in the relatively near future.
Once that is done, the C library will be able to implement
SEEK_HOLE and SEEK_DATA in user space, should that be
desirable.
Comments (5 posted)
December 4, 2007
This article was contributed by Daniel Drake
When developing kernel code, it is usually important to consider
constraints and requirements of architectures other than your
own. Otherwise, your code may not be portable to other architectures, as I
recently discovered when an unaligned memory access bug was reported
in a driver which I develop. Not having much familiarity with the concepts
of unaligned memory access, I set out to research the topic and complete my
understanding of the issues.
Certain architectures rule that memory
accesses must meet some certain alignment criteria or are otherwise
illegal. The exact criteria that determines whether an access is suitably
aligned depends upon the address being accessed and the number of bytes
involved in the transaction, and varies from architecture to architecture.
Kernel code is typically written to obey natural alignment
constraints, a scheme that is sufficiently strict to ensure portability to
all supported architectures. Natural alignment requires that every N byte
access must be aligned on a memory address boundary of N. We can express
this in terms of the modulus operator: addr % N must be
zero. Some examples:
- Accessing 4 bytes of memory from address 0x10004 is aligned
(
0x10004 % 4 = 0).
- Accessing 4 bytes of memory from address 0x10005 is unaligned
(
0x10005 % 4 = 1).
The phrase "memory access" is quite vague; the context here is
assembly-level instructions which read or write a number of bytes to or
from memory (e.g.
movb,
movw,
movl
in x86 assembly). It is relatively easy to relate these to C statements,
for example the instructions that are generated when the following code is
compiled would likely include a single instruction that accesses two bytes
(16 bits) of data from memory:
void example_func(unsigned char *data) {
u16 value = *((u16 *) data);
[...]
}
The effects of unaligned access vary from architecture to
architecture. On architectures such as ARM32 and Alpha, a processor
exception is raised when an unaligned access occurs, and the kernel is able
to catch the exception and correct the memory access (at large cost to
performance). Other architectures raise processor exceptions but the
exceptions do not provide enough information for the access to be
corrected. Some architectures that are not capable of unaligned access do
not even raise an exception when unaligned access happens, instead they
just perform a different memory access from the one that was requested and
silently return the wrong answer.
Some architectures are capable of performing unaligned accesses without
having to raise bus errors or processor exceptions, i386 and x86_64 being
some common examples. Even so, unaligned accesses can degrade performance
on these systems, as Andi Kleen explains:
On Opteron the typical cost of a
misaligned access is a single cycle and some possible penalty to load-store
forwarding. On Intel it is a bit worse, but not all that much. Unless you
do a lot of accesses of it in a loop it's not really worth something caring
about too much.
At the end of the day, if you write code that causes unaligned accesses
then your software will not work on some systems. This applies to both
kernel-space and userspace code.
The theory is relatively easy to get to grips with, but how does this apply
to real code? After all, when you allocate a variable on the stack, you
have no control over its address. You don't get to control the addresses
used to pass function parameters, or the addresses returned by the memory
allocation functions. Fortunately, the compiler understands the alignment
constraints of your architecture and will handle the common cases just
fine; it will align your variables and parameters to suitable boundaries,
and it will even insert padding inside structures to ensure the access to
members is suitably aligned. Even when using the GCC-specific packed
attribute (which tells GCC not to insert padding), GCC will
transparently insert extra instructions to ensure that standard accesses to
potentially unaligned structure members do not violate alignment
constraints (at a cost to performance).
In order to illustrate a situation that might cause unaligned memory
access, consider the example_func() implementation from
above. The first line of the function accesses two bytes (16 bits) of data
from a memory address passed in as a function parameter; however, we do not
have any other information about this address. If the data
parameter points to an odd address (as opposed to even), for example
0x10005, then we end up with an unaligned access. The main
places where you will potentially run into unaligned accesses are when
accessing multiple bytes of data (in a single transaction) from a pointer,
and when casting variables to types of increased lengths.
Conceptually, the way to avoid unaligned access is to use byte-wise memory
access because accessing single bytes of memory cannot violate alignment
constraints. For example, for a little-endian system we could replace the
example_func() implementation with the following:
void fixed_example_func(unsigned char *data) {
u16 value = data[0] | data[1] << 8;
[...]
}
memcpy() is another possible alternative in the general case,
as long as either the source or destination is a pointer to an 8-bit data
type (i.e. char). Inside the kernel, two macros are provided
which simplify unaligned accesses: get_unaligned() and
put_unaligned(). It is worth noting that using any of these
solutions is significantly slower than accessing aligned memory, so it is
wise to completely avoid unaligned access where possible.
Another option is to simply document the fact that
example_func() requires a 16-bit-aligned data parameter, and
rely on the call sites to ensure this or simply not use the
function. Linux's optimized routine for comparing two ethernet addresses
(compare_ether_addr()) is a real life example of this: the
addresses must be 16-bit-aligned.
I have applied my newfound knowledge to the task of writing some kernel
documentation, which covers this topic in more detail. If you want to learn
more, you may want to read the most recent
revision (as of this writing) of the document. Additionally, the initial
revision of the document generated a lot of interesting discussion, but
be aware that the initial attempt contained some mistakes. Finally, chapter
11 of Linux Device Drivers
touches upon this topic.
I'd like to thank everyone who helped me improve my understanding of
unaligned access, as this article would not have been possible without
their assistance.
Comments (9 posted)
By Jonathan Corbet
December 4, 2007
The network channels concept was
first aired by Van Jacobson
almost two years ago at linux.conf.au 2006. This idea promises
much-improved networking performance by pushing processing of network data
as close to the end point as possible - perhaps even into user
space. By getting the kernel out of the packet processing business and by
keeping that processing in a single place (on the same CPU), channel
schemes hope to minimize cache misses, context switches, and other
performance-degrading activities. Channels have had a rough encounter with
the real world, though; when one starts to consider needs like packet
filtering, address translation, and so on, it gets hard to maintain the
simplicity upon which the performance of channels relies. So, two years
later, there is no channels implementation which is even close to merging
into the mainline.
That does not mean that no work is happening in this area, though. Evgeniy
Polyakov, perhaps the most discouragement-resistant hacker out there,
continues to develop his channel patches; the 22nd release came out on
December 4.
This version of the patch has a well-defined internal structure to allow
kernel code to hook into channels. The best-developed mode, however, is
the one which simply transfers packets to and from user space. To that
end, there is a new system call:
int netchannel_control(struct unetchannel_control *ctl);
The full contents of the unetchannel_control structure can be seen
in the patch. The more important fields are:
- cmd, describing the action that the calling process wishes
to execute. Unlike previous versions of the patch, the current code
only supports one action: NETCHANNEL_CREATE, which makes a
new channel.
- type, the type of the channel to create. At the moment, the
only implemented type is NETCHANNEL_COPY_USER, which copies
packets to and from user space.
- unc.data which describes the channel to be created: it
contains source and destination addresses and ports and a protocol
number.
Once a network channel is created, it is added to a search tree which is
oriented toward blindingly-fast lookups. There is a new hook in the packet
receive code which looks up each incoming packet in that tree; packets
which do not turn up a hit there are processed normally by the
kernel's networking stack. Any packet whose addresses, ports, and protocol
are matched by an entry in the tree, however, is shunted over to the
channel code before even being queued by the network stack.
The final piece (on the receive side) is a simple read()
implementation. A process wishing to receive a packet from a network
channel need only read the associated file descriptor and the next
available packet will be copied into the supplied buffer. It would, of
course, be nice to do away with that copy operation, but that is a hard
trick to carry out: the packet must be received before its destination is
known. There are network adapters which can direct packets based on their
header information, but the current netfilter does does not have the driver
API enhancements which would be required to use that capability for
zero-copy packet reception.
Similarly, a write() operation causes the associated packet to be
copied into the kernel and fed into the networking stack at a fairly low
level. There is currently no zero-copy write support.
Evgeniy clearly has zero-copy operations in mind, though, probably using
his network allocator patch.
Even without that feature, though, the channel code, when used with his user-space
network stack appears to be quite fast. Some posted benchmark
results claim significant improvements over the core Linux networking
stack - three times the maximum bandwidth with one-third of the CPU usage
when small packets are being transferred. For larger (4096-byte) packets
the performance improvements essentially disappear - most likely the cost
of copying the packets into and out of the kernel is the dominating factor
there.
Improvements in small-packet performance are welcome: there are a number of
applications, including high-end financial trading, which require large
numbers of small transfers. The addition of zero-copy logic has the
potential to make the large-packet performance better as well. The real
test, though, will be the addition of all of the other features expected by
contemporary networking users, most of which are currently absent from the
channels implementation. There are hooks in the code aimed at the
insertion of per-packet processing; they could be used for filtering,
address translation, traffic control, or any of the other things that one
might want to have. Whether those hooks can be used without killing the
performance advantages of channels remains to be seen, though. But one
suspects that Evgeniy will not give up until he has an answer to that
question.
Comments (none posted)
Patches and updates
Kernel trees
Build system
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
- Chris Mason <chris.mason@oracle.com> (by way of Chris Mason: Btrfs v0.9.
(December 5, 2007)
Memory management
Networking
Architecture-specific
Virtualization and containers
Page editor: Jonathan Corbet
Next page: Distributions>>