Brief items
The current 2.6 development kernel is 2.6.30-rc1,
released by Linus on
April 8. "
So the two week merge window has closed, and just as
well - because we had a lot of changes. As usual. Certainly I had no urges
to keep the window open to get those last remaining few megabytes of
patches." Significant changes in 2.6.30 will include the
integrity management
architecture, the
TOMOYO
Linux security module, the
preadv() and
pwritev()
system calls,
object storage
device support, the FS-Cache local filesystem caching layer, several
new tracing features, the
Nilfs filesystem,
a number of other filesystem changes, and a huge number of new drivers.
See
the
long-format changelog for all the details.
The current stable 2.6 kernel is 2.6.29.1, released
on April 2. "There's many bugfixes all over the tree, but this should
specifically fix the networking issues people had w/ 2.6.29. As usual,
you're encouraged to upgrade."
Comments (none posted)
Kernel development news
The point is, that expectation that the BIOS returns 20 seems very
unreasonable. BIOS writers tend to have been on pain medication for
so long that they can hardly remember their own name, much less
actually make sure they follow all the documentation.
-- Linus Torvalds
But that's not the point. The point is that you barely had time to
compile that thing, much less give it any testing. The whole "it
compiles, it's perfect, ship it" mentality is _strictly_ only for
me.
--
Linus Torvalds
The problem is, this is what the application programmers are
telling the filesystem developers. They refuse to change their
programs; and the features they want are sometimes mutually
contradictory, or at least result in a overconstrained problem ---
and then they throw the whole mess at the filesystem developers'
feet and say, "you fix it!"
--
Ted Ts'o
Which application developers did you speak to? Because, frankly,
the majority of the ones I know felt that ext3 embodied the pony
that they'd always dreamed of as a five year old. Stephen gave them
that pony almost a decade ago and now you're trying to take it to
the glue factory. I remember almost crying at that bit on Animal
Farm, so I'm really not surprised that you're getting pushback
here.
--
Matthew Garrett
Thou shalt remember to use 'git add' or errors shall be visited on
your downloads and there shall be wrath from on list and much
gnashing of teeth.
Thou shalt remember to use git status or there shall be catcalls and much
embarrasment shall come to pass.
--
Alan Cox
Comments (2 posted)
By Jonathan Corbet
April 8, 2009
There have been some 3400 non-merge changesets incorporated into the
mainline since
last week's
update, for a total of some 9600 changes merged for 2.6.30 overall. At
this point, the 2.6.30 merge window is complete.
User-visible changes merged since last week include:
- The preadv() and pwritev() system calls have been
added. They have been long in coming; LWN first covered these system
calls in 2005. The expected user-space interface will be:
ssize_t preadv(int d, const struct iovec *iov, int iovcnt, off_t offset);
ssize_t pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset);
Due to the portability
challenges involved, though, the actual kernel interface (seen
only by the C library) is somewhat different.
- The loop block driver supports a new ioctl()
(LOOP_SET_CAPACITY) which can be used to change the size of
the device on the fly.
- The eventfd() system call takes a new flag
(EFD_SEMAPHORE) which causes it to implement simple
counting-semaphore behavior. See the
changelog entry for a description of how this works.
- The ext4 system is now more careful about forcing data out to disk in
situations where small files have been truncated or renamed. This
behavior increases robustness in the face of crashes, but it can also
have a performance cost. There is a new mount option
(auto_da_alloc) which can be used to disable this behavior.
Also new for ext4 is a set of control knobs found under
/sys/fs/ext4.
- The ext3 filesystem, too, is more careful to flush data to disk when
running in the data=writeback mode.
- The default mode for ext3 has been changed from data=ordered
to data=writeback. The latter performs quite a bit better in
2.6.30, but also carries an information disclosure risk if the system
crashes. Distributors can change the default mode when they configure
their kernels; some may well choose to retain the older
data=ordered default.
- The btrfs filesystem has also been changed to be careful about
flushing data to disk after truncate or rename operations.
- The Nilfs log-structured
filesystem has been merged.
- The MD RAID layer now has support for block-layer integrity
checking. MD can also change chunk_size and layout in a reshape
operation - a capability which makes it possible to turn a RAID5 array
into RAID6 while it is running.
- The exofs (formerly osdfs) filesystem, providing support for object storage
devices, has been merged.
- FS-Cache (formerly cachefs) has been merged. This subsystem (first covered here in 2004),
provides a local caching layer for network filesystems; it has finally
overcome the concerns
expressed by some developers and made it into the mainline.
- The distributed storage
subsystem and pohmelfs network filesystem have been merged.
Interestingly, this code went in via the -staging tree.
- The ATA subsystem has gained support for the TRIM command.
- There are two new tuning knobs under /proc/sys/vm
(nr_pdflush_threads_min and nr_pdflush_threads_max);
they place limits on the number of running pdflush
threads in the system.
- Multiple message queue namespaces are now supported.
- The PA-RISC architecture has gained support for ftrace and
latencytop.
- The ARM architecture now has high memory support, for all of you out
there with 2GB ARM-based systems.
- The Xtensa architecture now supports systems without a memory
management unit.
- New device drivers:
- Block: Marvell MMC/SD/SDIO host drivers.
- Graphics: Samsung S3C framebuffers.
- Miscellaneous: National Semiconductor LM95241 sensor chips,
Linear Technology LTC4215 Hot Swap controller
I2C monitoring interfaces,
PPC4xx IBM DDR2 memory controllers,
AMD8111 HyperTransport I/O hubs,
AMD8131 HyperTransport PCI-X Tunnel chips,
TI TWL4030/TWL5030/TPS695x0 PMIC voltage
regulators,
DragonRise game controllers,
National Semiconductor DAC124S085 SPI DAC
devices,
Rohm BD2802 RGB LED controllers,
TXx9 SoC NAND flash memory controllers, and
ASUS ATK0110 ACPI hardware monitoring
interfaces.
- Networking: Neterion X3100 Series 10GbE PCIe server adapters.
- Processors and systems: Tensilica S6000 processors and
S6105 IP camera reference design kits, and
Merisc AVR32-based boards.
- Sound: HTC Magician audio devices.
- Video: i.MX1/i.MXL CMOS sensor interfaces,
Conexant cx231xx USB video capture devices, and
Legend Silicon LGS8913/LGS8GL5/LGS8GXX DMB-TH
demodulators.
- Staging drivers (those not considered ready for regular
mainline inclusion): stlc4550 and stlc4560 wireless chipsets,
Brontes PCI frame grabbers,
ATEN 2011 USB to serial adapters,
Phison PS5000 IDE adapters,
Plan 9 style capability pseudo-devices,
Intel Management Engine Interfaces,
Line6 PODxt Pro audio devices,
USB Quatech ESU-100 8 port serial devices,
Ralink RT3070 wireless network adapters,
and a vast array of COMEDI data acquisition drivers.
Changes visible to kernel developers include:
- There is a new memory debug tool controlled by the PAGE_POISONING
configuration variable. Turning this feature on causes a pattern to
be written to all freed pages and checked at allocation time. The
result is "a large slowdown," but also the potential to catch a number
of use-after-free errors.
- The new function:
int pci_enable_msi_block(struct pci_dev *dev, int count);
allows a driver to enable a block of MSI interrupts.
- As part of the FS-Cache work, the "slow work" thread pool mechanism
has been merged. Some have expressed the hope that it would become
the One True Kernel Thread Pool, but there seems to be little progress
in that direction. See Documentation/slow-work.txt for more
information.
- There is a pair of new printing functions:
int vbin_printf(u32 *bin_buf, size_t size, const char *fmt, ...);
int bstr_printf(char *buf, size_t size, const char *fmt,
const u32 *bin_buf);
The difference here is that vbin_printf() places the binary
value of its arguments into bin_buf. The process can be
reversed with bstr_printf(), which formats a string from the
given binary buffer. The main use for these functions would appear to
be with Ftrace; they allow the encoding of values to be deferred until
a given trace string is read by user space.
- Also added is printk_once(), which only prints its message
the first time it is executed.
- The "kmemtrace" tracing facility has been merged. Kmemtrace provides
data on how the core slab allocations function. See Documentation/vm/kmemtrace.txt for
details.
- A number of ftrace changes have been merged. There is a workqueue
tracer which tracks the operations of workqueue threads. The blktrace
block subsystem tracer can now be used via ftrace. The new "event"
tracer allows a user to turn on specific tracepoints within the
kernel; tracepoints have been added for various scheduler and
interrupt events. "Raw" events (with binary-formatted data) are
available now. The new "syscall" tracer is for tracing system calls.
The merge window is now closed, and the stabilization process can begin.
Past experience suggests that something close to 3000 more changes will
find their way into the mainline before the 2.6.30 release, which can be
expected to happen sometime in June.
Comments (5 posted)
April 7, 2009
This article was contributed by Valerie Aurora (formerly Henson)
In the
first article in
this series about unioning file systems, I reviewed the terminology
and major design issues of unioning file systems. In
the
second article, I
described three implementations of union mounts: Plan 9, BSD, and
Linux. In this article, I will examine two unioning file systems for
Linux: unionfs and aufs.
While union mounts and union file systems have the same goals, they
are fundamentally different "under the hood." Union mounts are a
first class operating systems object, implemented right smack in the
middle of the VFS code; they usually require some minor modifications to
the underlying file systems. Union file systems, instead, are implemented in
the space between the VFS and the underlying file system, with few or
no changes outside the union file system code itself. With a union
file system, the VFS thinks it's talking a regular file system, and
the file system thinks it's talking to the VFS, but in reality both
are actually talking to the union file system code. As we'll see,
each approach has advantages and disadvantages.
Unionfs
Unionfs is the best-known and longest-lived implementation of a
unioning file system for Linux. Unionfs development began at SUNY
Stony Brook in 2003, as part of
the
FiST stackable file
system project. Both projects are led
by
Erez Zadok, a
professor at Stony Brook as well as an active contributor to the Linux
kernel. Many developers have contributed to unionfs over the years;
for a complete list, see the list of past students on
the
unionfs
web page - or read the copyright notices in the unionfs source code.
Unionfs comes in two major versions, version 1.x and version 2.x.
Version 1 was the original implementation, started in 2003. Version 2
is a rewrite intended to fix some of the problems with version 1; it
is the version under active development. A design document for
version 2 is available
at http://www.filesystems.org/unionfs-odf.txt.
Not all the features described in this document are implemented (at
least not in the
publicly
available git tree); for example, whiteouts are still stored as
directory entries with special names, which pollutes the namespace and
makes stacking of a unionfs file system over another unionfs file
system impossible.
Unionfs basic architecture
The unionfs code is a shim between the VFS and underlying file systems
(the branches). Unionfs registers itself as a file system with the
VFS and communicates with it using the standard VFS-file system
interface. Unionfs supplies various file system operation sets (such
as super block operations, which specify how to setup the file system
at mount, allocate new inodes, sync out changes to disk, and tear down
its data structures on unmount). At the data structure level, unionfs
file systems have their own superblock, mount, inode, dentry, and file
structures that link to those of the underlying file systems. Each
unionfs file system object includes an array of pointers to the
related objects from the underlying branches. For example, the
unionfs dentry private data (kept in the
d_fsdata looks
like:
/* unionfs dentry data in memory */
struct unionfs_dentry_info {
/*
* The semaphore is used to lock the dentry as soon as we get into a
* unionfs function from the VFS. Our lock ordering is that children
* go before their parents.
*/
struct mutex lock;
int bstart;
int bend;
int bopaque;
int bcount;
atomic_t generation;
struct path *lower_paths;
};
The
lower_paths member is a pointer to an array of path
structures (which include a pointer to both the dentry and
the
mnt structure) for each directory with the same path in
the lower file systems. For example, if you had three branches, and
two of the branches had a directory named "
/foo/bar/",
then, on lookup of that directory, unionfs will allocate (1) a
dentry structure, (2) a
unionfs_dentry_info
structure with a three-member
lower_paths array, and (3) two
dentry structures for the directories. Two members of
the
lower_paths array will be filled with pointers to
these dentries and their respective
mnt structures. The
array itself is dynamically allocated, grown, and shrunk according to
the number of branches. The number of branches (and therefore the
size of the array) is limited by a compile-time
constant,
UNIONFS_MAX_BRANCHES, which defaults to 128 -
about 126 more than commonly necessary, and more than enough for every
reasonable application of union file systems. The rest of the unionfs
data structures - super blocks, dentries, etc. - look very similar to
the structure described above.
The VFS calls the unionfs inode, dentry, etc. routines directly, which
then call back into the VFS to perform operations on the corresponding
data structures of the lower level branches. Take the example of
writing to a file: the VFS calls the write() function in
the inode's file operations vector. The inode is a unionfs inode, so
it calls unionfs_write(), which
finds the lower-level inode and checks whether it is hosted on a
read-only branch. (Unionfs copies up a file on the first write to the
data or metadata, not on the first open() with write
permission.) If the file is hosted on a read-only branch, unionfs
finds a writable branch and creates a new file on that branch (and any
directories in the path that don't already exist on the selected
branch). It then copies up the various associated attributes - file
modification and access times, owner, mode, extended attributes,
etc. - and the file data itself. Finally, it calls the
low-level write() file operation from the newly allocated
inode and returns the result back to the VFS.
Unionfs supports multiple writable branches. A file deletion (unlink) operation
is propagated through all writable branches, deleting (decrementing
the link count) of all files with the same pathname. If unionfs
encounters a read-only branch, it creates a whiteout entry in the
branch above it. Whiteout entries are named
".wh.<filename>", a directory is marked opaque with
an entry named ".wh.__dir_opaque".
Unionfs provides some level of cache-coherency by revalidating
dentries before operating on them. This works reasonably well as long
as all accesses to the underlying file systems goes through the
unionfs mount. Direct changes to the underlying file systems are
possible, but unionfs cannot correctly handle this in all cases,
especially when the directory structure changes.
Unionfs is under active development. According
the version 2
design document, whiteouts will be moved to small external
file system. A inode remapping file in the external file system will
allow persistent, stable inode numbers to be returned, making NFS
exports of unionfs file systems behave correctly.
The status of unionfs as a candidate for merging into the mainline
Linux kernel is mixed. On the one hand, Andrew Morton merged unionfs
into the -mm tree in January 2007, on the theory that unionfs may not
be the ideal solution, but it is one solution to a real problem.
Merging it into -mm may also prompt developers who don't like the
design to work on other unioning designs. However, unionfs has strong
NACKs from Al Viro and Christoph Hellwig, among others, and Linus is
reluctant to overrule subsystem maintainers.
The main objections to unionfs include its heavy duplication of data
structures such as inodes, the difficulty of propagating operations
from one branch to another, a few apparently insoluble race
conditions, and the overall code size and complexity. These
objections also apply to a greater or lesser degree to other stackable
file systems, such as ecryptfs. The consensus at the 2009 Linux file
systems workshop was that stackable file systems are conceptually
elegant, but difficult or impossible to implement in a maintainable
manner with the current VFS structure. My own experience writing a
stacked file system (an in-kernel chunkfs prototype) leads me to agree
with these criticisms.
Stackable file systems may be on the way out. Dustin Kirkland
proposed a new design for encrypted file systems that would not be
based on stackable file systems. Instead, it would create generic
library functions in the VFS to provide features that would also be
useful for other file systems. We identified several specific
instances where code could be shared between btrfs, NFS, and the
proposed ecryptfs design. Clearly, if stackable file systems are no longer
a part of Linux, the future of a unioning file system built on stacking is
in doubt.
aufs
Aufs, short for "Another UnionFS", was initially implemented as a fork
of the unionfs code, but was rewritten from scratch in 2006. The lead
developer is Junjiro R. Okajima, with some contributions from other
developers. The main aufs web site is at
http://aufs.sourceforge.net/.
The architecture of aufs is very similar to unionfs. The basic
building block is the array of lower-level file system structures
hanging off of the top-level aufs object. Whiteouts are named
similarly to those in unionfs, but they are hard links to a single whiteout
inode in the local directory. (When the maximum link count for the
whiteout inode is reached, a new whiteout inode is allocated.)
Aufs is the most featureful of the unioning file systems. It supports
multiple writable branch selection policies. The most useful is
probably the "allocate from branch with the most free space" policy.
Aufs supports stable, persistent inode numbers via an inode
translation table kept on an external file system. Hard links across
branches work. In general, if there is more than one way to do it,
aufs not only implements them all but also gives you a run-time
configuration option to select which way you would like to do it.
Given the incredible flexibility and feature set of aufs, why isn't it
more popular? A quick browse through the source code gives a clue.
Aufs consists of about 20,000 lines of dense, unreadable, uncommented
code, as opposed to around 10,000 for unionfs and 3,000 for union
mounts and 60,000 for all of the VFS. The aufs code is generally something
that one does not want to look at.
The evolution of the aufs source base tends towards increasing
complexity; for example, when removing a directory full of whiteouts
takes an unreasonably long time, the solution is to create a kernel
thread that removes the whiteouts in the background, instead of trying
to find a more efficient way to handle whiteouts. Aufs slices, dices,
and makes julienne fries, but it does so in ways that are difficult to
maintain and which pollute the namespace. More is not better in this case;
the general trend is that the fewer the lines of code (and features)
in a unioning file system, the better the feedback from other file
system developers.
Junjiro Okajima recently submitted a somewhat stripped down version of
aufs for mainline:
I have another version which dropped many features and the size became
about half because such suggestion was posted LKML. But I got no
response for it. Additionally I am afraid it is useless in real world
since the dropped features are so essential.
While aufs is used by a number of practical projects (such as the
Knoppix Live CD), aufs shows no sign of getting closer to being merged
into mainline Linux.
The future of unioning file systems development
Disclaimer: I am actively working on union mounts, so my summary will
be biased in their favor.
Union file systems have the advantage of keeping most of the unioning
code segregated off into its own corner - modularity is good. But
it's hard to implement efficient race-free file system operations
without the active cooperation of the VFS.
My personal opinion is that union mounts will be the dominant unioning
file system solution. Union mounts have always been more popular with
the VFS maintainers, and during the VFS session at the recent file
systems workshop, Jan Blunck and I were able to satisfactorily answer
all of Al Viro's questions about corner cases in union mounts.
Part of what makes union mounts attractive is that we have focused on
specific use cases and dumped the features that have a low
reward-to-maintenance-cost ratio. We said "no" to NFS export of unioned
file systems and therefore did not have implement stable inode
numbers. While NFS export would be nice, it's not a key design
requirement for the top use cases, and implementing it would require a
stackable file system-style double inode structure, with the attendant
complexity of propagating file system operations up and down between
the union mount inode and the underlying file system inode. We won't
handle online modification of branches other than the topmost writable
branch, or modification of file systems that don't go through the
union mount code, so we don't have to deal with complex
cache-coherency issues. To enforce this policy, Al Viro suggested a
per-superblock "no, you REALLY can't ever write to this file system"
flag, since currently read/write permissions are on a per-mount basis.
The st_dev and st_ino fields in stat will
change after a write to a file (technically, an open with write
permission), but most programs use this information, along
with ctime/mtime to decide whether a file has changed -
which is exactly what has just happened, so the application should
behave as expected. Files from different underlying devices in the
same directory may confuse userland programs that expect to be able to
rename within a directory - e.g., at least some versions of "make
menuconf" barf in this situation. However, this problem already
exists with bind mounts, which can result in entries with different
backing devices in the same directory. Rewriting the few programs
that don't handle this correctly is necessary to handle already
existing Linux features.
Changes to the underlying read-only file system must be done offline -
when it is not mounted as part of the union. We have at least two
schemes for propagating those changes up to the writable branch, which
may have marked directories opaque that we now want to see through
again. One is to run a userland program over the writable file system
to mark everything transparent again. Another is to use the
mtime/ctime information on directories to see if the underlying
directory has changed since we last copied up its entries. This can
be done incrementally at run-time.
Overall, the solution with the most buy-in from kernel developers is
union mounts. If we can solve the readdir() problem -
and we think we can - then it will be on track for merging in a
reasonable time frame.
Comments (10 posted)
By Jonathan Corbet
April 7, 2009
The annual Linux kernel summit may gain the most attention, but the size of
the kernel community makes it hard to get deeply into subsystem-specific
topics at that event. So, increasingly, kernel developers gather for more
focused events where some real work can be done. One of those gatherings
is the Linux Storage and Filesystem workshop; the
2009
workshop began on
April 6. Here is your editor's summary of the discussions which took
place on the first day.
Things began with a quick recap of the action items from the previous
year. Some of these had been fairly well resolved over that time; these
include power management, support for object storage devices, fibre channel
over Ethernet, barriers on by default in ext4, the fallocate()
system call, and
enabling relatime by default. The record for some other objectives is not
quite so good; low-level error handling is still not what it could be, "too
much work" has been done with I/O bandwidth controllers while nothing has
made it upstream, the union filesystem problem has not been solved, etc.
As a whole, a lot has been done, but a lot remains to do.
Device discovery
Joel Becker and Kay Sievers led a session on device discovery. On a
contemporary system, device numbers are not stable across reboots, and
neither are device names. So anything in the system which must work with
block devices and filesystems must somehow find the relevant device first.
Currently, that is being done by scanning through all of the devices on the
system. That works reasonably well on a laptop, but it is a real problem
on systems with huge numbers of block devices. There are stories of large
systems taking hours to boot, with the bulk of that time being spent
scanning (repeatedly - once for every mount request) through known devices.
What comes out of the discussion, of course, is that user space needs a
better way to locate devices. A given program may be searching for a
specific filesystem label, UUID, or something else; a good search API would
support all of these modes and more. What would be best would be to build
some sort of database where each new device is added at discovery time. As
additional information becomes available (when a filesystem label is found,
for example), it is added to the database.
Then, when a specific search is done, the information has already been
gathered and a scan of the system's devices is no longer necessary.
In the simplest form, this database can be the various directories full of
symbolic links that udev creates now. These directories solve much of the
problem, but they can never be a complete solution for one reason: some
types of devices - iSCSI targets, for example - do not really exist for the
system until user space has connected to them. Multipath devices also
throw a spanner into that works. For this reason, Ted Ts'o asserted that
some sort of programmatic API will always be needed.
Not a lot of progress was made toward specifying a solution; the main
concern, seemingly, was coming to a common understanding of the problem.
What's likely to happen is that the libblkid library will be extended to
provide the needed functionality. Next year, we'll see if that has been
done.
Asynchronous and direct I/O
Zach Brown's stated purpose in this session was to "just rant for 45
minutes" about the poor state of asynchronous I/O (AIO) support in Linux.
After ten years, he says, we still have an inadequate system which has
never been fixed. The problems with Linux AIO are well documented: only a
few operations are truly asynchronous, the internal API is terrible, it
does not properly support the POSIX AIO API, etc. There, Zach says, are
people wanting to do a lot more with AIO than is currently supported by
Linux.
That said, various alternatives have been proposed over time but nobody
ever tests them.
The conversation then shifted for a bit;
Jeff Moyer took a turn to complain about the related topic of direct I/O. It works poorly for
applications, he says, its semantics are different for different
filesystems, the internal I/O paths for direct I/O are completely
different from those used for buffered I/O, and it is full of races and
corner cases. Not a pretty picture.
One of the biggest complications with direct I/O is the need for the system
to support simultaneous direct and buffered I/O on the same file.
Prohibiting that combination would simplify the problem considerably, but
that is a hard thing to do. In particular, it would tend to break backups,
which often want to read (in buffered mode) a file which is open for direct
I/O. There was some talk of adding a new O_REALLYDIRECT mode which would
lock out buffered operations, but it's not clear that the advantages would
make this change worthwhile.
Another thing that would help with direct I/O would be to remove the
alignment restrictions on I/O buffers. That's a hard change to make,
though; many disk controllers can only perform DMA to properly-aligned
buffers. So allowing unaligned buffers would force the kernel to copy data
internally, which rather defeats the purpose of direct I/O. There
is one use case, though, where direct I/O might still make sense: some
direct I/O users really only
want to avoid filling the system page cache with their data. Using the
fadvise() system call is arguably a better way of achieving that
goal, but application developers are said to distrust it.
All told, it seems from the discussion that there is not a whole lot to be
done to improve direct I/O on Linux.
Returning to the AIO problem, the developers discussed Zach's proposed acall() API, which
shifts blocking operations into special-purpose kernel threads. The use
of threads in this manner promises a better AIO implementation than Linux
has ever had in the past. But there is a cost: some core scheduler changes
need to be made to support acall(). Among other things, there are
some complexities related to transferring credentials between threads,
propagating signals from AIO threads back to the original process, etc.
The end result is that scheduler performance may well suffer slightly. The
scheduler developers tend to be sensitive to even very small performance
penalties, so there may well be pushback when acall() is proposed
for mainline inclusion.
The addition of acall() would also add a certain maintenance
burden. Whenever a kernel developer makes a change to the task structure,
that developer would have to think about whether the change is relevant to
acall() and whether it would need to be transferred to or from
worker threads.
The conclusion was that acall() looks promising, and that the
developers in the room thought that it could work. They also agreed,
though, that a number of the relevant people were not in the room, so the
question of whether acall() is appropriate for the kernel as a
whole could not be answered.
RAID unification
The kernel currently contains two software RAID implementations, found in
the MD and device mapper (DM) subsystems. Additionally, the Btrfs
filesystem is gaining RAID capabilities of its own, a process which is
expected to continue in the future. It is generally agreed that having
three (or more) versions of RAID in the kernel is not an optimal situation.
What a proper solution will look like, though, is not all
that clear.
The session on RAID unification started with this question: who thinks that
block subsystem development should be happening in the device mapper
layer? A single hand was raised. In general, it seems, the developers in
the room had a relatively low opinion of the device mapper RAID code. It
should be said, of course, that there were no DM developers present.
What it comes down to is that the next generation of filesystems wants to
include multiple device support. Plans for Btrfs include eventual
RAID 6 support, but Btrfs developer Chris Mason has no interest in
writing that code. It would be much nicer to use a generic RAID layer
provided by the kernel. There are challenges, though. For example, a
RAID-aware filesystem really wants to use different stripe sizes for data
and metadata. Standard RAID, which knows little about the filesystems
built on it, does not provide any such feature.
So what would a filesystem RAID API look like? Christoph Hellwig is
working on this problem, but he's not ready to deal with the filesystem
problem yet. Instead, he's going to start by figuring out how to unify the
MD and DM RAID code. Some of this work may involve creating a set of
tables in the block layer for mapping specific regions of a virtual device
onto real regions in a lower-level device. The block layer already does
that - it's how partitions work - but incorporating RAID would complicate
things considerably. But, once that's done, we'll be a lot closer to
having a general-purpose RAID layer which can be used by multiple callers.
The talk wandered into the area of error handling for a while. In
particular, the tools Linux provides to administrators to deal with bad
blocks are still not what they could be. There was talk about providing a
consistent interface for reporting bad blocks - including tools for mapping
those blocks back to the files that contain them - as well as performing
passive scanning for bad blocks.
The action items that came out of this discussion include the rework of
in-kernel RAID by Christoph. After that, the process of trying to define
filesystem-specific interfaces will begin.
Rename, fsync, and ponies
Prior to Ted Ts'o's session on fsync() and rename(), some
joker filled the room with coloring-book pages depicting ponies. These
pages reflected the sentiment that Ted has often expressed: application
developers are asking too much of the filesystem, so they might as well
request a pony while they're at it.
Ted apologized to the room for his part in the implementation of the
data=ordered mode for ext3. This mode was added as a way to
improve the security of the filesystem, but it had the side effect of
flushing many changes to the filesystem within a five-second window. That
allowed application developers to "get lazy" and stop worrying about
whether their data had actually hit the disk at the right times. Now those
developers are resisting the idea that they should begin to worry again.
This problem has a longer history than many people realize. The XFS
filesystem first hit it back around 2001. But, Ted says, most application
developers didn't understand why they were getting corrupted files after a
crash. Rather than fix their applications, they just switched filesystems
- to ext3. Things worked for some time until Ubuntu users started testing
the alpha "Jaunty" release, which uses ext4 by default
makes ext4 available as an installation option. At that point,
they started finding zero-length files after crashes, and they blamed
ext4.
But, Ted says, the real problem is the missing fsync() calls.
There are a number of reasons why they are not there, including developer
laziness, the problem that fsync() on ext3 has become very
expensive, the difficulty involved in preserving access control lists and
other extended attributes when creating new files, and concerns about the
battery-life costs of forcing the disk to spin up. Ted had more sympathy
for some of these reasons than others, but, he says, "the application
developers outnumber us," so something will have to be done to meet their
concerns.
Valerie Aurora broke in to point out that application developers have been
put into a position where they cannot do the right thing. A call to
fsync() can stall the system for quite a while on ext3. Users
don't like that either; witness the fuss caused by excessive use of
fsync() by the Firefox browser. So it's not just that application
developers are lazy; there are real disincentives to the use of
fsync(). Ted agreed, but he also claimed that a lot of application
developers are refusing to help fix the problem.
In the short term, the ext4 filesystem has gained a number of workarounds to
help prevent the worst surprises. If a newly-written file is renamed on
top of another, existing file, its data will be flushed to disk with the
next commit. Similar things happen with files which have been truncated
and rewritten. There is a performance cost to these changes, but they do
make a significant part of the problem go away.
For the longer term, Ted asked: should the above-described fixes become a
part of the filesystem policy for Linux? In other words, should
application developers be assured that they'll be able to write a file,
rename it on top of another file, omit fsync(), and not encounter
zero-length files after a crash? The answer turns out to be "yes," but
first Ted presented his other long-term ideas.
One of those is to improve the performance of the
fsync() system call. The ext4 workarounds have also been added to
ext3 when it runs in the data=writeback mode. Additionally, some
block-layer fixes have been incorporated into 2.6.30. With those fixes in
place, it is possible to run in data=writeback mode, avoid the
zero-length file problem, and also avoid the fsync() performance
problem. So, Ted asked, should data=writeback be made the default
for ext3?
This idea was received with a fair amount of discomfort. The
data=writeback mode brings back problems that were fixed by
data=ordered; after a crash, a file which was being written could
turn up with completely unrelated data in it. It could be somebody else's
sensitive data. Even if it's boring data, the problem looks an awful lot
like file corruption to many users. It seems like a step backward and a
change which is hard to justify for a filesystem which is headed toward
maintenance mode. So it would be surprising to see this change made.
[After writing the above, your editor noticed that Linus had just merged a
change to make data=writeback the default for ext3 in 2.6.30.
Your editor, it seems, is easily surprised.]
Finally, the idea of the fbarrier() system call was raised.
Essentially, fbarrier() would ensure that any data written to a
file prior to the call would be flushed to disk before any metadata changes
made after the call. It could be implemented with fsync(); for
ext3 data=ordered mode, it would do nothing at all. Ted did not
try hard to sell this system call, saying that it was mainly there to
address the laptop power consumption concern. Ric Wheeler claimed that it
would be a waste of time; by the time people are actually using it, we'll
all have solid-state drives in our laptops and the power concern will be
gone. In general, enthusiasm for fbarrier() was low.
So the discussion turned back to the idea of generalizing and guaranteeing
the ext4 workarounds. Chris Mason asked when there might be a time that
somebody would not want to rename files safely; he did not get an
answer. There was concern that these workarounds could not be allowed to
hurt the performance of well-written applications. But the general
sentiment was that these workarounds should become policy that all
filesystems should implement.
pNFS
There was a session on supporting parallel NFS (pNFS). It was
mostly a detailed, technical discussion on what sort of API is needed to
allow clustered filesystems to tell pNFS about how files are distributed
across servers. Your editor will confess that his eyes glazed over after a
while, and his notes are relatively incoherent. Suffice to say that,
eventually, OCFS2 and GFS will be able to communicate with pNFS servers and
that all the people who really care about how that works will understand
it.
Miscellaneous topics
The final session of the day related to "miscellaneous VFS topics"; the
first had to do with eCryptfs. This filesystem provides encryption for
individual files; it is currently implemented as a stacking filesystem
using an ordinary filesystem to provide the real storage. The stacking
nature of eCryptfs has long been a problem; now some Ubuntu developers are
working to change it.
In particular, what they would like to do is to move the encryption
handling directly into the VFS layer. Somehow users will supply a key to
the kernel, which will then transparently handle the encryption and
decryption of data. To that end, some sort of transformation layer will be
provided to process the data between the page cache and the underlying
block device.
One question that came up was: what happens when the user does not have a
valid key? Should the VFS just provide encrypted data in that case? Al
Viro raised the question of what happens when one process opens the file
with a key while another one opens it without a key. At that point there
will be a mixture of encrypted and clear-text pages in the cache, a
situation which seems sure to lead to confusion. So it seems that the VFS
will simply refuse to provide access to files if the necessary key is not
provided.
There are various problems to be solved in the creation of the
transformation layer - things like not letting processes modify a page
while it is being encrypted or decrypted. Chris Mason noted that he faces
a similar problem when generating checksums for pages in Btrfs. These are
problems which can be addressed, though. But it was clear that this kind
of transformation is likely to be built into the VFS in the future.
Stacking filesystems just do not work well with the Linux VFS as it exists
now.
Next up was David Brown, who works in the scientific high-performance
computing field. David has an interesting problem. He runs massive
systems with large storage arrays spread out across many systems. Whenever
some process calls stat() on a file stored in that array, the
entire cluster essentially has to come to a stop. Locks have to be
acquired, cached pages have to be flushed out, etc., just to ensure that
specific metadata (the file size in particular) is available and
correct. So, if a scientist logs in and types "ls" in a large directory,
the result can be 30 minutes in coming and little work gets done in
the mean time. Not ideal.
What David would like is a "stat() light" call which wouldn't cause all of
this trouble. It should return the metadata to the best of its knowledge,
but it would not flush caches or take cluster-wide locks to obtain this
information. If that means that the size is not entirely
accurate, so be it. In the subsequent discussion, the idea was modified a
little bit. "Slightly inaccurate" results would not be returned; instead,
the size would simply be zeroed out. It was felt that returning no
information at all was better than returning something which may have no
real basis in reality.
Beyond that, there would likely be a mask
associated with the system call. Initially it was suggested that the mask
would be returned; it would have bits set to indicate which fields in the
return stat structure are valid. But it was also suggested that
the mask should be an input parameter instead; the call would then do
whatever was needed to provide the fields requested by the caller. Using
the mask as an input parameter would avoid the need for duplicate calls in
the case where the necessary information is not provided the first time
around.
The actual form of the system call is likely to be determined when somebody
follows Christoph Hellwig's advice to "send a bloody patch."
The final topic of the day was union mounts. Valerie Aurora, who led this
session, recently wrote an
article about union filesystems and the associated problems for LWN.
The focus of this session was the readdir() system call in
particular. POSIX requires that readdir() provide a position
within a directory which can be used by the application at any future time
to return to the same spot and resume reading directory entries. This
requirement is hard for any contemporary filesystem to meet. It becomes
almost impossible for union filesystems, which, by definition, are
presenting a combination of at least two other filesystems.
The solution that Valerie was proposing was to simply recreate directories
in the top (writable) layer of the union. The new directories would point
to files in the appropriate places within the union and would have
whiteouts applied. That would eliminate the need to mix together directory
entries from multiple layers later on, and the readdir()
problem would collapse back to
the single-filesystem implementation. At least, that holds true for as
long as none of the lower-level filesystems in the union change. Valerie
proposes that these filesystems be forced to be read-only, with an unmount
required before they could be changed.
The good news is that this is how BSD union mounts have worked for a long
time.
The bad news is that there's one associated problem: inode number
stability. NFS servers are expected to provide stable inode numbers to
clients even
across reboots. But copying a file entry up to the top level of a union
will change its inode number, confusing NFS clients. One possible solution
to this problem is to simply decree that union mounts cannot be exported
via NFS. It's not clear that there is a plausible use case for this kind
of export in any case. The other solution is to just let the inode number
change. That could lead to different NFS clients having open file
descriptors to different versions of the file, but so be it. The consensus
seemed to lean toward the latter solution.
And that is where the workshop concluded. Your editor will be attending
most of the second and final day (minus a brief absence for a cameo
appearance at the Embedded Linux Conference); a report from that day will
be posted shortly thereafter.
Comments (65 posted)
By Jonathan Corbet
April 8, 2009
The second and final day of the Linux Storage and Filesystem Workshop was
held in San Francisco, California on April 7. Conflicting commitments
kept your editor from attending the entire event, but he was able to
participate in sessions on solid-state device support, storage topology
information, and more.
Supporting SSDs
The solid-state device topic was the most active discussion of the
morning. SSDs clearly stand to change the storage landscape, but it often
seems that nobody has yet figured out just how things will change or what
the kernel should do to make the best use of these devices. Some things
are becoming clearer, though. The
kernel will be well positioned to support the current generation SSDs.
Supporting future products, though, is going to be a challenge.
Matthew Wilcox, who led the discussion, started by noting that Intel SSDs
are able to handle a large number of operations in parallel. The
parallelism is so good, in fact, that there is really little or no
advantage in delaying operations. I/O requests should be submitted
immediately; the block I/O subsystem shouldn't even attempt to merge
adjacent requests. This message was diluted a bit later on, but the core
message is clear: the kernel should, when driving an SSD, focus on getting
out of the way and processing operations as quickly as possible.
It was asked: how do these drives work internally? This would be nice to
know; the better informed the kernel developers are, the better they can do
at driving the devices better. It seems, though, that the firmware in
these devices - the part that, for now, makes Intel devices work better
than most of the alternatives - is laden with Valuable Intellectual
Property, and not much information will be forthcoming. Solid-state
devices will be black boxes for the foreseeable future.
In any case, current-generation Intel SSDs are not the only type of device
that the kernel will have to work with. Drives will differ greatly in the
coming years. What the kernel really needs to know is a few basic
parameters: what kind of request alignment works best, what request sizes
are fastest, etc. It would be nice if the drives could export this
information to the operating system. There is a mechanism by which this
can be done, but current drives are not making much information available.
One clear rule holds, though: bigger requests are better. They might
perform better in the drive itself, but, with high-quality SSDs, the real
bottleneck is simply the number of requests which can be generated and
processed in a given period of time. Bundling things into larger requests
will tend to increase the overall bandwidth.
A related rule has to do with changes in usage patterns.
It would appear that the Intel drives, at least, observe the requests
issued by the computer and adapt their operation to improve performance.
In particular, they may look at the typical alignment of requests. As a
result, it is important to let the drive know if the usage pattern is about
to change - when the drive is repartitioned and given a new filesystem, for
example. The way to do this, evidently, is to issue an ATA "secure erase"
command.
From there, the conversation moved to discard (or "trim") requests, which
are used by the host to tell the drive that the contents of specific blocks
are no longer needed. Judicious use of trim requests can help the drive in
its garbage collection work, improving both performance and the overall
life span of the hardware. But what constitutes "judicious use"? Doing a
trim when a new filesystem is made is one obvious candidate. When the
kernel initializes a swap file, it trims the entire file at the outset
since it cannot contain anything of use. There is no controversy here
(though it's amusing to note that mkfs does not, yet, issue trim
commands).
But what about when the drive is repartitioned? It was suggested that the
portion of the drive which has been moved from one partition to another
could be trimmed. But that raises an immediate problem: if the partition
table has been corrupted and the "repartitioning" is really just an attempt
to restore the drive to a working state, trimming that data would be a
fatal error. The same is true of using trim in the fsck command, which is
another idea which has been suggested. In the end, it is not clear that
using trim in either case is a safe thing to do.
The other obvious place for a trim command is when a file is deleted; after
all, its data clearly is no longer needed. But some people have questioned
whether that is a good time too. Data recovery is one issue; sometimes
people want to be able to get back the contents of an erroneously-deleted
file. But there is also a potential performance issue: on ATA drives, trim
commands cannot be issued as tagged commands. So, when a trim is
performed, all normal operations must be brought to a halt. If that
happens too often, the throughput of the drive can suffer. This problem
could be mitigated by saving up trim operations and issuing them all
together every few minutes. But it's not clear that the real performance
impact is enough to justify this effort. So some benchmarking work will be
needed to try to quantify the problem.
An alternative which was suggested was to not use trim at all. Instead, a
similar result could be had by simply reusing the same logical block
numbers over and over. A simple-minded implementation would always just
allocate the lowest-numbered free block when space is needed, thus
compressing the data toward the front end of the drive. There are a couple
of problems with this approach, though, starting with the fact that a lot
of cheaper SSDs have poor wear-leveling implementations. Reusing
low-numbered blocks repeatedly will wear those drives out prematurely. The
other problem is that allocating blocks this way would tend to fragment
files. The cost of fragmentation is far less than with rotating storage,
but there is still value in keeping files contiguous. In particular, it
enables larger I/O operations, and, thus, better performance.
There was a side discussion on how the kernel might be able to distinguish
"crap" drives from those with real wear-leveling built in. There's
actually some talk of trying to create value-neutral parameters which a
drive could use to export this information, but there doesn't seem to be
much hope that the vendors will ever get it right. No drive vendor wants
its hardware to self-identify as a lower-quality product.
One suggestion is that the kernel could interpret support for the trim
command as an indicator that it's dealing with one of the better drives.
That led to the revelation that the much-vaunted Intel drives do
not, currently, support trim. That will change in future versions, though.
A related topic is a desire to let applications issue their own trim
operations on portions of files. A database manager could use this feature
to tell the system that it will no longer be interested in the current
contents of a set of file blocks. This is essentially a version of the
long-discussed punch() system call, with the exception that the
blocks would remain allocated to the file. De-allocating the blocks would
be correct at one level, but it would tend to fragment the file over time,
force journal transactions, and make O_DIRECT operations block
while new space is allocated. Database developers would like to avoid all
of those consequences. So this variant of punch() (perhaps
actually a variant of fallocate()) would discard the data, but
keep the blocks in place.
From there, the discussion went to the seemingly unrelated topic of "thin
provisioning." This is an offering from certain large storage array
vendors; they will sell an array which claims to be much larger than the
amount of storage actually installed. When the available space gets low,
the customer can buy more drives from the vendor. Meanwhile, from the
point of view of the system, the (apparently) large array has never
changed.
Thin provisioning providers can use the trim command as well; it lets them
know that the indicated space is unused and can be allocated elsewhere.
But that leads to an interesting problem if trim is used to discard the
contents of some blocks in the middle of the file. If the application
later writes to those blocks - which are, theoretically, still in place -
the system could discover that the device is out of space and fail the
request. That, in turn, could lead to chaos.
The truth of the matter is that thin provisioning has this problem
regardless of the use of the trim command. Space "allocated" with
fallocate() could turn out to be equally illusory. And if space
runs out when the filesystem is trying to write metadata, the filesystem
code is likely to panic, remount the filesystem read-only, and, perhaps,
bring down the system. So thin provisioning should be seen as broken
currently. What's needed to fix it is a way for the operating system to
tell the storage device that it intends to use specific blocks; this is an
idea which will be taken back to the relevant standards committees.
Finally, there was some discussion of the CFQ I/O scheduler, which has a
lot of intelligence which is not needed for SSDs. There's a way
to bypass CFQ for some SSD operations, but CFQ still adds an
approximately 3% performance penalty compared to the no-op I/O scheduler.
That kind of cost is bearable now, but it's not going to work for future
drives. There is real interest in being able to perform 100,000 operations
per second - or more - on an SSD. That kind of I/O rate does not leave
much room for system overhead. So, at some point, we're going to see a
real effort to streamline the block I/O paths to ensure that Linux can
continue to get the best out of solid-state devices.
Storage topology
Martin Petersen introduced the storage topology issue by talking about the
coming 4K-sector drives. The sad fact is that, for all the talk of SSDs,
rotating storage will be with us for a while yet. And the vendors of disk
drives intend to shift to 4-kilobyte sectors by 2011. That leads to a
number of interesting support problems, most of which were covered in this LWN article in March. In
the end, the kernel is going to have to know a lot more about I/O sizes and
alignment requirements to be able to run future drives.
To that end, Martin has prepared a set of patches which export this information
to the system. The result is a set of directories under
/sys/block/drive/topology which provide the sector size,
needed alignment, optimal I/O flag, and more. There's also a "consistency
flag" which tells the user whether any of the other information actually
matches reality. In some situations (a RAID mirror made up of drives with
differing characteristics, for example), it is not possible to provide real
information, so the kernel has to make something up.
There was some wincing over this use of sysfs, but the need for this kind of
information is clear. So these patches will probably be merged into the
2.6.31 kernel.
readdirplus()
There was also a session on the proposed readdirplus() system
call. This call would function much like readdir() (or, more
likely, like getdents()), but it would provide file metadata along
with the names. That, in turn, would avoid the need for a separate
stat() call and, hopefully, speed things considerably in some
situations.
Most of the discussion had to do with how this new system call would be
implemented. There is a real desire to avoid the creation of independent
readdir() and readdirplus() implementations in each
filesystem. So there needs to be a way to unify the internal
implementation of the two system calls. Most likely that would be done by
using only the readdirplus() function if a filesystem provides
one; this callback would have a "no stat information needed" flag for the
case when normal readdir() is being called.
The creation of this system call looks like an opportunity to leave some
old mistakes behind. So, for example, it will not support seeking within a
directory. There will also probably be a new dirent structure
with 64-bit fields for most parameters. Beyond that, though, the shape of
this new system call remains somewhat cloudy. Somebody clearly needs to
post a patch.
Conclusion
And there ends the workshop - at least, the part that your editor was able
to attend. There were a number of storage-related sessions which, beyond doubt,
covered interesting topics, but it was not possible to be in both rooms at
the same time (though, with luck, your editor will soon receive another
attendee's notes from those sessions). The consensus among the attendees
was that it was a highly
successful and worthwhile event; the effects should be seen to ripple
through the kernel tree over the next year.
Comments (41 posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Architecture-specific
Benchmarks and bugs
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>