Brief items
The current 2.6 prepatch remains 2.6.17-rc6; no prepatches have been
released over the past week. A few dozen fixes have been merged into the
mainline repository since -rc6 was released, but the pace has slowed
considerably. The 2.6.17 final release may well happen in the near future.
The current -mm tree is 2.6.17-rc6-mm2. Recent changes
to -mm include a new statistics infrastructure for the memory management
subsystem, virtualized namespaces for SYSV interprocess communications
primitives, and some lock validator work.
Comments (none posted)
Kernel development news
I think the interesting point is how we're moving away from the
"global development" model (ie everything breaks at the same time
between 2.4.x and 2.6.x), and how the fact that we're trying to
maintain a more stable situation may well mean that we'll see more
of the "local development" model where a specific subsystem goes
through a development series, but where stability requirements mean
that we must not allow it to disturb existing users.
And even more interestingly (at least to me), the question might
become one of "how does that affect the tools and build and
configuration infrastructure", and just the general flow of
development.
-- Linus Torvalds
Comments (none posted)
Linux supports a wide variety of filesystems. While it is true that the
Linux VFS layer treats all filesystems equally, the ext3 filesystem is
certainly the first among equals. Ext3 is the default choice for a large
majority of distributions; it can thus be found on vast numbers of
installed Linux systems. If any filesystem were to be named
the
Linux filesystem, it would be ext3.
Ext3 is based on decades of experience with Unix filesystems. As a result,
it is relatively straightforward to understand and highly reliable in its
operation. It is, however, also showing its age in a number of ways. One
of those is the maximum size of the underlying device it can handle. This
limit is a mere 8 TB. That is enough to hold most of our mail spools
- even before spam filtering - but it is a limit which is already affecting
some users. With the size of contemporary disks, the creation of an
8 TB array is not an entirely outlandish thing to do now, and it will
only become easier over time.
There are a couple of reasons for this limit. One of them is the use of
32-bit block numbers within the filesystem - and signed 32-bit numbers at
that. The ext3 code can only track 2 gigablocks, which, using a 4K
block size, sets the limit at 8 TB. Switching to an unsigned type can
double that limit, but that only pushes back the problem by about one
year. Clearly, larger block numbers are required.
The other problem has to do with how ext3 tracks the blocks associated with
any given file. The ext3 inode structure contains an array of
fifteen 32-bit pointers; the first twelve of those pointers contain the
indexes of the first twelve blocks of the file. Thus, with a filesystem
using 4K blocks, the first twelve pointers can describe a file of up to
48KB in length. If the file exceeds that length, an "indirect block" is
created. This block is a big array of block pointers, holding the indexes
for the next 1024 blocks; the 13th pointer in the inode structure
tracks the location of this indirect block. Should that space not suffice,
the 14th pointer is used for a double-indirect block - a block holding
pointers to indirect blocks. Finally, the 15th pointer will be used for a
triple-indirect block if need be.
This arrangement is not too different from how Unix systems structured
filesystems two decades or more ago. It imposes a per-file maximum size of
about 4 TB - big, but perhaps limiting for today's hot applications
(such as comprehensive, nationwide telephone call archival). It works well
for small files but, as files get larger, this organization becomes
increasingly inefficient. Keeping a pointer to every single block is
expensive, both in terms of space usage and the time it can take to
locate a specific file block. Since larger filesystems will tend to hold
larger files, this overhead becomes increasingly limiting over time.
A solution to these problems can be found in the extents and 48-bit support patch
set. These patches have been posted by Mingming Cao; many other
developers - especially Alex Tomas - have worked on them as well. They
change the way files are stored to make things more efficient, and to allow
the filesystem to index the blocks on larger devices.
The core of the patch is the support for extents. An extent is simply a
set of blocks which are logically contiguous within the file and also on
the underlying block device. Most contemporary filesystems put
considerable effort into allocating contiguous blocks for files as a way of
making I/O operations faster, so blocks which are logically contiguous
within the file often are also contiguous on-disk. As a result, storing
the file structure as extents should result in significant compression of
the file's metadata, since a single extent can replace a large number of
block pointers. The reduction in metadata should enable faster access as
well.
An ext3 filesystem mounted with the extents option enabled will handle files stored
in the old way, using block pointers, as always. New files will be created
using extents, however. In these files, the fifteen-pointer array
described above is overlaid with a new data structure. There is a short
header, followed by a few occurrences of this structure:
struct ext3_extent {
__le32 ee_block; /* first logical block extent covers */
__le16 ee_len; /* number of blocks covered by extent */
__le16 ee_start_hi; /* high 16 bits of physical block */
__le32 ee_start; /* low 32 bits of physical block */
};
Here, ee_block is the index (within the file, not on disk) of the
first block covered by this extent. The number of blocks in the extent is
stored in ee_len, and the pointer to the first of those blocks (on
disk, now) lives in the combination of ee_start and
ee_start_hi. By storing physical block numbers this way, ext3 can
handle 48-bit block numbers - enough to index a 1024 PB device. That
should be enough to last for a couple years or so.
For files with few extents, all of the information can be stored within the
on-disk inode itself. As the number of extents grows, however, the
available space runs out. In that case, a form of indirect blocks is used;
the in-inode extents array describes ranges of blocks holding extents
arrays of their own. The tree of indirect extents blocks can grow to an
essentially unlimited depth, allowing the filesystem to represent even very
large, highly-fragmented files.
Beyond extents, relatively little had to be done to prepare ext3 for 48-bit
block addressing. The signed, 32-bit block numbers are gone, having been
converted to the larger sector_t type. Some reserved space in the
ext3 superblock has been grabbed to store the high 16 bits of some global
block counts. Much of the tracking of free blocks within the filesystem is
done using block numbers relative to the beginning of the block group, so
that code did not need to change much at all. A few tweaks to the
journaling code were required for it to be able to handle the larger block
numbers.
The end result is an enhancement to the ext3 filesystem which enables it to
work with much larger devices. Existing filesystems can use the new
features immediately with no dump-and-restore cycle. It would appear to be
(nearly)
universally agreed that these changes turn ext3 into a better filesystem.
Whether that better filesystem should still be called ext3 is
controversial, but that is a subject for another article.
Comments (17 posted)
As described in
this article, patches which
add extents and 48-bit capability to the ext3 filesystem have been
circulated for wider review. Everybody seems to agree that these changes
are good and should be part of the Linux kernel. Well,
almost everybody agrees. But the way in which
these features get in has become the inspiration for an extended discussion
on how filesystem development should work.
Some developers, most prominently Jeff Garzik, have expressed concerns about merging these changes
into ext3; they would rather see a new ext4 filesystem created for new
features. There are a number of reasons put forward for doing things this
way. First and foremost, perhaps, is the fact that using the extents/48-bit
features results in filesystems which are no longer backward compatible.
If a system administrator enables extents on a filesystem, a special
"incompatible feature" flag will be set in the filesystem superblock.
Thereafter, it will no longer be possible to mount that filesystem with any
older kernel which does not recognize that flag. Until now, it has
generally been possible to mount ext3 filesystems on older kernels - even
those which only support ext2 (with one ugly exception involving a
distributor which was heavily pushing SELinux features).
The overall effect of all these changes on filesystem stability is also a
concern. Filesystems are important, and users tend to take a very dim view
of "upgrades" which introduce bugs or impact performance. As Linus puts it:
For me, the biggest cost tends to actually be support. A stable
filesystem that is used by thousands and thousands of people and
that isn't actually developed outside of just maintaining it IS A
REALLY GOOD THING TO HAVE.
The incorporation of major features into ext3 certainly takes it out of the
"just maintaining it" realm.
As more features are added, the filesystem code (which must support
filesystems both with and without that feature) gets more complicated. In
particular, one sees increasing amounts of code which looks like:
if (has_this_fancy_feature)
do_it_the_fancy_way();
else
do_it_the_old_boring_way();
Such code can be harder to follow, and it tends not to isolate the new
feature code as nicely as one might like. If, instead, one were to put the
new features into a new filesystem, a lot of these conditionals could be
taken out.
Finally, it is said that the need to be so careful about backward
compatibility is a drag on filesystem development. By separating
development filesystems from those which are meant to be stable, the
developers can push forward with the new capabilities they would like to
implement. For practical examples, consider the separation of ext2 and
ext3, the separation of the SMB and CIFS filesystems, and the creation of
libata rather than shoehorning serial ATA support
into the old ATA drivers.
Needless to say, the ext3 developers have their own take on all of this.
A filesystem with the new features will not work on older kernels
regardless of whether it is called ext3 or ext4. Since a feature like
extents must be explicitly enabled by the system administrator (assuming
the distributor does not quietly do it for them), nobody should be
surprised by a filesystem which no longer works on older systems. Pushing
the new features into an ext4 would simply slow their uptake without buying
much else.
While some think that splitting out development into a new filesystem will
ease code maintenance, others are less sure. In particular, there is worry
that bugs fixed in one of the filesystems may not get fixed in the other.
It has been noted, repeatedly, that users very much like to be able to get
new features into their filesystems without having to backup and restore
the whole thing. The transition from ext2 to ext3 is a clear example of
how this can work; if moving to ext3 had required restoring the filesystem
from scratch, ext3 would have been adopted much more slowly, and less
universally, than it was. As this example shows, however, putting new
features into a new ext4 filesystem would not necessarily preclude this
sort of upgrade.
The ext3 developers also point out that they have been working on that
filesystem for many years and have not yet created big problems for the
Linux user base. They have, they feel, earned a certain amount of trust.
So they would rather move ahead with some features which have been put
together with great care and extensive review rather than cloning ext3 into
ext4 and starting something new.
An attempt to guess how all this might settle out could start with these words from Linus:
Quite frankly, at this point, there's no way in hell I believe we
can do major surgery on ext3. It's the main filesystem for a lot of
users, and it's just not worth the instability worries unless it's
something very obviously transparent.
Yet another point of view comes from Andrew
Morton:
All that being said, Linux's filesystems are looking increasingly
crufty and we are getting to the time where we would benefit from a
greenfield start-a-new-one. That new one might even be based on
reiser4 - has anyone looked? It's been sitting around for a couple
of years.
As reiser4 shows, getting a truly new filesystem into the kernel isn't
necessarily an easy thing to do. It may well not happen before large
numbers of users start running into the current limits of ext3. So the
current set of enhancements will probably find its way in - though what the
resulting filesystem will be called is still not entirely clear.
Comments (31 posted)
"Resource" is the term used within the Linux kernel for a specific set of
I/O-related hardware resources - I/O memory and ports, in particular.
Device drivers allocate specific resources with functions like
request_region(), but, underneath that layer, Linux has a set of
generic resource allocation utilities. And at the core of that code is
struct resource, which tracks individual resource allocations. A
read of
/proc/iomem or
/proc/ioports is really just
dumping out one of those resource data structures.
Since the resource management code was added by Linus at the
beginning of the 2.3 development cycle, the unsigned long type has
been used to track actual resource values. That worked at the time, but it
can be problematic on 32-bit machines which have I/O memory resources at
high addresses. If a memory region is located out of the 32-bit range, the
resource management code can no longer deal with it.
The solution, of course, is to start using 64-bit numbers to track resource
allocations. Vivek Goyal (along with others) has been working for some
time on a set of patches
which makes this change. Those patches have been fixed up by Greg
Kroah-Hartman and, by all appearances, are set for merging once the 2.6.18
development cycle starts.
Introducing new typedefs into the kernel is generally frowned upon, but
this patch creates resource_size_t anyway. Early in the patch
series, this type is just unsigned long; only when the type has
propagated through the source is it changed to a 64-bit value. There is a
configuration option controlling whether 64-bit resources are used;
interestingly, 64-bit is the default, and the old 32-bit mode is marked
"experimental."
The result of the change is that
the prototypes for some commonly-used functions change:
struct resource *request_region(resource_size_t start,
resource_size_t n,
const char *name);
void release_region(resource_size_t start, resource_size_t n);
struct resource *request_mem_region(resource_size_t start,
resource_size_t n,
const char *name);
void release_mem_region(resource_size_t start, resource_size_t n);
Most driver code will be unaffected by these changes; simple constant
resource locations will still work, and, in many cases, the bus layer
handles the details of resource allocation anyway. But, in cases where a
driver is directly storing or working with resource locations, the new type
will have to be used.
Comments (none posted)
Patches and updates
Kernel trees
Core kernel code
Development tools
Device drivers
Documentation
Filesystems and block I/O
Memory management
Networking
Security-related
Virtualization and containers
Miscellaneous
Page editor: Jonathan Corbet
Next page: Distributions>>