By Jonathan Corbet
July 15, 2008
One likes to think of disk drives as being a reliable store of data. As
long as nothing goes so wrong as to let the smoke out of the device, blocks
written to the disk really should come back with the same bits set in the
same places. The reality of the situation is a bit less encouraging,
especially when one is dealing with the sort of hardware which is available
at the local computer store. Stories of blocks which have been corrupted,
or which have been written to a location other than the one which was
intended, are common.
For this reason, there is steady interest in filesystems which use
checksums on data stored
to block devices. Rather than take the device's word that it successfully
stored and retrieved a block, the filesystem can compare checksums and be sure. A
certain amount of checksumming is also done by paranoid applications in
user space. The checksums used by BitKeeper are said to have caught a
number of corruption problems; successor tools like git have checksums
wired deeply into their data structures. If a disk drive corrupts a git
repository, users will know about it sooner rather than later.
Checksums are a useful tool, but they have one minor problem: checksum
failures tend to come when they are too late to be useful. By the time a
filesystem or application notices that a disk block isn't quite what it
once was, the original data may be long-gone and unrecoverable. But disk
block corruption often happens in the process of getting the data to the
disk; it would sure be nice if the disk itself could use a checksum to
ensure that (1) the data got to the disk intact, and (2) the disk
itself hasn't mangled it.
To that end, a few standards groups have put together schemes for the
incorporation of data integrity checking into the hardware itself. These
mechanisms generally take the form of an additional eight-byte checksum
attached to each 512-byte block. The host system generates the checksum
when it prepares a block for writing to the drive; that checksum will
follow the data through the series of host controllers, RAID
controllers, network fabrics, etc., with the hardware verifying the
checksum along each step of the way. The checksum is stored with the data,
and, when the data is read in the future, the checksum travels back with
it, once again being verified at each step. The end result should be that
data corruption problems are caught immediately, and in a way which
identifies which component of the system is at fault.
Needless to say, this integrity mechanism requires operating system
support. As of the 2.6.27 kernel, Linux will have such support, at least
for SCSI and SATA drives, thanks to Martin Petersen. The well-written documentation file included with the data
integrity patches envisions three places where checksum generation and
verification can be performed: in the block layer, in the filesystem, and
in user space. Truly end-to-end protection seems to need user-space
verification, but, for now, the emphasis is on doing this work in the block
layer or filesystem - though, as of this writing, no integrity-aware
filesystems exist in the mainline repository.
Drivers for block devices which can manage integrity data need to register
some information with the block layer. This is done by filling in a
blk_integrity structure and passing it to
blk_integrity_register(). See the document for the full details;
in short, this structure contains two function pointers.
generate_fn() generates a checksum for a block of data, and
verify_fn() will verify a checksum. There are also functions for
attaching a tag to a block - a feature supported by some drives. The data
stored in the tag can be used by filesystem-level code to, for example,
ensure that the block is really part of the file it is supposed to belong
to.
The block layer will, in the absence of an integrity-aware filesystem,
prepare and verify checksum data itself. To that end, the bio
structure has been extended with a new bi_integrity field,
pointing to a bio_vec structure describing the checksum
information and some additional housekeeping. Happily, the integrity
standards were written to allow the checksum information to be stored
separately from the actual data; the alternative would have been to modify
the entire Linux memory management system to accommodate that information.
The bi_integrity area is where that information goes;
scatter/gather DMA operations are used to transfer the checksum and data
to and from the drive together.
Integrity-aware filesystems, when they exist, will be able to take over the
generation and verification of checksum data from the block layer. A call
to bio_integrity_prep() will prepare a given bio
structure for integrity verification; it's then up to the filesystem to
generate the checksum (for writes) or check it (for reads). There's also a
set of functions for managing the tag data; again, see the document for the
details.
Extended partitions
One of the more annoying and long-lived annoyances in the Linux block layer
has been the limit on the number of partitions which can be created on any
one device. IDE devices can handle up to 64 partitions, which is usually
enough, but SCSI devices can only manage 16 - including one reserved for
the full device. As these devices get larger, and as applications which
benefit from filesystem isolation (virtualization, for example) become more
popular, this limit only becomes more irksome.
The interesting thing is that the work needed to circumvent this problem
was done some years ago when device numbers were extended to 32 bits. Some
complicated schemes were
proposed back in 2004 as a way of extending the number of partitions while
not changing any existing device numbers, but that approach was never
adopted. In the mean time, increasing use of tools like udev has
pretty much eliminated the need for device number compatibility; on most
distributions, there are no persistent device files anymore.
So when Tejun Heo revisited the
partition limit problem, he didn't bother with obscure bit-shuffling
schemes. Instead, with his patch set, block devices simply move to a new
major device number and have all minor numbers dynamically assigned. That
means that no block device has a stable (across boots) number; it also
means that the minor numbers for partitions on the same device are not
necessarily grouped together. But, since nobody really ever sees the
device numbers on a contemporary distribution, none of this should matter.
Tejun's patch series is an interesting exercise in slowly evolving an
interface toward a final goal, with a number of intermediate states. In
the end, the API as seen by block drivers changes very little. There is a
new flag (GENHD_FL_EXT_DEVT) which allows the disk to use extended
partition numbers; once the number of minor numbers given to
alloc_disk() is exhausted, any additional partitions will be
numbered in the extended space. The intended use, though, would appear to
be to allocate no traditional minor numbers at all - allocating disks with
alloc_disk(0) - and creating all partitions in that extended
space. Tejun's patch causes both the IDE and sd drivers to allocate
gendisk structures in that way, moving all disks on most systems
into the (shared) extended number space.
Even though modern distributions are comfortable with dynamic device
numbers (and names, for that matter), it seems hard to imagine that a
change like this would be entirely free of systems management problems
across the full Linux user base. Distributors may still be a little
nervous from the grief they took after the shift to the PATA drivers
changed drive names on installed systems. So it's not really clear when
Tejun's patches might make it into the mainline, or when distributors would
make use of that functionality. The pressure for more partitions is
unlikely to go away, though, so these patches may find their way in before
too long.
(
Log in to post comments)