LWN.net Logo

Block layer: integrity checking and lots of partitions

By Jonathan Corbet
July 15, 2008
One likes to think of disk drives as being a reliable store of data. As long as nothing goes so wrong as to let the smoke out of the device, blocks written to the disk really should come back with the same bits set in the same places. The reality of the situation is a bit less encouraging, especially when one is dealing with the sort of hardware which is available at the local computer store. Stories of blocks which have been corrupted, or which have been written to a location other than the one which was intended, are common.

For this reason, there is steady interest in filesystems which use checksums on data stored to block devices. Rather than take the device's word that it successfully stored and retrieved a block, the filesystem can compare checksums and be sure. A certain amount of checksumming is also done by paranoid applications in user space. The checksums used by BitKeeper are said to have caught a number of corruption problems; successor tools like git have checksums wired deeply into their data structures. If a disk drive corrupts a git repository, users will know about it sooner rather than later.

Checksums are a useful tool, but they have one minor problem: checksum failures tend to come when they are too late to be useful. By the time a filesystem or application notices that a disk block isn't quite what it once was, the original data may be long-gone and unrecoverable. But disk block corruption often happens in the process of getting the data to the disk; it would sure be nice if the disk itself could use a checksum to ensure that (1) the data got to the disk intact, and (2) the disk itself hasn't mangled it.

To that end, a few standards groups have put together schemes for the incorporation of data integrity checking into the hardware itself. These mechanisms generally take the form of an additional eight-byte checksum attached to each 512-byte block. The host system generates the checksum when it prepares a block for writing to the drive; that checksum will follow the data through the series of host controllers, RAID controllers, network fabrics, etc., with the hardware verifying the checksum along each step of the way. The checksum is stored with the data, and, when the data is read in the future, the checksum travels back with it, once again being verified at each step. The end result should be that data corruption problems are caught immediately, and in a way which identifies which component of the system is at fault.

Needless to say, this integrity mechanism requires operating system support. As of the 2.6.27 kernel, Linux will have such support, at least for SCSI and SATA drives, thanks to Martin Petersen. The well-written documentation file included with the data integrity patches envisions three places where checksum generation and verification can be performed: in the block layer, in the filesystem, and in user space. Truly end-to-end protection seems to need user-space verification, but, for now, the emphasis is on doing this work in the block layer or filesystem - though, as of this writing, no integrity-aware filesystems exist in the mainline repository.

Drivers for block devices which can manage integrity data need to register some information with the block layer. This is done by filling in a blk_integrity structure and passing it to blk_integrity_register(). See the document for the full details; in short, this structure contains two function pointers. generate_fn() generates a checksum for a block of data, and verify_fn() will verify a checksum. There are also functions for attaching a tag to a block - a feature supported by some drives. The data stored in the tag can be used by filesystem-level code to, for example, ensure that the block is really part of the file it is supposed to belong to.

The block layer will, in the absence of an integrity-aware filesystem, prepare and verify checksum data itself. To that end, the bio structure has been extended with a new bi_integrity field, pointing to a bio_vec structure describing the checksum information and some additional housekeeping. Happily, the integrity standards were written to allow the checksum information to be stored separately from the actual data; the alternative would have been to modify the entire Linux memory management system to accommodate that information. The bi_integrity area is where that information goes; scatter/gather DMA operations are used to transfer the checksum and data to and from the drive together.

Integrity-aware filesystems, when they exist, will be able to take over the generation and verification of checksum data from the block layer. A call to bio_integrity_prep() will prepare a given bio structure for integrity verification; it's then up to the filesystem to generate the checksum (for writes) or check it (for reads). There's also a set of functions for managing the tag data; again, see the document for the details.

Extended partitions

One of the more annoying and long-lived annoyances in the Linux block layer has been the limit on the number of partitions which can be created on any one device. IDE devices can handle up to 64 partitions, which is usually enough, but SCSI devices can only manage 16 - including one reserved for the full device. As these devices get larger, and as applications which benefit from filesystem isolation (virtualization, for example) become more popular, this limit only becomes more irksome.

The interesting thing is that the work needed to circumvent this problem was done some years ago when device numbers were extended to 32 bits. Some complicated schemes were proposed back in 2004 as a way of extending the number of partitions while not changing any existing device numbers, but that approach was never adopted. In the mean time, increasing use of tools like udev has pretty much eliminated the need for device number compatibility; on most distributions, there are no persistent device files anymore.

So when Tejun Heo revisited the partition limit problem, he didn't bother with obscure bit-shuffling schemes. Instead, with his patch set, block devices simply move to a new major device number and have all minor numbers dynamically assigned. That means that no block device has a stable (across boots) number; it also means that the minor numbers for partitions on the same device are not necessarily grouped together. But, since nobody really ever sees the device numbers on a contemporary distribution, none of this should matter.

Tejun's patch series is an interesting exercise in slowly evolving an interface toward a final goal, with a number of intermediate states. In the end, the API as seen by block drivers changes very little. There is a new flag (GENHD_FL_EXT_DEVT) which allows the disk to use extended partition numbers; once the number of minor numbers given to alloc_disk() is exhausted, any additional partitions will be numbered in the extended space. The intended use, though, would appear to be to allocate no traditional minor numbers at all - allocating disks with alloc_disk(0) - and creating all partitions in that extended space. Tejun's patch causes both the IDE and sd drivers to allocate gendisk structures in that way, moving all disks on most systems into the (shared) extended number space.

Even though modern distributions are comfortable with dynamic device numbers (and names, for that matter), it seems hard to imagine that a change like this would be entirely free of systems management problems across the full Linux user base. Distributors may still be a little nervous from the grief they took after the shift to the PATA drivers changed drive names on installed systems. So it's not really clear when Tejun's patches might make it into the mainline, or when distributors would make use of that functionality. The pressure for more partitions is unlikely to go away, though, so these patches may find their way in before too long.


(Log in to post comments)

Lots of partitions? what for?

Posted Jul 17, 2008 9:36 UTC (Thu) by smurf (subscriber, #17840) [Link]

There's LVM. it does exactly the same thing a partition does, only better (you can resize
arbitrary partitions, or even move them to another disk transparently).

Thus I kindof doubt that this is much of a problem.

Lots of partitions? what for?

Posted Jul 17, 2008 10:54 UTC (Thu) by ljt (guest, #33337) [Link]

I use LVM on top of RAID5, but it is _very_ convenient to be able to slice the enormous disks
in many parts. 
I sliced my 4 disks in 14 partitions each, making thus 14 RAID5 volumes. I can now assign
those PVs to whatever VGs I am using. It is extremely flexible.

the *only* problem I encounter is that 14 partitions is not enough: 400Go/14partitions*(4-1
RAID disks)=85 Go. I would rather have had a 20Go unit.

Lots of partitions? what for?

Posted Jul 18, 2008 7:56 UTC (Fri) by dgm (subscriber, #49227) [Link]

Just to ensure I parsed it right:

Go = Gigaoctet = French abbreviation = GB (Gigabyte) ?

Go explained

Posted Jul 18, 2008 20:57 UTC (Fri) by pr1268 (subscriber, #24648) [Link]

My recently-purchased SATA disk says "1 TB/To" and "32 MB Cache/Mo Cachette" on the retail box, so my assumption is yes, this is French. Plus, the line Guarantie limitée de 5 ans would seem to confirm this.

I don't know French, but I can recognize it in written/printed text. Yes, this is a late-model Seagate consumer drive. :-)

Go explained

Posted Jul 18, 2008 21:14 UTC (Fri) by nix (subscriber, #2304) [Link]

It's French: 'octet'. (You often see this in older standards documents, 
too, which have to be clear about the number of bits in a byte.)

Lots of partitions? what for?

Posted Jul 18, 2008 9:14 UTC (Fri) by smurf (subscriber, #17840) [Link]

Well, personally I don't see any reason for having multiple flexible-sized VGs on a single
RAID in the first place, much less ~60 of them, but maybe I'm just missing something.

Lots of partitions? what for?

Posted Jul 18, 2008 22:08 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

The point is you can do that with LVM instead of partitions. Slice each disk (which is a PV and VG) into 14 LVs, make 14 RAID5 volumes out of those, assign those PVs to VGs, ...

I've always hated partitions; even before LVM existed I knew a stacked device driver was a cleaner way than having partitioning intelligence in the lowest level of kernel disk management code and a weird minor number interpretation scheme.

Originally (pre-Linux), partitions were actually in a layer beneath the kernel and that made sense for the problems that had to be solved at that time. But inside Linux, LVM (or anything else layered on top of the physical device) is the cleaner way to go.

Lots of partitions? what for?

Posted Jul 24, 2008 9:01 UTC (Thu) by eduperez (guest, #11232) [Link]

> I use LVM on top of RAID5, but it is _very_ convenient to be able to slice the enormous
disks in many parts.
> I sliced my 4 disks in 14 partitions each, making thus 14 RAID5 volumes. I can now assign
those PVs to whatever VGs I am using. It is extremely flexible

Could you explain why do you need to do that, please?

Lots of partitions? what for?

Posted Jul 24, 2008 13:07 UTC (Thu) by yhdezalvarez (guest, #29255) [Link]

Does LVM supports write barriers already? Last time I checked it didn't. So I'm using
partitions for now.

Lots of partitions? what for?

Posted Oct 16, 2008 5:29 UTC (Thu) by cortana (subscriber, #24596) [Link]

I believe it does for 'linear' mappings. But that's just my recollection from a recent LWN article on the topic, so I may be wrong.

Block layer: integrity checking and lots of partitions

Posted Jul 17, 2008 16:09 UTC (Thu) by mkp (subscriber, #45897) [Link]

Great writeup on the block integrity stuff. Just a few comments:

  • The SCSI Data Integrity Field specification is fully baked and ratified and products are beginning to appear on the market. The T13 committee that governs SATA is currently reviewing a proposal called "External Path Protection" that is essentially SCSI DIF adapted to the ATA protocol. IOW, SATA support is work in progress but the block layer infrastructure has been designed to accomodate it.

  • Short of using 520 byte sectors directly (and mangling the VM) there is no "standard" for DMA protection information to and from memory. HBA interfaces are outside the scope of the T10 SCSI committee.

    When we started this project it became obvious that separate scatterlists for data and protection information were an absolute must. Without them it would be far too intrusive to make Linux support end-to-end data integrity. So we engaged with HBA vendors to make it so. Docs available here: http://oss.oracle.com/projects/data-integrity/documentation/.

  • With regards to making filesystems integrity-aware and passing protection information to and from userland: Yep, that's next on the list. I'm hoping to be able to yak about this at the Plumbers Conference.

Block layer: integrity checking and lots of partitions

Posted Jul 19, 2008 11:01 UTC (Sat) by garloff (subscriber, #319) [Link]

The current mapping of SCSI disks is documented in sd.c:

/*
 * Device no to disk mapping:
 * 
 *       major         disc2     disc  p1
 *   |............|.............|....|....| <- dev_t
 *    31        20 19          8 7  4 3  0
 * 
 * Inside a major, we have 16k disks, however mapped non-
 * contiguously. The first 16 disks are for major0, the next
 * ones with major1, ... Disk 256 is for major0 again, disk 272 
 * for major1, ... 
 * As we stay compatible with our numbering scheme, we can reuse 
 * the well-know SCSI majors 8, 65--71, 136--143.
 */

With some more bit shuffling, support for 64 partitions would be
possible without breaking backwards compatibility. For this, the two
upper bits of disc2 could be taken, limiting us to 4k disks per major
(or 32k disks in total). That's fine, the naming scheme 
(sda->sdz,sdaa->sdaz,sdbX->sdzX,sdaaa->sdzzz) only
works for up to 18k disks only anyway.

Ugly of course ...

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds