|
|
Subscribe / Log in / New account

Making FIEMAP and delayed allocation play well together

By Jonathan Corbet
February 22, 2011
The FIEMAP ioctl() command can be used to learn about how a file's blocks are laid out on the disk. It's useful for determining fragmentation, optimizing boot-time readahead order, and a number of other things. One of those other things, though, has turned up bugs in how a couple of important filesystems implement FIEMAP.

The cp application, it seems, has recently been taught to use FIEMAP to find holes in files. The idea is to optimize the copying of such files by not even reading the holes; that way, the need to zero-fill pages (in the kernel) and compare them against pages full of zeros (in user space) can be eliminated. It seems like a better way of doing things.

Somewhere along the way, Chris Mason got word that cp was corrupting files on btrfs filesystems. The problem, naturally enough, was that FIEMAP was reporting holes where none should exist. The root cause was that FIEMAP was not prepared to deal with regions of a file which have been written to, but which do not actually have blocks assigned yet. The delayed allocation mechanism used by most contemporary filesystems will create exactly that kind of situation, so this is not a theoretical concern.

Chris fixed the problem for btrfs, then decided to see how other filesystems handled the same situation. From his report, xfs handled things well, but ext4 had similar bugs in situations where delayed allocation and real holes came together in the same file. Certain types of bugs, it seems, are likely to turn up in more than one context.

Chris's fix should get into 2.6.38 before the final release; chances are good that an ext4 fix will be fast-tracked as well. Expect stable kernel backports too. In the meantime, be careful when copying recently-written files with new versions of cp on those filesystems.

Index entries for this article
KernelFIEMAP ioctl()


to post comments

Making FIEMAP and delayed allocation play well together

Posted Feb 24, 2011 15:26 UTC (Thu) by dberkholz (guest, #23346) [Link]

Thanks so much for posting this, I've been hitting this problem for the past few weeks and was totally puzzled. Now I see it correlates with when I installed coreutils 8.10.

Making FIEMAP and delayed allocation play well together

Posted Feb 24, 2011 21:20 UTC (Thu) by dougg (guest, #1894) [Link] (4 responses)

Been looking at the FIEMAP ioctl description in the kernel documentation. That should frighten away any user space folks thinking of using it. For a start it could define what an "extent" is.
Anyway I have a different angle. Will FIEMAP work when a file is opened O_DIRECT? What about when the file is a partition or a disk (with or without O_DIRECT)? When a SCSI disk is opened O_DIRECT the FIEMAP ioctl could map through to the SCSI GET LBA STATUS command. Most likely I'm just dreaming.

Making FIEMAP and delayed allocation play well together

Posted Feb 24, 2011 21:26 UTC (Thu) by corbet (editor, #1) [Link] (3 responses)

An extent is a group of blocks in a file laid out contiguously on disk by the filesystem. It's a filesystem concept, which is what is needed to answer your questions. O_DIRECT shouldn't change anything. If your file descriptor is for a partition or a block device, there's no filesystem, so FIEMAP will make no sense. And FIEMAP cannot possibly map to a low-level SCSI operation, since there is no filesystem knowledge at that level.

Making FIEMAP and delayed allocation play well together

Posted Feb 24, 2011 21:57 UTC (Thu) by razb (guest, #43424) [Link] (1 responses)

I have hacked raid5+xfs several years ago. from time to time I had this delayed allocation which was a huge headache. Question , can we prevent delayed allocation ?

Making FIEMAP and delayed allocation play well together

Posted Feb 25, 2011 16:32 UTC (Fri) by nix (subscriber, #2304) [Link]

Delayed allocation is a really useful technique to (among other things) keep fragmentation down and increase the size of contiguous writes hitting the disk to something closer to the umpty-megabytes-at-once which the disk would actually prefer. It's better fixed than ripped out, I'd say.

Making FIEMAP and delayed allocation play well together

Posted Feb 24, 2011 22:19 UTC (Thu) by dougg (guest, #1894) [Link]

cp seems to be using FIEMAP on the src file to detect sparseness (i.e. holes) so it doesn't waste time reading potentially a lot of zeros. Unix already has good support for generating a sparse dst (unless dst is being overwritten). Next consider 'cp /dev/sda /dev/sdb' (don't try that at home) with unmapped (aka trimmed) blocks on /dev/sda . The SCSI GET LBA STATUS command on /dev/sda would play the same role as FIEMAP.


Copyright © 2011, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds