Not logged in
Log in now
Create an account
Subscribe to LWN
Pencil, Pencil, and Pencil
Dividing the Linux desktop
LWN.net Weekly Edition for June 13, 2013
A report from pgCon 2013
Little things that matter in language design
if so, a horrible case where the file is 0101010101 would seem to be miserable (or any file that uses null as a field separator)
is there any way to tune how _many_ 0's have to be found before it's considered a hole?
while most media uses 512 byte blocks (and therefor holes can only really be in multiples of 512 bytes aligned on multiples of 512) I think it would be bad to assume that this will always be the case.
The return of SEEK_HOLE
Posted Apr 28, 2011 8:55 UTC (Thu) by peter-b (subscriber, #66996)
No. Read the fine article. SEEK_HOLE finds the next zone in the file for which storage has not yet been allocated.
For example, create a 100 MB file (using e.g. mmap()) and write to the first and last 1 KB of it, and close the file. You will find that many filesystems will not allocate 100 MB of disk space for the file; they will store the data that was written (which might require some padding) and simply note that there is a big chunk of "empty space" in the middle of the file.
When you read the file again, the filesystem will report the "empty space" as containing 0. Sometimes, you know that you can safely skip over these unallocated blocks as an optimisation when reading the file (such as in the case of 'cp'). The idea of SEEK_HOLE is to enable this.
Posted Apr 28, 2011 11:09 UTC (Thu) by johill (subscriber, #25196)
A "hole," in this case, is defined as a range of zeroes which need not correspond to blocks which have actually been omitted from the file, though in practice it almost certainly will.
This comes from the Solaris/FreeBSD definition. The actual patch implements what you say though. There's some talk about the minimum hole size as well.
Posted Apr 28, 2011 11:31 UTC (Thu) by peter-b (subscriber, #66996)
> This comes from the Solaris/FreeBSD definition. The actual patch implements what you say though. There's some talk about the minimum hole size as well.
Okay, whoops. I kinda concentrated on the details of the actual patch. My bad. Sorry about that.
Posted Apr 28, 2011 17:32 UTC (Thu) by chad.netzer (guest, #4257)
And the man page you linked to says: "A 'hole' is defined as a contiguous range of bytes in a file, all having the value of zero, but not all zeros in a file are guaranteed to be represented as holes returned with SEEK_HOLE."
So, it would seem that the "10101010" case would be covered by skipping (it's clearly not a hole). Even if the filesystem were maniacal and reported each 0 as a hole (ie. lying) everything should still work, it just means backup software could be made to do an insane amount of work to faithfully "reproduce" such bogus holes; they could protect against that. Might be fun to add this behavior to a bogo-filesystem for testing.
Posted Apr 29, 2011 5:11 UTC (Fri) by dlang (✭ supporter ✭, #313)
how many 0's need to be in place before it should be identified as a hole?
what if the source is on 4192 byte sector media and the destination is on 512 byte sector media and a string of 1M zeros starts at an offset of 1024 into the file? does the hole start at 1K (where the zeros start and space could be saved on the 512 byte sector media), or at 4K where the source may have a hole? what if the source is a raid array where holes can only really be useful if punched in blocks of 64K*#drives?
I think the process of 'SEEK_HOLE' is better than getting a dump of how the file happens to be allocated at the moment, but it seems like this is too big a question to just overload into a single flag.
Since software would have to be modified to use it anyway, it seems like it may be better to have a seek_hole() function rather than flag on the existing lseek so that you could tell seek_hole() what you consider a hole.
I see the following valid definitions of holes as obvious
1. if it's a hole at the time of the seek
2. if it could be a hole at the time of the seek (taking into account the filesystem and device holding the file)
3. #2 with an automatic 'if you could punch a hole here, do so' flag
4. any string of X sequential bytes of 0 aligned on a multiple of Y bytes
5. #3 but with the definition that represents the 'sector size' so X == Y and it only needs to be specified once.
I could see all of these being useful in different situations.
Posted Apr 29, 2011 18:32 UTC (Fri) by chad.netzer (guest, #4257)
The key is that the filesystem has this information *without* having to go searching through data blocks for it. It can therefore relay the hole locations to userspace efficiently, as opposed to the current userspace heuristic methods which must read() all the data and examine it for (position,length) runs of zeros, which it can then seek() forward over while writing.
So, with that:
"is 1000001 a hole?" - No. I'm quite confident no existing filesystem would store that with a hole in it. The zero run is too short. However, a general 100...001 string may, or may not contain a hole, which could potentially (but not necessarily) be reported by lseek(SEEK_HOLE).
"how many 0's need to be in place before it should be identified as a hole?" - it *should* only be identified as a hole if the zero's are not actually stored in data blocks, and thus the file is "sparse". We are not asking the filesystem to analyze the content for potential holes, but just to report what holes it currently has, which should be efficient. In fact, any lseek(SEEK_HOLE) implementation that has to examine data blocks for zero-content should probably be considered broken, imo.
"what if the source is on 4192 byte sector media and the destination is on 512 byte sector media..." - It doesn't matter. Nothing says holes are commutative, they are simply a storage optimization and need not be reproduced exactly. A destination filesystem will automatically convert to the padding and alignment of new holes in its own internal structure if you lseek over zeros (though perhaps not optimally; it depends on how the file is constructed). On vfat, for example, sparse files are not possible, and *every* zero byte will be stored literally, regardless of the sparsity of the original file.
"it seems like it may be better to have a seek_hole() function" - the FIEMAP ioctl still exists, although since it is apparently error prone, the lseek(SEEK_HOLE) interface may produce less buggy client code, and still be hugely efficient on sparse file copies.
"so that you could tell seek_hole() what you consider a hole." - There is no point in telling seek_hole() what you consider a hole, is there? You can already find and try to reproduce all possible holes in a file, in user-space, by read()ing, and then seek()ing over zeros on copies (I'll have to check what madness 'coreutils' uses for "cp --sparse"). The filesystem *can* efficiently tell us what holes it currently has, though, potentially saving a lot of read()s.
So currently, the classic way to create holes in sparse files is by lseeking() (mmap() also works); it seems somewhat logical to use lseek() to also detect those holes. It's not all powerful, but it's reasonably simple.
Posted Apr 29, 2011 20:44 UTC (Fri) by dlang (✭ supporter ✭, #313)
The interface created at Sun used the lseek() system call, which is normally used to change the read/write position within a file. If the SEEK_HOLE option is provided to lseek(), the offset will be moved to the beginning of the first hole which starts after the specified position. The SEEK_DATA option, instead, moves to the beginning of the first non-hole region which starts after the given position. A "hole," in this case, is defined as a range of zeroes which need not correspond to blocks which have actually been omitted from the file, though in practice it almost certainly will. Filesystems are not required to know about or report holes; SEEK_HOLE is an optimization, not a means for producing a 100% accurate map of every range of zeroes in the file.
note specifically: A "hole," in this case, is defined as a range of zeroes which need not correspond to blocks which have actually been omitted from the file
so this seems to be implying that this isn't just reporting what holes currently exist, but holes that could potentially exist, even if they haven't been punched out yet. at that point the question of what should be reported arises.
Posted Apr 29, 2011 21:55 UTC (Fri) by nybble41 (subscriber, #55106)
Ergo, an implementation which only reported filesystem-level blocks of zeros actually omitted from the file would be perfectly valid. The interface is allowed, but not *required*, to report "holes that could potentially exist". In practice I would expect filesystems to only report omissions, as scanning arbitrarily large amounts of stored data for the first non-zero byte would be prohibitively expensive (and can be done just as easily from userspace).
SEEK_HOLE and SEEK_DATA are meant as optimizations. It makes little sense to save the application the trouble of scanning for ranges of zeros in stored data at the expense of moving the same task into the filesystem. On the other hand, if the filesystem already knows that there is a hole--for example, because it was omitted from the stored data--then SEEK_HOLE and SEEK_DATA allow it to save the application some unnecessary reads.
Posted Apr 29, 2011 22:19 UTC (Fri) by dlang (✭ supporter ✭, #313)
so an implementation that reported every 0 in the file would be valid
and an implementation that didn't report any holes in the file would be valid (although useless)
I'm arguing that it would be better to allow the flexibility to define what a hole is if applications are going to be modified to make use of this feature.
I'm not sure if the application should define the hole, or if it should be something that's tunable at the system (or device) level. I can definitely see a reluctance to have the app try and figure out what size hole is relevant, but at the same time, the ability to find potential holes without having to push the data all the way to userspace just to find 0's int he file seems like a useful optimisation for a small amount of code.
Posted Apr 29, 2011 23:29 UTC (Fri) by chad.netzer (guest, #4257)
I assume the wording of the specification has to be "loose" like this to cover cases where the filesystem converts zero data blocks to holes (via block data scrubbing), or a file block of zeros gets rewritten as actual zeros (an optimization like zero-block data deduplication, for example) so that while the logical content of the file has not changed, the "hole" structure is different and the previous lseek(SEEK_HOLE) may no longer be a hole. This is a lesser constraint than if the content itself is altered, and should still work.
"so an implementation that reported every 0 in the file would be valid" - Yes, although it should at least adhere to the _PC_MIN_HOLE_SIZE as a lower bound. If that lower bound can be 1, clients should be prepared for that; in particular, backup software might need to detect and refuse to bother with bookkeeping such small holes, and just read and store the zeros verbatim.
"the ability to find potential holes without having to push the data all the way to userspace just to find 0's int he file seems like a useful optimisation for a small amount of code." - A filesystem could choose to "scrub" the data in the background and look for places to add holes, but whether its userspace or kernel, the act of looking for potential holes will involve processing a lot of data blocks, and could be tricky when done on active filesystems. The copying to userspace is trivial, compared to the block reads (even on non-rotating media). Whereas, creating the file with holes initially can often be done efficiently, since the writing application may know where the holes belong at the start. (ie. compare "time dd if=/dev/zero of=/var/tmp/non-sparse-file bs=1M count=1000" vs. "time dd if=/dev/zero of=/var/tmp/sparse-file bs=1M count=1 seek=999")
That said, I wonder if any of the compressing files system try to aggressively find ways to make files sparser (given that they have to process all the data anyway)? My guess is that sparseness is not much of a win on those filesystems, so they don't bother.
Posted Apr 29, 2011 23:49 UTC (Fri) by dlang (✭ supporter ✭, #313)
I think that I'm saying that PC_MIN_HOLE_SIZE and what alignment it needs to have should be configurable at least on the device (including logical device) level
if the purpose of this is to allow backups and copies to deal with holes efficiently, it seems like it would be good to be able to tune how aggressively to look for holes (or possible holes, if things are layered, you may not know for sure if the holes are real or not). remember that this is all happening long after the file was created (and after it may have been mangled by other tools that filled in holes because they didn't know any better)
as for compressed filesystems, since a string of 0's compresses _really_ well, I suspect that none of them look for the special case of a full block of 0's aligned on a block boundry as it probably would take just about as much to record that special case as it takes to record that they are zero anyway ;-)
if de-duplication logic forces holes to be replaced with a block of 0's (even a shared one), the authors of that code should be fired they are moving in the wrong direction (the block of 0's now takes up space and I/O where it didn't before)
Posted Apr 30, 2011 19:23 UTC (Sat) by jrn (subscriber, #64214)
Linux doesn't support Solaris's _PC_MIN_HOLE_SIZE currently. It doesn't seem very useful --- it just lets applications know, any hole will be at least such-and-such size (e.g., 512 bytes).
Posted May 4, 2011 18:27 UTC (Wed) by chad.netzer (guest, #4257)
It was a pure hypothetical, but for example some systems can convert an online volume to de-duped mode and back, all while serving files from it. I could see (in such cases of intermediate online filesystem conversions, or other hypothetical situations) that a filesystem could choose to not honor, or incorrectly report the SEEK_HOLE values. In such cases, the API would allow backups to still work, just less efficiently. So, my point is that the SEEK_HOLE API is not bound by any particular filesystem constraint.
> if the purpose of this is to allow backups and copies to deal with holes efficiently, it seems like it would be good to be able to tune how aggressively to look for holes
You don't want the filesystem to "look" for holes; it just knows them outright, if it supports them, based on what data blocks are actually stored. The "looking" for all potential holes can already be (and is) done in userspace for any filesystem, at the cost of examining a lot of zeros. Anyway, that's my view.
Posted May 4, 2011 19:01 UTC (Wed) by dlang (✭ supporter ✭, #313)
step 1 use SEEK_HOLE to find holes the filesystem knows about
step 2 read the remainder of the file through userspace to look for additional holes (or holes that SEEK_HOLE didn't report.
examining a range of memory to find if it's exclusively zero seems like the type of thing that is amiable to optimisation based on the particular CPU in use. Since the kernel is already optimised this way it would seem to be better to leverage this rather than require multiple userspace tools to all implement the checking (with the optimisations)
the full details of what extents are used for a file seems like it isn't the right answer, both because it's complex, but also because it's presenting a lot of information that isn't useful (i.e. you don't care if a block of real data is in one block, or fragmented into lots of blocks), but at the same time it seems a bit wasteful to find the holes by doing a separate system call for each hole boundary.
Posted May 4, 2011 19:54 UTC (Wed) by chad.netzer (guest, #4257)
Perhaps, but it's almost certainly I/O bound, not CPU.
If you *really* want to aggressively replace long runs of zeros with holes, in existing files (ie. make them sparser), a background userspace scrubber could be employed; although doing it in-place without forcing a copy (new inode) is tricky. At least some Linux filesystems have, or will have, the ability to "punch holes":
Posted Apr 30, 2011 3:16 UTC (Sat) by jrn (subscriber, #64214)
I think you're misreading it.
This is about reporting holes, but nobody wanted to guarantee that such a thing as a hole exists. So the semantics are: if SEEK_HOLE reports a hole, the content there consists of NUL bytes. That's it (though naturally enough any sane kernel is only going to report large blocks of NUL bytes, for example by reporting the actual holes, and userspace programs are likely to rely on that assumption for reasonable performance).
Posted Apr 30, 2011 3:29 UTC (Sat) by dlang (✭ supporter ✭, #313)
what may be a large block for a filesystem running on one device may not be a large block for another device.
I'm not saying that it makes sense to have it report down to every single null byte in the file, but I do think that there should be some ability to define what 'large block' means outside of editing the source.
Posted Apr 30, 2011 19:10 UTC (Sat) by jrn (subscriber, #64214)
Perhaps you're talking about the holes feature in general, and saying that users or applications should be able to configure when a seek while writing will create a hole? Then I would understand a little better.
Posted May 2, 2011 4:28 UTC (Mon) by njs (guest, #40338)
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds