Allocating uninitialized file blocks

By Jonathan Corbet
April 17, 2012

The fallocate() system call can be used to increase the size of a file without actually writing to the new blocks. It is useful as a way to encourage the kernel to lay out the new blocks contiguously on disk, or just to ensure that sufficient space is available before beginning a complex operation. Filesystems implementing fallocate() take care to note that the new blocks have not actually been written; attempts to read those uninitialized blocks will normally just return zeroes. To do otherwise would be to risk disclosing information remaining in blocks recently freed from other files.

For most users, fallocate() works just as it should. In some cases, though, the application in question does a lot of random writes scattered throughout the file. Writing to a small part of an uninitialized extent may force the filesystem to initialize a much larger range of blocks, slowing things down. But if the application knows where it has written in the file, and will thus never read from uninitialized parts of that file, it gains no benefit from this extra work.

How much does this initialization cost? Zheng Liu recently implemented a new fallocate() flag (called FALLOC_FL_NO_HIDE_STALE) that marks new blocks as being initialized, even though the filesystem has not actually written them; these blocks, will thus contain random old data. A random-write benchmark that took 76 seconds on a mainline kernel ran in 18 seconds when this flag was used. Needless to say, that is a significant performance improvement; for that reason, Zheng has proposed that this flag be merged into the mainline.

Such a feature has obvious security implications; Zheng's patch tries to address them by providing a sysctl knob to enable the new feature and defaulting it to "off." Still, Ric Wheeler didn't like the idea, saying "Sounds like we are proposing the introduction a huge security hole instead of addressing the performance issue head on". Ted Ts'o was a little more positive, especially if access to the feature required a capability like CAP_SYS_RAWIO. But everybody seems to agree that a good first step would be to figure out why performance is so bad in this situation and see if a proper fix can be made. If the performance issue can be made to go away without requiring application changes or possibly exposing sensitive data, everybody will be better off in the end.

Index entries for this article
Kernel	fallocate()

Allocating uninitialized file blocks

Posted Apr 19, 2012 7:15 UTC (Thu) by slashdot (guest, #22014) [Link] (3 responses)

This is something that should never be done unless the benefit is truly huge it is mathematically provable that there's no other way to achieve it.

Even if an application has root privileges, a bug in it (even just not zeroing out padding in a structure) may result in an accidental leak of data from a totally unrelated file to a remote client.

The only sensible variant is zeroing blocks from deleted files in the background and only using those scrubbed blocks.

Allocating uninitialized file blocks

Posted Apr 19, 2012 15:35 UTC (Thu) by jzbiciak (guest, #5246) [Link] (2 responses)

How is this at all different than just opening up the device node associated with the partition? If you're root, what's stopping you?

As Ted Ts'o suggested, if a process has the capability to do raw I/O, why not let it see raw disk blocks occasionally? You've already given it permission to do low level I/O in spite of a filesystem, so what's the harm in letting it stale blocks within a filesystem?

Allocating uninitialized file blocks

Posted Apr 19, 2012 19:28 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (1 responses)

The problem is that while a root process or one with raw I/O capabilities can see those blocks itself, it wouldn't usually write them out to a file which other users could read. However, if a root process allocates space for a file readable by non-root processes, and that space remains uninitialized, the other processes will have access to the former contents of those blocks.

Allocating uninitialized file blocks

Posted Apr 20, 2012 5:41 UTC (Fri) by jzbiciak (guest, #5246) [Link]

Well, this new fallocate() feature is explicit. It's not like we're suddenly changing the semantics of holes in files. It's really about policy vs. mechanism. We need to ask if this is a useful mechanism, and if so, can user space use it safely if it adopts appropriate polices?

The definition of "appropriate policy" depends entirely on the usage scenario and security requirements of the system and application. A DVR disk that has nothing but video files on it won't leak anything interesting, so this fallocate() mode may be perfectly suited to it, for example, assuming a bittorent-style scattered download.

All that said, this new mode does need to prove its usefulness. If the performance issue is unique to ext4, then it's probably better to just fix ext4.

Allocating uninitialized file blocks

Posted Apr 19, 2012 9:48 UTC (Thu) by epa (subscriber, #39769) [Link] (10 responses)

Maybe the filesystem needs a low-priority scrubber process that zeroes out unallocated blocks, and marks them as zeroed. They can then be allocated quickly later.

If enough already-zeroed blocks aren't available (or they aren't contiguous enough) then you still have to go to the work of initializing blocks when fallocate() is called. But if those blocks are never used and the file is then shrunk or deleted, the zeroed blocks are still available for next time so the work isn't entirely wasted.

Allocating uninitialized file blocks

Posted Apr 19, 2012 14:56 UTC (Thu) by vonbrand (subscriber, #4458) [Link] (9 responses)

The problem is that zeroing blocks is work, and besides SSDs (and RAIDs) shouldn't be written except when really needed.

Why not just split the range fallocated and just rewrite the affected blocks, not the whole range?

Allocating uninitialized file blocks

Posted Apr 19, 2012 17:01 UTC (Thu) by epa (subscriber, #39769) [Link]

Yeah, it's work, my point is that since most systems have idle periods and busy periods, it makes sense to do some of this work during idle periods so that it doesn't have to be done when the system (and the disk) is busy.

Allocating uninitialized file blocks

Posted Apr 19, 2012 18:40 UTC (Thu) by slashdot (guest, #22014) [Link]

If it's an SSD or a "smart" RAID array, use TRIM instead (if it's not supported, throw the stupid thing away and get a decent one).

Allocating uninitialized file blocks

Posted Apr 20, 2012 7:32 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

Modern SSDs have TRIM which 'zeroes' the TRIM-ed blocks automatically.

Allocating uninitialized file blocks

Posted Apr 20, 2012 19:31 UTC (Fri) by drag (guest, #31333) [Link] (5 responses)

But then you are depending on your hardware to have correct functionality.
Which is a very very unsafe assumption.

Anyways, I believe TRIM just tells them that the blocks are no longer being used. It's not a order to zero out the blocks. Without trim the SSD doesn't know if the file system still expects them to be used or not.

Allocating uninitialized file blocks

Posted Apr 20, 2012 20:17 UTC (Fri) by dlang (guest, #313) [Link]

trim says that the blocks are not being used. If there is an attempt to read from blocks that are not being used, the SSD returns all 0's (it doesn't actually read the block, because after the trim, that block doesn't actually exist on the flash media anymore)

If you were to take apart the drive and bypass the controller to read the flash chips directly, you would have a chance at recovering the data.

Allocating uninitialized file blocks

Posted Apr 20, 2012 21:33 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

>But then you are depending on your hardware to have correct functionality.

Its presence is indicated by the 'Deterministic read data after TRIM' capability (you can check it using "hdparm -I"). So it's not like you need to blindly trust your SDD.

>Anyways, I believe TRIM just tells them that the blocks are no longer being used. It's not a order to zero out the blocks.

With deterministic zeros one can also use it as a way to quickly erase blocks.

Besides, overwriting something on SDD in most cases would NOT actually overwrite it in the real hardware flash due to load balancing and remapping.

Allocating uninitialized file blocks

Posted Apr 20, 2012 21:39 UTC (Fri) by dlang (guest, #313) [Link] (2 responses)

unfortunately trim is not a fast command for many drives

Allocating uninitialized file blocks

Posted Apr 20, 2012 21:45 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

That depends. TRIM-ing 100Gb of data on my Intel SSD-based RAID takes little more than 2-3 seconds, it's way faster than writing zeros directly and more flash-friendly.

However, TRIM command can't be queued. So it probably makes no sense to use it for large allocations and/or to keep a pool of recently-trimmed pages for immediate small allocations.

Allocating uninitialized file blocks

Posted Apr 20, 2012 21:51 UTC (Fri) by dlang (guest, #313) [Link]

the fact that it can't be queued is a major performance hit.

The SSD does keep a pool of unused pages for new allocations. What trim does is it lets the SSD know that you no longer care about the data on that block, and so it can add the block back to that pool.

If the SSD runs out of this pool, writing slows drastically as it must first erase a block before it can write anything. If you are expecting to do a LOT of writing to a SSD, you may want to make sure that you partition it to something less than the advertised size so that that extra space will remain in the pool (this works as long as that extra space has never been written to, or is explicitly relased via a TRIM command)

Allocating uninitialized file blocks

Posted Apr 19, 2012 15:21 UTC (Thu) by sandeen (guest, #42852) [Link] (1 responses)

As the thread wore on, numbers were posted which show XFS has no serious performance degradation under the same workload. At this point, blowing a security hole in ext4 and promoting the flag to the VFS level seems incredibly premature.

I'd look into just what is making ext4 slow here but so far I can't reproduce a slowdown anything like what the patch submitter has seen...

-Eric

Allocating uninitialized file blocks

Posted Apr 23, 2012 1:21 UTC (Mon) by szaka (guest, #12740) [Link]

The flag (with a better name) could be helpful for filesystems which can't fully support uninitialized allocated blocks efficiently. We are supporting several such interoperable filesystems (NTFS, exFAT, FAT) where changing the specification is not possible.

There is real user need despite explaining potential security consequences. Typical usage scenarios are using a large file as a container for an application which tracks free/used blocks itself. Windows supports this feature by SetFileValidData() if extra privilege is granted.

The performance gain can be huge on embedded using low-end storage and SoC. In one of our cases it took 5 days vs 12 minutes to fully setup a large file for use.

Allocating uninitialized file blocks

Posted Apr 19, 2012 15:44 UTC (Thu) by joey (guest, #328) [Link] (2 responses)

I was just reading yesterday about a torrent program that used the equivilant option in Windows. So every partially downloaded torrent file exposes other "deleted" data. Ugh.

Allocating uninitialized file blocks

Posted Apr 19, 2012 20:51 UTC (Thu) by intgr (subscriber, #39733) [Link] (1 responses)

> that used the equivilant option in Windows

Are you saying that there's an equivalent to the new proposed FALLOC_FL_NO_HIDE_STALE flag too? I'd be surprised. Can you link to the discussion?

(Maybe you're just confused. I'm sure there is a fallocate() equivalent feature in Windows -- but fallocate() doesn't leak any foreign file data, without this new proposed flag)

Allocating uninitialized file blocks

Posted Apr 19, 2012 21:22 UTC (Thu) by Fowl (subscriber, #65667) [Link]

That would be the SetFileValidData function, which "is useful in very limited scenarios. "

To use it requires the SE_MANAGE_VOLUME_NAME privilege (ie. raw disk access).

"Applications should call SetFileValidData only on files that restrict access to those entities that have SE_MANAGE_VOLUME_NAME access. The application must ensure that the unwritten ranges of the file are never exposed, or security issues can result as follows."

[Something about giving just enough rope to hang yourself / trusting programmers to know what they're doing.]

Allocating uninitialized file blocks

Posted May 1, 2012 5:50 UTC (Tue) by butlerm (subscriber, #13312) [Link]

It sounds like EXT4 and any other modern filesystems with this problem need a more robust form of unwritten extent tracking similar to that used by XFS.