Allocating uninitialized file blocks
For most users, fallocate() works just as it should. In some cases, though, the application in question does a lot of random writes scattered throughout the file. Writing to a small part of an uninitialized extent may force the filesystem to initialize a much larger range of blocks, slowing things down. But if the application knows where it has written in the file, and will thus never read from uninitialized parts of that file, it gains no benefit from this extra work.
How much does this initialization cost? Zheng Liu recently implemented a new fallocate() flag (called FALLOC_FL_NO_HIDE_STALE) that marks new blocks as being initialized, even though the filesystem has not actually written them; these blocks, will thus contain random old data. A random-write benchmark that took 76 seconds on a mainline kernel ran in 18 seconds when this flag was used. Needless to say, that is a significant performance improvement; for that reason, Zheng has proposed that this flag be merged into the mainline.
Such a feature has obvious security implications; Zheng's patch tries to
address them by providing a sysctl knob to enable the new feature and
defaulting it to "off." Still, Ric Wheeler didn't like the idea, saying "Sounds
like we are proposing the introduction a huge security hole instead of
addressing the performance issue head on
". Ted Ts'o was a little more positive, especially if access
to the feature required a capability like CAP_SYS_RAWIO. But
everybody seems to agree that a good first step would be to figure out why
performance is so bad in this situation and see if a proper fix can be
made. If the performance issue can be made to go away without requiring
application changes or possibly exposing sensitive data, everybody will be
better off in the end.
Index entries for this article | |
---|---|
Kernel | fallocate() |
Posted Apr 19, 2012 7:15 UTC (Thu)
by slashdot (guest, #22014)
[Link] (3 responses)
Even if an application has root privileges, a bug in it (even just not zeroing out padding in a structure) may result in an accidental leak of data from a totally unrelated file to a remote client.
The only sensible variant is zeroing blocks from deleted files in the background and only using those scrubbed blocks.
Posted Apr 19, 2012 15:35 UTC (Thu)
by jzbiciak (guest, #5246)
[Link] (2 responses)
As Ted Ts'o suggested, if a process has the capability to do raw I/O, why not let it see raw disk blocks occasionally? You've already given it permission to do low level I/O in spite of a filesystem, so what's the harm in letting it stale blocks within a filesystem?
Posted Apr 19, 2012 19:28 UTC (Thu)
by nybble41 (subscriber, #55106)
[Link] (1 responses)
Posted Apr 20, 2012 5:41 UTC (Fri)
by jzbiciak (guest, #5246)
[Link]
Well, this new fallocate() feature is explicit. It's not like we're suddenly changing the semantics of holes in files. It's really about policy vs. mechanism. We need to ask if this is a useful mechanism, and if so, can user space use it safely if it adopts appropriate polices? The definition of "appropriate policy" depends entirely on the usage scenario and security requirements of the system and application. A DVR disk that has nothing but video files on it won't leak anything interesting, so this fallocate() mode may be perfectly suited to it, for example, assuming a bittorent-style scattered download. All that said, this new mode does need to prove its usefulness. If the performance issue is unique to ext4, then it's probably better to just fix ext4.
Posted Apr 19, 2012 9:48 UTC (Thu)
by epa (subscriber, #39769)
[Link] (10 responses)
If enough already-zeroed blocks aren't available (or they aren't contiguous enough) then you still have to go to the work of initializing blocks when fallocate() is called. But if those blocks are never used and the file is then shrunk or deleted, the zeroed blocks are still available for next time so the work isn't entirely wasted.
Posted Apr 19, 2012 14:56 UTC (Thu)
by vonbrand (subscriber, #4458)
[Link] (9 responses)
The problem is that zeroing blocks is work, and besides SSDs (and RAIDs) shouldn't be written except when really needed. Why not just split the range fallocated and just rewrite the affected blocks, not the whole range?
Posted Apr 19, 2012 17:01 UTC (Thu)
by epa (subscriber, #39769)
[Link]
Posted Apr 19, 2012 18:40 UTC (Thu)
by slashdot (guest, #22014)
[Link]
Posted Apr 20, 2012 7:32 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted Apr 20, 2012 19:31 UTC (Fri)
by drag (guest, #31333)
[Link] (5 responses)
Anyways, I believe TRIM just tells them that the blocks are no longer being used. It's not a order to zero out the blocks. Without trim the SSD doesn't know if the file system still expects them to be used or not.
Posted Apr 20, 2012 20:17 UTC (Fri)
by dlang (guest, #313)
[Link]
If you were to take apart the drive and bypass the controller to read the flash chips directly, you would have a chance at recovering the data.
Posted Apr 20, 2012 21:33 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
Its presence is indicated by the 'Deterministic read data after TRIM' capability (you can check it using "hdparm -I"). So it's not like you need to blindly trust your SDD.
>Anyways, I believe TRIM just tells them that the blocks are no longer being used. It's not a order to zero out the blocks.
With deterministic zeros one can also use it as a way to quickly erase blocks.
Besides, overwriting something on SDD in most cases would NOT actually overwrite it in the real hardware flash due to load balancing and remapping.
Posted Apr 20, 2012 21:39 UTC (Fri)
by dlang (guest, #313)
[Link] (2 responses)
Posted Apr 20, 2012 21:45 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
However, TRIM command can't be queued. So it probably makes no sense to use it for large allocations and/or to keep a pool of recently-trimmed pages for immediate small allocations.
Posted Apr 20, 2012 21:51 UTC (Fri)
by dlang (guest, #313)
[Link]
The SSD does keep a pool of unused pages for new allocations. What trim does is it lets the SSD know that you no longer care about the data on that block, and so it can add the block back to that pool.
If the SSD runs out of this pool, writing slows drastically as it must first erase a block before it can write anything. If you are expecting to do a LOT of writing to a SSD, you may want to make sure that you partition it to something less than the advertised size so that that extra space will remain in the pool (this works as long as that extra space has never been written to, or is explicitly relased via a TRIM command)
Posted Apr 19, 2012 15:21 UTC (Thu)
by sandeen (guest, #42852)
[Link] (1 responses)
I'd look into just what is making ext4 slow here but so far I can't reproduce a slowdown anything like what the patch submitter has seen...
-Eric
Posted Apr 23, 2012 1:21 UTC (Mon)
by szaka (guest, #12740)
[Link]
There is real user need despite explaining potential security consequences. Typical usage scenarios are using a large file as a container for an application which tracks free/used blocks itself. Windows supports this feature by SetFileValidData() if extra privilege is granted.
The performance gain can be huge on embedded using low-end storage and SoC. In one of our cases it took 5 days vs 12 minutes to fully setup a large file for use.
Posted Apr 19, 2012 15:44 UTC (Thu)
by joey (guest, #328)
[Link] (2 responses)
Posted Apr 19, 2012 20:51 UTC (Thu)
by intgr (subscriber, #39733)
[Link] (1 responses)
Are you saying that there's an equivalent to the new proposed FALLOC_FL_NO_HIDE_STALE flag too? I'd be surprised. Can you link to the discussion?
(Maybe you're just confused. I'm sure there is a fallocate() equivalent feature in Windows -- but fallocate() doesn't leak any foreign file data, without this new proposed flag)
Posted Apr 19, 2012 21:22 UTC (Thu)
by Fowl (subscriber, #65667)
[Link]
That would be the SetFileValidData function, which "is useful in very limited scenarios. " To use it requires the SE_MANAGE_VOLUME_NAME privilege (ie. raw disk access). [Something about giving just enough rope to hang yourself / trusting programmers to know what they're doing.]
Posted May 1, 2012 5:50 UTC (Tue)
by butlerm (subscriber, #13312)
[Link]
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Which is a very very unsafe assumption.
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
Allocating uninitialized file blocks
"Applications should call SetFileValidData only on files that restrict access to those entities that have SE_MANAGE_VOLUME_NAME access. The application must ensure that the unwritten ranges of the file are never exposed, or security issues can result as follows."
It sounds like EXT4 and any other modern filesystems with this problem need a more robust form of unwritten extent tracking similar to that used by XFS.
Allocating uninitialized file blocks