LWN.net Logo

Uninitialized blocks and unexpected flags

By Jonathan Corbet
November 28, 2012
One often-heard complaint in the early BitKeeper era was that, by letting code reach the mainline without going via a mailing list, BitKeeper made it easy for maintainers to slip surprise changes in underneath the review radar. Those worries have mostly proved unfounded; when surprises have happened, the response from the community has usually helped to ensure that there would be no repeats. But some developers are charging that the 3.7 kernel contains exactly this type of stealth change and are demanding that it be reverted.

Background

The fallocate() system call is meant to be a way for an application to request the efficient allocation of blocks for a file. Use of fallocate() allows a process to verify that the required disk space is available, helps the filesystem to allocate all of the space in a single, contiguous group, and avoids the overhead that block-by-block allocation would incur. In the absence of an fallocate() implementation (each filesystem must implement it independently), the C library will emulate it by simply writing zeroes to the requested block range; that gets the space allocated, but is less efficient than one would like. The implementation of fallocate() within filesystems tries to be more efficient than that; one way to do so is to avoid the process of writing zeroes to the newly-allocated blocks.

Leaving stale data in allocated blocks has obvious security implications: a hostile application could read those blocks in the hopes of finding confidential documents, passwords, or the missing Fedora 18 Beta release announcement. To avoid this exposure, filesystems like ext4 will mark unwritten blocks as being uninitialized; any attempt to read those blocks will be intercepted and just return zeroes. In the normal case, the application will write data to those blocks before ever trying to read them; writing obviously initializes the blocks without the need to write zeroes first. This implementation seems like it should be about optimal.

Except that, seemingly, ext4 marks uninitialized blocks at the extent (group of contiguous blocks) level. So, if an application writes to one uninitialized block, the containing extent must be split and the newly-written block(s) added to the previous extent, if possible. That turns out to be more expensive than some users would like. So a shortcut was attempted.

That shortcut first appeared in April, 2012, in the form of a new fallocate() flag called FALLOC_FL_NO_HIDE_STALE. If fallocate() was called with that flag, the newly-allocated blocks would be marked as being initialized even though the old data remained untouched. That obviously brings the old security issues back; to mitigate the problem, the patch added a mount option making the new functionality available only to members of a specific group. That was deemed to be enough, especially for settings where access to the machine as a whole is tightly controlled.

At least, the authors and supporters of the patch deemed the group check to be enough. The patch was roundly criticized by other filesystem developers; the prevailing opinion appeared to be that it was trying to open up a huge security hole in order to avoid fixing an ext4 performance problem. After that discussion, the patch went away and wasn't heard from again.

A surprise flag

At least, it was not heard from until recently, when some filesystem developers were surprised to discover this commit by Ted Ts'o which found its way into the mainline (via the ext4 tree) during the 3.7 merge window. The patch is small and simple; it simply defines the FALLOC_FL_NO_HIDE_STALE flag, but adds no code to actually implement it. The changelog reads:

As discussed at the Plumber's Conference, reserve the bit 0x04 in fallocate() to prevent collisions with a commonly used out-of-tree patch which implements the no-hide-stale feature.

Filesystem developer Dave Chinner, at least, does not recall this discussion. His response was to post a patch reverting the change, saying:

The lack of formal review and discussion for a syscall API change is grounds for reverting patch, especially given the controversial nature of the feature and the previous discussions and NAKs. The way the change was pushed into mainline borders on an abuse of the trust we place in maintainers and hence as a matter of principle this change should be reverted.

It is true that this particular change is a bit abnormal. It changes the core filesystem code but came by way of a filesystem-specific tree with no acks from any other developers. The patch does not appear to have been posted to any relevant mailing list, violating the rule that all patches should go through public review before being pushed toward the mainline. The addition of a flag with no in-kernel users is also contrary to usual kernel practice. It is, in summary, the sort of change that less well-established kernel developers would never get away with making. So it is hard to fault other filesystem developers for being surprised and unhappy.

On the other hand, the change just adds a flag definition; it obviously cannot cause problems for existing code. And there does appear to be a real user community for this feature. Ted justified his action this way:

It doesn't change the interface or break anything; it just reserves a bit so that out-of-tree patches don't collide with future allocations. There are significant usages of this bit within Google and Tao Bao. It is true that there has been significant pushback about adding this functionality on linux-fsdevel; I find it personally frustrating that in effect, if enough people scream, they can veto an optional feature that might only be implemented by a single file system.

This explanation does not appear to have satisfied anybody, though. So we have an impasse of sorts; some developers want a flag to control a functionality they need, while others see it as a security problem and the result of an abuse of the kernel's trust system.

Alan Cox suggested that it would be possible to, instead, reserve a set of filesystem-private flags that could be used for any purpose by any filesystem. Dave pointed out, however, that a flag bit that behaved differently from one filesystem to the next is a recipe for trouble. His suggestion, instead, is that this functionality should be implemented via the ioctl() interface, which is where filesystem-specific options usually hide. The ioctl() approach seems like it should be workable, but no patches to that effect have been posted thus far.

As of this writing, Linus has not accepted the revert, so the FALLOC_FL_NO_HIDE_STALE flag can still be found in the 3.7 kernel. He has also remained silent in the discussion. He will have to make a decision one way or the other, though, before the final 3.7 release is made. Once that flag is made available in a stable mainline release, it will be much harder to get rid of, so, if that flag is going to come out, it needs to happen soon.


(Log in to post comments)

Uninitialized blocks and unexpected flags

Posted Nov 29, 2012 6:14 UTC (Thu) by pr1268 (subscriber, #24648) [Link]

... a hostile application could read those blocks in the hopes of finding confidential documents, passwords, or the missing Fedora 18 Beta release announcement.

Hehe... Our esteemed editor's witty humor strikes again. :)

Making that 2040 thesis easy

Posted Nov 29, 2012 11:03 UTC (Thu) by man_ls (guest, #15091) [Link]

For future historians, the quote is a reference to this story and its comments.

Uninitialized blocks and unexpected flags

Posted Nov 29, 2012 20:44 UTC (Thu) by emunson (subscriber, #44357) [Link]

Maybe I haven't been around long enough to have seen it, but I thought that making life easier for people that maintain out-of-tree patches was not good enough justification to get something merged...

Uninitialized blocks and unexpected flags

Posted Nov 29, 2012 21:33 UTC (Thu) by bronson (subscriber, #4806) [Link]

You're absolutely right. I guess the exception comes when the out-of-tree patch is used by the maintainer's employer.

Uninitialized blocks and unexpected flags

Posted Nov 30, 2012 9:21 UTC (Fri) by dgm (subscriber, #49227) [Link]

I guess that's the kind of stuff companies hire maintainers for.

Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds