Atime and btrfs: a bad combination?
One of the core design features of btrfs is its copy-on-write nature. Blocks on disk are never modified in place; instead, when it becomes necessary to commit a change, the affected block is copied and rewritten into a new location. Copy-on-write applies to metadata as well as data; if a file's metadata (such as its last-accessed time) is changed, the block containing that metadata will be copied to a new spot. So, on btrfs, an operation that reads a lot of files (creating a tar archive, say, or a recursive grep) can, through atime updates, cause the copying and rewriting of a lot of metadata blocks.
Needless to say, performance is not improved this way, but that is not where the big problem comes in. As Alexander Block pointed out, the real problem has to do with the interaction between atime, copy-on-write, and snapshots.
Btrfs provides a fast snapshotting feature that can create a copy of the state of the filesystem at a specific time. When a snapshot is created, it shares all data and metadata with the "trunk" filesystem. Should a file be changed, the resulting copy-on-write operation separates the trunk from the snapshot, keeping both versions of the data available. So snapshots can be thought of as being nearly free as long as the filesystem remains relatively quiet. Snapshots will share data and metadata, so they do not require a lot of additional space.
Atime updates change the situation, though. If somebody takes a snapshot of a filesystem, then performs a recursive grep on that filesystem, the last-access time of every file touched may be updated. That, in turn, can cause copy-on-write operations on each file's inode structure, with the result that many or all of the inodes in the filesystem may be duplicated. That can increase the space consumption of the filesystem considerably; Alexander posted an example where a recursive grep caused 2.2GB of free space to disappear. That is a surprising result for what is meant to be a read-only operation.
Once upon a time, when disk capacities were measured in megabytes, it was said that the only standard feature of Unix systems was the message of the day telling users to clean up their files. Atime was often used by harried system administrators trying to recover some disk space; they would scan for infrequently-accessed files and, depending on how desperate the situation was and how powerful their users were, either send lists of unused files to users or simply back those files up to tape and delete them. It is somewhat ironic that a feature meant (among other things) to help keep disk space free has now, on next-generation filesystems, become part of the problem.
It's worth noting that the relatime option (which only updates atime once per day unless the file has been modified since the last atime update) is of little help here. It only takes one atime update to force an inode to be rewritten and unshared with any snapshots. So the fact that such updates are limited to one per day offers little in the way of consolation.
Users are also unlikely to be consoled by one other aspect of the problem pointed out by Alexander: since reading data can consume space in the filesystem, read operations might fail with "no space available" errors on an overflowing filesystem. That may make it difficult or impossible to fix the problem by copying data out of a full filesystem. By the time that happens, a typical user could be forgiven for thinking that, perhaps, they don't need last-accessed time tracking at all.
Along those lines, Alexander suggested that it might be a good idea to default to "noatime" (which turns off atime recording entirely) for btrfs mounts, even if that means that btrfs would then behave differently than other filesystems. That idea was quickly shot down for a simple reason: there are still applications that actually need the atime information to function correctly. The classic example is the mutt email client which, in the absence of atime, cannot tell whether a mailbox contains unread mail. Programs that clean up temporary directories (tmpreaper or tmpwatch, for example) will fail without atime. There are also hierarchical storage systems that, like the Unix system administrator of old, use atime to determine when to move files to slower storage. So atime needs to stick around, lest users run into a different kind of unpleasant surprise.
For now, the only recourse for users who run into (or are worried about)
this problem is to explicitly mount their filesystems with the "noatime"
option. Further ahead, it might be possible to make some tweaks to btrfs
to mitigate the problem; Boaz Harrosh suggested disabling atime updates when the
free space falls below a certain threshold or moving the atime data into a
separate data structure. But nobody appears to be working on such
solutions now. So it may be that, as usage of btrfs grows, users will
occasionally be surprised that reading a file can consume space in their
filesystems.
Index entries for this article | |
---|---|
Kernel | Btrfs |
Kernel | Filesystems/Btrfs |
Posted Jun 1, 2012 4:30 UTC (Fri)
by jzbiciak (guest, #5246)
[Link] (17 responses)
I seem to recall posting a crazy idea about moving atime out to its own separately managed data structure and keeping it out of the inode entirely. (I can't find the comment now, of course.)
But still... atime is broken. It turns reads into writes and is generally just nasty. Furthermore, most things just don't need it.
Here's a totally radical, unsellable idea:
Even in the absence of that crazy idea, it still sounds like having atime in the inode allows for trivial bandwidth and storage-size amplification attacks. If you could factor out atime updates to some dedicated on-disk structure that relied more on versioning semantics than COW, you could at least fix btrfs' immediate problem without totally ditching atime. With 8 byte atime and 8 byte inode numbers (let's say), The 2.2GB quoted in the article is enough space to store 128M atime updates if they were stored like a journal.
Posted Jun 1, 2012 4:51 UTC (Fri)
by neilbrown (subscriber, #359)
[Link] (3 responses)
I don't use it a lot, but I certainly do use it from time to time to see what files are being accessed. Not a killer feature, but a valuable one.
I'm a big fan of keeping atime in a separate data structure. The liveness properties, stability requirements, and necessary precision are very different from other values in the inode and keeping it together with them is a simplification, not a requirement.
Posted Jun 1, 2012 5:06 UTC (Fri)
by jzbiciak (guest, #5246)
[Link]
It seems to me the other option, if you don't fix atime, is to mitigate it with hacks (relatime -- which doesn't work well for the attack against btrfs shown in the article) or outright disable it everywhere or almost everywhere.
My comment above was perhaps slightly over the top. Sorry for any confusion.
Posted Jun 1, 2012 14:10 UTC (Fri)
by jezuch (subscriber, #52988)
[Link] (1 responses)
Hah. I disable ataime on all my filesystems and the only use I have for it is a side effect: it functions as creation time, which is much more valuable for me than access time :)
Posted Jun 1, 2012 15:13 UTC (Fri)
by jamesh (guest, #1159)
[Link]
Posted Jun 1, 2012 7:53 UTC (Fri)
by MrWim (subscriber, #47432)
[Link] (4 responses)
One nice property about this solution is that reads being writes are now explicit and if disk runs out read isn't going to fail but the failure mode can be implemented in the watching daemon.
Potentially you could take the dconf like design where a convenient atime API is provided such that atime can be read synchronously by mapping the atime "database" into the process that is reading it read-only whereas the atimed process would be the only process with write access to this file.
Posted Jun 1, 2012 12:45 UTC (Fri)
by ablock (guest, #84933)
[Link]
Posted Jun 1, 2012 13:13 UTC (Fri)
by jzbiciak (guest, #5246)
[Link] (2 responses)
Of course, here's where it breaks: How do you export that over NFS? I guess nfsd would also need to talk to that infrastructure also.
My suggestion about doing it at the kernel level is that you retain the userspace ABI, and there's never an application that breaks because you've radically changed where atime gets monitored and recorded.
I guess you could provide a FUSE-like mechanism to hook userspace back up to 'stat'.
Posted Jun 3, 2012 22:39 UTC (Sun)
by MrWim (subscriber, #47432)
[Link] (1 responses)
As you say a FUSE filesystem could be offered to preserve the stat() interface but it would probably be less work to get the existing apps which use atime to use some new library. The only examples I ever hear of are tmpwatch and mutt.
Posted Jun 7, 2012 12:29 UTC (Thu)
by MrWim (subscriber, #47432)
[Link]
Posted Jun 1, 2012 9:41 UTC (Fri)
by bergwolf (guest, #55931)
[Link] (1 responses)
Posted Jun 5, 2012 10:25 UTC (Tue)
by mgedmin (subscriber, #34497)
[Link]
Posted Jun 1, 2012 18:48 UTC (Fri)
by Yorick (guest, #19241)
[Link] (4 responses)
In fact, once most people agree that there is no reason whatsoever not to mount everything with noatime, we can drop it altogether and start reaping the benefits. All operations are faster, the code becomes simpler, and we can put the now free space in inodes (both on disk and in memory) to more productive use. It is difficult to see any costs here—what would break? finger?
Then, once that has been taken care of, we can go on dealing with some other part of the baroque Unix legacy. Remove 99 % of the TTY options, perhaps? We can start slowly, by taking away the one that converts lower to upper case, and see if anyone notices.
Posted Jun 1, 2012 21:40 UTC (Fri)
by jzbiciak (guest, #5246)
[Link]
<stty olcuc>
I'M SURE ANYONE WHO MIGHT COMPLAIN WILL DO SO VERY LOUDLY.
<STTY -OLCUC>
Posted Jun 4, 2012 12:59 UTC (Mon)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted Jun 4, 2012 14:50 UTC (Mon)
by hummassa (subscriber, #307)
[Link]
One of our problems as developers is exactly that: one does not simply take features away. Lots of systems made me bury them exactly by trying to take "my" features (the ones I used and cared for and needed) away.
Posted Jun 5, 2012 12:19 UTC (Tue)
by Yorick (guest, #19241)
[Link]
To remove old cruft, a good start is quarantine. Simply don't implement atime in new file systems (btrfs); people who need it for their business-critical fingerd can run UFS or ext2 or something else. The important part is that we don't let use of a bad feature to spread, since that is only going to make it harder to get rid of.
Instead of making code worse for everyone for the (dubious) benefit of a vocal minority of cavemen, deal with the problem head-on. Give them a chance to adapt - help them all you can - but set a firm date for when the coddling stops.
Posted Jun 5, 2012 11:42 UTC (Tue)
by roblucid (guest, #48964)
[Link]
Rather than whinge about it on LKML (Kernel developers & casual enthusiasts aren't the ppl who find atimes useful) implementing atime's better ought to be the focus of discussion. Ppl think about advanced features like snap-shotting and ignore the basic POSIX requirements.
atime doesn't need synchronous update guarantees, in real use the fuzzy relatime (better 23hr min update than 24, to be predictable on daily jobs) would likely be adequate. If your FS can't stand some inode info updates, during reading, then it is what is broken, not the spec.
Posted Jun 1, 2012 6:05 UTC (Fri)
by nestal (subscriber, #66970)
[Link] (4 responses)
If you "really" want atime, 2.2GB of space is like a small price to pay :P
Posted Jun 1, 2012 7:20 UTC (Fri)
by ptman (subscriber, #57271)
[Link] (2 responses)
Posted Jun 1, 2012 9:12 UTC (Fri)
by Otus (subscriber, #67685)
[Link]
That could only happen if you did daily snapshots as well, right?
Otherwise the COW source data should be freed as unreferenced.
Posted Jun 1, 2012 10:30 UTC (Fri)
by cwillu (guest, #67268)
[Link]
Posted Jun 7, 2012 14:16 UTC (Thu)
by slashdot (guest, #22014)
[Link]
Using atime is broken anyway since the fact that a file was accessed doesn't mean that the user read the mail (e.g. it could be the user grepping the mailbox).
Posted Jun 1, 2012 7:41 UTC (Fri)
by bakterie (guest, #37541)
[Link] (4 responses)
Posted Jun 2, 2012 13:03 UTC (Sat)
by Ringding (guest, #34316)
[Link]
Posted Jun 7, 2012 16:18 UTC (Thu)
by quanstro (guest, #77996)
[Link] (2 responses)
it would seem to me that this could be fixed
(it must be doing that right, otherwise how would
one would think that this is a general problem with
Posted Jun 7, 2012 23:23 UTC (Thu)
by HenrikH (subscriber, #31152)
[Link] (1 responses)
Posted Jun 10, 2012 16:52 UTC (Sun)
by quanstro (guest, #77996)
[Link]
it still makes more sense to me to copy into the
Posted Jun 1, 2012 8:39 UTC (Fri)
by geertj (guest, #4116)
[Link] (1 responses)
Posted Jun 1, 2012 9:43 UTC (Fri)
by bergwolf (guest, #55931)
[Link]
Posted Jun 1, 2012 11:53 UTC (Fri)
by pjm (guest, #2080)
[Link] (9 responses)
Posted Jun 1, 2012 12:37 UTC (Fri)
by ablock (guest, #84933)
[Link] (3 responses)
Posted Jun 1, 2012 15:07 UTC (Fri)
by drag (guest, #31333)
[Link]
Posted Jun 1, 2012 17:21 UTC (Fri)
by faramir (subscriber, #2327)
[Link] (1 responses)
As to snapshots being "cheap", that can refer to two different things: storage requirements OR execution time. Preallocation might not be cheap
Now as to why BTRFS needed such a large amount of space relative to the size of the filesystem, in the example, that would seem to be an issue with the design of BTRFS and how it interacts with traditional Unix filesystem functionality. As others have suggested storing atimes separately (perhaps just for the "trunk") might work. Perhaps better would be to give the trunk a "current" atime allocation in addition to the standard one. Updates would go to both the current data structure (after copying) as well as the atime only structure up until the disk was full. OTOH, this is getting complicated. No something to figure out in a couple of minutes.
Posted Jun 2, 2012 18:52 UTC (Sat)
by drag (guest, #31333)
[Link]
I think it refers to _both_.
Plus it's tough to pre-alocate when you have no idea how much space you are actually going to have to end up using.
Posted Jun 1, 2012 18:04 UTC (Fri)
by cwillu (guest, #67268)
[Link]
Posted Jun 2, 2012 0:48 UTC (Sat)
by droundy (subscriber, #4559)
[Link] (3 responses)
Posted Jun 7, 2012 6:10 UTC (Thu)
by butlerm (subscriber, #13312)
[Link] (2 responses)
Posted Jun 7, 2012 8:33 UTC (Thu)
by dgm (subscriber, #49227)
[Link]
People tend to forget that and think that COW is cheaper that full copy in terms of space. It's only initially, and maybe as long as the data remains unchanged.
Posted Jun 8, 2012 3:09 UTC (Fri)
by pjm (guest, #2080)
[Link]
We all agree on physically what's happening,
and I'm sure we agree that in truth it's not just reading or snapshotting by itself that uses extra
space, it's the combination of a read and a preceding snapshot. The only question is what to do about the possibility of there not being enough space to rewrite the inode.
Some possibilities include: I don't want to advocate one solution over another, and I'm pretty happy with
what I'm told is the current approach, I'm just listing some of the options.
Posted Jun 1, 2012 15:03 UTC (Fri)
by josefwhiter (guest, #39238)
[Link]
Posted Jun 1, 2012 18:34 UTC (Fri)
by jmorris42 (guest, #2203)
[Link]
Posted Jun 1, 2012 19:20 UTC (Fri)
by ballombe (subscriber, #9523)
[Link] (6 responses)
Posted Jun 2, 2012 7:49 UTC (Sat)
by liljencrantz (guest, #28458)
[Link] (5 responses)
As such, I think it's extremely sad to see that relatively modern software, like popcon, is *still* being written that makes the misguided mistake of using this feature.
Posted Jun 3, 2012 10:19 UTC (Sun)
by ballombe (subscriber, #9523)
[Link] (3 responses)
Posted Jun 3, 2012 11:00 UTC (Sun)
by liljencrantz (guest, #28458)
[Link] (1 responses)
Posted Jun 3, 2012 14:57 UTC (Sun)
by Yorick (guest, #19241)
[Link]
We see this every time a sensible proposal comes forth to dump an old misfeature that causes way more grief than enjoyment. Control characters in file names, for example...
Posted Jun 3, 2012 19:48 UTC (Sun)
by nybble41 (subscriber, #55106)
[Link]
The downside, of course, would be that some daemon would have to run in the background to collect the audit data. However, that could still involve less overhead than updating atimes on every filesystem read.
Posted Jun 5, 2012 9:31 UTC (Tue)
by zack (subscriber, #7062)
[Link]
The "converting reads into write" argument is very nice (and lyric), but it's not particularly compelling. atime is not changing the intrinsic nature of reads. atime is a logging/accounting mechanism like many others that an OS kernel implements. It is in the nature of accounting to consume space and that happens (inevitably) also when you log actions that, per se, wouldn't have consumed any space. I don't see any inherent flaw in that, it is "just" a matter of difficult trade-offs about where to store the information and how to minimize their size when space is tight.
Posted Jun 7, 2012 9:35 UTC (Thu)
by Mity (guest, #85011)
[Link]
I.e. the unchanged files in the snapshot would silently share the atime with the live file as long as the live file is not really explicitly written.
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
WE CAN START SLOWLY, BY TAKING AWAY THE ONE THAT CONVERTS LOWER TO UPPER CASE, AND SEE IF ANYONE NOTICES.
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
The problem is Filesystem Designers not implementing it well.
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
> each time you execute grep (or per day that you do that, if using
> relatime).
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
10 snapshots in the example would take up just
one extra copy of the metadata, not 10 after
changing the live data.
without de-dup by simply making the copy into
the live file system rather than all the snapshots.
that would be O(1) for a block not O(snapshots).
we generate 10x the original metadata)
btrfs snapshots, and not specific to atime.
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
to begin with, then the recursive grep would only
trigger a copy into the last snapshot.
working tree rather than the snap. that way all
snaps can continue to share blocks as best they can.
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Avoinding disk-full problems
Avoinding disk-full problems
Avoinding disk-full problems
Avoinding disk-full problems
in terms of storage, but it should be fairly cheap in execution time.
If I'm right about only needing preallocation for the most recent snapshot, you might be able to just handoff the preallocated space from snapshot to snapshot reducing the execution cost. As for reserving space for this, it would be kind of like the X% reserved for "root" which many filesystems have/had (which was often actually done to reduce fragmentation). The preallocation here would be to preserve functionality (i.e. working atimes) even when the filesystem was "full".
Avoinding disk-full problems
Avoinding disk-full problems
Avoinding disk-full problems
Avoiding disk-full problems
Avoiding disk-full problems
Avoiding disk-full problems
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Quite right—for every odd feature there will always be someone having found a use for it and object to it being taken away. But the cost of its existence is often carried by everyone: in performance, code complexity, security, ease of use, conceptual simplicity, and so on. For atime, it should stand clear that the occasional benefits stand in no proportion to those costs.
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
Atime and btrfs: a bad combination?
A compromise?
(1) Specifying that atimes of files in snapshot are undefined.
(2) Hack btrfs so that sole atime change does not trigger write-on-copy.