| Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net. |
Unix and Unix-like systems have traditionally recorded the time of last access for each file in the system. This practice has fallen partially out of favor over the last decade for a simple reason: writing the last-accessed time ("atime") takes up a lot of I/O bandwidth when lots of files are being read; see this article from 2007, for example. The worst of the atime-related problems have long since been mitigated by moving to the "relatime" mount option by default; relatime only updates atime a maximum of once per day for unchanging files. But now it seems that atime recording can be especially problematic with the btrfs filesystem, and relatime may not help much.
One of the core design features of btrfs is its copy-on-write nature. Blocks on disk are never modified in place; instead, when it becomes necessary to commit a change, the affected block is copied and rewritten into a new location. Copy-on-write applies to metadata as well as data; if a file's metadata (such as its last-accessed time) is changed, the block containing that metadata will be copied to a new spot. So, on btrfs, an operation that reads a lot of files (creating a tar archive, say, or a recursive grep) can, through atime updates, cause the copying and rewriting of a lot of metadata blocks.
Needless to say, performance is not improved this way, but that is not where the big problem comes in. As Alexander Block pointed out, the real problem has to do with the interaction between atime, copy-on-write, and snapshots.
Btrfs provides a fast snapshotting feature that can create a copy of the state of the filesystem at a specific time. When a snapshot is created, it shares all data and metadata with the "trunk" filesystem. Should a file be changed, the resulting copy-on-write operation separates the trunk from the snapshot, keeping both versions of the data available. So snapshots can be thought of as being nearly free as long as the filesystem remains relatively quiet. Snapshots will share data and metadata, so they do not require a lot of additional space.
Atime updates change the situation, though. If somebody takes a snapshot of a filesystem, then performs a recursive grep on that filesystem, the last-access time of every file touched may be updated. That, in turn, can cause copy-on-write operations on each file's inode structure, with the result that many or all of the inodes in the filesystem may be duplicated. That can increase the space consumption of the filesystem considerably; Alexander posted an example where a recursive grep caused 2.2GB of free space to disappear. That is a surprising result for what is meant to be a read-only operation.
Once upon a time, when disk capacities were measured in megabytes, it was said that the only standard feature of Unix systems was the message of the day telling users to clean up their files. Atime was often used by harried system administrators trying to recover some disk space; they would scan for infrequently-accessed files and, depending on how desperate the situation was and how powerful their users were, either send lists of unused files to users or simply back those files up to tape and delete them. It is somewhat ironic that a feature meant (among other things) to help keep disk space free has now, on next-generation filesystems, become part of the problem.
It's worth noting that the relatime option (which only updates atime once per day unless the file has been modified since the last atime update) is of little help here. It only takes one atime update to force an inode to be rewritten and unshared with any snapshots. So the fact that such updates are limited to one per day offers little in the way of consolation.
Users are also unlikely to be consoled by one other aspect of the problem pointed out by Alexander: since reading data can consume space in the filesystem, read operations might fail with "no space available" errors on an overflowing filesystem. That may make it difficult or impossible to fix the problem by copying data out of a full filesystem. By the time that happens, a typical user could be forgiven for thinking that, perhaps, they don't need last-accessed time tracking at all.
Along those lines, Alexander suggested that it might be a good idea to default to "noatime" (which turns off atime recording entirely) for btrfs mounts, even if that means that btrfs would then behave differently than other filesystems. That idea was quickly shot down for a simple reason: there are still applications that actually need the atime information to function correctly. The classic example is the mutt email client which, in the absence of atime, cannot tell whether a mailbox contains unread mail. Programs that clean up temporary directories (tmpreaper or tmpwatch, for example) will fail without atime. There are also hierarchical storage systems that, like the Unix system administrator of old, use atime to determine when to move files to slower storage. So atime needs to stick around, lest users run into a different kind of unpleasant surprise.
For now, the only recourse for users who run into (or are worried about)
this problem is to explicitly mount their filesystems with the "noatime"
option. Further ahead, it might be possible to make some tweaks to btrfs
to mitigate the problem; Boaz Harrosh suggested disabling atime updates when the
free space falls below a certain threshold or moving the atime data into a
separate data structure. But nobody appears to be working on such
solutions now. So it may be that, as usage of btrfs grows, users will
occasionally be surprised that reading a file can consume space in their
filesystems.
| Index entries for this article | |
|---|---|
| Kernel | Btrfs |
| Kernel | Filesystems/Btrfs |
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 4:30 UTC (Fri) by jzbiciak (guest, #5246) [Link]
I seem to recall posting a crazy idea about moving atime out to its own separately managed data structure and keeping it out of the inode entirely. (I can't find the comment now, of course.)
But still... atime is broken. It turns reads into writes and is generally just nasty. Furthermore, most things just don't need it.
Here's a totally radical, unsellable idea:
Even in the absence of that crazy idea, it still sounds like having atime in the inode allows for trivial bandwidth and storage-size amplification attacks. If you could factor out atime updates to some dedicated on-disk structure that relied more on versioning semantics than COW, you could at least fix btrfs' immediate problem without totally ditching atime. With 8 byte atime and 8 byte inode numbers (let's say), The 2.2GB quoted in the article is enough space to store 128M atime updates if they were stored like a journal.
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 4:51 UTC (Fri) by neilbrown (subscriber, #359) [Link]
I don't use it a lot, but I certainly do use it from time to time to see what files are being accessed. Not a killer feature, but a valuable one.
I'm a big fan of keeping atime in a separate data structure. The liveness properties, stability requirements, and necessary precision are very different from other values in the inode and keeping it together with them is a simplification, not a requirement.
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 5:06 UTC (Fri) by jzbiciak (guest, #5246) [Link]
It seems to me the other option, if you don't fix atime, is to mitigate it with hacks (relatime -- which doesn't work well for the attack against btrfs shown in the article) or outright disable it everywhere or almost everywhere.
My comment above was perhaps slightly over the top. Sorry for any confusion.
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 14:10 UTC (Fri) by jezuch (subscriber, #52988) [Link]
Hah. I disable ataime on all my filesystems and the only use I have for it is a side effect: it functions as creation time, which is much more valuable for me than access time :)
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 15:13 UTC (Fri) by jamesh (guest, #1159) [Link]
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 7:53 UTC (Fri) by MrWim (subscriber, #47432) [Link]
One nice property about this solution is that reads being writes are now explicit and if disk runs out read isn't going to fail but the failure mode can be implemented in the watching daemon.
Potentially you could take the dconf like design where a convenient atime API is provided such that atime can be read synchronously by mapping the atime "database" into the process that is reading it read-only whereas the atimed process would be the only process with write access to this file.
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 12:45 UTC (Fri) by ablock (guest, #84933) [Link]
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 13:13 UTC (Fri) by jzbiciak (guest, #5246) [Link]
Of course, here's where it breaks: How do you export that over NFS? I guess nfsd would also need to talk to that infrastructure also.
My suggestion about doing it at the kernel level is that you retain the userspace ABI, and there's never an application that breaks because you've radically changed where atime gets monitored and recorded.
I guess you could provide a FUSE-like mechanism to hook userspace back up to 'stat'.
Atime and btrfs: a bad combination?
Posted Jun 3, 2012 22:39 UTC (Sun) by MrWim (subscriber, #47432) [Link]
As you say a FUSE filesystem could be offered to preserve the stat() interface but it would probably be less work to get the existing apps which use atime to use some new library. The only examples I ever hear of are tmpwatch and mutt.
Atime and btrfs: a bad combination?
Posted Jun 7, 2012 12:29 UTC (Thu) by MrWim (subscriber, #47432) [Link]
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 9:41 UTC (Fri) by bergwolf (subscriber, #55931) [Link]
Atime and btrfs: a bad combination?
Posted Jun 5, 2012 10:25 UTC (Tue) by mgedmin (subscriber, #34497) [Link]
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 18:48 UTC (Fri) by Yorick (guest, #19241) [Link]
In fact, once most people agree that there is no reason whatsoever not to mount everything with noatime, we can drop it altogether and start reaping the benefits. All operations are faster, the code becomes simpler, and we can put the now free space in inodes (both on disk and in memory) to more productive use. It is difficult to see any costs here—what would break? finger?
Then, once that has been taken care of, we can go on dealing with some other part of the baroque Unix legacy. Remove 99 % of the TTY options, perhaps? We can start slowly, by taking away the one that converts lower to upper case, and see if anyone notices.
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 21:40 UTC (Fri) by jzbiciak (guest, #5246) [Link]
<stty olcuc>
WE CAN START SLOWLY, BY TAKING AWAY THE ONE THAT CONVERTS LOWER TO UPPER CASE, AND SEE IF ANYONE NOTICES.
I'M SURE ANYONE WHO MIGHT COMPLAIN WILL DO SO VERY LOUDLY.
<STTY -OLCUC>
Atime and btrfs: a bad combination?
Posted Jun 4, 2012 12:59 UTC (Mon) by nix (subscriber, #2304) [Link]
Atime and btrfs: a bad combination?
Posted Jun 4, 2012 14:50 UTC (Mon) by hummassa (subscriber, #307) [Link]
One of our problems as developers is exactly that: one does not simply take features away. Lots of systems made me bury them exactly by trying to take "my" features (the ones I used and cared for and needed) away.
Atime and btrfs: a bad combination?
Posted Jun 5, 2012 12:19 UTC (Tue) by Yorick (guest, #19241) [Link]
To remove old cruft, a good start is quarantine. Simply don't implement atime in new file systems (btrfs); people who need it for their business-critical fingerd can run UFS or ext2 or something else. The important part is that we don't let use of a bad feature to spread, since that is only going to make it harder to get rid of.
Instead of making code worse for everyone for the (dubious) benefit of a vocal minority of cavemen, deal with the problem head-on. Give them a chance to adapt - help them all you can - but set a firm date for when the coddling stops.
Atime and btrfs: a bad combination?
Posted Jun 5, 2012 11:42 UTC (Tue) by roblucid (guest, #48964) [Link]
Rather than whinge about it on LKML (Kernel developers & casual enthusiasts aren't the ppl who find atimes useful) implementing atime's better ought to be the focus of discussion. Ppl think about advanced features like snap-shotting and ignore the basic POSIX requirements.
atime doesn't need synchronous update guarantees, in real use the fuzzy relatime (better 23hr min update than 24, to be predictable on daily jobs) would likely be adequate. If your FS can't stand some inode info updates, during reading, then it is what is broken, not the spec.
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 6:05 UTC (Fri) by nestal (subscriber, #66970) [Link]
If you "really" want atime, 2.2GB of space is like a small price to pay :P
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 7:20 UTC (Fri) by ptman (subscriber, #57271) [Link]
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 9:12 UTC (Fri) by Otus (subscriber, #67685) [Link]
That could only happen if you did daily snapshots as well, right?
Otherwise the COW source data should be freed as unreferenced.
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 10:30 UTC (Fri) by cwillu (subscriber, #67268) [Link]
Atime and btrfs: a bad combination?
Posted Jun 7, 2012 14:16 UTC (Thu) by slashdot (guest, #22014) [Link]
Using atime is broken anyway since the fact that a file was accessed doesn't mean that the user read the mail (e.g. it could be the user grepping the mailbox).
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 7:41 UTC (Fri) by bakterie (guest, #37541) [Link]
Atime and btrfs: a bad combination?
Posted Jun 2, 2012 13:03 UTC (Sat) by Ringding (subscriber, #34316) [Link]
Atime and btrfs: a bad combination?
Posted Jun 7, 2012 16:18 UTC (Thu) by quanstro (subscriber, #77996) [Link]
it would seem to me that this could be fixed
without de-dup by simply making the copy into
the live file system rather than all the snapshots.
that would be O(1) for a block not O(snapshots).
(it must be doing that right, otherwise how would
we generate 10x the original metadata)
one would think that this is a general problem with
btrfs snapshots, and not specific to atime.
Atime and btrfs: a bad combination?
Posted Jun 7, 2012 23:23 UTC (Thu) by HenrikH (subscriber, #31152) [Link]
Atime and btrfs: a bad combination?
Posted Jun 10, 2012 16:52 UTC (Sun) by quanstro (subscriber, #77996) [Link]
it still makes more sense to me to copy into the
working tree rather than the snap. that way all
snaps can continue to share blocks as best they can.
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 8:39 UTC (Fri) by geertj (subscriber, #4116) [Link]
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 9:43 UTC (Fri) by bergwolf (subscriber, #55931) [Link]
Avoinding disk-full problems
Posted Jun 1, 2012 11:53 UTC (Fri) by pjm (guest, #2080) [Link]
Avoinding disk-full problems
Posted Jun 1, 2012 12:37 UTC (Fri) by ablock (guest, #84933) [Link]
Avoinding disk-full problems
Posted Jun 1, 2012 15:07 UTC (Fri) by drag (guest, #31333) [Link]
Avoinding disk-full problems
Posted Jun 1, 2012 17:21 UTC (Fri) by faramir (subscriber, #2327) [Link]
As to snapshots being "cheap", that can refer to two different things: storage requirements OR execution time. Preallocation might not be cheap
in terms of storage, but it should be fairly cheap in execution time.
If I'm right about only needing preallocation for the most recent snapshot, you might be able to just handoff the preallocated space from snapshot to snapshot reducing the execution cost. As for reserving space for this, it would be kind of like the X% reserved for "root" which many filesystems have/had (which was often actually done to reduce fragmentation). The preallocation here would be to preserve functionality (i.e. working atimes) even when the filesystem was "full".
Now as to why BTRFS needed such a large amount of space relative to the size of the filesystem, in the example, that would seem to be an issue with the design of BTRFS and how it interacts with traditional Unix filesystem functionality. As others have suggested storing atimes separately (perhaps just for the "trunk") might work. Perhaps better would be to give the trunk a "current" atime allocation in addition to the standard one. Updates would go to both the current data structure (after copying) as well as the atime only structure up until the disk was full. OTOH, this is getting complicated. No something to figure out in a couple of minutes.
Avoinding disk-full problems
Posted Jun 2, 2012 18:52 UTC (Sat) by drag (guest, #31333) [Link]
I think it refers to _both_.
Plus it's tough to pre-alocate when you have no idea how much space you are actually going to have to end up using.
Avoinding disk-full problems
Posted Jun 1, 2012 18:04 UTC (Fri) by cwillu (subscriber, #67268) [Link]
Avoinding disk-full problems
Posted Jun 2, 2012 0:48 UTC (Sat) by droundy (subscriber, #4559) [Link]
Avoiding disk-full problems
Posted Jun 7, 2012 6:10 UTC (Thu) by butlerm (guest, #13312) [Link]
Avoiding disk-full problems
Posted Jun 7, 2012 8:33 UTC (Thu) by dgm (subscriber, #49227) [Link]
People tend to forget that and think that COW is cheaper that full copy in terms of space. It's only initially, and maybe as long as the data remains unchanged.
Avoiding disk-full problems
Posted Jun 8, 2012 3:09 UTC (Fri) by pjm (guest, #2080) [Link]
We all agree on physically what's happening, and I'm sure we agree that in truth it's not just reading or snapshotting by itself that uses extra space, it's the combination of a read and a preceding snapshot.
The only question is what to do about the possibility of there not being enough space to rewrite the inode. Some possibilities include:
I don't want to advocate one solution over another, and I'm pretty happy with what I'm told is the current approach, I'm just listing some of the options.
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 15:03 UTC (Fri) by josefwhiter (guest, #39238) [Link]
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 18:34 UTC (Fri) by jmorris42 (guest, #2203) [Link]
Atime and btrfs: a bad combination?
Posted Jun 1, 2012 19:20 UTC (Fri) by ballombe (subscriber, #9523) [Link]
Atime and btrfs: a bad combination?
Posted Jun 2, 2012 7:49 UTC (Sat) by liljencrantz (guest, #28458) [Link]
As such, I think it's extremely sad to see that relatively modern software, like popcon, is *still* being written that makes the misguided mistake of using this feature.
Atime and btrfs: a bad combination?
Posted Jun 3, 2012 10:19 UTC (Sun) by ballombe (subscriber, #9523) [Link]
Atime and btrfs: a bad combination?
Posted Jun 3, 2012 11:00 UTC (Sun) by liljencrantz (guest, #28458) [Link]
Atime and btrfs: a bad combination?
Posted Jun 3, 2012 14:57 UTC (Sun) by Yorick (guest, #19241) [Link]
Quite right—for every odd feature there will always be someone having found a use for it and object to it being taken away. But the cost of its existence is often carried by everyone: in performance, code complexity, security, ease of use, conceptual simplicity, and so on. For atime, it should stand clear that the occasional benefits stand in no proportion to those costs.We see this every time a sensible proposal comes forth to dump an old misfeature that causes way more grief than enjoyment. Control characters in file names, for example...
Atime and btrfs: a bad combination?
Posted Jun 3, 2012 19:48 UTC (Sun) by nybble41 (subscriber, #55106) [Link]
The downside, of course, would be that some daemon would have to run in the background to collect the audit data. However, that could still involve less overhead than updating atimes on every filesystem read.
Atime and btrfs: a bad combination?
Posted Jun 5, 2012 9:31 UTC (Tue) by zack (subscriber, #7062) [Link]
The "converting reads into write" argument is very nice (and lyric), but it's not particularly compelling. atime is not changing the intrinsic nature of reads. atime is a logging/accounting mechanism like many others that an OS kernel implements. It is in the nature of accounting to consume space and that happens (inevitably) also when you log actions that, per se, wouldn't have consumed any space. I don't see any inherent flaw in that, it is "just" a matter of difficult trade-offs about where to store the information and how to minimize their size when space is tight.
A compromise?
Posted Jun 7, 2012 9:35 UTC (Thu) by Mity (guest, #85011) [Link]
I.e. the unchanged files in the snapshot would silently share the atime with the live file as long as the live file is not really explicitly written.
Copyright © 2012, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds