Temporary files: RAM or disk? [LWN.net]

Temporary files: RAM or disk?

Posted Jun 4, 2012 7:46 UTC (Mon) by dvdeug (guest, #10998) [Link] (13 responses)

The default policy should be that when I save a file it's saved. If they had created this idea that only fsync puts the file on the disk, say, forty years ago, code would be littered with fsyncs (and no doubt filesystem writers would be cheating on that invariant and complaining that people overused fsync.)

Right now, after I've spent 15 minutes working on something and saving my work along the way, if I lose my data because something didn't run fsync in that 15 minutes, I'm going to be royally pissed. It takes a lot of speed increase on a benchmark to make up for 15 minutes of lost work. The time that users lose when stuff goes wrong doesn't show up on benchmarks, though.

Temporary files: RAM or disk?

Posted Jun 4, 2012 7:57 UTC (Mon) by dlang (guest, #313) [Link] (2 responses)

the idea that your data isn't safe if the system crashes and you haven't done an fsync on that file (not just any other file in the system) HAS been around for 40 years.

current filesystems attempt to schedule data to be written to disk within about 5 seconds or so in most cases (I remember that at one point reiserfs allowed for 30 seconds, and so was posting _amazing_ benchmark numbers, for benchmarks that took <30 seconds to run), but it's possible for it to take longer, or for the data to get to disk on the wrong order, or partially get to disk (again in some random order)

because of this, applications that really care about their data in crash scenarios (databases, mail servers, log servers, etc), do have fsync calls "littered" through their code. It's only recent "desktop" software that is missing this. In part because ext3 does have such pathological behaviour on fsync

Temporary files: RAM or disk?

Posted Jun 4, 2012 21:25 UTC (Mon) by giraffedata (guest, #1954) [Link] (1 responses)

current filesystems attempt to schedule data to be written to disk within about 5 seconds or so in most cases

Are you sure? The last time I looked at this was ten years ago, but at that time there were two main periods: every 5 seconds kswapd checked for dirty pages old enough to be worth writing out and "old enough" was typically 30 seconds. That was easy to confirm on a personal computer, because 30 seconds after you stopped working, you'd see the disk light flash.

But I know economies change, so I could believe dirty pages don't last more than 5 seconds in modern Linux and frequently updated files just generate 6 times as much I/O.

Temporary files: RAM or disk?

Posted Jun 4, 2012 23:11 UTC (Mon) by dlang (guest, #313) [Link]

this is a filesystem specific time setting for the filesystem journal. I know it's ~5 seconds on ext3. it could be different on other filesystems.

also, this is for getting the journal data to disk, if the journal is just metadata it may not push the file contents to disk (although it may, to prevent the file from containing blocks that haven't been written to yet and so contain random, old data)

Temporary files: RAM or disk?

Posted Jun 4, 2012 8:00 UTC (Mon) by neilbrown (subscriber, #359) [Link] (9 responses)

> The default policy should be that when I save a file it's saved.

You are, of course, correct.
However this is a policy that is encoded in your editor, not in the filesystem. And I suspect most editors do exactly that. i.e. they call 'fsync' before 'close'.

But not every "open, write, close" sequence is an instance of "save a file". It may well be "create a temporary file which is completely uninteresting if I get interrupted". In that case an fsync would be pointless and costly. So the filesystem doesn't force an fsync on every close as the filesystem doesn't know what the 'close' means.

Any application that is handling costly-to-replace data should use fsync. An app that is handling cheap data should not. It is really that simple.

Temporary files: RAM or disk?

Posted Jun 4, 2012 9:11 UTC (Mon) by dvdeug (guest, #10998) [Link] (6 responses)

Another choice for a set of semantics would be to make programs that don't want to use a filesystem as a permanent storage area for files specify that. That is, fail safe, not fail destructive. As it is, no C program can portably save a file; fsync is not part of the C89/C99/C11 standards. Many other languages can not save a file at all without using an interface to C.

I've never seen this in textbooks and surely that should be front and center with the discussion of file I/O, that if you're actually saving user data, that you need to use fsync. It's not something you'll see very often in actual code. But should you actually be in a situation where this blows up in your face, it will be all your fault.

Temporary files: RAM or disk?

Posted Jun 4, 2012 9:51 UTC (Mon) by dgm (subscriber, #49227) [Link] (5 responses)

It's not in the C standard because it has nothing to do with C itself, but with the underlaying OS. You will find fsync() in POSIX, and it's portable as long as the target OS supports POSIX semantics (event Windows used to).

Temporary files: RAM or disk?

Posted Jun 4, 2012 10:24 UTC (Mon) by dvdeug (guest, #10998) [Link] (4 responses)

What do you mean nothing to do with C itself? Linux is interpreting C semantics to mean that a standard C program cannot reliably produce permanent files. That's certainly legal, but it means that most people who learn to write C will learn to write code that doesn't reliably produce permanent files. Linux could interpret the C commands as asking for the creation of permanent files and force people who want temporary file to use special non-portable commands.

Temporary files: RAM or disk?

Posted Jun 4, 2012 10:33 UTC (Mon) by andresfreund (subscriber, #69562) [Link] (2 responses)

Mount your filesystems with O_SYNC and see how long you can endure that. Making everything synchronous by default is a completely useless behaviour. *NO* general purpose OS in the last years does that.
Normally you need only very few points where you fsync (or equivalent) and quite some more places where you write data...

Temporary files: RAM or disk?

Posted Jun 4, 2012 11:20 UTC (Mon) by neilbrown (subscriber, #359) [Link] (1 responses)

To be fair, O_SYNC is much stronger than what some people might reasonably want to expect.

O_SYNC means every write request is safe before the write system call returns.

An alternate semantic is that a file is safe once the last "close" on it returns. I believe this has been implemented for VFAT filesystems which people sometimes like to pull out of their computers without due care.
It is quite an acceptable trade-off in that context.

This is nearly equivalent to always calling fsync() just before close().

Adding a generic mount option to impose this semantic on any fs might be acceptable. It might at least silence some complaints.

Temporary files: RAM or disk?

Posted Jun 4, 2012 12:19 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

> To be fair, O_SYNC is much stronger than what some people might reasonably want to expect.
> O_SYNC means every write request is safe before the write system call returns.
Hm. Not sure if that really is what people expect. But I can certainly see why it would be useful for some applications. Should probably be a fd option or such though? I would be really unhappy if a rm -rf or copy -r would behave that way.

Sometimes I wish userspace controllable metadata transactions where possible with a sensible effort/interface...

Temporary files: RAM or disk?

Posted Jun 4, 2012 16:44 UTC (Mon) by dgm (subscriber, #49227) [Link]

Linux does not interpret C semantics. Linux implements POSIX semantics, and C programs use POSIX calls to access those semantics. So this has nothing to do with C, but POSIX.

POSIX offers a tool to make sure your data is safely stored: the fsync() call. POSIX and the standard C library are careful not to make any promises regarding the reliability of writes, because this would mean a burden for all systems implementing those semantics, some of which do not even have a concept of fail-proof disk writes.

Now Linux could chose to deviate from the standard, but that would be exactly the reverse of portability, wouldn't it?

Temporary files: RAM or disk?

Posted Jun 4, 2012 15:37 UTC (Mon) by giraffedata (guest, #1954) [Link] (1 responses)

Any application that is handling costly-to-replace data should use fsync. An app that is handling cheap data should not. It is really that simple.

Well, it's a little more complex because applications are more complex than just C programs. Sometimes the application is a person sitting at a workstation typing shell commands. The cost of replacing the data is proportional to the amount of data lost. For that application, the rule isn't that the application must use fsync, but that it must use a sync shell command when the cost of replacement has exceeded some threshold. But even that is oversimplified, because it makes sense for the system to do a system-wide sync automatically every 30 seconds or so to save the user that trouble.

On the other hand, we were talking before about temporary files on servers, some of which do adhere to the fsync dogma such that an automatic system-wide sync may be exactly the wrong thing to do.

Temporary files: RAM or disk?

Posted Jun 4, 2012 23:06 UTC (Mon) by dlang (guest, #313) [Link]

a system-wide sync can take quite a bit of time, and during that time it may block a lot of other activity (or make it so expensive that the system may as well be blocked)

Temporary files: RAM or disk?

Posted Jun 4, 2012 9:39 UTC (Mon) by dgm (subscriber, #49227) [Link] (5 responses)

Ext3 does worse than ext2 because it tries to keep metadata consistency, but that is useless for a tmp filesystem, where all files are going to be wiped out on reboot or crash.

It's not a regression, but a conscientious design decision, and that use case is outside of what Ext3 is good for.

ext3 regression: unnecessarily syncs temporary files

Posted Jun 4, 2012 15:43 UTC (Mon) by giraffedata (guest, #1954) [Link] (4 responses)

It's not a regression, but a conscientious design decision

It's a regression due to a conscious design decision. Regression doesn't mean mistake, it means the current thing does something worse than its predecessor. Software developers have a bias against regressions, but they do them deliberately, and for the greater good, all the time.

ext3 regression: unnecessarily syncs temporary files

Posted Jun 4, 2012 21:24 UTC (Mon) by dgm (subscriber, #49227) [Link] (3 responses)

Regression does mean mistake, and this is clearly not the case.

A more enlightening example: the latest version of the kernel requires more memory than 0.99 but nobody could possibly claim this is a regression. If anything, it's a trade-off.

ext3 regression: unnecessarily syncs temporary files

Posted Jun 5, 2012 1:42 UTC (Tue) by giraffedata (guest, #1954) [Link] (2 responses)

the latest version of the kernel requires more memory than 0.99 but nobody could possibly claim this is a regression

I claim that's a regression. Another area where kernel releases have steadily regressed: they run more slowly. And there are machines current kernels won't run on at all that previous ones could. Another regression.

I'm just going by plain meaning of the word (informed somewhat by it's etymology, the Latin for "step backward."). And the fact that it's really useful to be able to talk about the steps backward without regard to whether they're worth it.

Everyone recognizes that sometimes you have to regress in some areas in order to progress in others. And sometimes it's a matter of opinion whether the tradeoff is right. For example, regression testing often uncovers the fact that the new release runs so much slower than the previous one that some people consider it a mistake and it gets "fixed."

I like to use Opera, but almost every upgrade I've ever done has contained functional regressions, usually intentional. As they are often regressions that matter to me, I tend not to upgrade Opera (and it makes no difference to me whether it's a bug or not).

ext3 regression: unnecessarily syncs temporary files

Posted Jun 5, 2012 8:35 UTC (Tue) by dgm (subscriber, #49227) [Link] (1 responses)

Whatever, keep using 0.99 then, or better go back to first version that just printed AAAABBBB on the screen. Everything from there is a regression.

ext3 regression: unnecessarily syncs temporary files

Posted Jun 5, 2012 14:25 UTC (Tue) by giraffedata (guest, #1954) [Link]

Whatever, keep using 0.99 then, or better go back to first version that just printed AAAABBBB on the screen. Everything from there is a regression.

Everything since this is a regression in certain areas, but you seem to be missing the essential point that I stated several ways: These regressions come along with progressions. The value of the progressions outweigh the cost of the regressions. I hate in some way every "upgrade" I make, but I make them anyway.

Everyone has to balance the regressions and the progressions in deciding whether to upgrade, and distributors tend to make sure the balance is almost always in favor of the progressions. We can speak of a "net regression," which most people would not find current Linux to be with respect to 0.99.

Temporary files: RAM or disk?

Posted Jun 4, 2012 15:51 UTC (Mon) by bronson (subscriber, #4806) [Link] (14 responses)

No. There are so many buggy, non-fsyncing programs out there that, if a filesystem has 1G of writes outstanding, it's almost certainly going to lose many hours of work. (Unless it's manually flushing every 20 seconds or so, in which case that's fine but also slower than tmpfs).

In an ideal world, you're exactly right. In today's world, that would be fairly dangerous.

> I've seen filesystems that have mount options and file attributes that specifically indicate that files are temporary

Agreed, but if you're remounting part of your hierarchy with crazy mount options, why not just use tmpfs?

Temporary files: RAM or disk?

Posted Jun 4, 2012 23:08 UTC (Mon) by dlang (guest, #313) [Link] (13 responses)

because tempfs just uses ram? and while you can add swap to give you more space, the use of the swap will not be targeted. This means that you may end up with things swapped out that you really would rather have remained active, even if the result was that it took a little more time to retrieve a temporary file.

Temporary files: RAM or disk?

Posted Jun 5, 2012 7:05 UTC (Tue) by bronson (subscriber, #4806) [Link] (12 responses)

That's true, that's an important difference. But you could have a smilar situation with the filesystem-with-options, right? If the filesystem uses a lot of memory, the important things could get swapped out as well.

Temporary files: RAM or disk?

Posted Jun 5, 2012 7:19 UTC (Tue) by dlang (guest, #313) [Link] (11 responses)

True, but the difference is that it would need to be a very poorly written filesystem to eat up more memory than the contents that it's holding. And it's much easier to tell where the memory is being used, and therefor make an intelligent decision about what to write to disk (and what to throw away), than when it all has to be stored in memory and your only disk backing you have is swap.

Also, reading and writing swap tends to be rather inefficient compared to normal I/O (data ends up very fragmented on disk, bearing no resemblance to any organization that it had in ram, let alone the files being stored in tempfs.

Temporary files: RAM or disk?

Posted Jun 5, 2012 15:33 UTC (Tue) by giraffedata (guest, #1954) [Link] (10 responses)

reading and writing swap tends to be rather inefficient compared to normal I/O (data ends up very fragmented on disk, bearing no resemblance to any organization that it had in ram, let alone the files being stored in tempfs.

I believe the tendency is the other way around. One of the selling points for tmpfs for me is that reading and writing swap is more efficient than reading and writing a general purpose filesystem. First, there aren't inodes and directories to pull the head around. Second, writes stream out sequentially on disk, eliminating more seeking.

Finally, I believe it's usually the case that, for large chunks of data, the data is referenced in the same groups in which it becomes least recently used. A process loses its timeslice and its entire working set ages out at about the same time and ends up in the same place on disk. When it gets the CPU again, it faults in its entire working set at once. For a large temporary file, I believe it is even more pronounced - unlike many files, a temporary file is likely to be accessed in passes from beginning to end. I believe general purpose filesystems are only now gaining the ability to do the same placement as swapping in this case; to the extent that they succeed, though, they can at best reach parity.

In short, reading and writing swap has been (unintentionally) optimized for the access patterns of temporary files, where general purpose filesystems are not.

Temporary files: RAM or disk?

Posted Jun 6, 2012 6:53 UTC (Wed) by Serge (guest, #84957) [Link] (3 responses)

> I believe the tendency is the other way around. One of the selling points for tmpfs for me is that reading and writing swap is more efficient than reading and writing a general purpose filesystem. First, there aren't inodes and directories to pull the head around.

It's not that simple. Tmpfs is not "plain data" filesystem, you can create directories there, so it has to store all the metadata as well. It also has inodes internally.

> Second, writes stream out sequentially on disk, eliminating more seeking.

This could be true if swap was empty. Same when you write to the empty filesystem. But what if it was not empty? You get the same swap fragmentation and seeking as you would get in any regular filesystem.

> In short, reading and writing swap has been (unintentionally) optimized for the access patterns of temporary files, where general purpose filesystems are not.

And filesystem is intentionally optimized for storing files. Swap is not a plain data storage, otherwise "suspend to disk" could not work. Swap has its internal format, there're even different versions of its format (`man mkswap` reveals v0 and v1). I.e. instead of writing through one ext3fs level you write through two fs levels tmpfs+swap.

Things get worse when you start reading. When you read something from ext3, the oldest part of the filecache is dropped and data is placed to RAM. But reading from swap means that your RAM is full, and in order to read a page from swap you must first write another page there. I.e. sequential read from ext3 turns into random write+read from swap.

Temporary files: RAM or disk?

Posted Jun 6, 2012 15:24 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (1 responses)

> But reading from swap means that your RAM is full, and in order to read a page from swap you must first write another page there. I.e. sequential read from ext3 turns into random write+read from swap.

_Writing_ to swap means that your RAM is full (possibly including things like clean cache which are currently higher priority, but could be dropped at need). _Reading_ from swap implies only that something previously written to swap is needed in RAM again. There could be any amount of free space at that point. Even if RAM does happen to be full, the kernel can still drop clean data from the cache to make room, just as with reading from ext3.

Temporary files: RAM or disk?

Posted Jun 6, 2012 17:43 UTC (Wed) by dgm (subscriber, #49227) [Link]

Yes, merely reading from swap doesn't imply that your RAM is full. What is true is that _when_ your RAM is full (notice that I don't say "if") it _may_ imply a write to swap, depending in how dirty the page cache is. The problem is, tmpfs is a factor that contributes a lot to pollute the page cache. Temporary files are created to be written and then re-read in short, so all pages used by tmpfs are expected to be dirty.

All of this is of no consequence on system startup, when the page cache is mostly clean. Once the system has been up for a while, though... I think a few tests have to be done.

Temporary files: RAM or disk?

Posted Jun 7, 2012 2:28 UTC (Thu) by giraffedata (guest, #1954) [Link]

... First, there aren't inodes and directories to pull the head around.
It's not that simple. Tmpfs is not "plain data" filesystem, you can create directories there, so it has to store all the metadata as well. It also has inodes internally.

I was talking about disk structures. Inodes and directory information don't go into the swap space, so they don't pull the head around.

(But there's an argument in favor of regular filesystem /tmp: if you have lots of infrequently accessed small files, tmpfs will waste memory).

Second, writes stream out sequentially on disk, eliminating more seeking.
This could be true if swap was empty. Same when you write to the empty filesystem. But what if it was not empty? You get the same swap fragmentation and seeking as you would get in any regular filesystem.

It's the temporary nature of the data being swapped (and the strategies the kernel implements based on that expectation) that makes the data you want at any particular time less scattered in swap space than in a typical filesystem that has to keep copious eternally growing files forever. I don't know exactly what policies the swapper follows (though I have a pretty good idea), but if it were no better at storing anonymous process data than ext3 is at storing file data, we would really have to wonder at the competence of the people who designed it. And my claim is that since it's so good with process anonymous data, it should also be good with temporary files, since they're used almost the same way.

in order to read a page from swap you must first write another page there.

Actually, the system does the same thing for anonymous pages as it does for file cache pages: it tries to clean the pages before they're needed so that when a process needs to steal a page frame it usually doesn't have to wait for a page write. Also like file cache, when the system swaps a page in, it tends to leave the copy on disk too, so if it doesn't get dirty again, you can steal its page frame without having to do a page out.

Temporary files: RAM or disk?

Posted Jun 7, 2012 13:15 UTC (Thu) by njs (subscriber, #40338) [Link] (5 responses)

I don't know about tmpfs, but my experience is: if I have a process with a large (multi-gigabyte) working set, and it goes to sleep and gets swapped out, then there's no point in waking it back up again; I might as well kill it and start over. At least on our compute servers (running some vaguely recent Ubuntu, IIRC), swap-in is definitely not doing successful readahead. I've often wished for some hack that would just do a sequential read through the swap file to load one process back into memory; it would be hundreds of times faster.

Temporary files: RAM or disk?

Posted Jun 7, 2012 13:28 UTC (Thu) by Jonno (subscriber, #49613) [Link]

If you have enough free memory at the time you want to swap in that process, try to run "sudo swapoff -a ; sudo swapon -a", it will sequentially read in all swap to memory, no random access.

I find that if I have two processes with large working sets causing swaping, and kill one them, doing a swapoff will get the other one performant again much faster than letting it swap in only the stuff it needs as it needs it.

Temporary files: RAM or disk?

Posted Jun 7, 2012 15:44 UTC (Thu) by giraffedata (guest, #1954) [Link] (1 responses)

At least on our compute servers (running some vaguely recent Ubuntu, IIRC), swap-in is definitely not doing successful readahead

Good information.

That's probably a good reason to use a regular filesystem instead of tmpfs for large temporary files.

I just checked, and the only readahead tmpfs does is the normal swap readahead, which consists of reading an entire cluster of pages when one of the pages is demanded. A cluster of pages is pages that were swapped out at the same time, so they are likely to be re-referenced at the same time and are written at the same spot on the disk. But this strategy won't effect streaming, like typical filesystem readahead.

And the kernel default size of the cluster is 8 pages. You can control it with /proc/sys/vm/page-cluster, though. I would think on a system with multi-gigabyte processes, a much larger value would be optimal.

Temporary files: RAM or disk?

Posted Jun 11, 2012 14:51 UTC (Mon) by kleptog (subscriber, #1183) [Link]

This is actually related to another problem I ran into recently: is there some way see what is actually in swap? I know /proc/<pid>/smaps gives you information about which blocks are in swap. But I can't see a way to get information about the order. That is, is my swap fragmented?

Temporary files: RAM or disk?

Posted Jun 7, 2012 21:36 UTC (Thu) by quotemstr (subscriber, #45331) [Link] (1 responses)

> I've often wished for some hack that would just do a sequential read through the swap file to load one process back into memory

Windows 8 will do that for modern applications. http://blogs.msdn.com/b/b8/archive/2012/04/17/reclaiming-...

Temporary files: RAM or disk?

Posted Jun 8, 2012 0:15 UTC (Fri) by giraffedata (guest, #1954) [Link]

I've often wished for some hack that would just do a sequential read through the swap file to load one process back into memory
Windows 8 will do that for modern applications.

When njh says "hack" I think it means something an intelligent user can invoke to override the normal system paging strategy because he knows a process is going to be faulting back much of its memory anyway.

The Windows 8 thing is automatic, based on an apparently pre-existing long-term scheduling facility. Some applications get long-term scheduled out, aka "put in the background," aka "suspended," mainly so devices they are using can be powered down and save battery energy. But there is a new feature that also swaps all the process' memory out when it gets put in the background, and the OS takes care to put all the pages in one place. Then, when the process gets brought back to the foreground, the OS brings all those pages back at once, so the process is quickly running again.

This of course requires applications that explicitly go to sleep, as opposed to just quietly not touching most of their memory for a while, and then suddenly touching it all again.