Temporary files: RAM or disk?
Temporary files: RAM or disk?
Posted Jun 1, 2012 2:41 UTC (Fri) by neilbrown (subscriber, #359)Parent article: Temporary files: RAM or disk?
Many years ago I worked with Apollo workstations running "Domain/OS" - which was Unix-like. They didn't have a swap partition, or a swap file. They just used spare space in the filesystem for swap.
Could that work for Linux? You could probably create a user-space solution that monitored swap usage and created new swap files on demand. But I suspect it wouldn't work very well.
Or you could teach Linux filesystems to support swap files that grow on demand - or instantiate space on demand.
Once the swap-over-NFS patches get merged this should be quite possible. The filesystem is told that a given file is being used for swap, then it can preload enough data so that it can allocate space immediately without needing any further memory allocation. You could then create a 100G sparse file and add that as a swap destination and it would "just work". Writing to a tmpfs filesystem would be fast for small files, but big files would spill out into the same space as is used by the filesystem.
(Yes, I realise this is a long-term solution while what is needed is a short-term solution.)
Posted Jun 1, 2012 3:18 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (67 responses)
Right now if I need to create a big file (and I do need it quite often) there is no alternative for /tmp.
Posted Jun 1, 2012 3:36 UTC (Fri)
by neilbrown (subscriber, #359)
[Link] (66 responses)
Isn't that answered in the article?
Because "as it is", /tmp imposes unnecessary disk IO which can be noticed when creating lots of small short-lived files. Let's see if we can make it faster, without making it any smaller.
Posted Jun 1, 2012 5:17 UTC (Fri)
by wahern (subscriber, #37304)
[Link] (65 responses)
Posted Jun 1, 2012 5:31 UTC (Fri)
by neilbrown (subscriber, #359)
[Link] (52 responses)
With journalling things become a bit more complex. You need to ensure that the various metadata are journalled in the right order and by far the easiest way to do that it to place every updated block in the "next" transaction. So with ext3 journalling (if I understand it correctly), every metadata block that gets changed will be written to the journal on the next journal commit, and then to the filesystem.
A filesystem which does delayed allocation would be better placed to optimise out short lived files completely and maybe ext4/xfs/btrfs do better at this. However I suspect is it far from trivial to optimise out *all* storage updates for short-lived files and I doubt it is something that fs developers optimise for.
So I think that you probably could see it as a filesystem problem, but I'm not sure that seeing it that way would lead to the best solution (but if some fs developers see this as a challenge and prove me wrong, I won't complain).
Posted Jun 1, 2012 7:03 UTC (Fri)
by wookey (guest, #5501)
[Link] (44 responses)
One thing Serge keeps coming back to is 'Please show us real-world improvements from /tmp-in-tmpfs, significant enough to make it a better _default_, given the well-documented problems'. This seems to be key, and I leave it to posters to make up their own minds about that. I certainly learned a lot from the thread. And there is clearly a longer-term issue to fix this properly.
Posted Jun 1, 2012 7:35 UTC (Fri)
by wujj123456 (guest, #84680)
[Link] (40 responses)
I always mount /tmp as tmpfs, but I have large RAM and know exactly what I am doing. I used to analysis ~10G of data, and reading from RAM was at least 300% faster, even including the heavy data processing. I also rendered movies using tmpfs when size fits, and again observed dramatical difference.
The problem is: if a user cares about that performance difference, he probably knows how to use tmpfs himself. Setting /tmp to tmpfs will confuse normal users when an application fails. Given the popularity of those big distros, it might not be a good move. Even Firefox doesn't store tmp file in /tmp unless you override it in about:config. It might be worthwhile to check how existing applications are using tmpfs (/dev/shm). I have a feeling that most applications don't care at all.
Posted Jun 1, 2012 7:43 UTC (Fri)
by neilbrown (subscriber, #359)
[Link]
Are you serious? The only people who care about performance are people who dig into the arcane configuration details of OSes ?? I don't think so.
Wasn't there a recent quote of the week along the lines of "We should make things simple and safe so that people don't *need* to carefully form good habits."?? I think that applies here to, only is so that people don't *need" to dig into arcane details.
I agree that we shouldn't make /tmp == tmpfs the default while it causes problems. But I do think that we should work to fix the problems so that we can do it safely.
Posted Jun 2, 2012 7:01 UTC (Sat)
by Los__D (guest, #15263)
[Link]
Posted Jun 2, 2012 23:41 UTC (Sat)
by giraffedata (guest, #1954)
[Link] (37 responses)
You imply that with /tmp in a disk-based filesystem, you didn't read from RAM. Why would that be? Why weren't your files in cache?
Posted Jun 3, 2012 15:09 UTC (Sun)
by bronson (subscriber, #4806)
[Link] (36 responses)
I can write 1G of data to tmpfs, read it badk, and delete it (a typical scientific profile), without ever expecting it to hit rust. I'd be very VERY disappointed in any filesystem that allowed its write buffers to get that far behind.
Posted Jun 3, 2012 17:44 UTC (Sun)
by giraffedata (guest, #1954)
[Link] (35 responses)
Getting this far behind is a valuable feature and any filesystem that doesn't let you do it is lacking. Someone pointed out earlier that the more modern ext3 is incapable of getting that far behind, whereas the less modern ext2 is not. That's a regression (but effectively explains why a tmpfs /tmp could be faster than an ext3 one).
I've seen filesystems that have mount options and file attributes that specifically indicate that files are temporary -- likely to be overwritten or deleted soon -- so that the page replacement algorithm doesn't waste valuable I/O time cleaning the file's pages.
Furthermore, many people believe that whenever you want data to be hardened to disk, you should fsync. Given that philosophy, the default kernel policy should be not to write the data to disk until you need the memory (with some allowance for forecasting future need for memory).
Posted Jun 4, 2012 7:46 UTC (Mon)
by dvdeug (guest, #10998)
[Link] (13 responses)
Right now, after I've spent 15 minutes working on something and saving my work along the way, if I lose my data because something didn't run fsync in that 15 minutes, I'm going to be royally pissed. It takes a lot of speed increase on a benchmark to make up for 15 minutes of lost work. The time that users lose when stuff goes wrong doesn't show up on benchmarks, though.
Posted Jun 4, 2012 7:57 UTC (Mon)
by dlang (guest, #313)
[Link] (2 responses)
current filesystems attempt to schedule data to be written to disk within about 5 seconds or so in most cases (I remember that at one point reiserfs allowed for 30 seconds, and so was posting _amazing_ benchmark numbers, for benchmarks that took <30 seconds to run), but it's possible for it to take longer, or for the data to get to disk on the wrong order, or partially get to disk (again in some random order)
because of this, applications that really care about their data in crash scenarios (databases, mail servers, log servers, etc), do have fsync calls "littered" through their code. It's only recent "desktop" software that is missing this. In part because ext3 does have such pathological behaviour on fsync
Posted Jun 4, 2012 21:25 UTC (Mon)
by giraffedata (guest, #1954)
[Link] (1 responses)
Are you sure? The last time I looked at this was ten years ago, but at that time there were two main periods: every 5 seconds kswapd checked for dirty pages old enough to be worth writing out and "old enough" was typically 30 seconds. That was easy to confirm on a personal computer, because 30 seconds after you stopped working, you'd see the disk light flash.
But I know economies change, so I could believe dirty pages don't last more than 5 seconds in modern Linux and frequently updated files just generate 6 times as much I/O.
Posted Jun 4, 2012 23:11 UTC (Mon)
by dlang (guest, #313)
[Link]
also, this is for getting the journal data to disk, if the journal is just metadata it may not push the file contents to disk (although it may, to prevent the file from containing blocks that haven't been written to yet and so contain random, old data)
Posted Jun 4, 2012 8:00 UTC (Mon)
by neilbrown (subscriber, #359)
[Link] (9 responses)
You are, of course, correct.
But not every "open, write, close" sequence is an instance of "save a file". It may well be "create a temporary file which is completely uninteresting if I get interrupted". In that case an fsync would be pointless and costly. So the filesystem doesn't force an fsync on every close as the filesystem doesn't know what the 'close' means.
Any application that is handling costly-to-replace data should use fsync. An app that is handling cheap data should not. It is really that simple.
Posted Jun 4, 2012 9:11 UTC (Mon)
by dvdeug (guest, #10998)
[Link] (6 responses)
I've never seen this in textbooks and surely that should be front and center with the discussion of file I/O, that if you're actually saving user data, that you need to use fsync. It's not something you'll see very often in actual code. But should you actually be in a situation where this blows up in your face, it will be all your fault.
Posted Jun 4, 2012 9:51 UTC (Mon)
by dgm (subscriber, #49227)
[Link] (5 responses)
Posted Jun 4, 2012 10:24 UTC (Mon)
by dvdeug (guest, #10998)
[Link] (4 responses)
Posted Jun 4, 2012 10:33 UTC (Mon)
by andresfreund (subscriber, #69562)
[Link] (2 responses)
Posted Jun 4, 2012 11:20 UTC (Mon)
by neilbrown (subscriber, #359)
[Link] (1 responses)
O_SYNC means every write request is safe before the write system call returns.
An alternate semantic is that a file is safe once the last "close" on it returns. I believe this has been implemented for VFAT filesystems which people sometimes like to pull out of their computers without due care.
This is nearly equivalent to always calling fsync() just before close().
Adding a generic mount option to impose this semantic on any fs might be acceptable. It might at least silence some complaints.
Posted Jun 4, 2012 12:19 UTC (Mon)
by andresfreund (subscriber, #69562)
[Link]
Sometimes I wish userspace controllable metadata transactions where possible with a sensible effort/interface...
Posted Jun 4, 2012 16:44 UTC (Mon)
by dgm (subscriber, #49227)
[Link]
POSIX offers a tool to make sure your data is safely stored: the fsync() call. POSIX and the standard C library are careful not to make any promises regarding the reliability of writes, because this would mean a burden for all systems implementing those semantics, some of which do not even have a concept of fail-proof disk writes.
Now Linux could chose to deviate from the standard, but that would be exactly the reverse of portability, wouldn't it?
Posted Jun 4, 2012 15:37 UTC (Mon)
by giraffedata (guest, #1954)
[Link] (1 responses)
Well, it's a little more complex because applications are more complex than just C programs. Sometimes the application is a person sitting at a workstation typing shell commands. The cost of replacing the data is proportional to the amount of data lost. For that application, the rule isn't that the application must use fsync, but that it must use a sync shell command when the cost of replacement has exceeded some threshold. But even that is oversimplified, because it makes sense for the system to do a system-wide sync automatically every 30 seconds or so to save the user that trouble.
On the other hand, we were talking before about temporary files on servers, some of which do adhere to the fsync dogma such that an automatic system-wide sync may be exactly the wrong thing to do.
Posted Jun 4, 2012 23:06 UTC (Mon)
by dlang (guest, #313)
[Link]
Posted Jun 4, 2012 9:39 UTC (Mon)
by dgm (subscriber, #49227)
[Link] (5 responses)
It's not a regression, but a conscientious design decision, and that use case is outside of what Ext3 is good for.
Posted Jun 4, 2012 15:43 UTC (Mon)
by giraffedata (guest, #1954)
[Link] (4 responses)
It's a regression due to a conscious design decision. Regression doesn't mean mistake, it means the current thing does something worse than its predecessor. Software developers have a bias against regressions, but they do them deliberately, and for the greater good, all the time.
Posted Jun 4, 2012 21:24 UTC (Mon)
by dgm (subscriber, #49227)
[Link] (3 responses)
A more enlightening example: the latest version of the kernel requires more memory than 0.99 but nobody could possibly claim this is a regression. If anything, it's a trade-off.
Posted Jun 5, 2012 1:42 UTC (Tue)
by giraffedata (guest, #1954)
[Link] (2 responses)
I claim that's a regression. Another area where kernel releases have steadily regressed: they run more slowly. And there are machines current kernels won't run on at all that previous ones could. Another regression.
I'm just going by plain meaning of the word (informed somewhat by it's etymology, the Latin for "step backward."). And the fact that it's really useful to be able to talk about the steps backward without regard to whether they're worth it.
Everyone recognizes that sometimes you have to regress in some areas in order to progress in others. And sometimes it's a matter of opinion whether the tradeoff is right. For example, regression testing often uncovers the fact that the new release runs so much slower than the previous one that some people consider it a mistake and it gets "fixed."
I like to use Opera, but almost every upgrade I've ever done has contained functional regressions, usually intentional. As they are often regressions that matter to me, I tend not to upgrade Opera (and it makes no difference to me whether it's a bug or not).
Posted Jun 5, 2012 8:35 UTC (Tue)
by dgm (subscriber, #49227)
[Link] (1 responses)
Posted Jun 5, 2012 14:25 UTC (Tue)
by giraffedata (guest, #1954)
[Link]
Everything since this is a regression in certain areas, but
you seem to be missing the essential point that I stated several ways: These regressions come along with progressions. The value of the progressions outweigh the cost of the regressions. I hate in some way every "upgrade" I make, but I make them anyway.
Everyone has to balance the regressions and the progressions in deciding whether to upgrade, and distributors tend to make sure the balance is almost always in favor of the progressions. We can speak of a "net regression," which most people would not find current Linux to be with respect to 0.99.
Posted Jun 4, 2012 15:51 UTC (Mon)
by bronson (subscriber, #4806)
[Link] (14 responses)
In an ideal world, you're exactly right. In today's world, that would be fairly dangerous.
> I've seen filesystems that have mount options and file attributes that specifically indicate that files are temporary
Agreed, but if you're remounting part of your hierarchy with crazy mount options, why not just use tmpfs?
Posted Jun 4, 2012 23:08 UTC (Mon)
by dlang (guest, #313)
[Link] (13 responses)
Posted Jun 5, 2012 7:05 UTC (Tue)
by bronson (subscriber, #4806)
[Link] (12 responses)
Posted Jun 5, 2012 7:19 UTC (Tue)
by dlang (guest, #313)
[Link] (11 responses)
Also, reading and writing swap tends to be rather inefficient compared to normal I/O (data ends up very fragmented on disk, bearing no resemblance to any organization that it had in ram, let alone the files being stored in tempfs.
Posted Jun 5, 2012 15:33 UTC (Tue)
by giraffedata (guest, #1954)
[Link] (10 responses)
I believe the tendency is the other way around. One of the selling points for tmpfs for me is that reading and writing swap is more efficient than reading and writing a general purpose filesystem. First, there aren't inodes and directories to pull the head around. Second, writes stream out sequentially on disk, eliminating more seeking.
Finally, I believe it's usually the case that, for large chunks of data, the data is referenced in the same groups in which it becomes least recently used. A process loses its timeslice and its entire working set ages out at about the same time and ends up in the same place on disk. When it gets the CPU again, it faults in its entire working set at once. For a large temporary file, I believe it is even more pronounced - unlike many files, a temporary file is likely to be accessed in passes from beginning to end. I believe general purpose filesystems are only now gaining the ability to do the same placement as swapping in this case; to the extent that they succeed, though, they can at best reach parity.
In short, reading and writing swap has been (unintentionally) optimized for the access patterns of temporary files, where general purpose filesystems are not.
Posted Jun 6, 2012 6:53 UTC (Wed)
by Serge (guest, #84957)
[Link] (3 responses)
It's not that simple. Tmpfs is not "plain data" filesystem, you can create directories there, so it has to store all the metadata as well. It also has inodes internally.
> Second, writes stream out sequentially on disk, eliminating more seeking.
This could be true if swap was empty. Same when you write to the empty filesystem. But what if it was not empty? You get the same swap fragmentation and seeking as you would get in any regular filesystem.
> In short, reading and writing swap has been (unintentionally) optimized for the access patterns of temporary files, where general purpose filesystems are not.
And filesystem is intentionally optimized for storing files. Swap is not a plain data storage, otherwise "suspend to disk" could not work. Swap has its internal format, there're even different versions of its format (`man mkswap` reveals v0 and v1). I.e. instead of writing through one ext3fs level you write through two fs levels tmpfs+swap.
Things get worse when you start reading. When you read something from ext3, the oldest part of the filecache is dropped and data is placed to RAM. But reading from swap means that your RAM is full, and in order to read a page from swap you must first write another page there. I.e. sequential read from ext3 turns into random write+read from swap.
Posted Jun 6, 2012 15:24 UTC (Wed)
by nybble41 (subscriber, #55106)
[Link] (1 responses)
_Writing_ to swap means that your RAM is full (possibly including things like clean cache which are currently higher priority, but could be dropped at need). _Reading_ from swap implies only that something previously written to swap is needed in RAM again. There could be any amount of free space at that point. Even if RAM does happen to be full, the kernel can still drop clean data from the cache to make room, just as with reading from ext3.
Posted Jun 6, 2012 17:43 UTC (Wed)
by dgm (subscriber, #49227)
[Link]
All of this is of no consequence on system startup, when the page cache is mostly clean. Once the system has been up for a while, though... I think a few tests have to be done.
Posted Jun 7, 2012 2:28 UTC (Thu)
by giraffedata (guest, #1954)
[Link]
I was talking about disk structures. Inodes and directory information don't go into the swap space, so they don't pull the head around.
(But there's an argument in favor of regular filesystem /tmp: if you have lots of infrequently accessed small files, tmpfs will waste memory).
It's the temporary nature of the data being swapped (and the strategies the
kernel implements based on that expectation) that makes the data you want at any particular time less scattered in swap space than in a typical filesystem that has to keep copious eternally growing files forever. I don't know exactly what policies the swapper follows (though I have a pretty good idea), but if it were no better at storing anonymous process data than ext3 is at storing file data, we would really have to wonder at the competence of the people who designed it. And my claim is that since it's so good with process anonymous data, it should also be good with temporary files, since they're used almost the same way.
Actually, the system does the same thing for anonymous pages as it does for
file cache pages: it tries to clean the pages before they're needed so that
when a process needs to steal a page frame it usually doesn't have to wait for a page write. Also like file cache, when the system swaps a page in, it tends to leave the copy on disk too, so if it doesn't get dirty again, you can steal its page frame without having to do a page out.
Posted Jun 7, 2012 13:15 UTC (Thu)
by njs (subscriber, #40338)
[Link] (5 responses)
Posted Jun 7, 2012 13:28 UTC (Thu)
by Jonno (subscriber, #49613)
[Link]
I find that if I have two processes with large working sets causing swaping, and kill one them, doing a swapoff will get the other one performant again much faster than letting it swap in only the stuff it needs as it needs it.
Posted Jun 7, 2012 15:44 UTC (Thu)
by giraffedata (guest, #1954)
[Link] (1 responses)
Good information.
That's probably a good reason to use a regular filesystem instead of tmpfs for large temporary files.
I just checked, and the only readahead tmpfs does is the normal swap readahead, which consists of reading an entire cluster of pages when one of the pages is demanded. A cluster of pages is pages that were swapped out at the same time, so they are likely to be re-referenced at the same time and are written at the same spot on the disk. But this strategy won't effect streaming, like typical filesystem readahead.
And the kernel default size of the cluster is 8 pages. You can control it with /proc/sys/vm/page-cluster, though. I would think on a system with multi-gigabyte processes, a much larger value would be optimal.
Posted Jun 11, 2012 14:51 UTC (Mon)
by kleptog (subscriber, #1183)
[Link]
Posted Jun 7, 2012 21:36 UTC (Thu)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Windows 8 will do that for modern applications. http://blogs.msdn.com/b/b8/archive/2012/04/17/reclaiming-...
Posted Jun 8, 2012 0:15 UTC (Fri)
by giraffedata (guest, #1954)
[Link]
When njh says "hack" I think it means something an intelligent user can invoke to override the normal system paging strategy because he knows a process is going to be faulting back much of its memory anyway.
The Windows 8 thing is automatic, based on an apparently pre-existing long-term scheduling facility. Some applications get long-term scheduled out, aka "put in the background," aka "suspended," mainly so devices they are using can be powered down and save battery energy. But there is a new feature that also swaps all the process' memory out when it gets put in the background, and the OS takes care to put all the pages in one place. Then, when the process gets brought back to the foreground, the OS brings all those pages back at once, so the process is quickly running again.
This of course requires applications that explicitly go to sleep, as opposed to just quietly not touching most of their memory for a while, and then suddenly touching it all again.
Posted Jun 8, 2012 0:59 UTC (Fri)
by CycoJ (guest, #70454)
[Link] (2 responses)
Posted Jun 8, 2012 17:14 UTC (Fri)
by apoelstra (subscriber, #75205)
[Link]
It's screaming fast. I originally started doing this when I had my $HOME mounted over SSHFS, and Firefox would single-handedly saturate my pipe, and took forever to do anything. Its disk IO is (was) obscene.
This also has the benefit (if you want to see it that way) that my history does not get so filled with garbage, since every reboot the profile is reset. I have a line in my .Xclients which copies a template .mozilla into place, so that I start off with Noscript, Adblock, Tor, etc, all enabled, and my history is seeded with LWN and other sites I frequent.
Posted Jun 9, 2012 15:51 UTC (Sat)
by Serge (guest, #84957)
[Link]
It might be a good idea to save some SSD writes, but does it really increases performance? My ~/.mozilla profile is about 2GB, so it was not a good idea to put it in RAM, but I tried that with a new empty profile and noticed no difference. What should I look at?
PS: it's not related to the /tmp dir, I assume, but it's still interesting to see some tmpfs benefits for a popular application.
Posted Jun 2, 2012 23:05 UTC (Sat)
by mirabilos (subscriber, #84359)
[Link]
Posted Jun 4, 2012 7:10 UTC (Mon)
by Serge (guest, #84957)
[Link] (5 responses)
Probably. But it won't trigger disk access. You can check that:
If file creation/deletion (metadata change) triggers disk access you'll see all the lines different. But if lines are same, then there was no disk access.
Cache still works for journaled filesystems. Linux kernel is written by smart people, yeah.
PS: I've seen reiserfs to trigger "read" in such test. You can see description of diskstats numbers in:
Posted Jun 4, 2012 7:27 UTC (Mon)
by neilbrown (subscriber, #359)
[Link] (4 responses)
This doesn't agree with my understanding of ext3 journalling, so maybe I expressed it poorly.
If you put a 5 second sleep in that loop, I expect you would see changes. I do - once I found a suitably quiet ext3 filesystem to test on.
The metadata blocks do go into the next transaction, but transactions can live in memory for up to 5 seconds before they are flushed.
Posted Jun 4, 2012 10:17 UTC (Mon)
by Serge (guest, #84957)
[Link] (2 responses)
The exact number of seconds depends on /proc/sys/vm/dirty_*_centisecs value and /proc/sys/vm/laptop_mode...
Anyway, are you talking about file content or file name being written to disk in 5 seconds? Or both?
We can check whether content of deleted file is written to disk, run:
> I do - once I found a suitably quiet ext3 filesystem to test on.
Try /boot. :) Or just insert some USB flash stick and create ext3 there.
Posted Jun 4, 2012 11:28 UTC (Mon)
by neilbrown (subscriber, #359)
[Link] (1 responses)
That many seconds after a journal transaction has been opened, it is closed and flushed - if it hadn't been closed already.
It is the metadata that is written to the journal - inodes, free-block bitmaps, directory names etc.
I'm not sure what the default is today. If you create then delete a file, the data will not go to disk, except possibly for "data=journal". But the metadata will.
Posted Jun 4, 2012 15:17 UTC (Mon)
by Serge (guest, #84957)
[Link]
That's harder to test. Maybe compare amount of writes generated by something like:
But, anyway, looks like it's not a problem for /tmp then, meaning that ext2 would not be (noticeably) better than ext3 in /tmp use cases.
Posted Jun 4, 2012 14:13 UTC (Mon)
by hummassa (subscriber, #307)
[Link]
Posted Jun 1, 2012 13:21 UTC (Fri)
by Richard_J_Neill (subscriber, #23093)
[Link] (1 responses)
BTW, Mandriva/Mageia has done /tmp on tmpfs for ages (I think ~ 5 years), and it does work fine.
Posted Jun 5, 2012 12:23 UTC (Tue)
by roblucid (guest, #48964)
[Link]
That prevents files getting left around, so rather than a new flag, filesystems could stop sync-ing the disk copy, in this situation, reasoning the file is ephemeral.
On TMPFS based /tmp systems like Solaris (I used it with SunOS 4) then humongeous temporary files would need special arrangements and placing, disks just tended not to have much free space. Disks were not even 1GB and overloading memory + swap space with temp files, tended to be more reliable in operation in practice, because processes could still process even when some luser had filled the disk.
Posted Jun 8, 2012 11:46 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (9 responses)
Gentoo shoves all its compiles into /tmp. And when compiling LO, you need a lot of temp space. So rather than having space dedicated to tmp for compiling, I have something like 10 or 20Gb of swap (plus 8Gb RAM), and simply have a huge tmpfs /tmp.
SuSE on the other hand ... Why oh WHY can't they give you sane defaults! Swap space defaults to twice ram (good) but without doing a "wipe and redo manually", you can't *increase* swap space! I always set swap space to at least twice the mobo's max ram.
The other thing I didn't realise, is that tmpfs defaults to half available ram. So with 8Gb, the first few times I tried to compile OOo, I couldn't work out why it kept crashing !-)
So yeah, I'm all in favour of /tmp in tmpfs. But make sure you have *sane* defaults, and those defaults are *easy* to over-ride. SuSE, I'm glaring at you !!!
Cheers,
Posted Jun 8, 2012 15:20 UTC (Fri)
by anselm (subscriber, #2796)
[Link] (1 responses)
You can always increase swap space after the fact by means of swap files (rather than swap partitions).
Posted Jun 8, 2012 19:35 UTC (Fri)
by dlang (guest, #313)
[Link]
Posted Jun 8, 2012 20:23 UTC (Fri)
by jackb (guest, #41909)
[Link]
Mounting /var/tmp/portage on tmpfs is not the default behavior but has become extremely common. For large packages like Chromium or LibreOffice there are ways to override the default PORTAGE_TMPDIR to point to a non-tmpfs directory.
Posted Jun 9, 2012 18:06 UTC (Sat)
by Serge (guest, #84957)
[Link] (5 responses)
Why? Does it makes things faster for you? It would be interesting to see some benchmarks. I've seen tests showing there's no difference, and seen one with extfs being faster than tmpfs+swap for compiling.
> and simply have a huge tmpfs /tmp.
Imho, it's much simpler to have it on disk. :)
> So yeah, I'm all in favour of /tmp in tmpfs.
/tmp is not the only place where you can mount tmpfs. If you want your /var/tmp/portage in tmpfs, you don't have to break other apps and put /tmp there.
Posted Jun 12, 2012 14:03 UTC (Tue)
by TRauMa (guest, #16483)
[Link] (4 responses)
Another thing: I thought the plan was to migrate to per-user-tmp anyway, somewhere in $HOME, for apps that use a lot of tmp like DVD rippers this would be a good idea anyway.
Posted Jun 16, 2012 4:30 UTC (Sat)
by Serge (guest, #84957)
[Link] (3 responses)
Per-user directory would not get cleaned on reboot. Using per-user temporary directory may be a bad thing for users with NFS /home, they would prefer using local tmp if it is. Also a common /tmp for all users still needed for file exchange on a multiuser servers. And finally, why would DVD soft used something-in-HOME, if they can use /tmp which is there exactly for those things. ;)
Why put /tmp on tmpfs? Having /var/tmp/portage on tmpfs does not force you to put /tmp there. And it's really hard to find an application that becomes faster just because of /tmp on tmpfs. Even for portage it's not that obvious.
> Compiles on tmpfs are faster, factor is 1.8 to 2 in my tests
Hm... My simple test shows that tmpfs is just about 1-2% faster.
tmpfs results:
ext3 results:
What have I missed?
Posted Jun 16, 2012 13:44 UTC (Sat)
by nix (subscriber, #2304)
[Link] (2 responses)
(One application that becomes a lot faster with /tmp on tmpfs is GCC without -pipe, or, even with -pipe, at the LTO link step. It writes really quite a lot of large extremely temporary intermediate output to files in /tmp in each stage of the processing pipeline, then reads it back again in the next stage.)
Posted Jun 25, 2012 9:40 UTC (Mon)
by Serge (guest, #84957)
[Link] (1 responses)
You don't need tmpfs then. This will work with /tmp anywhere (disk, ram, separate partition, nfs, etc). I mean this is neither a reason to use tmpfs nor it's a reason to avoid it.
> One application that becomes a lot faster with /tmp on tmpfs is GCC without -pipe, or, even with -pipe, at the LTO link step.
Faster linking? Let's check that with something having a lot of binaries:
tmpfs results:
ext3 results:
If my test is correct, it's still same 1-2%. It is faster, but not a lot.
Posted Jun 26, 2012 15:49 UTC (Tue)
by nix (subscriber, #2304)
[Link]
It's not just linking that a tmpfs /tmp speeds up a bit, in theory: it's compilation, because without -pipe GCC writes its intermediate .S file to TMPDIR (and -pipe is not the default: obviously it speeds up compilation by allowing extra parallelism as well as reducing potential disk I/O, so I don't quite understand *why* it's still not the default, but there you are.)
btw, coreutils is by nobody's standards 'something having a lot of binaries'. It has relatively few very small binaries, few object files, and an enormous configure script that takes about 95% of the configure/make time (some of which, it is true, runs the compiler and writes to TMPDIR, but most of which is more shell-dependent than anything). LTO time will also have minimal impact in this build.
But, you're right, I'm pontificating in the absence of data -- or data less than eight years old, anyway, as the last time I measured this was in 2004. That's so out of date as to be useless. Time to measure again. But let's use some more hefty test cases than coreutils, less dominated by weird marginal workloads like configure runs.
Let's try a full build of something with more object files, and investigate elapsed time, cpu+sys time, and (for non-tmpfs) disk I/O time as measured from /proc/diskstats (thus, possibly thrown off by cross-fs merging: this is unavoidable, alas). A famous old test, the kernel (hacked to not use -pipe, with hot cache), shows minimal speedup, since the kernel does a multipass link process and writes the intermediates to non-$TMPDIR anyway:
tmpfs TMPDIR, with -pipe (baseline): 813.75user 51.28system 2:13.32elapsed
So, a definite effect, but not a huge one. I note that the effect of -pipe is near-nil these days, likely because the extra parallelism you get from combining the compiler and assembler is just supplanting the extra parallelism you would otherwise get by running multiple copies of the compiler in parallel via make -j. (On a memory-constrained or disk-constrained system, where the useless /tmp writes may contend with useful disk reads, and where reads may be required as well, we would probably see a larger effect, but this system has 24Gb RAM and a caching RAID controller atop disks capable of 250Mb/s in streaming write, so it is effectively unconstrained, being quite capable of holding the whole source tree and all build products in RAM simultaneously. So this is intentionally a worst case for my thesis. Smaller systems will see a larger effect. Most systems these days are not I/O- or RAM-constrained when building a kernel, anyway.)
How about a real 900kg monster of a test, GCC? This one has everything, massive binaries, massive numbers of object files, big configure scripts writing to TMPDIR run in parallel with ongoing builds, immense link steps, you name it: if there is an effect this will show it. (4.6.x since that's what I have here right now: full x86_64/x86 multilibbed biarch nonprofiled -flto=jobserver -j 9 bootstrap including non-multilib libjava, minus testsuite run: hot cache forced by cp -a'ing the source tree before building; LTO is done in stage3 but in no prior stages so as to make the comparison with the next test a tiny bit more meaningful: stage2/3 comparison is suppressed for the same reason):
tmpfs TMPDIR: 13443.91user 455.17system 36:02.86elapsed 642%CPU
So, no significant effect elapsed-time-wise, well into the random noise: though the system time is noticeably higher for the non-tmpfs case, it is hugely dominated by the actual compilation. However, if you were doing anything else with the system you would have noticed: paging was intense, as you'd expect with around 10Gb of useless writes being flushed to disk. Any single physical disk would have been saturated, and a machine with much less memory would have been waiting on it.
That's probably the most meaningful pair of results here, a practical worst case for the CPU overhead of non-tmpfs use. Note that the LTO link stage alone writes around six gigabytes to TMPDIR, with peak usage at any one time around 4Gb, and most of this cannot be -pipe'd (thus this is actually an example of something that on many machines cannot be tmpfsed effectively).
Posted Jun 1, 2012 4:42 UTC (Fri)
by thedevil (guest, #32913)
[Link] (2 responses)
It exists, or existed; search for "swapd". I remember it because it was in that context I submitted my one and only kernel patch (which was rightfully ignored).
swapd was essentially useless, because there was just no way for userspace to notice out of swap condition soon enough with polling. Maybe with a bit of kernel help like a netlink socket it would have been possible.
Posted Jun 2, 2012 20:37 UTC (Sat)
by branden (guest, #7029)
[Link]
Posted Jun 14, 2012 10:11 UTC (Thu)
by daenzer (subscriber, #7050)
[Link]
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
It starts here https://lwn.net/Articles/499534/
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Errrr... Yeah, right.
Temporary files: RAM or disk?
I used to analyze ~10G of data, and reading from RAM was at least 300% faster, ...
Temporary files: RAM or disk?
Temporary files: RAM or disk?
I'd be very VERY disappointed in any filesystem that allowed its write buffers to get that far behind.
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
current filesystems attempt to schedule data to be written to disk within about 5 seconds or so in most cases
Temporary files: RAM or disk?
Temporary files: RAM or disk?
However this is a policy that is encoded in your editor, not in the filesystem. And I suspect most editors do exactly that. i.e. they call 'fsync' before 'close'.
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Normally you need only very few points where you fsync (or equivalent) and quite some more places where you write data...
Temporary files: RAM or disk?
It is quite an acceptable trade-off in that context.
Temporary files: RAM or disk?
> O_SYNC means every write request is safe before the write system call returns.
Hm. Not sure if that really is what people expect. But I can certainly see why it would be useful for some applications. Should probably be a fd option or such though? I would be really unhappy if a rm -rf or copy -r would behave that way.
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Any application that is handling costly-to-replace data should use fsync. An app that is handling cheap data should not. It is really that simple.
Temporary files: RAM or disk?
Temporary files: RAM or disk?
ext3 regression: unnecessarily syncs temporary files
It's not a regression, but a conscientious design decision
ext3 regression: unnecessarily syncs temporary files
ext3 regression: unnecessarily syncs temporary files
the latest version of the kernel requires more memory than 0.99 but nobody could possibly claim this is a regression
ext3 regression: unnecessarily syncs temporary files
ext3 regression: unnecessarily syncs temporary files
Whatever, keep using 0.99 then, or better go back to first version that
just printed AAAABBBB on the screen. Everything from there is a
regression.
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
reading and writing swap tends to be rather inefficient compared to normal I/O (data ends up very fragmented on disk, bearing no resemblance to any organization that it had in ram, let alone the files being stored in tempfs.
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
... First,
there aren't inodes and directories to pull the head around.
It's not that simple. Tmpfs is not "plain data" filesystem, you can
create directories there, so it has to store all the metadata as well.
It also has inodes internally.
Second, writes stream out sequentially on disk, eliminating more
seeking.
This could be true if swap was empty. Same when you write to the empty
filesystem. But what if it was not empty? You get the same swap
fragmentation and seeking as you would get in any regular filesystem.
in order to read a page from swap you must first write another page there.
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
At least on our compute servers (running some vaguely recent Ubuntu, IIRC), swap-in is definitely not doing successful readahead
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
I've often wished for some hack that would just do a sequential read through the swap file to load one process back into memory
Windows 8 will do that for modern applications.
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
for i in `seq 5`; do echo 123 > f; rm -f f; grep sda1 /proc/diskstats; done
(replace "sda1" with the disk you write to)
http://www.kernel.org/doc/Documentation/iostats.txt
Temporary files: RAM or disk?
Temporary files: RAM or disk?
for i in `seq 100`; do dd if=/dev/zero of=f bs=1M count=10; rm -f f; done
then check /proc/diskstats or `iostat -k`. If you see writes increased in 1GB, your filesystem writes data even for deleted files. My ext3 does not.
Temporary files: RAM or disk?
It defaults to 5 seconds (JBD_DEFAULT_MAX_COMMIT_AGE) and can be changed by the "commit=nn" mount option.
The file contents are handled differently for different settings of "data=".
ordered: data that relates to the metadata in flushed before the metadata is written to the journal
writeback: data is written according to /proc/sys/vm/dirty* rules
journal: data is written to the journal with the metadata.
Temporary files: RAM or disk?
for i in `seq 10`; do touch $i; rm -f $i; done
with amount of writes generated by:
for i in `seq 1000`; do touch $i; rm -f $i; done
Every creation/deletion is written to disk if the latter line generates about 100 times more writes. On my ext3 I see sub-equal number of writes...
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Wol
Temporary files: RAM or disk?
Swap space defaults to twice ram (good) but without doing a "wipe and redo manually", you can't *increase* swap space!
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Gentoo shoves all its compiles into /tmp.
As long as I've been using it compiling has always been done in /var/tmp, not /tmp.Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Here's the script to resemble a basic package build:
mount tmpfs or ext3 to /mnt/test, then
$ cd /mnt/test
$ wget http://curl.haxx.se/download/curl-7.26.0.tar.bz2
$ export CFLAGS='-O2 -g -pipe' CXXFLAGS='-O2 -g -pipe'
$ time sh -c 'tar xf curl-7.26.0.tar.bz2 && cd curl-7.26.0 && ./configure && make install DESTDIR=/mnt/test/root && cd ../root && tar czf ../curl-package.tar.gz * && cd .. && rm -rf curl-7.26.0 root'
real 70.983s user 48.685s sys 26.527s
real 70.635s user 48.390s sys 26.694s
real 70.701s user 48.203s sys 26.929s
real 70.867s user 48.636s sys 27.090s
real 70.744s user 48.297s sys 27.082s
real 71.690s user 48.401s sys 27.498s
real 71.614s user 48.340s sys 27.869s
real 71.531s user 48.836s sys 27.520s
real 71.479s user 48.306s sys 27.469s
real 71.635s user 48.540s sys 27.496s
Temporary files: RAM or disk?
Temporary files: RAM or disk?
mount tmpfs or ext3 to /mnt/test, then
$ cd /mnt/test
$ wget http://ftp.gnu.org/gnu/coreutils/coreutils-8.17.tar.xz
$ export CFLAGS='-O2 -g -flto' TMPDIR=/mnt/test
$ time sh -c "tar xf coreutils-8.17.tar.xz; cd coreutils-8.17; ./configure; make install DESTDIR=/mnt/test/root; cd ../root; tar czf ../coreutils-package.tar.gz *; cd ..; rm -rf coreutils-8.17 root"
real 882.876s user 760.111s sys 110.353s
real 884.456s user 761.408s sys 110.603s
real 885.245s user 762.770s sys 110.525s
real 884.914s user 762.417s sys 110.395s
real 885.352s user 762.865s sys 110.360s
real 895.244s user 762.620s sys 115.027s
real 893.134s user 762.447s sys 114.841s
real 898.353s user 763.645s sys 116.369s
real 898.010s user 763.472s sys 116.074s
real 897.525s user 763.671s sys 116.219s
Temporary files: RAM or disk?
tmpfs TMPDIR: 812.23user 50.62system 2:12.96elapsed
ext4 TMPDIR: 809.74user 51.90system 2:29.15elapsed 577%CPU; TMPDIR reads: 11, 88 sectors; writes: 6394, 1616928 sectors; 19840ms doing TMPDIR I/O.
ext4 TMPDIR: 13322.24user 514.38system 36:01.62elapsed 640%CPU; TMPDIR reads: 59, 472 sectors; writes: 98661, 20058344 sectors; 83690ms doing TMPDIR I/O
Temporary files: RAM or disk?
Temporary files: RAM or disk?
Temporary files: RAM or disk?