LWN: Comments on "Luu: Files are hard"

Luu: Files are hard

ghane — Wed, 30 Dec 2015 14:45:53 +0000

This bit me very hard in early 2012; to make it worse I was doing three sub-optimal things:
1. Running the desktop in VMPlayer
2. Using btrfs for the first time
3. Updating daily with 12.04-proposed-update

A daily update of 30 packages would take more than an hour, I assumed this was VMPlayer. Finally googled to find the solution, libeatmydata.

In fact, I have got into the habit of doing a:
eatmydata apt-get --purge dist-upgrade
daily, on the 16.04-devel branch, even though I switched to xfs[1]. One of these days it will bite me badly.

[1] Why xfs? Because I thought I would see something different, and learn something new. Alas, things have been so event-less, I daily learn nothing.

Luu: Files are hard

throwaway — Mon, 28 Dec 2015 16:12:13 +0000

Could a reasonable take-away from the article be "if you want some guarantee of consistency for writes, don't use plain files at all, go for sqlite.". Which I believe is what android apps mostly do, right?

Luu: Files are hard

nix — Mon, 28 Dec 2015 12:04:32 +0000

Yeah. dpkg is careful with its fsync()s, etc -- and programs like that generally assume that fsync() does something, and that by extension writeback caching is in use.

Luu: Files are hard

flussence — Sun, 27 Dec 2015 17:46:21 +0000

I believe dpkg does, or did, the same thing as depmod as well; the first version of the Ubuntu installer to have a Btrfs option would take several hours of constant disk thrashing if it was used.

Write patterns like those are unfortunately so common it's led to people pushing back in the opposite direction with things like libeatmydata, tmpfs-mounted browser profiles etc.

Luu: Files are hard

nix — Sat, 26 Dec 2015 21:53:13 +0000

Sync-mounted filesystems are reliable as far as I can tell (I have an autotester that mounts its progress log fs -o sync and crashes, ah, quite a lot, and it seems to work as expected) but for / it is hopeless, and I suspect the same is true for $HOME. To give just one example, depmod appears to do its writes in small pieces, so depmod -a of a distribution kernel takes literally hours on a spinning rust disk, with the modules.dep file growing at a few hundred bytes a second. It is impossible to believe that depmod is the only such program: indeed, these days I would be surprised at a program that *didn't* assume that small write()s sans (f,data)sync() were effectively instantaneous due to writeback caching.

Luu: Files are hard

anselm — Sat, 26 Dec 2015 12:11:49 +0000

Not that they were a great innovation in Windows NT. Remember that back then, Windows NT was basically a version of VMS for Intel CPUs.

Luu: Files are hard

chojrak11 — Sat, 26 Dec 2015 10:32:58 +0000

Welcome to 1994... Windows NT 3.5 had them already...

No to Synchronous Writes!

anton — Fri, 25 Dec 2015 17:36:19 +0000

Maybe what you are looking for is what I call in-order semantics. If a file system provides that, and the application does not lose consistency when the process is killed, the data will also be consistent (but not necessarily up-to-date) if the OS crashes. As nix points out, in a distributed system, you occasionally also need up-to-date-ness; then you have to sync; but at least you don't have to sync for file consistency. And you can test your application by killing the process, which is quite a bit nicer than pushing the reset button or pulling the plug.

Concerning the block effects in flash, there are erase blocks (big, maybe 256k), and write blocks (smaller, but I don't find numbers at the moment). Ideal for a log-structured file system. So always syncing is not completely unrealistic, certainly not for low-bandwidth usage. Unfortunately, SSDs don't give us access to flash, but provide a HD-oriented interface, with the firmware optimizing maybe for FAT or NTFS access patterns. We have to see how much that hurts.

Luu: Files are hard

anton — Fri, 25 Dec 2015 17:04:44 +0000

What you call the barrier/writeback set seems to be what I think of as a commit in a log-structured or COW file system. If Linux does that correctly in memory, that's good, but it also needs to do it correctly when writing out to disk in order to give decent crash consistency guarantees. If you have a journaling file system with full journaling or a log-structured or COW file system, that's not too hard. And yet, even journaling and COW file systems in Linux (except NILFS) don't give such guarantees last I looked.

Concerning bug reports, the kernel people's (especially Ted T'so's) stand in the O_PONIES discussion would certainly discourage me from making such reports. On the practical side, Linux crashes so rarely, and power outages are so rare 'round here that there are very few real opportunities to actually see crash consistency in action, so I would have little to report even if I was willing to.

Luu: Files are hard

anton — Fri, 25 Dec 2015 16:40:10 +0000

It seems that everyone assumes ACID is the only use case. That used to be the case, when we all had systems running on a server in a nearby room and when we pressed the Save button we expected it to get absolute confirmation the data was saved in the time it took to respond to the enter button on a dumb terminal. After all the human isn't going to remember what they just typed. But now the most likely scenario is the data is coming from another computer, probably a web page, and it is travelling over a link with a TCP handshake overlaid onto a 50ms latency. It's not at all unusual for the final submit button to take 5 seconds, and if it failed the user hits back and tries again. Or even better, it's coming from pages like Google documents - where javascript in the web page is trickling updates back to the server and is perfectly happy to wait forever and resend over and over again.

I think it's exactly the other way 'round.

If the user is typing into the editor, and there is a power outage or OS crash, the user notices this, and will check that the file he just saved is really complete. Sure, it would be cool if the editor could also tell the user (asynchronously) that the file now resides permanently on disk, but does the user want to wait for that by using something like fsync()? Probably not, if it takes a noticable amount of time.

By contrast, in a distributed system, when you tell the customer that his flight has been reserved, you better be sure that it's in permanent storage, because the user won't notice if the server crashes one second after the notification. Also, I think that a distributed system where the client has to make up for an unreliable server is very hard to program, and we will see lots of lossage if we follow that model (just as we see now with file system maintainers who provide unreliable file systems and expect user space to make up for it).

Concerning the disconnection example, you just simulated a disconnection, not a server that lost data. The client just had to reconnect, but client and server were still consistent with each other after the reconnection. When the server crashes and loses some data, it's different. The easiest-to-program model for that is fully synchronous operation from the client to the server's disk hardware and back; if that's too slow, asynchronous operation (separate write requests and completion information) would be the way to go.

I don't see that most Linux file systems currently have a workable model for consistency and performance. Last I looked, the only file system that gave a decent consistency guarantee was NILFS. With the mainstream file systems, if we want consistency without programming to a complex API, use a libc variant that sync()s after every write() or other file system change (or maybe use sync-mounted file systems, but given how untested that probably is, I would not trust that option to work, especially given the data=journal precedent in Linux).

Luu: Files are hard

neilbrown — Wed, 23 Dec 2015 22:10:38 +0000

Let me try to paint a picture in terms that make sense in the kernel and in an API to the kernel.

To get Atomicity it is nearly axiomatic that you need a journal or log or something a lot like that. Let's assume that is a separate file written to with an append-only discipline. After a crash every block in the file will be either the data that was written there, or nuls (unless you use a deliberately-broken filesystem like ext3 with data=writeback) so it is easy to find the transactions and to be sure that after replaying the journal Atomicity is provided.

To get Consistency you need to be sure that data is safe in the main database file(s) before removing it from the journal. To get Durability you need to be sure that the data is safe in the journal before telling the application that data is safe. So from the kernel-api perspective, these are much the same thing.

(Isolation is an application level concern, not a kernel-level concern. The kernel provides advisory locks and other IPC mechanisms that can be used to provide whatever isolation is needed).

So how can we know that "data is safe". We know that the kernel will write it "eventually", so we just need to know when "eventually" is. You (quite reasonably) didn't like the idea of the kernel telling us when *all* of a file was stable so we need the kernel to have some concept of at least two different sets of pages in the file: one set which triggers a notification when it becomes empty (possibly after a flush is sent to the device) and one set which is all the other dirty pages.

Then the "barrier" command that people seem to want would move all dirty pages (in a given range maybe) into the first set, and would request notification when the set became empty. We can call this the "barrier set" and the other set the "dirty set". When you request a barrier for the whole file, the dirty set becomes empty and then gradually gains new -pages. When a new barrier is created, everything in the dirty set moves to the barrier set.

There are a couple of issues that need to be addressed at this point:
1/ what if I write to a page of a file which is already in the barrier-set. Does it stay in the barrier set, or move to the dirty set? I think a few moments thought will confirm that it must stay in the barrier set. So the barrier is 1-way. Newly dirtied data might get written before the barrier notification arrives, but old dirty pages cannot "slip past" the barrier and only get written after the notification.

2/ What if memory pressure or some other force needs to write out dirty-set pages before all the barrier-set pages are written? Should those pages be written separately from the barrier-set, or should they just be moved into the barrier-set and then it be flushed? It is hard to reason about this in a completely abstract context, but I suspect that if any of a file needs to be flushed out, then the barrier-set really should be flushed out first.

So the simple approach is that the kernel divides the pages in a file into three sets, "clean" (which don't need to be written out), "dirty" (which need to be written out eventually), and "barrier" (which need to be written out before the dirty pages, and when they are written, a notification is generated).

This is very nearly what we already have: Clean, Dirty, and Writeback is what we call them. "Writeback" are pages that are queued for a "backing device". Where it makes sense, they tend to be queued at a fairly low priority, certainly lower than synchronous reads (as opposed to async read-ahead).

We even have a mechanism for getting a notification when the Writeback set is empty. Fork, call sync_file_range with the flag SYNC_FILE_RANGE_WAIT_BEFORE, then signal the parent (maybe by simply exiting).

Moving pages from the dirty set to the barrier (aka writeback) set involves calling sync_file_range with the SYNC_FILE_RANGE_WRITE flag. Note that this doesn't wait for the write to complete, it just queues the IO.

So I think we already have very nearly all the functionality you need to do what you want. There are some issues that may (or may not) be a problem.

- It would be nice if the "barrier" operation didn't block. sync_file_range(SYNC_FILE_RANGE_WRITE) can block though. All the pages are queued for the backing-device, and if that has a queue size limit (which it must) then there could be delays until the queue drops below that limit. Even without the queue limit, there is typically a need to allocate small data structures (bios, requests) to store the pages in the queue. Their memory allocations could block. These are really implementation details though. If someone made a case of a non-blocking "barrier" operation and had a genuine application wanting to use it, I'm sure something could be arranged.

- while background writes tend to have a low priority, it might not be as low as you would like. Once a page is in writeback it will progress on the queue and be written. There seems to be a suggestion above that you would like it to languish around a bit more to maximise the opportunity for write-combining.. It is hard to know how important this really is, and the importance probably varies a lot between different storage technologies and different loads. There may be room for tuning queuing priorities if a problem was demonstrated.

So I fall back on what I've been saying all along. Linux *already* *has* what you need to write files safely, reliably, efficiently. If it isn't quite perfect, or if it doesn't work as advertised, then it is because you haven't submitted bug reports. There is undoubtedly room for improvement (an aio_sync_file_range would be nice) but improvements only happen when someone drives them.

No to Synchronous Writes!

nybble41 — Wed, 23 Dec 2015 03:45:31 +0000

> They [editors] are used by humans, not other computers, and if the system crashes just after the user exits their editor, they're not likely to expect everything to be miraculously saved just as they meant to. It doesn't matter really, since a crash 10 seconds earlier, just before the user pressed 'save' would be well expected to lose data.

It is expected that you may lose the most recent edits. However, if you opened up a large text file and made one minor edit before saving, you might hope that if the system should happen to crash right in the middle of saving you would up with either the old version of the file or the new version, and not a zero-length file. The truncate/write/fsync sequence can trash all the data, not just the changes.

No to Synchronous Writes!

mathstuf — Tue, 22 Dec 2015 01:26:29 +0000

These are more journal files than actual backups. Creating a new file:

$ vim foo.txt
^Z
$ cat .foo.txt.swp
3210#"! U

Luu: Files are hard

ras — Tue, 22 Dec 2015 00:45:25 +0000

> How does the javascript know that it needs to resend, or that it never will need to again? At some point durability is needed.

The best answer is to watch it in action. Just create a new Google Spreadsheet, enter stuff into a few cells, disconnect the computer from network and continue typing. Soon a little bar will appear telling you it is saving your changes to Google. It wouldn't work of course, but you will be able to continue entering new data. A minute or two later it will and put up a message saying the "Trying to connect ..." and block you from typing (presumably so you won't lose too much work). Then plug the cable back in. The browser will notice, re-send the data that was dropped and so pick up from where it left off. Your guess is as good as mine on how it works under the hood.

The real answer is there are as many ways of doing it as there are programmers, and I'm sure now you've seen an example of it in action you could think up your own. One simple way would be to not send the AJAX response until the kernel notifies you the data is on disk, and pair it with a separate "flush file" AJAX command that is sent when the user attempts to leave the page. Presumably the flush would cause the all pending commits to happen so any pages waiting for commits would return soon.

Believe it or not, there are times when when the degree of caring whether the data made it to disk or not is so low it's hard to measure. The classic example I have been involved in is a sort of "big data" application, where the user uploads gigabytes of data to the server where it is processed by a propriety application. Usually they will do a series of uploads. The uploads then have to be processed looking for patterns, which generates many times as many gigabytes of indexes. This all happens on a purchased VM where every cycle and IO request is charged for, and so the goal is to minimise those charges. The one thing they that don't want to do is lose those indexes as they represent a lot of accumulated CPU cycles over many uploads. But a failed upload can be re-done, and if a VM dies during an indexing run no one cares provided the on disk data structure remains consistent - it can just be re-run. The cost benefit tradeoff is waiting for a flush to complete after every transaction moving the database from one consistent state to another, versus re-doing the entire operation in the very rare event of failure.

emacs writing new files

robbe — Mon, 21 Dec 2015 21:33:23 +0000

Yes, it's the default. For most cases. But there are situations where emacs will not create a backup file, and will use simple overwrite.

Emacs will not create backups of files in /tmp (this was your gotcha), or if a recent backup already exists. It goes without saying that this default behaviour can be changed.

Here is a more typical example:

$ ls -lin lwn-test*
1836267 -rw-r--r-- 1 1000 1000 5 Dez 21 22:14 lwn-test
$ strace -f -eopen,write,unlink,rename -p 1532 2>&1 | sed -n "/SIGIO/b;s|$HOME|\$HOME|g;p"
[open "lwn-test", append two lines, then save]
open("$HOME/lwn-test", O_RDONLY|O_CLOEXEC) = 18
open("$HOME", O_RDONLY|O_DIRECTORY|O_CLOEXEC) = 18
rename("$HOME/lwn-test", "$HOME/lwn-test~") = 0
open("$HOME/lwn-test", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 18
write(18, "test\nbla\nblo\n", 13) = 13
unlink("$HOME/.#lwn-test") = 0
$ ls -lin lwn-test*
1836273 -rw-r--r-- 1 1000 1000 13 Dez 21 22:15 lwn-test
1836267 -rw-r--r-- 1 1000 1000  5 Dez 21 22:14 lwn-test~

Luu: Files are hard

neilbrown — Mon, 21 Dec 2015 21:25:47 +0000

> It was to get to the stage before that, which is to get some acknowledgement there is fruit here worthy of spending some energy to harvest.

I think that is an excellent idea. I think that the first step would be to have a very clear very specific use-case. Something that you could implement and then say "see, only N updates per second" and then I could implement differently and I say "Look, I can get M updates per second". Then you can say "but if a crash happens *there* you lose consistency".

Then we can create a new API that allows 10*M updates per second, and write a crash-test that randomly resets a KVM instance and never ever detects corruption after tens of thousands of crash cycles.

Then we would have something to sell.

I can't even being to suggest a use case. All I ever care about is whole files. write;fsync;rename;fsync-directory. Done

> where javascript in the web page is trickling updates back to the server and is perfectly happy to wait forever and resend over and over again. We absolutely need ACI from ACID, but the D is not as important as it once was.

How does the javascript know that it needs to resend, or that it never will need to again? At some point durability is needed.
If you have multiple webpages all updating the same backend, then in the case of a server crash where each client replays, you need to be sure that any ordering issues are resolved in the same way (at least if they were externally visible).

I accept the durability itself may not be always required, but I think it is by far the easiest way to achieve other things that are required. With the recent and expected advances in hardware, durability on demand is so cheap, there seems little point in coming up with a more complex solution.

No to Synchronous Writes!

itvirta — Mon, 21 Dec 2015 17:32:49 +0000

>> Most editors for example, actually write out a new copy of a file, rather than face the complex integrity issues of updating in place.
>The lack of fsync() in mg, nano, and joe seems like a pretty gross omission.

I'm not exactly sure if interactive editors in general should be expected to give very strict integrity guarantees.
They are used by humans, not other computers, and if the system crashes just after the user exits their editor,
they're not likely to expect everything to be miraculously saved just as they meant to. It doesn't matter really, since a crash
10 seconds earlier, just before the user pressed 'save' would be well expected to lose data.
(And given all the problems and quirks with fsync, I'm even more inclined to excuse their creators for this.)

> Update-via-rename is aesthetically appealing, but has practical problems with metadata preservation
> and less-than-graceful behavior with large files (slowness, potential for ENOSPC).

Many editors leave a backup file anyway, with the associated space cost. Joe (jmacs) and nano seem to write it by just reading
and copying the original file, which I suppose would have a higher IO cost than renaming Emacs-style.

The default Emacs on my Debian seems to happily rename a hard-linked file. If that's useful or a problem depends on
what one wants. Hards links can be used as a cheap-ass file level copy-on-write system. :)

No to Synchronous Writes!

nijhof — Mon, 21 Dec 2015 11:44:29 +0000

Emacs creates a new file (moving old => old~) only the first time around in any edit session!
Then saving after any further edits write again to the same 'new' file, without further inode changes.

Even if you throw away the backup file, with the editor still open, on re-saving emacs will keep overwriting without turning the last version into a new backup file. I learnt that the hard way :-).

No to Synchronous Writes!

epa — Mon, 21 Dec 2015 09:56:22 +0000

Which is why user-space should have a mechanism for telling the kernel what is important.

Yes, I agree. But one most also consider: when user space says nothing, and hasn't told the kernel anything about what is important, what should the kernel assume by default? I suggest that as far as possible the default should tend towards safer (if slower) behaviour. OK, making all writes synchronous is a step too far, but still it should be possible to write a simple program creating some files and moving them around, without having learnt all the details of write barriers and how to signal your intent to the kernel, and have some reasonable semantics even in the presence of system crashes. Since there are two indisputable facts here: systems do crash, and 90% of programmers will never learn the exact incantations and subtleties of asynchronous disk writes. Even the top 10% will struggle, since there isn't much test infrastructure which simulates crashes part way through disk operations, or static checkers that filesystem access is being done safely.

Luu: Files are hard

ras — Mon, 21 Dec 2015 08:16:53 +0000

> This might be a nice idea, but it is completely impractical on Linux (without a massive rewrite).

To be honest I knew that when I wrote it.

The point of the post wasn't to solve the problem as I don't know enough about the kernel to do it. It was to get to the stage before that, which is to get some acknowledgement there is fruit here worthy of spending some energy to harvest. Reading through the comments it looked to me the discussion hadn't made it that far.

It seems that everyone assumes ACID is the only use case. That used to be the case, when we all had systems running on a server in a nearby room and when we pressed the Save button we expected it to get absolute confirmation the data was saved in the time it took to respond to the enter button on a dumb terminal. After all the human isn't going to remember what they just typed. But now the most likely scenario is the data is coming from another computer, probably a web page, and it is travelling over a link with a TCP handshake overlaid onto a 50ms latency. It's not at all unusual for the final submit button to take 5 seconds, and if it failed the user hits back and tries again. Or even better, it's coming from pages like Google documents - where javascript in the web page is trickling updates back to the server and is perfectly happy to wait forever and resend over and over again. We absolutely need ACI from ACID, but the D is not as important as it once was.

Forcing the application writer to destroy any chance of the kernel has of optimising batched writes because the only way he can keep his on disk data structures consistent is flushing everything, all the time, makes no sense in this world. Yet as far as I can tell, there is no way around it.

> You could get a notification that a file has no more dirty pages in the page cache without too much trouble. You might even be able to get a notification that there are no dirty pages in a given range.

Flushing page cache doesn't quite cut it. fsync() man page says "This includes writing through or flushing a disk cache if present" because that's what "I am now certain my on-disk data structures are consistent" requires. I don't think "a file has no more dirty pages" would work either, as it's not difficult to dream up scenarios where that condition would be very rare to non-existent, which would effectively mean the final write that moves the on disks data structure to the new version can never be done. A blockv system call that returned when a a list of ranges were safely on disk would be the minimum, I think. And no, I'm not seriously proposing that - as each blockv call would needs its own thread.

Luu: Files are hard

neilbrown — Mon, 21 Dec 2015 06:27:27 +0000

> It really has to be "all writes done on this file descriptor", as opposed to "all writes done on this open file description"

This might be a nice idea, but it is completely impractical on Linux (without a massive rewrite).

The distinction between file descriptors disappears almost the instant you enter a system-call.
The distinction between open file descriptions doesn't last much longer for a write request.

You could get a notification that a file has no more dirty pages in the page cache without too much trouble. You might even be able to get a notification that there are no dirty pages in a given range.

For directory entries you could similarly arrange a notification that all updates to a directory are safe, and possibly that the name used for a given file descriptor was safe.

To get stability guarantees in the face of ongoing updates you would probably need to block new updates until old updates are flushed. If that caused a problem (I suspect it would) then having two log files that you alternate between might be a solution.

That only way you could hope to track "all writes done to this file descriptor" would be to use O_DIRECT.

Luu: Files are hard

ras — Mon, 21 Dec 2015 02:20:48 +0000

Now that I've thought about it, a "all writes done on this file descriptor have hit the disk" flag for epoll flag would be sufficient. It really has to be "all writes done on this file descriptor", as opposed to "all writes done on this open file description" so the application can use a file descriptor to track a bunch of related writes that move the on disk data structure from one consistent state to another.

This doesn't address Luu's prime complaint: keeping your on disk data structures consistent remains hard. This just makes it possible to do it without wrecking performance by forcing flushes. It also makes it possible to write a userspace library that does make it easy.

IMHO that would be a huge step forward. I was just reading a paper on LevelDB vs LMDB, which boils down to Log Structured Merge (LSM) versus Multiversion Concurrency Control (MCC). http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A7... Oddly at the 1000ft view LSM and MCC are very similar. Both are COW, writing updated data to a separate place and deferring the expensive "this is the new version" operation till later. The primary difference is LSM defers it for a looong time by writing it to a separate file (often several) then doing a merge step in the background. In the most extreme case can avoid flushes entirely (because if you wait long enough, you can safely assume the data is on disk even though you don't have API telling you it's there). MCC on the other hand effectively does the merge on every commit, but to do that it must use the only API the kernel provides that tells you the data has hit disk - a flush.

Not unsurprisingly, in this paper LSM blitzed MCC in write speed by a factor of 4, (ie 400%). But that comes at the expense of having to merge the logs on the fly on every read. Because MCC wears the flush, the number of unmerged changes remains tiny. And again not surprisingly, MCC blitzed LSM by a factor of 4 for reads.

If MCC could avoid the flushes there is no reason it could not be as fast as LSM. But as of right now it can't be because the kernel doesn't supply the API's to make it possible. As a consequence, Linux applications are at an unnecessary 4 fold speed disadvantage in some scenarios. How we have tolerated that for over a decade now is a bit of a mystery to me.

No to Synchronous Writes!

jem — Sun, 20 Dec 2015 12:31:02 +0000

On my machine with Emacs 24.5.1:

[jem@red ~]$ ls -li foo.txt*
10488614 -rw------- 1 jem jem 1310 20 dec 14.17 foo.txt
[jem@red ~]$ emacs -nw -Q foo.txt
[jem@red ~]$ ls -li foo.txt*
10525938 -rw------- 1 jem jem 1313 20 dec 14.25 foo.txt
10488614 -rw------- 1 jem jem 1310 20 dec 14.17 foo.txt~
[jem@red ~]$

My interpretation of this listing is that, after editing, the file foo.txt is a new file with inode number 10525938, whereas the original file before editing has been renamed foo.txt~. The rename takes place before the call to open(..., O_CREAT)?

No to Synchronous Writes!

andresfreund — Sun, 20 Dec 2015 12:03:00 +0000

Opening with O_CREAT ("If the file does not exist, it will be created.") on an existing file does nothing. It doesn't create a new inode, it doesn't create new attribute.

No to Synchronous Writes!

jem — Sun, 20 Dec 2015 11:55:02 +0000

Now I am not sure if we are talking about the same thing. In my terminology, opening a file with O_CREAT is creating a new file, thus to "write out a new copy of a file, rather than face the complex integrity issues of updating in place."

The new file does have the old name, but it is a new file object, with a new inode number and new attributes.

No to Synchronous Writes!

andresfreund — Sun, 20 Dec 2015 11:21:22 +0000

Is it really what it does by default?

1$ emacs --version
GNU Emacs 24.5.1
1$ emacs -nw -Q /tmp/test.txt
<edit>
<save>
2$ strace -f -eopen,write,unlink,rename -p pid-of-above
1$
<edit>
<save>
2$
[pid 17835] open("/tmp/test.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 7
[pid 17835] write(7, "line1\nline2\nline3\nline4\nline5", 29) = 29
[pid 17835] write(7, "\n", 1) = 1
[pid 17835] close(7) = 0

No to Synchronous Writes!

jem — Sun, 20 Dec 2015 11:10:46 +0000

Yes, of course it is configurable. We are, after all, talking about GNU Emacs here. But that's beside the point -- the point was that zev above called the behaviour an urban legend, when it is what Emacs does by default.

No to Synchronous Writes!

butlerm — Sun, 20 Dec 2015 04:23:47 +0000

There probably should be a way to request that, but doing it without a synchronous wait requires global recovery synchronization across all mounted filesystems to bring them back to a mutually consistent state. That is difficult thing to implement. There are also substantial issues using a write barrier that covers all operations on a single filesystem. The only way to implement that on existing filesystems is to force everything to disk.

I think it is safe to say that if you request a write barrier using a portable interface anytime in the near future you are going to get a synchronous operation of some sort, and you don't really want that to be more expensive than a series of pertinent fsyncs. And if it is not a portable interface, no one will use it, etc...

No to Synchronous Writes!

nix — Sun, 20 Dec 2015 00:34:41 +0000

This is a configurable behaviour. You can even turn it off, or on, if the file has a link count >1, to avoid (or enforce) snapping that link.

No to Synchronous Writes!

nix — Sun, 20 Dec 2015 00:33:47 +0000

Unfortunately it's more complex than that. I can easily imagine people being very confused if a program does a write, then tells some other program via IPC or the network (i.e. *not a filesystem*, just an fd) that it's done it... and then there's a crash and it's gone because it was never committed to disk.

i.e. filesystem ops don't only have to be consistent with other filesystem ops, but with everything else too.

No to Synchronous Writes!

andyc — Sun, 20 Dec 2015 00:10:30 +0000

Vim actually uses a .<filename>.swp file

No to Synchronous Writes!

butlerm — Sat, 19 Dec 2015 19:15:36 +0000

> You can provide synchronous semantics without having to do synchronous writes, if you make sure that the interactions with outside are serialized correctly.

That works, but I think that guarantee is stronger than is necessary.

I believe the basic requirement for a write barrier is that if the system crashes, upon recovery writes and other operations made by any process subsequent to the write barrier have no effect unless the effects of all writes or other operations that are made prior to the write barrier are preserved, with respect to some pertinent subset of the files in a single filesystem - a subset as small as a single file.

No to Synchronous Writes!

jem — Sat, 19 Dec 2015 16:43:04 +0000

GNU Emacs writes a new file when saving the buffer. You can verify this by checking the inode number: you get a new number each time, and the original file is renamed with a tilde appended to the name.

No to Synchronous Writes!

zev — Sat, 19 Dec 2015 09:14:37 +0000

> Most editors for example, actually write out a new copy of a file, rather than face the complex integrity issues of updating in place.

This is a common claim, but as far as I can tell it's an urban legend (though an understandable one).

I just installed a few editors pulled from Debian's package repos and ran some brief tests with strace; here's what I found:

1) GNU emacs:

open(O_CREAT|O_TRUNC); write(); fsync(); close();

2) vim:

open(O_CREAT|O_TRUNC); write(); fsync(); close();

3) mg:

open(O_CREAT|O_TRUNC); write(); close(); /* no fsync()! */

3) GNU nano:

open(O_CREAT|O_TRUNC): write(); close(); /* no fsync()! */

4) joe:

open(O_CREAT|O_TRUNC); write(); close(); /* no fsync()! */

The lack of fsync() in mg, nano, and joe seems like a pretty gross omission. But all of them go for in-place truncate-and-overwrite, no new files or renaming in sight. Update-via-rename is aesthetically appealing, but has practical problems with metadata preservation (see jameslivingston's comment for one such example: https://lwn.net/Articles/667808/) and less-than-graceful behavior with large files (slowness, potential for ENOSPC). Pick your poison, I suppose.

But there's also one other editor I looked at that was something of an outlier (still no renaming though): alpine-pico. Firstly, it opens the file *without* O_TRUNC. Then, when extending a file (i.e. saving when the to-be-written data is longer than the existing file), it first lseeks to EOF and writes garbage data to extend the file to the new size, fsyncs the garbage, then lseeks back to zero and writes out the new real data, ftruncates to the expected length, and closes it (without a second fsync on the real data!). When not extending it's similar, but minus the garbage-data write and fsync. Not sure exactly what its aim is with that little dance that a basic write-and-fsync wouldn't achieve. (Shrug.)

Luu: Files are hard

butlerm — Sat, 19 Dec 2015 07:12:43 +0000

One correction: It is possible for a filesystem to implement a write barrier without a synchronous wait. A synchronous wait is only necessary if you want the write barrier to commit before proceeding, which often isn't necessary.

Luu: Files are hard

butlerm — Sat, 19 Dec 2015 04:29:49 +0000

> "When the authors examined ext2, ext3, ext4, btrfs, and xfs, they found that there are substantial differences in how code has to be written to preserve consistency."

I believe the main issue there is not how to write code to preserve consistency, it is that some filesystems silently preserve some form of consistency as an artifact of the way they were designed without the program having to request it.

The second issue is merely making code written to the standard perform well, but that is mostly up to the filesystem.

Luu: Files are hard

butlerm — Fri, 18 Dec 2015 19:32:16 +0000

> Your "create sfd" function I would rename as "start transaction".

I wouldn't call this transaction anything because that implies ACID semantics, which would be much more difficult to implement in a kernel. This is just write barriers with the ability to get completion notification.

The suggested API is more extensive than one might expect in order to allow a single process to handle multiple "write threads" without itself being multi-threaded, or having as many threads as there are application level transactions in process. Each sfd is its own write thread. If you don't want to use sfds explicitly, every thread would have an implicit one. Not necessarily an independent one, but an implicit one.

That means that if all you want is a write barrier you would make one call (or maybe two) and that would be it. Nothing else required.

The real work would be implementing a write barrier internally (i.e. in the kernel) as more than a series of synchronous fsync or sync_file_range operations. There already is the ability to initiate write out (at least for portions of files) without blocking, so basically what you want to do internally is initiate write out for everything on the write thread that hasn't already completed.

Then when the next write operation on the same "write thread" comes along, if there is an outstanding write barrier, you want to wait on write out for all those operations to complete before proceeding. As long as POSIX semantics are in force, you must wait for write out to complete before proceeding with another write operation on the same write thread, or the write barrier is meaningless.

Write threads are important here because there are writes you do not want to block or wait for commit of in any circumstances. That is why you need an API so that different writes can go on different "write threads" or sfds. The implicit sfd makes things simple for standard write(2) calls, and applications and libraries that just do the normal thing, but an explicit write as part of this "write thread" interface is necessary in the general case.

Implementation wise there is an issue with how many uncommitted operations you can track so you can commit them when necessary. I believe that is the biggest issue, especially with implicit write threads - you could easily have thousands of operations on a write thread that haven't been committed to durable storage yet. At some point you have to start combining them or retiring them early or you may have to block synchronously in unexpected places, just to free up space in your uncommitted operation buffer.

One thing that would help there is a filesystem level transaction group number for metadata and certain other operations. Then an internal operation can return the TXG number that the operation is expected to be committed in, and everything in the buffer for the same filesystem with a lower TXG number than the current committed one for that FS can be efficiently discarded - no need to make an inquiry or be notified about the completion of each one. You could just group them together - at most one entry per write thread per filesystem TXG. That is easily a thousand fold reduction in some cases. Then a mere handful of entries - something much more practical to keep in kernel - would suffice most of the time. You would only need more for writes of the sort the FS doesn't plan to commit in the next TXG.

Luu: Files are hard

meyert — Fri, 18 Dec 2015 17:02:43 +0000

For Java their exists XADisk - https://xadisk.java.net/ which plugins as JTA resource manager. But I wonder how safe it really could be after reading this article. But I think it's the right way to get out of this mess!

No to Synchronous Writes!

epa — Fri, 18 Dec 2015 10:15:50 +0000

I thought about this some more and realized that what matters is the interaction between the userspace program and the outside world. You can provide synchronous semantics without having to do synchronous writes, if you make sure that the interactions with outside are serialized correctly.

So if a program does three write() calls and then runs along by itself for a few seconds, making no system calls in that time, it is not necessary to flush the writes to disk. Only when it makes some further system call, interacting with the outside world, do you need to make sure the writes are completed and committed before the next operation. (And if the next operation is only a read(), you might loosely decide that this does not communicate information from the program to the outside and so can return immediately before the writes are committed to disk.)

Would this give good enough performance? I'm not sure.

Luu: Files are hard

mjthayer — Fri, 18 Dec 2015 07:57:28 +0000

Actually I have forgotten almost the most important thing there - the more cases and the higher the complexity of the cases you choose to support, the higher the chances are that you will get things wrong and introduce bugs into your code which could well cause more loss of data than crashes. It pays to concentrate your energy on simple and reasonable cases which you have enough time to test thoroughly.