Luu: Files are hard
Posted Dec 14, 2015 21:15 UTC (Mon) by jgg (guest, #55211) [Link]
This article basically gets it right on, there is no standardized way to write a file and contents and define a crash consistency barrier. The business of crcs, fsycning . and .. will work in a lot of places, but it certainly isn't consistent with a standard (which says basically nothing).
I really wish the Linux FS community should sit down and define some kind of useful crash consistently model and uAPI enhancement to put this issue to rest for good. Maybe even try to standardize it..
Hopefully this research will motivate that. The idea that essentially nothing out there gets this right is mildly terrifying and, to me, says our uAPI is somewhere around a -5 or -6 in the Rusty API scale.
Luu: Files are hard
Posted Dec 14, 2015 23:03 UTC (Mon) by Wol (guest, #4433) [Link]
Well, Linus is on record as saying he ignores Posix when Posix is stupid. And (reported on LWN iirc) the filesystem developers did reach out to people like the Postgresql guys, asking what they needed in a file system. I know - I mucked in on that - but really, I think the thing most people really need is some guaranteed write barrier. If I *KNOW* that every write request I make before the barrier will hit spinning rust before any request made after the barrier hits, then I can reason about what's going to happen, and make sensible judgements.
I don't even need fsync, which is a killer anyway, because that has all sorts of nasty side effects that can bring a system to its knees - call fsync when a streaming write is filling buffers faster than the system can flush them, and a "compliant" implementation will cause the system to effectively hang ...
Otherwise, we're going to repeatedly have re-runs of the ext2/3 fiasco where, iirc, code written assuming ext2 would bring ext3 to its knees AND VICE VERSA. Nasty!!!
Cheers,
Wol
Luu: Files are hard
Posted Dec 15, 2015 2:45 UTC (Tue) by dgc (subscriber, #6611) [Link]
That's the way kernel filesystems work. IO ordering is enforced by the layer with the ordering depenencies and it's done by waiting for all dependent IO to complete, then flushing physical caches and/or issuing FUA writes as teh "barrier" to ensure ordering at the physical storage layer is correct.
Dependencies are different for fsync, journal operations, metadata writeback, etc and they are all occurring concurrently in a filesystem. It's far more efficient to have each different subsystem control their IO dependencies directly via the above mechanism than it is to place global ordering barriers in the IO stream as all that such barriers do is create pipeline bubbles that reduce the utilisation of the IO devices. Hence by using a more efficient method of controlling order dependencies the penalty for using "barriers" vs "nobarriers" is now only a couple of percent, simply because we no longer have pipeline bubbles that incorrectly block non-dependent IOs...
You can do exactly the same thing from userspace using AIO+DIO and fsync. AIO+DIO to submit and wait for completion of dependent IO, then fdatasync the file(s) to commit metadata changes and enforce physical IO ordering in the storage. Then you can unplug the barrier and issue all your new IO and the problem is solved. You don't need to block/order any non-dependent IO on the barrier, so other parts of the application can continue oblivious (i.e. at full speed) to the parts of the app that have ordering dependencies.
Yes, i know, it doesn't help apps that use mmap/buffered IO, only deranged monkeys use DIO and crack-addled undead armadillos use AIO, etc. But the reality is that if you are going to step outside POSIX, we already have the tools for apps to implement strict, optimised ordering of their IO....
> I don't even need fsync, which is a killer anyway, because that has all
> sorts of nasty side effects that can bring a system to its knees - call fsync
> when a streaming write is filling buffers faster than the system can flush
> them, and a "compliant" implementation will cause the system to
> effectively hang ...
I'm pretty sure those old fsync/sync livelock problems were pretty much fixed back in 2010 by commit 6e6938b ("writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage").
-Dave.
Luu: Files are hard
Posted Dec 15, 2015 12:04 UTC (Tue) by Wol (guest, #4433) [Link]
Because the problem, simply put, was that if process A was writing faster than the disk could handle, and process B called sync, then process B would just hang until A finished, because the buffers never emptied.
Cheers,
Wol
Luu: Files are hard
Posted Dec 15, 2015 14:58 UTC (Tue) by nix (subscriber, #2304) [Link]
Yes, i know, it doesn't help apps that use mmap/buffered IO, only deranged monkeys use DIO and crack-addled undead armadillos use AIO, etc. But the reality is that if you are going to step outside POSIX, we already have the tools for apps to implement strict, optimised ordering of their IO....The only problem is that the number of practical barriers to doing this is almost insurmountable for most purposes. In practice DIO is nearly useless because almost no real apps other than a certain database can do anything useful within its alignment constraints, and it doesn't work over most network transports so if your app uses it it's now totally broken on some filesystems (which users will be *much* less satisfied with than data loss on crash). It's also totally Linux-specific...
As for AIO, well, the kernel has an AIO layer, but the aio syscalls in glibc still emulate it badly using threads, and what are the alternatives? SIGIO? Give me a break. There's libaio, but anyone using that is not portable to non-Linux platforms, which means that for most applications using it is a non-starter. You can write files reliably and fast, if you totally rewrite your file I/O to be asynchronous and rely on a library that doesn't work off Linux...
Luu: Files are hard
Posted Dec 15, 2015 15:24 UTC (Tue) by andresfreund (subscriber, #69562) [Link]
FWIW, I don't think it's that specific to big O. Mysql uses O_DIRECT. At least three people, me included, have written patches to use O_DIRECT for PostgreSQL.
Sure, that's still mostly databases ;). But I think it's rather hard to efficiently make use of O_DIRECT without a whole lot of infrastructure. Which mostly only more complex applications will bring with them... You suddenly need to care a lot more about IO scheduling, ordering, et al, unless you want to seriously degrade performance.
> It's also totally Linux-specific...
While O_DIRECT isn't really linux specific, the differences between OSs, especially related to performance, the statement basically holds. Which e.g. is the primary reason postgres doesn't rely on O_DIRECT for the main IO paths atm.
Luu: Files are hard
Posted Dec 15, 2015 18:50 UTC (Tue) by Wol (guest, #4433) [Link]
And as a user, why should I care about it? What's it got to do with me as a user-space dev? ... let's change tack. gcc does a load of optimisation. But when gcc buggers up the kernel because it re-orders instructions that shouldn't be re-ordered, when it optimises away writes because it spots that you're writing to a memory location and not reading it, buggering up your i/o, when it optimises away reads because it spots you're reading the same location repeatedly ... the kernel devs file a bug report against gcc. They don't want to get involved in the internals of gcc, they just want the code to work the way they wrote it, and not have bugs inserted by optimisation.
So why, as a user-space dev, should I have to worry about kernel-space optimisation? Surely, if it messes up my program, it's a kernel bug? You talk about journals, metadata, fsync etc - that's all internal to the filesystem. Why should a user-space developer be FORCED to care about it? Why should a user-space program be FORCED to have file-system specific code if it cares about data integrity? Surely that's the job of the filesystem and disk driver? Surely that's a pretty serious layering violation?
I do take your point about efficient use of i/o devices, but I'd contend that's actually a perfect example of "premature optimisation". By all means, optimise the hell out of your internals, but as soon as you commit a layering violation you are pushing a load of your costs into the other layer. As a user-space dev, that costs me a lot more than it saves you! As a user-space dev, I don't give a monkeys about journals and metadata and all that stuff, all I care about is that my DATA makes it safely to disk.
And if you give me the OPTION of putting a barrier in at the data level, it's then MY decision as to whether the cost of "bubbling" the i/o is worth it or not. I contend - very seriously - that most database apps would almost certainly be quite happy to take the hit. I have file create, file open, write, close. If at any point I can put a synchronicity barrier in there I don't care how you optimise it, but I then have a guarantee that one operation completes before the next begins. And in user space, that will make my life SO much simpler.
I may be wrong, but all of these debates come across to me that the filesystem devs are more concerned about the metadata and filesystem integrity, than they are about the user data. Excuse me, but surely protecting the user's DATA must come before everything else? What's the point of having your filesystem recover perfectly, if the user then has to run a full-blown integrity check to make sure the DATA hasn't been corrupted? You've saved the cost of a fsck, at the expense of a full disk scan! That imho is a pretty poor trade ...
(One of the reasons I'm so anti-Relational is precisely because of those layering violations! As a database developer, MultiValue does lots of stuff in the database layer that relational off-loads to the application layer. That's why relational is so inefficient - all this stuff is in the wrong layer!)
Cheers,
Wol
Luu: Files are hard
Posted Dec 15, 2015 22:15 UTC (Tue) by BenHutchings (subscriber, #37955) [Link]
So why, as a user-space dev, should I have to worry about kernel-space optimisation?
O_SYNC is there if you want it and can accept the low performance. If you want high performance, well, that doesn't come for free.
Surely, if it messes up my program, it's a kernel bug?
A system crash is the kernel bug. Failure to infer transactions from an application's writes is not.
Luu: Files are hard
Posted Dec 15, 2015 22:52 UTC (Tue) by fandingo (subscriber, #67019) [Link]
Synchronous operations and operations that have barriers/transactions ensuring sequential operation are totally different concepts. Just because I want my operations well-ordered doesn't necessarily mean that I need data hitting the disk synchronously. I may be perfectly fine with writes sitting in a volatile buffer for a short time, so long as when it's flushed, everything happens in the correct order.
O_SYNC is like wearing a full hazmat suit to stay dry instead of using an umbrella.
Luu: Files are hard
Posted Dec 16, 2015 13:17 UTC (Wed) by Wol (guest, #4433) [Link]
Yup. But as the other guy said, you're offering me either fast and risky, or safe but slow. And if I value my user's data, I have no choice but to accept safe. And no, I don't want to wear a hazmat suit just to stay dry :-)
If I can have a half-way house that gives me safe *when I need it*, that's better all round.
Cheers,
Wol
Luu: Files are hard
Posted Dec 16, 2015 6:18 UTC (Wed) by dgc (subscriber, #6611) [Link]
I expected someone asking for an IO barrier to think about the problems
involved with IO barriers, especially when an example of how the request barrier semantics being proposed is known to be deficient.
i.e. I gave an example of how "complete-before-submit" dependency ordering ends up being more efficient and performant than a "put a barrier in the stream here" operation. Neil, independently, has described the same issues as I did but with MD as the example rather than a filesystem. I'll quote your response and put it in the context of what I said:
> You mean, the user space app calls a "wait until this io has completed"
> function? I'm okay with that. Not as simple from the user-dev point of view,
> but not that hard.
Right, the "wait until this io has completed" function is exactly what I was refering to when I said "you can do today in userspace with AIO+DIO". i.e. Using libaio, submit a bunch of IO on an AIO context, then call io_getevents() configured to block until all the AIO events submitted on that context are delivered.
That's the core of a context specific, userspace IO barrier implementation, and if you combine it with fdatasync() after the completion then you have a multi-IO integrity barrier. All you need to do then is queue incoming IO while the completion processing part of the barrier is being executed, and then submit them once the integrity operation completes.
So we already have a mechanism for inserting integrity operations into a write stream without needing any special kernel help at all. That should be the start point for userspace library development, and once there is a solid library we can then look to optimising/improving the implementation of the library with kernel tweaks....
-Dave.
Luu: Files are hard
Posted Dec 15, 2015 8:06 UTC (Tue) by neilbrown (subscriber, #359) [Link]
While that might be true(*) I don't think it is an interface that the kernel should provide.
To implement a "write barrier" you need to identify which requests are "before" and which are "after", so you are really asking the kernel identify groups of requests for you and that opens various cans of worms regarding privilege and resource reservation.
I think that the interfaces you want are "make change" "flush changes" and "wait for flush to complete" with the ability to identify "changes" with reasonably precision.
We have "sync_file_range" which allows fairly precise identification of which part of a file you want to flush or wait for. We cannot offer more precision in directories than "all names in the directory". Would there really be value in that?
If you wanted to implement a "barrier" it should be possible in a library, using a separate thread to wait for the "before" writes to complete before initiating the "after" requests. Certainly a genuine attempt at that would be an important precursor to adding any sort of support to the kernel.
(*) However I don't think "All you need is barriers" will ever be a hit song.
Luu: Files are hard
Posted Dec 15, 2015 11:53 UTC (Tue) by Wol (guest, #4433) [Link]
Except you're offering a fix for just a small part of the problem :-( What happens if I'm scattering writes all over the file system?
What I *NEED* is some way of telling the kernel that for a given bunch of writes, order is irrelevant, but as soon as I make one particular write, ALL previous writes MUST have been committed otherwise corruption is a possibility. Basically, this is an absolute necessity for any copy-on-write system, and if you give me this guarantee, I can then sort myself out.
But you're limiting the guarantee on offer, and what you're offering just isn't enough. All I need is a system call, that gets queued just like any other write, but that tells the layers below "don't optimise *across* this barrier". It can optimise like hell between barriers, but then passes the barrier down to the next layer, right down to the disk driver.
And then I have the ability to reason about disk writes. I can guarantee that transactions are synchronous (across a filesystem, at least). And the system won't be brought to its knees if a programmer is stupid and calls the barrier every single write.
(Of course, if the system treats data, metadata, journals and logs all separately and the barrier doesn't apply "across the board", this then brings in another big can of worms ...)
Cheers,
Wol
Luu: Files are hard
Posted Dec 15, 2015 12:21 UTC (Tue) by Wol (guest, #4433) [Link]
You're putting me to a load of work, solving a problem I'm not interested in, just because the *side* *effect* is what I need. I have to individually check all my writes have been flushed before I move on. And that's putting the kernel to a load of unnecessary work too.
If I have some way of saying "I don't want *this* write to start, until all previous writes have finished", then the kernel has far more freedom to optimise, I don't need to keep track of stuff I don't care about, and everything is a lot simpler. It provides a synchronous guarantee, when I need it ...
Cheers,
Wol
Luu: Files are hard
Posted Dec 15, 2015 12:33 UTC (Tue) by andresfreund (subscriber, #69562) [Link]
Luu: Files are hard
Posted Dec 15, 2015 13:05 UTC (Tue) by alonz (subscriber, #815) [Link]
Yes - but the only (semi?) guaranteed way they can implement these is by using O_DIRECT access to the storage, and preventing the kernel from optimizing at all.
Luu: Files are hard
Posted Dec 15, 2015 13:22 UTC (Tue) by andresfreund (subscriber, #69562) [Link]
You "just" have to do the whole preallocate_file(tmp);fsync(tmp);fsync(dir);rename(tmp, normal);fsync(normal);fsync(dir); dance to create the files for journaling. And then protect individual journal writes with fdatasync() (falling back to fsync if not present). If you get ENOSYS for any of those, you give up.
Luu: Files are hard
Posted Dec 15, 2015 13:27 UTC (Tue) by andresfreund (subscriber, #69562) [Link]
Luu: Files are hard
Posted Dec 15, 2015 18:22 UTC (Tue) by jgg (guest, #55211) [Link]
The application actually has to go and zero fill the file directly before O_DIRECT and AIO begin to work properly.
Something about fallocate just reserves physical space, but it might not be zero so the FS has a special slow path for the first write to any block because it also is accompanied by a meta-data write to mark the block as written. mumble mumble. I never dug into enough to be sure.
Luu: Files are hard
Posted Dec 15, 2015 13:55 UTC (Tue) by Wol (guest, #4433) [Link]
Oh - and what happens if we get something like the ext2/ext3 transition, where the behaviour that one filesystem REQUIRED was actually TOXIC to the other? So your recommended behaviour is, unfortunately, filesystem-dependent :-( Along comes a new filesystem, with new behaviour, and your behaviour no longer works :-( In fact, it could kill the system ...
And yep, I would expect to have logs and all that, but what I'm trying to do is avoid as much of that crud as possible. If my database is copy-on-write, and I can guarantee synchronicity where I need it (hence the write barrier), I can be pretty confident that the chances of a failure in a critical section are very slim. Error recovery code rarely gets tested ... :-)
Don't forget, I'm a database programmer by trade. And I don't "do relational" - my database DESIGN gives me a lot of integrity that a SQL programmer has to implement by hand. I get my integrity "by default" to a large degree. And I want the OS to give me the same guarantees. Yes, I know some things can't be guaranteed. But something as simple as guaranteeing that disk writes will be processed in a fixed order?
If the OS can guarantee me, that it will pass writes to the hardware in the order that I ask, and not corrupt my code by optimising the hell out of it where it shouldn't, then it makes my life much simpler. (And as I said elsewhere, I know Physics makes life a pain, some degree of optimisation is a necessity.)
Cheers,
Wol
Luu: Files are hard
Posted Dec 15, 2015 14:40 UTC (Tue) by corbet (editor, #1) [Link]
The person you're replying to knows a wee bit about databases too...:)So if you really want a new interface from the kernel, could you describe that interface? What would the system call(s) be, with which semantics? With that in place, one could argue about whether it's best done in the kernel or in user space.
Luu: Files are hard
Posted Dec 15, 2015 18:23 UTC (Tue) by HenrikH (subscriber, #31152) [Link]
Luu: Files are hard
Posted Dec 15, 2015 18:32 UTC (Tue) by fandingo (subscriber, #67019) [Link]
I think the solution is pretty simple (at least in concept) and steals the database transaction pattern:
begin_transaction(fd, flag)
(critical IO operations)
commit_transaction(fd)
while the names are the same as DB transactions, the feature wouldn't necessarily have any sort of rollback functionality, although that would be cool. Additionally, "commit" means more of a "I'm done with the transaction, go back to normal" rather than an "only now put all the data on disk." Begin_transaction() does two things. 1) IO reordering is disabled through all storage layers for blocks associated with that FD.* 2) Every IO syscall in the transaction for that FD automatically has fsync() called. Commit_transaction just turns it off -- the data have been hitting the disk as normal, though without reordering, the whole time.
Through a feature flag argument on begin, and you could allow tuning how each subfeature works.
The single biggest issue, in my opinion, from the article and comments is the unpredictable ordering of IO operations, and the only solution is to utilize synchronization methods (O_SYNC, fsync, custom in user space with O_DIRECT/AIO, etc.) that have serious downsides. You know, sometimes I want the Atomicity and Isolation in ACID but don't care so much about Durability. The current kernel features force you to do both -- hurting performance -- and even then, you'll probably still have subtle bugs.
* Much easier said than done.
> With that in place, one could argue about whether it's best done in the kernel or in user space.
At least in my idea, the principal concern is allowing the user to definitely prevent IO reordering on specified FDs. That's impossible to do in user space -- at least without going down the undesired existing-synchronization paths.
Or, hell, perhaps just go full-bore:
begin_acid(fd, flag)
(Critical region)
end_acid(fd)
where flag is simply a 5-bit flag corresponding to enabling each property in the acronym. (Again, no rollback necessarily, so atomicity is only at the individual syscall level within that block, and there is no larger idea of an atomic transaction. Of course, modern, COW file systems can simply reflink at begin_acid and do some FD magic if there is an error if they want to support developer-friendly features.)
Luu: Files are hard
Posted Dec 15, 2015 19:33 UTC (Tue) by Wol (guest, #4433) [Link]
Let's think about a typical database transaction. Write to log, write to database, clear log. We have *at* *least* two files involved - the log file and the database file(s), and we need clear air between each of the three steps. The log needs to be flushed before the first write gets to the database, and the database needs to be flushed before the log gets cleared, and certainly in my setup the database can be plastered across a lot of files.
But if I can say to the kernel "I don't care what optimisation you do AS LONG AS you guarantee those three operations are synchronous", then I can take responsibility for recovering from errors, without having to have that massive layering violation with loads of filesystem-specific code to make sure the stuff I need is safely on disk.
Cheers,
Wol
Luu: Files are hard
Posted Dec 15, 2015 20:51 UTC (Tue) by fandingo (subscriber, #67019) [Link]
Nonetheless, I think my suggestion could work in the use case you describe.
> We have *at* *least* two files involved - the log file and the database file(s), and we need clear air between each of the three steps.
Okay, so you'd do:
begin_transaction(log_fd, ...)
(Log IO operations)
commit_transaction(log_fd)
begin_transaction(data_fd, ...)
(Data IO operations)
commit_transaction(data_fd)
I suppose this was implied, but I'll state it explicitly: Commit_transaction() has an fsync, creating the necessary durability between your two phases. Additionally, it would be possible to nest transactions.Remember there's a (optional through the flag) fsync() after every IO call on that FD inside the transaction, so if I pwrite(log_fd, ...), that call is fully written* once it returns. Therefore, pwrite(log_fd, ...) followed by pwrite(data_fd, ...) would be well-ordered.
> But if I can say to the kernel "I don't care what optimisation you do AS LONG AS you guarantee those three operations are synchronous"
Since the title of article is "Files Are Hard," it's worth pointing this out. I primarily care about well-ordering, which is not the same as synchronous IO, although we tend to ensure well-ordering by using synchronization methods at tremendous performance impact.
* Well, as best as the OS can ensure. Storage controllers can have their own whims.
> Except that doing it at the FD level isn't really enough. If transactions carry the process-id, you could probably do it at that level.
I considered this, but a process doesn't universally need this feature. Forcing a developer to use multiprocess programming just to overcome a synthetic limitation isn't friendly. It seems more likely that a library author define tx_open and tx_close that do open(2)/close(2) and setup a transaction if this were desired universally within a program. If, however, we did want to make it per-process, we'd probably be happier with limiting it to a specific thread rather than an entire process.
Luu: Files are hard
Posted Dec 15, 2015 19:36 UTC (Tue) by Wol (guest, #4433) [Link]
Cheers,
Wol
Luu: Files are hard
Posted Dec 17, 2015 14:35 UTC (Thu) by Wol (guest, #4433) [Link]
Sorry to rain on your parade, but you've just given me a solution to make sure my JOURNAL gets safely to disk. I want a solution to make sure my DATA (whatever that is) gets safely to disk. And if I happen to be writing to a pre-existent database file that's several gigs in size, this particular solution is no help :-(
I'm after a generic solution - I write something, I want confirmation it's safe on disk. Any solution that makes assumptions about that "something" is wrong/not good enough. And I'd much rather it was simple, not an error-prone song-and-dance.
Cheers,
Wol
Luu: Files are hard
Posted Dec 17, 2015 20:39 UTC (Thu) by bronson (subscriber, #4806) [Link]
Luu: Files are hard
Posted Dec 16, 2015 7:01 UTC (Wed) by michaeljt (subscriber, #39183) [Link]
> Yes - but the only (semi?) guaranteed way they can implement these is by using O_DIRECT access to the storage, and preventing the kernel from optimizing at all.
I am going to put my foot in here and show my naivity, but wouldn't the following achieve journal reliability without requiring too much from the file system: each journal entry is check-summed or similar for integrity, and each one clearly references to its predecessor. Then after a crash you only replay integral entries with no gaps in the chain and drop the rest.
Luu: Files are hard
Posted Dec 16, 2015 12:01 UTC (Wed) by nye (guest, #51576) [Link]
I mean at this point you've basically reinvented ZFS.
Luu: Files are hard
Posted Dec 16, 2015 12:11 UTC (Wed) by michaeljt (subscriber, #39183) [Link]
Right you are, I always mix the two up.
> I mean at this point you've basically reinvented ZFS.
I think I still have quite a way to go before I get there!
Luu: Files are hard
Posted Dec 16, 2015 12:17 UTC (Wed) by michaeljt (subscriber, #39183) [Link]
> I think I still have quite a way to go before I get there!
More seriously - yes, I know that this is a basic file-system idea, but I do not know enough to know all the difficulties and traps associated with it. Or for that matter, whether it can be made to perform in a way which is acceptable for a database.
Luu: Files are hard
Posted Dec 16, 2015 13:22 UTC (Wed) by andresfreund (subscriber, #69562) [Link]
Luu: Files are hard
Posted Dec 16, 2015 13:22 UTC (Wed) by Wol (guest, #4433) [Link]
What happens if you read the log and the checksum is incorrect (or the log entry never even made it to disk!) and then you discover that the database writes didn't make it to disk properly, either.
Cheers,
Wol
Luu: Files are hard
Posted Dec 16, 2015 13:36 UTC (Wed) by andresfreund (subscriber, #69562) [Link]
What? fsync does actually exist.
> (b) you have no way of knowing that the database entries made it to the disk until you need them.
Commonly you only write out 'data' entries to the OS once the corresponding log entry has been flushed to disk. E.g. by tagging each page with the position of the corresponding log entry.
Luu: Files are hard
Posted Dec 17, 2015 14:25 UTC (Thu) by Wol (guest, #4433) [Link]
And if you read the original article, fsync has a couple of "gotchas" which can bite!
If I write a log/journal file, and call fsync on the log, then can I be sure that the data has been written and will survive a crash? NOOO!!!! my log has gone !!!!!
The more I discuss this and think about it, the more it becomes obvious that it is extremely hard to write defensively such that your data will survive a crash :-(
Cheers,
Wol
Luu: Files are hard
Posted Dec 17, 2015 14:38 UTC (Thu) by andresfreund (subscriber, #69562) [Link]
So? Yes, there's a bunch of additional rules. No, not everyone gets them right. But you can follow them. And yes, I read the article.
> If I write a log/journal file, and call fsync on the log, then can I be sure that the data has been written and will survive a crash? NOOO!!!! my log has gone !!!!!
If you do it correctly, and you're not using broken hardware, then yes, you can be sure.
> The more I discuss this and think about it, the more it becomes obvious that it is extremely hard to write defensively such that your data will survive a crash :-(
Reinforcing that perspective seems be your prime interest, so I'm stopping here.
Luu: Files are hard
Posted Dec 18, 2015 1:47 UTC (Fri) by Wol (guest, #4433) [Link]
"When the authors examined ext2, ext3, ext4, btrfs, and xfs, they found that there are substantial differences in how code has to be written to preserve consistency."
As I read this, he is stating pretty damn clearly that my user-space code has to contain file-system dependencies. That means I have no guarantee that what works today will work tomorrow. That means that writing correct code is not merely difficult, it's impossible.
You say it's easy. Luu says it's impossible. Who do I believe? When faced with the evidence of the ext3 debacle (where the code for ext2 worked, just so badly it was unusable) I'm pretty much forced to conclude Luu *was* right. Is he still right? I see no evidence to the contrary. I haven't seen anybody here actually challenge his paper.
Okay, Neil claims that that is actually a kernel bug. That's no help to me, a user-space dev. All I'm asking for is a simple kernel API where I can ask the kernel "is my data safely passed through to the hardware (or network, or whatever)", and I get back one of two answers, either "yes", or "sorry, we had an error". If that's a linux virtual file system api, then it's easy to write robust code, and if something goes wrong it's a clear, blatant, and serious bug.
Cheers,
Wol
Luu: Files are hard
Posted Dec 18, 2015 7:48 UTC (Fri) by michaeljt (subscriber, #39183) [Link]
As I mentioned before, I am no expert in this particular code area. I can answer this statement as a generalist though. As a database developer, certain things are under your control and certain things are not (you have a certain amount of choice as to where you set the limit). In this case you have clearly left the decision as to which file-system to use in the hands of the user of your database. As you pointed out, guaranteeing correctness for every possible file-system the user might use is impossible. Even for the file-system developer (not to mention for your own code using the file-system) ruling out bugs which might cause things to break is impossible.
There is a simple solution though: decide and clearly state what you will support and what not. This does not have to be binary. You can tell the user: "you can use this or this file-system, but you have a higher risk of loosing data if your system crashes". Perhaps having that optional choice even adds value for the user. You can say (if you really want to): "if you use this particular file-system I guarantee you in some form that you will not loose data if this or this sort of crash happens". And if you are not taking money from the user, this is also a good option: "I expect no loss of data on a crash with this and this set-up, feel free to investigate further and to point me to problems."
Luu: Files are hard
Posted Dec 18, 2015 7:57 UTC (Fri) by michaeljt (subscriber, #39183) [Link]
Luu: Files are hard
Posted Dec 19, 2015 4:29 UTC (Sat) by butlerm (subscriber, #13312) [Link]
I believe the main issue there is not how to write code to preserve consistency, it is that some filesystems silently preserve some form of consistency as an artifact of the way they were designed without the program having to request it.
The second issue is merely making code written to the standard perform well, but that is mostly up to the filesystem.
Luu: Files are hard
Posted Dec 16, 2015 13:45 UTC (Wed) by michaeljt (subscriber, #39183) [Link]
Luu: Files are hard
Posted Dec 16, 2015 15:27 UTC (Wed) by nye (guest, #51576) [Link]
This is basically what happens in ZFS: all writes are grouped into transactions and any incomplete operations are discarded. When loading a pool it scans a list of uberblocks (basically filesystem root notes) stored in a circular log, to find the one with the highest transaction group number; if that doesn't check out, it looks at the next highest, and so on. If somehow every uberblock in the log fails its checksum, you are having a very bad day, ZFS explodes into shards of superheated shrapnel, and there is much wailing and gnashing of teeth.
(I think you might need to tell it to do the rewind manually using an import flag, because automatically throwing away newly written data can be a bad idea even if it has a checksum failure, on the basis that something might be recoverable. I wouldn't swear to this though.)
For more on how ZFS handles this sort of thing, have a look at https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSTXGsA....
Luu: Files are hard
Posted Dec 17, 2015 14:21 UTC (Thu) by Wol (guest, #4433) [Link]
Cheers,
Wol
Luu: Files are hard
Posted Dec 17, 2015 14:41 UTC (Thu) by andresfreund (subscriber, #69562) [Link]
Luu: Files are hard
Posted Dec 18, 2015 2:17 UTC (Fri) by flussence (subscriber, #85566) [Link]
Luu: Files are hard
Posted Dec 15, 2015 21:52 UTC (Tue) by neilbrown (subscriber, #359) [Link]
> If I have some way of saying "I don't want *this* write to start, until all previous writes have finished",
I'm convinced that they best way to say "Don't start this write" is to not say "start this write".
i.e. don't submit the "write" request until you are ready for it to happen.
Maybe this is because I was burnt by the "write barriers" we had in the kernel for a while. That turned out to be horrible and I was so glad it was discarded. Now we effectively have separate "write" and "flush" interfaces internally and life is much easier.
Something we do have in the kernel that maybe isn't available to user-space is completion notification. *All* write requests are async with a completion notifier. The completion handler can achieve barrier-like semantics by queuing the dependent write if all prerequisites have been met (which is exactly how RAID5 enforces ordering w.r.t the bitmap or journal).
'aio' might make this sort of thing available to user-space, but I don't see an "aio_sync_file_range", so maybe not.
Someone has to keep track of the things that you claim not to care about and it isn't clear to me that the kernel is the best place to do that.
You want the the kernel to sometimes delay writes, but that could cause problems for the memory balancing code. When memory it tight, we typically start flushing out all dirty pages ..... but not that one or that one or those 5000 of there because they are waiting on a filesystem operation (rename?) which is blocked by a memory allocation...
The more complex ordering requirements you put in the kernel, the more deadlock-prone it becomes, and the more opportunity there is for (unprivileged) userspace processes to pin kernel memory.
Luu: Files are hard
Posted Dec 15, 2015 22:26 UTC (Tue) by Wol (guest, #4433) [Link]
:-)
> > If I have some way of saying "I don't want *this* write to start, until all previous writes have finished",
> I'm convinced that they best way to say "Don't start this write" is to not say "start this write". i.e. don't submit the "write" request until you are ready for it to happen.
You mean, the user space app calls a "wait until this io has completed" function? I'm okay with that. Not as simple from the user-dev point of view, but not that hard.
> Maybe this is because I was burnt by the "write barriers" we had in the kernel for a while. That turned out to be horrible and I was so glad it was discarded. Now we effectively have separate "write" and "flush" interfaces internally and life is much easier.
Or if user-space can invoke a flush, knowing that anything written after the flush won't actually get processed until the flush is complete. I think that is actually a pretty accurate description of my write barrier!
> Someone has to keep track of the things that you claim not to care about and it isn't clear to me that the kernel is the best place to do that.
The problem is that - under the current state of affairs - it seems clear that user-space is definitely the wrong place to do that. Different file-systems have different semantics, requiring different "song and dance"s to ensure data is flushed safely and efficiently. That means user-space needs to concern itself about what sort of file system it's using - my program shouldn't have to worry itself about whether backing store is a simple disk, raid, network, or what type of filesystem (FAT, ext, btrfs, whatever whatever) it is. I just want a simple api that enables me to know that what I've written is now safely passed through the kernel to the hardware underneath.
> You want the the kernel to sometimes delay writes, but that could cause problems for the memory balancing code. When memory it tight, we typically start flushing out all dirty pages ..... but not that one or that one or those 5000 of there because they are waiting on a filesystem operation (rename?) which is blocked by a memory allocation...
> The more complex ordering requirements you put in the kernel, the more deadlock-prone it becomes, and the more opportunity there is for (unprivileged) userspace processes to pin kernel memory.
But all I'm asking for is the ability to enforce FIFO when I need to ...
I understand your problem, but you're then creating a very hard problem for me - "how do I guarantee the integrity of my user's data in the face of a crash, when I have no easy way of ensuring that data has actually hit the disk?" And note that pretty much all of the work-arounds available to me, consist of trying to break kernel optimisations !!! Note the bit of the thread that said that the only decent, PARTLY reliable method was to bypass the kernel optimisations as much as possible, and move as much of the work as possible into user space. Pretty damning indictment of the kernel :-(
At the end of the day, all I need is some api - at the kernel interface level - that lets me reason about the state of the disk. That means it's got to be filesystem-independent. That means the kernel is about the only place it can go ... but a "flush" that *guarantees* to me that all my writes so far have hit disk is probably more than good enough ... :-)
Cheers,
Wol
Luu: Files are hard
Posted Dec 15, 2015 22:55 UTC (Tue) by neilbrown (subscriber, #359) [Link]
If that is true, then that is categorically a bug.
There is a base API, largely involving fsync, that must provide data stability with reasonable efficiency that all filesystems should implement (I thought they did).
Some filesystems may provide stronger guarantees than needed, but shouldn't impose higher costs when the guarantees are requested..
Do you have specific details of differences that user-space needs to be aware of, which have been reported, and which have been rejected as "not a bug"? Failing that - differences that haven't been reported would be interesting.
I'm aware of the issue with ext3 where fsync wasn't always needed (not a bug) but could cause substantial performance cost when used (definitely a bug). It's not clear to me that was ever reported as bug though.
As ext3 has been removed from the kernel now, I think we need to leave that in the garbage can of history and move on.
Luu: Files are hard
Posted Dec 18, 2015 0:55 UTC (Fri) by Wol (guest, #4433) [Link]
> If that is true, then that is categorically a bug.
Can I just quote Luu's article ... "When the authors examined ext2, ext3, ext4, btrfs, and xfs, they found that there are substantial differences in how code has to be written to preserve consistency. "
I may be misunderstanding it, but that certainly sounds to me like the song-and-dance varies by file-system ...
> I'm aware of the issue with ext3 where fsync wasn't always needed (not a bug) but could cause substantial performance cost when used (definitely a bug). It's not clear to me that was ever reported as bug though.
Bearing in mind that it actually worked as specified (apparently), maybe it wasn't reported as a bug because it wasn't seen as a bug. "It's far too slow" may be a show-stopper, but to me it isn't technically a bug ... (unless, of course, the spec actually makes performance guarantees).
But I'm going to have to bow out of this discussion very soon - I'm about to go away for the weekend and I try (for my wife's sake) to stay away from the internet when I'm on holiday ... :-)
Sorry.
Cheers,
Wol
Luu: Files are hard
Posted Dec 15, 2015 23:37 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]
Not that I'm against it. It looks like they are a really nice primitive for high-performance applications.
Luu: Files are hard
Posted Dec 16, 2015 9:23 UTC (Wed) by stressinduktion (subscriber, #46452) [Link]
Luu: Files are hard
Posted Dec 16, 2015 11:26 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]
Luu: Files are hard
Posted Dec 26, 2015 10:32 UTC (Sat) by chojrak11 (guest, #52056) [Link]
Luu: Files are hard
Posted Dec 26, 2015 12:11 UTC (Sat) by anselm (subscriber, #2796) [Link]
Not that they were a great innovation in Windows NT. Remember that back then, Windows NT was basically a version of VMS for Intel CPUs.
Luu: Files are hard
Posted Dec 15, 2015 18:04 UTC (Tue) by jgg (guest, #55211) [Link]
So, how about fcntl(F_CRASH_RECOVERY,F_STRONG) as an API?
Guarantees that upon recovery the linear sequence of actions applied to that FD (speaking broadly, meaning all meta data, directory meta data and data blocks) will be unwound to one of the linear states the application created. Guarantees the fsync completion means the unwind will be to the state at fsync, or later.
The API would specifically and deliberately link data and meta data together - which is the real problem. Keeping those flows distinct for performance is what utterly destroys the crash consistency model.
Ie for ext I could imagine implementing the above as forcing data=journal for just the tagged file..
Luu: Files are hard
Posted Dec 17, 2015 16:39 UTC (Thu) by butlerm (subscriber, #13312) [Link]
Luu: Files are hard
Posted Dec 16, 2015 4:42 UTC (Wed) by ras (subscriber, #33059) [Link]
No, I'm with Wol here. The case your talking about boils down to reporting to the "user" his action has happened, ie his airline reservation is firm or his payment has been processed. When that's what you must have then your API is fine, but it comes at a large performance cost. Call this case (A).
Another condition people often want is "my on disk data structure is consistent". To some this is known as "no fsck needed". It translates to perhaps losing what you were doing but never losing data you had promised you had saved for case (A) which could well happen if the data structure were allowed to become inconsistent. This condition can be implemented easily enough with a write barrier or the other alternatives you have mentioned like completion notification. The reason it's popular is the the primitive itself is effectively free in terms of performance. Free is a very attractive number, so attractive people will sometimes be happy enough to trade the possible loss of a few minutes data for it.
What they are generally NOT willing to trade is losing 1TB of data for free. Interestingly, we have voted on this when it comes to file systems - I often run with journal_data_writeback for speed, not I never run with journalling turned off entirely. So you kernel guys have decided it's important enough to give your users that choice for things you store on disk like file systems - but apparently no user space data is worthy of the same flexibility.
Luu: Files are hard
Posted Dec 21, 2015 2:20 UTC (Mon) by ras (subscriber, #33059) [Link]
This doesn't address Luu's prime complaint: keeping your on disk data structures consistent remains hard. This just makes it possible to do it without wrecking performance by forcing flushes. It also makes it possible to write a userspace library that does make it easy.
IMHO that would be a huge step forward. I was just reading a paper on LevelDB vs LMDB, which boils down to Log Structured Merge (LSM) versus Multiversion Concurrency Control (MCC). http://www.diva-portal.org/smash/record.jsf?pid=diva2%3A7... Oddly at the 1000ft view LSM and MCC are very similar. Both are COW, writing updated data to a separate place and deferring the expensive "this is the new version" operation till later. The primary difference is LSM defers it for a looong time by writing it to a separate file (often several) then doing a merge step in the background. In the most extreme case can avoid flushes entirely (because if you wait long enough, you can safely assume the data is on disk even though you don't have API telling you it's there). MCC on the other hand effectively does the merge on every commit, but to do that it must use the only API the kernel provides that tells you the data has hit disk - a flush.
Not unsurprisingly, in this paper LSM blitzed MCC in write speed by a factor of 4, (ie 400%). But that comes at the expense of having to merge the logs on the fly on every read. Because MCC wears the flush, the number of unmerged changes remains tiny. And again not surprisingly, MCC blitzed LSM by a factor of 4 for reads.
If MCC could avoid the flushes there is no reason it could not be as fast as LSM. But as of right now it can't be because the kernel doesn't supply the API's to make it possible. As a consequence, Linux applications are at an unnecessary 4 fold speed disadvantage in some scenarios. How we have tolerated that for over a decade now is a bit of a mystery to me.
Luu: Files are hard
Posted Dec 21, 2015 6:27 UTC (Mon) by neilbrown (subscriber, #359) [Link]
This might be a nice idea, but it is completely impractical on Linux (without a massive rewrite).
The distinction between file descriptors disappears almost the instant you enter a system-call.
The distinction between open file descriptions doesn't last much longer for a write request.
You could get a notification that a file has no more dirty pages in the page cache without too much trouble. You might even be able to get a notification that there are no dirty pages in a given range.
For directory entries you could similarly arrange a notification that all updates to a directory are safe, and possibly that the name used for a given file descriptor was safe.
To get stability guarantees in the face of ongoing updates you would probably need to block new updates until old updates are flushed. If that caused a problem (I suspect it would) then having two log files that you alternate between might be a solution.
That only way you could hope to track "all writes done to this file descriptor" would be to use O_DIRECT.
Luu: Files are hard
Posted Dec 21, 2015 8:16 UTC (Mon) by ras (subscriber, #33059) [Link]
To be honest I knew that when I wrote it.
The point of the post wasn't to solve the problem as I don't know enough about the kernel to do it. It was to get to the stage before that, which is to get some acknowledgement there is fruit here worthy of spending some energy to harvest. Reading through the comments it looked to me the discussion hadn't made it that far.
It seems that everyone assumes ACID is the only use case. That used to be the case, when we all had systems running on a server in a nearby room and when we pressed the Save button we expected it to get absolute confirmation the data was saved in the time it took to respond to the enter button on a dumb terminal. After all the human isn't going to remember what they just typed. But now the most likely scenario is the data is coming from another computer, probably a web page, and it is travelling over a link with a TCP handshake overlaid onto a 50ms latency. It's not at all unusual for the final submit button to take 5 seconds, and if it failed the user hits back and tries again. Or even better, it's coming from pages like Google documents - where javascript in the web page is trickling updates back to the server and is perfectly happy to wait forever and resend over and over again. We absolutely need ACI from ACID, but the D is not as important as it once was.
Forcing the application writer to destroy any chance of the kernel has of optimising batched writes because the only way he can keep his on disk data structures consistent is flushing everything, all the time, makes no sense in this world. Yet as far as I can tell, there is no way around it.
> You could get a notification that a file has no more dirty pages in the page cache without too much trouble. You might even be able to get a notification that there are no dirty pages in a given range.
Flushing page cache doesn't quite cut it. fsync() man page says "This includes writing through or flushing a disk cache if present" because that's what "I am now certain my on-disk data structures are consistent" requires. I don't think "a file has no more dirty pages" would work either, as it's not difficult to dream up scenarios where that condition would be very rare to non-existent, which would effectively mean the final write that moves the on disks data structure to the new version can never be done. A blockv system call that returned when a a list of ranges were safely on disk would be the minimum, I think. And no, I'm not seriously proposing that - as each blockv call would needs its own thread.
Luu: Files are hard
Posted Dec 21, 2015 21:25 UTC (Mon) by neilbrown (subscriber, #359) [Link]
I think that is an excellent idea. I think that the first step would be to have a very clear very specific use-case. Something that you could implement and then say "see, only N updates per second" and then I could implement differently and I say "Look, I can get M updates per second". Then you can say "but if a crash happens *there* you lose consistency".
Then we can create a new API that allows 10*M updates per second, and write a crash-test that randomly resets a KVM instance and never ever detects corruption after tens of thousands of crash cycles.
Then we would have something to sell.
I can't even being to suggest a use case. All I ever care about is whole files. write;fsync;rename;fsync-directory. Done
> where javascript in the web page is trickling updates back to the server and is perfectly happy to wait forever and resend over and over again. We absolutely need ACI from ACID, but the D is not as important as it once was.
How does the javascript know that it needs to resend, or that it never will need to again? At some point durability is needed.
If you have multiple webpages all updating the same backend, then in the case of a server crash where each client replays, you need to be sure that any ordering issues are resolved in the same way (at least if they were externally visible).
I accept the durability itself may not be always required, but I think it is by far the easiest way to achieve other things that are required. With the recent and expected advances in hardware, durability on demand is so cheap, there seems little point in coming up with a more complex solution.
Luu: Files are hard
Posted Dec 22, 2015 0:45 UTC (Tue) by ras (subscriber, #33059) [Link]
The best answer is to watch it in action. Just create a new Google Spreadsheet, enter stuff into a few cells, disconnect the computer from network and continue typing. Soon a little bar will appear telling you it is saving your changes to Google. It wouldn't work of course, but you will be able to continue entering new data. A minute or two later it will and put up a message saying the "Trying to connect ..." and block you from typing (presumably so you won't lose too much work). Then plug the cable back in. The browser will notice, re-send the data that was dropped and so pick up from where it left off. Your guess is as good as mine on how it works under the hood.
The real answer is there are as many ways of doing it as there are programmers, and I'm sure now you've seen an example of it in action you could think up your own. One simple way would be to not send the AJAX response until the kernel notifies you the data is on disk, and pair it with a separate "flush file" AJAX command that is sent when the user attempts to leave the page. Presumably the flush would cause the all pending commits to happen so any pages waiting for commits would return soon.
Believe it or not, there are times when when the degree of caring whether the data made it to disk or not is so low it's hard to measure. The classic example I have been involved in is a sort of "big data" application, where the user uploads gigabytes of data to the server where it is processed by a propriety application. Usually they will do a series of uploads. The uploads then have to be processed looking for patterns, which generates many times as many gigabytes of indexes. This all happens on a purchased VM where every cycle and IO request is charged for, and so the goal is to minimise those charges. The one thing they that don't want to do is lose those indexes as they represent a lot of accumulated CPU cycles over many uploads. But a failed upload can be re-done, and if a VM dies during an indexing run no one cares provided the on disk data structure remains consistent - it can just be re-run. The cost benefit tradeoff is waiting for a flush to complete after every transaction moving the database from one consistent state to another, versus re-doing the entire operation in the very rare event of failure.
Luu: Files are hard
Posted Dec 23, 2015 22:10 UTC (Wed) by neilbrown (subscriber, #359) [Link]
To get Atomicity it is nearly axiomatic that you need a journal or log or something a lot like that. Let's assume that is a separate file written to with an append-only discipline. After a crash every block in the file will be either the data that was written there, or nuls (unless you use a deliberately-broken filesystem like ext3 with data=writeback) so it is easy to find the transactions and to be sure that after replaying the journal Atomicity is provided.
To get Consistency you need to be sure that data is safe in the main database file(s) before removing it from the journal. To get Durability you need to be sure that the data is safe in the journal before telling the application that data is safe. So from the kernel-api perspective, these are much the same thing.
(Isolation is an application level concern, not a kernel-level concern. The kernel provides advisory locks and other IPC mechanisms that can be used to provide whatever isolation is needed).
So how can we know that "data is safe". We know that the kernel will write it "eventually", so we just need to know when "eventually" is. You (quite reasonably) didn't like the idea of the kernel telling us when *all* of a file was stable so we need the kernel to have some concept of at least two different sets of pages in the file: one set which triggers a notification when it becomes empty (possibly after a flush is sent to the device) and one set which is all the other dirty pages.
Then the "barrier" command that people seem to want would move all dirty pages (in a given range maybe) into the first set, and would request notification when the set became empty. We can call this the "barrier set" and the other set the "dirty set". When you request a barrier for the whole file, the dirty set becomes empty and then gradually gains new -pages. When a new barrier is created, everything in the dirty set moves to the barrier set.
There are a couple of issues that need to be addressed at this point:
1/ what if I write to a page of a file which is already in the barrier-set. Does it stay in the barrier set, or move to the dirty set? I think a few moments thought will confirm that it must stay in the barrier set. So the barrier is 1-way. Newly dirtied data might get written before the barrier notification arrives, but old dirty pages cannot "slip past" the barrier and only get written after the notification.
2/ What if memory pressure or some other force needs to write out dirty-set pages before all the barrier-set pages are written? Should those pages be written separately from the barrier-set, or should they just be moved into the barrier-set and then it be flushed? It is hard to reason about this in a completely abstract context, but I suspect that if any of a file needs to be flushed out, then the barrier-set really should be flushed out first.
So the simple approach is that the kernel divides the pages in a file into three sets, "clean" (which don't need to be written out), "dirty" (which need to be written out eventually), and "barrier" (which need to be written out before the dirty pages, and when they are written, a notification is generated).
This is very nearly what we already have: Clean, Dirty, and Writeback is what we call them. "Writeback" are pages that are queued for a "backing device". Where it makes sense, they tend to be queued at a fairly low priority, certainly lower than synchronous reads (as opposed to async read-ahead).
We even have a mechanism for getting a notification when the Writeback set is empty. Fork, call sync_file_range with the flag SYNC_FILE_RANGE_WAIT_BEFORE, then signal the parent (maybe by simply exiting).
Moving pages from the dirty set to the barrier (aka writeback) set involves calling sync_file_range with the SYNC_FILE_RANGE_WRITE flag. Note that this doesn't wait for the write to complete, it just queues the IO.
So I think we already have very nearly all the functionality you need to do what you want. There are some issues that may (or may not) be a problem.
- It would be nice if the "barrier" operation didn't block. sync_file_range(SYNC_FILE_RANGE_WRITE) can block though. All the pages are queued for the backing-device, and if that has a queue size limit (which it must) then there could be delays until the queue drops below that limit. Even without the queue limit, there is typically a need to allocate small data structures (bios, requests) to store the pages in the queue. Their memory allocations could block. These are really implementation details though. If someone made a case of a non-blocking "barrier" operation and had a genuine application wanting to use it, I'm sure something could be arranged.
- while background writes tend to have a low priority, it might not be as low as you would like. Once a page is in writeback it will progress on the queue and be written. There seems to be a suggestion above that you would like it to languish around a bit more to maximise the opportunity for write-combining.. It is hard to know how important this really is, and the importance probably varies a lot between different storage technologies and different loads. There may be room for tuning queuing priorities if a problem was demonstrated.
So I fall back on what I've been saying all along. Linux *already* *has* what you need to write files safely, reliably, efficiently. If it isn't quite perfect, or if it doesn't work as advertised, then it is because you haven't submitted bug reports. There is undoubtedly room for improvement (an aio_sync_file_range would be nice) but improvements only happen when someone drives them.
Luu: Files are hard
Posted Dec 25, 2015 17:04 UTC (Fri) by anton (subscriber, #25547) [Link]
What you call the barrier/writeback set seems to be what I think of as a commit in a log-structured or COW file system. If Linux does that correctly in memory, that's good, but it also needs to do it correctly when writing out to disk in order to give decent crash consistency guarantees. If you have a journaling file system with full journaling or a log-structured or COW file system, that's not too hard. And yet, even journaling and COW file systems in Linux (except NILFS) don't give such guarantees last I looked.Concerning bug reports, the kernel people's (especially Ted T'so's) stand in the O_PONIES discussion would certainly discourage me from making such reports. On the practical side, Linux crashes so rarely, and power outages are so rare 'round here that there are very few real opportunities to actually see crash consistency in action, so I would have little to report even if I was willing to.
Luu: Files are hard
Posted Dec 25, 2015 16:40 UTC (Fri) by anton (subscriber, #25547) [Link]
It seems that everyone assumes ACID is the only use case. That used to be the case, when we all had systems running on a server in a nearby room and when we pressed the Save button we expected it to get absolute confirmation the data was saved in the time it took to respond to the enter button on a dumb terminal. After all the human isn't going to remember what they just typed. But now the most likely scenario is the data is coming from another computer, probably a web page, and it is travelling over a link with a TCP handshake overlaid onto a 50ms latency. It's not at all unusual for the final submit button to take 5 seconds, and if it failed the user hits back and tries again. Or even better, it's coming from pages like Google documents - where javascript in the web page is trickling updates back to the server and is perfectly happy to wait forever and resend over and over again.I think it's exactly the other way 'round.
If the user is typing into the editor, and there is a power outage or OS crash, the user notices this, and will check that the file he just saved is really complete. Sure, it would be cool if the editor could also tell the user (asynchronously) that the file now resides permanently on disk, but does the user want to wait for that by using something like fsync()? Probably not, if it takes a noticable amount of time.
By contrast, in a distributed system, when you tell the customer that his flight has been reserved, you better be sure that it's in permanent storage, because the user won't notice if the server crashes one second after the notification. Also, I think that a distributed system where the client has to make up for an unreliable server is very hard to program, and we will see lots of lossage if we follow that model (just as we see now with file system maintainers who provide unreliable file systems and expect user space to make up for it).
Concerning the disconnection example, you just simulated a disconnection, not a server that lost data. The client just had to reconnect, but client and server were still consistent with each other after the reconnection. When the server crashes and loses some data, it's different. The easiest-to-program model for that is fully synchronous operation from the client to the server's disk hardware and back; if that's too slow, asynchronous operation (separate write requests and completion information) would be the way to go.
I don't see that most Linux file systems currently have a workable model for consistency and performance. Last I looked, the only file system that gave a decent consistency guarantee was NILFS. With the mainstream file systems, if we want consistency without programming to a complex API, use a libc variant that sync()s after every write() or other file system change (or maybe use sync-mounted file systems, but given how untested that probably is, I would not trust that option to work, especially given the data=journal precedent in Linux).
Luu: Files are hard
Posted Dec 26, 2015 21:53 UTC (Sat) by nix (subscriber, #2304) [Link]
Luu: Files are hard
Posted Dec 27, 2015 17:46 UTC (Sun) by flussence (subscriber, #85566) [Link]
Write patterns like those are unfortunately so common it's led to people pushing back in the opposite direction with things like libeatmydata, tmpfs-mounted browser profiles etc.
Luu: Files are hard
Posted Dec 28, 2015 12:04 UTC (Mon) by nix (subscriber, #2304) [Link]
Luu: Files are hard
Posted Dec 30, 2015 14:45 UTC (Wed) by ghane (subscriber, #1805) [Link]
This bit me very hard in early 2012; to make it worse I was doing three sub-optimal things:
1. Running the desktop in VMPlayer
2. Using btrfs for the first time
3. Updating daily with 12.04-proposed-update
A daily update of 30 packages would take more than an hour, I assumed this was VMPlayer. Finally googled to find the solution, libeatmydata.
In fact, I have got into the habit of doing a:
eatmydata apt-get --purge dist-upgrade
daily, on the 16.04-devel branch, even though I switched to xfs[1]. One of these days it will bite me badly.
[1] Why xfs? Because I thought I would see something different, and learn something new. Alas, things have been so event-less, I daily learn nothing.
Luu: Files are hard
Posted Dec 15, 2015 9:36 UTC (Tue) by epa (subscriber, #39769) [Link]
Then simple code will get simple semantics. If you know what you are doing and you want to sacrifice safety for performance, you can set a flag to allow the kernel to delay flushing to disk or even reorder writes; or the application could use asynchronous I/O to start with.
Luu: Files are hard
Posted Dec 15, 2015 11:33 UTC (Tue) by Wol (guest, #4433) [Link]
Have you ever used Windows before/without smartdrv? IT'S PAINFUL !!! I have/had to jump through all sorts of contortions to install XP on a bare system, and I can tell you that the mere act of copying the install files changes from a 30-minute job with smartdrive, to a 30-hour job without ...
Now try and run a database system on top of that.
Sorry, I think you will find that idea is totally impractical. Yes it would be lovely if we could have fast synchronous writes, but I think you'll find the reason we're in this mess in the first place is that the Physics says "tough, you can't do that".
Cheers,
Wol
Luu: Files are hard
Posted Dec 15, 2015 17:22 UTC (Tue) by epa (subscriber, #39769) [Link]
Although I do remember MS-DOS back in the days before smartdrv, that was with spinning disks where almost every write operation will require moving the head to the correct place, which could take tens of milliseconds. SSDs are nothing like that.
Luu: Files are hard
Posted Dec 15, 2015 19:46 UTC (Tue) by Wol (guest, #4433) [Link]
Which is exactly the problem - the database manages its own i/o, journalling and writeback because it CANNOT RELY on the OS to do it right! It shouldn't have to do that!
> Although I do remember MS-DOS back in the days before smartdrv, that was with spinning disks where almost every write operation will require moving the head to the correct place, which could take tens of milliseconds. SSDs are nothing like that.
And why do I want an SSD? The whole point of MultiValue systems is that disk i/o is incredibly efficient. Let's just say I could implement an MV database on spinning rust that would give a relational database on SSD a run for its money!
Plus, you're throwing hardware at the problem. Bad move. The code should be clean and fast before you start worrying about the hardware (and as I've been hammering on elsewhere, the application shouldn't have to care about the filesystem (or hardware), it shouldn't enter into the equation as far as programming goes).
Cheers,
Wol
Luu: Files are hard
Posted Dec 15, 2015 20:39 UTC (Tue) by epa (subscriber, #39769) [Link]
I don't want to get into a discussion about multi-value databases like Pick compared to relational ones using SQL. I will just note that I was talking specifically about systems using an SSD with no spinning disks (in other words, 90% of all systems currently sold, if you include phones and tablets) and about typical desktop or mobile workloads. I accept that specialized applications will need specialized treatment, which is why I suggest the application should be able to turn on write-back caching if it wants it. The key point is that the simplest code should get the simplest semantics: the writes are done in the order they are requested. In fact, that doesn't exclude write-back caching at all, since the OS can keep a journal; but I suggest that even without such techniques a lot of the caching tradeoffs which were made for the age of spinning disks are no longer needed for SSDs.
Luu: Files are hard
Posted Dec 16, 2015 17:58 UTC (Wed) by nix (subscriber, #2304) [Link]
I will just note that I was talking specifically about systems using an SSD with no spinning disks (in other words, 90% of all systems currently sold, if you include phones and tablets) and about typical desktop or mobile workloadsYou mean, workloads that are generally not disk-bound except at application startup and system boot in any case? That's not terribly useful.
Luu: Files are hard
Posted Dec 17, 2015 11:31 UTC (Thu) by epa (subscriber, #39769) [Link]
Again, I am entirely in favour of a sane userspace API for defining write barriers or whatever; you should be able to say 'I do not require these writes to be committed immediately I make them, but I would like to make sure they are committed by this point'. But the most naive code that doesn't want to care about tricky issues of filesystem consistency should get the simplest and safest semantics. The ten-line program to write 'hello world' to a file and then rename that file to overwrite another one should not have hidden gotchas because operations are committed to disk in a different order. Make the default sane, and then allow more advanced use on top of that.
No to Synchronous Writes!
Posted Dec 17, 2015 10:00 UTC (Thu) by roblucid (subscriber, #48964) [Link]
What would happen, if the kernel did NOT do cache-ing, is the complexity of file buffering, would have to be in almost all applications, which would just be horrible.
In UNIX a raw character device was available without the block device cache-ing, yet was mainly used in high performance DBMS's, by dump(8) for backups and clueful admins doing bit for bit copies of FS's with dd(1).
So whatever the imperfections of the POSIX file semantics, the fact is, they are convenient and what is wanted,
most of the time for overwhelming number of applications. Most editors for example, actually write out a new copy of a file, rather than face the complex integrity issues of updating in place.
No to Synchronous Writes!
Posted Dec 17, 2015 11:41 UTC (Thu) by epa (subscriber, #39769) [Link]
horrible on SSD's because of write amplification effects, where the flash page size tends to be much greater than block size.Thanks, I hadn't considered that issue. The operating system can't provide truly synchronous semantics without a high performance cost. I might suggest some kind of compromise, such as that all writes are committed when closing a file, or before any filesystem metadata operation (so if you write a file then rename it, all writes are committed before the rename).
So whatever the imperfections of the POSIX file semantics, the fact is, they are convenient and what is wanted,Isn't the problem that there are no POSIX filesystem semantics in the area under discussion? POSIX says little or nothing about the semantics in the presence of disk crashes or system crashes. We need to agree and codify some semantics of when a write operation becomes permanent and when it is 'volatile' and can be lost on a crash. Subject, of course, to some model of hardware failure which defines what kind of crashes we are considering (no semantics can be defined for when a hard disk is squashed by an asteroid impact).
No to Synchronous Writes!
Posted Dec 17, 2015 19:49 UTC (Thu) by Wol (guest, #4433) [Link]
If you think about it, you'll still have a problem - flushing a file on close will interfere with normal buffering - this flushing and barriers and all that have a cost.
Which is why user-space should have a mechanism for telling the kernel what is important. I know I'm thinking MultiValue, but I'm assuming a workload that (a) is probably thrashing memory, (b) is doing a lot of reads and writes, and (c) is actually providing decent response times to users (I've actually worked on a system that regularly achieved that).
The snag is, the more I think about it, the harder the problem gets :-) The *default* *has* to be just let the kernel get on with it - if the i/o load is low you'll pretty much get the semantics you want. The trouble is, if the system is under load that not only increases the risk of an accident, but it increases the consequences of an accident ... at which point the system guessing what is important not only increases the risk of an accident even further, but if it gets it wrong the consequences are worse still ...
This problem is hard ... :-)
Cheers,
Wol
No to Synchronous Writes!
Posted Dec 18, 2015 10:15 UTC (Fri) by epa (subscriber, #39769) [Link]
I thought about this some more and realized that what matters is the interaction between the userspace program and the outside world. You can provide synchronous semantics without having to do synchronous writes, if you make sure that the interactions with outside are serialized correctly.So if a program does three write() calls and then runs along by itself for a few seconds, making no system calls in that time, it is not necessary to flush the writes to disk. Only when it makes some further system call, interacting with the outside world, do you need to make sure the writes are completed and committed before the next operation. (And if the next operation is only a read(), you might loosely decide that this does not communicate information from the program to the outside and so can return immediately before the writes are committed to disk.)
Would this give good enough performance? I'm not sure.
No to Synchronous Writes!
Posted Dec 19, 2015 19:15 UTC (Sat) by butlerm (subscriber, #13312) [Link]
That works, but I think that guarantee is stronger than is necessary.
I believe the basic requirement for a write barrier is that if the system crashes, upon recovery writes and other operations made by any process subsequent to the write barrier have no effect unless the effects of all writes or other operations that are made prior to the write barrier are preserved, with respect to some pertinent subset of the files in a single filesystem - a subset as small as a single file.
No to Synchronous Writes!
Posted Dec 20, 2015 0:33 UTC (Sun) by nix (subscriber, #2304) [Link]
i.e. filesystem ops don't only have to be consistent with other filesystem ops, but with everything else too.
No to Synchronous Writes!
Posted Dec 20, 2015 4:23 UTC (Sun) by butlerm (subscriber, #13312) [Link]
I think it is safe to say that if you request a write barrier using a portable interface anytime in the near future you are going to get a synchronous operation of some sort, and you don't really want that to be more expensive than a series of pertinent fsyncs. And if it is not a portable interface, no one will use it, etc...
No to Synchronous Writes!
Posted Dec 21, 2015 9:56 UTC (Mon) by epa (subscriber, #39769) [Link]
Which is why user-space should have a mechanism for telling the kernel what is important.Yes, I agree. But one most also consider: when user space says nothing, and hasn't told the kernel anything about what is important, what should the kernel assume by default? I suggest that as far as possible the default should tend towards safer (if slower) behaviour. OK, making all writes synchronous is a step too far, but still it should be possible to write a simple program creating some files and moving them around, without having learnt all the details of write barriers and how to signal your intent to the kernel, and have some reasonable semantics even in the presence of system crashes. Since there are two indisputable facts here: systems do crash, and 90% of programmers will never learn the exact incantations and subtleties of asynchronous disk writes. Even the top 10% will struggle, since there isn't much test infrastructure which simulates crashes part way through disk operations, or static checkers that filesystem access is being done safely.
No to Synchronous Writes!
Posted Dec 25, 2015 17:36 UTC (Fri) by anton (subscriber, #25547) [Link]
Maybe what you are looking for is what I call in-order semantics. If a file system provides that, and the application does not lose consistency when the process is killed, the data will also be consistent (but not necessarily up-to-date) if the OS crashes. As nix points out, in a distributed system, you occasionally also need up-to-date-ness; then you have to sync; but at least you don't have to sync for file consistency. And you can test your application by killing the process, which is quite a bit nicer than pushing the reset button or pulling the plug.Concerning the block effects in flash, there are erase blocks (big, maybe 256k), and write blocks (smaller, but I don't find numbers at the moment). Ideal for a log-structured file system. So always syncing is not completely unrealistic, certainly not for low-bandwidth usage. Unfortunately, SSDs don't give us access to flash, but provide a HD-oriented interface, with the firmware optimizing maybe for FAT or NTFS access patterns. We have to see how much that hurts.
No to Synchronous Writes!
Posted Dec 19, 2015 9:14 UTC (Sat) by zev (subscriber, #88455) [Link]
> Most editors for example, actually write out a new copy of a file, rather than face the complex integrity issues of updating in place.open(O_CREAT|O_TRUNC); write(); fsync(); close();
open(O_CREAT|O_TRUNC); write(); fsync(); close();
open(O_CREAT|O_TRUNC); write(); close(); /* no fsync()! */
open(O_CREAT|O_TRUNC): write(); close(); /* no fsync()! */
open(O_CREAT|O_TRUNC); write(); close(); /* no fsync()! */
No to Synchronous Writes!
Posted Dec 19, 2015 16:43 UTC (Sat) by jem (subscriber, #24231) [Link]
GNU Emacs writes a new file when saving the buffer. You can verify this by checking the inode number: you get a new number each time, and the original file is renamed with a tilde appended to the name.
No to Synchronous Writes!
Posted Dec 20, 2015 0:34 UTC (Sun) by nix (subscriber, #2304) [Link]
No to Synchronous Writes!
Posted Dec 20, 2015 11:10 UTC (Sun) by jem (subscriber, #24231) [Link]
Yes, of course it is configurable. We are, after all, talking about GNU Emacs here. But that's beside the point -- the point was that zev above called the behaviour an urban legend, when it is what Emacs does by default.
No to Synchronous Writes!
Posted Dec 20, 2015 11:21 UTC (Sun) by andresfreund (subscriber, #69562) [Link]
1$ emacs --version
GNU Emacs 24.5.1
1$ emacs -nw -Q /tmp/test.txt
<edit>
<save>
2$ strace -f -eopen,write,unlink,rename -p pid-of-above
1$
<edit>
<save>
2$
[pid 17835] open("/tmp/test.txt", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 7
[pid 17835] write(7, "line1\nline2\nline3\nline4\nline5", 29) = 29
[pid 17835] write(7, "\n", 1) = 1
[pid 17835] close(7) = 0
No to Synchronous Writes!
Posted Dec 20, 2015 11:55 UTC (Sun) by jem (subscriber, #24231) [Link]
Now I am not sure if we are talking about the same thing. In my terminology, opening a file with O_CREAT is creating a new file, thus to "write out a new copy of a file, rather than face the complex integrity issues of updating in place." The new file does have the old name, but it is a new file object, with a new inode number and new attributes.
No to Synchronous Writes!
Posted Dec 20, 2015 12:03 UTC (Sun) by andresfreund (subscriber, #69562) [Link]
No to Synchronous Writes!
Posted Dec 20, 2015 12:31 UTC (Sun) by jem (subscriber, #24231) [Link]
On my machine with Emacs 24.5.1:[jem@red ~]$ ls -li foo.txt* 10488614 -rw------- 1 jem jem 1310 20 dec 14.17 foo.txt [jem@red ~]$ emacs -nw -Q foo.txt [jem@red ~]$ ls -li foo.txt* 10525938 -rw------- 1 jem jem 1313 20 dec 14.25 foo.txt 10488614 -rw------- 1 jem jem 1310 20 dec 14.17 foo.txt~ [jem@red ~]$My interpretation of this listing is that, after editing, the file foo.txt is a new file with inode number 10525938, whereas the original file before editing has been renamed foo.txt~. The rename takes place before the call to open(..., O_CREAT)?
No to Synchronous Writes!
Posted Dec 21, 2015 11:44 UTC (Mon) by nijhof (subscriber, #4034) [Link]
Even if you throw away the backup file, with the editor still open, on re-saving emacs will keep overwriting without turning the last version into a new backup file. I learnt that the hard way :-).
emacs writing new files
Posted Dec 21, 2015 21:33 UTC (Mon) by robbe (subscriber, #16131) [Link]
Yes, it's the default. For most cases. But there are situations where emacs will not create a backup file, and will use simple overwrite.Emacs will not create backups of files in /tmp (this was your gotcha), or if a recent backup already exists. It goes without saying that this default behaviour can be changed.
Here is a more typical example:
$ ls -lin lwn-test*
1836267 -rw-r--r-- 1 1000 1000 5 Dez 21 22:14 lwn-test
$ strace -f -eopen,write,unlink,rename -p 1532 2>&1 | sed -n "/SIGIO/b;s|$HOME|\$HOME|g;p"
[open "lwn-test", append two lines, then save]
open("$HOME/lwn-test", O_RDONLY|O_CLOEXEC) = 18
open("$HOME", O_RDONLY|O_DIRECTORY|O_CLOEXEC) = 18
rename("$HOME/lwn-test", "$HOME/lwn-test~") = 0
open("$HOME/lwn-test", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 18
write(18, "test\nbla\nblo\n", 13) = 13
unlink("$HOME/.#lwn-test") = 0
$ ls -lin lwn-test*
1836273 -rw-r--r-- 1 1000 1000 13 Dez 21 22:15 lwn-test
1836267 -rw-r--r-- 1 1000 1000 5 Dez 21 22:14 lwn-test~
No to Synchronous Writes!
Posted Dec 20, 2015 0:10 UTC (Sun) by andyc (subscriber, #1130) [Link]
No to Synchronous Writes!
Posted Dec 22, 2015 1:26 UTC (Tue) by mathstuf (subscriber, #69389) [Link]
$ vim foo.txt
^Z
$ cat .foo.txt.swp
3210#"! U
No to Synchronous Writes!
Posted Dec 21, 2015 17:32 UTC (Mon) by itvirta (subscriber, #49997) [Link]
I'm not exactly sure if interactive editors in general should be expected to give very strict integrity guarantees.
They are used by humans, not other computers, and if the system crashes just after the user exits their editor,
they're not likely to expect everything to be miraculously saved just as they meant to. It doesn't matter really, since a crash
10 seconds earlier, just before the user pressed 'save' would be well expected to lose data.
(And given all the problems and quirks with fsync, I'm even more inclined to excuse their creators for this.)
> Update-via-rename is aesthetically appealing, but has practical problems with metadata preservation
> and less-than-graceful behavior with large files (slowness, potential for ENOSPC).
Many editors leave a backup file anyway, with the associated space cost. Joe (jmacs) and nano seem to write it by just reading
and copying the original file, which I suppose would have a higher IO cost than renaming Emacs-style.
The default Emacs on my Debian seems to happily rename a hard-linked file. If that's useful or a problem depends on
what one wants. Hards links can be used as a cheap-ass file level copy-on-write system. :)
No to Synchronous Writes!
Posted Dec 23, 2015 3:45 UTC (Wed) by nybble41 (subscriber, #55106) [Link]
It is expected that you may lose the most recent edits. However, if you opened up a large text file and made one minor edit before saving, you might hope that if the system should happen to crash right in the middle of saving you would up with either the old version of the file or the new version, and not a zero-length file. The truncate/write/fsync sequence can trash all the data, not just the changes.
Luu: Files are hard
Posted Dec 15, 2015 21:22 UTC (Tue) by dfsmith (guest, #20302) [Link]
>... I make before the barrier will hit spinning rust before any request ...
You still use floppy disks? You have bigger problems than application crash resilience... ;-)
Iron oxide particulate hard drives were obsolete by 1990 (the IBM 3380 switched to thin film with the 10" platters!) There were some Conner stragglers in the 40MB capacity range, but the industry moved pretty quickly to film/sputtered/evaporated chrome-based surfaces.
Luu: Files are hard
Posted Dec 14, 2015 21:27 UTC (Mon) by needs (subscriber, #98089) [Link]
The same logic also apply to directories, although you will have to use links or symlinks to have something really atomic.
It may not work on strangely configured systems, like if your files are spread over different devices over the network (or maybe with NFS). But in those cases you will be able to detect it if you catch errors of rename() and co (and you should catch them of course). So no silver bullet here, but still a good shot.
Luu: Files are hard
Posted Dec 14, 2015 21:35 UTC (Mon) by juliank (subscriber, #45896) [Link]
You at least forgot to fsync before the rename.
The data write may happen after the metadata write otherwise.
Luu: Files are hard
Posted Dec 14, 2015 22:00 UTC (Mon) by Sesse (subscriber, #53779) [Link]
As I understand it: You will need to fsync the file. Then, if you changed the file's size (say, by appending to it, or it was created anew), you will need to fsync the directory (which is basically impossible from some languages). Then, you can do your rename, and then finally, you must fsync the directory again to make that rename durable.
Of course, if you're on OS X or iOS, fsync() won't do since it's basically ignored, and if you're writing to a FUSE file system (say, /sdcard on Android), you might not even have fsync() and would need to rely on a global syncfs() or similar and hope the FUSE driver isn't being evil with you.
Luu: Files are hard
Posted Dec 14, 2015 22:02 UTC (Mon) by needs (subscriber, #98089) [Link]
Luu: Files are hard
Posted Dec 14, 2015 22:18 UTC (Mon) by Zizzle (guest, #67739) [Link]
I took that to mean the fsync() may not be relied upon?
Luu: Files are hard
Posted Dec 14, 2015 22:25 UTC (Mon) by Wol (guest, #4433) [Link]
And with a database, that can be pretty much impossible - I'd hate to use "copy, rename" on a *bunch* of files, all in the multi-gig range (that's assuming I've even got enough spare disk space to copy the files!).
Cheers,
Wol
Luu: Files are hard
Posted Dec 14, 2015 22:28 UTC (Mon) by juliank (subscriber, #45896) [Link]
Luu: Files are hard
Posted Dec 14, 2015 22:52 UTC (Mon) by Wol (guest, #4433) [Link]
Cheers,
Wol
Luu: Files are hard
Posted Dec 14, 2015 22:13 UTC (Mon) by jameslivingston (guest, #57330) [Link]
Consider two users foo and bar who are both members of the group staff. There is a file with ownership foo:staff, with the group-write permission set. How does bar write to the file without altering the ownership? Writing a temporary file and moving it over the top will change the file ownership, and bar can't change the owner of the temporary back to foo.
exchangedata swaps the data of two files keeping all metadata as-is (except mtime).
Luu: Files are hard
Posted Dec 15, 2015 1:06 UTC (Tue) by needs (subscriber, #98089) [Link]
Luu: Files are hard
Posted Dec 15, 2015 1:50 UTC (Tue) by eternaleye (subscriber, #67051) [Link]
"Sadly more complex than it seemed in the first place // A realist's guide to engineering"
Luu: Files are hard
Posted Dec 17, 2015 17:00 UTC (Thu) by butlerm (subscriber, #13312) [Link]
Luu: Files are hard
Posted Dec 14, 2015 23:21 UTC (Mon) by marduk (subscriber, #3831) [Link]
(hope entire this comment makes it to di
Luu: Files are hard
Posted Dec 15, 2015 1:31 UTC (Tue) by scientes (guest, #83068) [Link]
Crash Consistency
Posted Dec 15, 2015 6:35 UTC (Tue) by madthanu (guest, #102468) [Link]
The "Try It Yourself!" section in the CACM article is hopefully interesting :)
Luu: Files are hard
Posted Dec 15, 2015 8:32 UTC (Tue) by philipstorry (subscriber, #45926) [Link]
A lovely acknowledgement there that LWN gets some of the best in-depth tech coverage there is. But isn't it scary that if LWN died, a source of documentation would be dying?
I'm not knocking LWN here, of course. I'm simply agreeing with the point that a lot of the documentation around these fundamental issues is lacking. I sometimes wish that I'd picked doing community documentation as one of my personal projects, but it's too late for that now. (And I know very little C, which probably wouldn't help.)
Luu: Files are hard
Posted Dec 15, 2015 16:33 UTC (Tue) by mathstuf (subscriber, #69389) [Link]
> I sometimes wish that I'd picked doing community documentation as one of my personal projects, but it's too late for that now.
Why is it too late?
Luu: Files are hard
Posted Dec 15, 2015 9:30 UTC (Tue) by michaeljt (subscriber, #39183) [Link]
Luu: Files are hard
Posted Dec 15, 2015 12:11 UTC (Tue) by Wol (guest, #4433) [Link]
User space needs to be able to rely on certain guarantees about the behaviour of backing store and, to the best of my knowledge, Posix provides NONE of that. Posix defines correct behaviour. A system crash takes you into "undefined" territory :-(
Cheers,
Wol
Luu: Files are hard
Posted Dec 15, 2015 14:44 UTC (Tue) by welinder (guest, #4699) [Link]
All disks in use today are, essentially, NAS. That is, they are independent
computers at the end of some kind of network connection. That's true
even for disks physically sitting inside your machine. The network might
not be an ethernet connection in that case, but SATA, USB, SCSI, or
whatever. It's all the same.
You may think you are writing data in a nice, ordered manner. You may
even have gotten the kernel to send the data towards the disk in a well-
defined order using the right combination of fsync and offerings at the
full moon. Maybe the disk has even been told not to do any reordering
whatsoever.
It doesn't matter.
The disk, for reasons of its own, will then make its decisions about what
to do: write this block now and send an acknowledgment? Send the
acknowledgment right away and write later? Write this block before
that block because the latter cannot be written now? Delay writing
this block because it is likely to be rewritten shortly?
There is financial pressure on disk manufacturers to make the decisions
that will cause the drive to appear fast by cutting each and every corner
imaginable.
Until and unless the power goes out suddenly, you will never know the
difference. And then the blame isn't usually stuck to the drive manufacturer.
M.
Luu: Files are hard
Posted Dec 15, 2015 16:58 UTC (Tue) by butlerm (subscriber, #13312) [Link]
Write threads with independent barriers would be even better, but there do not appear to be any storage devices out there that support that sort of thing yet.
Luu: Files are hard
Posted Dec 15, 2015 19:51 UTC (Tue) by Wol (guest, #4433) [Link]
The bad thing is that, at present, I can't even guarantee that the data gets *to* the drive in the required order.
Cheers,
Wol
Luu: Files are hard
Posted Dec 17, 2015 10:45 UTC (Thu) by renox (subscriber, #23785) [Link]
You find buggy hardware by testing and maintaining a database of good&buggy hardware.
But having good HW isn't good enough: you need also good APIs (and good implementation of those APIs)..
Luu: Files are hard
Posted Dec 15, 2015 18:11 UTC (Tue) by sprin (guest, #101377) [Link]
I would imagine the API would be similar to SQL "START TRANSACTION" and
"COMMIT", except that an explicit transaction descriptor is required
for each operation. Something like:
int txn = start_fs_txn();
txn_write(txn, ...);
txn_pwrite(txn, ...);
...
commit_fs_txn(txn);
`start_fs_txn()` could take optional flags to loosen the transaction
requirements for performance.
Apologies if this is naive.
Luu: Files are hard
Posted Dec 15, 2015 18:55 UTC (Tue) by fandingo (subscriber, #67019) [Link]
Surely the kernel can allocate and distinguish between FDs and TXNs in a way that don't step on each other, meaning that given a FD, the kernel knows if it's a FD or TXN. Consequently, you don't need special txn_ syscalls; just use the regular ones and use different internal implementations depending on whether it's a file or transaction descriptor.
Luu: Files are hard
Posted Dec 15, 2015 20:02 UTC (Tue) by sprin (guest, #101377) [Link]
"Transaction" might be a bad choice of word since it might be
construed to be an ACID transaction, but I don't think it is necessary
to guarantee atomicity or isolation. As used by jgg in the first
comment, "barrier", implying consistency and durability, is probably a
better choice of word.
I can imagine a possible implementation where each txn_OP call would
immediately make the corresponding syscall to start the write, but
register a callback to do the appropriate steps in the "get all the bits
on stable storage" dance once the transaction is committed. Upon commit,
the number of calls to correctly sync would be reduced to the minimum.
Luu: Files are hard
Posted Dec 15, 2015 20:04 UTC (Tue) by Wol (guest, #4433) [Link]
So even if my logs and stuff are on different filesystems, possibly across a network, I can write my logfile, say "wait until that's flushed to storage", carry on and update the database itself, say "wait until that's flushed to disk", and then clear the log.
For the sake of a couple of ordering guarantees given to the app, the complexity of the app suddenly becomes MUCH simpler, because it no longer has to worry about whether the data has really been saved (okay, as pointed out elsewhere, hardware can lie, but every level we can push the problem further down the stack, the less likely it is to bite).
Cheers,
Wol
Luu: Files are hard
Posted Dec 15, 2015 20:06 UTC (Tue) by HIGHGuY (subscriber, #62277) [Link]
Downside: fragmentation, but with SSDs this shouldn't be a big concern.
Luu: Files are hard
Posted Dec 15, 2015 20:59 UTC (Tue) by fandingo (subscriber, #67019) [Link]
This seems like the best solution. Plus, it allows the kernel to use COW, reflink semantics on compatible file systems. I like the idea about locking regions, and if we could allow the kernel to start an early copy of the source file (i.e. before commit_txn, although commit_txn would still block until completed), performance on non-IO bound applications should be minimally impacted.
Luu: Files are hard
Posted Dec 17, 2015 14:53 UTC (Thu) by ksandstr (subscriber, #60862) [Link]
In practice, how does the transaction bracket interact with e.g. writable memory-mapped data? What about data that's being written by a concurrent transaction? ACID databases require applications to be capable of restarting any transaction to guarantee overall completion in the face of non-serializable operations (e.g. write-to-read dependency); would it be at all appropriate to require filesystem clients to adhere to such requirements? (consider a shell command line that involves reading from a number of files, passing data through a couple of pipes, and finally making an output.)
In my opinion, filesystem-level transactions won't happen because it's fundamentally intractable within traditional POSIX. A speculative operating system that encapsulated all inter-process interactions and wrapped them in a distributed transaction protocol might allow for this, but the relation between files and memory in Unix-like systems would make it equivalent to a combination of distributed shared memory and transactional semantics -- a wet nightmare squared. Moreover, side effects such as user interaction would cause implicit commits, removing the option of rolling everything back from the application layer and turning transactional operation into an unpredictable circus hoop. That's to say: such transactions aren't practically tractable outside of POSIX, either.
Until POSIX changes to allow for some kind of per-process dependency ordering (and I don't see why it wouldn't), programs that're robust in the face of crashes will either be specific to a family of implementations (while deceptively supporting any old "POSIX" platform), or be designed to recover from any permutation of observably durable operations (a quickly-growing millstone around userspace's neck). The other option is undefined, irrecoverable outcomes as programs become more complex in their I/O, which is terrible.
So here's to hoping that the POSIX organ comes up with a reasonable atomicity-durability model before things start going horribly wrong with (say) increasing btrfs adoption. For now, filesystem access is for hardarses and the sanguine only.
Luu: Files are hard
Posted Dec 17, 2015 16:12 UTC (Thu) by butlerm (subscriber, #13312) [Link]
Keep per process or per thread list of all file / directory fds with potentially uncommitted changes. When a write barrier is requested, call fsync or fdatasync on all of them and clear the list. If the list gets too large, retire operations from the head of the list as necessary.
Luu: Files are hard
Posted Dec 17, 2015 16:44 UTC (Thu) by nybble41 (subscriber, #55106) [Link]
That seems a bit heavier than a standard write barrier, because like fsync() it blocks the thread until after all the data has been written out. Normally a barrier only impacts the order of I/O operations; it doesn't prevent computation or queuing of additional I/O requests. To implement a write barrier without blocking the thread I think you would need to queue any filesystem operations which occur after the barrier, which would complicate the implementation. At that point you effectively need something like AIO.
Luu: Files are hard
Posted Dec 17, 2015 17:47 UTC (Thu) by butlerm (subscriber, #13312) [Link]
Yes, that is a fallback implementation for where you lack the ability to do asynchronous I/O on ordinary files and directories. If you write your application to the appropriate interface, the implementation can then be improved to issue write barriers without synchronous waits.
A next level implementation might create a number of threads to issue those operations in the background, for example, perhaps one per independent filesystem at issue.
A kernel level implementation really needs the ability (at least internally) to do asynchronous operations on ordinary files and directories. Implementing cross filesystem write barriers any other way is rather tricky. If you are only dealing within the same filesystem - which is true in many cases - the file system may be able to providing ordering guarantees without asynchronous I/O.
Speaking of AIO, an internal interface could use the ability to upgrade an outstanding operation from 'complete this at your leisure' to '(we are waiting on this now) complete this asap', the ability to impose a strict ordering on dependent reads and writes, and of course the ability to do an asynchronous equivalent of sync_file_range(2). That sort of thing could usefully go as far as a specific disk drive. How else is your SSD or whatever going to know which requests need to be committed *right* now?
Luu: Files are hard
Posted Dec 18, 2015 17:02 UTC (Fri) by meyert (subscriber, #32097) [Link]
Luu: Files are hard
Posted Dec 17, 2015 9:17 UTC (Thu) by butlerm (subscriber, #13312) [Link]
There are three major problems here:
The first problem is that these operations are slow on many important filesystems, and the trend is not positive in many cases - for copy-on-write filesystems in particular.
The second problem is that you generally have to issue a series of synchronous waits, each of which can take two or three seconds to complete, if not more, rather than doing them all at once.
The third problem is that you often have no idea when or if operations associated with library calls or child processes commit to stable storage.
A much better interface would provide something like this:
(a) A function that creates a synchronization descriptor (sfd).
(b) Versions of write(2), rename(2), etc. that accept a sfd as a secondary parameter.
(c) A function to create a barrier on a sfd, returning a sequentially allocated barrier id.
(d) A function to initiate write out on all the operations associated with a sfd.
(e) A function to synchronously wait for all operations associated with the sfd to commit.
(f) Support for using select(2) or something like it on an sfd to be notified when the operations associated with the sfd have been committed up through a previously specified barrier on an edge triggered basis.
(g) A function to get to identify the most recent barrier that has been committed on an sfd.
(h) Functions to get and set the (implied) sfd for the current thread.
(i) Have write(2), rename(2) use the current sfd.
(j) Inherit the current sfd across a fork/exec combination if flag set when sfd is created.
(k) On process creation, if there is no current sfd, create one.
In many cases you probably really care whether the operations associated with a sub process have committed or not, and it is currently almost impossible to do that without heavily customizing the code for the child process and/or its child processes. An inheritable current sfd would allow you to easily associate those operations with the current process and wait for all of them to be done together. That solves problem three.
The current sfd allows all operations to be implicitly associated and reordered between barriers and waited on as a group. That solves problem two.
The first problem is mitigated as much as possible, but ultimately it is up to the filesystem to be able to sync quickly, by journalling small operations for example rather than waiting for the next directory tree transaction to complete.
Luu: Files are hard
Posted Dec 18, 2015 0:28 UTC (Fri) by Wol (guest, #4433) [Link]
Your point that you can have multiple synchronous waits is one of the things I was thinking of ... if the system is a database server and pretty much every user process is repeatedly doing this, then performance is going to be hammered unnecessarily ...
If we want a very simple API, I've come up with two simple ideas since getting involved in this discussion. One call, let's call it "tidemark", that simply guarantees that all filesystem operations made before tidemark is called have completed. It doesn't interfere with optimisation and re-ordering, it simply requires that any and all optimisations that cross the tidemark only ever pull writes forward - you can NOT optimise by pushing a write backwards across a tidemark. And it can be synchronous (doesn't return until the tidemark has been flushed) or asynchronous (sets a flag for the caller to check when the tidemark has been flushed).
The second call, let's call it "snapshot". This is the barrier Neil doesn't like - it forces all i/o before it to complete before any i/o after it is allowed to start. In other words, the runtime state at the point the call is made is guaranteed to be an exact match of the permanent state when the call completes. And again it can be synchronous (blocking) or asynchronous (sets a user-space flag). And yes, this could cause a serious performance hit, but imho it's not the kernel's place to dictate to user space whether or not this is acceptable. If user space wants this, then it's the kernel's job to deliver.
User-space can then use these calls to check whether data has actually made it to backing store, and either use synchronous calls to confirm to the user their data is safe, or use asynchronous calls and accept the fact they might suddenly discover after the fact that something has gone wrong and they need to clean up. Either way, they can reason simply and effectively about the real state of the system in the face of a serious error.
I must admit, though, that your version sounds very nice. Depending on how easy it is to track specific i/o operations though, I'd alter slightly, as follows ...
Your "create sfd" function I would rename as "start transaction".
This would then cause the sfd to capture every and all subsequent i/o operations.
Your functions (d) and (e) I would rename as "end transaction", with the choice of doing it synchronously or asynchronously. And the choice also of preventing any subsequent transaction starting until the previous one had finished.
This actually is better than my idea above, primarily because if "end transactions" fails, then I know there is a problem with the transaction. With my idea, if it fails, I have no idea whether it was my writes that failed, or someone else's, so I have to do a possibly useless check to make sure my transaction succeeded.
Cheers,
Wol
Luu: Files are hard
Posted Dec 18, 2015 19:32 UTC (Fri) by butlerm (subscriber, #13312) [Link]
I wouldn't call this transaction anything because that implies ACID semantics, which would be much more difficult to implement in a kernel. This is just write barriers with the ability to get completion notification.
The suggested API is more extensive than one might expect in order to allow a single process to handle multiple "write threads" without itself being multi-threaded, or having as many threads as there are application level transactions in process. Each sfd is its own write thread. If you don't want to use sfds explicitly, every thread would have an implicit one. Not necessarily an independent one, but an implicit one.
That means that if all you want is a write barrier you would make one call (or maybe two) and that would be it. Nothing else required.
The real work would be implementing a write barrier internally (i.e. in the kernel) as more than a series of synchronous fsync or sync_file_range operations. There already is the ability to initiate write out (at least for portions of files) without blocking, so basically what you want to do internally is initiate write out for everything on the write thread that hasn't already completed.
Then when the next write operation on the same "write thread" comes along, if there is an outstanding write barrier, you want to wait on write out for all those operations to complete before proceeding. As long as POSIX semantics are in force, you must wait for write out to complete before proceeding with another write operation on the same write thread, or the write barrier is meaningless.
Write threads are important here because there are writes you do not want to block or wait for commit of in any circumstances. That is why you need an API so that different writes can go on different "write threads" or sfds. The implicit sfd makes things simple for standard write(2) calls, and applications and libraries that just do the normal thing, but an explicit write as part of this "write thread" interface is necessary in the general case.
Implementation wise there is an issue with how many uncommitted operations you can track so you can commit them when necessary. I believe that is the biggest issue, especially with implicit write threads - you could easily have thousands of operations on a write thread that haven't been committed to durable storage yet. At some point you have to start combining them or retiring them early or you may have to block synchronously in unexpected places, just to free up space in your uncommitted operation buffer.
One thing that would help there is a filesystem level transaction group number for metadata and certain other operations. Then an internal operation can return the TXG number that the operation is expected to be committed in, and everything in the buffer for the same filesystem with a lower TXG number than the current committed one for that FS can be efficiently discarded - no need to make an inquiry or be notified about the completion of each one. You could just group them together - at most one entry per write thread per filesystem TXG. That is easily a thousand fold reduction in some cases. Then a mere handful of entries - something much more practical to keep in kernel - would suffice most of the time. You would only need more for writes of the sort the FS doesn't plan to commit in the next TXG.
Luu: Files are hard
Posted Dec 19, 2015 7:12 UTC (Sat) by butlerm (subscriber, #13312) [Link]
Luu: Files are hard
Posted Dec 28, 2015 16:12 UTC (Mon) by throwaway (guest, #106019) [Link]
Copyright © 2015, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds