Luu: Files are hard
Luu: Files are hard
Posted Dec 15, 2015 13:22 UTC (Tue) by andresfreund (subscriber, #69562)In reply to: Luu: Files are hard by alonz
Parent article: Luu: Files are hard
You "just" have to do the whole preallocate_file(tmp);fsync(tmp);fsync(dir);rename(tmp, normal);fsync(normal);fsync(dir); dance to create the files for journaling. And then protect individual journal writes with fdatasync() (falling back to fsync if not present). If you get ENOSYS for any of those, you give up.
Posted Dec 15, 2015 13:27 UTC (Tue)
by andresfreund (subscriber, #69562)
[Link] (1 responses)
Posted Dec 15, 2015 18:22 UTC (Tue)
by jgg (subscriber, #55211)
[Link]
The application actually has to go and zero fill the file directly before O_DIRECT and AIO begin to work properly.
Something about fallocate just reserves physical space, but it might not be zero so the FS has a special slow path for the first write to any block because it also is accompanied by a meta-data write to mark the block as written. mumble mumble. I never dug into enough to be sure.
Posted Dec 15, 2015 13:55 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (6 responses)
Oh - and what happens if we get something like the ext2/ext3 transition, where the behaviour that one filesystem REQUIRED was actually TOXIC to the other? So your recommended behaviour is, unfortunately, filesystem-dependent :-( Along comes a new filesystem, with new behaviour, and your behaviour no longer works :-( In fact, it could kill the system ...
And yep, I would expect to have logs and all that, but what I'm trying to do is avoid as much of that crud as possible. If my database is copy-on-write, and I can guarantee synchronicity where I need it (hence the write barrier), I can be pretty confident that the chances of a failure in a critical section are very slim. Error recovery code rarely gets tested ... :-)
Don't forget, I'm a database programmer by trade. And I don't "do relational" - my database DESIGN gives me a lot of integrity that a SQL programmer has to implement by hand. I get my integrity "by default" to a large degree. And I want the OS to give me the same guarantees. Yes, I know some things can't be guaranteed. But something as simple as guaranteeing that disk writes will be processed in a fixed order?
If the OS can guarantee me, that it will pass writes to the hardware in the order that I ask, and not corrupt my code by optimising the hell out of it where it shouldn't, then it makes my life much simpler. (And as I said elsewhere, I know Physics makes life a pain, some degree of optimisation is a necessity.)
Cheers,
Posted Dec 15, 2015 14:40 UTC (Tue)
by corbet (editor, #1)
[Link] (5 responses)
So if you really want a new interface from the kernel, could you describe that interface? What would the system call(s) be, with which semantics? With that in place, one could argue about whether it's best done in the kernel or in user space.
Posted Dec 15, 2015 18:23 UTC (Tue)
by HenrikH (subscriber, #31152)
[Link]
Posted Dec 15, 2015 18:32 UTC (Tue)
by fandingo (guest, #67019)
[Link] (3 responses)
I think the solution is pretty simple (at least in concept) and steals the database transaction pattern:
begin_transaction(fd, flag)
(critical IO operations)
commit_transaction(fd)
while the names are the same as DB transactions, the feature wouldn't necessarily have any sort of rollback functionality, although that would be cool. Additionally, "commit" means more of a "I'm done with the transaction, go back to normal" rather than an "only now put all the data on disk." Begin_transaction() does two things. 1) IO reordering is disabled through all storage layers for blocks associated with that FD.* 2) Every IO syscall in the transaction for that FD automatically has fsync() called. Commit_transaction just turns it off -- the data have been hitting the disk as normal, though without reordering, the whole time.
Through a feature flag argument on begin, and you could allow tuning how each subfeature works.
The single biggest issue, in my opinion, from the article and comments is the unpredictable ordering of IO operations, and the only solution is to utilize synchronization methods (O_SYNC, fsync, custom in user space with O_DIRECT/AIO, etc.) that have serious downsides. You know, sometimes I want the Atomicity and Isolation in ACID but don't care so much about Durability. The current kernel features force you to do both -- hurting performance -- and even then, you'll probably still have subtle bugs.
* Much easier said than done.
> With that in place, one could argue about whether it's best done in the kernel or in user space.
At least in my idea, the principal concern is allowing the user to definitely prevent IO reordering on specified FDs. That's impossible to do in user space -- at least without going down the undesired existing-synchronization paths.
Or, hell, perhaps just go full-bore:
begin_acid(fd, flag)
(Critical region)
end_acid(fd)
where flag is simply a 5-bit flag corresponding to enabling each property in the acronym. (Again, no rollback necessarily, so atomicity is only at the individual syscall level within that block, and there is no larger idea of an atomic transaction. Of course, modern, COW file systems can simply reflink at begin_acid and do some FD magic if there is an error if they want to support developer-friendly features.)
Posted Dec 15, 2015 19:33 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (1 responses)
Let's think about a typical database transaction. Write to log, write to database, clear log. We have *at* *least* two files involved - the log file and the database file(s), and we need clear air between each of the three steps. The log needs to be flushed before the first write gets to the database, and the database needs to be flushed before the log gets cleared, and certainly in my setup the database can be plastered across a lot of files.
But if I can say to the kernel "I don't care what optimisation you do AS LONG AS you guarantee those three operations are synchronous", then I can take responsibility for recovering from errors, without having to have that massive layering violation with loads of filesystem-specific code to make sure the stuff I need is safely on disk.
Cheers,
Posted Dec 15, 2015 20:51 UTC (Tue)
by fandingo (guest, #67019)
[Link]
Nonetheless, I think my suggestion could work in the use case you describe.
> We have *at* *least* two files involved - the log file and the database file(s), and we need clear air between each of the three steps.
Okay, so you'd do:
begin_transaction(log_fd, ...)
(Log IO operations)
commit_transaction(log_fd)
begin_transaction(data_fd, ...)
(Data IO operations)
commit_transaction(data_fd)
I suppose this was implied, but I'll state it explicitly: Commit_transaction() has an fsync, creating the necessary durability between your two phases. Additionally, it would be possible to nest transactions.Remember there's a (optional through the flag) fsync() after every IO call on that FD inside the transaction, so if I pwrite(log_fd, ...), that call is fully written* once it returns. Therefore, pwrite(log_fd, ...) followed by pwrite(data_fd, ...) would be well-ordered.
> But if I can say to the kernel "I don't care what optimisation you do AS LONG AS you guarantee those three operations are synchronous"
Since the title of article is "Files Are Hard," it's worth pointing this out. I primarily care about well-ordering, which is not the same as synchronous IO, although we tend to ensure well-ordering by using synchronization methods at tremendous performance impact.
* Well, as best as the OS can ensure. Storage controllers can have their own whims.
> Except that doing it at the FD level isn't really enough. If transactions carry the process-id, you could probably do it at that level.
I considered this, but a process doesn't universally need this feature. Forcing a developer to use multiprocess programming just to overcome a synthetic limitation isn't friendly. It seems more likely that a library author define tx_open and tx_close that do open(2)/close(2) and setup a transaction if this were desired universally within a program. If, however, we did want to make it per-process, we'd probably be happier with limiting it to a specific thread rather than an entire process.
Posted Dec 15, 2015 19:36 UTC (Tue)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Dec 17, 2015 14:35 UTC (Thu)
by Wol (subscriber, #4433)
[Link] (1 responses)
Sorry to rain on your parade, but you've just given me a solution to make sure my JOURNAL gets safely to disk. I want a solution to make sure my DATA (whatever that is) gets safely to disk. And if I happen to be writing to a pre-existent database file that's several gigs in size, this particular solution is no help :-(
I'm after a generic solution - I write something, I want confirmation it's safe on disk. Any solution that makes assumptions about that "something" is wrong/not good enough. And I'd much rather it was simple, not an error-prone song-and-dance.
Cheers,
Posted Dec 17, 2015 20:39 UTC (Thu)
by bronson (subscriber, #4806)
[Link]
Luu: Files are hard
Luu: Files are hard
Luu: Files are hard
Wol
The person you're replying to knows a wee bit about databases too...:)
Luu: Files are hard
Luu: Files are hard
Luu: Files are hard
Luu: Files are hard
Wol
Luu: Files are hard
Luu: Files are hard
Wol
Luu: Files are hard
Wol
Luu: Files are hard