Luu: Files are hard

Posted Dec 15, 2015 13:22 UTC (Tue) by andresfreund (subscriber, #69562)
In reply to: Luu: Files are hard by alonz
Parent article: Luu: Files are hard

I think using O_DIRECT helps here. Neither does it guarantee queues are flushed, nor that volatile on-devices caches are flushed, nor does it commit the filesystem's metadata transactions.

You "just" have to do the whole preallocate_file(tmp);fsync(tmp);fsync(dir);rename(tmp, normal);fsync(normal);fsync(dir); dance to create the files for journaling. And then protect individual journal writes with fdatasync() (falling back to fsync if not present). If you get ENOSYS for any of those, you give up.

Luu: Files are hard

Posted Dec 15, 2015 13:27 UTC (Tue) by andresfreund (subscriber, #69562) [Link] (1 responses)

Gah. Obviously I meant that I *don't* think O_DIRECT helps.

Luu: Files are hard

Posted Dec 15, 2015 18:22 UTC (Tue) by jgg (subscriber, #55211) [Link]

Last time I checked O_DIRECT and AIO become broken for many file systems if all you do is create(),ftruncate(),fallocate() - ie AIO becomes blocking :(

The application actually has to go and zero fill the file directly before O_DIRECT and AIO begin to work properly.

Something about fallocate just reserves physical space, but it might not be zero so the FS has a special slow path for the first write to any block because it also is accompanied by a meta-data write to mark the block as written. mumble mumble. I never dug into enough to be sure.

Luu: Files are hard

Posted Dec 15, 2015 13:55 UTC (Tue) by Wol (subscriber, #4433) [Link] (6 responses)

Which is a horrendous dance :-(

Oh - and what happens if we get something like the ext2/ext3 transition, where the behaviour that one filesystem REQUIRED was actually TOXIC to the other? So your recommended behaviour is, unfortunately, filesystem-dependent :-( Along comes a new filesystem, with new behaviour, and your behaviour no longer works :-( In fact, it could kill the system ...

And yep, I would expect to have logs and all that, but what I'm trying to do is avoid as much of that crud as possible. If my database is copy-on-write, and I can guarantee synchronicity where I need it (hence the write barrier), I can be pretty confident that the chances of a failure in a critical section are very slim. Error recovery code rarely gets tested ... :-)

Don't forget, I'm a database programmer by trade. And I don't "do relational" - my database DESIGN gives me a lot of integrity that a SQL programmer has to implement by hand. I get my integrity "by default" to a large degree. And I want the OS to give me the same guarantees. Yes, I know some things can't be guaranteed. But something as simple as guaranteeing that disk writes will be processed in a fixed order?

If the OS can guarantee me, that it will pass writes to the hardware in the order that I ask, and not corrupt my code by optimising the hell out of it where it shouldn't, then it makes my life much simpler. (And as I said elsewhere, I know Physics makes life a pain, some degree of optimisation is a necessity.)

Cheers,
Wol

Luu: Files are hard

Posted Dec 15, 2015 14:40 UTC (Tue) by corbet (editor, #1) [Link] (5 responses)

The person you're replying to knows a wee bit about databases too...:)

So if you really want a new interface from the kernel, could you describe that interface? What would the system call(s) be, with which semantics? With that in place, one could argue about whether it's best done in the kernel or in user space.

Luu: Files are hard

Posted Dec 15, 2015 18:23 UTC (Tue) by HenrikH (subscriber, #31152) [Link]

Perhaps something like how TCP_CORK works for sockets?

Luu: Files are hard

Posted Dec 15, 2015 18:32 UTC (Tue) by fandingo (guest, #67019) [Link] (3 responses)

I can't speak for Wol, but I have a similar opinion. I see two major issues at play. 1) IO reordering is problematic sometimes and needs a fine-grained off-switch. 2) Constant fsync is tedious and shouldn't be necessary.

I think the solution is pretty simple (at least in concept) and steals the database transaction pattern:

begin_transaction(fd, flag)

(critical IO operations)

commit_transaction(fd)

while the names are the same as DB transactions, the feature wouldn't necessarily have any sort of rollback functionality, although that would be cool. Additionally, "commit" means more of a "I'm done with the transaction, go back to normal" rather than an "only now put all the data on disk." Begin_transaction() does two things. 1) IO reordering is disabled through all storage layers for blocks associated with that FD.* 2) Every IO syscall in the transaction for that FD automatically has fsync() called. Commit_transaction just turns it off -- the data have been hitting the disk as normal, though without reordering, the whole time.

Through a feature flag argument on begin, and you could allow tuning how each subfeature works.

The single biggest issue, in my opinion, from the article and comments is the unpredictable ordering of IO operations, and the only solution is to utilize synchronization methods (O_SYNC, fsync, custom in user space with O_DIRECT/AIO, etc.) that have serious downsides. You know, sometimes I want the Atomicity and Isolation in ACID but don't care so much about Durability. The current kernel features force you to do both -- hurting performance -- and even then, you'll probably still have subtle bugs.

* Much easier said than done.

> With that in place, one could argue about whether it's best done in the kernel or in user space.

At least in my idea, the principal concern is allowing the user to definitely prevent IO reordering on specified FDs. That's impossible to do in user space -- at least without going down the undesired existing-synchronization paths.

Or, hell, perhaps just go full-bore:

begin_acid(fd, flag)

(Critical region)

end_acid(fd)

where flag is simply a 5-bit flag corresponding to enabling each property in the acronym. (Again, no rollback necessarily, so atomicity is only at the individual syscall level within that block, and there is no larger idea of an atomic transaction. Of course, modern, COW file systems can simply reflink at begin_acid and do some FD magic if there is an error if they want to support developer-friendly features.)

Luu: Files are hard

Posted Dec 15, 2015 19:33 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

Except that doing it at the FD level isn't really enough. If transactions carry the process-id, you could probably do it at that level.

Let's think about a typical database transaction. Write to log, write to database, clear log. We have *at* *least* two files involved - the log file and the database file(s), and we need clear air between each of the three steps. The log needs to be flushed before the first write gets to the database, and the database needs to be flushed before the log gets cleared, and certainly in my setup the database can be plastered across a lot of files.

But if I can say to the kernel "I don't care what optimisation you do AS LONG AS you guarantee those three operations are synchronous", then I can take responsibility for recovering from errors, without having to have that massive layering violation with loads of filesystem-specific code to make sure the stuff I need is safely on disk.

Cheers,
Wol

Luu: Files are hard

Posted Dec 15, 2015 20:51 UTC (Tue) by fandingo (guest, #67019) [Link]

I get what you're saying, and my prototype wanted to keep things simple with a single FD, although personally being a Python programmer, I'd allow commit_transaction to take a file or an iterable of file objects.

Nonetheless, I think my suggestion could work in the use case you describe.

> We have *at* *least* two files involved - the log file and the database file(s), and we need clear air between each of the three steps.

Okay, so you'd do:

begin_transaction(log_fd, ...)

(Log IO operations)

commit_transaction(log_fd)

begin_transaction(data_fd, ...)

(Data IO operations)

commit_transaction(data_fd)

I suppose this was implied, but I'll state it explicitly: Commit_transaction() has an fsync, creating the necessary durability between your two phases. Additionally, it would be possible to nest transactions.Remember there's a (optional through the flag) fsync() after every IO call on that FD inside the transaction, so if I pwrite(log_fd, ...), that call is fully written* once it returns. Therefore, pwrite(log_fd, ...) followed by pwrite(data_fd, ...) would be well-ordered.

> But if I can say to the kernel "I don't care what optimisation you do AS LONG AS you guarantee those three operations are synchronous"

Since the title of article is "Files Are Hard," it's worth pointing this out. I primarily care about well-ordering, which is not the same as synchronous IO, although we tend to ensure well-ordering by using synchronization methods at tremendous performance impact.

* Well, as best as the OS can ensure. Storage controllers can have their own whims.

> Except that doing it at the FD level isn't really enough. If transactions carry the process-id, you could probably do it at that level.

I considered this, but a process doesn't universally need this feature. Forcing a developer to use multiprocess programming just to overcome a synthetic limitation isn't friendly. It seems more likely that a library author define tx_open and tx_close that do open(2)/close(2) and setup a transaction if this were desired universally within a program. If, however, we did want to make it per-process, we'd probably be happier with limiting it to a specific thread rather than an entire process.

Luu: Files are hard

Posted Dec 15, 2015 19:36 UTC (Tue) by Wol (subscriber, #4433) [Link]

Just noticed you mentioned COW filesystems. The obvious thing for me, as a database programmer, is if I write the database system itself I will write to my files in COW mode. But if I don't have the ability to do synchronous writes, I could find that my "head" has been updated but my data hasn't, and the file DATA has been corrupted !!! :-(

Cheers,
Wol

Luu: Files are hard

Posted Dec 17, 2015 14:35 UTC (Thu) by Wol (subscriber, #4433) [Link] (1 responses)

> You "just" have to do the whole preallocate_file(tmp);fsync(tmp);fsync(dir);rename(tmp, normal);fsync(normal);fsync(dir); dance to create the files for journaling. And then protect individual journal writes with fdatasync() (falling back to fsync if not present). If you get ENOSYS for any of those, you give up.

Sorry to rain on your parade, but you've just given me a solution to make sure my JOURNAL gets safely to disk. I want a solution to make sure my DATA (whatever that is) gets safely to disk. And if I happen to be writing to a pre-existent database file that's several gigs in size, this particular solution is no help :-(

I'm after a generic solution - I write something, I want confirmation it's safe on disk. Any solution that makes assumptions about that "something" is wrong/not good enough. And I'd much rather it was simple, not an error-prone song-and-dance.

Cheers,
Wol

Luu: Files are hard

Posted Dec 17, 2015 20:39 UTC (Thu) by bronson (subscriber, #4806) [Link]

The O_PONIES discussion was in 2009. Sounds like you want to go over all that again?