Luu: Files are hard

Posted Dec 15, 2015 18:32 UTC (Tue) by fandingo (guest, #67019)
In reply to: Luu: Files are hard by corbet
Parent article: Luu: Files are hard

I can't speak for Wol, but I have a similar opinion. I see two major issues at play. 1) IO reordering is problematic sometimes and needs a fine-grained off-switch. 2) Constant fsync is tedious and shouldn't be necessary.

I think the solution is pretty simple (at least in concept) and steals the database transaction pattern:

begin_transaction(fd, flag)

(critical IO operations)

commit_transaction(fd)

while the names are the same as DB transactions, the feature wouldn't necessarily have any sort of rollback functionality, although that would be cool. Additionally, "commit" means more of a "I'm done with the transaction, go back to normal" rather than an "only now put all the data on disk." Begin_transaction() does two things. 1) IO reordering is disabled through all storage layers for blocks associated with that FD.* 2) Every IO syscall in the transaction for that FD automatically has fsync() called. Commit_transaction just turns it off -- the data have been hitting the disk as normal, though without reordering, the whole time.

Through a feature flag argument on begin, and you could allow tuning how each subfeature works.

The single biggest issue, in my opinion, from the article and comments is the unpredictable ordering of IO operations, and the only solution is to utilize synchronization methods (O_SYNC, fsync, custom in user space with O_DIRECT/AIO, etc.) that have serious downsides. You know, sometimes I want the Atomicity and Isolation in ACID but don't care so much about Durability. The current kernel features force you to do both -- hurting performance -- and even then, you'll probably still have subtle bugs.

* Much easier said than done.

> With that in place, one could argue about whether it's best done in the kernel or in user space.

At least in my idea, the principal concern is allowing the user to definitely prevent IO reordering on specified FDs. That's impossible to do in user space -- at least without going down the undesired existing-synchronization paths.

Or, hell, perhaps just go full-bore:

begin_acid(fd, flag)

(Critical region)

end_acid(fd)

where flag is simply a 5-bit flag corresponding to enabling each property in the acronym. (Again, no rollback necessarily, so atomicity is only at the individual syscall level within that block, and there is no larger idea of an atomic transaction. Of course, modern, COW file systems can simply reflink at begin_acid and do some FD magic if there is an error if they want to support developer-friendly features.)

Luu: Files are hard

Posted Dec 15, 2015 19:33 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

Except that doing it at the FD level isn't really enough. If transactions carry the process-id, you could probably do it at that level.

Let's think about a typical database transaction. Write to log, write to database, clear log. We have *at* *least* two files involved - the log file and the database file(s), and we need clear air between each of the three steps. The log needs to be flushed before the first write gets to the database, and the database needs to be flushed before the log gets cleared, and certainly in my setup the database can be plastered across a lot of files.

But if I can say to the kernel "I don't care what optimisation you do AS LONG AS you guarantee those three operations are synchronous", then I can take responsibility for recovering from errors, without having to have that massive layering violation with loads of filesystem-specific code to make sure the stuff I need is safely on disk.

Cheers,
Wol

Luu: Files are hard

Posted Dec 15, 2015 20:51 UTC (Tue) by fandingo (guest, #67019) [Link]

I get what you're saying, and my prototype wanted to keep things simple with a single FD, although personally being a Python programmer, I'd allow commit_transaction to take a file or an iterable of file objects.

Nonetheless, I think my suggestion could work in the use case you describe.

> We have *at* *least* two files involved - the log file and the database file(s), and we need clear air between each of the three steps.

Okay, so you'd do:

begin_transaction(log_fd, ...)

(Log IO operations)

commit_transaction(log_fd)

begin_transaction(data_fd, ...)

(Data IO operations)

commit_transaction(data_fd)

I suppose this was implied, but I'll state it explicitly: Commit_transaction() has an fsync, creating the necessary durability between your two phases. Additionally, it would be possible to nest transactions.Remember there's a (optional through the flag) fsync() after every IO call on that FD inside the transaction, so if I pwrite(log_fd, ...), that call is fully written* once it returns. Therefore, pwrite(log_fd, ...) followed by pwrite(data_fd, ...) would be well-ordered.

> But if I can say to the kernel "I don't care what optimisation you do AS LONG AS you guarantee those three operations are synchronous"

Since the title of article is "Files Are Hard," it's worth pointing this out. I primarily care about well-ordering, which is not the same as synchronous IO, although we tend to ensure well-ordering by using synchronization methods at tremendous performance impact.

* Well, as best as the OS can ensure. Storage controllers can have their own whims.

> Except that doing it at the FD level isn't really enough. If transactions carry the process-id, you could probably do it at that level.

I considered this, but a process doesn't universally need this feature. Forcing a developer to use multiprocess programming just to overcome a synthetic limitation isn't friendly. It seems more likely that a library author define tx_open and tx_close that do open(2)/close(2) and setup a transaction if this were desired universally within a program. If, however, we did want to make it per-process, we'd probably be happier with limiting it to a specific thread rather than an entire process.

Luu: Files are hard

Posted Dec 15, 2015 19:36 UTC (Tue) by Wol (subscriber, #4433) [Link]

Just noticed you mentioned COW filesystems. The obvious thing for me, as a database programmer, is if I write the database system itself I will write to my files in COW mode. But if I don't have the ability to do synchronous writes, I could find that my "head" has been updated but my data hasn't, and the file DATA has been corrupted !!! :-(

Cheers,
Wol