LWN: Comments on "Fast commits for ext4"

What about other filesystems?

andresfreund — Fri, 23 Jul 2021 18:01:28 +0000

> The OS/FS aren't magic.

In particular they have no reliable way of knowing which files (or even parts of files) are related and need constrained ordering between writes, and which are unrelated and thus can handled independently.

What about other filesystems?

andresfreund — Fri, 23 Jul 2021 17:55:49 +0000

You can trust the data if you tell the OS you want to pay the price (use fsync, or O_SYNC/O_DSYNC) and you only rely on data known to be flushed after a crash.

The alternative you're proposing basically implies that writes cannot be buffered for any meaningful amount of time. Oh, your browser updated it's history database? Let's just make that wait for all temporary files being written out, the file being copied concurrently, etc. And what's worse, do not allow any concurrent sync writes to finish (like file creation), because that would violate ordering.

Ext3's ordering guarantees were weaker and yet lead to horrible stalls all the time. There constantly were complaints about Firefox's SQLite databases stalling the system etc.

The OS/FS aren't magic.

What about other filesystems?

andresfreund — Fri, 23 Jul 2021 17:43:35 +0000

> Do an fsync on the user-space journal.

With a small bit of care fdatasync() should be enough, and will often be a lot faster (no synchronous updating of unimportant filesystem metadata like mtime, turning a single write into multiple).

> Anything else will not work and has never worked. Ok, especially if you are developing databases, O_DIRECT might be a viable alternative.

O_DIRECT on its own does not remove the need for an fsync/fdatasync. Devices have volatile write caches, and they do loose data on power loss/reset. O_DIRECT alone only avoids issues with the OS write cache. And makes the f[data]sync cheaper, because it will often only have to send a cache flush (and transparently avoid that on devices without volatile write caches).

Alternatively it can be combined with O_DSYNC to achieve actually durable writes - but if one isn't careful that can tank throughput. It either adds the FUA flag to each write or does separate sync commands after each write, which can end up being more cache flushes for a workload. It can be significantly faster to do a series of writes and then a single cache flush.

It's hardware dependant too whether FUA or separate cache flush commands are faster :(. Dear Samsung, please fix your drives.

My testing says that on most NVMe devices with a volatile cache DSYNC wins if the queue depth is very low and latency is king (only one roundtrip needed). fdatasync wins if there's more than a few writes happening at once, especially if all/most need to complete before the "user transaction" finishes - the lower number of flushes saves more than the additional roundtrip costs.

> Actually database developers are probably among the first to scream in terror if a filesystem will guarantee strong ordering properties because of the immense performance penalties, which they do not want to buy.

Indeed! And there's no realistic way the FS can do better.

What about other filesystems?

Defenestrator — Fri, 23 Jul 2021 13:10:08 +0000

> And I think it is guaranteed that a rename only is done after the renamed file has hit the disk.

This is often true in practice (in particular, in ext3 and in ext4 outside of early versions), but not always explicitly guaranteed. See the auto_da_alloc option added to ext4 for more info and background.

Fast commits for ext4

mrybczyn — Mon, 08 Feb 2021 15:58:56 +0000

Thank you for the comment, Jan. We've clarified the sentence.

What about other filesystems?

drjohnnyfever — Mon, 01 Feb 2021 14:25:32 +0000

The situation on FreeBSD is actually a bit complicated. The current default configuration uses journaled soft updates (su+j) which work like traditional soft updates with the exception that there is enough metadata journaled to avoid having to run a background fsck to reclaim space from orphaned allocations.

FreeBSD also supports proper journaling (logging) with gjournal which works at the block layer rather than in UFS itself. If I recall correctly, gjournal keeps a proper intent log of disk writes rather than just a metadata journal. It also allows you to use a separate device for the log.

ZFS does seem to be the way forward on FreeBSD but Netflix is pretty big consumer of UFS/FFS so they have been sponsoring continued development.

PostgreSQL might benefit from a fd-list fsync() API

ringerc — Fri, 22 Jan 2021 01:15:37 +0000

Yes. PostgreSQL may have many dirty FDs. Each tables or index is stored as separate file, and split into 1GB extents. I imagine that the extent splitting could be changed if there's a benefit to doing so, but the separate files per relation not so much.

What about other filesystems?

mstone_ — Thu, 21 Jan 2021 19:42:43 +0000

Yeah, the confusion here comes from comparing the current state of affairs to a past state that didn't exist.

What about other filesystems?

anton — Wed, 20 Jan 2021 18:51:14 +0000

All this talk about "apps should use fsync" or "apps should use fsfsync" fills me with horror, as someone who wants to write a system that relies on data integrity. Are you telling me that my app needs to be filesystem-aware, and not only that but aware of what mount options were used, so I know which commands to call to make sure that my data is safe?

I think that the Ted Ts'o position is that you should use fsync a lot, independent of the file system; and that you should not just use it on the file you were working on, but possibly on other files; he explicitly mentions the directory that contains the file, but my expectation is that it might also affect other files, depending on the file system. E.g., some file systems have a file that contains the inodes, so why not require that the application also fsyncs that?

My position is pretty much the opposite: A good file system should provide decent consistency guarantees in case of an OS crash or power outage (which, BTW, is not covered by POSIX, so any claim by Ted Ts'o that POSIX requires applications to use fsync the way he likes is nonsense). And these consistency guarantees should orient themselves on the guarantees that file systems give in the non-crash case. This is the usual case, which is what programmers develop for and test against; so even if Ted Ts'o's file systems would guarantee to corrupt all non-fsynced data in case of a crash, it is unlikely that all applications would contain so many fsyncs that Ted Ts'o would not blame them on data loss.

Given that POSIX guarantees sequential consistency for file system operations in the non-crash case, my take is that file systems should guarantee a sequential consistent state in case of a crash; but for performance reasons, not the state at the time of the crash. You can use sync whenever you want the crash state to be consistent with the logical state, e.g., before reporting the completion of a transaction over the network.

Unfortunately, there is only one file system in Linux that gives such a guarantee last I looked (NILFS2), but that's because we let Linux get away with file systems that don't give such guarantees; admittedly Linux crashes so rarely these days and at least around here power outages are so rare that the itch is small.

Concerning whether apps should be file system aware, my take is that you should program for a filesystem that gives good guarantees, and recommend that the user should not be using other file systems (I have certainly avoided any file system maintained by Ted Ts'o since the O_PONIES discussion). Or, if you feel like accomodating bad file systems, have a flag for the application that insert an fsync of the file and its directory after every change of the file (that may satisfy Ted Ts'o for now); the user can use that flag when using a bad file system.

Concerning efficiency, this can be implemented efficiently, by writing the changes (data blocks and/or journal entries) to free blocks in arbitrary order, then a write barrier, then a root block that allows finding all this stuff, and once that root block reaches the platter, the recovery code will find everything. This does not even require any synchronous operation in principle, so it could be very efficient (in practice, I think that the barrier operations on existing drives are synchronous).

Concerning lying hardware: If the hardware lies, all the fsyncs that Ted Ts'o asks for will not help you. Just as we need file systems with decent guarantees, we need honest hardware. And I expect it exists, because otherwise database servers would be unable to guarentee consistency after crashes.

What about other filesystems?

anton — Wed, 20 Jan 2021 17:44:42 +0000

Oh - and wasn't advice about how to shut a system down always "# sync; sync; sync; halt"?

No, the advice was to type

sync
sync
sync
halt

That's because sync did not block (unlike on Linux), so the time you needed to type the additional syncs and the halt was needed to finish the sync.

Fast commits for ext4

viiru — Wed, 20 Jan 2021 07:48:21 +0000

> What does the system call do in other filesystems?

> If it's an ext4 quirk then any software relying on that would already break just by virtue of moving to a
> different filesystem.

In my understanding that is exactly what it is. Ext3 shares the behavior, but for example XFS does not. This caught many application developers by surprise, but that happened a couple of decades ago and has most likely been fixed in any sensible application.

PostgreSQL might benefit from a fd-list fsync() API

Wol — Tue, 19 Jan 2021 14:55:29 +0000

> Note that Pg can't use FS-level write-order barriers after each WAL record written. It'd eliminate the latency after each fsync(), but would be grossly inefficient for this because it'd prevent reordering of non-WAL writes across barriers, and we want the FS to be free to reorder non-WAL writes as aggressively as possible (up to the next checkpoint) in order to do write combining and write deduplication.

Does PG often spend time io-bound? How important really is "as efficient as possible" disk io?

I would have thought this issue would actually impact me even more, given that traditionally Pick has always been "don't bother with caching, it's faster to retrieve it from disk" and Pick can also really hammer the disk.

Cheers,
Wol

PostgreSQL might benefit from a fd-list fsync() API

Wol — Tue, 19 Jan 2021 14:51:28 +0000

> Finally, for checkpoints PostgreSQL must fsync() all the dirty files it has open. This will force all pending writes on those files to disk, not just the writes made before the current checkpoint target position. On most Linux FSes it also flushes a bunch of other irrelevant and uninteresting files that don't need to be durable at all.

Does PG tend to have a lot of these (dirty files)? Because Pick, the database I'm thinking of, typically stores a single "table" in one OR MORE os files, so I could have a *lot* of dirty files open ...

Although of course, like PG, the main thing I'm concerned about is knowing that the journal is flushed ...

Cheers,
Wol

Fast commits for ext4

jan.kara — Tue, 19 Jan 2021 14:24:57 +0000

> In the default data=ordered mode, where the journal entry is written only after flushing all pending data, delayed allocation might thus delay the writing of the > journal.

This is actually not quite correct. Delayed allocation just means that write(2) stores data in the page cache without actually allocating blocks on disk. This also means that the journalling machinery is completely ignorant of the write at this moment. Later, when VM decides to write out dirty pages from the page cache, filesystem allocates blocks for the pages and it is only at this point that there are filesystem metadata changes that are journalled. So it isn't true that "delayed allocation may delay writing of the journal".

Fast commits for ext4

LtWorf — Tue, 19 Jan 2021 08:15:01 +0000

What does the system call do in other filesystems?

If it's an ext4 quirk then any software relying on that would already break just by virtue of moving to a different filesystem.

PostgreSQL might benefit from a fd-list fsync() API

ringerc — Tue, 19 Jan 2021 04:36:35 +0000

I think PostgreSQL might benefit significantly from a fsync() variant that takes a fd-list. Though the project is considering moving to syncfs() on Linux. It generally only cares about flushes to many files when it's doing checkpoints, and ordering of individual flushes isn't important for those.

Pg has three major write/flush order requirements.

1. For WAL (its journal): WAL record writes must be flushed to disk strictly in ascending byte order in the WAL files. Not all WAL record writes must be immediately durable, but all writes prior to a durable record like a commit must complete before the durable record is flushed. PostgreSQL uses either O_DATASYNC or fdatasync() to ensure this.

2. Ordering of non-WAL writes with respect to their corresponding WAL record writes. Non-WAL writes such as heap pages, index pages, clog (commit state) must not be flushed to disk until the corresponding WAL writes are known to be durable. To do this, PostgreSQL buffers the data to be written in its own shared memory, delaying any write() to the OS until it knows the WAL for these writes is durable. This causes significant double-buffering waste and write latency.

3. Checkpoint WAL segment removal vs non-WAL writes. PostgreSQL must know that all non-WAL writes corresponding to WAL records up to a given point in the WAL are durably on disk before it removes or overwrites a WAL segment (file). PostgreSQL does this by remembering each file descriptor that was touched since the last checkpoint and fsync()ing it before removing any WAL segments.

If this sounds inefficient to you, that's because it is.

The O_DATASYNC or fdatasync() after many WAL record writes can limit overall throughput of sync-sensitive writes like commit records, especially since sometimes a large volume of less critical records must be flushed before the critical record is durably flushed. WAL writing spends a lot of time waiting because there's no API that lets us pipeline or queue up fsync()s in a non-blocking manner - except possibly helper threads for blocking calls, and they cause plenty of other issues.

For non-WAL (heap) writes, PostgreSQL has to tie up memory for its pending writes until it knows the OS has flushed the corresponding WAL writes. Only then can it write() them, because it has no way to tell the OS that they must not hit disk before the corresponding WAL records are durable. That means it has to buffer them for longer, and double handle the writes.

Finally, for checkpoints PostgreSQL must fsync() all the dirty files it has open. This will force all pending writes on those files to disk, not just the writes made before the current checkpoint target position. On most Linux FSes it also flushes a bunch of other irrelevant and uninteresting files that don't need to be durable at all.

What I'd really like to have is a way to tell the FS about write-order dependencies when PostgreSQL issues a write().

Postgres would request a tag for each WAL record write, and then tag each write that depends on that WAL record with an ordering requirement against that WAL write tag. Then if WAL ordering was critical for a given WAL write it'd tag the next WAL write with an ordering requirement against the previous critical WAL write. So the OS would be free to write in natural order:

* WAL record A
* Heap changes for A
* WAL record B
* Heap changes for B

or reorder:

* WAL record A
* WAL record B
* Heap changes for A and B in any combination

but could not write heap changes for A before WAL A, or heap changes for B before WAL B if postgres specified the A before B write-order requirement.

That'd let Pg just write everything to the OS without blocking on fsync() etc, and let the kernel's dirty writeback and VFS/block layer sort out the ordering.

Basically a kind of AIO with the ability to impose write ordering requirements and with a sensible interface for confirming data is durably flushed. The latter is woefully lacking in any current AIO facilities.

It'd be necessary to guarantee that any read of a file with pending writes always returned the latest data that's pending writeback. Otherwise Pg would have to pin each dirty buffer in its own buffer cache until it knew it was flushed by the OS anyway.

I hope that the recent blk-mq work may well benefit PostgreSQL if the needed balance between permitting reordering and imposing necessary write-order constraints proves to be possible.

Note that Pg can't use FS-level write-order barriers after each WAL record written. It'd eliminate the latency after each fsync(), but would be grossly inefficient for this because it'd prevent reordering of non-WAL writes across barriers, and we want the FS to be free to reorder non-WAL writes as aggressively as possible (up to the next checkpoint) in order to do write combining and write deduplication.

Fast commits for ext4

tytso — Mon, 18 Jan 2021 19:24:38 +0000

So for decades, competently written text editors write new precious files, such as source files via:

1) Write the new contents of foo.c to foo.c.new
2) Fsync foo.c.new --- check the error return from the fsync(2) as well as the close(2)
3) Delete foo.c.bak
4) Create a hard link from foo.c to foo.c.bak
5) Rename foo.c.new on top of foo.c

This doesn't require an fsync of the directory, but it guarantees that /path/to/foo.c will either have the original contents of foo.c., or the new contents of foo.c, even if there is a crash any time during the above process. If you want portability to other Posix operating systems, including people running, say, retro versions of BSD 4.3, this is what you should do. It's what emacs and vi does, and some of the "ritual", such as making sure you check the error return from close(2), is because other wise you might lose data if you run into a quota overrun on the Andrew File System (the distributed file system developed at CMU, and used at MIT Project Athena, as well as several National Labs and financial institutions).

That being said, rename is not a write barrier, but as part of the O_PONIES discussion, on a close(2) of an open which was opened with O_TRUNC, or on a rename(2) where the destination file is getting overwritten, the file being closed, or the source file of the rename will have an immediate write-out initiated. It's not going to block the rename(2) operation from returning, but it narrows the race window from 30 seconds to however long it takes to accomplish the writeout, which is typically less than a second. It's also something that was implemented informally by all of the major file systems at the time of the O_PONIES controversy, but it doesn't necessarily account for what newer file systems (for example, like bcachefs and f2fs) might decide to do, and of course, this is not applicable for what other operating systems such as MacOS might be doing.

The compromise is something that was designed to minimize performance impact, since users and applications also depend upon --- and get cranky --- when there are performance regressions, while still papering over most of the problems caused by careless application. From file system developers' perspective, the ultimate responsibility is on application writers if they think a particular file write is precious and must lost be lost after a system or application crash. After all, if the application is doing something really stupid, such as overwriting a precious file by using open(2) with O_TRUNC, because it's too much of a pain to copy over ACL's and extended attributes, so it's simpler to just use O_TRUNC and overwrite the data file and crossing your fingers. There is absolutely no way the file system can protect against application writer stupidity, but we can try to minimize the risk of damage, while not penalizing the performance of applications which are doing the right thing, and are writing, say, a scratch file.

What about other filesystems?

hkario — Mon, 18 Jan 2021 17:47:40 +0000

precisely, the issue is not that hardware can fail and that the file system can't promise anything in such case, the problem is that there is no specification common _to all file systems_ that says what is expected to happen under such and such scenarios

or to put it other way round: every file system will exhibit different behaviour on power failure and every file system requires slightly different handling to get something you can reasonably expect (like, when the file system says it committed data to disk, the data is committed to disk)

that's no way to program stuff when dealing with such fundamental thing in computing as data storage

What about other filesystems?

smoogen — Mon, 18 Jan 2021 17:05:52 +0000

From reading from your other posts on this, I think you also have to be aware of what the hardware the system has in it if you want that level of guarantee. There are large number of hardware pieces who for the sake of speed fool the OS into saying stuff was written but wasn't. You can fsync(), sync(), and all otehr things and the hardware will write stuff in the order it wants. And yes like bufferbloat it is built into all parts of modern hardware from the cache on the harddrive, to the controller of the harddrive to the PCI bus the controller is connected to the memory subsystem.

In the end for a lot of hardware you just have to throw away guarenteeing a lot of things because the entire industry has agreed to lie in order to give the feeling of speed. To get around this you start having to move back to Real Time hardware which is much slower (less smart caches which could break RT guarantees) but more predictable. Not the answer you probably want to hear.. but I think a good reason the kernel people don't see the lies as a problem any more is that finding hardware post 2004 which doesn't lie in someway to its calling system is rare.

What about other filesystems?

pbonzini — Mon, 18 Jan 2021 13:05:00 +0000

> AND YOU CAN'T EVEN RELY ON JOURNALLING because you don't know whether the file system has written the journal before, after, or in the middle of writing the data.

The filesystem is going to write data before metadata, so that you won't have a file that's full of zeros (or worse, full of stale data including another user's cleartext password). With "old Unix" you could get a file that's full of trash after a power failure; I sure did. So if anything journalling makes things better.

What about other filesystems?

farnz — Mon, 18 Jan 2021 10:45:18 +0000

io_uring does provide the primitives needed; there's IORING_OP_FSYNC (with IORING_FSYNC_DATASYNC to weaken from fsync to fdatasync) and IORING_OP_SYNC_FILE_RANGE for flushing caches asynchronously, and the IOSQE_IO_DRAIN and IOSQE_IO_LINK flags to order io_uring operations with respect to each other so that you can issue the fsync after all the related writes have been done.

What about other filesystems?

farnz — Mon, 18 Jan 2021 10:29:39 +0000

No, because even on ancient systems, you had elevator reordering for performance, and no guarantees about metadata writes; in the event of a crash, you simply did not know the state of the update or the transaction log, as even if you wrote them in a careful order, the elevator could reorder writes to disk, and the metadata writes might be reordered, too.

In other words, as soon as there's a kernel panic or a power failure, all bets are off on an old UNIX system. This wasn't an issue with reliable systems, but as reliability went down (no dedicated power supplies, no UPSes etc), it became an issue again.

What about other filesystems?

NYKevin — Mon, 18 Jan 2021 07:16:42 +0000

Well, you can also use aio_fsync(3), but that's basically just a crappier version of "use a thread pool."

IMHO this is a broader issue with aio(7) and not a problem with fsync in particular.

What about other filesystems?

joib — Mon, 18 Jan 2021 05:34:39 +0000

I believe io_uring supports fsync, so that would be a way to do an asynchronous fsync on somewhat modern Linux.

What about other filesystems?

NYKevin — Sun, 17 Jan 2021 22:13:36 +0000

The process that SQLite uses is documented in https://sqlite.org/atomiccommit.html in a very high level of detail.

TL;DR: They make a copy ("rollback journal") of the data they are about to overwrite, fsync that copy, overwrite the data, fsync the database itself, and finally delete the rollback journal.

What about other filesystems?

Wol — Sun, 17 Jan 2021 21:23:48 +0000

mmmm

The risk of a corrupted filesystem hasn't changed.

But if the application writes a journal before doing an update, then provided there's no collateral damage it can recover from a crash mid transaction on an old unix system.

On a new system, it can't be sure whether the transaction log is okay and the update is damaged, or the transaction log is damaged and the transaction is lost, or even worse the transaction log is damaged and the transaction is partially complete!

Cheers,
Wol

What about other filesystems?

Wol — Sun, 17 Jan 2021 21:18:25 +0000

> Actually database developers are probably among the first to scream in terror if a filesystem will guarantee strong ordering properties

I did say the *appearance* of strong ordering guarantees :-)

> If the filesystem gets corrupted, then all user data is lost, so indeed the journal is there to protect the data.

But if the data in flight is corrupted, then the only way to get the system back to a sane (for the user) state may be "format, recover backup". Yes making sure the filesystem is itself consistent is important but it's only part of the picture. If I can't trust the state of the data, I need to recover from backup.

Cheers,
Wol

What about other filesystems?

matthias — Sun, 17 Jan 2021 21:03:20 +0000

>> Actually this was very well what could happen and still can happen with non-journaling filesystems. Plugging the power-cord in the wrong moment and you have a broken filesystem that cannot not be mounted. With a bit of luck you can get your files back with fsck. Fortunately, the situation improved. In a journaling filesystem you should in any way only loose data in modification.

>And this is pretty much the perfect example of what is wrong with the current setup. The filesystem journal is there to protect the filesystem, to hell with the user's data.

If the filesystem gets corrupted, then all user data is lost, so indeed the journal is there to protect the data.

>So HOW as a database developer am I supposed to protect my database (other than writing to a raw partition!) if I can't trust the filesystem to protect my user-space journal!

You can trust it. Do an fsync on the user-space journal. Anything else will not work and has never worked. Ok, especially if you are developing databases, O_DIRECT might be a viable alternative. And writing to a raw partition also does not guarantee that the writes are in order. You still have to flush the cashes to ensure that something actually has hit the disk, or use synchronous IO from the beginning.

>That's why ext3 journal="ordered" was so good - it gave the APPLICATION DEVELOPERS the guarantee that, after a crash, writes *appeared* to have been written to disk in the order that they were made. That's ALL a developer needs!

This is clearly wrong. data=ordered only ensures that data is written before related meta-data. Writes to different files are not guaranteed to be ordered. Overwriting a file is not guaranteed to be ordered. The blocks can hit the disk in a random order. The things that are guaranteed by data=ordered is that when appending a file the new data hits the disk before the new blocks are added to the inode, i.e., it is not possible that the file contains garbage instead of the new data. And I think it is guaranteed that a rename only is done after the renamed file has hit the disk. This of course helps for those people that update data by the create new file, rename model. But there is not much that really helps for databases.

>(Oh, and I don't think my work on raid would help that, even if raid could provide those guarantees which I hope it can, that an application could rely on it unless the filesystem ALSO provided those guarantees.)

The work on raid is essential, as of course all layers below the filesystem have to provide certain data safety features. Especially the filesystem requires some kind of transactional semantics for its own journal.

Best,
Matthias

P.S.: Actually database developers are probably among the first to scream in terror if a filesystem will guarantee strong ordering properties because of the immense performance penalties, which they do not want to buy. At least the big systems know pretty well which data has which ordering requirements, which data should be in memory cache or will likely not be used again soon and want to control all these aspects themselves. If they do not use a raw partition from the beginning, they will just use the filesystem to reserve a bunch of blocks and then use synchronous IO.

What about other filesystems?

matthias — Sun, 17 Jan 2021 20:31:46 +0000

>> Taking code that didn't require fsync (because it didn't (sic) exist) and, in the words of zlynx, saying that "it's broken" makes all ISO C code that needs data safety broken, which seems extreme.
>Actually, I think that's called a regression, is it not? And one of Linus' absolute rules is "no regressions", isn't it?
There is no regression. The code works as good as back in the days. Back in the days it was clear, that the data is only safe is the system is working properly, including no power outages. If you make sure that your system never crashes, the old code will work fine. If the system crashes, the old code might loose data, but this was always the case with this code. If you want additional guarantees (like no data loss in case of power loss), you have to use fsync.

Best,
Matthias

What about other filesystems?

Wol — Sun, 17 Jan 2021 17:58:39 +0000

> Actually this was very well what could happen and still can happen with non-journaling filesystems. Plugging the power-cord in the wrong moment and you have a broken filesystem that cannot not be mounted. With a bit of luck you can get your files back with fsck. Fortunately, the situation improved. In a journaling filesystem you should in any way only loose data in modification.

And this is pretty much the perfect example of what is wrong with the current setup. The filesystem journal is there to protect the filesystem, to hell with the user's data. So HOW as a database developer am I supposed to protect my database (other than writing to a raw partition!) if I can't trust the filesystem to protect my user-space journal!

That's why ext3 journal="ordered" was so good - it gave the APPLICATION DEVELOPERS the guarantee that, after a crash, writes *appeared* to have been written to disk in the order that they were made. That's ALL a developer needs!

(Oh, and I don't think my work on raid would help that, even if raid could provide those guarantees which I hope it can, that an application could rely on it unless the filesystem ALSO provided those guarantees.)

Cheers,
Wol

What about other filesystems?

Wol — Sun, 17 Jan 2021 17:40:18 +0000

What I find hard to understand is, if the database (SQLite, whatever) is using linux syscalls, how does it know the data has actually been written? Or does it do loads of sync()s, and then pause all writes for ten seconds or so waiting for the data to flush, etc etc.

I can see how databases can provide 99.999% reliability. I'm active on the raid list. I know all about disk timeouts, disks lying, how long things take to get flushed, etc etc. I simply do not see how an application can guarantee safety.

As for "why should it be in the kernel" - because LOTS of developers will benefit from the ability to reason about the state of a system in a crash scenario. Why should all the database developers be forced to duplicate each others' work?

And frankly, if I commit something to the filesystem for saving, surely I should be able to ask the filesystem "have you saved it?" AND BE ABLE TO RELY ON THE ANSWER! (Yep, I know disks lie, and I don't expect the file system necessarily to deal with that, but it really should be held responsible for its own actions!)

Cheers,
Wol

What about other filesystems?

Wol — Sun, 17 Jan 2021 17:28:33 +0000

> Taking code that didn't require fsync (because it didn't (sic) exist) and, in the words of zlynx, saying that "it's broken" makes all ISO C code that needs data safety broken, which seems extreme.

Actually, I think that's called a regression, is it not? And one of Linus' absolute rules is "no regressions", isn't it?

Cheers,
Wol

What about other filesystems?

Wol — Sun, 17 Jan 2021 17:25:08 +0000

> 2. I did not claim that old code was "broken," merely that it was at risk of losing data. My point is that both the application developer and the sysadmin would have been aware of that problem, and would take appropriate steps to remediate it (such as making regular backups, building a RAID, or whatever else makes sense). Everyone should still be taking those steps today, because as you say, nothing is 100% reliable.

RAID is useless if it can't guarantee that stuff has been safely saved to disk ... which it can't if the linux layers provide no guarantees ...

Backups are pretty much useless BY DEFINITION, because if the data is corrupted while saving to disk (which is what we're discussing here), then it's not been around long enough to be saved to backup.

Come on, all I'm asking for is the ABILITY TO REASON about what is happening, so I can provide my own guarantees. "The system may or may not have saved this data in the event of a crash" is merely the filesystem guys saying "not our problem", and the references to the SQLite guys jumping through hoops to make certain is the perfect example of them having to do somebody else's job, because surely it's the filesystem's guys' job to make sure that data entrusted to the filesystem is actually safely saved by the filesystem.

If I can have some guarantee that "this data is saved before that data starts to be written", then at least I can reason about it.

And yes, I know making all filesystems provide these sort of guarantees may be fun - I'm on the raid mailing list - I know - because I read all the messages and glance at all the patches and all that stuff (and don't understand much of it :-) - but when (I know, I know) I find the time to start really digging in to it, I want the raid layer to provide exactly those guarantees.

And why can't we say "these are the guarantees we *intend* to provide", and make it a requirement that anything new *does* provide them! If I provide a "flush" in the raid code, I can then pass it on to the next layer down, and then when it says it's done it I can then pass success back up (or E_NOTSUPPORTED if I can't pass it down). But this is exactly another of those *new* things they're trying to get into the linux block layer, isn't it - the ability to pass error codes back to the caller other than the most basic of "succeeded" or "failed", isn't it? If they can get that in, surely they can get my "flush" in, can't they?

Cheers,
Wol

What about other filesystems?

NYKevin — Sun, 17 Jan 2021 09:33:17 +0000

1. Modern filesystems are much safer than old filesystems, by default. When was the last time you had to run fsck on boot?
2. I did not claim that old code was "broken," merely that it was at risk of losing data. My point is that both the application developer and the sysadmin would have been aware of that problem, and would take appropriate steps to remediate it (such as making regular backups, building a RAID, or whatever else makes sense). Everyone should still be taking those steps today, because as you say, nothing is 100% reliable.
3. Safety and speed are a tradeoff. But since we can't get to 100% safety, the primary value of safety is extrinsic: a safer system causes us to spend less time and resources on recovery (e.g. sitting around waiting for fsck to complete so I can boot my machine). So safety is itself a form of speed, and we can directly compare the time spent on recovery to the time spent on disk I/O - and as it turns out, once you make fsck obsolete, the disk I/O is a lot bigger for most people under most circumstances.

What about other filesystems?

dvdeug — Sun, 17 Jan 2021 08:16:23 +0000

Linux can't save you if the computer fails due to any number of physical problems. It has always been the case that "you might lose data". The change is that older Unixes don't require you to do anything special to achieve maximal data safety offered, whereas modern systems require you to do something special for the OS to try its best. Arguably (as Wol does), going from ordered behavior to complex reordering is a downgrade in the promised level of support.

Taking code that didn't require fsync (because it doesn't exist) and, in the words of zlynx, saying that "it's broken" makes all ISO C code that needs data safety broken, which seems extreme. From my perspective, filesystem developers got the ability to increase safety by default or speed by default, and chose speed. That doesn't really upset me, so much as the fact that the word "pony" gets pulled out and one side gets painted as unreasonable, instead of it getting painted as a tradeoff and argued on that basis.

What about other filesystems?

NYKevin — Sun, 17 Jan 2021 04:41:21 +0000

It depends. Let's go through these:

- For fsync()'ing multiple files, the standard answer is "use a thread pool." This is also the standard answer to "I want asynchronous I/O like on Windows," so no surprise there.
- As the article mentions, they are discussing an "fsync multiple files" syscall, which will (probably) further alleviate this problem (if it actually happens).
- I'm not aware of any syscall called "fsfsync()," so I assume you meant syncfs(2). That function is not in POSIX, so all we have to go on is the note in that man page, which explicitly states that "sync() or syncfs() provide the same guarantees as fsync() called on every file in the system or filesystem respectively."
- POSIX says that sync(2) is not required to wait for the writes to complete before returning (unlike fsync()). As noted above, POSIX does not specify syncfs() at all.
- Arguably, a conforming implementation could implement sync() as a no-op, because POSIX says it causes outstanding data "to be scheduled for writing out" - but it was *already* scheduled for writing out.
- Therefore, if you want to be pedantically POSIX-correct, you should not use sync(2) at all, because it gets you exactly nothing according to the standard.
- Since syncfs() is already Linux-specific, you can rely on its Linux-specific guarantees, if you are in a position to call it in the first place.

What about other filesystems?

orib — Sun, 17 Jan 2021 00:53:03 +0000

> All this talk about "apps should use fsync" or "apps should use fsfsync" fills me with horror, as someone who wants to write a system that relies on data integrity. Are you telling me that my app needs to be filesystem-aware, and not only that but aware of what mount options were used, so I know which commands to call to make sure that my data is safe?

No, the opposite: your app needs to be aware of the *STANDARDS* that the filesystem is attempting to conform to, rather than what the implementation of the day happens to do.

What about other filesystems?

Wol — Sat, 16 Jan 2021 21:42:40 +0000

> Why should fsbarrier() be any different in this regard than fsync(). Neither of the two requires the system to cripple performance. And both of them can be implemented by just forcing a global filesystem sync. The performance of fsync is getting much better, as the developers actually use the freedom they have. But I am wondering why you expect filesystem developers to implement the (from a filesystem perspective) much harder fsbarrier() call more efficiently than the relatively straightforward fsync() call. fsbarrier() would probably require a major rewrite of the VFS layer to even be able to compute the list of files that are effected by such a call. Chances are good that developers will use similar shortcuts as they have done for fsync() for decades and performance of the whole system will cripple with such a call.

So let's say I want to guarantee - let's say ten or twenty - files have all flushed before I start writing the next file, can I do those fsync()s in parallel? Without having to spawn 20 threads and then wait on them all? Whatever, that's a lot of work.

And with an fsfsync, again does that provide the ordering guarantee? I've heard that yes it guarantees everything that's been written gets flushed, but does it put a hard barrier in (like my fsbarrier()), or does it just stall all new writes until all the old writes have been flushed, or does it just guarantee that everything written before the fsfsync is flushed but it doesn't stop newer writes being merged forwards and being caught up in the flush?

Because if fsfsync() puts that barrier in, I'm simply changing a synchronous fsfsync() to an asynchronous fsbarrier(), if it's the second option it's causing a performance impact on the system, and if it's the third option then my app has to do a synchronous call with the performance impact that implies.

Cheers,
Wol

What about other filesystems?

matthias — Sat, 16 Jan 2021 21:10:06 +0000

>AND YOU CAN'T EVEN RELY ON JOURNALLING because you don't know whether the file system has written the journal before, after, or in the middle of writing the data.

Journalling was primarily invented to ensure the integrity of the filesystem. I.e., to avoid a total loss of the filesystem in case of power loss/crash.

> Really, all I want is something like fsbarrier(), which GUARANTEES that stuff written after it is written after stuff that was written before it.

This would be quite nice. fsync() only guarantees ordering for data written to the given file descriptor. fsbarrier() would probably be easier to use for the app developer. No need to call it for every involved file descriptor. And yes, in many cases guaranteeing ordering would be enough. No need to actually force the data to the disk before the syscall can return.

> I don't give a monkeys whether the filesystem batches, parallelises, or what ever other O_PONIES writes, provided I can reason that this call makes sure my stuff hits the disk in the order I expect.
> If I want to trash my application's performance with excessive use of fsbarrier(), that's my problem. If the OS expects me to trash EVERYONE ELSE'S performance with excessive use of fsync() or fsfsync(), then that's a BIG problem for the OS!

Why should fsbarrier() be any different in this regard than fsync(). Neither of the two requires the system to cripple performance. And both of them can be implemented by just forcing a global filesystem sync. The performance of fsync is getting much better, as the developers actually use the freedom they have. But I am wondering why you expect filesystem developers to implement the (from a filesystem perspective) much harder fsbarrier() call more efficiently than the relatively straightforward fsync() call. fsbarrier() would probably require a major rewrite of the VFS layer to even be able to compute the list of files that are effected by such a call. Chances are good that developers will use similar shortcuts as they have done for fsync() for decades and performance of the whole system will cripple with such a call.

> Oh - and wasn't advice about how to shut a system down always "# sync; sync; sync; halt"? So all of us old hands expect sync() to do a filesystem flush? And do you really expect me as a developer to do that after most writes when I expect something like that to bring the system to its knees?

sync guarantees a full filesystem flush. No changes there. That is indeed a bit of overkill if you just require ordering. fsync used to be quite inefficient as well, but it is getting better in this regard. And I know nobody who suggests to use sync in normal apps. fsync should be enough if used correctly.

Best,
Matthias

What about other filesystems?

NYKevin — Sat, 16 Jan 2021 20:49:23 +0000

The fundamental problem with this argument is that the API you describe can be (and has been) implemented in userspace (in the form of SQLite, as well as numerous "real" databases). Therefore, if you want to argue in favor of doing this in kernel space, it is not enough to argue that a new API would be "better" in various ways. You need to *specifically* address one question: Why should anyone re-implement already working userspace code in the kernel? Would it provide some performance advantage? Would it somehow enable you to do things that you can't currently do? Or would it just be "more convenient?" If the latter, how is that the kernel's problem?