Thanks for the nudge

Posted Feb 20, 2025 21:44 UTC (Thu) by koverstreet (✭ supporter ✭, #4296)
Parent article: Filesystem support for block sizes larger than the page size

This was on my todo list, but I'd been too lazy to look up the relevant helper :)

bcachefs now has it in for-next, so it should land in 6.15:
https://evilpiepirate.org/git/bcachefs.git/commit/?h=for-...

bcachefs

Posted Feb 21, 2025 15:16 UTC (Fri) by DemiMarie (subscriber, #164188) [Link] (11 responses)

Given that it is CoW, could bcachefs support arbitrary atomic writes, or even (gulp) transactions in the style of NTFS?

bcachefs

Posted Feb 21, 2025 18:10 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Unlikely. And even NTFS' successor removed the transactional support.

The problem is not in the filesystem itself, where transactions are reasonably easy, but in the VFS layer and the page cache. There's no easy way to impose isolation between parts of the page cache, and rollbacks are even more tricky.

bcachefs

Posted Feb 21, 2025 20:05 UTC (Fri) by koverstreet (✭ supporter ✭, #4296) [Link] (9 responses)

Yes, it could support arbitrary atomic writes (not of infinite size, bcachefs transactions have practical limits). If someone wanted to fund it - it's not a particularly big interest of mine.

Unfamiliar with NTFS style transactions.

bcachefs

Posted Feb 22, 2025 22:41 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (8 responses)

To the best of my understanding, "NTFS style transactions" means, roughly speaking, "you expose full transaction semantics to userspace, so that userspace can construct transactions with arbitrary combinations of writes, including writes that span multiple files or directories." And then, once it exists and works correctly, you write documentation[1] telling userspace not to use it, supposedly because userspace never really wanted it in the first place (which I find hard to believe, personally).

[1]: https://learn.microsoft.com/en-us/windows/win32/fileio/de...

bcachefs

Posted Feb 22, 2025 23:23 UTC (Sat) by koverstreet (✭ supporter ✭, #4296) [Link] (6 responses)

> Many applications which deal with "document-like" data tend to load the entire document into memory, operate on it, and then write it back out to save the changes. The needed atomicity here is that the changes either are completely applied or not applied at all, as an inconsistent state would render the file corrupt. A common approach is to write the document to a new file, then replace the original file with the new one. One method to do this is with the ReplaceFile API.

Yeah, I tend to agree with Microsoft :) I'm not aware of applications that would benefit, but if you do know of some please let me know.

I'm more interested in optimizations for fsync overhead.

bcachefs

Posted Feb 22, 2025 23:45 UTC (Sat) by intelfx (subscriber, #130118) [Link] (1 responses)

> but if you do know of some please let me know.

Package managers? Text editors? Basically anything that currently has to do the fsync+rename+fsync dance?

Now, I'm not saying that someone should get on coding userspace transactions yesterday™, but at a glance, there are definitely uses for that.

bcachefs

Posted Feb 27, 2025 10:30 UTC (Thu) by koverstreet (✭ supporter ✭, #4296) [Link]

That fsync already isnn't needed on bcachefs (nor ext4, and maybe xfs as well) since we do an implicit fsync on an overwrite rename, where we flush the data but not the journal.

That is, you get ordering, not persistence, which is exactly what applications want in this situation.

bcachefs

Posted Feb 23, 2025 9:37 UTC (Sun) by Wol (subscriber, #4433) [Link]

Depends how much state, across how many files, but (if I understand correctly) I'm sure object based databases could benefit.

I would want to update part of a file (maybe two or three blocks, across a several-meg (or more) file) and the ability to rewrite just the blocks of interest, then flush a new inode or whatever, changing just those block pointers, would be wonderful.

Maybe we already have that. Maybe it's too complicated (as in multiple people trying to update the same file at the same time ...)

Cheers,
Wol

bcachefs

Posted Feb 24, 2025 18:52 UTC (Mon) by tim-day-387 (subscriber, #171751) [Link] (2 responses)

Lustre would benefit from a filesystem agnostic transaction API (at least, in kernel space). The OSD layer is essentially implementing that. We're making a push to get Lustre included upstream and the fate of OSD/ldiskfs/ext4 is one of the big open questions. Having a shared transaction API would make that much easier to answer.

bcachefs

Posted Feb 24, 2025 21:37 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

How does Lustre currently handle transactions? Especially rollbacks?

It looks like transactions in Lustre are more like an atomic group of operations, rather than something long-lived? I.e. you can't start a transaction, spend 2 hours doing something with it, and then commit it?

bcachefs

Posted Feb 25, 2025 16:49 UTC (Tue) by tim-day-387 (subscriber, #171751) [Link]

Currently, Lustre hooks into ext4 transactions in osd_trans_start() and osd_trans_stop() [1]. So the transactions aren't long-lived and are usually scoped to a single function. Lustre patches ext4 (to create ldiskfs) and interfaces with it directly. But it'd probably be better to have a generic way for filesystems to (optionally) expose these primitives. Infiniband has a concept of kverbs - drivers can optionally expose an interface to in-kernel users. We'd could do something similar for transaction handling.

[1] https://git.whamcloud.com/?p=fs/lustre-release.git;a=blob;...

bcachefs

Posted Feb 23, 2025 1:45 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

> And then, once it exists and works correctly

Except this never happened :) I wrote an application that actually used distributed transactions with NTFS and SQL Server, for video file management from CCTV cameras, some time around 2008.

There were tons of corner cases that didn't work quite right. For example, if you created a folder and a file within that folder, then nobody else could create files in that folder until the transaction commits. Because the folder had to be deleted during the rollback.

And this at least made some sense within Windows, as it's a very lock-heavy system. It will make much less sense in a Linux FS.