|
|
Log in / Subscribe / Register

Other use cases

Other use cases

Posted Mar 3, 2026 16:10 UTC (Tue) by farnz (subscriber, #17727)
In reply to: Other use cases by epa
Parent article: The ongoing quest for atomic buffered writes

There's three notifications the kernel could give you:

  1. The data is visible to other readers, but not necessarily the storage device.
  2. The data is in the storage device's volatile cache, but not necessarily persistent.
  3. The device reports that the data has been persisted to the long-term storage medium, and absent a hardware issue, will be safe after a system restart.

The problem is that the second notification is not normally a lot more valuable than the first (they both mean "your data is not yet safe, but is visible to other readers", just with different values of "not yet safe"), while the third is costly (either you flush caches to convert "in the storage device's volatile cache" into "persisted", or you use a more expensive "forced unit access" write command, and prevent the device from doing its normal reordering optimizations on writes).

It's thus best for performance if you have to choose which notification of the three you get up-front, and accept the cost of asking for the "data is persisted" notification if you've chosen that one. It could, for example, be worth accepting the increased cost because you're writing to a USB 2.0 attached flash drive, and the resulting extra cost is minimal as compared to the cost of the write itself, and well worth it to avoid a long delay when the write is finished as the device becomes safe to remove, or it might be not worth the extra cost because you're writing to NVME storage, and the user won't remove it.


to post comments

Other use cases

Posted Mar 4, 2026 10:34 UTC (Wed) by taladar (subscriber, #68407) [Link] (3 responses)

Technically there could also be more layers, e.g. if you have a RAID controller with its own cache or some sort of system where NVMe disks cache data from rotating ones.

Other use cases

Posted Mar 4, 2026 12:54 UTC (Wed) by farnz (subscriber, #17727) [Link] (2 responses)

Even with more layers, the interesting notifications remain similar - they're telling you how big a failure is needed to lose your data. The interesting states are "any failure", "only a failure in the storage subsystem, but one that's quite likely to happen" and "only unlikely failures that we try to engineer out of the system completely".

Other use cases

Posted Mar 5, 2026 13:11 UTC (Thu) by taladar (subscriber, #68407) [Link] (1 responses)

True, my point was more that you probably don't want to bake assumptions about current storage architectures into the long-lived FS API.

Other reasons for more layers with messy properties might include network filesystems which might have local cached storage in case of network failure.

Other use cases

Posted Mar 5, 2026 13:56 UTC (Thu) by farnz (subscriber, #17727) [Link]

Yep - you want to base the notifications on what's interesting to the application, not on the current layering; one for "other readers will see the write" (obvious), one for "this system can crash without losing data, but if the storage stack crashes, you may lose data" (there are commit protocols where this is the point at which you can tentatively acknowledge the write), and one for "data should not be lost even if everything crashes" (since this is the final state for all writes) are still probably the interesting three, no matter how complicated you make the underlying layers.

I can also see some room for splitting the first into two: "other readers on this system" and "other readers with access to this storage stack" - but again, the point has to be "what is interesting to the application about this state", as opposed to "what states do I know about?", with "what states do I know about?" only being useful for preventing you offering notifications you can't provide at a sensible cost.

How much would it cost for every NVMe drive to have a non-volatile cache?

Posted Mar 8, 2026 21:24 UTC (Sun) by DemiMarie (subscriber, #164188) [Link] (3 responses)

If NVMe drives were required to have non-volatile caches, how much would it add to the BOM cost?

How much would it cost for every NVMe drive to have a non-volatile cache?

Posted Mar 9, 2026 14:55 UTC (Mon) by farnz (subscriber, #17727) [Link] (2 responses)

What difference does it make? The notifications (in kernel caches, sent to device but not yet guaranteed on non-volatile storage, guaranteed as safe as the device can make it) are the same whether the cache is volatile, or non-volatile.

And the BOM cost for being able to save the volatile cache to non-volatile store is not small - maybe $10 for the capacitance to power the DRAM, flash and MCU until the cache is saved, plus however much extra flash you need so that you have a "safe space" to write the volatile cache to. You can't just make the device only have non-volatile cache, because the performance characteristics of non-volatile memory aren't what you need when you're doing things like "read 1 MiB from the main flash, replace 4 KiB with this new change, write 1 MiB to a new location, mark the original 1 MiB as safe to erase".

Note, too, that depending on the NVMe design, a write in non-volatile cache may not be considered "safe", because the chances of the non-volatile cache being corrupted are too high. At least one device I've looked at with non-volatile cache has a clear statement that the non-volatile cache is only safe during a commanded shutdown of the device, and not a power loss event, because the microcontroller may scribble over non-volatile cache during power loss.

How much would it cost for every NVMe drive to have a non-volatile cache?

Posted Mar 9, 2026 16:20 UTC (Mon) by Wol (subscriber, #4433) [Link] (1 responses)

> You can't just make the device only have non-volatile cache, because the performance characteristics of non-volatile memory aren't what you need when you're doing things like "read 1 MiB from the main flash, replace 4 KiB with this new change, write 1 MiB to a new location, mark the original 1 MiB as safe to erase".

Did NVRAM use to be 4KiB blocks? Can you still get it?

Okay, I don't know the cost implications, but couldn't you write your 4K updates to a 4K NVRAM cache, and then use normal DRAM as your cache for actually doing your "read 1MiB in, update 4K, write 1 MiB out"? And then you just need to be able to flush your incoming data to the 4K NVRAM on a power fail, which is an easier problem than flushing everything to the permanent backing store?

(And have a "here is a 1MiB write" command, so if the OS is streaming data to disk it can still send data in chunks that match the NVMe block size for efficiency).

Cheers,
Wol

How much would it cost for every NVMe drive to have a non-volatile cache?

Posted Mar 9, 2026 16:32 UTC (Mon) by farnz (subscriber, #17727) [Link]

You can get various forms of flash with different block sizes; the smaller the blocks, the more expensive it becomes per byte, and the faster you want it, the more expensive it gets.

But the problem is that you don't have just 4K of updates at a time to handle - a high performance NVMe drive is handling gigabytes of unwritten data in its volatile cache, and having gigabytes of very fast flash is expensive, as compared to gigabytes of RAM, gigabytes of reasonably fast flash, and a supercapacitor to let you flush the RAM to the fast flash before you lose power completely. However, even that's going to be significant - doing it with pSLC flash will add maybe $30 to your $100/TiB NVMe SSD, more if you want a bigger cache (useful for performance).


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds