LWN: Comments on "The NOVA filesystem"

The NOVA filesystem/Patches from large companies

vomlehn — Thu, 07 Sep 2017 19:22:58 +0000

The biggest reasons why contributions from large companies have an easier time getting merged is that they tend to approach the kernel community with complete implementations that have been run for months on a large number of systems, and they often have benchmarks to show performance, memory, etc. improvements over a range of configurations. Compare that with someone who has a notion that they've tested a few times and seems to work for them.

Additionally, large companies tend to have the resources to hire people and keep them working on the kernel. This allows a level of familiarity and comfort with their work to grow. This is very useful when evaluating the likelihood that someone's work is correct, though the same factors risk a sense of complacency.

Note that that large companies do not always get it right. For example, multiple features Google created were too narrowly focused on Android and did not work as the larger community would have wished. It's taken years to remedy the ones that could be remedied. A large part of this was a failure to openly and frequently engage with the kernel community up front.

I would, however, hope that nobody is discouraged if they don't work for a large company. Those companies, economic powerhouses though they may be, tend towards group think and hold no monopolies on creativity. It's a lot of work to get something into the kernel but it still remains well within the abilities of lone developers.

Technology-specific

mips — Fri, 18 Aug 2017 11:13:10 +0000

I would imagine it can do RAID 5 but you may be occasionally prompted to insert disc 2.

The NOVA filesystem

anton — Wed, 16 Aug 2017 13:56:25 +0000

Software has problems like memory leaks and bugs that make process persistence less practical than it may appear at first.

E.g., some earlier version of Microsoft Word exhibited many of the disadvantages of trying to work with process persistence only: When the user saved the document, it produced more or less a dump of its memory, and when loading the document, the memory dump was restored. The document contained a lot of stuff that had been deleted by the user (which led to various interesting findings), and if the memory was corrupted in some way (that would lead to a crash on certain operations), that corruption would end up in the saved document, persisting the crash across the save. Letting a new version of Word load an old document was probably a major programming feat, and the other way was not possible at all.

Saving data in a well-defined format instead of just keeping it in persistent process memory or saving it as a memory dump is more work at first, but it pays off.

We have had persistent OSs like EROS for decades (they saved the changed pages to disk every 30s or so, so you would lose up to that amount of work on a crash, much like with file systems), but they did not catch on, nor did mainstream OSs acquire process persistence features (well, there's hibernation, but the OS is still not designed to let processes live forever). Maybe now, with containers.

So when we have a persistent variant of RAM, we still need to save data from processes, so we still need a file system. The choice of having a log-structured base for such a file system surprised me, though.

mmap() + checksum = ?

swanson — Tue, 15 Aug 2017 17:58:17 +0000

This is an interesting problem.

The approach that NOVA takes is to disable parity/checksum protection while data is mmap'd and re-enable it when the mapping is finished. We track overlapping maps etc., and we have log mmap operations so we can re-enable protection after reboot.

The reasoning for this approach is that when you use DAX-mmap() you take responsibility for your data. This includes responsibility for updating it consistently and responsibility for protecting it from media errors, etc. As you point out, the file system really can't fill this role (at least not without a significant performance penalty), so it must fall to the application.

-steve

The NOVA filesystem

giraffedata — Sun, 13 Aug 2017 21:30:24 +0000

When you hit the power button, the instruction pointer picks right back up and the machine keeps running.
For that you need not only persistent memory, but also a persistent CPU.

You need persistent a lot more than that. Main memory is only one of myriad holders of state in a computer. Look at how hard suspend/resume is. You can't just store the contents of main memory, power down, then restore that memory upon power up and keep going. Every device in the computer has some volatile memory in it (disk controllers, for example). And since you can't keep the time of day persistent, there is some difficulty there too.

The image I'm getting here is that "persistent memory" is nothing but marketing speak - it's actually just a SSD.
The big conceptual difference is that the CPU can address the memory directly.

This misses Tara_Li's point. This difference affects only what kernel code you write to use it. The user still sees persistent storage that works at solid state speeds, that you could use to store data, which is exactly what SSDs are all about.

But Jon points out one difference between persistent memory and SSD that could actually make a difference in its application: persistent memory is byte-addressable.

The NOVA filesystem

bof — Thu, 10 Aug 2017 15:31:23 +0000

tmpfs is totally usable with large files - as long as you have enough RAM to spare and don't have it go to swap.

I use tmpfs continuously on server VMs, with multi GB files and/or up to 80 GB in smaller files (precomputed mysql tables...) - always worked without any issues. However, these VMs don't have any swap configured (or even swap support in the kernel), so YMMV.

The NOVA filesystem

krakensden — Thu, 10 Aug 2017 15:23:36 +0000

I don't know how it's implemented, but my first bad experience with tmpfs /tmp was discovering that Vagrant wrote machine images there, and everything started behaving super poorly. Is it safe to use now with large files?

The NOVA filesystem

zlynx — Wed, 09 Aug 2017 22:06:40 +0000

You can do that if you want. Take a file and mmap it. If NOVA works like tmpfs then writing to the mmap is writing directly into RAM. I haven't checked but I don't see why it would do anything else. Double buffering it would just be silly.

The NOVA filesystem

iabervon — Wed, 09 Aug 2017 18:47:28 +0000

The fundamental thing is that emacs communicates to gcc and git by doing one side of a particular set of system calls, and the other ends of this one-to-many-over-time interaction expect to use another set of system calls to retrieve it. Additionally, the system calls have some defined semantics for what the acceptable intermediate states in the source program's sequence to make visible to other programs, including the case where the source program crashes for some reason (not necessarily a power failure).

If you think about the IPC properties of filesystems, it's clear that you need one, even if memory isn't at particular risk from losing power.

The NOVA filesystem

mfuzzey — Wed, 09 Aug 2017 17:03:40 +0000

Not sure about big companies having an easier time due to it being easier to justify changes.

I think the justification is far more linked to the technical merits, maintenance load and the impact on the kernel than the number of devices using it.

For example the android "wake lock" stuff took years before being merged (and it wasn't merged "as is" but in a significantly modified form).

There are drivers for very niche devices, driver submissions are accepted, even from "unknown" individuals, provided they respect the license and coding style rules and pass review. I've never seen any questions about how many of the devices are out there for it to be "worth the effort"...
For this reason, contrary to popular belief, Linux actually supports *more* devices than Windows (particularly true for older devices).

Getting non trivial code into the core kernel though, is significantly more difficult since the potential for breakage is much higher.

But yes, corporate developpers do have the advantage of having more time, by virtue of being paid to work on the kernel, and often in house peer review before anything even gets submitted to the public mailing lists.

The point

JFlorian — Wed, 09 Aug 2017 13:24:28 +0000

You and I share a common history there. A DG Nova was my first experience with a larger system. I'm always amazed at how far technology has advanced, but this reminder was rather jarring (in a good way).

Technology-specific

alison — Tue, 08 Aug 2017 05:37:48 +0000

Corbet, what RAID levels does your floppy-array support?

The NOVA filesystem

smckay — Tue, 08 Aug 2017 02:36:40 +0000

I'm very much an outsider, but it definitely does seem like contributions from large companies have an easier time getting merged. I'm sure a lot of it is down to having people experienced with kernel development getting paid to do the work, but I think justifying the change is easier too. Like, is this patch worth the effort? Will it be used? If it's already running on umpty-million Android handsets or a jillion cloud servers or a thousand enterprise customers are slavering to buy the hardware that goes with this driver, the usefulness is already there. Whereas something coming out of academia, probably not production quality yet, without a significant installed base, is going to be harder to justify.

The NOVA filesystem

pizza — Mon, 07 Aug 2017 17:12:55 +0000

It takes ongoing work to get something merged into the kernel, and that work falls outside the scope of typical academic funding/manpower. Meanwhile, Google/Facebook/etc employs full-time kernel folks, and getting stuff mainlined is part of their job description.

So it's less about $BIGCORP and more to do with the realities of academia being somewhat different from the real world. (Or at least the realities of the kernel development process..)

The NOVA filesystem

sperl — Mon, 07 Aug 2017 17:07:26 +0000

I still wonder why no implementation of shared memory that uses nvram and which persists accross reboots/power outage has appeared yet.

The NOVA filesystem

sasha — Mon, 07 Aug 2017 16:13:38 +0000

"Academic work has a discouraging tendency to stop when the papers are published and the grant money runs out"

As BFQ I/O scheduler (2008: https://lkml.org/lkml/2008/4/1/234 2014: https://lwn.net/Articles/600366/ 2016: https://lwn.net/Articles/709202/) shows, academic work has a discouraging tendency to be ignored by kernel community for years, because there is no "BIG CORP" behind it. Google or Facebook can say "we already use this patch, please accept it" and kernel community accepts even imperfect patch. It is not so easy for patches from academic community; at least it was not so easy for Paolo Valente.

mmap() + checksum = ?

abatters — Mon, 07 Aug 2017 13:54:02 +0000

> NOVA disables COW for the portions of a file that have been mapped into a process's address space, so changes are made in place.

> There are a number of self-protection measures built into NOVA, including checksumming for data and metadata.

If you mmap() the storage as memory and write to it in-place from the CPU, that would invalidate the data checksum. If the system crashes at that point, what happens to the file with invalid checksum after the system reboots?

The NOVA filesystem

Paf — Mon, 07 Aug 2017 12:52:53 +0000

Ah, sorry. I'm drawing my info from public sources, and I thought the reviews meant it was generally available.

That IS disappointing.

The NOVA filesystem

Cyberax — Mon, 07 Aug 2017 09:20:10 +0000

I would imagine that each CPU will have several partitions assigned to it. If the number of CPUs is less than the number of partitions then some CPUs will have multiple partitions, if the reverse is true then some partitions will be shared across several CPUs.

The NOVA filesystem

alonz — Mon, 07 Aug 2017 08:50:41 +0000

Does this mean some memory cannot be allocated unless you have all 256 CPUs?

How about ubifs and friends?

giggls — Mon, 07 Aug 2017 08:28:47 +0000

Reading the Article I was wondering about the differences of this approach compared to ubifs/mtd.

Am I right in the assumption that the devices targeted by ubifs/mtd can not be written in a byte-wise manner while the devices targeted by NOVA filesystem can?

Number of CPUs

skitching — Mon, 07 Aug 2017 07:49:30 +0000

That sounds to me like "bucketing" as used in NoSQL databases or "consistent hashing" algorithms.

The NOVA filesystem

jem — Mon, 07 Aug 2017 06:24:54 +0000

> When you hit the power button, the instruction pointer picks right back up and the machine keeps running.

For that you need not only persistent memory, but also a persistent CPU.

> The image I'm getting here is that "persistent memory" is nothing but marketing speak - it's actually just a SSD.

The big conceptual difference is that the CPU can address the memory directly. The CPU is able to store and load data, and execute code just like with ordinary RAM. That's not possible with today's SSDs.

The NOVA filesystem

alison — Mon, 07 Aug 2017 05:05:40 +0000

>bork the filesystem, reboot, and it's still going to be borked.

Sorry for being unclear. I was thinking more that having 'memory' and 'storage on filesystems' as distinct advantages and disadvantages. Blurring the distinction between them will have many consequences, some of which are unpleasant and perhaps unanticipated.

The NOVA filesystem

Cyberax — Mon, 07 Aug 2017 02:23:47 +0000

You can not buy 3D xpoint on the open market right now (I tried to buy one), apparently it's sold only to Intel partners under heavy NDAs for evaluation purposes only.

All I was able to find were battery-backed DRAMs with flash for longer-term storage, which is kinda disappointing.

The NOVA filesystem

dskoll — Sun, 06 Aug 2017 21:25:32 +0000

tmpfs data doesn't suck away precious RAM. It uses the cache to store data and that memory is reclaimed if needed.

The point

corbet — Sun, 06 Aug 2017 19:42:48 +0000

Many years ago, I worked with a Data General Nova machine that had core memory. It worked that way: turn it off at the end of the day, and it would pick up where it left off in the morning. Most of the time.

Persistent memory is not core memory, though, and it's not a replacement for DRAM, at least not now; it has rather different performance characteristics. So systems will have both types of memory for the foreseeable future. It differs rather significantly from an SSD, though, in that it is byte-addressable by the CPU. That changes a lot of the calculations and is why filesystems like NOVA may make sense.

The NOVA filesystem

josh — Sun, 06 Aug 2017 19:38:14 +0000

> When you hit the power button, the instruction pointer picks right back up and the machine keeps running.

That's an eventual goal, but for compatibility with existing software, that doesn't happen yet. And the rest of the hardware on the system doesn't work without power, either, so it's more like suspend-to-RAM but with zero power usage.

The NOVA filesystem

Tara_Li — Sun, 06 Aug 2017 19:20:22 +0000

But I thought the point of persistent memory is that it is *just* memory - when you shut the machine down, the CPU quits incrementing the instruction pointer and the system shuts down. When you hit the power button, the instruction pointer picks right back up and the machine keeps running. The memory is managed by the standard memory manager, I would expect - programs know where their memory is supposed to be. The image I'm getting here is that "persistent memory" is nothing but marketing speak - it's actually just a SSD.

The NOVA filesystem

Paf — Sun, 06 Aug 2017 17:54:11 +0000

Really? Is it? Intel's marketing of 3D xpoint (available now) sort of suggests it will be RAM level, but the actual numbers are well off. They're faster than flash, but they live between flash and RAM.

They're claiming they'll be ~0.5 microsecond latency, but current 3D xpoint stuff is more like 5 microseconds. Flash is more like 100 microseconds, DRAM is 5-30 nanoseconds. So another 100x faster than 3D xpoint, 1000x if you use the latencies of what you can buy today.

Those numbers are from memory mostly, but I'm pretty sure they're broadly correct. Maybe there's other tech or I've got the #s off a bit...?

So it creates another (fascinating) stopping point in the storage/memory/cache hierarchy.

The NOVA filesystem

jhoblitt — Sun, 06 Aug 2017 14:42:26 +0000

This is vaguely similar to XFS' allocation groups?

The NOVA filesystem

idra — Sun, 06 Aug 2017 12:17:30 +0000

Why would you ever touch swap? I already have to regularly systemctl mask tmp.mount to avoid useless tmp data sucking away precious RAM...

The NOVA filesystem

swanson — Sun, 06 Aug 2017 07:50:26 +0000

We don't need anything so complicated as a separate utility.

The origin of the CPU-count dependence is that NOVA divides PMEM into per-CPU allocation regions. We use the current CPU ID as a hint about which region to use and avoid contention on the locks that protect it.

So moving from a smaller number of CPUs to a larger number of CPUs just means more contention for the locks. Moving from a larger number to a smaller number is no problem at all. So, our current plan is to set the CPU count very high (like 256) when the file system is created.

-steve

The NOVA filesystem

swanson — Sun, 06 Aug 2017 07:43:15 +0000

I'm one of NOVA's designers, and I wanted to clarify one point in the (very nice) article:

NOVA does support DAX mmap. The original paper focused on the atomic mmap mechanism because it was novel (such is the way of research), but normal DAX mmap is fully supported. We received some useful (and not very positive) feedback about atomic mmap from several people, so I'm not sure that atomic mmap is likely to remain a feature.

-steve

The NOVA filesystem

kmeyer — Sun, 06 Aug 2017 04:44:03 +0000

The Intel Xpoint storage should come out in DIMM form factor soon, which should work this this. https://en.wikipedia.org/wiki/3D_XPoint

The NOVA filesystem

Cyberax — Sun, 06 Aug 2017 03:03:12 +0000

The current tech is basically a mock-up of future persistent RAM. It's expected to be within the usual DRAM performance range, if not that durable. It'll certainly be faster than flash.

The NOVA filesystem

pr1268 — Sat, 05 Aug 2017 22:42:30 +0000

it is impossible to move a NOVA filesystem from one system to another if the two machines do not have the same number of CPUs.

Do you suppose that an option to limit the number of per-CPU arrays of inodes when creating a NOVA filesystem can be implemented?

For example: % mknovafs -O cpu=4 /dev/nvramdevice1 (or similar) such that the number of cpus may be less than how many CPUs that computer has.

Setting cpu=1 could make a filesystem capable of being exported to any computer. Of course, that would not make sense if your whole purpose is to exploit that technology to its fullest (quoting our Editor).

The NOVA filesystem

Paf — Sat, 05 Aug 2017 18:05:09 +0000

In addition to what Corbet noted, most of the proposed tech sits between flash and true RAM in terms of performance. 10x-1000x slower than RAM, depending on tech and what you're measuring (and whose claims of future performance you believe). One could end up quite sad from expecting truly RAM like performance from some persistent memory solutions.

(The above leaves out battery backed DIMMs, I suppose. But they don't seem like a tech with broad applicability.)

Technology-specific

Paf — Sat, 05 Aug 2017 17:59:30 +0000

Particularly given that we support dynamic CPU plugging/unplugging...

The NOVA filesystem

nix — Sat, 05 Aug 2017 17:45:56 +0000

We can all look forward to the day when userspace borks itself, and we reboot to find that userspace is still borked.

Well, that's exactly what we have in the present day: bork the filesystem, reboot, and it's still going to be borked. NOVA's just another filesystem. Sure, the storage is directly addressable and probably insanely fast, but it's not some fundamentally new abstraction, and the 'bork an important filesystem and reboot will not save you' properly is just the same as it is on present-day filesystems.