|
|
Subscribe / Log in / New account

The NOVA filesystem

By Jonathan Corbet
August 4, 2017
Nonvolatile memory offers the promise of fast, byte-addressable storage that persists over power cycles. Taking advantage of that promise requires the imposition of some sort of directory structure so that the persistent data can be found. There are a few approaches to the implementation of such structures, but the usual answer is to employ a filesystem, since managing access to persistent data is what filesystems were created to do. But traditional filesystems are not a perfect match to nonvolatile memory, so there is a natural interest in new filesystems that were designed for this media from the beginning. The recently posted NOVA filesystem is a new entry in this race.

The filesystems that are currently in use were designed with a specific set of assumptions in mind. Storage is slow, so it is worth expending a considerable amount of CPU power and memory to minimize accesses to the underlying device. Rotational storage imposes a huge performance penalty on non-sequential operations, so there is great value in laying out data consecutively. Sector I/O is atomic; either an entire sector will be written, or it will be unchanged. All of these assumptions (and more) are wired deeply into most filesystems, but they are all incorrect for nonvolatile memory devices. As a result, while filesystems like XFS or ext4 can be sped up considerably on such devices, the chances are good that a filesystem designed from the beginning with nonvolatile memory in mind will perform better and be more resistant to data corruption.

NOVA is intended to be such a filesystem. It is not just unsuited for regular block devices, it cannot use them at all, since it does not use the kernel's block layer. Instead, it works directly with storage mapped into the kernel's address space. A filesystem implementation gives up a lot if it avoids the block layer: request coalescing, queue management, prioritization of requests, and more. On the other hand, it saves the overhead imposed by the block layer and, when it comes to nonvolatile memory performance, cutting down on CPU overhead is a key part of performing well.

NOVA filesystem structure

Like most filesystems, NOVA starts with a superblock — the top-level data structure that describes the filesystem and provides the locations of the other data structures. One of those is the inode table, an inode being the internal representation of a file (or directory) within the filesystem. The NOVA inode table is set up as a set of per-CPU arrays, allowing any CPU to allocate new inodes without having to take cross-processor locks.

Free space is also split across a system's CPUs; it is managed in a red-black tree on each processor to facilitate coalescing of free regions. Unlike the inode tables, the free lists are maintained in normal RAM, not nonvolatile memory. They are written back when the filesystem is unmounted; if the filesystem is not unmounted properly, the free list will be rebuilt with a scan of the filesystem as a whole.

Perhaps the most interesting aspect of NOVA is how the inodes are stored. On a filesystem like ext4, the on-disk inode is a well-defined structure containing much of a file's metadata. To make things fast, NOVA took a different approach based on log-structured filesystems. The result is that an inode in the table is just a pair of pointers to a data structure that looks something like this:

[inode log]

Each active inode consists of a log describing the changes that have been made to the file; all that is found in the inode structure itself is a pair of pointers indicating the first and last valid log entries. Those entries are stored (in nonvolatile memory) in chunks of at least 4KB, organized as a linked list. Each log entry will indicate an event like:

  • The attributes of the file have been changed — a change in the permission bits, for example.

  • An entry has been added to a directory (for directory inodes, obviously).

  • A link to the file was added.

  • Data has been written to the file.

The case of writing data is worth looking at a bit more closely. If a process writes to an empty file, there will be no data pages already allocated. The NOVA implementation will allocate the needed memory from the per-CPU free list and copy the data into that space. It will then append an entry to the inode log indicating the new length of the file and pointing to the location in the array where the data was written. Finally, an atomic update of the inode's tail pointer will complete the operation and make it visible globally.

If, instead, a write operation overwrites existing data, things are done a little differently. NOVA is a copy-on-write (COW) filesystem, so the first step is, once again, to allocate new (nonvolatile) memory for the new data. Data is copied from the old pages into the new if necessary, then the new data is added. A new entry is added to the log pointing to the new pages, the tail pointer is updated, and the old log entry for those pages is invalidated. At that point, the operation is complete and the old data pages can be freed for reuse.

Thus, the "on-disk" inode in NOVA isn't really a straightforward description of the file it represents. It is perhaps better thought of as a set of instructions that, when followed in order (and skipping the invalidated ones) will yield a complete description of the file and its layout in memory. This structure has the advantage of being quite fast to update when the file changes, with minimal locking required. It will obviously be a bit slower when it comes to accessing an existing file. NOVA addresses that by assembling a compact description of the file in RAM when the file is opened. Even that act of assembly should not be all that slow. Remember that the whole linked-list structure is directly addressable by the CPU. Storing this type of structure on a rotating disk, or even on a solid-state disk accessed as a normal block device, would be prohibitively slow, but direct addressability changes things.

There is another interesting feature enabled by this log structure. Each entry in the log contains an "epoch number" that is set when the entry is created. That makes it possible to create snapshots by incrementing the global epoch number, and associating the previous number with a pointer to the snapshot. When the snapshot is mounted, any log entries with an epoch number greater than the snapshot's number can be simply ignored to give a view of the file as it existed when the snapshot was taken. There are some details to manage, of course: entries associated with snapshots cannot be invalidated, and those entries have to be passed over when the snapshot is not in use. But it is still an elegant solution to the problem.

DAX and beyond

Readers may be wondering about how NOVA interacts with the kernel's DAX interface, which exists to allow applications to directly map files in nonvolatile memory into their address space, bypassing the kernel entirely for future accesses. It can be hard to make direct mapping work well with a COW-based write mechanism. In this 2016 paper describing NOVA [PDF], the authors say they don't even try. Rather than support DAX, NOVA supports an alternative mechanism called "atomic mmap" which copies data into "replica pages" and maps those instead. In a sense, atomic mmap reimplements a part of the page cache.

One can imagine that this approach was seen as being suboptimal; direct access to nonvolatile memory is one of that technology's most compelling features. Happily, the posted patch set does claim to support DAX. As far as your editor can tell from the documentation and the code, NOVA disables COW for the portions of a file that have been mapped into a process's address space, so changes are made in place. One significant shortcoming is that pages that have been mapped into a process's address space cannot be written to with write(). There is some relatively complex logic (described in this other paper [PDF]) to ensure that the filesystem does the right thing when taking a snapshot of a file that is currently directly mapped into some process's address space.

There are a number of self-protection measures built into NOVA, including checksumming for data and metadata. One of the more interesting mechanisms seems likely to prove controversial, though. One possible hazard of having your entire storage array mapped into the kernel's address space is that writing to a stray pointer can directly corrupt persistent data. That would not be a concern in a bug-free kernel but, well, that is not the world we live in. In an attempt to prevent inadvertent overwriting of data, NOVA can keep the entire array mapped read-only. When a change must be made, the processor's write-protect bit is temporarily cleared, allowing the kernel to bypass the memory permissions. Disabling write protection has been deemed too dangerous in the past; it seems unlikely that the idea will get a better reception now. Protection against stray writes is a valuable feature, though, so hopefully another way to implement it can be found.

There are a few other things that will need to be fixed before NOVA can be seriously considered for merging upstream. For example, it only works on the x86-64 architecture and, due to the per-CPU inode table structure, it is impossible to move a NOVA filesystem from one system to another if the two machines do not have the same number of CPUs. NOVA doesn't support access control lists or disk quotas. There is no filesystem checker tool. And so on. The developers are aware of these issues and expect to deal with them.

The fact that the developers do want to take care of those details and get the filesystem upstream is generally encouraging, but it is especially so given that NOVA comes from the academic world (from the University of California at San Diego in particular). Academic work has a discouraging tendency to stop when the papers are published and the grant money runs out, so the free-software world in general gets far less code from universities than one might expect. With luck, NOVA will be one development that escapes academia and becomes useful to the wider world.

There are, of course, many other aspects of this filesystem that cannot be covered in such a short article. See the two papers referenced above and the documentation in the patch itself for more information. This appears to be a project to keep an eye on; if all goes well, it will show the way forward for getting full performance out of the huge, fast nonvolatile memory arrays that, we're told, we'll all be able to get sometime soon.

Index entries for this article
KernelFilesystems/Nonvolatile memory


to post comments

The NOVA filesystem

Posted Aug 5, 2017 5:41 UTC (Sat) by alison (subscriber, #63752) [Link] (2 responses)

We can all look forward to the day when userspace borks itself, and we reboot to find that userspace is still borked. Perhaps then there will be some kind of intelligent interface allowing operators to go back one snapshot so that the system comes up all the way? I know that some existing FSs have snapshotting features, but don't know how people use them, beyond transferring data between VMs.

Why not just keep a read-only filesystem in the NVM and then bind- or overlay-mount over it? The start-up time while the overlay was read into memory would be longer, but after that, the performance should be great, with little risk of data loss for some applications.

The NOVA filesystem

Posted Aug 5, 2017 17:45 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

We can all look forward to the day when userspace borks itself, and we reboot to find that userspace is still borked.
Well, that's exactly what we have in the present day: bork the filesystem, reboot, and it's still going to be borked. NOVA's just another filesystem. Sure, the storage is directly addressable and probably insanely fast, but it's not some fundamentally new abstraction, and the 'bork an important filesystem and reboot will not save you' properly is just the same as it is on present-day filesystems.

The NOVA filesystem

Posted Aug 7, 2017 5:05 UTC (Mon) by alison (subscriber, #63752) [Link]

>bork the filesystem, reboot, and it's still going to be borked.

Sorry for being unclear. I was thinking more that having 'memory' and 'storage on filesystems' as distinct advantages and disadvantages. Blurring the distinction between them will have many consequences, some of which are unpleasant and perhaps unanticipated.

The NOVA filesystem

Posted Aug 5, 2017 12:09 UTC (Sat) by eSyr (guest, #112051) [Link] (5 responses)

Is it better than tmpfs? If so, can tmpfs be transparently replaced with nova?

The NOVA filesystem

Posted Aug 5, 2017 13:32 UTC (Sat) by dskoll (subscriber, #1630) [Link] (4 responses)

My understanding is that tmpfs will spill into swap space if necessary. That's clearly not something NOVA would want to do. So if you need that feature, then NOVA cannot replace tmpfs.

The NOVA filesystem

Posted Aug 6, 2017 12:17 UTC (Sun) by idra (guest, #36289) [Link] (3 responses)

Why would you ever touch swap? I already have to regularly systemctl mask tmp.mount to avoid useless tmp data sucking away precious RAM...

The NOVA filesystem

Posted Aug 6, 2017 21:25 UTC (Sun) by dskoll (subscriber, #1630) [Link] (2 responses)

tmpfs data doesn't suck away precious RAM. It uses the cache to store data and that memory is reclaimed if needed.

The NOVA filesystem

Posted Aug 10, 2017 15:23 UTC (Thu) by krakensden (subscriber, #72039) [Link] (1 responses)

I don't know how it's implemented, but my first bad experience with tmpfs /tmp was discovering that Vagrant wrote machine images there, and everything started behaving super poorly. Is it safe to use now with large files?

The NOVA filesystem

Posted Aug 10, 2017 15:31 UTC (Thu) by bof (subscriber, #110741) [Link]

tmpfs is totally usable with large files - as long as you have enough RAM to spare and don't have it go to swap.

I use tmpfs continuously on server VMs, with multi GB files and/or up to 80 GB in smaller files (precomputed mysql tables...) - always worked without any issues. However, these VMs don't have any swap configured (or even swap support in the kernel), so YMMV.

The NOVA filesystem

Posted Aug 5, 2017 14:14 UTC (Sat) by vadim (subscriber, #35271) [Link] (1 responses)

What kind of hardware is available to run this on? Anything affordable?

The NOVA filesystem

Posted Aug 6, 2017 4:44 UTC (Sun) by kmeyer (subscriber, #50720) [Link]

The Intel Xpoint storage should come out in DIMM form factor soon, which should work this this. https://en.wikipedia.org/wiki/3D_XPoint

The NOVA filesystem

Posted Aug 5, 2017 14:55 UTC (Sat) by pabs (subscriber, #43278) [Link] (4 responses)

Tying a filesystem to a particular storage technology and thus losing the ability to move the filesystem between storage devices doesn't seem to be a desirable feature in any filesystem from a user perspective.

Technology-specific

Posted Aug 5, 2017 15:57 UTC (Sat) by corbet (editor, #1) [Link] (3 responses)

Tying it to the technology can make sense if your whole purpose is to exploit that technology to its fullest. If you later decide you want to move the filesystem back to your floppy disk array, you'll need to do a copy anyway, so changing the filesystem type shouldn't be a big deal. How often do you move a filesystem image between different media types?

Now the inability to move to another system with a different number of processors...that seems like a problem...

Technology-specific

Posted Aug 5, 2017 17:59 UTC (Sat) by Paf (subscriber, #91811) [Link]

Particularly given that we support dynamic CPU plugging/unplugging...

Technology-specific

Posted Aug 8, 2017 5:37 UTC (Tue) by alison (subscriber, #63752) [Link] (1 responses)

Corbet, what RAID levels does your floppy-array support?

Technology-specific

Posted Aug 18, 2017 11:13 UTC (Fri) by mips (guest, #105013) [Link]

I would imagine it can do RAID 5 but you may be occasionally prompted to insert disc 2.

The NOVA filesystem

Posted Aug 5, 2017 15:09 UTC (Sat) by Tara_Li (guest, #26706) [Link] (15 responses)

What I don't quite get is ... it's memory. Why not just treat it as memory? Shift the processor into 64-bit address space mode and just call it allllll memory.

The NOVA filesystem

Posted Aug 5, 2017 15:55 UTC (Sat) by corbet (editor, #1) [Link] (7 responses)

It's not just memory, it's persistent memory. But persistence is only useful if you can find stuff later. As mentioned in the article, that means you need some sort of directory structure.

The NOVA filesystem

Posted Aug 6, 2017 19:20 UTC (Sun) by Tara_Li (guest, #26706) [Link] (6 responses)

But I thought the point of persistent memory is that it is *just* memory - when you shut the machine down, the CPU quits incrementing the instruction pointer and the system shuts down. When you hit the power button, the instruction pointer picks right back up and the machine keeps running. The memory is managed by the standard memory manager, I would expect - programs know where their memory is supposed to be. The image I'm getting here is that "persistent memory" is nothing but marketing speak - it's actually just a SSD.

The NOVA filesystem

Posted Aug 6, 2017 19:38 UTC (Sun) by josh (subscriber, #17465) [Link]

> When you hit the power button, the instruction pointer picks right back up and the machine keeps running.

That's an eventual goal, but for compatibility with existing software, that doesn't happen yet. And the rest of the hardware on the system doesn't work without power, either, so it's more like suspend-to-RAM but with zero power usage.

The point

Posted Aug 6, 2017 19:42 UTC (Sun) by corbet (editor, #1) [Link] (1 responses)

Many years ago, I worked with a Data General Nova machine that had core memory. It worked that way: turn it off at the end of the day, and it would pick up where it left off in the morning. Most of the time.

Persistent memory is not core memory, though, and it's not a replacement for DRAM, at least not now; it has rather different performance characteristics. So systems will have both types of memory for the foreseeable future. It differs rather significantly from an SSD, though, in that it is byte-addressable by the CPU. That changes a lot of the calculations and is why filesystems like NOVA may make sense.

The point

Posted Aug 9, 2017 13:24 UTC (Wed) by JFlorian (guest, #49650) [Link]

You and I share a common history there. A DG Nova was my first experience with a larger system. I'm always amazed at how far technology has advanced, but this reminder was rather jarring (in a good way).

The NOVA filesystem

Posted Aug 7, 2017 6:24 UTC (Mon) by jem (subscriber, #24231) [Link] (1 responses)

> When you hit the power button, the instruction pointer picks right back up and the machine keeps running.

For that you need not only persistent memory, but also a persistent CPU.

> The image I'm getting here is that "persistent memory" is nothing but marketing speak - it's actually just a SSD.

The big conceptual difference is that the CPU can address the memory directly. The CPU is able to store and load data, and execute code just like with ordinary RAM. That's not possible with today's SSDs.

The NOVA filesystem

Posted Aug 13, 2017 21:30 UTC (Sun) by giraffedata (guest, #1954) [Link]

When you hit the power button, the instruction pointer picks right back up and the machine keeps running.
For that you need not only persistent memory, but also a persistent CPU.

You need persistent a lot more than that. Main memory is only one of myriad holders of state in a computer. Look at how hard suspend/resume is. You can't just store the contents of main memory, power down, then restore that memory upon power up and keep going. Every device in the computer has some volatile memory in it (disk controllers, for example). And since you can't keep the time of day persistent, there is some difficulty there too.

The image I'm getting here is that "persistent memory" is nothing but marketing speak - it's actually just a SSD.
The big conceptual difference is that the CPU can address the memory directly.

This misses Tara_Li's point. This difference affects only what kernel code you write to use it. The user still sees persistent storage that works at solid state speeds, that you could use to store data, which is exactly what SSDs are all about.

But Jon points out one difference between persistent memory and SSD that could actually make a difference in its application: persistent memory is byte-addressable.

The NOVA filesystem

Posted Aug 16, 2017 13:56 UTC (Wed) by anton (subscriber, #25547) [Link]

Software has problems like memory leaks and bugs that make process persistence less practical than it may appear at first.

E.g., some earlier version of Microsoft Word exhibited many of the disadvantages of trying to work with process persistence only: When the user saved the document, it produced more or less a dump of its memory, and when loading the document, the memory dump was restored. The document contained a lot of stuff that had been deleted by the user (which led to various interesting findings), and if the memory was corrupted in some way (that would lead to a crash on certain operations), that corruption would end up in the saved document, persisting the crash across the save. Letting a new version of Word load an old document was probably a major programming feat, and the other way was not possible at all.

Saving data in a well-defined format instead of just keeping it in persistent process memory or saving it as a memory dump is more work at first, but it pays off.

We have had persistent OSs like EROS for decades (they saved the changed pages to disk every 30s or so, so you would lose up to that amount of work on a crash, much like with file systems), but they did not catch on, nor did mainstream OSs acquire process persistence features (well, there's hibernation, but the OS is still not designed to let processes live forever). Maybe now, with containers.

So when we have a persistent variant of RAM, we still need to save data from processes, so we still need a file system. The choice of having a log-structured base for such a file system surprised me, though.

The NOVA filesystem

Posted Aug 5, 2017 18:05 UTC (Sat) by Paf (subscriber, #91811) [Link] (4 responses)

In addition to what Corbet noted, most of the proposed tech sits between flash and true RAM in terms of performance. 10x-1000x slower than RAM, depending on tech and what you're measuring (and whose claims of future performance you believe). One could end up quite sad from expecting truly RAM like performance from some persistent memory solutions.

(The above leaves out battery backed DIMMs, I suppose. But they don't seem like a tech with broad applicability.)

The NOVA filesystem

Posted Aug 6, 2017 3:03 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

The current tech is basically a mock-up of future persistent RAM. It's expected to be within the usual DRAM performance range, if not that durable. It'll certainly be faster than flash.

The NOVA filesystem

Posted Aug 6, 2017 17:54 UTC (Sun) by Paf (subscriber, #91811) [Link] (2 responses)

Really? Is it? Intel's marketing of 3D xpoint (available now) sort of suggests it will be RAM level, but the actual numbers are well off. They're faster than flash, but they live between flash and RAM.

They're claiming they'll be ~0.5 microsecond latency, but current 3D xpoint stuff is more like 5 microseconds. Flash is more like 100 microseconds, DRAM is 5-30 nanoseconds. So another 100x faster than 3D xpoint, 1000x if you use the latencies of what you can buy today.

Those numbers are from memory mostly, but I'm pretty sure they're broadly correct. Maybe there's other tech or I've got the #s off a bit...?

So it creates another (fascinating) stopping point in the storage/memory/cache hierarchy.

The NOVA filesystem

Posted Aug 7, 2017 2:23 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

You can not buy 3D xpoint on the open market right now (I tried to buy one), apparently it's sold only to Intel partners under heavy NDAs for evaluation purposes only.

All I was able to find were battery-backed DRAMs with flash for longer-term storage, which is kinda disappointing.

The NOVA filesystem

Posted Aug 7, 2017 12:52 UTC (Mon) by Paf (subscriber, #91811) [Link]

Ah, sorry. I'm drawing my info from public sources, and I thought the reviews meant it was generally available.

That IS disappointing.

The NOVA filesystem

Posted Aug 9, 2017 18:47 UTC (Wed) by iabervon (subscriber, #722) [Link]

The fundamental thing is that emacs communicates to gcc and git by doing one side of a particular set of system calls, and the other ends of this one-to-many-over-time interaction expect to use another set of system calls to retrieve it. Additionally, the system calls have some defined semantics for what the acceptable intermediate states in the source program's sequence to make visible to other programs, including the case where the source program crashes for some reason (not necessarily a power failure).

If you think about the IPC properties of filesystems, it's clear that you need one, even if memory isn't at particular risk from losing power.

The NOVA filesystem

Posted Aug 9, 2017 22:06 UTC (Wed) by zlynx (guest, #2285) [Link]

You can do that if you want. Take a file and mmap it. If NOVA works like tmpfs then writing to the mmap is writing directly into RAM. I haven't checked but I don't see why it would do anything else. Double buffering it would just be silly.

The NOVA filesystem

Posted Aug 5, 2017 22:42 UTC (Sat) by pr1268 (guest, #24648) [Link] (5 responses)

it is impossible to move a NOVA filesystem from one system to another if the two machines do not have the same number of CPUs.

Do you suppose that an option to limit the number of per-CPU arrays of inodes when creating a NOVA filesystem can be implemented?

For example: % mknovafs -O cpu=4 /dev/nvramdevice1 (or similar) such that the number of cpus may be less than how many CPUs that computer has.

Setting cpu=1 could make a filesystem capable of being exported to any computer. Of course, that would not make sense if your whole purpose is to exploit that technology to its fullest (quoting our Editor).

The NOVA filesystem

Posted Aug 6, 2017 7:50 UTC (Sun) by swanson (guest, #116493) [Link] (4 responses)

We don't need anything so complicated as a separate utility.

The origin of the CPU-count dependence is that NOVA divides PMEM into per-CPU allocation regions. We use the current CPU ID as a hint about which region to use and avoid contention on the locks that protect it.

So moving from a smaller number of CPUs to a larger number of CPUs just means more contention for the locks. Moving from a larger number to a smaller number is no problem at all. So, our current plan is to set the CPU count very high (like 256) when the file system is created.

-steve

The NOVA filesystem

Posted Aug 6, 2017 14:42 UTC (Sun) by jhoblitt (subscriber, #77733) [Link]

This is vaguely similar to XFS' allocation groups?

Number of CPUs

Posted Aug 7, 2017 7:49 UTC (Mon) by skitching (guest, #36856) [Link]

That sounds to me like "bucketing" as used in NoSQL databases or "consistent hashing" algorithms.

The NOVA filesystem

Posted Aug 7, 2017 8:50 UTC (Mon) by alonz (subscriber, #815) [Link] (1 responses)

Does this mean some memory cannot be allocated unless you have all 256 CPUs?

The NOVA filesystem

Posted Aug 7, 2017 9:20 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

I would imagine that each CPU will have several partitions assigned to it. If the number of CPUs is less than the number of partitions then some CPUs will have multiple partitions, if the reverse is true then some partitions will be shared across several CPUs.

The NOVA filesystem

Posted Aug 6, 2017 7:43 UTC (Sun) by swanson (guest, #116493) [Link] (2 responses)

I'm one of NOVA's designers, and I wanted to clarify one point in the (very nice) article:

NOVA does support DAX mmap. The original paper focused on the atomic mmap mechanism because it was novel (such is the way of research), but normal DAX mmap is fully supported. We received some useful (and not very positive) feedback about atomic mmap from several people, so I'm not sure that atomic mmap is likely to remain a feature.

-steve

mmap() + checksum = ?

Posted Aug 7, 2017 13:54 UTC (Mon) by abatters (✭ supporter ✭, #6932) [Link] (1 responses)

> NOVA disables COW for the portions of a file that have been mapped into a process's address space, so changes are made in place.

> There are a number of self-protection measures built into NOVA, including checksumming for data and metadata.

If you mmap() the storage as memory and write to it in-place from the CPU, that would invalidate the data checksum. If the system crashes at that point, what happens to the file with invalid checksum after the system reboots?

mmap() + checksum = ?

Posted Aug 15, 2017 17:58 UTC (Tue) by swanson (guest, #116493) [Link]

This is an interesting problem.

The approach that NOVA takes is to disable parity/checksum protection while data is mmap'd and re-enable it when the mapping is finished. We track overlapping maps etc., and we have log mmap operations so we can re-enable protection after reboot.

The reasoning for this approach is that when you use DAX-mmap() you take responsibility for your data. This includes responsibility for updating it consistently and responsibility for protecting it from media errors, etc. As you point out, the file system really can't fill this role (at least not without a significant performance penalty), so it must fall to the application.

-steve

How about ubifs and friends?

Posted Aug 7, 2017 8:28 UTC (Mon) by giggls (subscriber, #48434) [Link]

Reading the Article I was wondering about the differences of this approach compared to ubifs/mtd.

Am I right in the assumption that the devices targeted by ubifs/mtd can not be written in a byte-wise manner while the devices targeted by NOVA filesystem can?

The NOVA filesystem

Posted Aug 7, 2017 16:13 UTC (Mon) by sasha (guest, #16070) [Link] (4 responses)

"Academic work has a discouraging tendency to stop when the papers are published and the grant money runs out"

As BFQ I/O scheduler (2008: https://lkml.org/lkml/2008/4/1/234 2014: https://lwn.net/Articles/600366/ 2016: https://lwn.net/Articles/709202/) shows, academic work has a discouraging tendency to be ignored by kernel community for years, because there is no "BIG CORP" behind it. Google or Facebook can say "we already use this patch, please accept it" and kernel community accepts even imperfect patch. It is not so easy for patches from academic community; at least it was not so easy for Paolo Valente.

The NOVA filesystem

Posted Aug 7, 2017 17:12 UTC (Mon) by pizza (subscriber, #46) [Link] (3 responses)

It takes ongoing work to get something merged into the kernel, and that work falls outside the scope of typical academic funding/manpower. Meanwhile, Google/Facebook/etc employs full-time kernel folks, and getting stuff mainlined is part of their job description.

So it's less about $BIGCORP and more to do with the realities of academia being somewhat different from the real world. (Or at least the realities of the kernel development process..)

The NOVA filesystem

Posted Aug 8, 2017 2:36 UTC (Tue) by smckay (guest, #103253) [Link] (2 responses)

I'm very much an outsider, but it definitely does seem like contributions from large companies have an easier time getting merged. I'm sure a lot of it is down to having people experienced with kernel development getting paid to do the work, but I think justifying the change is easier too. Like, is this patch worth the effort? Will it be used? If it's already running on umpty-million Android handsets or a jillion cloud servers or a thousand enterprise customers are slavering to buy the hardware that goes with this driver, the usefulness is already there. Whereas something coming out of academia, probably not production quality yet, without a significant installed base, is going to be harder to justify.

The NOVA filesystem

Posted Aug 9, 2017 17:03 UTC (Wed) by mfuzzey (subscriber, #57966) [Link]

Not sure about big companies having an easier time due to it being easier to justify changes.

I think the justification is far more linked to the technical merits, maintenance load and the impact on the kernel than the number of devices using it.

For example the android "wake lock" stuff took years before being merged (and it wasn't merged "as is" but in a significantly modified form).

There are drivers for very niche devices, driver submissions are accepted, even from "unknown" individuals, provided they respect the license and coding style rules and pass review. I've never seen any questions about how many of the devices are out there for it to be "worth the effort"...
For this reason, contrary to popular belief, Linux actually supports *more* devices than Windows (particularly true for older devices).

Getting non trivial code into the core kernel though, is significantly more difficult since the potential for breakage is much higher.

But yes, corporate developpers do have the advantage of having more time, by virtue of being paid to work on the kernel, and often in house peer review before anything even gets submitted to the public mailing lists.

The NOVA filesystem/Patches from large companies

Posted Sep 7, 2017 19:22 UTC (Thu) by vomlehn (guest, #45588) [Link]

The biggest reasons why contributions from large companies have an easier time getting merged is that they tend to approach the kernel community with complete implementations that have been run for months on a large number of systems, and they often have benchmarks to show performance, memory, etc. improvements over a range of configurations. Compare that with someone who has a notion that they've tested a few times and seems to work for them.

Additionally, large companies tend to have the resources to hire people and keep them working on the kernel. This allows a level of familiarity and comfort with their work to grow. This is very useful when evaluating the likelihood that someone's work is correct, though the same factors risk a sense of complacency.

Note that that large companies do not always get it right. For example, multiple features Google created were too narrowly focused on Android and did not work as the larger community would have wished. It's taken years to remedy the ones that could be remedied. A large part of this was a failure to openly and frequently engage with the kernel community up front.

I would, however, hope that nobody is discouraged if they don't work for a large company. Those companies, economic powerhouses though they may be, tend towards group think and hold no monopolies on creativity. It's a lot of work to get something into the kernel but it still remains well within the abilities of lone developers.

The NOVA filesystem

Posted Aug 7, 2017 17:07 UTC (Mon) by sperl (subscriber, #5657) [Link]

I still wonder why no implementation of shared memory that uses nvram and which persists accross reboots/power outage has appeared yet.


Copyright © 2017, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds