Preparing for nonvolatile RAM
Matthew Wilcox raised the issue, noting that nonvolatile memory (NVM) is coming, that it promises bandwidth and latency numbers similar to those offered by dynamic RAM, and that, being cheaper than DRAM, it is likely to be offered in larger sizes than DRAM is. He later disclaimed any resemblance between this description and any future products to be offered by his employer; it is, he says, simply where the industry is going. Given that, it would be a good idea for the kernel community to be ready for this technology when it arrives.
One part of being ready is figuring out how to deal with nonvolatile memory within the kernel. The suggested approach was to use a filesystem:
A filesystem approach would allow the association of names with regions of NVM space; an API was then proposed to allow the kernel to perform tasks like mapping regions of NVM into the kernel's address space.
One question that came up quickly was: won't the use of the filesystem
model slow things down? There is a lot of overhead in the block layer,
which was not designed to deal with "storage" that operates at full memory
bandwidth. Matthew was never thinking of bringing in the full block layer,
though; instead, he said: "I'm hoping
that filesystem developers will indicate enthusiasm for moving to new
APIs
". Such enthusiasm was in short supply in this discussion; that
is probably more indicative of a lack of thought about the problem than any
sort of active opposition (which was also in short supply).
James Bottomley, though, questioned the filesystem idea, suggesting that NVM should be treated like memory. He said that the way to access NVM might be through the kernel's normal memory APIs, with nonvolatility just being another attribute of interest. One could imagine calling kmalloc() with a new GFP_NONVOLATILE flag, for example. The only problem with this approach is that it is not enough to request an arbitrary nonvolatile region; callers will usually want a specific NVM region that, presumably, contains data from a previous use. So the memory API would have to be extended with some sort of namespace giving reliable access to persistent data. To many, that namespace looks like a filesystem; James suggested using 32-bit keys like the SYSV shared memory mechanism does, but admirers of SYSV IPC tend to be scarce indeed on linux-kernel.
So, while there are a lot of details to be worked out, some sort of name-based kernel API seems certain to come about. Then there will be a mechanism, either through the memory-related or filesystem-related system calls, to make NVM available to user space. But that leads to another, perhaps harder question: what, then, do we do with all that fast, nonvolatile memory?
Some of it, certainly, could be used for caching; technologies like bcache could make good use of it. The page cache could go there; Matthew suggested that the inode cache might be another possibility. Both could speed booting considerably, though it would be necessary to somehow validate the cache contents against filesystems that could have changed while the system was down. Boaz Harrosh suggested that filesystems could store their journals in that space, speeding journal access and reducing journal I/O load on the underlying storage devices. He also mentioned checkpointing the system to NVM, allowing for quick recovery should the system go down unexpectedly. Vyacheslav Dubeyko had some wilder ideas about how NVM could eliminate system bootstrap entirely and make the concept of filesystems obsolete; instead, everything would just live in a persistent object environment.
Clearly, many of these ideas are going to take some time to come to
fruition. Nonvolatile memory changes things in fundamental ways; Linux may
have to scramble to keep up, but, then, that is a high-quality problem to
have. It will be most interesting to watch how this plays out over the
coming years.
Index entries for this article | |
---|---|
Kernel | Memory management/Nonvolatile memory |
Posted May 24, 2012 2:03 UTC (Thu)
by JoeBuck (subscriber, #2330)
[Link]
Perhaps mmap is the right abstraction: if a file is mmap'ed read-write it is randomly addressable and persistent. Another alternative, if the NVRAM is as fast as any other RAM, is to ignore the nonvolatile characteristic and use it as ordinary memory. Or it could be a mixture of both: users can create files in it and what's left is available as memory.
Posted May 24, 2012 3:12 UTC (Thu)
by pj (subscriber, #4506)
[Link] (1 responses)
I dunno, just an idea I had awhile back.
Posted May 24, 2012 17:41 UTC (Thu)
by cesarb (subscriber, #6266)
[Link]
Posted May 24, 2012 13:24 UTC (Thu)
by felixfix (subscriber, #242)
[Link] (9 responses)
I do not understand the problem as posed here, nor the responses. Why not just treat it exactly like a regular suspend and wakeup? The sole difference is that some peripherals may have lost power, including the usually-non-removable disk drives.
What is the point of trying to make it look like a disk? It reminds of those fun projects to run Linux on a 6502 emulator running under Windows running under Wine on Linux. Fun, interesting ... but ultimately silly.
Posted May 24, 2012 18:44 UTC (Thu)
by iabervon (subscriber, #722)
[Link] (8 responses)
I think the point of making it look like a disk is the same as the point of having tmpfs: it's nice to have filesystems (with their actual benefit of namespaces, naming, and authorization properties) with full memory bandwidth.
It's also possible that it would be beneficial to be able to keep data in NVM usefully while switching to a new kernel, which requires some sort of in-storage data structures which are stable across kernel versions.
Posted May 24, 2012 19:16 UTC (Thu)
by dlang (guest, #313)
[Link] (5 responses)
Posted May 24, 2012 23:42 UTC (Thu)
by giraffedata (guest, #1954)
[Link] (3 responses)
Don't forget cheaper.
I think the real question may be: after you replace all your DRAM with NVM and do the obvious suspend/resume exploitation, what more can you do with all the additional NVM you have in excess of what used to be DRAM. The article's reference to "available in larger sizes" alludes to this.
A bigger file cache in the kernel was mentioned. This would address the problem some systems have today that their file cache size is limited by how long it takes to prime it after each boot.
The filesystem idea seems to allude to using it for stuff we used to keep on SSDs, but were limited by the cost of dragging it over a SATA wire into a CPU register (stopping off at DRAM along the way).
Posted May 24, 2012 23:57 UTC (Thu)
by dlang (guest, #313)
[Link] (2 responses)
if they are talking about sized 2x -4x larger than current DRAM devices, the answer is simple, you will use all that you have as normal RAM and not have 'extra' to find a use for.
If they are talking 20x+ larger, then it may start replacing flash in systems. but unless they are talking significantly larger than that, they won't start replacing spinning rust
Posted Jun 2, 2012 9:45 UTC (Sat)
by Duncan (guest, #6647)
[Link] (1 responses)
Using pricewatch.com as a guide on going street-price, current near-best prices (note that for most gig quantities, prices are higher, and of course spinning rust comes in far larger gig quantities... I simply quickly scanned and picked what appeared to be the best gigs per $ in each category) :
DRAM ~ $4/gig ($32ish, 8 gig)
SSDs ~ $1/gig ($230, 240 gig)
Flash ~ 50 cents a gig ($16, 32 gig USB stick)
Spinning rust 3.5 inch ~ 5 cents a gig ($145, 3000 gig)
Now it's worth considering just what this NVRAM is being compared against. It's being compared against $4+/gig DRAM, similar latency, higher density, lower cost, except that it happens to be non-volatile.
It's *NOT* being compared against the next step down in both price and latency, SSDs, except lower latency, higher bandwidth, tho its non-volatile nature might make that a more logical direct comparison.
That certainly says a great deal about at least the intended price point. They'd rather be compared against $4/gig DRAM than against $1/gig SSDs.
OK, that sets some pretty close bounds on the practical price-point, well under an order of magnitude. We're looking at, probably, $2-3/gig at today's capacity/price points, tho of course all three technologies (SSD, NVRAM, DRAM) can be predicted to be somewhat below that by the time it comes out in reasonable quantities and target sub-dram prices (the time frame wasn't mentioned, but maybe a couple years?)
You've made a very practical point that at the target sub-dram price-point (tho you said size/capacity, but price per capacity is I believe the controlling factor, or we'd all be using battery-backed DRAM storage for our tibibytes of videos!), perhaps half that of dram, in practice we're simply talking a cheaper dram replacement that has one very different property than current dram -- non-volatility.
That means it's *NOT* going to be the end of the world as we know it. We're simply looking at, for the most part, a cheaper dram with one rather interesting quality compared to current dram. We're NOT, at least near term, going to be replacing spinning rust, for sure.
While it's likely to replace current tech SSDs at the high end, that's likely to simply push them down-market a bit, much as SSDs forced spinning rust down-market quite a bit. Current tech SSDs will in turn drop back toward on par with flash, where it seems they settled for quite awhile, altho they're double flash's price ATM. That will hopefully in turn push flash prices down... so were this NVRAM available at the target price-points today, we might see something like this instead of the above.
DRAM, $3/gig
NVRAM, $2/gig,
SSD, 80 cents/gig
Flash, 40 cents/gig
Spinning rust, 4 cents/gig 3.5, 8 cents/gig 2.5
Or for a time NVRAM might floor the dram market, to say $2/gig (relative), with NVRAM at $1.50/gig, then actually increase the price of dram as it gets eclipsed and drops from competitive so that prices for dram don't fall in line with the rest of the market, thus going up relative to it, such that people might end up with half-gig dram machines again, the rest nvram, and dram costing (relative, the whole market would of course be cheaper by then) say $6-8/gig low end as a result. (The dynamic would thus be much like that for eclipsed memory technologies like DDR-1 and PC100/133, today. You can compare their prices per gig on pricewatch, or your favorite alternative, if you like. I just took a WAG above, but just checked and $6-8/gig for DDR is pretty close, actually, PC100/133 about double that! Prices for eclipsed tech do drop, but not by nearly as much, relative to current tech.)
The big point of course being that all this talk about killing filesystems or even SSDs and spinning rust... is VERY premature, to say the least! Otherwise, the comparison would as I said, be to SSDs, but lower latency and higher bandwidth, not to dram, but non-volatile.
Of course predicting ten years out is a tough business. By then, this could well either look like a flash in the pan or could have taken over the whole market. But more likely, it'll be simply incremental, spinning rust will still be around for our then approaching petabyte needs and nvram may or may not have replaced dram and/or ssds but may, if it survives, be used like one or the other, or both, and the world will go on much like it does today, but different, just as today is much like 2002, but different.
Posted Jun 2, 2012 20:43 UTC (Sat)
by dlang (guest, #313)
[Link]
This is similar to the way that flash has essentially eliminated both ROM and EEPROM from the market.
Posted May 25, 2012 0:41 UTC (Fri)
by iabervon (subscriber, #722)
[Link]
Posted May 24, 2012 23:49 UTC (Thu)
by felixfix (subscriber, #242)
[Link] (1 responses)
Some things simply are not upgradeable. If you have an old computer with DRAM, you almost certainly won't be able to replace the DRAM with NVRAM.
And if the NVRAM computer comes with a BIOS that thinks it needs to set up refresh cycles and such, it is much too broken to buy, and anyone who does gets what they deserve.
Posted May 25, 2012 1:06 UTC (Fri)
by iabervon (subscriber, #722)
[Link]
Of course, you could also just say, "unexpected power failures are still fatal; the only benefit of NVM is that you can remove all power while cleanup suspended." I'd expect that to be the behavior if the OS doesn't know about NVM, but I think being able to recover from unexpected power failures is more interesting.
Posted May 24, 2012 18:24 UTC (Thu)
by ncm (guest, #165)
[Link]
In place of a file system, programs exchange pointer values. You can construct a name mapping if you want, or several (and they did), but the OS doesn't insist on it. In effect, every file is memory-mapped, and its inode number is also a pointer to it; alternatively, every allocated memory block is also a file, and its name is also a pointer to it.
AS/400 sneers at your uptime statistics.
Posted May 24, 2012 19:11 UTC (Thu)
by maney (subscriber, #12630)
[Link]
It was interesting trying to fix that CPU board set - microcoded diagnostics kept leading us to change more chips (this was gate level TTL, the microcode ROM & RAM and the ALU slices were about the most complex chips), and it always seemed we were making progress. Then they wondered why this one job was taking so long, which is when they told us about this interesting failure mode, and the board set was sent to the scrap heap.
Posted May 24, 2012 20:48 UTC (Thu)
by daglwn (guest, #65432)
[Link] (25 responses)
In fact we don't really need filesystems today. We could simply use the paging system to write out the contents of memory to persistent storage on a shutdown and load them back up on a restart, not unlike suspend/resume. Pointers become the names of objects.
Posted May 24, 2012 23:26 UTC (Thu)
by viro (subscriber, #7872)
[Link] (16 responses)
Posted May 24, 2012 23:40 UTC (Thu)
by daglwn (guest, #65432)
[Link] (15 responses)
The point is that we have an entire OS subsystem that probably isn't needed at all for a lot of use cases. Who wouldn't want to get rid of that complexity.
Posted May 25, 2012 3:48 UTC (Fri)
by viro (subscriber, #7872)
[Link] (14 responses)
What do you think 'ls' stands for? That's right, "list segments". In a directory segment, that is. As for the supposed complexity... Take a look at the amount of code in fs/ramfs someday. Especially if you leave no-MMU side of things alone...
Posted May 29, 2012 20:40 UTC (Tue)
by daglwn (guest, #65432)
[Link] (13 responses)
There is a lot of complex code that could be dumped if everything lived in a random-access memory. Device drivers alone would be a huge savings.
Posted May 29, 2012 21:33 UTC (Tue)
by paulj (subscriber, #341)
[Link] (5 responses)
Posted May 30, 2012 1:36 UTC (Wed)
by daglwn (guest, #65432)
[Link] (4 responses)
Posted May 30, 2012 4:43 UTC (Wed)
by dlang (guest, #313)
[Link] (1 responses)
However, they have always run into the stumbling block that it's just impractical to deal with all hunks of data in a flat namespace. Directories are EVIL, but nobody has make anything else work even one tenth as well
Also, just keeping everything in ram falls apart as soon as you want someone else to access it (or you loose the device, or the device gets destroyed, or ...)
Many of the people pushing back have been though this "eliminate filesystems" experiment before and have the scars to show for it. Listen and learn (then go try and build something to prove them wrong :-)
Posted May 31, 2012 19:35 UTC (Thu)
by timka.org (guest, #53366)
[Link]
Posted May 30, 2012 4:52 UTC (Wed)
by paulj (subscriber, #341)
[Link] (1 responses)
Posted May 30, 2012 14:54 UTC (Wed)
by daglwn (guest, #65432)
[Link]
I think the discussion is pretty pointless now...
Posted May 30, 2012 0:11 UTC (Wed)
by Trelane (subscriber, #56877)
[Link] (1 responses)
Because not all memory is the same, let alone the pipe over which we get it.
How would you use a hypothetical storage medium that had 1EB of storage but you could only access it at 128kbps?
Posted May 30, 2012 1:37 UTC (Wed)
by daglwn (guest, #65432)
[Link]
The filesystem is the abstraction and that abstraction has certain costs. Changing the abstraction doesn't imply we immediately forget everything we know.
Posted May 30, 2012 6:53 UTC (Wed)
by viro (subscriber, #7872)
[Link] (4 responses)
One more time, slowly:
* fs/ramfs/inode.c does *not* optimise for rotating storage, what with having nothing whatsoever to do with any storage.
* it does *not* optimise for disc buffering, what with having no backing storage, disc or otherwise
* file cache (page cache, really) is just a mechanism for finding a page by offset in file. In case of object living entirely in RAM, that's exactly what you need to work with that object. Unless you want your objects to be contiguous in RAM, that is - great idea, that, for e.g. 800Kb text file. Or a 22Mb PDF document.
* Device drivers have nothing whatsoever to do with aforementioned ramfs.
* You have demonstrated just what is wrong with "visionaries". You keep making profound sounds without stopping to check whether they have anything to do with reality. Other than that of your bowel movements, that is.
As for being deliberately dense... I wouldn't have dared - any attempt to fake being dense would be simply pathetic next to the geniune article of that magnitude.
Posted May 30, 2012 14:55 UTC (Wed)
by daglwn (guest, #65432)
[Link] (3 responses)
Posted May 30, 2012 21:25 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Don't know about you, but after Al Viro's post I went and checked VFS source code - and he's entirely correct.
Posted May 30, 2012 21:59 UTC (Wed)
by daglwn (guest, #65432)
[Link] (1 responses)
I never claimed to be a "visionary." I'm far, far from that. The idea isn't even original, people have talked about it for years. It just strikes me that it makes a lot of sense given system architecture trends. Outright dismissal accompanied by foul language, holier-than-thou attitudes and outright insults says much more about Al than it does me.
Al definitely lost a notch or too on my respect ladder and that's a pity because I've generally enjoyed reading his posts/articles.
Posted May 31, 2012 1:29 UTC (Thu)
by bronson (subscriber, #4806)
[Link]
I'm no linguist but anybody who disagrees is a dweeb. :)
Posted May 25, 2012 1:18 UTC (Fri)
by iabervon (subscriber, #722)
[Link] (7 responses)
Oh, and we use filesystems for disks, which NVM doesn't replace, not for DRAM, which NVM does replace. It would make more sense to say that now we can now have *only* filesystems, not anonymous memory. But that's also dumb.
Posted May 25, 2012 1:53 UTC (Fri)
by neilbrown (subscriber, #359)
[Link]
Posted May 25, 2012 3:03 UTC (Fri)
by daglwn (guest, #65432)
[Link] (5 responses)
There are many other, better ways to do IPC.
I don't think a filesystem-less Linux would work that well - the concept is too ingrained into the kernel. But starting from scratch? I would seriously consider not providing a filesystem at all.
Posted May 25, 2012 5:10 UTC (Fri)
by mrons (subscriber, #1751)
[Link] (1 responses)
http://www.jlkeedy.net/research-highlights.html
Don't know how far they got with an actual implementation though.
Posted May 31, 2012 19:56 UTC (Thu)
by timka.org (guest, #53366)
[Link]
Posted May 25, 2012 8:52 UTC (Fri)
by dgm (subscriber, #49227)
[Link] (2 responses)
And as Viro said above, the moment you add a name->page mapping... voilĂ ! the filesystem is back.
Posted May 29, 2012 20:45 UTC (Tue)
by daglwn (guest, #65432)
[Link] (1 responses)
I'll grant that removable media presents a challenge. But with The Cloud(TM) these days, do we even need it? I'm talking about certain common use cases. Certainly we need it for lots of things but I don't think everyone in the world needs it.
Even so, I wonder if page migration could be adapted to support removable storage. Page out on one system, page back in on another.
Just thinking blue sky here, I'm not an expert on any of this.
Posted Jun 1, 2012 1:07 UTC (Fri)
by kevinm (guest, #69913)
[Link]
About that only thing you can change that about that and still have a mapping from names to memory is to change what the names look like. Maybe you don't want them to be hierarchical - maybe you want them to be flat, or multidimensional instead of single-dimensional. But there doesn't seem to be a good reason why changing how the names look is in any way related to NVM.
Posted May 24, 2012 21:54 UTC (Thu)
by tbird20d (subscriber, #1901)
[Link]
Posted May 25, 2012 10:51 UTC (Fri)
by drag (guest, #31333)
[Link]
http://en.wikipedia.org/wiki/Execute_in_place
Fragmentation would be a serious problem so it would be necessary to revist and improve the anti-defrag memory patches and such that I've seen discussed here in the past.
Like:
As far as booting and dealing with BIOS...
Have the bootloader bootstrap the system just enough to just start executing whatever is at 0x0 on the NVRAM. Then leave it up the kernel to notice that the system just was booted and undo whatever brain dead state the UEFI or BIOS or whatever just put the system into.
Yes, I remember working with core memory. It was very convenient.
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
I guess I am a little slow here
I guess I am a little slow here
I guess I am a little slow here
Uses for NVM
if NVM is as fast as DRAM and higher density, why would people still have DRAM in their system?
Uses for NVM
Uses for NVM
Spinning rust 2.5 inch ~ 9 cents a gig ($$60, 640 gig)
Uses for NVM
I guess I am a little slow here
I guess I am a little slow here
I guess I am a little slow here
Paging AS/400 veterans
Data General core machines
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Dmitry Zavalishin, the author of Phantom OS (which "eliminates filesystems"), was asked the question about removable storage when he was giving a talk about the OS at HighLoad++ in 2009.
Preparing for nonvolatile RAM
His idea is to start a separate Phantom VM for a removable media which then can be seen as another "host" accessible via "network". AFAIU, this means Phantom's native IPC is substituted by some protocol. Smells somewhat like Plan 9 to me.
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
There's also Phantom OS. Never tried it myself but it's still active and looks interesting.
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
No one has brought up PRAMFS, which seems like a natural fit for this.
Preparing for nonvolatile RAM
Preparing for nonvolatile RAM
http://lwn.net/Articles/211505/