How to use a terabyte of RAM

By Jonathan Corbet
March 12, 2008

We have not yet reached a point where systems - even high-end boxes - come with a terabyte of installed memory. But products like those from Violin Memory make it clear that the day is coming; one can buy a Violin box with 500GB in it now. So it seems worth asking the question: once one has spent the not inconsiderable sum to buy a box like that, what does one do with all that memory - especially now that the Firefox developers have gotten serious about fixing memory leaks?

Perhaps it's time for some wild ideas. And there is no better source for such ideas than Daniel Phillips, whose Ramback patch has stirred up a bit of discussion this week. The core idea behind Ramback is that all of that memory is turned into a ramdisk, but with a persistent device attached to it. In normal conditions, all application I/O involves only the ramdisk, and is, thus, quite fast ("Every little factor of 25 performance increase really helps."). In the background, the kernel worries about synchronizing data from the ramdisk onto permanent storage. But the synchronization process is mostly concerned with I/O performance, rather than providing guarantees about just when any given block will make it onto the disk platters.

Ramback thus differs from the normal block I/O caching done by the kernel in a number of ways. It keeps the entire device in memory, so that, in steady-state operation, applications need never encounter a disk I/O delay. Should an application call fsync(), the expected result (blocking until the data is written to physical media) will not happen. Filesystems take great care to order operations in a way that minimizes the risk of data loss in a crash; Ramback ignores all of that and writes data to physical media in whatever order it decides is best. As Daniel put it, the "most basic principle" of Ramback's design is:

[T]he backing store is not expected to represent a consistent filesystem state during normal operation. Only the ramdisk needs to maintain a consistent state, which I have taken care to ensure. You just need to believe in your battery, Linux and the hardware it runs on. Which of these do you mistrust?

Ramback does include an emergency mode which will endeavor to bring the disk up to date in a hurry should the UPS indicate that power has been lost. But that does not seem to be enough for everybody. In the resulting discussion, nobody complained about the sort of performance benefits that a tool like Ramback could provide. But there was a lot of concern about data integrity; it seems that many people distrust their battery, their hardware, and Linux. And that has led to a sort of impasse, with several developers claiming that Ramback would be too risky to use and Daniel dismissing their concerns as FUD.

FUD or not, those concerns are likely to be a difficult barrier for Ramback to overcome. Meanwhile, Daniel is looking for people to help test out the code, but that presents challenges of its own:

This driver is ready to try for a sufficiently brave developer. It will deadlock and livelock in various ways and you will have to reboot to remove it. But it can already be coaxed into running well enough for benchmarks, and when it solidifies it will be pretty darn amazing.

So far, reports from suitably courageous testers have been, well, scarce. Your editor fears that this work could suffer the same fate as many of Daniel's other patches: they can contain brilliant ideas and great coding but just don't quite survive the encounter with the real, messy world. But we need people thinking about how our systems will work in the coming years; one hopes that Daniel won't stop.

Index entries for this article
Kernel	Block layer/Block drivers
Kernel	Ramback

How to use a terabyte of RAM

Posted Mar 13, 2008 2:08 UTC (Thu) by pj (subscriber, #4506) [Link]

This sounds like a good idea for embedded devices; they often have (or could cheaply have!)
much more RAM than flash storage.  Imagine something like a 4-500MHz ARM with 1G of RAM and 64
or 128M of flash.  Flash write speed actually *is* a limitation.

How to use a terabyte of RAM

Posted Mar 13, 2008 2:12 UTC (Thu) by jengelh (guest, #33263) [Link] (1 responses)

>We have not yet reached a point where systems - even high-end boxes - come with a terabyte of installed memory

I beg to differ, of course there is. Some SGI Altix 4700 in the datacenter:

cat /proc/meminfo 
MemTotal:     1645645808 kB
MemFree:      1149075520 kB
Buffers:          1136 kB
Cached:       62445312 kB
Active:       391650624 kB
Inactive:     34224432 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:     1645645808 kB
LowFree:      1149075520 kB
Dirty:        31667888 kB
...
Hugepagesize:    262144 kB

How to use a terabyte of RAM

Posted Mar 13, 2008 18:39 UTC (Thu) by riel (subscriber, #3142) [Link]

I'm working on making those work better with normal VM functionality, see http://linux-mm.org/PageReplacementDesign.

You can get split VM patches on lkml and from my people page.

How to use a terabyte of RAM

Posted Mar 13, 2008 3:20 UTC (Thu) by dlang (guest, #313) [Link] (15 responses)

the Violin box appears to the system like any other external SCSI device. it implements it's
own battery and backup mechinism, so ramback is not relavent to it.

How to use a terabyte of RAM

Posted Mar 13, 2008 4:28 UTC (Thu) by zlynx (guest, #2285) [Link] (3 responses)

It would be relevant if the Violin was running Linux as its embedded OS.

How to use a terabyte of RAM

Posted Mar 13, 2008 5:37 UTC (Thu) by daniel (guest, #3181) [Link] (2 responses)

"It would be relevant if the Violin was running Linux as its embedded OS."

It is.

How to use a terabyte of RAM

Posted Mar 14, 2008 21:47 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

but the storage does not show up as system memory, even inside violin

since it can be either 500G of ram or a much larger amount of flash, I'm sure that it shows up
to the Linux OS on the system as storage of some sort, not as system Ram

How to use a terabyte of RAM

Posted Mar 15, 2008 0:05 UTC (Sat) by nix (subscriber, #2304) [Link]

Until recently I'd have suspected that the page replacement algorithms 
would scream and die at the sight of that much RAM, but I've seen logs of 
systems with 1.5Tb of accessible RAM, so this sort of thing is happening. 
(That was a big Altix, IIRC.)

How to use a terabyte of RAM

Posted Mar 13, 2008 5:37 UTC (Thu) by daniel (guest, #3181) [Link] (10 responses)

"the Violin box appears to the system like any other external SCSI device. it implements it's
own battery and backup mechinism, so ramback is not relavent to it."

The violin box does not appear as a scsi device, but as a _much_ faster PCI-e (external)
device.  For which Violin has written a driver to make it appear as a block device.  They also
have a driver to make it appear as memory, but I have not tried that.

The Violin plays only for a single one-ear box, or for small groups with two ears as well?

Posted Mar 13, 2008 17:05 UTC (Thu) by hmh (subscriber, #3838) [Link] (2 responses)

Just curious: How many PCIe ports a violin box has?  Can I have redundant connections (two
PCIe HBAs per host)?  Can I plug various hosts to the violin, much like I could with a storage
with multiple FC ports?

This sort of information really should be in their webpages, but I have failed to locate it
anywhere.

The Violin plays only for a single one-ear box, or for small groups with two ears as well?

Posted Mar 15, 2008 19:24 UTC (Sat) by daniel (guest, #3181) [Link] (1 responses)

The Violin 1010 box has two PCI-e 8x (external) links, each delivering about 1.7 Gbytes/sec of
read throughput that can be connected in parallel to the same host for over 3 GB/sec of read
bandwidth, or connected to two different hosts for redudancy.

They really out to post this info in a blinking, scrolling banner on their site, if ask me.

The Violin plays only for a single one-ear box, or for small groups with two ears as well?

Posted Mar 15, 2008 23:37 UTC (Sat) by hmh (subscriber, #3838) [Link]

Agreed.  Nobody sane would bother to buy a US$50k+ storage unit without knowing this
information first, it is weird to not publish it up front.

Two ports is way too little for this class of hardware, IMO.  They need to add two more.

One needs two ports per host for HBI and cabling redundancy, and usually one will also need to
be able to connect a minimum of two separate hosts to the same storage (for HA redundancy).
Typically, a high-uptime HA setup would have two hosts connected via two PCIe links each to two
Violin boxes.

also as ram

Posted Mar 14, 2008 3:33 UTC (Fri) by ccyoung (guest, #16340) [Link] (1 responses)

sell ram itself if DDR2 packaging, so this becomes immediately feasible.

http://www.violin-memory.com/products/vimms.html

and, with next gen flash gets rid of power objections.

also as ram

Posted Mar 15, 2008 19:28 UTC (Sat) by daniel (guest, #3181) [Link]

"next gen flash gets rid of power objections"

It does and it doesn't.  A box with a heavy continuous write load will eventually end up with
all its "buffer" flash in erase mode, and write throughput will drop down to erase speed.  So
RAM will always be better than flash for high performance transactional setups, but less
demanding loads will be fine with flash.

How to use a terabyte of RAM

Posted Mar 14, 2008 20:03 UTC (Fri) by giraffedata (guest, #1954) [Link] (4 responses)

I'm getting confused. What is the relationship between Ramback and Violin?

How to use a terabyte of RAM

Posted Mar 14, 2008 21:48 UTC (Fri) by dlang (guest, #313) [Link] (3 responses)

the existance of the violin is being used as evidence that something liek Ramback is needed

How to use a terabyte of RAM

Posted Mar 14, 2008 22:07 UTC (Fri) by giraffedata (guest, #1954) [Link] (2 responses)

Thanks. Much clearer now.

If I have 500 GB of memory, Violin looks like a much better way than Ramback to use it. The reason is that it better answers my mistrust of the battery, hardware, and Linux. I can put multiple Violins in a RAID array and when the battery, hardware, or Linux fails in one, I don't lose data. And it looks roughly as fast as Ramback.

If I do put 500 GB of memory in the application server, I think I'd just like to add it to the pool and let the memory manager decide if caching all the contents of my filesystems is the best use for it. Maybe with some parameters to say I trust my battery, hardware, and Linux enough that Linux need not hurry to write any of it back to disk.

How to use a terabyte of RAM

Posted Mar 15, 2008 19:32 UTC (Sat) by daniel (guest, #3181) [Link] (1 responses)

"Violin looks like a much better way than Ramback to use it. The reason is that it better
answers my mistrust of the battery, hardware, and Linux. I can put multiple Violins in a RAID
array and when the battery, hardware, or Linux fails in one, I don't lose data. And it looks
roughly as fast as Ramback."

It is exactly as fast as ramback, because it is ramback.  Ramback was written to provide the
Violin box with stable backing store.  It is just a nice bonus that ramback happens to be
useful for ramdisks in general, and for my code hacking workstation in particular.

How to use a terabyte of RAM

Posted Mar 15, 2008 22:30 UTC (Sat) by giraffedata (guest, #1954) [Link]

OK, then, based on that and some more reading, I believe the answer to my question, "what is the relationship between Ramback and Violin" is this: Ramback couples a nonpersistent block device (a device that doesn't retain its memory across an orderly shutdown) with persistent storage so as to create a block device with the speed of RAM and the persistence of disk. The violin box is one source of a nonpersistent block device, and the one that inspired Ramback. The box uses DRAM for storage and connects to the application server via PCI express and comes with a driver for Linux to make a Linux block device out of it. Ramback, running on the application server, uses that block device.

That means my earlier comments comparing use of Violin with use of Ramback are nonsense; they aren't alternatives because one provides persistent storage and the other doesn't. And my comparison of using Ramback to adding memory to the regular pool is similarly nonsense because Ramback can use DRAM that isn't in the application server (e.g. the Violin box).

How to not use your hard disk

Posted Mar 13, 2008 9:31 UTC (Thu) by rvfh (guest, #31018) [Link] (3 responses)

I was wondering: is there a way to achieve something similar just using the I/O cache? 

My laptop is meant to run Vista and thus comes with 2GB of RAM, of which Linux (or even XP!)
does not use that much; it would be nice to tell the system to not flush some parts of the FS
to disk and keep them in RAM for a very long time, without having to set-up a RAM disk and
then committing it to disk on shut down...

More concretely: my source code is like so:

project/src/test/file_test_dir

all the file_test_dir's are big and contain big useless short-lived data. what is below is
important code. I'd love to be able to say: *test_dir does not be need be sync'd to disk at
all, or very rarely.

Ok, just writing it lets me think that's it's  not trivial unless I do create a RAM disk for
my test data... oh well.
Or maybe unionfs can help?

How to not use your hard disk

Posted Mar 13, 2008 15:36 UTC (Thu) by i3839 (guest, #31386) [Link]

If it's on a separate partition (or image file) you could try a huge value for the "commit"
mount option in ext3. But using tmpfs is probably easier, and just doing a cp -a when wanting
to store it to disk.

How to not use your hard disk

Posted Mar 14, 2008 20:22 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

So the goal is to eliminate writes to disk of data that will probably never be read back? Do you have a problem with too much disk I/O? Is it slowing down other I/O?

Some filesystems (none on Linux that I know of) have, for this reason, an attribute of a file or file image "delayed write," which means don't bother to harden writes to disk until the OS finds a better use for the memory or there's an orderly shutdown. And some policy engine that can set the attribute based on file name.

On Linux, I'd probably make those directories symlinks to directories in a tmpfs filesystem. Not ramfs, because the memory manager can probably do a better job than my static policy at determining when the memory can be better used for something else.

How to not use your hard disk

Posted Mar 15, 2008 17:42 UTC (Sat) by rvfh (guest, #31018) [Link]

This is exactly it! I might do the symlink trick, but I would really like to have that
per-file delayed write option...

Thanks!

How to use a terabyte of RAM

Posted Mar 13, 2008 10:02 UTC (Thu) by dankamongmen (subscriber, #35141) [Link] (1 responses)

<q>Perhaps it's time for some wild ideas. And there is no better source for such ideas than
Daniel Philips,</q>

LOL; it's good to see Mr. Phillips is still at it after all these years. It's always good to
open the day to one of his trademark tome-like posts.

How to use a terabyte of RAM

Posted Mar 21, 2008 6:33 UTC (Fri) by muwlgr (guest, #35359) [Link]

I remember his TUX2 concept. Lovely thing, but the world had finally settled on EXT3. There
were patents clashing with TUX2 or similar stuff.

How to use a terabyte of RAM

Posted Mar 13, 2008 13:05 UTC (Thu) by davecb (subscriber, #1574) [Link] (2 responses)

If the memory filesystem doesn't try to decide
on the order of writes to the underlying filesystem,
but instead simpley creates a queue of operations 
and feeds them to the disk in the order they
are received, then a logging filesystem can
reorder and coalesce them to get the physical
disk update quite quickly.

A colleague did a thesis on that, and found
that reordering and coalescing paid a huge
benefit.

Alternatively, one could have an intermediate
layer which the memory filesystem could pass
operations to, which could do optimizations like
doing all directory operations asap and in-order
and delaying all file writes until close or
a certain large size had accumulated, then
writing teh whole file in one fell swoop. The
latter is *very* effective at increasing performancw
while decreasing seeks and CPU usage.

--dave

How to use a terabyte of RAM

Posted Mar 13, 2008 15:51 UTC (Thu) by i3839 (guest, #31386) [Link] (1 responses)

> then a logging filesystem can
> reorder and coalesce them

The IO scheduler does exactly that already (also for reads).

But the problem is that the writing order is very important quite often.

> Alternatively, one could have an intermediate
> layer which the memory filesystem could pass
> operations to, which could do optimizations like
> doing all directory operations asap and in-order
> and delaying all file writes until close or
> a certain large size had accumulated, then
> writing teh whole file in one fell swoop.

This sounds like the dirsync mount option and the current file caching that happens at the VM
level.

Currently all file operations happen on files cached in RAM. Disk IO is only done for reads
for uncached files and writes explicitly requested by user space (sync), by the VM because of
memory pressure, or because a time out triggered and it's time to write the written data also
to disk. At which point enough data can be gathered so that there is actually anything to
coalesce and reorder by the IO scheduler.

How to use a terabyte of RAM

Posted Mar 13, 2008 18:07 UTC (Thu) by davecb (subscriber, #1574) [Link]

  Dirsync is somewhat similar, except it's 
used to guarantee the write completes
serially and synchronously. It's therefor
a performance pessimization, somewhat like
the sync mount option (;-))

  I was just talking about the need to 
request directory updates as they come in
and in order, as opposed to delaying
and reordering them.

  Ian's thesis actualy went into the
degree to which one could delay and reorder 
directory writes, but I didn't want to add
that complexity to a very short email.

--dave

How to use a terabyte of RAM

Posted Mar 13, 2008 15:49 UTC (Thu) by landley (guest, #6789) [Link] (3 responses)

When I suspend/resume my laptop (A Dell Inspiron E1505 that came pre-installed with Ubuntu,
since upgraded to x86-64 Kubuntu 7.10), about 10% of the time the keyboard and touchpad are
dead when it comes back.  The rest of the system is still working fine; if I press the power
button the "log out/suspend/hibernate/restart/turn off" dialog comes up (useless if I have no
mouse and keyboard), and if I plug in a USB mouse or USB keyboard I can use the system
normally.  (And if I suspend and resume it again via said USB peripherals, the
keyboard/touchpad controller usually revives itself when it comes back.)

If I don't happen to be carrying USB peripherals around with me, the only thing I can do is
hold the darn power button down until it hard powers off, and then reboot it.  I wind up doing
this, on average, about twice a month.

So there _is_ more to life than just trusting your battery...

How to use a terabyte of RAM

Posted Mar 13, 2008 17:56 UTC (Thu) by Hawke (guest, #6978) [Link] (1 responses)

You could set it so that the system doesn't ask what to do, but instead suspends immediately
when you press the power button. (system -> preferences -> power management, general tab,
"when the power button is pressed: suspend").

Alternatively, if you have a suspend key on your keyboard, that might be handled by the BIOS
instead of the OS, and might trigger a suspend event.

How to use a terabyte of RAM

Posted Mar 13, 2008 20:48 UTC (Thu) by landley (guest, #6789) [Link]

Those instructions seem to be for gnome, and I'm using kde (without 
kpowersave installed, which I'm reluctant to install due to the "ubuntu 
laptop disks eating themselves after 6 months if you don't "hdparm -B 
255 /dev/sda" them.  (See 
http://www.atomicmpc.com.au/forums.asp?s=2&c=16&t... for details.)

However, all I had to do was edit /etc/acpi/powerbtn.sh and replace the 
call to /usr/bin/dcop with a call to /etc/acpi/sleep.sh.

I'm aware this is probably not the _approved_ way of doing this, but as 
with most "chainsaw, shotgun, and duct tape" solutions, it works just 
fine for me...

How to use a terabyte of RAM

Posted Mar 15, 2008 20:05 UTC (Sat) by daniel (guest, #3181) [Link]

"So there _is_ more to life than just trusting your battery..."

I suggest that you do not design your Dell inspiron laptop into a mission critical transaction
processing system.  Or that if you do, you should consider avoiding sleep state :-)

How to use a terabyte of RAM

Posted Mar 20, 2008 22:15 UTC (Thu) by bobort (guest, #5019) [Link]

You can get a remarkably similar effect today with tmpfs, in fact I'm writing this on a
root-on-tmpfs machine.  I store my root fs as a tar file so it can be read and written without
much disk seeking.  I have a paltry 8GB of RAM, but it's more than enough for the 2.7G rootfs,
which takes about a minute to load/dump.  It's dumped every night, and I even have apcupsd
start a dump when the power goes out.  I've been running this way since about November of 2005
with no problems, and it's FAST!