LWN: Comments on "The future calculus of memory management"

The future calculus of memory management

nix — Fri, 03 Feb 2012 20:11:00 +0000

If you solve the cross-datacentre RAM allocation problem, have you also invented an algorithm that could be used to optimally drive a real-world command economy?

Well, that has already been invented, by Leonid Kantorovich way back before WWII, originally with the declared intention of treating the entire Soviet economy as a single vast optimization problem. It didn't work, and not just because the models were insanely complex and the necessary computer power didn't then exist. No algorithm can deal with the fact that people lie, and no algorithm exists, nor likely can ever exist, that can produce correct outputs given largely-lying inputs. (If only a few lie, you can deal with it: but in many command economies, espeically those that can resort to force when needed, there's an active incentive to lie to your superiors. So, in the end, most people will stretch the truth, and your beautiful optimizer fails ignominiously.)

I wish someone could find a way to globally optimize economies without the wastage inherent in competition, but then I also wish for immortality and FTL travel.

The future calculus of memory management

kevinm — Fri, 03 Feb 2012 03:52:08 +0000

Your first paragraph is quite interesting - I like the idea that such a system would be a kind of analogue of a real world financial economy. I think there is a lot of truth to this - you might indeed expect to see big booms and busts in the "datacentre RAM economy" if it's composed of a free market of locally-optimising agents.

I wonder if in the future research into this will lead to results that can be applied in macroeconomics, or vice-versa? If you solve the cross-datacentre RAM allocation problem, have you also invented an algorithm that could be used to optimally drive a real-world command economy?

The future calculus of memory management

dlang — Sat, 28 Jan 2012 01:32:05 +0000

> Individual machines are getting smaller and more likely to share external storage rather than have local storage.

I agree they are more likely to share external storage, but I disagree about them getting smaller.

with virtualization, the average machine size is getting significantly larger (while the resources allocated to an individual VM are significantly smaller than what an individual machine used to be)

there is a significant per-physical-machine overhead in terms of cabling, connectivity, management, and fragmentation of resources. In addition, the sweet spot of price/performance keep climbing.

as a result, the machines are getting larger.

for what you are saying to be true, a company would have to buy in to SAN without also buying in to VMs, and I think the number of companies making that particular combination of choices is rather small.

The future calculus of memory management

djm1021 — Fri, 27 Jan 2012 23:46:11 +0000

Good discussion! A few replies.

> what kind of application would actually benefit from this?

Here's what I'm ideally picturing (though I think RAMster works for other environments too):

Individual machines are getting smaller and more likely to share external storage rather than have local storage. Cloud providers that are extremely cost conscious but can provide performance if paid enough (just as web hosters today offer a virtual server vs physical server option, the latter being more expensive).

Picture a rack of "microblades" in a lights-out data center, no local disks, labor-intensive and/or downtime-intensive to remove and upgrade (or perhaps closed sheet-metal so not upgradeable at all), RAM maxed out (or not upgradeable), some kind of fabric or mid-plane connecting blades in the rack (but possibly only one or two GbE ports per blade).

> but throughput will be much lower

There isn't a high degree of locality in swap read/write so readahead doesn't help a whole lot. Also, with RAMster, transmitted pages are compressed so less data is moved across the wire.

> Have you actually categorized the performance of RAMster vs spinning disk vs SSD?

Yes, some, against disk not SSD. RAMster is just recently working well enough.

The future calculus of memory management

vlovich — Fri, 27 Jan 2012 22:59:12 +0000

The point I was making was that 10GBe was too expensive (since 10GBe is about on performance-parity with mass storage). With 1 GBe, you're looking at a lot less performance than attached storage. So yes, if your servers are using a regular spinning disk, then this *MIGHT* help the performance a bit since latency may be theoretically better (but throughput will be much lower). However, as the cluster grows, the performance will degrade (latency will get worse since the switch has to process more + bandwidth will decrease since your switch likely can't actually do simultaneous n ports * 1gbe of throughput).

So the point I've been trying to understand is what kind of application would actually benefit from this?

spinning storage in each box, limited amount of RAM, lots of machines, hitting swap occasionally, extremely cost conscious, not performance conscious

So here's two possible alternatives that provide significantly better performance for not much extra cost:
* add more RAM to 1 or more boxes (depending on how much you have in your budget) & make sure heavy workloads only go to those machines.
* if adding more RAM is too expensive use an equivalently-priced SSDs instead of spinning disks - yes you have less storage per-machine, but use a cluster-FS for the data instead (my hunch is that a cluster-FS + SSD for swap will actually get you better performance than spinning disk + RAMster for swap).
* use a growable swap instead of a fixed-size swap? that'll get you more efficient usage of the storage space

Have you actually categorized the performance of RAMster vs spinning disk vs SSD? I think it's useful to see the cost/performance tradeoff in some scenarios. Furthermore, it's important to categorize the cost of RAM vs the cost of the system & what kind of performance you require from the application + what are the specs of the machines in the cluster.

Also, getting back to the central thesis of the article; if it's hitting swap you're looking at a huge performance hit anyways, so why require paying for the "RAM swap" you use when you already do it implicitly performance-wise (i.e. you are already encouraged to use less RAM to improve performance, so charging for the swap use doesn't do anything; if you could lower the RAM usage, you would already be encouraged to do so).

Migration

slashdot — Fri, 27 Jan 2012 21:30:03 +0000

The solution is to not buy more RAM for ALL machines, but only for some, and dynamically migrate VMs from nodes where you are memory constrained to nodes where you aren't.

Of course, if you need real-time behavior, this likely won't work, and then you really need to buy as much RAM (and CPU cores) as it will ever be needed.

The future calculus of memory management

djm1021 — Fri, 27 Jan 2012 20:33:42 +0000

Thanks for the thorough cost analysis.

First, RAMster works fine over builtin 1GbE and provides improvement over spinning rust, so 10GbE is not required. True, it's probably not faster than a local SSD, but that misses an important point: To use a local SSD, you must permanently allocate it (or part of it) for swap. So if you have a thousand machines, EACH machine must have a local SSD (or part of it) dedicated for swap. By optimizing RAM utilization across machines, the local SSD is not necessary... though some larger ones on some of the thousand machines might be.

The key point is statistics: Max of sums is almost always less (and often MUCH less) than sum of max. This is especially true when the value you are measuring (in this case "working set") varies widely. Whether you are counting RAM or SSD or both, if you can "overflow" to a remote resource when the local resource is insufficient, the total resource needed (RAM and/or SSD) is less (and often MUCH less). Looking at it the other way, you can always overprovision a system, but at some point your ROI is diminishing. E.g. why don't you always max out RAM on all of your machines? Because its not cost-effective. So why do you want to overprovision EVERY machine with an SSD?

P.S. The same underlying technology can be used for RAMster as for VMs. See http://lwn.net/Articles/454795/
And for RAMster, see: http://marc.info/?l=linux-mm&m=132768187222840&w=2

The future calculus of memory management

perfgeek — Fri, 27 Jan 2012 18:37:29 +0000

10GbE latency on things like netperf TCP_RR is indeed in the range of 50 microseconds, and with lower-level things one can probably do < 10 microseconds, but isn't that an unloaded latency, without all that many hops and probably measured with something less than 4096 bytes flowing in either direction? Also, can we really expect page sizes to remain 4096 bytes?

The future calculus of memory management

vlovich — Fri, 27 Jan 2012 05:15:14 +0000

You're correct - I forgot that 10GBe removes CSMA/CD, so there will never be 10 GBe hubs.

But let's compare performance #s:
RAM/CPU cache: 17-100 GB/s (real), 7-70 ns (real)
Commercial SSD: 0.5 GB/s (real), ~83 000 - 100 000 ns (real)
10Gbe ethernet: 1.25 GB/s (max theoretical), ~50 0000 ns

10GBe performance is about comparable to SATA SSD.

Let's analyze cost:
128 GB RAM (maximum amount you can currently outfit - 8 x 16 gb): ~$1400+
64 GB RAM: $400-600
32 GB RAM: $200-300
4 GB RAM: ~$30
2 GB RAM: ~$20
128 GB: $200
10 Gbe controller w/ 2 ports: ~$500 (& quality of controller is very important).
10 GBe switch: $4000-25000 (the 4k is a 16-port switch, thus equivalent cost to just having n ports via controllers)

So for the ethernet approach, it costs C_shared_ram + C_server + n * (1.5 * C_ethernet + C_per_machine_ram + C_machine).
The intersect with the no-shared RAM case is (C_shared_ram + C_server) / (C_dedicated_ram - 1.5 * C_ethernet - C_per_machine_ram + (C_server - C_machine)) since the machine for the dedicated RAM case will be slightly more expensive.

At 64GB of RAM or lower, it's obviously never cheaper to use 10 GBe to share that RAM as opposed to just putting it into the machine (each controller 10Gbe connection requires on average $750).

Assuming a $200 CPU, $500 server motherboard & $100 power supply, you're looking at $800 for a machine that takes 128GB of RAM, $500 for the ones that take less.

For 4 GB per-machine + 128 shared vs 128 GB dedicated, you're looking at a cross-over of about 3.2 machines. 10 GBe = 4x PCI-e lane. The controller has 2 ports, so you'll need 8x lanes at least. Assuming 5 sufficient PCI-e lanes, that's 10 machines you can share the RAM with & thus is ~$2800 saved for every 10 machines. Of course, the performance hit of going > 4GB on any machine will increase as more machines share the RAM, so at 3.2, you're potentially below just having a dedicated SSD for swap (if you use 64GB of RAM + SSD, your savings for 10 machines will be $48-$340 dollars depending on the cost of the RAM & SSD).

To me, it seems like at no point is 10GBe swap an attractive option - the cost savings aren't that compelling & the performance seems much worse than just using more RAM & having an SSD dedicated for SWAP.

Now having some kind of VM ballooning thing where you have a 128 GBe machine where the VMs can vary the amount of RAM they use is obviously a more attractive solution & makes a lot of sense (combined with page-deduping + compcache).

The future calculus of memory management

quanstro — Fri, 27 Jan 2012 01:56:41 +0000

10gbe latency is < 50 microseconds on most networks. there
are no such things as 10gbe hubs, so collision domains consist
of two peers on a full-duplex link. there never has been a 10gbe
collision. :-)

Not so good usage of swap

renox — Thu, 26 Jan 2012 14:16:39 +0000

What is a little sad about this kind of article is that in fact currently each language with a GC don't use efficiently the RAM because the swap isn't used as it should be: rarely used pages are not swapped out with a GC..

Of course that's a difficult issue to solve as both the kernel and the GCs must be modified, see http://lambda-the-ultimate.org/node/2391

The future calculus of memory management

jzbiciak — Tue, 24 Jan 2012 08:27:17 +0000

It's describing instructions that implement a stricter form of madvise(..., MADV_DONTNEED) and madvise(..., MADV_WILLNEED) I believe. As described, these are verbs, not adjectives.

The main difference, if I understood correctly, is that they're not just advice. If I PageOffline a given page, it strictly goes away until I explicitly PageOnline that same page.

The future calculus of memory management

vlovich — Sat, 21 Jan 2012 01:38:51 +0000

Sure, but that's essentially how AWS works. It's how all distributed algorithms work. But it's never transparent. You still need to design the program to run distributed to begin with. That's not what was being proposed in the article though - it was some kind of amorphous build a "cloud" of RAM & rent it out & have it all work transparently.

The future calculus of memory management

giraffedata — Sat, 21 Jan 2012 01:17:27 +0000

One good way to lease memory to someone else is to let him run his stuff on your processor and send the results back via IP network.

Virtual machines: old and new

giraffedata — Sat, 21 Jan 2012 01:14:32 +0000

It's been about 10 years since x86 servers got virtual machine technology like that used on 1970s servers, because the economics of server granularity made it worthwhile. I don't know if the technology has completely caught up yet, but it's at least really close. The 1970s virtual machines did not have anything like the memory allocation intelligence we're talking about here.

The future calculus of memory management

ejr — Fri, 20 Jan 2012 21:37:14 +0000

Check into the DoE exascale studies. Moving data from RAM into the processor eats more power, and an absurd amount if it involves an off-board network. Keeping the RAM on will decrease in power drastically soon (phase-change memory with a relatively small cache), but fetching the data error-free? Painful. The voltage gaps have to keep shrinking.

This doesn't necessarily change your VM/oversubscription scenario, but it's crucial at massive scale.

Too soon to know

vlovich — Fri, 20 Jan 2012 21:14:28 +0000

But again, that's for IPC. That really doesn't work for RAM. 1us latency (which still isn't possible according to that article) would still be 3 orders of magnitude slower than regular RAM.

Too soon to know

man_ls — Fri, 20 Jan 2012 20:42:20 +0000

Latency is probably a bigger problem than bandwidth, and people are already working on the microsecond datacenter.

The future calculus of memory management

Cyberax — Fri, 20 Jan 2012 17:56:13 +0000

That doesn't help. Linux uses spare RAM for file system cache, so it quickly fills up.

There's 'balloon driver' which can return memory to the host OS, but it doesn't really work that well with dynamic policies.

Again: I'm surprised that you are surprised...

khim — Fri, 20 Jan 2012 16:08:38 +0000

Maybe I'm missing details in the article, but how could you lease RAM from one machine to another?

This is called VM. As was already noted it's back to the future moment. Looks like PCs are so large and powerfulnow that we face the same problems mainframes faced eons ego.

The future calculus of memory management

felixfix — Fri, 20 Jan 2012 16:05:38 +0000

More RAM in a virtual server could re-allocate RAM on the fly between instances.

The future calculus of memory management

vlovich — Fri, 20 Jan 2012 12:27:36 +0000

RAM bus speed is what - something on the order of 17GBytes/s. So a 10 Gigabit hub (which btw has to deal with collisions, errors, & other nastiness) is still an order of magnitude slower for bandwidth & you're probably still looking at 2-10x (if not more) larger latencies (which are what's really going to kill you).

It's still essentially swap - just maybe faster than current swap storage & that's a very generous maybe, especially when you start piling on the number of users sharing the bus and/or memory.

BTW - the 10 GigE interconnect isn't all that common yet & the cost of it + cabling is still up-there.

Compare all of this with just adding some extra RAM... I'm just not seeing the use-case. Either add RAM or bring up more machines & distribute your computation to not need so much RAM on any one box.

The future calculus of memory management

etienne — Fri, 20 Jan 2012 12:13:30 +0000

It is probably possible to have a firmware/FPGA box (no processor, too slow) having 64 Gbytes RAM to be shared by all members of the 10 Gb/s network, a bit like ATAoe; but security may be a problem.
That box could be the Ethernet 10 Gb/s HUB/switch.

The future calculus of memory management

vlovich — Fri, 20 Jan 2012 11:46:19 +0000

Maybe I'm missing details in the article, but how could you lease RAM from one machine to another? I'm going to assume it's over Ethernet or some other network connection. The question is, can you get the throughput & bandwidth for your network connection to such a point where it beats using local storage as swap (especially with SSDs). & remember that there will be significant additional computational overhead in terms of managing the state between machines, checking permissions, etc.

Local storage: 6Gbps+ @ a latency ~1-10ms (even more bandwidth & less latency for the RAM sitting on top of PCI-e solutions).

"Network" RAM: 1Gbps @ a latency ~1-10ms (depending on network topology).

Btw - this is typically called swap. It never really makes any sense to swap remotely. Now if you can come up with a mechanism that let you transport the algorithms seamlessly too (i.e. so you can transfer the code that works with the data to the remote machine as well), then you're on to something, but I think that can only be done on a per-problem basis, so you're back at step 1.

So it doesn't seem like there's a compelling reason to add a bunch of complexity for something that's not a benefit. Now what does make sense is figuring out how to allocate RAM in a way that allows for disabling large parts of it to save on power (i.e. even during execution you can keep most of RAM powered off) for embedded systems & servers.

The future calculus of memory management

ekj — Fri, 20 Jan 2012 08:05:35 +0000

Purchasing ram for a single machine is cheap. But if you've got a million machines, then investing another $100 in each of them, still costs you one hundred million dollars.

Which means it's worth it to spend a lot of programming-hours, even for just a modest decrease in this. A single percentage reduction in memory-consumption, is worth a million dollars.

Lots of stuff is worth it at large scale, even if at small scale it doesn't matter. If you're a small company owning a dozen servers, then in 99% of the cases it'll be cheaper to just throw more ram at them, than to spend programmer-time reducing the memory-footprint of the applications.

If you've got a million servers, the math looks different.

The future calculus of memory management

dgm — Thu, 19 Jan 2012 18:01:59 +0000

If you cannot use the memory in your current machine, then the price goes up to that of a whole new system.

There are many old machines out there that do not support more than 2 or 4 GB. Many of those machines are in production environments, and are not going to be replaced any time soon for multiple reasons, including the cost of retesting, downtime, and current buying policies.

This is a variant on what once was called "Goal Mode" in resource management

davecb — Thu, 19 Jan 2012 17:29:15 +0000

Once more, it's "back to the future" time in computer science (;-))

What you describe here is a superset of a problem that we suffered in the days of the mainframe, that of optimizing resource usage against a "success" criteria. One wanted, in those days, to adjust dispatch priority and disk storage to benefit a program that was overloaded, to get it out of trouble.

Resource management initially allowed one to set guaranteed minimums, and to share when one wasn't using all your allocation, but rather statically. IBM then introduced a scheme that allowed it to be done in a way that often diagnosed a slowdown and added more resources, called "goal mode". It still exists on mainframes.

Modern resource management schemes don't go quite that far. We guarantee minimums, provide maximums so as to avoid letting us shoot ourselves in the foot, make sharing of unused resources easy, and selective penalize memory hogs by making them page against themselves.

We need a modern goal mode: as Linux is a hot-bed of resource management research, I wouldn't be unduly surprised to see it happen here.

--dave

The future calculus of memory management

Cyberax — Thu, 19 Jan 2012 16:53:20 +0000

4G of RAM is not a lot. Unless you have hundreds or thousands of nodes.

I'm working with genomic data and some of our workloads are "spiky". They usually tend to require 2-3GB of RAM and can easily fit on Amazon EC2 "large" nodes. But sometimes they might require more than 10Gb of RAM which requires quite more powerful (and expensive!) nodes. You really start appreciating RAM usage when you're paying for it by hour (I think I now understand early mainframe programmers).

I've cobbled together a system which uses checkpointing to move workloads across nodes in case of danger of swapping. It works, but I'd really like ability to 'borrow' extra RAM for short periods.

The future calculus of memory management

cesarb — Thu, 19 Jan 2012 15:44:21 +0000

> Is memory really that expensive? For 100$ I can just buy 16G, I suspect for most people it would be cheaper in productivity to add more ram instead of the production loss you get from not having enough ram.

It is not just the price of the memory stick. It also uses more power. You might need a more expensive server with more memory slots. It is one more piece of hardware which can fail and need to be replaced.

On the software side, the operating system has to use more memory to track all the available pages, and on 32-bit architectures this memory might have to be in the very valuable low memory area (below the 4G limit). Your memory management algorithms might also need more time to manage all these pages.

The future calculus of memory management

cesarb — Thu, 19 Jan 2012 15:29:55 +0000

> Let's assume that the next generation processors have two new instructions: PageOffline and PageOnline. When PageOffline is executed (with a physical address as a parameter), that (4K) page of memory is marked by the hardware as inaccessible and any attempts to load/store from/to that location result in an exception until a corresponding PageOnline is executed.

Forgive me for being obtuse, but isn't that the Present bit?

The future calculus of memory management

felixfix — Thu, 19 Jan 2012 14:32:15 +0000

But then scale that for google. Even for small server farms of a hundred servers, that's $10K. Anyone can think of things more useful for $10K when most of the RAM bought is unused most of the time.

The future calculus of memory management

mlankhorst — Thu, 19 Jan 2012 12:32:51 +0000

Is memory really that expensive? For 100$ I can just buy 16G, I suspect for most people it would be cheaper in productivity to add more ram instead of the production loss you get from not having enough ram.

One example...

khim — Thu, 19 Jan 2012 09:35:48 +0000

However, I still think that systems can't be overcommitted safely, so I remain skeptical what possible tricks there are that an administrator could pull while not violating service's expectations.

This depends on what services you have available. Google's infrastructure is good example. It runs different kinds of taks on the same machines. Some serve search results and have "no overquota" policy. Some are batch processes, crawlers, exacycle, etc. These can be killed at any time because you can always just start them on another system.

Now, not only batch processes can be run in overcommit mode - Google can even take memory reserved for critical search process! Because if it actually will ask for the memory later you can kill non-critical process with extreme prejudice and give memory to critical process. Not sure if Google actually does it or not, but this is obviously doable.

If you'll think about what real clusters are doing you may be surprised just how much work is done by processes which can actually be killed and restarted. Sadly today such tasks are usually run synchroniously in the context of critical user-facing process thus to use memory efficiently you'll need to do serious refactoring.

The future calculus of memory management

alankila — Thu, 19 Jan 2012 08:17:55 +0000

Hmm. Reading this again I think the economic logic was not meant to be taken literally, but it was only there to sort of illustrate the motivation and the thinking involved. I think I made a mistake around that paragraph and flipped my bozo bit too early. However, I still think that systems can't be overcommitted safely, so I remain skeptical what possible tricks there are that an administrator could pull while not violating service's expectations.

One thing occurred to me, though: it might be possible to trade one resource for another. Suppose, for instance, that one workload is very memory and cpu hungry, but doesn't do almost any i/o, and another workload is able to operate at same efficiency with less memory if it gains more i/o to replace it. It would make sense to trade the resource bounds between the workloads if relevant data to build a valid model of the workload's throughput within resource constraints exists. However, I doubt the models will ever be anything better than purely empirical, and that means they sometimes don't work, and then all things crash. Ugly.

The future calculus of memory management

alankila — Thu, 19 Jan 2012 07:44:14 +0000

This article makes me think that it's time to start to look for a new line of work. I so dislike the notion of replacing the predictable reliability of fixed resources with this vague quasi-economical logic of boom and crunch periods whose timing will depend on the dynamic demands placed on unrelated systems which nevertheless share the same resources.

The problem here is that computer systems have a minimum expectation of the availability of the various resources, and that in turn dictates the minimum which must always be available to each service, or that service will risk collapsing under load. Of course, this minimum guaranteed level looks a lot like spare capacity to somebody who doesn't realize that the system is already at the optimal minimum. (The other big complication is that throughput of complex systems tends to be limited by the most constrained resource, so screwing this up in any of the major resource categories is going to be bad news for throughput.)

However, I suppose that sometimes it is guaranteed that resource demand peaks occur at non-overlapping times. So you might put nightly backup service on the host that also services some other function during daytime, or something. It is also possible to guarantee minimal resource availability to some virtual machines and leave the rest to struggle under "best effort" guarantees. This kind of competition between resources doesn't sound anything like a "RAM market" to me, though.

I will refrain from criticizing this article much further, but I'll be personally be rather happy if there will be no more like it. RAM servers to make a tidy profit? I find the notion ridiculous and have no idea how it could really work or even make any sense.