User: Password:
|
|
Subscribe / Log in / New account

The future calculus of memory management

The future calculus of memory management

Posted Jan 20, 2012 12:13 UTC (Fri) by etienne (guest, #25256)
In reply to: The future calculus of memory management by vlovich
Parent article: The future calculus of memory management

It is probably possible to have a firmware/FPGA box (no processor, too slow) having 64 Gbytes RAM to be shared by all members of the 10 Gb/s network, a bit like ATAoe; but security may be a problem.
That box could be the Ethernet 10 Gb/s HUB/switch.


(Log in to post comments)

The future calculus of memory management

Posted Jan 20, 2012 12:27 UTC (Fri) by vlovich (guest, #63271) [Link]

RAM bus speed is what - something on the order of 17GBytes/s. So a 10 Gigabit hub (which btw has to deal with collisions, errors, & other nastiness) is still an order of magnitude slower for bandwidth & you're probably still looking at 2-10x (if not more) larger latencies (which are what's really going to kill you).

It's still essentially swap - just maybe faster than current swap storage & that's a very generous maybe, especially when you start piling on the number of users sharing the bus and/or memory.

BTW - the 10 GigE interconnect isn't all that common yet & the cost of it + cabling is still up-there.

Compare all of this with just adding some extra RAM... I'm just not seeing the use-case. Either add RAM or bring up more machines & distribute your computation to not need so much RAM on any one box.

Too soon to know

Posted Jan 20, 2012 20:42 UTC (Fri) by man_ls (guest, #15091) [Link]

Latency is probably a bigger problem than bandwidth, and people are already working on the microsecond datacenter.

Too soon to know

Posted Jan 20, 2012 21:14 UTC (Fri) by vlovich (guest, #63271) [Link]

But again, that's for IPC. That really doesn't work for RAM. 1us latency (which still isn't possible according to that article) would still be 3 orders of magnitude slower than regular RAM.

The future calculus of memory management

Posted Jan 21, 2012 1:17 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

One good way to lease memory to someone else is to let him run his stuff on your processor and send the results back via IP network.

The future calculus of memory management

Posted Jan 21, 2012 1:38 UTC (Sat) by vlovich (guest, #63271) [Link]

Sure, but that's essentially how AWS works. It's how all distributed algorithms work. But it's never transparent. You still need to design the program to run distributed to begin with. That's not what was being proposed in the article though - it was some kind of amorphous build a "cloud" of RAM & rent it out & have it all work transparently.

The future calculus of memory management

Posted Jan 27, 2012 1:56 UTC (Fri) by quanstro (guest, #77996) [Link]

10gbe latency is < 50 microseconds on most networks. there
are no such things as 10gbe hubs, so collision domains consist
of two peers on a full-duplex link. there never has been a 10gbe
collision. :-)

The future calculus of memory management

Posted Jan 27, 2012 5:15 UTC (Fri) by vlovich (guest, #63271) [Link]

You're correct - I forgot that 10GBe removes CSMA/CD, so there will never be 10 GBe hubs.

But let's compare performance #s:
RAM/CPU cache: 17-100 GB/s (real), 7-70 ns (real)
Commercial SSD: 0.5 GB/s (real), ~83 000 - 100 000 ns (real)
10Gbe ethernet: 1.25 GB/s (max theoretical), ~50 0000 ns

10GBe performance is about comparable to SATA SSD.

Let's analyze cost:
128 GB RAM (maximum amount you can currently outfit - 8 x 16 gb): ~$1400+
64 GB RAM: $400-600
32 GB RAM: $200-300
4 GB RAM: ~$30
2 GB RAM: ~$20
128 GB: $200
10 Gbe controller w/ 2 ports: ~$500 (& quality of controller is very important).
10 GBe switch: $4000-25000 (the 4k is a 16-port switch, thus equivalent cost to just having n ports via controllers)

So for the ethernet approach, it costs C_shared_ram + C_server + n * (1.5 * C_ethernet + C_per_machine_ram + C_machine).
The intersect with the no-shared RAM case is (C_shared_ram + C_server) / (C_dedicated_ram - 1.5 * C_ethernet - C_per_machine_ram + (C_server - C_machine)) since the machine for the dedicated RAM case will be slightly more expensive.

At 64GB of RAM or lower, it's obviously never cheaper to use 10 GBe to share that RAM as opposed to just putting it into the machine (each controller 10Gbe connection requires on average $750).

Assuming a $200 CPU, $500 server motherboard & $100 power supply, you're looking at $800 for a machine that takes 128GB of RAM, $500 for the ones that take less.

For 4 GB per-machine + 128 shared vs 128 GB dedicated, you're looking at a cross-over of about 3.2 machines. 10 GBe = 4x PCI-e lane. The controller has 2 ports, so you'll need 8x lanes at least. Assuming 5 sufficient PCI-e lanes, that's 10 machines you can share the RAM with & thus is ~$2800 saved for every 10 machines. Of course, the performance hit of going > 4GB on any machine will increase as more machines share the RAM, so at 3.2, you're potentially below just having a dedicated SSD for swap (if you use 64GB of RAM + SSD, your savings for 10 machines will be $48-$340 dollars depending on the cost of the RAM & SSD).

To me, it seems like at no point is 10GBe swap an attractive option - the cost savings aren't that compelling & the performance seems much worse than just using more RAM & having an SSD dedicated for SWAP.

Now having some kind of VM ballooning thing where you have a 128 GBe machine where the VMs can vary the amount of RAM they use is obviously a more attractive solution & makes a lot of sense (combined with page-deduping + compcache).

The future calculus of memory management

Posted Jan 27, 2012 20:33 UTC (Fri) by djm1021 (guest, #31130) [Link]

Thanks for the thorough cost analysis.

First, RAMster works fine over builtin 1GbE and provides improvement over spinning rust, so 10GbE is not required. True, it's probably not faster than a local SSD, but that misses an important point: To use a local SSD, you must permanently allocate it (or part of it) for swap. So if you have a thousand machines, EACH machine must have a local SSD (or part of it) dedicated for swap. By optimizing RAM utilization across machines, the local SSD is not necessary... though some larger ones on some of the thousand machines might be.

The key point is statistics: Max of sums is almost always less (and often MUCH less) than sum of max. This is especially true when the value you are measuring (in this case "working set") varies widely. Whether you are counting RAM or SSD or both, if you can "overflow" to a remote resource when the local resource is insufficient, the total resource needed (RAM and/or SSD) is less (and often MUCH less). Looking at it the other way, you can always overprovision a system, but at some point your ROI is diminishing. E.g. why don't you always max out RAM on all of your machines? Because its not cost-effective. So why do you want to overprovision EVERY machine with an SSD?

P.S. The same underlying technology can be used for RAMster as for VMs. See http://lwn.net/Articles/454795/
And for RAMster, see: http://marc.info/?l=linux-mm&m=132768187222840&w=2

The future calculus of memory management

Posted Jan 27, 2012 22:59 UTC (Fri) by vlovich (guest, #63271) [Link]

The point I was making was that 10GBe was too expensive (since 10GBe is about on performance-parity with mass storage). With 1 GBe, you're looking at a lot less performance than attached storage. So yes, if your servers are using a regular spinning disk, then this *MIGHT* help the performance a bit since latency may be theoretically better (but throughput will be much lower). However, as the cluster grows, the performance will degrade (latency will get worse since the switch has to process more + bandwidth will decrease since your switch likely can't actually do simultaneous n ports * 1gbe of throughput).

So the point I've been trying to understand is what kind of application would actually benefit from this?

spinning storage in each box, limited amount of RAM, lots of machines, hitting swap occasionally, extremely cost conscious, not performance conscious

So here's two possible alternatives that provide significantly better performance for not much extra cost:
* add more RAM to 1 or more boxes (depending on how much you have in your budget) & make sure heavy workloads only go to those machines.
* if adding more RAM is too expensive use an equivalently-priced SSDs instead of spinning disks - yes you have less storage per-machine, but use a cluster-FS for the data instead (my hunch is that a cluster-FS + SSD for swap will actually get you better performance than spinning disk + RAMster for swap).
* use a growable swap instead of a fixed-size swap? that'll get you more efficient usage of the storage space

Have you actually categorized the performance of RAMster vs spinning disk vs SSD? I think it's useful to see the cost/performance tradeoff in some scenarios. Furthermore, it's important to categorize the cost of RAM vs the cost of the system & what kind of performance you require from the application + what are the specs of the machines in the cluster.

Also, getting back to the central thesis of the article; if it's hitting swap you're looking at a huge performance hit anyways, so why require paying for the "RAM swap" you use when you already do it implicitly performance-wise (i.e. you are already encouraged to use less RAM to improve performance, so charging for the swap use doesn't do anything; if you could lower the RAM usage, you would already be encouraged to do so).

The future calculus of memory management

Posted Jan 27, 2012 23:46 UTC (Fri) by djm1021 (guest, #31130) [Link]

Good discussion! A few replies.

> what kind of application would actually benefit from this?

Here's what I'm ideally picturing (though I think RAMster works for other environments too):

Individual machines are getting smaller and more likely to share external storage rather than have local storage. Cloud providers that are extremely cost conscious but can provide performance if paid enough (just as web hosters today offer a virtual server vs physical server option, the latter being more expensive).

Picture a rack of "microblades" in a lights-out data center, no local disks, labor-intensive and/or downtime-intensive to remove and upgrade (or perhaps closed sheet-metal so not upgradeable at all), RAM maxed out (or not upgradeable), some kind of fabric or mid-plane connecting blades in the rack (but possibly only one or two GbE ports per blade).

> but throughput will be much lower

There isn't a high degree of locality in swap read/write so readahead doesn't help a whole lot. Also, with RAMster, transmitted pages are compressed so less data is moved across the wire.

> Have you actually categorized the performance of RAMster vs spinning disk vs SSD?

Yes, some, against disk not SSD. RAMster is just recently working well enough.

The future calculus of memory management

Posted Jan 28, 2012 1:32 UTC (Sat) by dlang (subscriber, #313) [Link]

> Individual machines are getting smaller and more likely to share external storage rather than have local storage.

I agree they are more likely to share external storage, but I disagree about them getting smaller.

with virtualization, the average machine size is getting significantly larger (while the resources allocated to an individual VM are significantly smaller than what an individual machine used to be)

there is a significant per-physical-machine overhead in terms of cabling, connectivity, management, and fragmentation of resources. In addition, the sweet spot of price/performance keep climbing.

as a result, the machines are getting larger.

for what you are saying to be true, a company would have to buy in to SAN without also buying in to VMs, and I think the number of companies making that particular combination of choices is rather small.

The future calculus of memory management

Posted Jan 27, 2012 18:37 UTC (Fri) by perfgeek (guest, #82604) [Link]

10GbE latency on things like netperf TCP_RR is indeed in the range of 50 microseconds, and with lower-level things one can probably do < 10 microseconds, but isn't that an unloaded latency, without all that many hops and probably measured with something less than 4096 bytes flowing in either direction? Also, can we really expect page sizes to remain 4096 bytes?


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds