LC-Asia: Facebook contemplates ARM servers

By Jonathan Corbet
March 12, 2013

By any reckoning, the ARM architecture is a big success; there are more ARM processors shipping than any other type. But, despite the talk of ARM-based server systems over the last few years, most people still do not take ARM seriously in that role. Jason Taylor, Facebook's Director of Capacity Engineering & Analysis, came to the 2013 Linaro Connect Asia event to say that it may be time for that view to change. His talk was an interesting look into how one large, server-oriented operation thinks ARM may fit into its data centers.

It should come as a surprise to few readers that Facebook is big. The company claims 1 billion users across the planet. Over 350 million photographs are uploaded to Facebook's servers every day; Jason suggested that, perhaps 25% of all photos taken end up on Facebook. The company's servers handle 4.2 billion "likes," posts, and comments every day and vast numbers of users checking in. To be able to handle that kind of load, Facebook invests a lot of money into its data centers; that, in turn, has led naturally to a high level of interest in efficiency.

Facebook sees a server rack as its basic unit of computing. Those racks are populated with five standard types of server; each type is optimized for the needs of one of the top five users within the company. Basic web servers offer a lot of CPU power, but not much else, while database servers are loaded with a lot of memory and large amounts of flash storage capable of providing high I/O operation rates. "Hadoop" servers offer medium levels of CPU and memory, but large amounts of rotating storage; "haystack" servers offer lots of storage and not much of anything else. Finally, there are "feed" servers with fast CPUs and a lot of memory; they handle search, advertisements, and related tasks. The fact that these servers run Linux wasn't really even deemed worth mentioning.

There are clear advantages to focusing on a small set of server types. The machines become cheaper as a result of volume pricing; they are also easier to manage and easier to move from one task to another. New servers can be allocated and placed into service in a matter of hours. On the other hand, these servers are optimized for specific internal Facebook users; everybody else just has to make do with servers that might not be ideal for their needs. Those needs also tend to change over time, but the configuration of the servers remains fixed. There would be clear value in the creation of a more flexible alternative.

Facebook's servers are currently all built using large desktop processors made by Intel and AMD. But, Jason noted, interesting things are happening in the area of mobile processors. Those processors will cross a couple of important boundaries in the next year or two: 64-bit versions will be available, and they will start reaching clock speeds of 2.4 GHz or so. As a result, he said, it is becoming reasonable to consider the use of these processors for big, compute-oriented jobs.

That said, there are a couple of significant drawbacks to mobile processors. The number of instructions executed per clock cycle is still relatively low, so, even at a high clock rate, mobile processors cannot get as much computational work done as desktop processors. And that hurts because processors do not run on their own; they need to be placed in racks, provided with power supplies, and connected to memory, storage, networking, and so on. A big processor reduces the relative cost of those other resources, leading to a more cost-effective package overall. In other words, the use of "wimpy cores" can triple the other fixed costs associated with building a complete, working system.

Facebook's solution to this problem is a server board called, for better or worse, "Group Hug." This design, being put together and published through Facebook's Open Compute Project, puts ten ARM processor boards onto a single server board; each processor has a 1Gb network interface which is aggregated, at the board level, into a single 10Gb interface. The server boards have no storage or other peripherals. The result is a server board with far more processors than a traditional dual-socket board, but with roughly the same computing power as a server board built with desktop processors.

These ARM server boards can then be used in a related initiative called the "disaggregated rack." The problem Facebook is trying to address here is the mismatch between available server resources and what a particular task may need. A particular server may provide just the right amount of RAM, for example, but the CPU will be idle much of the time, leading to wasted resources. Over time, that task's CPU needs might grow, to the point that, eventually, the CPU power on its servers may be inadequate, slowing things down overall. With Facebook's current server architecture, it is hard to keep up with the changing needs of this kind of task.

In a disaggregated rack, the resources required by a computational task are split apart and provided at the rack level. CPU power is provided by boxes with processors and little else — ARM-based "Group Hug" boards, for example. Other boxes in the rack may provide RAM (in the form of a simple key/value database service), high-speed storage (lots of flash), or high-capacity storage in the form of a pile of rotating drives. Each rack can be configured differently, depending on a specific task's needs. A rack dedicated to the new "graph search" feature will have a lot of compute servers and flash servers, but not much storage. A photo-serving rack, instead, will be dominated by rotating storage. As needs change, the configuration of the rack can change with it.

All of this has become possible because the speed of network interfaces has increased considerably. With networking speeds up to 100Gb/sec within the rack, the local bandwidth begins to look nearly infinite, and the network can become the backplane for computers built at a higher level. The result is a high-performance computing architecture that allows systems to be precisely tuned to specific needs and allows individual components to be depreciated (and upgraded) on independent schedules.

Interestingly, Jason's talk did not mention power consumption — one of ARM's biggest advantages — at all. Facebook is almost certainly concerned about the power costs of its data centers, but Linux-based ARM servers are apparently of interest mostly because they can offer relatively inexpensive and flexible computing power. If the disaggregated rack experiment succeeds, it may well demonstrate one way in which ARM-based servers can take a significant place in the data center.

[Your editor would like to thank Linaro for travel assistance to attend this event.]

Index entries for this article
Conference	Linaro Connect/2013

LC-Asia: Facebook contemplates ARM servers

Posted Mar 14, 2013 10:33 UTC (Thu) by epa (subscriber, #39769) [Link] (14 responses)

Since ARM cores are relatively low power, it's a pity that no chip maker sells a package with, say, sixteen of them on a single integrated circuit. The heat output from that would probably still be less than a single Intel Xeon server chip.

LC-Asia: Facebook contemplates ARM servers

Posted Mar 14, 2013 17:21 UTC (Thu) by robert_s (subscriber, #42402) [Link] (13 responses)

It's not really that simple. For instance, what would you expect the memory topology to be for such a processor?

All a single shared memory space? There would be an awful lot of cache coherency work to do there. AFAIK ARM's current (released) processors MPCore implementation only goes up to 4-way.

LC-Asia: Facebook contemplates ARM servers

Posted Mar 14, 2013 19:27 UTC (Thu) by butlerm (subscriber, #13312) [Link] (12 responses)

Why bother with cache coherency? Just make a chip with sixteen independent processors. Add a high speed internal interconnect and you would have a very nice cluster on a chip. Each processor could be associated with a fixed amount of RAM from the external DIMMs, with no instruction level access to other processor's RAM and no cache concurrency issues other than what you might have for DMA.

LC-Asia: Facebook contemplates ARM servers

Posted Mar 14, 2013 22:12 UTC (Thu) by bronson (subscriber, #4806) [Link] (4 responses)

So each CPU gets 1/16 of the memory bus? You'd need an absurd amount of cache to make that competitive with current desktop offerings.

LC-Asia: Facebook contemplates ARM servers

Posted Mar 15, 2013 0:47 UTC (Fri) by smoogen (subscriber, #97) [Link]

I would expect the answer would be to make 16 memory busses. Since at best an ARM would see a 4GB dimm each.. you could still come off cost wise :).

[And no this is not meant to be serious. But I filed a patent on it anyway.]

LC-Asia: Facebook contemplates ARM servers

Posted Mar 15, 2013 1:17 UTC (Fri) by butlerm (subscriber, #13312) [Link] (2 responses)

Each core on an eight core processor gets 1/8 of the memory bus already. If you have non-SMP cores, what is the difference? The most obvious one is if each core is less powerful you better make sure that your application can use all of them. If you are memory bandwidth constrained more cores aren't going to help, but they aren't going to hurt very much either.

An active market for processors with up to sixteen x86 processor cores certainly suggests that devices with sixteen or even thirty two smaller, somewhat lower clocked cores isn't out of line, provided you are dealing with applications that work well on a horizontally scaled, non-SMP basis. Not workstations, servers.

Cache coherency doesn't scale. That is why companies like Facebook and Google use racks of thousands of relatively lightweight servers in the first place, instead of hundred processor NUMA setups. Going to non-SMP cores on the same silicon substrate is the next logical step in that evolution. For any sufficiently large scale application, SMP is a crutch, and a power hungry, expensive one at that.

LC-Asia: Facebook contemplates ARM servers

Posted Mar 15, 2013 1:43 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

you can't just connect processors to the memory bus, you need to setup some form of arbitration between them.

you can create systems with lots of cores, but it's not easy, it's not cheap, and it's not low power.

There's a lot more to making a non SMP system than just throwing cores on a memory bus.

LC-Asia: Facebook contemplates ARM servers

Posted Mar 15, 2013 5:45 UTC (Fri) by butlerm (subscriber, #13312) [Link]

It's certainly wouldn't be trivial, but as long as the power required per core doesn't go up, you would come out far ahead of any kind of cache coherent design. Large scale applications can't rely on cache coherency, and so they don't. At some point it is just a waste of silicon.

LC-Asia: Facebook contemplates ARM servers

Posted Mar 15, 2013 9:11 UTC (Fri) by epa (subscriber, #39769) [Link] (2 responses)

Or... each processor can see all the address space, but only write to 1/16 of it. So if you are processor zero you have full, cached access to the lower one gigabyte of memory, and read-only access to the rest. That would simplify cache coherency a great deal.

LC-Asia: Facebook contemplates ARM servers

Posted Mar 15, 2013 16:39 UTC (Fri) by robert_s (subscriber, #42402) [Link] (1 responses)

> That would simplify cache coherency a great deal.

But probably complicate OS design massively.

LC-Asia: Facebook contemplates ARM servers

Posted Mar 17, 2013 23:42 UTC (Sun) by butlerm (subscriber, #13312) [Link]

I don't think you need to go quite as far as saying you can't have shared writable memory at all, the issue is that hardware can't maintain cache coherency across core specific caches on a scalable or efficient basis. So what you need is for any caching of such memory to be carefully arbitrated in software, e.g. on an acquire and release basis, and for the use of shared mutable data structures to be minimized in favor of things that you can partition across cores on a relatively long term basis.

LC-Asia: Facebook contemplates ARM servers

Posted Mar 15, 2013 11:48 UTC (Fri) by etienne (guest, #25256) [Link] (1 responses)

Possible problems:
- How to start/stop tasks on other processor? Use ssh?
- How to migrate tasks from overloaded processors to under-used processors?
- How to share page cache? Have kernel and all libraries in each address space?
- How to share input/output? Network, serial lines/keyboards, screens, hard disks?
You may decide to have one processor controlling the other 15, with special access to their memory, but usually that processor will be the weakest link.

LC-Asia: Facebook contemplates ARM servers

Posted Mar 17, 2013 20:47 UTC (Sun) by butlerm (subscriber, #13312) [Link]

If all you need is a cluster on a chip those are largely solved problems. Existing cluster applications have the necessary management tools, and I/O devices can be either hardwired to specific cores or shared using something like PCI IOV. It might be convenient to have an internal high speed Ethernet interface for each core with an embedded switching fabric, for example, although something designed for the purpose would probably be more efficient for on chip communication.

A long term solution to the cache coherency problem would indeed substantially complicate kernel design. For example, if the memory overhead of each core is a problem, what you really want is a unified physical address space, explicit sharing of read only code and data, and no sharing or highly restricted sharing of mutable code and data. There is no reason why the kernels running on each core can't trust each other, they just can't scalably share mutable data on an instruction by instruction basis.

And if you really want a single system image, a much more radical redesign would be required. In the long run, that is probably inevitable, because cache concurrency doesn't scale - not across per core caches at any rate.

LC-Asia: Facebook contemplates ARM servers

Posted Mar 15, 2013 17:05 UTC (Fri) by robert_s (subscriber, #42402) [Link] (1 responses)

So you need 16 copies of the OS and common software in memory?

How much memory were you planning on putting on those external DIMMs?

If "a lot", how were you planning on cramming all those lanes into the single smallish SoC package?

There are of course many-many-core chips like those made by Tilera, but note these seem to use a rather exotic and, well, "different" memory architecture that can't just be cooked up by a SoC vendor "throwing on" 16 cores to a die.

LC-Asia: Facebook contemplates ARM servers

Posted Mar 21, 2013 19:29 UTC (Thu) by dfsmith (guest, #20302) [Link]

Well, let's say there's about 100MB of system software. With the price of DDR3 at around $6/GB (http://www.dramexchange.com/), that makes it of the order 60 cents per CPU unit.

Fixing the issue to reduce system software overhead would probably cost 4 person-years; or about $1M of burdened payroll.

That makes payback at around 2 million CPU units. Feasible, but probably not cost effective. And by the time the redesign is finished, DRAM prices will have dropped again.

LC-Asia: Facebook contemplates ARM servers

Posted Mar 14, 2013 11:36 UTC (Thu) by wingo (guest, #26929) [Link] (5 responses)

Interesting article from RealWorldTech on "microservers" and the potential for alternative architectures to win on the server side: http://www.realworldtech.com/microservers/

The summary seems to be that it's complicated!

Calxeda's ARM server tested

Posted Mar 14, 2013 13:15 UTC (Thu) by pflugstad (subscriber, #224) [Link] (4 responses)

Another recent article:

http://www.anandtech.com/show/6757/calxedas-arm-server-te...

I'm still skeptical that ARM Servers are anything but a fad. As RealWorldTech's article above noted, they'll need to offer 10x improvement in some aspect of a job in order to make inroads. Which means they'll need to specialize in a big way to get to that. And even then, I'm doubtful.

It remains to be seen if ARMv8 (64 bit) can improve performance enough without increasing power to make a difference.

Calxeda's ARM server tested

Posted Mar 15, 2013 0:49 UTC (Fri) by smoogen (subscriber, #97) [Link]

The only way to validate/invalidate that statement is how long do you think a fad lasts? I ask this because I know professors out there who still think structured programming is a fad.. they just aren't sure when it will end (I said once perl went mainstream myself ;)).

Calxeda's ARM server tested

Posted Mar 15, 2013 16:45 UTC (Fri) by geuder (guest, #62854) [Link] (2 responses)

> I'm still skeptical that ARM Servers are anything but a fad.

I'd say at the moment they are hype. How many models are commercially usable without the kind of experiments described in the article? How many have been sold?

Whether they will be forgotten after the hype, whether they will turn into a fad (like netbooks 5 years ago) or whether to will come to stay I wouldn't dare to predict. It looks promising to me, so as a distro or Linux vendor I would be working on it. But many things have looked promising at some stage and they have never lived up to their promises. But unless you are Facebook or Google or anything close to that I would not care about deployment yet. Until Aarch64 is ready for mass-deployment it's too early to say. Of course if somebody has some nice business idea in the area and can get venture capital for it, why not. For an engineer ARM is much nicer than another widely used architecture...

Calxeda's ARM server tested

Posted Mar 15, 2013 19:06 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

I think that ARM servers are here to stay.

However, I am NOT saying that ARM servers will replace amd64 servers in the datacenter.

ARM servers are here to stay for home/small office environments where a small ARM box (including many current wireless access points) provides all the power needed to be a fileserver, mail server, web server, etc.

I agree with you that it's an open question if ARM servers are going to scale up to replace the high-density datacenter nodes or not.

But distros should be supporting ARM anyway, because there are already many places where they are 'good enough', and as they get more powerful, the number of places where they will be 'good enough' is going to grow.

Low power home server

Posted Mar 18, 2013 21:36 UTC (Mon) by man_ls (guest, #15091) [Link]

Well said. I have had a SheevaPlug server for my home for several years and I am very happy with it. Now I am planning to change it with a Raspberry Pi, which might also be used as a desktop if the need arises. They are versatile computers that waste very little energy and can be left on 24/7.

ARM is the third most popular architecture for Debian after amd64 and i386. Interestingly, armel appears to have reached a plateau lately, with the highest point around late 2011.