A look at The Machine

By Jake Edge
August 26, 2015

In what was perhaps one of the shortest keynotes on record (ten minutes), Keith Packard outlined the hardware architecture of "The Machine"—HP's ambitious new computing system. That keynote took place at LinuxCon North America in Seattle and was thankfully followed by an hour-long technical talk by Packard the following day (August 18), which looked at both the hardware and software for The Machine. It is, in many ways, a complete rethinking of the future of computers and computing, but there is a fairly long way to go between here and there.

The hardware

The basic idea of the hardware is straightforward. Many of the "usual computation units" (i.e. CPUs or systems on chip—SoCs) are connected to a "massive memory pool" using photonics for fast interconnects. That leads to something of an equation, he said: electrons (in CPUs) + photons (for communication) + ions (for memory storage) = computing. Today's computers transfer a lot of data and do so over "tiny little pipes". The Machine, instead, can address all of its "amazingly huge pile of memory" from each of its many compute elements. One of the underlying principles is to stop moving memory around to use it in computations—simply have it all available to any computer that needs it.

Some of the ideas for The Machine came from HP's DragonHawk systems, which were traditional symmetric multiprocessing systems, but packed a "lot of compute in a small space". DragonHawk systems would have 12TB of RAM in an 18U enclosure, while the nodes being built for The Machine will have 32TB of memory in 5U. It is, he said, a lot of memory and it will scale out linearly. All of the nodes will be connected at the memory level so that "every single processor can do a load or store instruction to access memory on any system".

Nodes in this giant cluster do not have to be homogeneous, as long as they are all hooked to the same memory interconnect. The first nodes that HP is building will be homogeneous, just for pragmatic reasons. There are two circuit boards on each node, one for storage and one for the computer. Connecting the two will be the "next generation memory interconnect" (NGMI), which will also connect both parts of the node to the rest of the system using photonics.

The compute part of the node will have a 64-bit ARM SoC with 256GB of purely local RAM along with a field-programmable gate array (FPGA) to implement the NGMI protocol. The storage part will have four banks of memory (each with 1TB), each with its own NGMI FPGA. A given SoC can access memory elsewhere without involving the SoC on the node where the memory resides—the NGMI bridge FPGAs will talk to their counterpart on the other node via the photonic interface. Those FPGAs will eventually be replaced by application-specific integrated circuits (ASICs) once the bugs are worked out.

ARM was chosen because it was easy to get those vendors to talk with the project, Packard said. There is no "religion" about the instruction set architecture (ISA), so others may be used down the road.

Eight of these nodes can be collected up into a 5U enclosure, which gives eight processors and 32TB of memory. Ten of those enclosures can then be placed into a rack (80 processors, 320TB) and multiple racks can all be connected on the same "fabric" to allow addressing up to 32 zettabytes (ZB) from each processor in the system.

The storage and compute portions of each node are powered separately. The compute piece has two 25Gb network interfaces that are capable of remote DMA. The storage piece will eventually use some kind of non-volatile/persistent storage (perhaps even the fabled memristor), but is using regular DRAM today, since it is available and can be used to prove the other parts of the design before switching.

SoCs in the system may be running more than one operating system (OS) and for more than one tenant, so there are some hardware protection mechanisms built into the system. In addition, the memory-controller FPGAs will encrypt the data at rest so that pulling a board will not give access to the contents of the memory even if it is cooled (à la cold boot) or when some kind of persistent storage is used.

At one time, someone said that 640KB of memory should be enough, Packard said, but now he is wrestling with the limits of the 48-bit addresses used by the 64-bit ARM and Intel CPUs. That only allows addressing up to 256TB, so memory will be accessed in 8GB "books" (or, sometimes, 64KB "booklettes"). Beyond the SoC, the NGMI bridge FPGA (which is also called the "Z bridge") deals with two different kinds of addresses: 53-bit logical Z addresses and 75-bit Z addresses. Those allow addressing 8PB and 32ZB respectively.

The logical Z addresses are used by the NGMI firewall to determine the access rights to that memory for the local node. Those access controls are managed outside of whatever OS is running on the SoC. So the mapping of memory is handled by the OS, while the access controls for the memory are part of the management of The Machine system as a whole.

NGMI is not intended to be a proprietary fabric protocol, Packard said, and the project is trying to see if others are interested. A memory transaction on the fabric looks much like a cache access. The Z address is presented and 64 bytes are transferred.

The software

Packard's group is working on GPL operating systems for the system, but others can certainly be supported. If some "proprietary Washington company" wanted to port its OS to The Machine, it certainly could. Meanwhile, though, other groups are working on other free systems, but his group is made up of "GPL bigots" that are working on Linux for the system. There will not be a single OS (or even distribution or kernel) running on a given instance of the The Machine—it is intended to support multiple different environments.

Probably the biggest hurdle for the software is that there is no cache coherence within the enormous memory pool. Each SoC has its own local memory (256GB) that is cache coherent, but accesses to the "fabric-attached memory" (FAM) between two processors are completely uncoordinated by hardware. That has implications for applications and the OS that are using that memory, but OS data structures should be restricted to the local, cache-coherent memory as much as possible.

For the FAM, there is a two-level allocation scheme that is arbitrated by a "librarian". It allocates books (8GB) and collects them into "shelves". The hardware protections provided by the NGMI firewall are done on book boundaries. A shelf could be a collection of books that are scattered all over the FAM in a single load-store domain (LSD—not Packard's favorite acronym, he noted), which is defined by the firewall access rules. That shelf could then be handed to the OS to be used for a filesystem, for example. That might be ext4, some other Linux filesystem, or the new library filesystem (LFS) that the project is working on.

Talking to the memory in a shelf uses the POSIX API. A process does an open() on a shelf and then uses mmap() to map the memory into the process. Underneath, it uses the direct access (DAX) support to access the memory. For the first revision, LFS will not support sparse files. Also, locking will not be global throughout an LSD, but will be local to an OS running on a node.

For management of the FAM, each rack will have a "top of rack" management server, which is where the librarian will run. That is a fairly simple piece of code that just does bookkeeping and keeps track of the allocations in a SQLite database. The SoCs are the only parts of the system that can talk to the firewall controller, so other components communicate with a firewall proxy that runs in user space, which relays queries and updates. There are a "whole bunch of potential adventures" in getting the memory firewall pieces all working correctly, Packard said.

The lack of cache coherence makes atomic operations on the FAM problematic, as traditional atomics rely on that feature. So the project has added some hardware to the bridges to do atomic operations at that level. There is a fam_atomic library to access the operations (fetch and add, swap, compare and store, and read), which means that each operation is done at the cost of a system call. Once again, this is just the first implementation; other mechanisms may be added later. One important caveat is that the FAM atomic operations do not interact with the SoC cache, so applications will need to flush those caches as needed to ensure consistency.

Physical addresses at the SoC level can change, so there needs to be support for remapping those addresses. But the SoC caches and DAX both assume static physical mappings. A subset of the physical address space will be used as an aperture into the full address space of the system and books can be mapped into that aperture.

Flushing the SoC cache line by line would "take forever", so a way to flush the entire cache when the physical address mappings change has been added. In order to do that, two new functions have been added to the Intel persistent memory library (libpmem): one to check for the presence of non-coherent persistent memory (pmem_needs_invalidate()) and another to invalidate the CPU cache (pmem_invalidate()).

In a system of this size, with the huge amounts of memory involved, there needs to be well-defined support for memory errors, Packard said. Read is easy—errors are simply signaled synchronously—but writes are trickier because the actual write is asynchronous. Applications need to know about the errors, though, so SIGBUS is used to signal an error. The pmem_drain() call will act as a barrier, such that errors in writes before that call will signal at or before the call. Any errors after the barrier will be signaled post-barrier.

There are various areas where the team is working on free software, he said, including persistent memory and DAX. There is also ongoing work on concurrent/distributed filesystems and non-coherent cache management. Finally, reliability, availability, and serviceability (RAS) are quite important to the project, so free software work is proceeding in that area as well.

Even with two separate sessions, it was a bit of a whirlwind tour of The Machine. As he noted, it is an environment that is far removed from the desktop world Packard had previously worked in. By the sound, there are plenty of challenges to overcome before The Machine becomes a working computing device—it will be an interesting process to watch.

[I would like to thank the Linux Foundation for travel assistance to Seattle for LinuxCon North America.]

Index entries for this article
Conference	LinuxCon North America/2015

FPGAs are temporary implementation detail

Posted Aug 27, 2015 2:09 UTC (Thu) by michaelkjohnson (subscriber, #41438) [Link]

Keith said that the FPGAs being used are not a design element (say, to allow dynamic reprogramming of The Machine), but are a development detail expected to be replaced by ASICs in production. [My summary, not his words.]

A look at The Machine

Posted Aug 27, 2015 2:19 UTC (Thu) by Paf (subscriber, #91811) [Link] (17 responses)

This... Doesn't sound that special, from an HPC insider perspective (I work at a major supercomputer company). To me, the persistent memory was the highlight feature of The Machine in the original proposals, and the thing which really justified new thinking.

Perhaps the photonics are particularly special? The current generation of HPC interconnects (such as Aries from Cray, which is a few years old now) already use fibre for the longer distance runs... And they already do a lot of the memory stuff... I mean, you can address memory on remote nodes easily, it just has a cost. And consistency stuff gets complex (it can be enforced or not, basically, depending on the programming model and such).

This just sounds like a tightly integrated HPC style cluster with a fibre optic interconnect. Those exist already... Have I missed something really special here? I mean, sure, they're working on an OS for the whole machine, but some systems work that way already as well. SGI has multi thousand core machines that run a single OS image, if I remember right.

So does anyone see why this should be truly exciting?

A look at The Machine

Posted Aug 27, 2015 3:07 UTC (Thu) by keithp (subscriber, #5140) [Link] (16 responses)

It is different from other HPC systems in some important ways.

A pure single-system-image multi-processor machine shares everything between nodes, and is cache coherent. A network cluster shares nothing between nodes and copies data around using a network.

The Machine is in-between these worlds; sharing 'something', in the form of a large fabric-attached memory pool. Each SoC runs a separate operating system out of a small amount of local memory, and then accesses the large pool of shared memory purely as a data store, not part of the main machine memory. So, it looks like a cluster in many ways, with an extremely fast memory-mapped peripheral.

The first instance we're building holds 320TB of memory (80 nodes of 4TB each), which is larger than ARM or x86 can address, so we *can't* make it look like a single flat address space, even if we wanted to.

Memristor will make the store persistent, silicon photonics will allow us to build a far denser system. What we're building today is designed to help get the rest of the hardware and software ecosystem ready for that.

A look at The Machine

Posted Aug 27, 2015 14:09 UTC (Thu) by Paf (subscriber, #91811) [Link] (1 responses)

Ahh, ok. It sounds like a fantastic project...

I don't actually know how current HPC systems in global shared memory mode handle the addressing. I can think of a few ways, but none of them are the OS directly. (It's a little like a peripheral, as you describe memory in The Machine.)

As a distributed file system developer, I'm particularly interested in persistent memory, especially in a distributed access model like the one you describe.

Thanks for the reply and good luck!

A look at The Machine

Posted Aug 27, 2015 14:10 UTC (Thu) by Paf (subscriber, #91811) [Link]

*by "the OS directly", I mean they're not using normal memory addressing via the kernel. There are multi-PB systems in this mode, so they must be using another approach.

A look at The Machine

Posted Aug 27, 2015 17:30 UTC (Thu) by smoogen (subscriber, #97) [Link] (10 responses)

I wonder how long until this pushes the need for 128 bit CPUs and (say 96 bit MMU ) or just moving the 48 number higher in 64 bit cpus since it seems to be more of a "at this point in time we don't see how having it higher would be helpful in our designing this..." type decision.

A look at The Machine

Posted Aug 27, 2015 17:48 UTC (Thu) by gioele (subscriber, #61675) [Link] (8 responses)

> I wonder how long until this pushes the need for 128 bit CPUs and (say 96 bit MMU ) or just moving the 48 number higher in 64 bit cpus since it seems to be more of a "at this point in time we don't see how having it higher would be helpful in our designing this..." type decision.

In addition I wonder in how many places one can already find the trick "I will use these extra 16 bits for my own business" based on the justification that "all architectures discard the top 16 bits anyway".

A look at The Machine

Posted Aug 27, 2015 18:01 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Quite a few... One project I know of uses upper 8 bits of pointers for object type tag to help the GC.

A look at The Machine

Posted Aug 27, 2015 18:13 UTC (Thu) by keithp (subscriber, #5140) [Link]

Fortunately, x86-64 requires that the top 16 bits of the virtual address be all the same (either all 1 or all 0), and http://support.amd.com/TechDocs/24592.pdf says that a GPF or Stack fault is generated when this isn't true. So, at least we're somewhat protected from applications doing "bad" things by hiding data up in those bits, unless they're masking them out before use. We've been here before several times, and it looks like AMD took those lessons to heart.

A look at The Machine

Posted Aug 30, 2015 1:21 UTC (Sun) by mathstuf (subscriber, #69389) [Link] (4 responses)

Ooh, Mill is going to shake things up quite a bit here. Four reserved bits: one for use in implementing fork (apparently used to flag relative versus absolute…which hints that maybe the hardware representation is all relative addressing) and three for use by garbage collectors. Sixty remaining address the (flat) memory space. Casting between pointer and int is an instruction (which can presumably return NAR) to make sure you don't muck with the bits.

A look at The Machine

Posted Aug 30, 2015 22:54 UTC (Sun) by keithp (subscriber, #5140) [Link] (1 responses)

At least limited virtual address space is cheaper to hack around than limited physical address space. Swapping virtual to physical mappings is something the cache system understands, while swapping physical to device mappings makes the cache system very unhappy.

Just give us enough bits to address every particle in the universe and we'll be happy, right?

A look at The Machine

Posted Aug 30, 2015 23:15 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

Well, with pointers being distinct entities from an integer representation, I imagine one could spec a Mill processor with 128 bit pointers so you'd have 124 (or 120 to expand the GC bits some) bits of address.

Looking back at the slides[1], it appears that there is virtual memory addressing, but there is only one address space. This allows them to put the TLB behind the cache (slide 52). Not sure if that would change things for The Machine.

[1]http://millcomputing.com/docs/memory/

A look at The Machine

Posted Sep 4, 2015 18:39 UTC (Fri) by adler187 (guest, #80400) [Link] (1 responses)

The IBM i (ne AS400/Systemi/etc) MI architecture has been using 128-bit pointers since the 80s and does similar things to distinguish pointers. There are many different pointer types. One is a Space Pointer, which points to a MI space object (a memory region up to 16MiB - 1 page) that can be used for storing data. The bottom 64-bits are the actual 64-bit machine pointer and the top 64-bits are are 0 except for the top bit used to indicate it's a space pointer. If the machine ever goes to a CPU with a larger address space (as has already happened in the 90s moving from 48-bit custom CISC to 64-bit POWER), more bits can be used for the address (I don't know exactly how many bits are reserved for type distinction, though).

The MI architecture has special instructions for manipulating pointers, which prevent casting back and forth between int and if you try it, the tag bit will become unset by the hardware and attempting to dereference the pointer will cause a segmentation fault. This is similar to the CHERI CPU: https://lwn.net/Articles/604298/, though CHERI differs in that it has pointers and "capabilities" and you have to opt-in to capabilities, so applications that assume pointer == long continue to work, but you can get better protection by using capabilities - on IBM i there is no such luck, you have to update your application.

A look at The Machine

Posted Sep 11, 2015 12:12 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

It's nice to know that such problems have been dealt with before. However the set of software which has been exposed is, I suspect, very different for the MI versus the Mill's targets.

A look at The Machine

Posted Sep 3, 2015 7:04 UTC (Thu) by oldtomas (guest, #72579) [Link]

> In addition I wonder in how many places one can already find the trick "I will use these extra 16 bits for my own business"

Since a while, tagging implementations tend to (mis-) use some lower bits out of the realisation that a pointer "to something useful" is divisible by four (gives you tow bits) or even by eight (gives you three).

Guile Scheme is an example, but surely not the only one.

Perhaps time has come to address memory in bigger chunks than 8 bit bytes?

A look at The Machine

Posted Aug 28, 2015 11:15 UTC (Fri) by justincormack (subscriber, #70439) [Link]

Risc-V has a 128 bit instruction set version, for just this eventuality. I think we will see full 64 bit physical addresses arrive fairly soon though to stave off this type of segmentation.

tuple spaces

Posted Aug 28, 2015 9:14 UTC (Fri) by skitching (guest, #36856) [Link] (1 responses)

Somewhere between a multi-processor coherent system and a cluster? That sounds somewhat like the "tuple spaces" concept (https://en.wikipedia.org/wiki/Tuple_space).

In that software design, applications running on a cluster of servers can "connect to a shared memory space" using a library. The library api provides some reasonably standard data-structures (particularly key/value maps and queues).

In practice, the storage is usually distributed across the nodes belonging to the cluster (ie there is no "central server" with lots of memory). In the case of "the machine", perhap tuple-spaces might be a useful way to access this central memory (which would _really_ be central rather than distributed), to reduce the problems related to coherency?

tuple spaces

Posted Aug 28, 2015 17:30 UTC (Fri) by keithp (subscriber, #5140) [Link]

Thanks for the pointer. We've got a bunch of people in HP labs figuring out how to get data structures to work on this hardware; will pass this along to them.

A look at The Machine

Posted Oct 19, 2015 11:11 UTC (Mon) by pabs (subscriber, #43278) [Link]

Have you looked at the 128-bit variant of the RISC-V architecture? An open ISA might be a better choice than ARM.

I want a stripped down non-expensive system

Posted Sep 7, 2015 0:30 UTC (Mon) by porton (guest, #94885) [Link]

I want a minimal The Machine system to replace my i7 PC: One SoC and just one chip with memristors.

I am probably not going to pay for the 320 TB (I don't need so much.) system.

How much the minimal configuration is expected to cost?