A look at The Machine
In what was perhaps one of the shortest keynotes on record (ten minutes), Keith Packard outlined the hardware architecture of "The Machine"—HP's ambitious new computing system. That keynote took place at LinuxCon North America in Seattle and was thankfully followed by an hour-long technical talk by Packard the following day (August 18), which looked at both the hardware and software for The Machine. It is, in many ways, a complete rethinking of the future of computers and computing, but there is a fairly long way to go between here and there.
The hardware
The basic idea of the hardware is straightforward. Many of the "usual computation units" (i.e. CPUs or systems on chip—SoCs) are connected to a "massive memory pool" using photonics for fast interconnects. That leads to something of an equation, he said: electrons (in CPUs) + photons (for communication) + ions (for memory storage) = computing. Today's computers transfer a lot of data and do so over "tiny little pipes". The Machine, instead, can address all of its "amazingly huge pile of memory" from each of its many compute elements. One of the underlying principles is to stop moving memory around to use it in computations—simply have it all available to any computer that needs it.
![Keith Packard [Keith Packard]](https://static.lwn.net/images/2015/lcna-packard-sm.jpg)
Some of the ideas for The Machine came from HP's DragonHawk systems, which were traditional symmetric multiprocessing systems, but packed a "lot of compute in a small space". DragonHawk systems would have 12TB of RAM in an 18U enclosure, while the nodes being built for The Machine will have 32TB of memory in 5U. It is, he said, a lot of memory and it will scale out linearly. All of the nodes will be connected at the memory level so that "every single processor can do a load or store instruction to access memory on any system".
Nodes in this giant cluster do not have to be homogeneous, as long as they are all hooked to the same memory interconnect. The first nodes that HP is building will be homogeneous, just for pragmatic reasons. There are two circuit boards on each node, one for storage and one for the computer. Connecting the two will be the "next generation memory interconnect" (NGMI), which will also connect both parts of the node to the rest of the system using photonics.
The compute part of the node will have a 64-bit ARM SoC with 256GB of purely local RAM along with a field-programmable gate array (FPGA) to implement the NGMI protocol. The storage part will have four banks of memory (each with 1TB), each with its own NGMI FPGA. A given SoC can access memory elsewhere without involving the SoC on the node where the memory resides—the NGMI bridge FPGAs will talk to their counterpart on the other node via the photonic interface. Those FPGAs will eventually be replaced by application-specific integrated circuits (ASICs) once the bugs are worked out.
ARM was chosen because it was easy to get those vendors to talk with the project, Packard said. There is no "religion" about the instruction set architecture (ISA), so others may be used down the road.
Eight of these nodes can be collected up into a 5U enclosure, which gives eight processors and 32TB of memory. Ten of those enclosures can then be placed into a rack (80 processors, 320TB) and multiple racks can all be connected on the same "fabric" to allow addressing up to 32 zettabytes (ZB) from each processor in the system.
The storage and compute portions of each node are powered separately. The compute piece has two 25Gb network interfaces that are capable of remote DMA. The storage piece will eventually use some kind of non-volatile/persistent storage (perhaps even the fabled memristor), but is using regular DRAM today, since it is available and can be used to prove the other parts of the design before switching.
SoCs in the system may be running more than one operating system (OS) and for more than one tenant, so there are some hardware protection mechanisms built into the system. In addition, the memory-controller FPGAs will encrypt the data at rest so that pulling a board will not give access to the contents of the memory even if it is cooled (à la cold boot) or when some kind of persistent storage is used.
At one time, someone said that 640KB of memory should be enough, Packard said, but now he is wrestling with the limits of the 48-bit addresses used by the 64-bit ARM and Intel CPUs. That only allows addressing up to 256TB, so memory will be accessed in 8GB "books" (or, sometimes, 64KB "booklettes"). Beyond the SoC, the NGMI bridge FPGA (which is also called the "Z bridge") deals with two different kinds of addresses: 53-bit logical Z addresses and 75-bit Z addresses. Those allow addressing 8PB and 32ZB respectively.
The logical Z addresses are used by the NGMI firewall to determine the access rights to that memory for the local node. Those access controls are managed outside of whatever OS is running on the SoC. So the mapping of memory is handled by the OS, while the access controls for the memory are part of the management of The Machine system as a whole.
NGMI is not intended to be a proprietary fabric protocol, Packard said, and the project is trying to see if others are interested. A memory transaction on the fabric looks much like a cache access. The Z address is presented and 64 bytes are transferred.
The software
Packard's group is working on GPL operating systems for the system, but others can certainly be supported. If some "proprietary Washington company" wanted to port its OS to The Machine, it certainly could. Meanwhile, though, other groups are working on other free systems, but his group is made up of "GPL bigots" that are working on Linux for the system. There will not be a single OS (or even distribution or kernel) running on a given instance of the The Machine—it is intended to support multiple different environments.
Probably the biggest hurdle for the software is that there is no cache coherence within the enormous memory pool. Each SoC has its own local memory (256GB) that is cache coherent, but accesses to the "fabric-attached memory" (FAM) between two processors are completely uncoordinated by hardware. That has implications for applications and the OS that are using that memory, but OS data structures should be restricted to the local, cache-coherent memory as much as possible.
For the FAM, there is a two-level allocation scheme that is arbitrated by a "librarian". It allocates books (8GB) and collects them into "shelves". The hardware protections provided by the NGMI firewall are done on book boundaries. A shelf could be a collection of books that are scattered all over the FAM in a single load-store domain (LSD—not Packard's favorite acronym, he noted), which is defined by the firewall access rules. That shelf could then be handed to the OS to be used for a filesystem, for example. That might be ext4, some other Linux filesystem, or the new library filesystem (LFS) that the project is working on.
Talking to the memory in a shelf uses the POSIX API. A process does an open() on a shelf and then uses mmap() to map the memory into the process. Underneath, it uses the direct access (DAX) support to access the memory. For the first revision, LFS will not support sparse files. Also, locking will not be global throughout an LSD, but will be local to an OS running on a node.
For management of the FAM, each rack will have a "top of rack" management server, which is where the librarian will run. That is a fairly simple piece of code that just does bookkeeping and keeps track of the allocations in a SQLite database. The SoCs are the only parts of the system that can talk to the firewall controller, so other components communicate with a firewall proxy that runs in user space, which relays queries and updates. There are a "whole bunch of potential adventures" in getting the memory firewall pieces all working correctly, Packard said.
The lack of cache coherence makes atomic operations on the FAM problematic, as traditional atomics rely on that feature. So the project has added some hardware to the bridges to do atomic operations at that level. There is a fam_atomic library to access the operations (fetch and add, swap, compare and store, and read), which means that each operation is done at the cost of a system call. Once again, this is just the first implementation; other mechanisms may be added later. One important caveat is that the FAM atomic operations do not interact with the SoC cache, so applications will need to flush those caches as needed to ensure consistency.
Physical addresses at the SoC level can change, so there needs to be support for remapping those addresses. But the SoC caches and DAX both assume static physical mappings. A subset of the physical address space will be used as an aperture into the full address space of the system and books can be mapped into that aperture.
Flushing the SoC cache line by line would "take forever", so a way to flush the entire cache when the physical address mappings change has been added. In order to do that, two new functions have been added to the Intel persistent memory library (libpmem): one to check for the presence of non-coherent persistent memory (pmem_needs_invalidate()) and another to invalidate the CPU cache (pmem_invalidate()).
In a system of this size, with the huge amounts of memory involved, there needs to be well-defined support for memory errors, Packard said. Read is easy—errors are simply signaled synchronously—but writes are trickier because the actual write is asynchronous. Applications need to know about the errors, though, so SIGBUS is used to signal an error. The pmem_drain() call will act as a barrier, such that errors in writes before that call will signal at or before the call. Any errors after the barrier will be signaled post-barrier.
There are various areas where the team is working on free software, he said, including persistent memory and DAX. There is also ongoing work on concurrent/distributed filesystems and non-coherent cache management. Finally, reliability, availability, and serviceability (RAS) are quite important to the project, so free software work is proceeding in that area as well.
Even with two separate sessions, it was a bit of a whirlwind tour of The Machine. As he noted, it is an environment that is far removed from the desktop world Packard had previously worked in. By the sound, there are plenty of challenges to overcome before The Machine becomes a working computing device—it will be an interesting process to watch.
[I would like to thank the Linux Foundation for travel assistance to
Seattle for LinuxCon North America.]
Index entries for this article | |
---|---|
Conference | LinuxCon North America/2015 |
Posted Aug 27, 2015 2:09 UTC (Thu)
by michaelkjohnson (subscriber, #41438)
[Link]
Posted Aug 27, 2015 2:19 UTC (Thu)
by Paf (subscriber, #91811)
[Link] (17 responses)
Perhaps the photonics are particularly special? The current generation of HPC interconnects (such as Aries from Cray, which is a few years old now) already use fibre for the longer distance runs... And they already do a lot of the memory stuff... I mean, you can address memory on remote nodes easily, it just has a cost. And consistency stuff gets complex (it can be enforced or not, basically, depending on the programming model and such).
This just sounds like a tightly integrated HPC style cluster with a fibre optic interconnect. Those exist already... Have I missed something really special here? I mean, sure, they're working on an OS for the whole machine, but some systems work that way already as well. SGI has multi thousand core machines that run a single OS image, if I remember right.
So does anyone see why this should be truly exciting?
Posted Aug 27, 2015 3:07 UTC (Thu)
by keithp (subscriber, #5140)
[Link] (16 responses)
A pure single-system-image multi-processor machine shares everything between nodes, and is cache coherent. A network cluster shares nothing between nodes and copies data around using a network.
The Machine is in-between these worlds; sharing 'something', in the form of a large fabric-attached memory pool. Each SoC runs a separate operating system out of a small amount of local memory, and then accesses the large pool of shared memory purely as a data store, not part of the main machine memory. So, it looks like a cluster in many ways, with an extremely fast memory-mapped peripheral.
The first instance we're building holds 320TB of memory (80 nodes of 4TB each), which is larger than ARM or x86 can address, so we *can't* make it look like a single flat address space, even if we wanted to.
Memristor will make the store persistent, silicon photonics will allow us to build a far denser system. What we're building today is designed to help get the rest of the hardware and software ecosystem ready for that.
Posted Aug 27, 2015 14:09 UTC (Thu)
by Paf (subscriber, #91811)
[Link] (1 responses)
I don't actually know how current HPC systems in global shared memory mode handle the addressing. I can think of a few ways, but none of them are the OS directly. (It's a little like a peripheral, as you describe memory in The Machine.)
As a distributed file system developer, I'm particularly interested in persistent memory, especially in a distributed access model like the one you describe.
Thanks for the reply and good luck!
Posted Aug 27, 2015 14:10 UTC (Thu)
by Paf (subscriber, #91811)
[Link]
Posted Aug 27, 2015 17:30 UTC (Thu)
by smoogen (subscriber, #97)
[Link] (10 responses)
Posted Aug 27, 2015 17:48 UTC (Thu)
by gioele (subscriber, #61675)
[Link] (8 responses)
In addition I wonder in how many places one can already find the trick "I will use these extra 16 bits for my own business" based on the justification that "all architectures discard the top 16 bits anyway".
Posted Aug 27, 2015 18:01 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Aug 27, 2015 18:13 UTC (Thu)
by keithp (subscriber, #5140)
[Link]
Posted Aug 30, 2015 1:21 UTC (Sun)
by mathstuf (subscriber, #69389)
[Link] (4 responses)
Posted Aug 30, 2015 22:54 UTC (Sun)
by keithp (subscriber, #5140)
[Link] (1 responses)
Just give us enough bits to address every particle in the universe and we'll be happy, right?
Posted Aug 30, 2015 23:15 UTC (Sun)
by mathstuf (subscriber, #69389)
[Link]
Looking back at the slides[1], it appears that there is virtual memory addressing, but there is only one address space. This allows them to put the TLB behind the cache (slide 52). Not sure if that would change things for The Machine.
Posted Sep 4, 2015 18:39 UTC (Fri)
by adler187 (guest, #80400)
[Link] (1 responses)
The MI architecture has special instructions for manipulating pointers, which prevent casting back and forth between int and if you try it, the tag bit will become unset by the hardware and attempting to dereference the pointer will cause a segmentation fault. This is similar to the CHERI CPU: https://lwn.net/Articles/604298/, though CHERI differs in that it has pointers and "capabilities" and you have to opt-in to capabilities, so applications that assume pointer == long continue to work, but you can get better protection by using capabilities - on IBM i there is no such luck, you have to update your application.
Posted Sep 11, 2015 12:12 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link]
Posted Sep 3, 2015 7:04 UTC (Thu)
by oldtomas (guest, #72579)
[Link]
Since a while, tagging implementations tend to (mis-) use some lower bits out of the realisation that a pointer "to something useful" is divisible by four (gives you tow bits) or even by eight (gives you three).
Guile Scheme is an example, but surely not the only one.
Perhaps time has come to address memory in bigger chunks than 8 bit bytes?
Posted Aug 28, 2015 11:15 UTC (Fri)
by justincormack (subscriber, #70439)
[Link]
Posted Aug 28, 2015 9:14 UTC (Fri)
by skitching (guest, #36856)
[Link] (1 responses)
In that software design, applications running on a cluster of servers can "connect to a shared memory space" using a library. The library api provides some reasonably standard data-structures (particularly key/value maps and queues).
In practice, the storage is usually distributed across the nodes belonging to the cluster (ie there is no "central server" with lots of memory). In the case of "the machine", perhap tuple-spaces might be a useful way to access this central memory (which would _really_ be central rather than distributed), to reduce the problems related to coherency?
Posted Aug 28, 2015 17:30 UTC (Fri)
by keithp (subscriber, #5140)
[Link]
Posted Oct 19, 2015 11:11 UTC (Mon)
by pabs (subscriber, #43278)
[Link]
Posted Sep 7, 2015 0:30 UTC (Mon)
by porton (guest, #94885)
[Link]
I am probably not going to pay for the 320 TB (I don't need so much.) system.
How much the minimal configuration is expected to cost?
FPGAs are temporary implementation detail
A look at The Machine
A look at The Machine
A look at The Machine
A look at The Machine
A look at The Machine
A look at The Machine
A look at The Machine
A look at The Machine
A look at The Machine
A look at The Machine
A look at The Machine
A look at The Machine
A look at The Machine
A look at The Machine
A look at The Machine
tuple spaces
tuple spaces
A look at The Machine
I want a stripped down non-expensive system