A look at The Machine

Posted Aug 27, 2015 2:19 UTC (Thu) by Paf (subscriber, #91811)
Parent article: A look at The Machine

This... Doesn't sound that special, from an HPC insider perspective (I work at a major supercomputer company). To me, the persistent memory was the highlight feature of The Machine in the original proposals, and the thing which really justified new thinking.

Perhaps the photonics are particularly special? The current generation of HPC interconnects (such as Aries from Cray, which is a few years old now) already use fibre for the longer distance runs... And they already do a lot of the memory stuff... I mean, you can address memory on remote nodes easily, it just has a cost. And consistency stuff gets complex (it can be enforced or not, basically, depending on the programming model and such).

This just sounds like a tightly integrated HPC style cluster with a fibre optic interconnect. Those exist already... Have I missed something really special here? I mean, sure, they're working on an OS for the whole machine, but some systems work that way already as well. SGI has multi thousand core machines that run a single OS image, if I remember right.

So does anyone see why this should be truly exciting?

A look at The Machine

Posted Aug 27, 2015 3:07 UTC (Thu) by keithp (subscriber, #5140) [Link] (16 responses)

It is different from other HPC systems in some important ways.

A pure single-system-image multi-processor machine shares everything between nodes, and is cache coherent. A network cluster shares nothing between nodes and copies data around using a network.

The Machine is in-between these worlds; sharing 'something', in the form of a large fabric-attached memory pool. Each SoC runs a separate operating system out of a small amount of local memory, and then accesses the large pool of shared memory purely as a data store, not part of the main machine memory. So, it looks like a cluster in many ways, with an extremely fast memory-mapped peripheral.

The first instance we're building holds 320TB of memory (80 nodes of 4TB each), which is larger than ARM or x86 can address, so we *can't* make it look like a single flat address space, even if we wanted to.

Memristor will make the store persistent, silicon photonics will allow us to build a far denser system. What we're building today is designed to help get the rest of the hardware and software ecosystem ready for that.

A look at The Machine

Posted Aug 27, 2015 14:09 UTC (Thu) by Paf (subscriber, #91811) [Link] (1 responses)

Ahh, ok. It sounds like a fantastic project...

I don't actually know how current HPC systems in global shared memory mode handle the addressing. I can think of a few ways, but none of them are the OS directly. (It's a little like a peripheral, as you describe memory in The Machine.)

As a distributed file system developer, I'm particularly interested in persistent memory, especially in a distributed access model like the one you describe.

Thanks for the reply and good luck!

A look at The Machine

Posted Aug 27, 2015 14:10 UTC (Thu) by Paf (subscriber, #91811) [Link]

*by "the OS directly", I mean they're not using normal memory addressing via the kernel. There are multi-PB systems in this mode, so they must be using another approach.

A look at The Machine

Posted Aug 27, 2015 17:30 UTC (Thu) by smoogen (subscriber, #97) [Link] (10 responses)

I wonder how long until this pushes the need for 128 bit CPUs and (say 96 bit MMU ) or just moving the 48 number higher in 64 bit cpus since it seems to be more of a "at this point in time we don't see how having it higher would be helpful in our designing this..." type decision.

A look at The Machine

Posted Aug 27, 2015 17:48 UTC (Thu) by gioele (subscriber, #61675) [Link] (8 responses)

> I wonder how long until this pushes the need for 128 bit CPUs and (say 96 bit MMU ) or just moving the 48 number higher in 64 bit cpus since it seems to be more of a "at this point in time we don't see how having it higher would be helpful in our designing this..." type decision.

In addition I wonder in how many places one can already find the trick "I will use these extra 16 bits for my own business" based on the justification that "all architectures discard the top 16 bits anyway".

A look at The Machine

Posted Aug 27, 2015 18:01 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Quite a few... One project I know of uses upper 8 bits of pointers for object type tag to help the GC.

A look at The Machine

Posted Aug 27, 2015 18:13 UTC (Thu) by keithp (subscriber, #5140) [Link]

Fortunately, x86-64 requires that the top 16 bits of the virtual address be all the same (either all 1 or all 0), and http://support.amd.com/TechDocs/24592.pdf says that a GPF or Stack fault is generated when this isn't true. So, at least we're somewhat protected from applications doing "bad" things by hiding data up in those bits, unless they're masking them out before use. We've been here before several times, and it looks like AMD took those lessons to heart.

A look at The Machine

Posted Aug 30, 2015 1:21 UTC (Sun) by mathstuf (subscriber, #69389) [Link] (4 responses)

Ooh, Mill is going to shake things up quite a bit here. Four reserved bits: one for use in implementing fork (apparently used to flag relative versus absolute…which hints that maybe the hardware representation is all relative addressing) and three for use by garbage collectors. Sixty remaining address the (flat) memory space. Casting between pointer and int is an instruction (which can presumably return NAR) to make sure you don't muck with the bits.

A look at The Machine

Posted Aug 30, 2015 22:54 UTC (Sun) by keithp (subscriber, #5140) [Link] (1 responses)

At least limited virtual address space is cheaper to hack around than limited physical address space. Swapping virtual to physical mappings is something the cache system understands, while swapping physical to device mappings makes the cache system very unhappy.

Just give us enough bits to address every particle in the universe and we'll be happy, right?

A look at The Machine

Posted Aug 30, 2015 23:15 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

Well, with pointers being distinct entities from an integer representation, I imagine one could spec a Mill processor with 128 bit pointers so you'd have 124 (or 120 to expand the GC bits some) bits of address.

Looking back at the slides[1], it appears that there is virtual memory addressing, but there is only one address space. This allows them to put the TLB behind the cache (slide 52). Not sure if that would change things for The Machine.

[1]http://millcomputing.com/docs/memory/

A look at The Machine

Posted Sep 4, 2015 18:39 UTC (Fri) by adler187 (guest, #80400) [Link] (1 responses)

The IBM i (ne AS400/Systemi/etc) MI architecture has been using 128-bit pointers since the 80s and does similar things to distinguish pointers. There are many different pointer types. One is a Space Pointer, which points to a MI space object (a memory region up to 16MiB - 1 page) that can be used for storing data. The bottom 64-bits are the actual 64-bit machine pointer and the top 64-bits are are 0 except for the top bit used to indicate it's a space pointer. If the machine ever goes to a CPU with a larger address space (as has already happened in the 90s moving from 48-bit custom CISC to 64-bit POWER), more bits can be used for the address (I don't know exactly how many bits are reserved for type distinction, though).

The MI architecture has special instructions for manipulating pointers, which prevent casting back and forth between int and if you try it, the tag bit will become unset by the hardware and attempting to dereference the pointer will cause a segmentation fault. This is similar to the CHERI CPU: https://lwn.net/Articles/604298/, though CHERI differs in that it has pointers and "capabilities" and you have to opt-in to capabilities, so applications that assume pointer == long continue to work, but you can get better protection by using capabilities - on IBM i there is no such luck, you have to update your application.

A look at The Machine

Posted Sep 11, 2015 12:12 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

It's nice to know that such problems have been dealt with before. However the set of software which has been exposed is, I suspect, very different for the MI versus the Mill's targets.

A look at The Machine

Posted Sep 3, 2015 7:04 UTC (Thu) by oldtomas (guest, #72579) [Link]

> In addition I wonder in how many places one can already find the trick "I will use these extra 16 bits for my own business"

Since a while, tagging implementations tend to (mis-) use some lower bits out of the realisation that a pointer "to something useful" is divisible by four (gives you tow bits) or even by eight (gives you three).

Guile Scheme is an example, but surely not the only one.

Perhaps time has come to address memory in bigger chunks than 8 bit bytes?

A look at The Machine

Posted Aug 28, 2015 11:15 UTC (Fri) by justincormack (subscriber, #70439) [Link]

Risc-V has a 128 bit instruction set version, for just this eventuality. I think we will see full 64 bit physical addresses arrive fairly soon though to stave off this type of segmentation.

tuple spaces

Posted Aug 28, 2015 9:14 UTC (Fri) by skitching (guest, #36856) [Link] (1 responses)

Somewhere between a multi-processor coherent system and a cluster? That sounds somewhat like the "tuple spaces" concept (https://en.wikipedia.org/wiki/Tuple_space).

In that software design, applications running on a cluster of servers can "connect to a shared memory space" using a library. The library api provides some reasonably standard data-structures (particularly key/value maps and queues).

In practice, the storage is usually distributed across the nodes belonging to the cluster (ie there is no "central server" with lots of memory). In the case of "the machine", perhap tuple-spaces might be a useful way to access this central memory (which would _really_ be central rather than distributed), to reduce the problems related to coherency?

tuple spaces

Posted Aug 28, 2015 17:30 UTC (Fri) by keithp (subscriber, #5140) [Link]

Thanks for the pointer. We've got a bunch of people in HP labs figuring out how to get data structures to work on this hardware; will pass this along to them.

A look at The Machine

Posted Oct 19, 2015 11:11 UTC (Mon) by pabs (subscriber, #43278) [Link]

Have you looked at the 128-bit variant of the RISC-V architecture? An open ISA might be a better choice than ARM.