The Machine: Controlling storage with a filesystem
Packard described The Machine as "memory-driven computing". Most computing systems are built around the CPU, but, on The Machine, the focus is on data instead. The architecture of The Machine is not aimed at providing fast processors; instead, it is built around connecting commodity CPUs to vast amounts of user data, and providing "load-and-store" access (i.e. all of the data is addressable like main memory). Traditional Linux systems route data access through filesystems and disk drivers; The Machine will remove those layers and connect applications directly to their data.
The CPUs in The Machine are ARM64 processors — reasonably fast, but far from the strongest CPUs out there. They have "a little bit of DRAM, about a quarter terabyte" directly attached to them. But the core of the Machine is a large array of persistent memory attached to each CPU. Those memory arrays are then connected into a fabric that makes all of the memory available to all CPUs in the system. This fabric is implemented with the "Gen-Z" memory controller, a now-open project complete with its own consortium to further its development.
One of the key features of The Machine is that security is built in at
every level; its components do not trust each other. So it is not possible
to just allow every CPU to access any part of the memory fabric; instead,
there must be a layer of protection between the CPUs and that fabric.
The Machine needs a way to name portions of memory, perform access control
so that the appropriate processors can access the memory that is allowed to
them, and so on.
Packard said that there is a well-established mechanism for doing this sort of work — the filesystem. The designers of The Machine had sketched out an API for the management of the system's memory, but he realized quickly that this API looked an awful lot like POSIX. So, rather than create a management module with a one-off interface, he set out to make a filesystem interface instead.
The memory in the fabric is managed in 8GB units called "books"; that is the minimum granularity in which memory can be assigned to any given processor. The plans for The Machine call for over 300TB of installed memory, which is a fair amount, but that is only about 40,000 books, which is not a huge number to keep track of. That simplifies the management of memory considerably, since there is no need for complex data structures to get the required level of performance.
That management is not actually done by The Machine at all; instead, there is a "librarian" system — a standard x86 server — connected via Ethernet to do that job. The librarian only handles metadata, it never sees the actual data itself and has no access to that data. Within The Machine, the Library Filesystem (LFS) implements a view of the memory fabric and provides an interface by which fabric-attached memory can be managed. The filesystem itself is implemented as a FUSE (filesystem in user space) module using a "hacked-up" version of the kernel's FUSE interface.
The Machine is all about high-performance access to storage, so some might find it amusing that the LFS implementation and the librarian are both written in Python; they communicate via JSON-formatted messages. That may not seem like a recipe for performance, but there are only 40,000 books to manage, so a faster implementation is not needed. The magic is in the fabric interconnect, not the metadata management system.
The system's security model requires that no CPU be given access to memory for which it is not authorized. So the fabric needs to have access control and a firewall built in. The original intention was for this functionality to be implemented outside of the processors themselves (since the intent was to protect the memory from the processors) but, in the real world, it is CPU-based. This privileged code is implemented using the ARM TrustZone security system; Packard allowed that making that setup work was not the most fun part of this job.
The core functionality of LFS is to manage access to 8GB books of memory, combining those books into larger virtual storage devices as needed. This is, of course, functionality that is also provided by the existing device-mapper code and the LVM2 tools. But, he said, the actual management of the books is trivial, there is no real need for the device mapper. And if this functionality is exposed by a filesystem, then ordinary shell commands can be used to perform the needed operations.
Thus, for example, a simple touch command will create a volume set; no need for lvmcreate. An invocation of fallocate can allocate books for this volume set. And so on. In each case, he said, the normal shell commands are far simpler and easier to use than the associated LVM2 commands, which he can never remember without rereading the man pages first. The result is a simple and scriptable interface to memory management within The Machine.
There is one challenge, though, in that applications often want control over where books of memory are allocated. The nature of the memory fabric necessarily means that there will be performance differences between local and remote memory, even if the remote access has been made as fast as possible. A common requirement is to spread the books across all CPUs on the system so that all CPUs see approximately equal performance overall. An application may also want to explicitly allocate memory on two separate nodes, running on separate power supplies, and mirror its data across the two; that way, the data remains available even if part of the system goes down. Other allocation policies can be easily imagined as well.
Maintaining the filesystem abstraction requires finding a way to express these policies in the filesystem interface; in this case, that was done by storing allocation policies as extended attributes. Once again, that makes it easy to manipulate these policies in scripts. The actual allocations obtained (which could differ from the requested policies) are also available as extended attributes. One big advantage of this approach is that those attributes will be saved in backups of the system, and will be preserved should there be a need to restore from the backups.
There is a cost to not using the device mapper, of course. For example, it is not possible to create RAID arrays. But storage in a RAID array cannot be made directly accessible to the CPU via the DAX subsystem (since there would be no way for the kernel to intervene and do the RAID management). Direct access to all storage is one of the key design features of the Machine, so using RAID is out of the picture — and not all that useful with fabric-attached memory in the first place. Avoiding the device mapper also sacrifices the ability to spread storage across multiple types of devices, but there is only one type of storage in The Machine, so that also matters little.
Instead, by using the filesystem interface (along with some trickery to implement a specialized block device in the same module), The Machine is able to map all of that storage directly into the processors' address spaces, and all of the management can be done with simple shell tools. That may well prove to be a winning combination — assuming The Machine ever becomes a commercial product available outside of HPE.
For those wanting to play with the code, including a virtualization-based emulation system, the FabricAttachedMemory GitHub repository has all the code.
[Your editor would like to thank linux.conf.au and the Linux Foundation
for assisting with his travel to the event.]
Index entries for this article | |
---|---|
Conference | linux.conf.au/2017 |
Posted Jan 26, 2017 11:41 UTC (Thu)
by Wol (subscriber, #4433)
[Link]
There really is nothing new under the sun ... :-)
Reading this my first reaction was "Pick OS on a mainframe" - low powered CPUs (the mainframe bit) using permanent backing as if it was addressable memory (Pick).
Okay there's a heck of a lot more there behind this new stuff, but the basic concepts haven't changed.
Cheers,
The Machine: Controlling storage with a filesystem
Wol