|
|
Subscribe / Log in / New account

UKUUG: Arnd Bergmann on interconnecting with PCIe

By Jake Edge
November 19, 2008

PCI express (PCIe) is not normally considered as a way to connect computers, rather it is a bus for attaching peripherals, but there are advantages to using it as an interconnect. Kernel hacker Arnd Bergmann gave a presentation at the recent UKUUG Linux 2008 conference on work he has been doing on using PCIe for IBM. He outlined the current state of Linux support as well as some plans for the future.

The availability of PCIe endpoints for much of the hardware in use today is one major advantage. By using PCIe, instead of other interconnects such as InfiniBand, the same throughput can be achieved with lower latency and power consumption. Bergmann noted that avoiding using a separate InfiniBand chip saves 10-30 watts which adds up rather quickly on a 30,000 node supercomputer.

There are some downsides to PCIe as well. There is no security model, for example, so a root process on one machine can crash other connected machines. There is also a single point of failure because if the PCIe root port goes down, it takes the network with it or, as Bergmann puts it: "if anything goes wrong, the whole system goes down". PCIe lacks a standard high-level interface for Linux and there is no generic code shared between the various drivers—at least so far.

As an example of a system that uses PCIe, Bergmann described the "Roadrunner" supercomputer that is currently the fastest in existence. It is a cluster of hybrid nodes, called "Triblades", each of which has one Opteron blade along with two Cell blades. The nodes are connected with InfiniBand, but PCIe is used to communicate between the processors within each node by using the Opteron root port and PCIe endpoints on the Cells.

There is other hardware that uses PCIe in this way, including the Fixstars GigaAccel 180 accelerator board and an embedded PowerPC 440/460 system-on-a-chip (SoC) board, both of which use the same Axon PCIe device. Bergmann also talked about PCIe switches and non-transparent bridges that perform the same kinds of functions as networking switches and bridges. Bridges are called "non-transparent" because they have I/O remapping tables—sometimes IOMMUs—that can be addressed by the two root ports that are connected via the bridge. These bridges may also have DMA engines to facilitate data transfer without host processor control.

Bergmann then moved on to the software side of things, looking at the drivers available—and planned—to support connection via PCIe. The first driver was written by Mercury Computers in 2006 for a Cell accelerator board and is now "abandonware". It has many deficiencies and would take a lot of work to get it into shape for the mainline.

Another choice is the driver used in the Roadrunner Triblade and the GigaAccel device which is vaguely modeled on InfiniBand. It has an interface that uses custom ioctl() commands that implement just eight operations, as opposed to hundreds for InfiniBand. It is "enormous for a Linux device driver", weighing in at 13,000 lines of code.

The Triblade driver is not as portable as it could be, as it is very specific to the Opteron and Cell architectures. On the Cell side, it is implemented as an Open Firmware driver, but the Opteron side is a PCIe driver. There is a lot of virtual ethernet code mixed in as well. Overall, it is not seen as the best way forward to support these kinds of devices in Linux.

Another approach was taken by a group of students sponsored by IBM who developed a virtual ethernet prototype to talk to an IBM BladeCenter from a workstation by way of a non-transparent bridge. Each side could access memory on the other by using ioremap() on one side and dma_map_single() on the other. By implementing a virtio driver, they did not have to write an ethernet driver, as the virtio abstraction provided that functionality. The driver was a bit slow, as it didn't use DMA, but it is a start down the road that Bergmann thinks should be taken.

He went on to describe a "conceptual driver" for PCIe endpoints that is based on the students' work but adds on things like DMA as well as additional virtio drivers. Adding a virtio block device would allow embedded devices to use hard disks over PCIe or, by implementing a Plan 9 filesystem (9pfs) virtio driver, individual files could be used directly over PCIe. All of this depends on using the virtio abstraction.

Virtio is seen as a useful layer in the driver because it is a standard abstraction for "doing something when you aren't limited by hardware". Networking, block device, and filesystem "hosts" are all implemented atop virtio drivers, which makes them available fairly easily. One problem area, though, is the runtime configuration piece. The problem there is "not in coming up with something that works, but something that will also work in the future".

Replacing the ioctl() interface with the InfiniBand verbs (ibverb) interface is planned. The ibverb interface may not be the best choice in an abstract sense, but it exists and supports OpenMPI, so the new driver should implement it as well.

Two types of virtqueue implementations are envisioned, one for memory-mapped I/O (MMIO) and the other for a DMA-based virtqueue. The MMIO would be the most basic virtqueue implementation, with a local read of a remote write. Read access on PCIe is much slower than write because a read must flush all writes then wait for data reception. Data and signaling information would have separate areas so that data ordering guarantees could be relaxed on the data area for better performance, while strict data ordering would be set for the signalling area.

The DMA engine virtqueue implementation would be highly hardware-specific to incorporate performance and other limitations of the underlying engine. In some cases, for example, it is not worth setting up a DMA for transfers of less than 2K, so copying via MMIO should be used instead. DMA would be used for transferring payload data, but signaling would still be handled via MMIO. Bergmann noted that the kernel DMA abstraction may not provide all that is needed so enhancements to that interface may be required as well.

Bergmann did not provide any kind of time frame in which this work might make its way into the kernel as it is a work in progress. There is much still to be done, but his presentation laid out a roadmap of where he thinks it is headed.

In a post-talk email exchange, Bergmann points to his triblade-2.6.27 branch for those interested in looking at the current state of affairs, while noting that it "is only mildly related to what I think we should be doing". He also mentioned a patch by Ira Snyder that implements virtual ethernet over PCI, which "is more likely to go into the kernel in the near future". Bergmann and Snyder have to agreed to join forces down the road to add more functionality along the lines that were outlined in the talk.


Index entries for this article
KernelDevice drivers
KernelPCIe
ConferenceUKUUG Linux Conference/2008


to post comments

mvXCell board available in Europe

Posted Nov 20, 2008 9:16 UTC (Thu) by arnd (subscriber, #8866) [Link]

I recently learned about the mvXCell from MATRIX VISION, which is another
OEM version of the GigaAccel 180 and is now available in Europe as well.
It runs the same device drivers, meaning that we're also targetting it
with this work.
http://www.matrix-vision.com/products/cell/mvxcell.php?la...

UKUUG: Arnd Bergmann on interconnecting with PCIe

Posted Nov 20, 2008 22:49 UTC (Thu) by jengelh (guest, #33263) [Link] (3 responses)

This triblade concept of linking an Opteron with Cell over PCIe sounds highly interesting. But I doubt it is an SSI, is it?

I'd really wish there was SSI hardware with different CPU types. That could aid in software development, or just for the fun of possessing one. Something unlike Cell though (SPU/SPE are all in the PPC group, kinda boring) — what I think of is more like a mainboard with, say, one AMD64 core, one IA64 core and a SPARC64 core.

UKUUG: Arnd Bergmann on interconnecting with PCIe

Posted Nov 21, 2008 16:12 UTC (Fri) by drag (guest, #31333) [Link] (2 responses)

IBM, as you probably know, is about a decade or two ahead of everybody else in terms of virtualization. In the past they've offered POWER boxes that had a add-on Xeon processor so that you can run Windows along side AIX and Linux on the same machine.

This seems like a extension of that.

UKUUG: Arnd Bergmann on interconnecting with PCIe

Posted Nov 21, 2008 16:28 UTC (Fri) by jengelh (guest, #33263) [Link] (1 responses)

Now add “affordable to the end customer” to the equation ;-)

UKUUG: Arnd Bergmann on interconnecting with PCIe

Posted Nov 21, 2008 22:27 UTC (Fri) by sbergman27 (guest, #10767) [Link]

True for now. However, it is possible that Windows licenses will come down in price at some future time. ;-)


Copyright © 2008, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds