End-to-end network programmability

By Jake Edge
August 10, 2020

Nick McKeown kicked off the virtual Netdev 0x14 conference with a talk on extending the programmability of networking equipment well beyond where it is today. His vision is of an end-to-end system with programmable pieces at every level. Getting there will require collaboration between the developers of the networking stacks on endpoint operating systems as well as those of switches, routers, and other backbone equipment. The keynote was held on July 28, a little over two weeks before the seven days of talks, workshops, and tutorials for Netdev, which begins on August 13.

McKeown began by noting that he has used free operating systems throughout his 30-year career in networking, first BSD, then Linux. Those operating systems have shaped networking in various ways; they have also shaped how networking is taught to undergraduates at Stanford University, where he is a professor. The Linux infrastructure is "an amazing example of networking at its best that we show to our students and try to get them experience getting their hands dirty using it", he said.

He is a "huge believer in the open-source community for networking". In his group at Stanford, all of the code is released as open source. The "real revolution in networking" over the last ten years or more has been the rise of open source as a "trustworthy infrastructure for how we learn about and operate networks". Ten or 12 years ago, everyone was using closed-source, proprietary networking equipment, but today's largest data centers are all running on mostly open-source software, mainly on Linux-based equipment.

This change is pleasing to him—not simply for the sake of openness—but because it has allowed the owners and operators of this equipment to be able to program it. Those players can then introduce changes into their networks to improve their service in various ways. That kind of innovation can only be helpful to the networking world in the future.

A combination of express data path (XDP) and BPF provides the ability to do fast packet forwarding in the Linux kernel. In parallel, new forwarding pipelines, hardware accelerators, switches, and smart network-interface cards (NICs) are emerging, many of which are programmable using the P4 language. How can those two things be brought together so that the benefits can be gained end-to-end? Those two "camps" could be determined to be in opposition to each other, but he hopes that does not end up being the case. If the two do not end up working together, he said, it "will only confuse developers and users".

Some history

When he was a graduate student in the early 90s, the internet was still called NSFNet and routers were called "Cisco boxes". These routers were CPU-based and could "process a whopping ten-thousand packets per second". He and some other students decided to create a multi-port Fiber Distributed Data Interface (FDDI) router called the Bay Bridge [PDF]. It used complex programmable logic devices (CPLDs—a precursor to FPGAs) instead of CPUs to try to outdo the commercial routers.

The Bay Bridge was controlled from a Sun workstation over SBus. The CPLDs implemented a new microcoded language that described how packets should be processed. That made the Bay Bridge fast; it could process around 100,000 packets per second—ten times what the commercial products could do. The device sat on the Berkeley University FDDI ring for about five years while McKeown and his colleagues went off and did other things.

After that, they wanted to add some new features to the Bay Bridge; McKeown learned a valuable lesson from that experience. It was a "wake-up call on the rapid obsolescence, not only of microcode [...] but the obsolescence of brain cells because we couldn't remember how to program it". It ended up taking longer to add a fairly simple feature than it took to originally program the Bay Bridge. Creating a very long instruction word (VLIW) microcontroller seemed like a great idea at the time, but that experience has made him skeptical of microcode-programmable hardware accelerators.

When the idea of a network processor (or NPU) came about in the late 90s, he did not see that as the right path forward. Instead of stepping back and looking at the problem to be solved, NPU developers were simply creating a chip that contained an array of CPU cores. Network processing requires "extremely deep pipelines and very very fast I/O", neither of which is present on CPUs (or arrays of CPUs), he said.

He showed a graph that he has used in talks since around 2000, which compared the performance of network-processing chips and CPUs. Currently, the best chips can handle roughly 12.8Tbps, while CPUs are only a bit over 100Gbps, though you can argue a bit on the exact numbers, he said. When he first started looking at it, the difference was around 5x, but it is now around 100x. That led him to the conclusion that it was inevitable that something "based on a deep pipeline, high-speed I/O, and a fixed sequence of operations that corresponded to standard protocols" would have to be used for the highest performance. That would require the least amount of power, be most likely to fit on a single die, and, thus, would "provide the lowest overall cost".

For example, today you can get an application-specific integrated circuit (ASIC) switch that will handle 40 protocols at 10Tbps and use 400W. The CPU "equivalent" is 10Tbps for only four protocols and requires 25KW. CPUs are optimized for memory load and store operations, while the ASICs are optimized for I/O and pipelining. The upshot is that high-performance switches will be ASIC-based for the foreseeable future. He quoted "conventional wisdom", though he may have been the one who originally said it: "Programmable switches run 10-100x slower, consume more power, and cost more. Inherently."

The problem comes when a new protocol is needed. When the virtual extensible LAN (VXLAN) protocol needed to be added to the fixed-function switches, it took roughly four years for it to roll out in new hardware. Even though it takes a long time to get new features into the kernel, he said, four years is "pretty crazy". It makes him think that the development process we have for changing how packets are processed is wrong.

When the first programmable switches arrived, they were based on a variety of approaches: FPGAs, NPUs, or ASICs. None of them made it easy for the user to write code for their needs, so the device makers wrote the code. The device makers do not operate large networks, however, so they tended to simply implement existing standards—they are not going to be the ones that innovate, he said. All of that makes it hard to introduce new ideas, so everything tends to stagnate.

Domain-specific processing

He and others started looking at the domain-specific processors that have come about in various areas: GPUs for graphics and, it turned out, machine learning, digital signal processors (DSPs), tensor processing units (TPUs) for machine learning, and so on. Like with general-purpose computing on CPUs, all of the domain-specific processors had some higher-level languages that were compiled for the processor; those compilers would optimize the emitted code to take advantage of the instruction set and parallelism available.

In networking, there was no language where you could specify the behavior that would "lend itself to running at very high speed in an unrolled feed-forward path on a hardware accelerator", McKeown said. Around 2010 he and others started to think about a new domain-specific processor optimized for packet processing; it would specifically target allowing network operators to program it themselves. To that end, a new high-level language was needed that was hardware-independent and could be compiled to run at line-rate—all without sacrificing power, performance, or size.

Stanford started a project in collaboration with Texas Instruments, until TI got out of the large ASIC business, at which point the project started working with Barefoot Networks. The project developed the P4 language [PDF], which can target the protocol-independent switch architecture (PISA) that is provided by Barefoot Networks hardware.

McKeown described PISA at a high level. It consists of a programmable parser element that can pull apart packet headers based on a state-machine definition of the header layout. Each header then traverses a pipeline of match-and-action elements; one packet's headers are being processed at each stage of the pipeline, which provides one dimension of parallelism. Each of the match-and-action stages is identical and contains multiple elements that can do table-driven matches of various sorts (e.g. exact matches, associative matches) and take actions to modify the headers. There may be hundreds of match-and-action elements within each stage of the pipeline, which provides another dimension of parallelism.

There is a tendency to want to optimize the match-and-action stages based on how packets are usually processed today (e.g. layer 3 before layer 4), but they found that doing so gave too many constraints to the compiler. That is why all of those stages are identical; the number of stages is determined by the "degree of serial dependency in the programs". He showed four stages in the pipeline on his slide, but typically the pipelines are 16-25 stages deep. That provides room for programmers to add their own processing over and above the typical processing of today's protocols.

P4 is used to program the whole pipeline, including the parser, matches-and-actions stages, and the control flow. The matches and actions are made of tables that describe the types of matches to be done on various fields in the headers and the actions that should be taken when they occur; the control flow describes the sequence of tables that a packet will traverse on its way through the pipeline. Initially, the pipeline knows nothing about any protocol, so the information for any protocols (e.g. IPv4 and IPv6) must be specified.

The pipeline itself looks much like the fixed-function pipeline, he said, "so what's the big difference?" Unlike with a fixed-function device, the pipeline in a PISA device can be changed. He has been interested to see what kinds of changes network operators make on the PISA-based switch chips from Barefoot Networks, beyond just having the protocols that "we all know and love: IPv4, IPv6, etc.". Load-balancing changes are common, as is adding various types of telemetry to observe the behavior and performance of the switch. He is not sure that he would recommend it, but he has seen some use of non-standard IP address sizes (e.g. a 48-bit IPv4 address) on these programmable switches.

He showed a comparison of a fixed-function switch and one based on the Tofino P4-programmable chip from Barefoot Networks; both had 64 100Gbps ports and were pretty much identical other than the packet-processing chip used. The maximum forwarding rate was essentially the same, with the Tofino-based switch a bit higher. Power used per port was similar, but the Tofino was somewhat lower. Similarly, the latency was a bit less on the Tofino, but roughly equivalent. All of this shows that his conventional wisdom from earlier in the talk is now incorrect; programmable switches have the same performance, power, and cost characteristics as fixed-function devices. That means network operators will be likely to choose the programmable option for the additional flexibility it provides.

Where we are headed

Network owners and operators are "taking control of the software that controls their networks", McKeown said. It is "their lifeblood", so they need to ensure that it is reliable, secure, and can be extended in ways that allow them to differentiate themselves. To that end, they are also starting to take control of how their packets are being processed because of the availability of programmable switches and NICs. That is a transition that is starting to happen now as these devices all become more malleable.

He wondered what this means for how networks are going to be programmed in the future. He believes that we will think of the network as a programmable platform, rather than as a collection of separate elements; the behavior of the network will be described from the top (i.e. top-down). His hope is that the behavior will then be partitioned, compiled, and run across all of the elements in the network. There is still a lot of work to do to get there, however.

Furthermore, every data center will work differently, as they will be programmed locally to tailor them for the needs of the operator. It may be done to make them simpler, by removing protocols that are not needed, for example, or to add security measures specific to the needs of the organization that runs the data center.

A somewhat more controversial belief is that we will largely stop thinking in terms of protocols, he said; instead we will think in terms of software. The functions and protocols of today's internet will migrate into software. He suggested that many in the audience already thought that way, but that much of the network-development world is focused on interoperability, which is more of a concern when you are building things bottom-up and need to ensure that it all works well together. Today's large networks tend to be "somewhat homogeneous" that use similar equipment throughout. If you can express the desired behavior from the top "such that it is consistent across all of the devices, interoperability will matter, but much less than it used to" because the devices should work together by design, effectively.

All of this means that students of networking will need to learn about "programming a network top-down as a distributed computing platform". He and other educators need to figure out how to evolve their classes in that direction. It may mean that protocols are described in "quaint, historical terms"; things like routing and congestion control will instead be programs that are "partitioned across the system by a compiler".

Something that he thinks will be "absolutely huge" is the introduction of software-engineering techniques that will be routinely used in networking. If you have a specification of the behavior at the top level, then each API abstraction layer on the way down to packet processing can have its code checked for correctness. Techniques like unit testing, and even formal verification, can be applied to these programs to ensure that they are functioning correctly. We are a long way from being able to do that today, but all of the techniques either already exist or are within our reach as they are being researched and worked on today, he said.

Fine-grained per-packet telemetry will become more widespread, he believes. While in-band network telemetry [PDF] (INT) will be part of that, there will be other flavors and improvements over time. Once a device is programmable, the owner can determine what should be measured based on their needs; that will lead to a lot of innovation

The eventual goal, McKeown said, is that "we will have networks that are programmed by many and operated by few". They will be programmed by network operators and owners, but also by those who are studying and researching networking; hopefully those networks would be "operated by a lot fewer people than they are today". He noted that Stanford has around 35,000 people on its campus in normal times and that it takes around 200 people to keep the network running for them; that stands in stark contrast to the old telephone network that only took three people to keep it running. "We clearly haven't quite got it right yet in terms of making networks simple to operate and manage."

Programmable platform

The goal should be to be able to write code describing the network that is clear, will run at line-rate, and can be moved around to the component where it makes the most sense to run. He described the elements of the network pipeline, starting with user space, which uses the Data Plane Development Kit (DPDK) for networking in virtual machines (VMs), containers, or user-space programs. Then there is the kernel, which uses XDP and BPF. After that are the NICs and switches, which are increasingly being programmed using P4, though other languages or techniques may emerge.

The DPDK and XDP/BPF components already exist and work extremely well, he said; many of the developers of those pieces are in the audience. PISA and P4 are emerging on switches and he thinks those capabilities will also be moving into the NICs. So there is a potential collision between the two different ways of doing things; both are trying to efficiently handle packets in a familiar and easy-to-use way. He is not advocating leaving C and C++ behind, but, as XDP/BPF shows, there is a clear need for a "constrained way of programming" so that these programs can operate safely within the kernel.

As an example of the kind of program that might need to move to different elements in the pipeline, McKeown raised congestion control. It is somewhat "dangerous ground" that he is stepping into, because everyone has their "religious convictions about how congestion control should be done". He was not picking a side, but did want to show how the signals of congestion have moved around to the various parts of the pipeline for different algorithms.

The operators of clouds and large data centers have been experimenting with different techniques for determining when congestion is occurring. Originally, the signal for congestion was packet drops and duplicate ACKs, which is something that is mostly observed by the kernel. Later, round-trip time (RTT) was used as a signal, which required highly accurate timers so it was best done in the NICs. More recent research has looked at queue occupancy in the switches as a signal, which requires changes to those devices to gather the information along with changes to the NICs and kernel to handle a new header format to communicate the queue data. Adding VMs and containers into the mix, with a possibly different view of the congestion state from the underlying kernel or out on the NIC, makes things even more confusing. It makes sense that there would be a desire to be able to move things around to those various components as more is learned.

Routing has similar characteristics with regard to ideas changing over time. If there are ways that allow operators to move functionality around, they will do so, and the likely result is better techniques for networking. But the big question is how to get there. He said that he would be taking the "dangerous risk of putting a down a very tentative strawman". The overall problem deserves a lot of careful thought, because if it can be done right, it will have "dramatic consequences for the field of networking as a whole".

There is an enormous amount of user-space and kernel networking code that has already been written, which should be maintained going forward. But that general-purpose code will not directly run on a hardware-accelerated pipeline in a NIC or a switch; there needs to be some method of constraining those programs so that they can be run on those devices. Finding the right balance there "is not entirely obvious", McKeown said.

His idea is that the overall pipeline structure would be specified in P4 (or another language that allows specifying the serial dependencies of the processing). Using the P4 extern capability, much of the program code could still be written in C/C++, especially for things that will always run on a CPU. Other code would be written in P4 so that it could be moved to the hardware accelerators.

He gave an example of some functionality that is currently implemented in smart NICs for VM and container security in the cloud. When the cloud operators want to add new bare-metal systems, such as supercomputers, to their cloud, they cannot trust the NICs on those devices because they do not control the software that runs on them. They handle that by moving the security functionality into the switch. If they could just take the same code they are already running on the NIC and put it on the switch, it would make this process much easier.

Figuring all of this out is important, but he does not think that either the P4 or the Linux networking communities should try to figure this out on their own. There is expertise in both communities and, in the spirit of open source, they should come together to collaborate on solving these problems. He proposed that "netdev" (the kernel networking community centered around the netdev mailing list) and P4.org form a working group to specifically focus on finding open-source solutions to make all of this work end-to-end.

He wrapped up his talk by saying that, over the next decade, he believes that networks are going to become end-to-end programmable and that there is a need for collaboration to make it all work consistently. That will result in a lot of innovation in networking, and it will happen much faster than it would otherwise. Network operators will create the "dials that they need to observe the behavior" as well as "the control knobs that they need to change that behavior"; he suspects that most of the time they will use it to make their networks simpler, more reliable, and more secure. "Our job is to figure out how to make it possible for them to do so."

After his hour-long talk, McKeown spent another 30 minutes or so fielding a wide variety of questions from an obviously engaged audience. Even this long article did not cover everything; there is more that can be found in the video of the talk (and Q&A), which will be released sometime after Netdev 0x14 wraps up on August 21.

Index entries for this article
Kernel	Networking
Conference	Netdev/2020

End-to-end network programmability

Posted Aug 15, 2020 10:13 UTC (Sat) by ras (subscriber, #33059) [Link] (1 responses)

So where do things like OpenVswitch[0] and faucet[1] fit into all this? This looks to be a a different way of doing the same thing. It would be nice to see a comparison.

[0] https://www.openvswitch.org/

[1] https://faucet.nz/

End-to-end network programmability

Posted Aug 15, 2020 12:31 UTC (Sat) by leromarinvit (subscriber, #56850) [Link]

I've never used any kind of SDN stack, but I'm under the impression that at least if you want hardware support, you're limited by what protocols your hardware supports with traditional switches, even if controlled by OVS or Faucet. Whereas with PISA and P4, you can just write a new parser and load it into your existing switch, and it will work at line speed.

Now, as I said, I'm certainly no expert on any of this, so please correct me if I'm wrong.

End-to-end network programmability

Posted Aug 21, 2020 13:05 UTC (Fri) by jhs2 (guest, #107016) [Link]

Video at: https://youtu.be/EraZTUIuS2E

End-to-end network programmability

Posted Sep 11, 2020 18:01 UTC (Fri) by marcH (subscriber, #57642) [Link]

> ... but that much of the network-development world is focused on interoperability, which is more of a concern when you are building things bottom-up and need to ensure that it all works well together. Today's large networks tend to be "somewhat homogeneous" that use similar equipment throughout. If you can express the desired behavior from the top "such that it is consistent across all of the devices, interoperability will matter, but much less than it used to" because the devices should work together by design, effectively.

I'm not sure what "large" means here. "Large company network", OK. The Internet likely never because of https://www.google.com/search?q=site%3Alwn.net+ossification