Netconf 2018 Day 1

[Posted June 11, 2018 by ris]
The two day Linux kernel networking development plenary session,
called Netconf was held in Boston, Massachusetts, on May 31st and June
1st 2018. Covered here is day one of the sessions, attended by 15
developers. A subsequent article will cover day two of the event.

====================
DPDK (Stephen Hemminger)

Stephen Hemminger spoke about DPDK as a project for the kernel
networking community to learn from.

DPDK has been passed around under the umbrella of several entities
including Intel and the Linux Foundation.  The community is very
active with many contributors, although most are hardware vendors.

Most consumers of the technology fork a copy internally and then stick
with that for a long time.  Distros, on the other hand, tend to roll
with the upstream releases.

The main users of DPDK are FD.io and OVS.  FD.io uses Cisco's VPP
infrastructure and has demonstrated 1 terbit/sec performance.  It uses
batching extensively and has a very good performance regression
testbed.

The first good point about DPDK is obviously performance.  Next, it
allows people to spin a product on an ancient kernel.  Also, the BSD
license enables proprietary products and makes it easier to innovate.

On the other hand DPDK's ABI lacks stability, and this is often why
consumers fork internally.  There's also a lot of HW catchall, where
everyone wants to enable their own features.  DPDK also has a very
weak config model, which is complicated and uses very long command
lines.  Power management and security are also quite a challenge with
DPDK.

A set of basic high level recommendations were given.  The first is
something well established already, do a lot of batching when
possible.  Next, we can learn from the ideas of doing pipeline
decomposition, using lockless rings for managing packets, and pinning
resources to the owner and avoiding sharing.

====================
BPF, Cilium, and bpfilter (Daniel Borkmann)

BPF patches used to get merged directly into the networking GIT trees.
But now Alexei Starovoitov and Daniel manage separate GIT trees ('bpf'
and 'bpf-next') for these changes which eventually get pulled into net
and net-next.  There is also a 'bpf' delegate in the networking
patchwork instance.  Meanwhile, BPF submissions have been increasing
dramatically.

Kernel 4.18 should be a new record for for BPF with 248 patches
(excluding XDP driver changes) and 35 unique contributors so far.  The
'bpf' delegate has been assigned 2065 patches (averaging to 17 per
work day), 794 of which ended up in state 'accepted' (averaging to 7
per work day).  There have been 18 pull-requests for 'bpf-next' and 26
pull-requests for 'bpf'.

The main bottleneck in BPF development is reviewing.  To address this
the BPF developers are experimenting with a rotating on-call
assignment of reviewers for spreading the load.  The goal is to get
all BPF patches reviewed promptly.  The final vetting and application
of the changes into the GIT tree(s) is still done by Alexei and
Daniel.  Not only does this make the review process more scalable, it
also makes the rotating reviewers more familiar with various areas of
the BPF code.

Testing wise, test_verifier and test_kmod.sh alone run a total of 2018
test programs.  Outside of RCU torture, BPF is the biggest subsystem
under kselftests.  The BPF developers are also very happy with the
state of syzkaller and how it improves the testing of BPF.

AF_XDP has been merged recently, but some problems have arisen.  There
are NICs which use a buffering scheme that is incompatible with the
ring buffer design.  This presents a difficult decision.  We could
disable the feature in the tree until this is resolved, or we could
revert the entire set of AF_XDP changes.  Nobody in the room found
either choice desirable.  It was instead proposed that AF_XDP be
adjusted to support multiple ring layouts.  In this way we can keep
the code in the tree and not disable it.

In Cilium, XDP is used for fast DDoS protection and possibly DSR load
balancing in the future.  cls_bpf is used in direct-action mode for
all heavy duty data path work.  And sk progs are used for L7 in-kernel
policy enforcement and redirect to accelerate L7 proxies like Envoy.

Sockmap programs parse and trigger verdicts for the data flowing
to/from application sockets.  Future development necessary to flush
out this technology are bounded loops for BPF and kTLS integration of
sockmaps.  The latter is a hard problem and developers are deciding
whether sockmap should be installed as a stacked ULP with kTLS or
together with kTLS as a single integrated ULP.

Bpfilter is a recently started project which will take existing
netfilter (iptables and nftables) kernel API calls and translate them
into BPF code sequences which can execute either from the existing
netfilter hooks or via XDP as the situation allows.  When running via
XDP this means that netfilter rules can be accelerated on SmartNICs.

Via this mechanism we will be reducing the attack surface of the
netfilter data paths.  All BPF code generated is pushed through the
BPF system call interface via UMH helper programs.  This means that
the code is generated out of userspace behind the syscall boundary.
Once feature complete the old kernel xtables etc. code can be removed.

The UMH module approach has several advantages.  It delegates the
complex logic into userspace.  Crashes of the UMH helper doesn't take
down the kernel.  It can leverage all of the debugging, testing,
sanitizing, fuzzing etc. infrastructure that exists for userspace.  It
also means that non-kernel developers can contribute to this project.

The current state of the project is that the UMH infrastructure and
the basic skeleton for bpfilter itself have been merged into net-next
and will appear in 4.18

There were some discussions on how to deal with building the UMH
helpers even in the event we are cross compiling.  Alexei suggested
that merging a very small implementation of libc (such as klibc) into
the kernel might be the best way to handle this.

====================
Netflix and BPF; future work on BPF tracing (Brendan Gregg)

Inside Netfilx BPF is used by Performance Engineering (observability),
Network Engineering (flow tracing, XDP), the Security Team (intrusion
detection, whitelist generation), and the Container Team (networking,
security).

Brendan next showed a huge diagram which illustrates all of the kernel
subsystems and elements which can be analyzed using eBPF and bcc right
now.  It was quite extensive.

Next, a brief outline of tracing and BPF features were listed.  The
idea was to demonstrate which kernel version they landed in and how
this has made eBPF based tracing more powerful over time.

Various examples of tracing and statistics gathering were given.  Such
as bio latency as an ascii table as well as a heat map.  BPF allows
more sophisticated counting than we can get with just the basic
tracing infrastructure.  For example, we can do a 'tcptop' with full
information on the socket object inside the kernel as well as the task
operating on that socket.  This is far superior to tools which try to
do this using facilities outside of the kernel such as libpcap.

We also saw two more examples of TCP tracing.  First, session logging
using inet_sock_set_state.  Second, TCP packet dropping via tcp_drop.
When the drop triggers, the tracer records a stack backtrace so we can
get a better idea why the packet was dropped.

In the future, we might want to encapsulate TCP statistic bumping into
a special function so that it specifically can be traced and we can
get similar information at these event points just like in the
tcp_drop example above.

More advanced situations were shown next.  In particular, we got to
see the power of off-wake flame graphs.  Here we can see (from the
bottom, up) the blocked task down to the scheduling point and then the
waking task up to the return to userspace.

Taking this a step further, we can chain wakeup stacks into a huge
flame graph.

A full Linux tracing timeline was given, starting from static tracers
and prototype dynamic tracers in the 1990's.  Eventually we get perf,
ftrace, tracepoints, uprobes, eBPF tracing, ftrace histogram filters,
and then we get to the current state of affairs (and beyond).

A brief overview was given for 'ply' and 'bpftrace'.  These are
potentially very powerful tools, but their development has become
stale.  Brendan made a plea for someone to finish one of these two
tools before someone starts yet another one.

====================
BPF offload; NIC switchdev mode; killing tc egdev (Jakub Kicinski)

Jakub started off by discussing the ongoing development of bpftool.
As discussed in Daniel's presentation, the goal is to make this
utility be the go-to tool for everything eBPF.  It's particularly
powerful for introspection.  For example, dumping MAPS, showing
offload info and device specific assembly, getting the control flow
graph, listing tracing attachment points, etc.

Speaking of BPF offloading, this topic wrt. nfp was discussed next.
Interesting eBPF features that nfp can offload include MAPS (including
lookup, update, and delete from the data path), atomic adds (both
32-bit and 64-bit), bpf_get_prandom_u32(), and perf event output.

In the future, Jakub would like to see support for parallel offload
and driver eBPF.  The idea is that cases that can be offloaded to the
NIC, will be.  But the cases that can't will be punted to the driver
in it's XDP path (as it happens for non-SmartNIC drivers that have
explicit XDP support).

Next, supporting an SR-IOV e-switch using pieces of eBPF
infrastructure was presented.  It can provide full switchdev mode but
requires some new bits of infrastructure in eBPF land.  For example,
per-ASIC program sharing, and dealing with multicast and broadcast
will require a new eBPF helper of some sort.

There are a lot of issues to resolve to make this a reality outside of
the above mentioned required new features.  Some side effects are of
concern as well, such as the new netdevs that get created, and the
fact that this kind of scheme doesn't work for HW that can't do
representors.

As far as statistics are concerned, it's a really inconsistent and
hodge-podge place especially for sophisticated devices such as nfp.
There are various flavors of statistics via IFLA_*_STATS and ethtool.
The problem with the latter is that every driver chooses their own
names for their ethtool stats and there is therefore very little
consistency.

====================
Networking Traffic Control (Cong Wang)

Cong discussed the current state of tc in the kernel.  He started with
a list of some new features such as pfifo_fast becoming "more"
lockless (but not quite completely lockless), TC filter 'block'
objects, hardware offloading, and finally extack support.

RCU completeness was a goal which was described next.  Several
elements of the packet scheduler are RCU complete (such as filters)
but we are not quite there elsewhere.  For example, for TC actions we
are missing the "copy" part of RCU and this can potentially cause
problems.  Cong would like to see all of the spinlocks in the TC
action fast paths removed.

He also thinks we can make at least simple qdiscs like pfifo_fast
lockless by using lock free data structures as it is just a simple
ring buffer with no classes or filters.

We'd also like to see a bit of a breakdown of the RTNL mutex in the
control paths.  Because of Florian Westphal's infrastructure work last
year, subsystems can try to execute their control plane without taking
the RTNL.  This can be difficult to achieve in practice, especially
when correctness of a configuration operation involves keeping objects
from multiple subsystems of the networking stable until completion.

Cong's presentation ended with a brief discussion of several other
ideas involving TX backpressure, ingress shaping, and
dev->tx_queue_len.

====================
TCP work (Eric Dumazet)

TCP Zerocopy went into net-next recently.  It uses the header split
feature of NICs and an MTU of 4096 plus the size of the networking
headers.  This way the TCP payload lands completely in a single page.

Aa mmap() method for TCP sockets completes the design.  One concern is
that in single thread multi-process applications we might get
contention on the VM locks and overhead due to TLB flushing.

In analyzing the performance, we have to first understand that just by
going to a 4K MTU we get significant performance improvements.
However, with mmap() and TCP RX zerocopy we get an additional effect
which is that the cpu overhead drops considerably (from 230usec/MB
down to 60usec/MB in Eric's tests).

Due to the 4K MTU, this isn't a feature which is easy to deploy at
large sites.  One has to deal with black holes somehow if we want to
try to use it automatically, otherwise some explicit enabling is
necessary (f.e. via connect/accept BPF hooks)

There are some plans to increase the default socket send buffer size
limits for TCP.  It currently sits at 4MB, and the plan is to bump it
to 16MB or even 64MB.  Without this increase we don't get optimal
performance for long distance flows.

However, by itself this could allow large numbers of sockets to
consume enormous amount of memory, causing unexpected OOM situations.
There is a plan to fix several problems that exist with
TCP_NOTSENT_LOWAT (too many context switches, no wakeups when
receiving SACK to drain the queue below the LOWAT limit).  Once that
is done, the send buffer size limit can be increased safely.

There are performance problems that are caused by the amount of ACKs
that our TCP stack currently emits.  The effect is seen most acutely
on wireless, but the effect everywhere is enough to cause general
concern.  Luckily most ACKs are redundant.

We already now have SACK compression support, which uses a high
resolution timer.  Eric will try to reuse that timer for normal ACKs
in cases where delayed ACKs might save cpu cycles and network
capacity.

Finally, Eric proposed to set the PSH bit on all TSO packets.  It can
reduce the workload on the receiver when processing GRO clusters.  The
problem is that GRO clusters are maintained on a linked list so
keeping the length of that list in check helps a lot.

Related to being able to deploy this, Eric mentioned the idea of using
usec resolution for TCP timestamp values.  Because of limitations in
the RFC definition of the TCP timestamp timer, this won't be easy to
deploy on a global scale.

====================
Layer 1 boring stuff (Florian Fainelli)

Florian began by discussing the PHY testing changes he has been
working on.  Either in the field or in a controlled test lab or
facility, people want to run different kinds of PHY tests (IEEE
compliance tests, cable diagnostics, packet generation/testing).  Of
note is that most of these tests disrupt the link state.

Requirements wise, the facility must allow for tests to run for an
undefined amount of time, and it must communicate the link state
disruption to the user clearly (f.e. visible in 'ip' command output)
The interface must also be future proof, and it might have both input
and output data.

Various ethtool interfaces were presented and then the discussion
diverged to ethtool in general.  The issue is that it is long overdue
to convert ethtool over to a netlink based facility.  As long as we
keep letting people add features to the ioctl based ethtool that
conversion becomes harder and harder.

There was agreement that no new ethtool features should be allowed
until the netlink conversion occurs.  Once we have a netlink based
ethtool we can use things like netlink attributes to design truly
extensible APIs.

What there was not agreement about is how to go about doing the
conversion.  Some have proposed to put this ethtool facility into
devlink.  Jiri Pirko and others don't agree.  Jiri picked up this
topic in his own presentation on day two.

Next, Florian discussed PHYLINK.  This is a bridge between MDIO, MAC
to MAC devices, and things like SFP/SFF.  He listed a bunch of
rationales for PHYLINK, but the most important one is simply
supporting these kinds of hardware.

PHYLINK uses several layers.  The device tree for instantiating
hardware.  I2C for EEPROM and diagnostics.  And finally the PHY
library for Clause 22/45 devices.  What PHYLINK provides is an SFP bus
abstraction layer.

The ongoing work on PHYLINK was discussed and Florian expressed the
wish that hardware vendors made more use of it.  At the very least
this would result in a more robust implementation over time.

Kiddingly Florian stated that the DSA layer is feature complete.  It
obviously is not and there is more work to do.  Support for
multicasting without a bridge, DSA testing, bug fixing, and of course
supporting more devices such as the Microchip KSZ.

Florian asked if we should use devlink to expose things like
statistics, link state, and register dumps for parts of the DSA
hardware that is not exposed with bonafide netdev representors.

The next topic was lightweight devices.  This is a reoccurring theme.
When one creates a lot of netdev objects, they allocate a ton of
resources.  One example is sysfs files.  David Ahern brought up this
very topic at a previous netconf and proposed to provide a way to not
instantiate various parts of a netdev's state when it is created.

Other proposals include Florian's "L2 only", as well as Si-Wei's
IFF_HIDDEN_DEVICE facility.

Finally, Florian tried to touch upon the issue of how we have too many
network device driver interfaces.  Some of these provide overlapping
behavior.  The biggest offender is ethtool's rxnfc colliding with
various tc offloads.

When a driver implements multiple of these facilities, it is unclear
what is supposed to happen to the underlying hardware configuration
when more than one of them is used at the same time.

The main drawback to the TC side of things, according to Florian, is
the lack of a facility for location placement (where in the chip is
the rule loaded, etc.).