Netconf 2018 Day 2

[Posted June 11, 2018 by ris]
This article covers day two of Netconf, the plenary sessions for Linux
kernel networking, that was held in Boston, Massachusetts. (For
coverage of day one of the sessions, see this article.)

====================
Being Less Indirect (David S. Miller)

David presented a brief overview of Spectre and the implications it
has for the kernel.  The networking subsystem is particularly
vulnerable because per-design there are indirect function calls all
over the data paths for protocol demux and running filters or rules.

A side effect of Spectre is that optimization schemes which proved not
so effective when tried in the past, are now open for a full
re-evaluation.  For example, various methods of batching which did not
give clear performance gains should be looked into once more.

We can also use eBPF to translate filtering rules into straight line
code paths.  This linearization eliminates all of the indirect calls
except the one necessary to call the eBPF code stream itself.

Some side topics were discussed next, the first being SKB lists.  The
SKB metadata is one of the few core data structres into kernel which
implements it's own linked list handling.  Different parts of the
networking use the list next and prev pointers in different ways,
which complicates any conversion over to the kernel's generic
list_head.

The main offender is GRO, and Eric Dumazet said that he planned to
change that singly linked list scheme into a per-NAPI hash table.
This would facilitate the conversion over to generic list handling
significantly.

The various robots we have operating on the kernel tree were discussed
next.  From the kbuild robot onwards to technologies like syzbot, we
have a lot of automated critters building and testing our code.  Is
this overall a good thing?  The verdict is not entirely clear.

Kbuild obviously provides value, it can detect build problems before
they propagate too deeply into permanently committed changes.  Early
on, kbuild had trouble figuring out exactly what tree a set of changes
apply to.  But it is much better at this now.

Syzbot tends to have a roller coaster pattern of activity.  Every time
it learns how to explore a new area of the kernel, it finds a lot of
new bugs.  Developers scramble to fix the onslaught of new bug
reports, and then once that initial peak subsides things are quiet for
a while.  This sequence repeats over and over again.

The worry is that this kind of tool can become a crutch.  People can
just decide to run syzbot in order to test a set of new changes.
Also, there could be less and less human by-hand auditing of the
sources.  Such by-hand auditing is an effective way to find problems
and also become more familiar with the code base itself.

David really likes the amount of tests that are being added for
networking under selftests, and he hopes the trend continues.  Perhaps
at some point we can require new tests for any set of changes that add
a new feature to the networking.

====================
TC Flower Tunneling (Simon Horman)

Simon began explaining how encapsulation key dissection works.  This
is a sub-part of the generic networking flow dissector and is
currently only used by TC flower.  The key point is that this facility
allows differentiation between inner and outer header keys.

Most of the networking dissects on the inner header before
decapsulation using skb_flow_dissect().  Encapsulation keys, on the
other hand, are seeded from the outer header during decapsulation
using skb_flow_dissect_tunnel_info().

One main limitation of the current encapsulation keys is that they key
on addresses and ports.  This basically limits them to UDP based
encapsulations (VXLAN, Geneve, etc.) and further limits them to the
RFC defined protocol ports.  The latter is because the port in the key
is what is used to identify the tunnel type.

Simon proposed to add an encapsulation type field to the encapsulation
key.  This would make the type of tunnel explicit, allow non-UDP based
tunneling, and also allow UDP tunnels to run on non-standard ports.

On the topic of matching on geneve tunnel options, it's not exactly
clear the right way to expose a configuration mechanism for this.  Do
we accept raw TLV blobs?  At the very least we should verify the TLV
so that we only accept geneve TLVs for geneve tunnels.

That works for setting, but matching TC Flower needs support added for
maskable TLVs.  Also, what about ordering?  Does TLV matching require
exact ordering, or is any-order allowed?

====================
Who Fears the Spectres?  (Paolo Abeni)

Paolo provided some performance graphs for UDP small packet RX
performance.  He covered several scenarios.  First, with no
mitigations.  Next, with PTI enabled.  Then, with just retpolines.
And finally, with both PTI and retpolines.

Generally, it can be seen that PTI has the largest effect.  However,
retpolines[c][d] give a noticeable effect as well.  Some perf output
snapshots were shown, and how each of the mitigations adjust what we
end up seeing with perf.

PTI causes the syscall entry/exit to show up clearly.  The retpoline
effect is less clear.  What we see is not exactly an indirect call at
the top of the perf output, but rather some point right after that
indirect call.

Another test, this time using pktgen, was shown.  Here, since there
are really no transitions between the kernel and userspace, the PTI
effect is basically zero.  But we can clearly see a performance hit
from retpolines.

Some suggestions for fighting these problems.  Batching was mentioned
first.  We have batching already in GRO, GSO, and the scheduling qdisc
dequeue.  But these don't really trigger for forwarding workloads.
Furthermore, we only have partial support for segmentation offloading
of UDP.

A lot of our indirect calls in the kernel are actually not really
indirect.  An interesting case is skb->destruct, which only takes on
one of several values.  We could instead encode an integer, and take a
direct call to the desired destructor using that.

Eric Dumazet mentioned that there have been experiments done to
formalize this kind of technique.  For example, a macro at the
protocol demux points that expands to a series of tests on the
protocol value which all lead to direct invocations of that protocol's
packet input routine.

Three more ideas were brought up.  First, it may be possible to move a
lot of our COMPAT layer code into a new UMH helper of some sort.
Next, Paolo suggested to bring back to life an old idea of Eric
Dumazet which is to perform remote SKB freeing (always free on the cpu
the packet came from).  Finally, he asked if early socket demux for
unconnected sockets is still worth pursuing.

There was also a discussion about TCP regressions due to how
skb_scrub_packet() works.  One side effect of this operation is that
it NULLs out the SKBs socket pointer, which in turn causes us to lose
the TX queue mapping in use by that socket.  So this breaks flow
separation completely.  After some discussions on various ways to deal
with this problem, it was generally agreed upon that we can preserve
the socket pointer during a scrub operation.

====================
TLS, Crypto, and ULP's (Dave Watson)

Dave discussed the current state of the in-kernel TLS layer.  Right
now we have full software support for RX and TX.  The Chelsio driver
can also do both TX and RX.  And finally the Mellanox driver can do
TX.  A basic design principle of the TLS layer is that the handshake
is still done in userspace.

The challenges of zerocopy were discussed next.  And furthermore how
this all fits together when one tries to use kTLS over NBD.  In
particular NBD needs larger buffer sizes to achieve zerocopy.

kTLS currently only supports 128-bit keys, and Dave would like to
remove that limitation soon.  Also, TLS 1.3 support is on the way and
this encompasses several things.  One nice aspect of TLS 1.3 is that
rekeying has been removed which can be tricky to support properly in
this multi-modal design.

Next, the limitations and problems of using the strparser layer for
kTLS were listed.  In particular it's handling of SKB sharing needs to
be improved.  It uses skb_clone, and this interacts badly with
skb_cow_data() and friends.

We have various implementations of various crypto algorithms, all with
different performance metrics.  The C version supports all key sizes
and full scatter-gather, but is the slowest.  The AVX and SSE versions
are faster, but can support only certain key sizes or lack
scatter-gather.

The next issue in the crypto layer is that the FPU usage is
inefficient.  A round of crypto using SSE/AVX can be done by first
loading the key material into the FPU registers.  If we are doing
multiple rounds we should only do this key load once.  However that is
not what happens currently, the keying material is loaded every single
time around.

ULPs are a hot topic of conversation because we'd like to use them in
new and interesting ways.  Current users of ULPs are kTLS and BPF.
ULPs operate by overloading the proto and socket operations, so that
they can sit in the middle.  This works fine when you have only one
ULP active, but once you try to do several at a time it gets messy.

For example, for some callbacks you want to invoke the callbacks
starting from the "topmost" ULP to the lowest.  And in other
situations you want to do the exact opposite.  Also, once we have
multiple ULPs stacked together, how do you unload one of them?

====================
TC changes and "ethlink" (Jiri Pirko)

Jiri first discussed a feature he is working on called chain
templates.  One limitation of hardware offloading is that the driver
can't assume anything about the width of the key masks that will be
coming in future operations.  Therefore, it allocates perhaps more
space in the chip tables than is really necessary.  Chaining hopes to
give the driver a better idea on how wide such values can ever be.  It
is all about using HW tables, which are a scarce resource, in a more
optimal way.

Vlad Buslov has been working to increase the rate of TC rule
insertion.  All inserts are currently not parallel, and are
synchronized by the RTNL semaphore.  The idea is to elide RTNL
locking, and refcounting and small locks to do more fine grained
control.  The initial work will target TC actions and the Flower
classifier.  Future work will involve the batching of changes.

An overview of devlink was given as well as some discussion on the
ongoing work to add parameters by Moshe Shemesh.  Different kinds of
parameter changes can require that the hardware be reset in different
ways in order for the change to take effect.  And this concept is
built directly into the new facility.  The parameters that can be
changed have a value which describes what kind of reset is necessary,
and when changing a parameter the user asks explicitly for one of
those reset types.

Alex Vesker has been working on support for resources in devlink, it
allows the user to operate on memory regions which are usually part of
the firmware image.  Users can snapshot, dump, and read these regions.

The final topic of conversation here was 'ethlink' which is Jiri's
name for exposing ethtool operations via netlink.  He believes it is a
mistake to place this into devlink.  Devlink is for performing
operations on objects which are not netdevs, whereas ethtool is firmly
netdev based.  A high level design, as well as a migration strategy
were provided.  As a first step, we could implement the netlink
attributes for ethlink and start with code that simply emits netlink
events when ethtool operations are performed.

====================
RX Batching, GRO, Megaflow merging, ARFS, BPF Verifier (Edward Cree)

Edward touched upon a variety of topics, starting with RX batching.
This is a patch series he proposed a while ago, which at the time
received a lukewarm reception.  He hopes that now with Spectre it may
gain more traction.  The idea is to take packets into a list at NAPI
poll time, and walk them through the networking stack as a group even
if they are somewhat unrelated.

He went on to describe the listification algorithm he uses, and how
this leverages all of the existing networking stack.  It is simply
hooking up the existing datapath in a new way.  Plans are underway to
rebase his changes, clean them up, and get more recent performance
Numbers.

Another experimental idea is to extend this concept even to XDP.  The
verifier emits a prologue and epilogue which transparently makes an
XDP program capable of processing a list of packets instead of just
one.

Megaflow merging is a scheme which can hopefully improve the
performance of OVS/Flower at least for certain workloads.  The trick
is in finding what is the optimal set of flows to merge.  Edward has
some test code in Python which tries to achieve this in various ways.
It is still very early experimental work but the initial results are
promising.

Loose GRO was the next topic.  LRO is generally frowned upon because,
unlike GRO, it doesn't maintain enough state to allow emitting exactly
the original packet stream.  LRO can coalesce, therefore, in some
situations where GRO can't so at least theoretically it might be
faster in some circumstances.  Although this is unproven, Edward said
that nevertheless many customers think they need LRO.

The proposal was to allow less strict segmentation in GRO, ala LRO
style, when some option or sysctl is set.  In reality, we only need to
respect the resegmentability of the packet stream when we are
forwarding.  So perhaps that could act as a key to control the
behavior.  Edward is interested in upstreaming something like this as
it would allow him to remove a bit of code from the Solarflare Driver.

What to key on was discussed some more, and one idea from Eric Dumazet
is to perform early demux of sockets even earlier than we do now.  If
we have a local socket an SKB maps to, we know we won’t be forwarding
the packet so we can safely do loose GRO in that case.

ARFS flapping can occur when interrupt affinities are misconfigured.
He didn't realize this was the cause of ARFS flapping initially, so he
implemented code to avoid ARFS flapping explicitly.  In the end he
thinks that interrupt affinities should be set properly, so his
changes probably aren't very useful any more.

Recently there have been some disagreement on how to proceed in making
the eBPF verifier more powerful.  Edward has one set of approaches he
would like to use, and Alexei Starovoitov has another.  These
disagreements have been fermenting on the mailing lists for some time.

So it was great to be able to have both of them in the room to discuss
these differences and try to work things out.

One of Edward's chief complaints seems to be the hard push to stick to
traditional and tested userspace compiler algorithms for this work.
He would like to see more freedom to explore more inventive
approaches.

Alexei is very concerned about the verifier in the long term.  He
would like to see it able to handle a one million instruction program
efficiently.  Because of new developments in the pipeline (bounded
loops, real function calls, libraries) he thinks getting this right is
a life or death situation for BPF.  This is why he wants to try at
first to use tried and tested mechanisms for data flow analysis.

====================
SCTP offload and tunnel ICMP handling (Xin Long)

Xin gave a detailed explanation of how SCTP chunking works.  This is
important to understand because it has a huge impact on how GRO and
GSO need to work.

Unlike a normal stream protocol like TCP, SCTP can transport multiple
data streams at once.  These are managed by the stream scheduler, and
packaged into SCTP frames using what are called chunks.  So the data
stream is not just the SCTP payload.  There is a further level of
header parsing and demuxing that must occur.  And it is this which
complicates a GSO implementation considerably.

Right now the SCTP GSO implementation puts the packets together using
SKB fragment lists.  Not many drivers support receiving SKB fragment
lists, so the SCTP GSO frame has the be unraveled right before it is
given to the driver.

The discussion went on to cover how feasible it would be to get
drivers to support SKB fragment lists, in order to handle this.  This
isn't easy, because once you have fragment lists, the drivers have
trouble figuring out exactly how much space in their TX ring might be
necessary to send an arbitrary packet.

Eric Dumazet suggested that it might be possible to encode SCTP GSO
packets using paged frags just like TCP does, and that the current
implementation of SCTP GSO should be reconsidered.  He made it clear
that doing this is much easier than getting drivers to accept SKB
fragment lists.

Next, Xin presented various cases of ICMP handling over tunnels that
are very troublesome.  The gist of the issue is that the current way
UDP sockets are looked up for UDP based tunnels doesn't work when
processing ICMP messages.  Xin feels that we can adjust the demux in
certain ways such that we find the correct UDP socket in most cases.
He went on to show that L2TP tunnels have a similar set of issues,
although the path to solving the problem in this case is much less
clear.

====================
BPF and the Future of Kernel Extensibility (Alexei Starovoitov)

Alexei explained that the goal of BPF is simply to let non-kernel
developers safely and easily modify kernel behavior, and this means
making BPF easy to use.

He used various visuals (a Transformer, a bunch of children playing
with legos) to show the difference between BPF in the past (cool and
powerful but only in limited forms) and BPF in the present (a giant
lego set with no instruction manual).

However, even without this instruction manual, people have built lots
of interesting and powerful things.  He went on to describe why people
learn BPF, and why the kernel needs BPF in order to stay relevant in
this quickly changing world.

The BPF verifier, as always, was a major topic.  Programs up until
this point have been relatively simple for the verifier to work with.
But this is going to change, especially because of BPF to BPF calls.

There is also work afoot to track pointer life times in the verifier.
This kind of analysis will allow things like lock/unlock and
malloc/free to be verified in BPF programs.  Another major area of
work is supporting bounded loops.

Other things we have to look forward to are global variables, local
storage, indirect calls which are statically verified and patched,
libraries, and dynamic linking.

Alexei would like to see the verifier move away from it's brute force
"walk all instructions" approach, and he feels that this is
fundamentally necessary in order to efficiently handle huge programs.

Support for BTF (BPF Type Format) was merged recently and more is
coming.  It describes all types in a given program.  In the future BTF
can integrate with the verifier for safety checks.  Also, LLVM could
emit code that references struct members symbolically and the BPF
loader can fix them up using BTF information.

Facebook recently fully open sourced their XDP based load balancer
called katran.  Alexei explained the key aspects which were enabled by
XDP in their design.

He would also like to see a move to more file descriptor based APIs
for BPF objects.  These solve all the weird concurrency and lifetime
issues we see these days.

Another new feature coming up is cgroup local storage.  Variables are
annotated with "__cgroup" just like one can use "__thread" in
userspace.  This variable becomes local to that cgroup and can be used
accordingly.

Drivers are getting more and more complex.  So it is proposed to make
the memory management aspect of adding XDP support as common as
possible.  Alexei feels that this is part of why XDP uptake in drivers
goes so slowly.

Finally, the topic of firmware came up.  The firmware running on your
hardware these days is a different beast than that which we had in the
past.  It used to be just a shim over the hardware, but now it is
something much more.  In fact, at many shops the firmware team is
several times larger than the kernel driver team.

Such a huge piece of unpublished code is just a security problem
waiting to happen.  It is also full of secret features and, even more
unfortunately, bugs.  Therefore firmware needs to become more open,
part of the driver, and the kernel GIT tree before this problem
becomes even worse.  It will be a huge uphill challenge to achieve
this, however.
Netconf 2018 Day 2

Posted Jun 11, 2018 20:28 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)
Can we just skip all this nonsense with eBPF and just go directly to using LLVM bitcode? Heck, just JIT-compile it in the userspace into native code.
/me runs and hides
Netconf 2018 Day 2

Posted Jun 11, 2018 20:29 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]
On the second thought, maybe we should just embed a V8 JavaScript engine into the kernel? After all, NetBSD did the same with LUA.
Netconf 2018 Day 2

Posted Jun 11, 2018 22:13 UTC (Mon) by nybble41 (subscriber, #55106) [Link]
> Can we ... go directly to using LLVM bitcode? Heck, just JIT-compile it in the userspace into native code.
That doesn't seem so unreasonable to me, provided the JIT-compilation occurs within a restricted user-mode helper and targets a subset of the native machine language amenable to verification (as with NaCl). The verifier, of course, would need to run in a separate user-mode helper context. LLVM bitcode is not fully portable, though—something like WebAssembly might be a better choice.
Netconf 2018 Day 2

Posted Jun 12, 2018 6:50 UTC (Tue) by magfr (subscriber, #16052) [Link]
Isn't there a tongue in cheek law stating that all programs will develop their own internal lisp parser at some point?
Lisp is also nice since it naturally promotes some of the eBPF limitations.