By Jonathan Corbet
October 27, 2009
Your editor still remembers installing his first Ethernet adapter. Through
the expenditure of massive engineering resources, DEC was able to squeeze
this device onto a linked pair of UNIBUS boards - the better part of a
square meter of board space in total - so that a VAX system could be put
onto a modern network. Supporting 10Mb/sec was a bit of a challenge in
those days. In the intervening years, leading-edge network adaptors have
sped up to 10Gb/sec - a full three orders of magnitude. Supporting them is
still a challenge, though for different reasons. At the 2009 Japan
Linux Symposium, Herbert Xu discussed those challenges and how Linux has
evolved to meet them.
Part of the problem is that 10G Ethernet is still Ethernet underneath.
There is value in that; it minimizes the changes required in other parts of
the system. But it's an old technology which brings some heavy baggage
with it, with the heaviest bag of all being the 1500-byte maximum transfer
unit (MTU) limit. With packet size capped at 1500 bytes, a 10G network
link running at full speed will be transferring over 800,000 packets per
second. Again, that's an increase of three orders of magnitude from the
10Mb days, but CPUs have not kept pace. So the amount of CPU time
available to process a single Ethernet packet is less than it was in the
early days. Needless to say, that is putting some pressure on the
networking subsystem; the amount of CPU time required to process each
packet must be squeezed wherever possible.
(Some may quibble that, while individual CPU speeds have not kept pace, the
number of cores has grown to make up the difference. That is true, but the
focus of Herbert's talk was single-CPU performance for a couple of reasons:
any performance work must benefit uniprocessor systems, and distributing a
single adapter's work across multiple CPUs has its own challenges.)
Given the importance of per-packet overhead, one might well ask whether it
makes sense to raise the MTU. That can be done; the "jumbo frames"
mechanism can handle packets up to 9KB in size. The problem, according to
Herbert, is that "the Internet happened." Most connections of interest go
across the Internet, and those are all bound by the lowest MTU in the
entire path. Sometimes that MTU is even less than 1500 bytes.
Protocol-based mechanisms for finding out what that MTU is exist, but they
don't work well on the Internet; in particular, a lot of firewall setups
break it. So, while jumbo frames might work well for local networks, the
sad fact is that we're stuck with 1500 bytes on the wider Internet.
If we can't use a larger MTU, we can go for the next-best thing: pretend
that we're using a larger MTU. For a few years now Linux has supported
network adapters which perform "TCP segmentation offload," or TSO. With a
TSO-capable adapter, the kernel can prepare much larger packets (64KB, say)
for outgoing data; the adapter will then re-segment the data into smaller
packets as the data hits the wire. That cuts the kernel's per-packet
overhead by a factor of 40. TSO is well supported in Linux; for systems
which are engaged mainly in the sending of data, it's sufficient to make
10GB work at full speed.
The kernel actually has a generic segmentation offload mechanism (called
GSO) which is not limited to TCP. It turns out that performance improves
even if the feature is emulated in the driver. But GSO only works for data
transmission, not reception. That limitation is entirely fine for broad
classes of users; sites providing content to the net, for example, send far
more data than they receive. But other sites have different workloads,
and, for them, packet reception overhead is just as important as
transmission overhead.
Solutions on the receive side have been a little slower in coming, and not
just because the first users were more interested in transmission
performance. Optimizing the receive side is harder because packet
reception is, in general, harder. When it is transmitting data, the kernel
is in complete control and able to throttle sending processes if
necessary. But incoming packets are entirely asynchronous events, under
somebody else's control, and the kernel just has to cope with what it gets.
Still, a solution has emerged in the form of "large receive offload" (LRO),
which takes a very similar approach: incoming packets are merged at
reception time so that the operating system sees far fewer of them. This
merging can be done either in the driver or in the hardware; even LRO
emulation in the driver has performance benefits. LRO is widely supported
by 10G drivers under Linux.
But LRO is a bit of a flawed solution, according to Herbert; the real
problem is that it "merges everything in sight." This transformation is
lossy; if there are important differences between the headers in incoming
packets, those differences will be lost. And that breaks things. If a
system is serving as a router, it really should not be changing the headers
on packets as they pass through. LRO can totally break satellite-based
connections, where some very strange header tricks are done by providers to
make the whole thing work. And bridging breaks, which is a serious
problem: most virtualization setups use a virtual network bridge between
the host and its clients. One might simply avoid using LRO in such
situations, but these also tend to be the workloads that one really wants
to optimize. Virtualized networking, in particular, is already slower; any
possible optimization in this area is much needed.
The solution is generic receive offload (GRO).
In GRO, the criteria for which packets can be merged is greatly
restricted; the MAC headers must be identical and only a few TCP or IP
headers can differ. In fact, the set of headers which can differ is
severely restricted: checksums are necessarily different, and the IP ID
field is allowed to increment. Even the TCP timestamps must be identical,
which is less of a restriction than it may seem; the timestamp is a
relatively low-resolution field, so it's not uncommon for lots of packets
to have the same timestamp. As a result of these restrictions, merged
packets can be resegmented losslessly; as an added benefit, the GSO code
can be used to perform resegmentation.
One other nice thing about GRO is that, unlike LRO, it is not limited to
TCP/IPv4.
The GRO code was merged for 2.6.29, and it is supported by a number of 10G
drivers. The conversion of drivers to GRO is quite simple. The biggest
problem, perhaps, is with new drivers which are written to use the LRO API
instead. To head this off, the LRO API may eventually be removed, once the
networking developers are convinced that GRO is fully functional with no
remaining performance regressions.
In response to questions, Herbert said that there has not been a lot of
effort toward using LRO in 1G drivers. In general, current CPUs can keep
up with a 1G data stream without too much trouble. There might be a
benefit, though, in embedded systems which typically have slower
processors. How does the kernel decide how long to wait for incoming
packets before merging them? It turns out that there is no real need for
any special waiting code: the NAPI API already has the driver polling for
new packets occasionally and processing them in batches. GRO can simply be
performed at NAPI poll time.
The next step may be toward "generic flow-based merging"; it may also be
possible to start merging unrelated packets headed to the same destination
to make larger routing units. UDP merging is on the list of things to do.
There may even be a benefit in merging TCP ACK packets. Those packets are
small, but there are a lot of them - typically one for every two data
packets going the other direction. This technology may go in surprising
directions, but one thing is clear: the networking developers are not short
of ideas for enabling Linux to keep up with ever-faster hardware.
(
Log in to post comments)