Credit where it's due

Posted Aug 21, 2018 15:42 UTC (Tue) by ecree (guest, #95790)
Parent article: Batch processing of network packets

The original idea was not mine: it was Jesper Dangaard Brouer who thought of it. I 'merely' implemented it.
He suggested it in http://lists.openwall.net/netdev/2016/01/15/51 and the following thread seems also to contain some early prefiguring of XDP.

Credit where it's due

Posted Aug 22, 2018 2:43 UTC (Wed) by dgc (subscriber, #6611) [Link] (3 responses)

Is it a sign that you're getting old when you notice that the old, forgotten ways have become the shiny new again?

-Dave.

Credit where it's due

Posted Aug 22, 2018 22:02 UTC (Wed) by ejr (subscriber, #51652) [Link]

Not even that old: https://dl.acm.org/citation.cfm?id=502057

Credit where it's due

Posted Aug 27, 2018 19:20 UTC (Mon) by roblucid (guest, #48964) [Link]

Batch processing was all the rage when I first used computers!

Credit where it's due

Posted Aug 27, 2018 19:21 UTC (Mon) by roblucid (guest, #48964) [Link]

Batch processing was all the rage, when I was a teen :)

Credit where it's due

Posted Aug 22, 2018 3:57 UTC (Wed) by mtaht (subscriber, #11087) [Link] (3 responses)

Normally batching anything makes me grumpy, but I like this. :)

However I've seen a lot of code that does stupid things to software batch gro stuff, with things that actually overwhelm the cache. I'm curious, with this new code,
if folk have experimented with decreasing the (oft rather large) NAPI value in the first place, and to what extent it was tested on devices with small i/dcaches.

Skylake is one thing. But a typical cache size on small boxes is 32k/32k for these.

Credit where it's due

Posted Aug 22, 2018 3:59 UTC (Wed) by mtaht (subscriber, #11087) [Link] (2 responses)

Also, I so dearly want timestamps on ingress, always. Being able to amortize one itty bitty timestamp per draining of the rx ring would make me happy and not cost a fraction of the cpu you just saved.

Credit where it's due

Posted Aug 23, 2018 16:15 UTC (Thu) by ecree (guest, #95790) [Link] (1 responses)

Packets should still be timestamped if they were before, see netif_receive_skb_list_internal() (and compare with netif_receive_skb_internal()).

Credit where it's due

Posted Aug 24, 2018 7:26 UTC (Fri) by mtaht (subscriber, #11087) [Link]

you missed the "always" part. :)

In my ideal world, packets inside the kernel and from future devices, would have something the like the following format:

The rx_timestamp is free if you have it in hw, the rx/tx_timestamp would make all the codel-y work faster on tx, and also enable the timer queues VJ is talking about ( https://netdevconf.org/0x12/session.html?evolving-from-af... ) see also sch_etx and the igb network hw, and... selfishly - on top of "timestamp always", 3 hashes would make sch_cake fast enough for general use. timestamps are at the front because they are "free" though they could live at the back, hashes at the back because you need time to calc them and on a cut through switch you can't wait for them. The existing cb has some other fields I'd do always too and make persistent through the stack.

I'm so totally not going to make more of this "modest proposal", as much as I think the whole skb layout could use a major revision, as it would take a forklift, time, and taste to redo in linux, dpdk, rgmii, etc. It would make it really difficult to backport code from one format to the other. It would take years to implement in linux (years more in bsd, osx, windows), for a benefit that would mostly be for 100gige+ interfaces originally. A metric ton of people would have their favorite fixed format field they'd want somewhere, thus politics and gnashing of standards bodies teeth would happen... specialized hw offload engines break...

Still, if I could have 1 sysctl to immuttably rx timestamp packets on all interfaces always it would be great. Packets being managed by codel could measure the entire system from ingress to egress and *drop them*, shedding load automatically as a system as a whole (rather than at the qdisc) got more stress than it should take, figuring out the cost of each substep through the layers would be something you could do on a per packet basis on your workload, rather than by blasting specialized test tools through it that *don't model real traffic* and claiming that pps really meant something...

and there's be ponies! and speckled dancing unicorns! harder RTT deadlines! and systems that didn't collapse under load! and poppies, poppies, poppies everywhere to roll around in to help you sleep even better!

It's after midnight. I usually don't post anything after midnight.

...

I am happy that ebpf is making it easier to offload more stuff into smarter hardware. and I look forward to trying this new skb list idea out next quarter on some very old, slow, tiny hardware.