LWN.net Logo

Congestion-management subtleties

Congestion-management subtleties

Posted Jan 25, 2011 19:49 UTC (Tue) by jthill (guest, #56558)
In reply to: LCA: Vint Cerf on re-engineering the Internet by gdt
Parent article: LCA: Vint Cerf on re-engineering the Internet

I think I've got what might be a helpful contribution here. I spent many years, long ago, fixing and building high-performance networking code in address spaces handling thousands of connections, so I hope it turns out to be worth attention. It might also be something everybody knows about already, but I get the suspicion from the discussion I'm seeing that, maybe it isn't a known and discarded idea. Your reasoning shouldn't have been so easily missed if it were.

So let me start with a slightly artificial example to get all the elements in play: on such a router with high-volume TCP endpoints of its own, the TCP buffers need to be kept separate from the routed-packet buffers because the TCP buffers are necessary only for TCP retransmit and shouldn't be allowed to clog the queue for telnet or whatnot.

No need to burden QoS for this: separate and shrink the routed-packet pool, and arrange to have the routed pool ask for another packet from the TCP pool shortly before it needs it. That will do the trick automatically if I have it right. It occurs to me, since endpoint TCP has much more info available than any router, it should be able to do that-much-better prioritizing anyway.

To keep fairness with non-local sources, local TCP gets some simple proportion of the packets in the pool relative to the packets from other sources. Packets going to local TCP never enter the routed pool at all: it's a matter of swapping a full routed-packet buffer for an empty receive-window buffer. TCP can offer ACK packets in return right there.

So, to the payload: even though doing this for local TCP achieves the purpose in that scenario, I don't see any intrinsic reason to do this only for TCP, or only locally.

When the opportunity and need coincide, why not do this kind of coordinated buffer management across links?

This isn't source-quench. The basic idea needs extension to handle more general cases, but start small.

Pick a leaf router, where one link reaches the vast majority of the net and virtually everything reaching it is going to use that link. Use the idle local bandwidth to make the backpressure explicit.

To put it in a way that might horrify some, why not have a congested leaf router convert its downstream links to half-duplex for the duration of the congestion? It's easy: "Ok. Go 'way now, I'm busy". "Gimme a packet". "Ok, send what you like".

Those need acks to avoid throttling in error, but again those are sent on links that should be idle anyway. This is one-hop link management.

Plainly, when life starts to get interesting (i.e. when more than one of the router's links is likely to get congested), the poll should explicitly list congested routing entries. An overbroad (or ignored) choke list would slow some things down unnecessarily, but if the choke is honored at all (and the router sanely reserves one or two packets for each link no matter what) the congestion gets pushed directly to its source.

When you get to nodes where combinations of inbound links are saturating combinations of outbound links I think this starts running out of steam, but as I understand it those aren't the nodes where we're seeing this problem in the first place.

So, thanks for reading, if it's a good idea I don't feel all possessive about it, and either way I'd appreciate feedback.


(Log in to post comments)

Congestion-management subtleties

Posted Jan 25, 2011 23:10 UTC (Tue) by ajb (subscriber, #9694) [Link]

Sounds vaguely like backward congestion notification, which is now in data-center grade ethernet: www.ieee802.org/3/ar/public/0505/bergamasco_1_0505.pdf

Congestion-management subtleties

Posted Jan 26, 2011 2:05 UTC (Wed) by jthill (guest, #56558) [Link]

Yeah, that's the idea, only IP-aware, not so unselective. As I said, I think this scheme starts running out of steam as you get towards the core. Cisco's saying theirs starts there, where the routers are already too busy to think. If more than a few simple address ranges were included in this scheme's backpressure notifications I think it'd start getting ugly. For e.g. intranet border routers it occurs to me greenlight ranges (send me what you want for these guys, you hang on to traffic for anyplace else) would be simpler.

Fwliw, seems to me from reading his links that gmaxwell has it right about the seeming contradiction between the results Gettys and Villamizar/Song get - if I recall prices then, the idea of grossly overprovisioning buffers would have seemed insane in 1994. Plus the market was more technical, so there'd be little reason for the earlier study to examine it.

Some things I like about this notion (I am, of course, completely objective on the subject) are that

  • Unlike Cisco's BCN, the sender can still forward e.g. network control packets (in addition to packets destined for outbound uncongested links, because it knows what those are).
  • Like Cisco's scheme it's incremental. If the congestion is local only, i.e. if the aggregate buffering in the route back to the source is sufficient, the sending TCPs never see it at all—and when they do hear of it, they hear via backpressure from their local router:
    • The pipe is never unnecessarily drained
    • they know why they're not getting ACKs if the jam lasts, they don't have to retransmit
    • and they can prioritize what to send when polled using every bit of local state
There's more, they're all even more obvious than these.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds