Meh. I can't believe how much I trip myself up on this. Who knows, maybe I've been getting it wrong all this time, too....
Let me try again.
the total amount of buffering in the txqueue + tx ring buffer portion of the stack needs to be not much longer than the the BDP to the *next hop*.
BQL appears to solve the tx ring portion of the problem thoroughly, at least on ethernet.
Figuring out how many streams can co-exist in the txqueuelen set of buffers above the tx ring, and when to start dropping packets there, is an AQM problem, about which much debate exists. The next-hop BDP*sqrt(flows) thing is, well, debatable, but getting the effective txqueue's length down to where that portion of the AQM debate can take place again, seems doable with the time in queue idea floating about.
The total amount of buffering in tcp's algorithms, which do their own buffering internally, that is required for the end-to-end queue to be handled, is dependent on the BDP, and I'm going to flat out wave hands and say that AQM can help there a lot, and typically has very 'interesting' problems with streams of different RTTs.