Your comments make it sound to make like you don't understand how Linux plugging works - though maybe I misunderstand you...
There is no unplugging timer. There was a timer in the previous code that would unplug after 3ms, but that was mainly a 'just in case' measure. Normally the unplug would happen much earlier. If it was only the timer that triggered an unplug then you are right, performance would be terrible.
Also plugging was only relevant when the device was idle. If the queue was not empty it would not get plugged. So with long, nearly continuous writes, the queue would only be plugged once at the very beginning.
The purpose of plugging is primarily about latency, not bandwidth. If bandwidth is an issue, your queue will not be empty, so the time it takes for a request to get to the front of the queue is long enough for any other related requests to be seen and merged so there is no point in plugging (at least the old style - the new style still brings benefits).
The new plugging code is a bit different. Rather than only plugging idle devices it doesn't really plug devices at all. It plugs request submitters instead. (So the new code is a lot closer to the page cache than the old code).
So when a process submits a request, it gets queued in the process, not in the device. Once a process has submitted all that it wants to submit (or whenever it schedules - to avoid deadlocks), the requests queued in the process are released.
So you get a similar effect on the starting transient - a large number of pre-sorted requests gets handled all at once. However there are other advantages as well in terms of lock contention.
I think it is very hard to imaging plugging slowing down even a very fast low latency device. The page cache has a bunch of pages that it wants to perform IO on and it assembles them into a big chunk and sends them all to the device. It is true that the first request won't get there quite as soon, but unless the device is as fast as memory, then it will still get there fast enough that you probably cannot measure the difference.
And if a device is as fast as memory, then it probably shouldn't be under the request_queue code at all - a driver more like the umem.c driver might be appropriate. It just takes bios directly and turns them into DMA descriptors.. But even that uses plugging so it can start a chain of dmas at once rather than just one.
I've probably rambled a but there, but I really think you aren't being fair to plugging. It may not be perfect but I would need a lot more evidence before I could see any justification for it being a gross mistake.