GNU C Library version 2.39

Posted Feb 7, 2024 11:41 UTC (Wed) by paulj (subscriber, #341)
In reply to: GNU C Library version 2.39 by meven-collabora
Parent article: GNU C Library version 2.39

The reference to "Workload-aware writeback pacing" as a use-case is interesting. In the network world we call this "congestion control". Workload-pacing for storage writes, thankfully, shouldn't have to worry about lost writes though!

GNU C Library version 2.39

Posted Feb 7, 2024 12:34 UTC (Wed) by farnz (subscriber, #17727) [Link] (9 responses)

There's a related issue that distros don't currently set per-bdi writeback limits "sensibly" (and indeed, "sensible" is in the eyes of the beholder here). If the user expects to remove a device, then the per-bdi writeback limits should be set so that the kernel won't buffer more than a second or so worth of writeback, accepting that the consequence is that all operations on the device slow down as soon as there's a small amount of dirty data to write, while for devices intended to be permanently connected, the writeback limits should be large so that the kernel can delay writes for longer, and only pays the penalty of delaying if you have a large amount of data pending writeback at shutdown time.

The kernel can't do this itself, because the policy about "permanent" or "removeable" isn't known to the kernel; if you have a USB SSD attached as the main drive for a Raspberry Pi, that's "permanent", and a large writeback limit is reasonable. If you have a USB SSD plugged into that same RPi so that you can copy data to it and then unplug it to move to another location, that's "removeable", and you want the writeback limit to be small.

GNU C Library version 2.39

Posted Feb 7, 2024 13:04 UTC (Wed) by paulj (subscriber, #341) [Link] (6 responses)

This sounds mildly similar to the problem in the networking world of "buffer bloat". Network devices, and the transmitting host particularly, buffer up writes to network (a.k.a packets) with the expectation writes will be bursty. The buffering allows the bursts to be smoothed out and the links kept busy even when the sender isn't actively writing. The problem of course is when the sender is /not/ bursty, and has a long amount of data to send; there is now an intrinsic delay that need not be there. Plus the problem that this buffering makes it harder for senders to accurately measure the latency - necessary for good workload-pacing / congestion control.

Interesting parallels. ;)

GNU C Library version 2.39

Posted Feb 7, 2024 14:00 UTC (Wed) by farnz (subscriber, #17727) [Link] (5 responses)

The difference is that a networking host doesn't have any way to determine the path capacity - indeed, it can change significantly over time. A storage host does usually have a way to determine the device capacity; we can make good guesses at the number of IOPS the device can do, and at how large each IOP can be before it reduces the number of IOPS we can handle.

Also, we have the weirdness that for some devices, we want the buffer to be bloated, and for others, we don't; /, /home and other internal filesystems on my laptop can have a very bloated buffer, since there's no practical cost to it, but there is a potential gain from a huge buffer (turning lots of small operations into a smaller number of big operations).

GNU C Library version 2.39

Posted Feb 7, 2024 15:30 UTC (Wed) by paulj (subscriber, #341) [Link] (2 responses)

True, though, the analogues entity in the storage case is more the user process - not the host and its system software. The user process doesn't have good knowledge of the storage throughput either. Just as in the networking case, it has to write, time to some event, and measure.

One difference you're raising there is that the storage case, you have what the networking world would call "content addressable networking". I.e., the process specifies /what/ content to read and write, thus allowing the system (a tiny distributed system, in a case) to offer caching (inc. write caching) at various levels. This is something the networking world generally lacks, sadly (?). In networking the reads/writes are generally intimately tied to the location of the data. Caching is thus minimal, and we have to build very complicated systems to virtualise the location of the data /somewhat/ (within the scope of that complicated system).

Of course, the single-system storage model morphs into that same problem once it exceeds the capacity of the highly-cohesive, coherent single-system model, as the answer will involve introducing much less coherent technologies, i.e. networking . ;)

GNU C Library version 2.39

Posted Feb 7, 2024 16:32 UTC (Wed) by farnz (subscriber, #17727) [Link] (1 responses)

The other important difference is that in the storage case, we rarely care about the effects of congestion on shared links. Either we can afford to wait when the in-memory kernel cache flushes out to the device (the internal drive case), or we want to keep the cache small compared to the speed of the device so that it's quick to flush when needed (the removable drive case), and we very rarely have links slower than the devices they're connecting (even under congestion).

GNU C Library version 2.39

Posted Feb 7, 2024 16:55 UTC (Wed) by paulj (subscriber, #341) [Link]

Well, you have reliable links, so you don't have to worry about over-loading the links and causing loss in transmission. However, you do still want some system that is able to pace each distinct user's traffic, and keep things fair between those users. In networking - least the classic "dumb packet switching network" that the likes of TCP/IP runs on, notionally - that ends up a distributed problem. In the coherent single system, a central scheduler can arbitrate and enforce.

But that was what my first comment was pointing at: the noted use-case of /end process/ workload pacing would start to introduce some of that functionality into the user process (which is equivalent to the "end host"). ;) Who knows where that leads to in the future. ;)

Maybe at some point the coherent single-system becomes more of an explicit distributed system. (It already is a distributed system, but hides it very well; HyperTransport, PCIe, etc., are all at least packet based, but non-blocking and [very near] perfectly reliable - making the presentation of a very coherent system much easier than with networking).

GNU C Library version 2.39

Posted Feb 15, 2024 11:32 UTC (Thu) by tanriol (guest, #131322) [Link] (1 responses)

> The difference is that a networking host doesn't have any way to determine the path capacity - indeed, it can change significantly over time. A storage host does usually have a way to determine the device capacity; we can make good guesses at the number of IOPS the device can do, and at how large each IOP can be before it reduces the number of IOPS we can handle.

And then the RAM cache of the SSD fills up and the available bandwidth drops. And then the SLC cache area of the SSD fills up and it drops again.

GNU C Library version 2.39

Posted Feb 15, 2024 11:55 UTC (Thu) by Wol (subscriber, #4433) [Link]

And then you discover it's not an SSD, it's a shingled rotating rust drive ... :-)

Cheers,
Wol

GNU C Library version 2.39

Posted Feb 8, 2024 4:40 UTC (Thu) by intelfx (subscriber, #130118) [Link] (1 responses)

Last time I checked, per-bdi writeback limits were totally nonfunctional even when set manually. In the kernel, I think there even were some references to auto-tuning these limits, but that behavior is also nowhere to be seen.

Could you please clarify how do I make them work? :-)

GNU C Library version 2.39

Posted Feb 8, 2024 10:28 UTC (Thu) by farnz (subscriber, #17727) [Link]

Tested on 6.6.6 and 6.7.3 kernels, on a Fedora 39 system. I have a slow USB device as /dev/sda, which is therefore bdi 8:0. To restrict it to 1 second of dirty data, I need to run two commands as root:

echo 1 > /sys/class/bdi/8:0/strict_limit
echo $((4 * 1024 * 1024)) > /sys/class/bdi/8:0/max_bytes

Per the documentation for bdi limits, the first command tells the kernel that this device's limits must be respected even if the global background dirty limit isn't reached; my system has a global background dirty limit around 8 GiB, so without this, any limit I set below 8 GiB is ignored.

The second sets the actual limit - in this case, 1 second of writes to the device, which is 4 MiB of data. You can see why strict limits matters here, though - without strict limits, the global background dirty limit would push me to 8 GiB of data before even checking the per-bdi limits, unless I had a lot of writes in progress to other devices. And I want relatively large limits for the global limits, since my laptop has a big battery, and when I build a Yocto image, I've often got many readers in parallel that need time on the NVMe drive, along with writes that can be delayed, and I'd prefer the writes to wait so that I'm blocking on CPU, not on I/O.