Reason for needing I/O wait

Posted Sep 10, 2024 19:36 UTC (Tue) by farnz (subscriber, #17727)
In reply to: Disk encryption by cloehle
Parent article: The trouble with iowait

If I've understood your descriptions properly, the point of I/O wait isn't about being CPU bound - it's about slow I/O devices (like hard drives instead of SSDs). A HDD can easily have an average seek time over 8ms, and a Haswell-era Intel CPUs (4th Generation Core i7, for example) has (according to static struct cpuidle_state hsw_cstates[] in intel_idle.c) a C-state with 7.7ms target residency, and 2.6ms exit latency (note that the numbers in the struct are in µs).

Without iowait, it's plausible that the system would determine that it can sleep for 7.7ms comfortably after a single read triggers a seek, and enter that deep sleep state; it then takes 2.6ms to wake up before it even processes the completion interrupt. With iowait, the process can contribute to "wakefulness" of the CPU, and ensure that it doesn't enter the deep C-state, and thus avoid sleeping.

Thus, I'd expect that you'd see the worst case with a CPU that has long exit latencies from the deep C state and a HDD as the I/O device - some of the time, you'll hit long exit latencies because cpuidle (correctly) predicts that you're going to sleep for a long time, and those exit latencies will hurt.

Reason for needing I/O wait

Posted Sep 11, 2024 9:43 UTC (Wed) by cloehle (subscriber, #128160) [Link]

You're correct but let me expand. I need to point out the difference between cpufreq and cpuidle regarding iowait behavior. One solves the IO utilization problem, the other selects shallower states on the CPU that had tasks go to sleep on with in_iowait set.
You're referring to cpuidle, in that case the long IO latency is a problem, as you said, with 8ms IO latency the cpuidle governor will have made a correct decision when choosing a state with a target residency ("How long do we have to sleep for this state to be worth it?") of <8ms and doesn't care about the exit latency (I should note, exit latency is kind of treated as worst-case, so you will probably observe significantly less than 2.6ms in your example).
The theoretical worst-case would be an IO device that has a latency just above the target residency of the highest state (if the IO device is slower, the cost of the exit latency will diminish in the IO latency).
If you are in that theoretical worst-case and the exit latency does hurt you, you're much better off not relying on the governor altogether and either using the PM QoS API or disabling the state(s) while you're doing IO, menu (with iowait heuristic) wasn't always doing a good job there either, see the cover letter of my patch.

Unfortunately even if we did say that iowait was a good metric to base cpuidle heuristics on (I have outlined why it isn't in the cover-letter), for how long will we choose to select shallower states? Tasks can be in_iowait for seconds even (i.e. aeons) and if you suffer 'end-to-end' throughput decrease by the exit latency (or for cpufreq the low CPU frequency) (and if that decrease might be a good trade-off for the power saved) is practically impossible for the kernel to tell.
Similarly to the iowait boost comment, if you're maximizing IO performance, that IO request that just completed is hopefully large enough for completion not too happen often or it is just one of many queued requests for the IO device.