Short waits with umwait

By Jonathan Corbet
June 13, 2019

If a user-space process needs to wait for some event to happen, there is a whole range of mechanisms provided by the kernel to make that easy. But calling into the kernel tends not to work well for the shortest of waits — those measured in small numbers of microseconds. For delays of this magnitude, developers often resort to busy loops, which have a much smaller potential for turning a small delay into a larger one. Needless to say, busy waiting has its own disadvantages, so Intel has come up with a set of instructions to support short delays. A patch set from Fenghua Yu to support these instructions is currently working its way through the review process.

The problem with busy waiting, of course, is that it occupies the processor with work that is even more useless than cryptocoin mining. It generates heat and uses power to no useful end. On hyperthreaded CPUs, a busy-waiting process could prevent the sibling thread from running and doing something of actual value. For all of these reasons, it would be a lot nicer to ask the CPU to simply wait for a brief period until something interesting happens.

To that end, Intel is providing three new instructions. umonitor provides an address and a size to the CPU, informing it that the currently running application is interested in any writes to that range of memory. A umwait instruction tells the processor to stop executing until such a write occurs; the CPU is free to go into a low-power state or switch to a hyperthreaded sibling during that time. This instruction provides a timeout value in a pair of registers; the CPU will only wait until the timestamp counter (TSC) value exceeds the given timeout value. For code that is only interested in the timeout aspect, the tpause instruction will stop execution without monitoring any addresses.

It's worth noting that these are unprivileged instructions; any process can execute them. As a general rule, instructions that can halt a processor (or put it into a low-power state) are not available to unprivileged code for fairly obvious reasons. In this case, these instructions have (hopefully) been rendered safe by allowing the kernel to set an upper bound on how long the umwait and tpause instructions can wait before normal execution continues. Yu's patch set makes that upper bound available to system administrators in a sysfs file:

    /sys/devices/system/cpu/umwait_control/max_time

Since the TSC is involved, this value is in processor cycles; the default is 100,000, or about 100µs on a 1GHz CPU. This value was suggested by Andy Lutomirski during a discussion on a previous version of the patch set in January; his reasoning was:

I think we should set the default to something quite small, maybe 100 microseconds. IMO the goal is to pick a value that is a high enough multiple of the C0.2 entry+exit latency that we get most of the power and SMT resource savings while being small enough that no one thinks that UMWAIT is more than a glorified, slightly improved, and far more misleading version of REP NOP.

The "C0.2" mentioned above is one of two special low-power states that the CPU can go into while waiting with one of these instructions; the other is, unsurprisingly, named C0.1. The C0.1 state is a "light" low-power state that doesn't reduce power usage that much, but which can be exited with relatively little latency. C0.2 is a deeper sleep that saves more power and takes longer to get out of.

It is conceivable that system administrators might not want to allow the system to go into C0.2 if, for example, it is handling workloads with realtime response requirements. The enable_c02 file in the same sysfs directory can be used to restrict the processor to C0.1. The default is to allow the deeper sleeps.

In the same message linked above, Lutomirski worried about the security implications of instructions that allow a process to monitor when a range of memory is touched. As he put it, umwait "seems quite delightful as a tool to create a highish-bandwidth covert channel, and it's possibly quite nice to augment Spectre-like attacks". Exactly how useful it would be has not really been described anywhere, though doubtless there will be an academic paper on the topic in the near future. Yu did answer that these instructions can be disabled outright (with a significant performance cost), though no administrator-level knob has been provided to do that.

Meanwhile, these instructions (which should appear in the upcoming "Tremont" processors) do appear to offer some value to specific types of workloads. Most of the comments on the patches have been addressed, with seemingly little left to fix at this point. So, most likely, there will be kernel support for the umwait family of instructions in the near future.

Index entries for this article
Kernel	umwait

Short waits with umwait

Posted Jun 13, 2019 19:04 UTC (Thu) by flussence (guest, #85566) [Link] (2 responses)

It's somewhat weird to me that hardware shies away from this kind of event-driven design, after all they've been on the “performance per watt” drive for over a decade now. But I guess there's some level of paranoia about security implications of timing attacks now, especially within Intel.

Short waits with umwait

Posted Jun 13, 2019 20:07 UTC (Thu) by wahern (subscriber, #37304) [Link] (1 responses)

Does it shy away from it? I think it's just been abstracted away.

Conceptually the way an OS is supposed to work is that an ethernet packet arrives, interrupts the CPU which jumps to the scheduler which jumps into the process sleeping on a socket read which reads the data which writes to a pipe which transfers control to a second process sleeping on a read from the pipe... It doesn't get more event oriented than that.

What makes this type of instruction different is that it's waiting on memory addresses. But doing this in a generic way--being able to detect changes on any [virtual] memory address--is actually quite expensive to do in the hardware. It's why we don't have fully general LL/SC for proper software transactional memory. You'd have to add an extra bit, at least, to every byte- or cacheline-sized block of memory in the system to track reads/writes. So instead you get interfaces that look general but really have some clever hackery in the microcode which suspiciously looks like the kind of solution you'd usually implement in the kernel, with the downside being that microcode (the new lowest-level software layer) is inaccessible and cannot be extended.

Ultimately what this is all about is being able to transfer logical control to different threads of execution. Blocking IPC was the OS-level abstraction that made this simple and intuitive. But things got more complicated and it wasn't as convenient and performant as it used to be, or at least was perceived that way. Some of the alternatives did a better job at abstracting control transfer than others.

Short waits with umwait

Posted Jun 15, 2019 8:09 UTC (Sat) by cpitrat (subscriber, #116459) [Link]

> with the downside being that microcode is inaccessible and cannot be extended.

We need eBPF for micro-code!

Short waits with umwait

Posted Jun 14, 2019 1:29 UTC (Fri) by Fowl (subscriber, #65667) [Link] (6 responses)

Interesting that userspace seems to get this before the kernel itself. Surely spinwaits are used more in the kernel?

Short waits with umwait

Posted Jun 14, 2019 2:55 UTC (Fri) by evanp (guest, #50543) [Link] (5 responses)

The kernel-only versions (monitor/mwait, without the 'u') have been around since SSE3, though tpause is new....

Short waits with umwait

Posted Jun 14, 2019 8:59 UTC (Fri) by Tov (subscriber, #61080) [Link] (4 responses)

Actually it is strange it took so long to add a TPAUSE ("tea pause"? :-) instruction. There are many places a hardware driver needs a slight (microsecond) delay, where scheduler ticks are much too coarse and NOP loops are too unreliable (being clock speed dependent).

I still remember in horror how a number of ISA bus reads were used for small delays, as they were guaranteed to be some amount of slow. :-/ ... Heh! I even found a stackoverflow answer describing that practice:
https://stackoverflow.com/questions/6793899/what-does-the...

Short waits with umwait

Posted Jun 14, 2019 12:55 UTC (Fri) by nilsmeyer (guest, #122604) [Link] (3 responses)

> Actually it is strange it took so long to add a TPAUSE ("tea pause"? :-) instruction.

they couldn't use coffee break since the term break is ambiguous ;)

Short waits with umwait

Posted Jun 14, 2019 15:49 UTC (Fri) by dskoll (subscriber, #1630) [Link] (2 responses)

I'm intrigued... continue please...

Short waits with umwait

Posted Jun 15, 2019 8:13 UTC (Sat) by cpitrat (subscriber, #116459) [Link] (1 responses)

Let's return to the main topic.

Short waits with umwait

Posted Jun 17, 2019 14:28 UTC (Mon) by intgr (subscriber, #39733) [Link]

The main topic: Um, wait...

Short waits with umwait

Posted Jun 15, 2019 4:22 UTC (Sat) by ncm (guest, #165) [Link] (2 responses)

I will totally use this instruction to watch for a mapped ring buffer's head indicator to be updated after a batch of packets has been DMA'd in.

I assume that, instead of hammering the bus as one would in a spin loop, the relevant cache line is just primed to watch for an invalidation notification from the bus, and let the hyperthread proceed. So, the sleep is very like a cache miss stall, and the wake very like delivery of the missed line.

It looks to me as if the main desirable result of using this instruction (vs a spin) is that other threads that have a productive use for the memory bus will not driven off of it?

Short waits with umwait

Posted Jun 16, 2019 12:01 UTC (Sun) by caliloo (subscriber, #50055) [Link] (1 responses)

Was thinking the same thing. Would be nice to have more details on the limitations of the memory range that can be specified (hopefully one cannot provide a -virtual I suppose- range that starts at 0x0 of length 2ˆ64 ... and how that instruction performs with DMA.

Short waits with umwait

Posted Jun 17, 2019 2:48 UTC (Mon) by ncm (guest, #165) [Link]

It occurs to me that there is really no need for this instruction -- micro-op fusion ought to recognize a busy-wait loop, and translate it internally.

Since Haswell, Intel already does fusion of two ALU instructions and two branches to one cycle -- on a good day, anyway; when I last checked, Clang was very bad at putting instructions in the right order to get this.

Short waits with umwait

Posted Jun 17, 2019 0:54 UTC (Mon) by xywang (guest, #108121) [Link] (2 responses)

What will happen if the sleeping task should be interrupted for reschedule? Does these instructions temporarily ignore or delay time interrupt?

Short waits with umwait

Posted Jun 17, 2019 3:01 UTC (Mon) by ncm (guest, #165) [Link]

The kernel would need to resume the thread after the wait, as the deadline would certainly have passed. It could resume at the instruction, but that would break immediately so there would be no point -- unless sleeps longer than a scheduling interval were permitted, which seems unlikely.

But probably you would run this on an isolcpu, with nohz, and hope never to get scheduled out.

It appears I have not yet discovered a formula that guarantees no schedule breaks, ever. I would welcome enlightenment.

Short waits with umwait

Posted Jun 17, 2019 4:08 UTC (Mon) by jcm (subscriber, #18262) [Link]

I believe recent enough parts allow a small delta in the VMEXIT to be controlled by the Hypervisor. No idea if this is respected for the userspace mwait instructions, but it could be. In any case, you're coming out of the VM and when you go back in you're going to have to restart the pause. Similar for kernel/userspace. Interrupts will preempt anything like this. Otherwise you'd have a DoS opportunity.