LWN: Comments on "Short waits with umwait"

Short waits with umwait

intgr — Mon, 17 Jun 2019 14:28:39 +0000

The main topic: Um, wait...

Short waits with umwait

jcm — Mon, 17 Jun 2019 04:08:55 +0000

I believe recent enough parts allow a small delta in the VMEXIT to be controlled by the Hypervisor. No idea if this is respected for the userspace mwait instructions, but it could be. In any case, you're coming out of the VM and when you go back in you're going to have to restart the pause. Similar for kernel/userspace. Interrupts will preempt anything like this. Otherwise you'd have a DoS opportunity.

Short waits with umwait

ncm — Mon, 17 Jun 2019 03:01:13 +0000

The kernel would need to resume the thread after the wait, as the deadline would certainly have passed. It could resume at the instruction, but that would break immediately so there would be no point -- unless sleeps longer than a scheduling interval were permitted, which seems unlikely.

But probably you would run this on an isolcpu, with nohz, and hope never to get scheduled out.

It appears I have not yet discovered a formula that guarantees no schedule breaks, ever. I would welcome enlightenment.

Short waits with umwait

ncm — Mon, 17 Jun 2019 02:48:19 +0000

It occurs to me that there is really no need for this instruction -- micro-op fusion ought to recognize a busy-wait loop, and translate it internally.

Since Haswell, Intel already does fusion of two ALU instructions and two branches to one cycle -- on a good day, anyway; when I last checked, Clang was very bad at putting instructions in the right order to get this.

Short waits with umwait

xywang — Mon, 17 Jun 2019 00:54:37 +0000

What will happen if the sleeping task should be interrupted for reschedule? Does these instructions temporarily ignore or delay time interrupt?

Short waits with umwait

caliloo — Sun, 16 Jun 2019 12:01:01 +0000

Was thinking the same thing. Would be nice to have more details on the limitations of the memory range that can be specified (hopefully one cannot provide a -virtual I suppose- range that starts at 0x0 of length 2ˆ64 ... and how that instruction performs with DMA.

Short waits with umwait

cpitrat — Sat, 15 Jun 2019 08:13:01 +0000

Let's return to the main topic.

Short waits with umwait

cpitrat — Sat, 15 Jun 2019 08:09:56 +0000

> with the downside being that microcode is inaccessible and cannot be extended.

We need eBPF for micro-code!

Short waits with umwait

ncm — Sat, 15 Jun 2019 04:22:05 +0000

I will totally use this instruction to watch for a mapped ring buffer's head indicator to be updated after a batch of packets has been DMA'd in.

I assume that, instead of hammering the bus as one would in a spin loop, the relevant cache line is just primed to watch for an invalidation notification from the bus, and let the hyperthread proceed. So, the sleep is very like a cache miss stall, and the wake very like delivery of the missed line.

It looks to me as if the main desirable result of using this instruction (vs a spin) is that other threads that have a productive use for the memory bus will not driven off of it?

Short waits with umwait

dskoll — Fri, 14 Jun 2019 15:49:37 +0000

I'm intrigued... continue please...

Short waits with umwait

nilsmeyer — Fri, 14 Jun 2019 12:55:32 +0000

> Actually it is strange it took so long to add a TPAUSE ("tea pause"? :-) instruction.

they couldn't use coffee break since the term break is ambiguous ;)

Short waits with umwait

Tov — Fri, 14 Jun 2019 08:59:35 +0000

Actually it is strange it took so long to add a TPAUSE ("tea pause"? :-) instruction. There are many places a hardware driver needs a slight (microsecond) delay, where scheduler ticks are much too coarse and NOP loops are too unreliable (being clock speed dependent).

I still remember in horror how a number of ISA bus reads were used for small delays, as they were guaranteed to be some amount of slow. :-/ ... Heh! I even found a stackoverflow answer describing that practice:
https://stackoverflow.com/questions/6793899/what-does-the...

Short waits with umwait

evanp — Fri, 14 Jun 2019 02:55:48 +0000

The kernel-only versions (monitor/mwait, without the 'u') have been around since SSE3, though tpause is new....

Short waits with umwait

Fowl — Fri, 14 Jun 2019 01:29:04 +0000

Interesting that userspace seems to get this before the kernel itself. Surely spinwaits are used more in the kernel?

Short waits with umwait

wahern — Thu, 13 Jun 2019 20:07:30 +0000

Does it shy away from it? I think it's just been abstracted away.

Conceptually the way an OS is supposed to work is that an ethernet packet arrives, interrupts the CPU which jumps to the scheduler which jumps into the process sleeping on a socket read which reads the data which writes to a pipe which transfers control to a second process sleeping on a read from the pipe... It doesn't get more event oriented than that.

What makes this type of instruction different is that it's waiting on memory addresses. But doing this in a generic way--being able to detect changes on any [virtual] memory address--is actually quite expensive to do in the hardware. It's why we don't have fully general LL/SC for proper software transactional memory. You'd have to add an extra bit, at least, to every byte- or cacheline-sized block of memory in the system to track reads/writes. So instead you get interfaces that look general but really have some clever hackery in the microcode which suspiciously looks like the kind of solution you'd usually implement in the kernel, with the downside being that microcode (the new lowest-level software layer) is inaccessible and cannot be extended.

Ultimately what this is all about is being able to transfer logical control to different threads of execution. Blocking IPC was the OS-level abstraction that made this simple and intuitive. But things got more complicated and it wasn't as convenient and performant as it used to be, or at least was perceived that way. Some of the alternatives did a better job at abstracting control transfer than others.

Short waits with umwait

flussence — Thu, 13 Jun 2019 19:04:53 +0000

It's somewhat weird to me that hardware shies away from this kind of event-driven design, after all they've been on the “performance per watt” drive for over a decade now. But I guess there's some level of paranoia about security implications of timing attacks now, especially within Intel.