Short waits with umwait
The problem with busy waiting, of course, is that it occupies the processor with work that is even more useless than cryptocoin mining. It generates heat and uses power to no useful end. On hyperthreaded CPUs, a busy-waiting process could prevent the sibling thread from running and doing something of actual value. For all of these reasons, it would be a lot nicer to ask the CPU to simply wait for a brief period until something interesting happens.
To that end, Intel is providing three new instructions. umonitor provides an address and a size to the CPU, informing it that the currently running application is interested in any writes to that range of memory. A umwait instruction tells the processor to stop executing until such a write occurs; the CPU is free to go into a low-power state or switch to a hyperthreaded sibling during that time. This instruction provides a timeout value in a pair of registers; the CPU will only wait until the timestamp counter (TSC) value exceeds the given timeout value. For code that is only interested in the timeout aspect, the tpause instruction will stop execution without monitoring any addresses.
It's worth noting that these are unprivileged instructions; any process can execute them. As a general rule, instructions that can halt a processor (or put it into a low-power state) are not available to unprivileged code for fairly obvious reasons. In this case, these instructions have (hopefully) been rendered safe by allowing the kernel to set an upper bound on how long the umwait and tpause instructions can wait before normal execution continues. Yu's patch set makes that upper bound available to system administrators in a sysfs file:
    /sys/devices/system/cpu/umwait_control/max_time
Since the TSC is involved, this value is in processor cycles; the default is 100,000, or about 100µs on a 1GHz CPU. This value was suggested by Andy Lutomirski during a discussion on a previous version of the patch set in January; his reasoning was:
The "C0.2" mentioned above is one of two special low-power states that the CPU can go into while waiting with one of these instructions; the other is, unsurprisingly, named C0.1. The C0.1 state is a "light" low-power state that doesn't reduce power usage that much, but which can be exited with relatively little latency. C0.2 is a deeper sleep that saves more power and takes longer to get out of.
It is conceivable that system administrators might not want to allow the system to go into C0.2 if, for example, it is handling workloads with realtime response requirements. The enable_c02 file in the same sysfs directory can be used to restrict the processor to C0.1. The default is to allow the deeper sleeps.
In the same message linked above, Lutomirski worried about the security
implications of instructions that allow a process to monitor when a range
of memory is touched.  As he put it, umwait "seems quite
delightful as a tool to create a highish-bandwidth covert channel, and it's
possibly quite nice to augment Spectre-like attacks
".  Exactly how
useful it would be has not really been described anywhere, though doubtless
there will be an academic paper on the topic in the near future.  Yu did answer
that these instructions can be disabled outright (with a significant
performance cost), though no
administrator-level knob has been provided to do that.
Meanwhile, these instructions (which should appear in the upcoming
"Tremont" processors)  do appear to offer some value to specific
types of workloads.  Most of the comments on the patches have been
addressed, with seemingly little left to fix at this point.  So, most
likely, there will be kernel support for the umwait family of
instructions in the near future.
| Index entries for this article | |
|---|---|
| Kernel | umwait | 
      Posted Jun 13, 2019 19:04 UTC (Thu)
                               by flussence (guest, #85566)
                              [Link] (2 responses)
       
     
    
      Posted Jun 13, 2019 20:07 UTC (Thu)
                               by wahern (subscriber, #37304)
                              [Link] (1 responses)
       
Conceptually the way an OS is supposed to work is that an ethernet packet arrives, interrupts the CPU which jumps to the scheduler which jumps into the process sleeping on a socket read which reads the data which writes to a pipe which transfers control to a second process sleeping on a read from the pipe... It doesn't get more event oriented than that. 
What makes this type of instruction different is that it's waiting on memory addresses. But doing this in a generic way--being able to detect changes on any [virtual] memory address--is actually quite expensive to do in the hardware. It's why we don't have fully general LL/SC for proper software transactional memory. You'd have to add an extra bit, at least, to every byte- or cacheline-sized block of memory in the system to track reads/writes. So instead you get interfaces that look general but really have some clever hackery in the microcode which suspiciously looks like the kind of solution you'd usually implement in the kernel, with the downside being that microcode (the new lowest-level software layer) is inaccessible and cannot be extended. 
Ultimately what this is all about is being able to transfer logical control to different threads of execution. Blocking IPC was the OS-level abstraction that made this simple and intuitive. But things got more complicated and it wasn't as convenient and performant as it used to be, or at least was perceived that way. Some of the alternatives did a better job at abstracting control transfer than others. 
 
     
    
      Posted Jun 15, 2019 8:09 UTC (Sat)
                               by cpitrat (subscriber, #116459)
                              [Link] 
       
We need eBPF for micro-code! 
     
      Posted Jun 14, 2019 1:29 UTC (Fri)
                               by Fowl (subscriber, #65667)
                              [Link] (6 responses)
       
     
    
      Posted Jun 14, 2019 2:55 UTC (Fri)
                               by evanp (guest, #50543)
                              [Link] (5 responses)
       
     
    
      Posted Jun 14, 2019 8:59 UTC (Fri)
                               by Tov (subscriber, #61080)
                              [Link] (4 responses)
       
I still remember in horror how a number of ISA bus reads were used for small delays, as they were guaranteed to be some amount of slow. :-/ ... Heh! I even found a stackoverflow answer describing that practice: 
     
    
      Posted Jun 14, 2019 12:55 UTC (Fri)
                               by nilsmeyer (guest, #122604)
                              [Link] (3 responses)
       
they couldn't use coffee break since the term break is ambiguous ;) 
     
    
      Posted Jun 14, 2019 15:49 UTC (Fri)
                               by dskoll (subscriber, #1630)
                              [Link] (2 responses)
        Posted Jun 15, 2019 4:22 UTC (Sat)
                               by ncm (guest, #165)
                              [Link] (2 responses)
       
I assume that, instead of hammering the bus as one would in a spin loop, the relevant cache line is just primed to watch for an invalidation notification from the bus, and let the hyperthread proceed. So, the sleep is very like a cache miss stall, and the wake very like delivery of the missed line. 
It looks to me as if the main desirable result of using this instruction (vs a spin) is that other threads that have a productive use for the memory bus will not driven off of it? 
     
    
      Posted Jun 16, 2019 12:01 UTC (Sun)
                               by caliloo (subscriber, #50055)
                              [Link] (1 responses)
       
     
    
      Posted Jun 17, 2019 2:48 UTC (Mon)
                               by ncm (guest, #165)
                              [Link] 
       
Since Haswell, Intel already does fusion of two ALU instructions and two branches to one cycle -- on a good day, anyway; when I last checked, Clang was very bad at putting instructions in the right order to get this. 
     
      Posted Jun 17, 2019 0:54 UTC (Mon)
                               by xywang (guest, #108121)
                              [Link] (2 responses)
       
     
    
      Posted Jun 17, 2019 3:01 UTC (Mon)
                               by ncm (guest, #165)
                              [Link] 
       
But probably you would run this on an isolcpu, with nohz, and hope never to get scheduled out. 
It appears I have not yet discovered a formula that guarantees no schedule breaks, ever. I would welcome enlightenment. 
     
      Posted Jun 17, 2019 4:08 UTC (Mon)
                               by jcm (subscriber, #18262)
                              [Link] 
       
     
    Short waits with umwait
      
Short waits with umwait
      
Short waits with umwait
      
Short waits with umwait
      
Short waits with umwait
      
Short waits with umwait
      
https://stackoverflow.com/questions/6793899/what-does-the...
Short waits with umwait
      
      I'm intrigued...  Short waits with umwait
      continue please...
      
          Short waits with umwait
      
Short waits with umwait
      
Short waits with umwait
      
Short waits with umwait
      
Short waits with umwait
      
Short waits with umwait
      
           