Facebook and the kernel--adaptive spinlocks

Posted Apr 16, 2014 4:02 UTC (Wed) by vomlehn (guest, #45588)
Parent article: Facebook and the kernel

Adaptive spinlocks in user space is a doable thing; a friend of mine and I did some work on this fifteen years ago. I'm pretty sure I'm forgetting some details, but the approach was something like this:

The question that we looked at was, how long do you spin before you give up and go into the kernel. Our short answer is to assume that spin locks are only held for a short while, so the biggest thing you want to know is whether the task currently holding the lock is currently running. A process with a lock does two things:

It registers a section of memory to hold status of other tasks with which locks might be shared. This list is, of necessity, dynamic.
When it acquires the lock, it stores its PID in the lock. As I recall, we trusted processes to be honest in reporting the PID, but we may have taken a somewhat different approach.

If the process acquires the lock on the first try, all is well. If not, you look at the PID stored in the lock and index into the task status memory. (actually, you probably don't use the PID, per se, since those can be pretty sparse. If the task is running, you loop again. If you don't, you do a system call to wait on the spinlock.

When a task fails to acquire a lock and enters the kernel, the kernel makes a note that there is a task waiting for the lock. When the lock holder is ready to release the lock, it checks to see if something is waiting. If not, it zeros out the lock and goes on. If a task is waiting, it makes a system call to release the lock.

As you can see, this is highly optimized on the assumption that spinlock contention is low. But, if it's not, you shouldn't be using spin locks. This approach yielded some amazingly flat curves, however, even for some pretty high contention rates.

Note that checking to see if a task that is hold a lock is currently running can be augmented if other information is known. You could, possibly, look and see how much longer it has until its scheduler quantum expires, or whether it is currently processing a signal, or anything else which can provide a hint as to whether the task is likely to run until it releases the lock. Likewise, metrics on how long locks are likely to be held can provide heuristics useful in deciding whether more CPU time will be spent in spinning or by doing a system call. And, on the latter issue, there are tradeoffs of CPU time verus latency.

Lots and lots of things can be done here, even by the intrepid novice

Facebook and the kernel--adaptive spinlocks

Posted Apr 16, 2014 12:35 UTC (Wed) by intgr (subscriber, #39733) [Link] (2 responses)

> so the biggest thing you want to know is whether the task currently holding the lock is currently running

So the point is to figure out if the owning PID is currently running or preempted? How do you find that out?

If you ask the kernel via /proc or some other means, you've already paid the cost of a syscall or few. Doesn't that mean you already incur just as much overhead as you would using normal synchronization provided by the kernel, like futexes? Or am I missing something?

Facebook and the kernel--adaptive spinlocks

Posted Apr 16, 2014 17:36 UTC (Wed) by vomlehn (guest, #45588) [Link] (1 responses)

Good question. As I said, it has been a while, but as I recall, we didn't actually use PIDs because they are sparse. Instead, we assigned each task a dense unique task ID. Then, on initialization, you request access to a read-only memory region from the kernel, which contains an array indexed by task ID. When the task's status changes, such as it starting to run, the appropriate array element is set to indicate that. So, checking on the status of a task is simply a read of a memory location. Though in our implementation the task status just indicated whether it was running or not, it could be more sophisticated than that.

Note that, from the security standpoint, this scheme does have a slow information leak in that tasks can signal to each other by suspending or resuming execution, and it does allow some information about all processes to leak to all other processes. Some thought needs to be given to this. I think it's not a big deal but I may be wrong. Also, security is one motivation for sparse PIDs. If the dense task ID has limited scope, I think it should not pose much of a security risk.

Facebook and the kernel--adaptive spinlocks

Posted Apr 16, 2014 23:34 UTC (Wed) by valkyrie (guest, #96637) [Link]

As the other contributor mentioned by vomlehn (Hi!), I remember a few more
of the details.

We were originally asked to look at a method to tell the kernel "I am the
lock holder, and nobody can do anything until I'm done, so run me
exclusively." I think we can all agree that was probably the wrong way to
look at the problem.

Since the patent application was denied and the text has been freely viewable for years, I don't see any harm in filling in the gaps in the description above. We had it working on FreeBSD and Linux (2.4.18-ish, I think). Our experiments seemed to indicate that it solved the class of problems we set out to work. The code never made it out of the lab, unless there's a backup sitting in a landfill somewhere. I'm sure an interested party could probably take the design concepts and re-implement this without too much trouble if it is useful.

First, as vomlehn noted, the most important piece of information when
trying to decide how to contend (spin, sleep) is the status of the lock
holder. Second, if the lock holder is not running, contenders need an
efficient method to get out of the way. Third, even though I feel it's
implied by the word "efficient," it's worth pointing out that effort must
be made to minimize context switches.

As my cohort mentioned, we did trust the lock holder to write its own
identifier (index, handle, what-have-you) as the lock value, because we
didn't want to expose task list memory. I don't think we wrote the PIDs
into the structure, because that could be read by other contenders and
used to harass the lock holder.

Our solution required that tasks agree on a shared memory region that
would hold:

1) Spinlocks
2) Process status information

It isn't necessary to teach the kernel about user-space spinlocks.
Instead our solution provides a primitives that allow a task to inform the
scheduler that this task isn't interested in running until a memory
location it can read contains (or does not contain) a specific value. We
provided another primitive that allowed processes to inform the scheduler
that it should publish their status changes (running, waiting, dead) in a
specified memory location in their address space. In this way, tasks can
opt-in to sharing certain information in a shared memory block, with all
of the protections (or lack thereof) that shared memory usually enjoys.
This information could be used for many purposes (DMA, IPC, etc.), but our
experiments focused on implementing efficient spinlocks with these
primitives. The addresses supplied by the task were converted to kernel
space by methods which should be obvious, and stored in the task's
structure.

Therefore, a contender can determine the state of the lock holder. If the
lock holder is dead, the lock can be safely cleared and acquired
atomically. If the lock holder is running, the contender might choose to
spin for a time. If the lock holder is waiting, or if it is running but
the contender's patience has expired, the contender might suspend itself
(via one of our primitives) until the specific location (that is, the
spinlock) contains the "available" value. Alternatively, it might choose
to suspend itself until the location no longer contains the handle of the
current lock holder. Whenever the scheduler would normally wake a process
which has suspended itself in this manner, the value at the specified
address is checked first. If the value matches the criteria, then the
task has a non-zero chance of acquiring the resource (the lock holder has
released the lock or a new lock holder has acquired the lock), and the
contender is run as normal. If the value does not match the criteria, the
task has virtually no chance of acquiring the resource, and is skipped.
It may be germane to mention that the scheduler never knows WHY the
criteria is interesting. The criteria either matches or doesn't, and the
process is run (or not) accordingly.

One caveat: our prototype addressed neither the case where the permissions on the shared segment are changed, nor the case where the segment is removed. A production implementation might want to consider these cases for security reasons.

Whatever you do, don't ever read the (denied) patent application (US
20040103414 A1). It's incomprehensible and does not in any way resemble
the write-up we gave the lawyers. Reading it will only rot your brain and
may wake the Great Old-Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn