User: Password:
|
|
Subscribe / Log in / New account

Facebook and the kernel--adaptive spinlocks

Facebook and the kernel--adaptive spinlocks

Posted Apr 16, 2014 17:36 UTC (Wed) by vomlehn (subscriber, #45588)
In reply to: Facebook and the kernel--adaptive spinlocks by intgr
Parent article: Facebook and the kernel

Good question. As I said, it has been a while, but as I recall, we didn't actually use PIDs because they are sparse. Instead, we assigned each task a dense unique task ID. Then, on initialization, you request access to a read-only memory region from the kernel, which contains an array indexed by task ID. When the task's status changes, such as it starting to run, the appropriate array element is set to indicate that. So, checking on the status of a task is simply a read of a memory location. Though in our implementation the task status just indicated whether it was running or not, it could be more sophisticated than that.

Note that, from the security standpoint, this scheme does have a slow information leak in that tasks can signal to each other by suspending or resuming execution, and it does allow some information about all processes to leak to all other processes. Some thought needs to be given to this. I think it's not a big deal but I may be wrong. Also, security is one motivation for sparse PIDs. If the dense task ID has limited scope, I think it should not pose much of a security risk.


(Log in to post comments)

Facebook and the kernel--adaptive spinlocks

Posted Apr 16, 2014 23:34 UTC (Wed) by valkyrie (guest, #96637) [Link]

As the other contributor mentioned by vomlehn (Hi!), I remember a few more
of the details.

We were originally asked to look at a method to tell the kernel "I am the
lock holder, and nobody can do anything until I'm done, so run me
exclusively." I think we can all agree that was probably the wrong way to
look at the problem.

Since the patent application was denied and the text has been freely viewable for years, I don't see any harm in filling in the gaps in the description above. We had it working on FreeBSD and Linux (2.4.18-ish, I think). Our experiments seemed to indicate that it solved the class of problems we set out to work. The code never made it out of the lab, unless there's a backup sitting in a landfill somewhere. I'm sure an interested party could probably take the design concepts and re-implement this without too much trouble if it is useful.

First, as vomlehn noted, the most important piece of information when
trying to decide how to contend (spin, sleep) is the status of the lock
holder. Second, if the lock holder is not running, contenders need an
efficient method to get out of the way. Third, even though I feel it's
implied by the word "efficient," it's worth pointing out that effort must
be made to minimize context switches.

As my cohort mentioned, we did trust the lock holder to write its own
identifier (index, handle, what-have-you) as the lock value, because we
didn't want to expose task list memory. I don't think we wrote the PIDs
into the structure, because that could be read by other contenders and
used to harass the lock holder.

Our solution required that tasks agree on a shared memory region that
would hold:

1) Spinlocks
2) Process status information

It isn't necessary to teach the kernel about user-space spinlocks.
Instead our solution provides a primitives that allow a task to inform the
scheduler that this task isn't interested in running until a memory
location it can read contains (or does not contain) a specific value. We
provided another primitive that allowed processes to inform the scheduler
that it should publish their status changes (running, waiting, dead) in a
specified memory location in their address space. In this way, tasks can
opt-in to sharing certain information in a shared memory block, with all
of the protections (or lack thereof) that shared memory usually enjoys.
This information could be used for many purposes (DMA, IPC, etc.), but our
experiments focused on implementing efficient spinlocks with these
primitives. The addresses supplied by the task were converted to kernel
space by methods which should be obvious, and stored in the task's
structure.

Therefore, a contender can determine the state of the lock holder. If the
lock holder is dead, the lock can be safely cleared and acquired
atomically. If the lock holder is running, the contender might choose to
spin for a time. If the lock holder is waiting, or if it is running but
the contender's patience has expired, the contender might suspend itself
(via one of our primitives) until the specific location (that is, the
spinlock) contains the "available" value. Alternatively, it might choose
to suspend itself until the location no longer contains the handle of the
current lock holder. Whenever the scheduler would normally wake a process
which has suspended itself in this manner, the value at the specified
address is checked first. If the value matches the criteria, then the
task has a non-zero chance of acquiring the resource (the lock holder has
released the lock or a new lock holder has acquired the lock), and the
contender is run as normal. If the value does not match the criteria, the
task has virtually no chance of acquiring the resource, and is skipped.
It may be germane to mention that the scheduler never knows WHY the
criteria is interesting. The criteria either matches or doesn't, and the
process is run (or not) accordingly.

One caveat: our prototype addressed neither the case where the permissions on the shared segment are changed, nor the case where the segment is removed. A production implementation might want to consider these cases for security reasons.

Whatever you do, don't ever read the (denied) patent application (US
20040103414 A1). It's incomprehensible and does not in any way resemble
the write-up we gave the lawyers. Reading it will only rot your brain and
may wake the Great Old-Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds