Redesigning asynchronous suspend/resume

By Jonathan Corbet
December 16, 2009

Your editor suspects that, were somebody to poll the community of Linux users, very few would state that they dislike the idea of having their systems suspend and resume more quickly. Rafael Wysocki has been working toward this goal for some time; his asynchronous suspend/resume patches were covered here back in August. This code has not encountered any real turbulence for a while, so one might well assume that Rafael's 2.6.33 pull request containing asynchronous suspend/resume would not be controversial. Such assumptions, however, fail to take into account the "last-minute Linus" effect.

The simple fact of the matter is that, like anybody else, Linus cannot possibly follow all of the projects under way at any given time; that makes it entirely possible for work on a specific project to proceed to a conclusion without ever drawing his attention. That will inevitably come to an end, though, when somebody sends a pull request asking that the work be merged into the mainline. It seems clear that some requests are scrutinized more closely than others, but some are looked at closely indeed. The power management request, as it turns out, was one of those.

Linus didn't like what he saw, to say the least. The code struck him as overly complex and possibly unsafe; he refused to pull it. In particular, he thought that far too much work went into trying to map out the device tree topology and all of the dependencies between devices. In the past, attempts to make things asynchronous based on just the apparent topology have run into trouble; why should it be different this time?

Having said that, Linus then went on to outline an alternative solution based mainly on the device tree. In so doing, he wanted to make it possible for most drivers to ignore the concept of asynchronous suspend and resume entirely. For much of the hardware on the system, the time required for either operation is so short that there is really little point in trying to do it in parallel. If a device can be suspended in a few milliseconds, one might as well just do it serially and avoid the complexity.

For the rest, Linus very much wanted the decision on whether to do things asynchronously to be made at the driver level. But the power management core still needs to know enough about asynchronous operation to wait until it is done; one cannot suspend a controller until all devices connected to it have, themselves, completed suspending. After some revisions, Linus's plan came down to something like this:

A reader/writer semaphore (rwsem) is associated with each node in the device tree. These semaphores allow an unlimited number of concurrent reader locks, but only one writer lock can exist at any given time, and writers must first wait for any readers to finish. At the beginning of the suspend process, no locks are taken.
The suspend process is initiated on all children of a given node. If suspend is done synchronously, it happens right away and no further action is required.
Should the driver decide to suspend its device asynchronously, it starts a thread to do that work. It also takes a read lock on the parent's rwsem.
When an asynchronous suspend for a specific device completes, the read lock is released.
The parent node acquires a write lock on its own rwsem before suspending the device. If any child nodes are suspending asynchronously, the write lock will block as a result of the outstanding read locks. Only when all read locks are released - meaning that all children are suspended - can the parent acquire its write lock and suspend.

For resume, the write lock is taken first, and all children take read locks on their parent before resuming the hardware. That will ensure that all devices complete resuming before any child devices begin the process.

This scheme has the benefit of simplicity. Getting it implemented took a few rounds of discussion, though, with Linus repeatedly asking developers to retain that simplicity and not try to make up new locking schemes. Things still changed along the way; as of this writing, the current suspend/resume patch set does not use Linus's plan as originally written. Among other things, Rafael, who did implement an rwsem-based solution, ran into problems with lockdep that Linus agreed were serious.

What has been implemented instead is a variant on that scheme based on completions. Every device node gets a completion structure, initially set to the "not complete" state. Additionally, any driver which implements asynchronous suspend/resume needs to call device_enable_async_suspend() to inform the power management core of that fact. It's now up to that core to create threads for asynchronous suspend/resume operations, and to invoke driver callbacks from those threads. Before suspending a specific device node, the power core will wait for completions for any child devices which have been marked for asynchronous callbacks. Once again, that ensures that all children have been suspended before the parent node is suspended.

Linus doesn't like the completion-based approach, but has indicated that he will be willing to take it. As of this writing, that has not yet happened, though.

Seen in one light, this episode highlights the sort of disregard for developer time which is occasionally seen in the kernel development process. It is not that uncommon for code which has seen a lot of work to end up being discarded or massively reworked. This model can seem quite wasteful, and there can be no doubt that it can be highly frustrating for the developers involved. But it is also a fundamental part of how quality control for the kernel works. The suspend/resume code was clearly improved by this last-minute redesign. One might say that it would have been better done some months ago, but what matters most for Linux users is that it happens at all.

Index entries for this article
Kernel	Development model
Kernel	Power management

Redesigning asynchronous suspend/resume

Posted Dec 17, 2009 16:11 UTC (Thu) by jzbiciak (guest, #5246) [Link] (5 responses)

I'm reminded of Fredrick Brooks' statement:

Plan to thrown one away; you will, anyhow.

Of course, Brooks' aphorism is a bit over-compressed, and he's noted its limitations in later essays. I can't help but think, though, that these massive reworks are only successful because the thing being reworked already exists and already shows benefits despite its warts. That gives incentive to go back and fix the things that are clearly not right, but retain the value.

"Because it took a lot of time to write" is a horrible reason to keep a piece of code if you can see that there's clearly a better way to do something once you get to the end. And besides, this massive rework took, what, less than two weeks? (Rafael's pull request was dated Dec 5th, and today is 12 days later.) If you covered the original patches back in August, then comparing the 2 weeks relative to the previous 3+ months doesn't feel so bad.

Sometimes you don't know why something is bad until after you've implemented it. The rework that happens at the end benefits from hindsight.

Redesigning asynchronous suspend/resume

Posted Dec 17, 2009 18:09 UTC (Thu) by halla (subscriber, #14185) [Link] (4 responses)

On the on the other hand, there's the second system effect and the mob of
attention-deficit teenagers. Though in Krita we've rewritten the core at
least four times, and I'm definitely not a teenager anymore.

Redesigning asynchronous suspend/resume

Posted Dec 17, 2009 19:43 UTC (Thu) by cpeterso (guest, #305) [Link] (1 responses)

there's the second system effect and the mob of attention-deficit teenagers.

Perhaps Linux development breaks another of Brooks's laws: Adding manpower to a late software project makes it later.

Brooks assumed that new developers would negatively impact existing developers' productivity. I don't think that applies when the pool of potential Linux developers working is (practically) infinite. There is little cost to Linus/Linux if dozens of autonomous developers are all writing their own Linux process schedulers.

Redesigning asynchronous suspend/resume

Posted Dec 17, 2009 23:56 UTC (Thu) by jzbiciak (guest, #5246) [Link]

I'm going to reply to both of you in one spot. :-)

Unfortunately, I don't have my copy of Mythical Man-Month with me, so I can't quote him directly.) In more recent editions of MMM, he includes essays that look back on some of his original statements, and presents refinements, counter-arguments, and clarifications.

boudewijn said:

there's the second system effect

With respect to this and the "throw-one-away" comment, he pointed out that if you take both literally and out of context, this would seem like a recipe for disaster: You'd throw your first system away, and end up with your second system rife with feeping creaturitis. (Obviously paraphrased.)

He goes on to say that the two are talking about different things. The first one refers to multiple implementations of the same system. The second one refers to the second complete system as a discrete entity / project from the first. The Linux kernel is too large a project to be described in this manner, IMHO. It's more interesting to look at individual subsystems, and even then, only when the same group of designers implemented major revisions of that subsystem. (Recall, Brooks describe the "second system effect" in the context of a particular designer's second system.)

cpeterso said:

Perhaps Linux development breaks another of Brooks's laws: Adding manpower to a late software project makes it later.

As I noted above, I don't think you can apply that to the Linux kernel as a whole, because it's actually a collection of many, many smaller projects. The whole of Linux marches forward mostly-independently of any of its constituent pieces. Something doesn't stay merged if it breaks things.

For one thing, Linux defies the notion of "late," since "late" only makes sense relative to a schedule. "When it's ready" is not a schedule. That doesn't mean Brook's Law is invalid, it just needs a change in nomenclature. A more accurate restatement would be "Adding more developers to a project that is well on its way to completion will delay that project's completion, due to the overhead of bringing the new members up to speed, and the additional need for cross communication among developers." In particular, he notes that the number of potential communication paths among developers increases as the square of the number of developers.

In this more general form, I believe Brook's observation does still apply, at least to a major discrete feature development at the subsystem level. If you look at any such addition, rapid progress on its implementation does generally slow down when it's opened to a larger community, because of the additional communication overheads. The "last minute Linus effect" is just a particular manifestation of this property.

In later versions of MMM, Brooks' retrospective does point out that there are organizations that have shown much better productivity scaling than the N² developer interactions suggest by segmenting the work effectively. In Linux kernel land, I think this directly translates to the notion of discrete subsystems that largely don't care about each other most of the time, thus enabling them to remain independent of each other and considered separately.

Redesigning asynchronous suspend/resume

Posted Jan 6, 2010 21:42 UTC (Wed) by jchrist (guest, #14782) [Link] (1 responses)

I'm trying to track down the original "mob of attention-deficit teenagers"
source without any luck. Any pointers?

-Jon

Redesigning asynchronous suspend/resume

Posted Jan 6, 2010 22:18 UTC (Wed) by mjg59 (subscriber, #23239) [Link]

JWZ