User: Password:
Subscribe / Log in / New account

A new approach to opportunistic suspend

A new approach to opportunistic suspend

Posted Sep 29, 2011 1:01 UTC (Thu) by neilbrown (subscriber, #359)
Parent article: A new approach to opportunistic suspend

Assuming that I understand the proposal correctly, I think this is a horrible proposal.

The key need isn't to 'allow "important" processes to keep the device awake' but rather to 'allow "important" *events* to keep the device awake'. So the proposed scheme would require tying events to processes and require blocking access to event generators. i.e. it would preclude a non-blocking programming style where a main loop uses e.g. poll() to be alerted to new events and handled then as required.

In any case, I thought that this problem was solved, it is just that no-one is using the solution. See /sys/power/wakeup_count and the documentation in kernel/power/main.c.

The opportunistic-suspend user-space program does:

- read /sys/power/wakeup_count
- check that the rest of user-space is ready to suspend.
- write the value read back into /sys/power/wakeup_count
- if that succeeds, write 'sleep' to /sys/power/state
- wait for user-space to be quiescent again
- go back to the top of the loop.

Is this not sufficient?

(Log in to post comments)

A new approach to opportunistic suspend

Posted Sep 29, 2011 2:13 UTC (Thu) by jstultz (subscriber, #212) [Link]

Unfortunately I don't believe the addition of /sys/power/wakeup_count is sufficient.

While it does allow us to know if any event has occurred before we try to suspend, it doesn't tell us the event has been consumed by userland.

The easiest example for me is my nightly backup, which is triggered by a RTC backed alarmtimer is scheduled for midnight. Then I walk away from my computer at exactly 11:45, when the system is set to suspend after 15 minutes of no X input.

The RTC interrupt fires first, upping the wakeup count and waking up my backup application. Then before my backup application can run, the pm-daemon starts, reads the wakeup count, echo's it back in, and then echo's "mem" to /sys/power/state. In that case, the system would suspend and by backup wouldn't start until I returned to my computer the next day.

A new approach to opportunistic suspend

Posted Sep 29, 2011 2:49 UTC (Thu) by neilbrown (subscriber, #359) [Link]

I think the wakeup_count model would require the pm-daemon, between reading wakeup_count and writing back to it, to say to the alarm timer "you aren't doing anything right now, are you" and for the alarm timer to check and say "no, I'm right - nothing happening here". Only then does pm-daemon go ahead and suspend. In your case the alarm timer would say "Oh, I just got a wake-up, better not suspend just yet".

The only place I can see room for a race is inside the kernel as the event is handed from one subsystem to another. If e.g. an interrupt handler registered a wakeup event, then queued it for a kernel thread to handle and make available to user-space. I cannot immediately see how we are guaranteed not to sleep before the event becomes visible to user-space. However I suspect a very short usleep in pm-daemon would cover the race, and there may be some rules for writing wakeup-aware drivers that I don't know about.

A new approach to opportunistic suspend

Posted Sep 29, 2011 7:01 UTC (Thu) by jstultz (subscriber, #212) [Link]

I'm not sure I follow the pm-daemon talking with the alarmtimer. But I'd be interested in hearing more.

The trouble is that you have to keep suspend from happening from the wakeup irq to the point that userland runs and consumes the event.

You're right about protecting the path from the wakeup irq to the task wakeup, and that it might go through numerous kernel threads (threaded irqs, softirqs, work queues, or even timers). However, that can be protected using pm_stay_awake/pm_relax chaining, as I did for the RTC/alarmtimer paths.

But we also have to wait for the woken task to run and do whatever communication with the pm-daemon to stop it from suspending. I don't really see how that can be solved without some kernel assistance.

With wakelocks, there is the select/userland-wakelock/read method for overlapping the critical sections. With my proposal, the kernel deboosts before calling schedule() and reboosts the task on wakeup.

Trying to get userland to do something like registering interest with the pm-daemon, and then having to ack the pm-daemon the event was consumed might be plausable, but with only a global wakeup_count I'm not sure how we'd associate which wakeup events we're waiting for an ack on and which ones to ignore (much less the general ugliness of having to map userland's sense of wakeup sources to the actual kernel devices causing them).

I'm very interested in proposals on how to properly solve this on the userland side. However, so far folks seem to wave at it, and no one has shown a working proof of concept.

A new approach to opportunistic suspend

Posted Sep 29, 2011 7:58 UTC (Thu) by neilbrown (subscriber, #359) [Link]

I have to confess that I have only half-implemented this. The kernel running on my phone predates wakeup_count (2.6.29!) but when I get a GTA04....

Anyway this is how it would work - with lots of implementation details that are to some extent irrelevant, but make it more concrete.

There is a file /var/lock/suspend which is empty. Any process which cares about suspend takes a shared lock on this file and arranges to be notified when it changes (dnotify or inotify - choose your poison).
There is another file: /var/run/suspend_disabled. Any process that doesn't want a suspend to happen takes a shared lock on this file.

Some process that is responsible for initiating suspend either periodically tries to take a non-blocking exclusive lock on /var/run/suspend_disabled, or just sits waiting on a blocking exclusive lock.

When it gets a lock, it then drops it and starts the suspend cycle handshake. This involves:
1 - read /sys/power/wakeup_count
2 - create a new file /var/lock/suspend-next
3 - write anything to /var/lock/suspend
4 - get a blocking exclusive lock on /var/lock/suspend

Any process that is interested in suspend will now get notified that a suspend in imminent ([di]notify tells it, and it can check the size of the suspend file to confirm). It does any preparation that might be needed for suspend (tell the gsm chip to stop reporting cell changes, record the time of the next cron job in the RTC alarm), takes a shared lock on /var/lock/suspend-next, asks for change notification on that file, and closes /var/lock/suspend (thus dropping the lock).
If the preparation determined that something actually needed to be done *now* so a suspend should not happen it should take a shared lock on
/var/run/suspend_disabled before it closes /var/lock/suspend.

When all interested parties have actioned their part of the handshake the suspending thread will get the exclusive lock it wanted and:

5 - get a non-blocking exclusive lock on /var/run/suspend_disabled.
6 - if that succeeded, write the wakeup_count back to /sys/power/wakeup_count
7 - if that succeeded, write sleep to /sys/power/state
8 - unlock /var/run/suspend_disabled if it was locked
9 - rename /var/lock/suspend-next to /var/lock/suspend

This completes the cycle. Any client which asked for notification on /var/lock/suspend_disabled will find out that it has been renamed (so it now has a shared lock on /var/lock/suspend) and can do any wake-from-suspend processing such a telling the GSM chip it is allowed to be noisy again.

This way any process that cares can get race-free notification of suspend and resume so it can do any necessary processing, and it has the opportunity to abort the suspend before it commits. If any wakeup event happens after a thread does a check, that event will prevent the system from actually suspending, and the relevant thread will get another notification and another chance to consume the event before there is any chance of another suspend.

Your alarmtimer would need to hook into this somehow. Presumably there would be a cron-like task which gets notification of suspends, arranges an RTC wakeup before allowing suspend to complete, and holds a suspend_disabled lock from a couple of seconds before something needs to run (depending on how long a typical suspend/resume cycle is) until the task has completed.

So the "wait for woken task to run and ... stop it from suspending" is achieved by holding a lock which the pm-daemon must wait for before it is allowed to complete a suspend.

Does that make it clear and concrete enough? Is it convincing?

Current best guess at Android suspend-blocker requirements

Posted Oct 1, 2011 17:41 UTC (Sat) by PaulMcKenney (subscriber, #9624) [Link]

It is not clear to me how your approach handles wakeup events (for example, keyboard/button input). Your thought is to have the kernel drivers acquire locks on the files, or did you have something else in mind?

Also, are you assuming FIFO processing of file lock requests? In the absence of FIFO processing, I don't see what causes all outstanding requests to be processed between your step 4 and 5. (I don't know of any formal standard requiring that file locking process requests in FIFO order, but I might well be missing some fine print somewhere.)

Current best guess at Android suspend-blocker requirements

Posted Oct 1, 2011 21:46 UTC (Sat) by neilbrown (subscriber, #359) [Link]

Hi Paul,

I trust the kernel to handle wakeup events properly and to get them to a point where user-space can see them before allowing the wakeup event to be considered complete. Up thread John says "that can be protected using pm_stay_awake/pm_relax chaining, as I did for the RTC/alarmtimer paths" and I believe that he is talking about exactly that issue and believes it can be addressed - and I trust him in this.... and I just had a look at his patches and the relevant code and find my trust in John was not misplaced (though my trust in the kernel might be ... the kernel currently doesn't do the right thing with RTC wakeups ... but that is a bug that John is fixing, it is not a problem with the design).

The locking protocol relies on the difference between shared and exclusive locks. The clients (who want suspend notification) take shared locks. The service (which will initiate suspend) takes an exclusive lock which will wait for all shared locks to release. This provides the required ordering.

New client locks are directed to a different file: suspend-next (if 'suspend' is not empty, clients must not try to lock it, but should lock suspend-next instead).

I believe this foundation can be used to address all the concerns in your fine document. I'm not completely certain on "reasonable statistics in order to enable debugging and to enable reporting to endusers" as I don't know what the "reasonable statistics are". Presumably that is just statistics about who locks "suspend_disabled" and for how long. That isn't available directly in my scheme, but the locking of suspend_disabled is a trivial part of the API - quite separate from the race-free-suspend-notification - and could easily be replaced by something that carries more detailed information.

(Of course the whole API could be replaced with something using "binder", or sockets, or SUN-RPC or maybe even smoke signals. I just like filesystem APIs so that is what I tend to use. It is the sequencing and interactions that are important).

The interactions are:
- service multicasts "I'm going to suspend now - prepare yourselves"
- clients perform any preparation, which may include saying "please don't suspend now" on a separate channel.
- clients confirm they are prepared
- services waits for all confirmations. Then checks again if anyone is blocking suspend (via wakeup_count or checking user-space requests).
- If no-one is blocking suspend - enter suspend, and resume
- service tells all client "Ok, we are awake again - go about your business"
- clients act accordingly

Current best guess at Android suspend-blocker requirements

Posted Oct 2, 2011 8:09 UTC (Sun) by neilbrown (subscriber, #359) [Link]

It occurred to me that I didn't really answer your first question properly.

The auto-suspend daemon performs 2 checks to ensure that it is OK to suspend.
1/ The file locking handshake that I described in detail - this makes sure that user-space processes are happy with the suspend.
2/ the wakeup_count handshake. This ensures that the kernel is happy with the suspend.

The kernel has "wakeup_sources" that are similar to suspend blockers. When a wakeup_source is active, it means that suspend should not succeed.

When a wakeup_source is activated (normally in an interrupt handler), two global counters are incremented, and when it is deactivated (which should happen once the event is visible to user-space) one of them is decremented.

When the auto-suspend daemon reads from /sys/power/wakeup_count, the read blocks until the decrementing count is zero - so there are no active wakeup_sources. The other count is returned.
Subsequently - after checking with all of userspace - the daemon writes the number back to wakeup_count. If the corresponding counter has changed, the write fails. This means there has been another event and the daemon needs to check with user-space again.
If it succeeds then the kernel stores this number and the next attempt to suspend (echo mem > /sys/power/state) will only complete if there are no more wakeup events. i.e. if, right up to the moment of suspend, the event counter has the same value as the stored number that was written to /sys/power/state.

So the file-locking is only for user-space, and the wakeup event counting completely handles the kernel-event side.

Current best guess at Android suspend-blocker requirements

Posted Oct 2, 2011 22:24 UTC (Sun) by PaulMcKenney (subscriber, #9624) [Link]

Ah, understood. For whatever reason, I thought you were trying to re-implement John's patches as opposed to building a user-space API for them. Agreed, it would be quite hard to properly handle wakeup events in user space, given that the kernel needs to be involved in the wakeup/resume process!

Current best guess at Android suspend-blocker requirements

Posted Oct 2, 2011 23:43 UTC (Sun) by neilbrown (subscriber, #359) [Link]

Yes and no.

Some of John's patches were just bug-fixes. The wakeup event from the CMOS timer wasn't being counted as a wakeup event properly. Those patches I wasn't trying to re-implement - they are good.

The other patches from John did things to the scheduler so that distinguished processes could prevent suspend just by being busy. I was trying to show that these patches were not necessary because a completely race-free protocol is available in userspace.

(and finally there was a patch which tried to make /sys/power/state resistant to processes that didn't follow the wakeup_count protocol. That issue is best addressed by fixing the processes, or bind-mounting a file over /sys/power/state).

Current best guess at Android suspend-blocker requirements

Posted Oct 3, 2011 16:18 UTC (Mon) by PaulMcKenney (subscriber, #9624) [Link]

Cool! But I must defer to John and the Android guys at this point, as they would have a better idea of what works and does not in the Android environment.

My guess is that if they needed something lighter weight, this same general approach could be implemented within a device driver, though an implementation that just used existing user-space interfaces would be preferable, all else being equal. (Which it never is, but that is life!)

A new approach to opportunistic suspend

Posted Sep 29, 2011 18:22 UTC (Thu) by iabervon (subscriber, #722) [Link]

Surely, it's not true that your computer is set to suspend after 15 minutes of no X input; your computer is set not to suspend (by itself) within 15 minutes of X input. I think that the difference between these two similar statements is a large part of the reason that a lot of proposed solutions are useless in practice. If you say it the way your said it, the misbehavior you discuss is as specified. If you want your system to be glitch-free, I think you have to start by carefully describing glitch-free policies in your text.

(For that matter, I think the main problem that Android has is due to specifying the wakeup button in a glitchy way; I think it would solve everything to say that the wakeup button keeps the system on for a certain period of time, measured by kernel timer, during which period the user has a chance to starting using the system in a way that keeps it on longer but wouldn't be detectable while the system was off.)

A new approach to opportunistic suspend

Posted Sep 29, 2011 22:10 UTC (Thu) by neilbrown (subscriber, #359) [Link]

It is an important distinction that you draw. No application should be saying "nothing has happened for a while, time to suspend". Rather it should be able to say "something is happening, don't suspend just now".

However "should be" != "is"

The Power Management Preferences on my Gnome desktop have an option "Put compute to sleep when inactive for: ....". Assuming that text is representative of what is actually done, the g-p-m seems to be acting in exactly the wrong way.

I read the original lkml thread, and a big part of John's apparent motivation was to be able to intervene when g-p-m explicitly tries to put the system to sleep. Apparently a dbus API is coming but until then.....

It is actually fairly easy to interfere with g-p-m. Simply create an empty file and bind-mount it over /sys/power/state. Then when g-p-m tries to suspend the system it will write to the file instead. Some daemon can watch the file and when "mem" appears in it, check that nothing else wants to keep the system up, and only then write to /sys2/power/state (assuming sysfs is mounted at /sys2 as well).

It is a bit ugly, but it should allow effective prototyping of a sensible suspend manager until g-p-m comes to the party.

(and if there are any g-p-m developers listening who want to tell me that I'm completely wrong about g-p-m, I'd love to hear the details, thanks!)

A new approach to opportunistic suspend

Posted Sep 29, 2011 22:58 UTC (Thu) by iabervon (subscriber, #722) [Link]

I think having a userspace daemon that tracks why the system is on (and turns it off if there's no need for it to be on) would take care of a lot of the issues; however, I'm not entirely convinced that userspace can be race-free for this sort of thing. The other part of his situation is that he's worried about the fact that he'd have to poke the userspace daemon (whatever it is) from the hardware timer interrupt in order to make sure that the system doesn't suspend after the timer has fired (so the timer doesn't wake it) but before the daemon finds out that something needs the system up.

Anyway, I think the principle that most sensible policies specify "I need the system on/I don't need the system on", not "I want the system off" applies to the kernel. (For that matter, where there are sensible policies for "I want the system off", it's things like low battery or the lid being closed, where you want to keep the system suspended, regardless of ordinary wakeup events, until the particular issue that triggers the suspend is cleared.) So I don't see the logic in having the kernel API only offer a misleading operation, rather than letting userspace specify the policy in sensible terms and the kernel carry it out.

A new approach to opportunistic suspend

Posted Sep 29, 2011 23:32 UTC (Thu) by neilbrown (subscriber, #359) [Link]

Yes, exactly. The /sys/power/state api is wrong and backwards. But arguable the hardware/firmware api which it provides access to is busted as well. But plenty of interfaces are busted and we just have to work with them until well after the next-best-thing is perfected.

There is a new api which is per-device power management where the kernel is in control of everything instead of the firmware, and there is a school of thought which says we should forget suspend because both the hardware concept and the kernel API is broken - just use per-device power management and fix all the bugs.

However we still need to work with hardware that saves more power on "suspend" and /sys/power/state is the natural API to work with that. By itself is a racy API, so /sys/power/wake_count was added to make the races avoidable, which it does. Putting more work into make the API in the kernel for suspend/resume even "better" would be a mistake - the work should go to making per-device power-management work really well.

So yes: /sys/power/state is busted, but it is possible to create a user-space API on top of that which provides sensible semantics. So we should just do that rather than nagging the kernel devs to make life easier.

And I think you are wrong - as user-space solution using power/state and power/wake_count can be race free (modulo bugs). You just need to use power/wake_count correctly (see my post up-thread)

Think of system-wide suspend/resume as legacy. Yes it should work and work reliably, but it doesn't need to be pretty. per-device power states and power policy is the new way. We should use that were possible and only fall back on suspend/resume on legacy systems that require it (like .. uhmmm... x86). Android (and all others) are welcome to use suspend/resume and it should work. But they are encouraged to explore the various improvements possible with finer grained power management.

A new approach to opportunistic suspend

Posted Sep 29, 2011 2:29 UTC (Thu) by jstultz (subscriber, #212) [Link]

I do think you're right that the non-blocking style would have issues, if you were wanting the polling task to be "important". In my mail I mentioned the difficulty with using select() on multiple fds that all are not backed by wakeup devices, and if suspending while that blocked (and thus would in effect ignore the non-wakeup fds until a wakeup event occurs) would be appropriate or not.

That said, even with nonblocking polls, however the system was suspended, any such application is in effect blocking on an event that will wake the system up.

Its an interesting point though, and I appreciate the critique. I was not expecting this idea to be warmly received, but I do want it to make people think about what is actually needed, and what other approaches can be used to solve the same issues.

A new approach to opportunistic suspend

Posted Sep 29, 2011 13:29 UTC (Thu) by kiko (subscriber, #69905) [Link]

And yet I'm surprised to read Peter and Thomas echoing a familiar "give me use cases" meme back on this thread. After all this time, are the use cases this tries to address not clear? Maybe we should fix that first.

A new approach to opportunistic suspend

Posted Sep 29, 2011 13:58 UTC (Thu) by khim (subscriber, #9252) [Link]

After all this time, are the use cases this tries to address not clear?

Of course not! That's the problem.

Maybe we should fix that first.

Not possible. Period.

That's the problem with userspace APIs: they are used to implement all kind of functionality not ever imagined by their developers. This is the same with wakelocks: they were initially created to solve problems of a few concrete examples. But since they were available people started using them for all kinds of other stuff (remember that they are available in official Android API). Thus today if you want to rip them out you'll need to either implement identical functionality (and then we can only talk about clean interface between kernel and userpsace) or you should support all the usecases hundred of thousand of apps out there are presenting.

A new approach to opportunistic suspend

Posted Sep 29, 2011 16:04 UTC (Thu) by kiko (subscriber, #69905) [Link]

How can it not be possible to provide a compelling subset of the existing use cases?

At any rate, you have a valid point that what we are arguing here is for a capability that has the potential of addressing a major class of problem that application developers face. Perhaps better than outlining the use cases individually might be describing the class of problems wakelocks let you address. I think Paul McKenney actually tried this a while back, but it rambled on too much -- but maybe John has the guts to run a more focused thread on the subject.

Alternatively, we could motivate the kernel community into appreciating that while they don't see the feature addressing their own use cases, a population of Android application developers are relying on the functionality, and it would be the responsible thing to either support them or convert them to use something else.

A new approach to opportunistic suspend

Posted Sep 29, 2011 20:23 UTC (Thu) by jrn (subscriber, #64214) [Link]

> Alternatively, we could motivate the kernel community into appreciating that while they don't see the feature addressing their own use cases, [...]

The usual approach in that case is for people who do see value in the feature to join the kernel community. Hard work, but possible.

A new approach to opportunistic suspend

Posted Sep 29, 2011 22:41 UTC (Thu) by neilbrown (subscriber, #359) [Link]

Bah. The use cases are perfectly clear.
And (with one small caveat) the Android API is perfectly sane and simple to implement on todays mainline kernel (baring bugs due to lack of testing).

The small caveat is what does "CPU" mean in the context of things that can be "on" or "off".

The Android interpretation means "CPU is off" is when system is "suspended" and "CPU is on" is when system is not suspended. Among other things, "suspend" defines a class of events which can wake from suspend. All other events will not.
So we could re-describe "CPU is off" as "only class X of events will be responded to" while "CPU is on" is "all events will be responded to". Because a process could be CPU-bound, we might need to interpret the completion of one CPU instruction as an event which triggers the execution of the next CPU instruction. These events are ignored during suspend.

The school of thought to which Peter Z seems to belong is essentially saying that this classification of events is too coarse because

a/ modern hardware often has a lot more control of power states and can save the same power without flushing all registers and stopping all clocks and going fully into suspend. So an application should get a wakelock on the events it really wants rather than on the class X of events called "CPU".

b/ Badly behaved applications can still waste power even without being able to take wakelocks. If one app is holding a CPU wakelock, then any app is allowed to use the CPU and waste power.

'b' touches the monitoring side of wakelocks rather than the locking side. wakelocks allow the system to know who is keeping the device awake to know who to blame. But we really want to know who is keeping the CPU busy and so wasting power. So I think the suggestion is that rather than making the CPU available to all processes, or not, the system should count the cycles used by each process and charge accordingly. So if any process wants to execute instructions it shouldn't need a wake-lock, it should just execute them, but be charged. It might still need a wake-lock on any events it wants like timers or frame-refresh signals or keyboard or whatever. An app which shows a high battery-use charge is likely to be quickly uninstalled.

So if the hardware were capable of very low power idle, and of accounting CPU cycles to processes, then taking the "execute next instruction" event out of the set of events tied to "CPU is on", and charging for it explicitly would probably make a lot of sense, and wouldn't need a change in the API.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds