User: Password:
Subscribe / Log in / New account

A new approach to opportunistic suspend

A new approach to opportunistic suspend

Posted Sep 29, 2011 2:13 UTC (Thu) by jstultz (subscriber, #212)
In reply to: A new approach to opportunistic suspend by neilbrown
Parent article: A new approach to opportunistic suspend

Unfortunately I don't believe the addition of /sys/power/wakeup_count is sufficient.

While it does allow us to know if any event has occurred before we try to suspend, it doesn't tell us the event has been consumed by userland.

The easiest example for me is my nightly backup, which is triggered by a RTC backed alarmtimer is scheduled for midnight. Then I walk away from my computer at exactly 11:45, when the system is set to suspend after 15 minutes of no X input.

The RTC interrupt fires first, upping the wakeup count and waking up my backup application. Then before my backup application can run, the pm-daemon starts, reads the wakeup count, echo's it back in, and then echo's "mem" to /sys/power/state. In that case, the system would suspend and by backup wouldn't start until I returned to my computer the next day.

(Log in to post comments)

A new approach to opportunistic suspend

Posted Sep 29, 2011 2:49 UTC (Thu) by neilbrown (subscriber, #359) [Link]

I think the wakeup_count model would require the pm-daemon, between reading wakeup_count and writing back to it, to say to the alarm timer "you aren't doing anything right now, are you" and for the alarm timer to check and say "no, I'm right - nothing happening here". Only then does pm-daemon go ahead and suspend. In your case the alarm timer would say "Oh, I just got a wake-up, better not suspend just yet".

The only place I can see room for a race is inside the kernel as the event is handed from one subsystem to another. If e.g. an interrupt handler registered a wakeup event, then queued it for a kernel thread to handle and make available to user-space. I cannot immediately see how we are guaranteed not to sleep before the event becomes visible to user-space. However I suspect a very short usleep in pm-daemon would cover the race, and there may be some rules for writing wakeup-aware drivers that I don't know about.

A new approach to opportunistic suspend

Posted Sep 29, 2011 7:01 UTC (Thu) by jstultz (subscriber, #212) [Link]

I'm not sure I follow the pm-daemon talking with the alarmtimer. But I'd be interested in hearing more.

The trouble is that you have to keep suspend from happening from the wakeup irq to the point that userland runs and consumes the event.

You're right about protecting the path from the wakeup irq to the task wakeup, and that it might go through numerous kernel threads (threaded irqs, softirqs, work queues, or even timers). However, that can be protected using pm_stay_awake/pm_relax chaining, as I did for the RTC/alarmtimer paths.

But we also have to wait for the woken task to run and do whatever communication with the pm-daemon to stop it from suspending. I don't really see how that can be solved without some kernel assistance.

With wakelocks, there is the select/userland-wakelock/read method for overlapping the critical sections. With my proposal, the kernel deboosts before calling schedule() and reboosts the task on wakeup.

Trying to get userland to do something like registering interest with the pm-daemon, and then having to ack the pm-daemon the event was consumed might be plausable, but with only a global wakeup_count I'm not sure how we'd associate which wakeup events we're waiting for an ack on and which ones to ignore (much less the general ugliness of having to map userland's sense of wakeup sources to the actual kernel devices causing them).

I'm very interested in proposals on how to properly solve this on the userland side. However, so far folks seem to wave at it, and no one has shown a working proof of concept.

A new approach to opportunistic suspend

Posted Sep 29, 2011 7:58 UTC (Thu) by neilbrown (subscriber, #359) [Link]

I have to confess that I have only half-implemented this. The kernel running on my phone predates wakeup_count (2.6.29!) but when I get a GTA04....

Anyway this is how it would work - with lots of implementation details that are to some extent irrelevant, but make it more concrete.

There is a file /var/lock/suspend which is empty. Any process which cares about suspend takes a shared lock on this file and arranges to be notified when it changes (dnotify or inotify - choose your poison).
There is another file: /var/run/suspend_disabled. Any process that doesn't want a suspend to happen takes a shared lock on this file.

Some process that is responsible for initiating suspend either periodically tries to take a non-blocking exclusive lock on /var/run/suspend_disabled, or just sits waiting on a blocking exclusive lock.

When it gets a lock, it then drops it and starts the suspend cycle handshake. This involves:
1 - read /sys/power/wakeup_count
2 - create a new file /var/lock/suspend-next
3 - write anything to /var/lock/suspend
4 - get a blocking exclusive lock on /var/lock/suspend

Any process that is interested in suspend will now get notified that a suspend in imminent ([di]notify tells it, and it can check the size of the suspend file to confirm). It does any preparation that might be needed for suspend (tell the gsm chip to stop reporting cell changes, record the time of the next cron job in the RTC alarm), takes a shared lock on /var/lock/suspend-next, asks for change notification on that file, and closes /var/lock/suspend (thus dropping the lock).
If the preparation determined that something actually needed to be done *now* so a suspend should not happen it should take a shared lock on
/var/run/suspend_disabled before it closes /var/lock/suspend.

When all interested parties have actioned their part of the handshake the suspending thread will get the exclusive lock it wanted and:

5 - get a non-blocking exclusive lock on /var/run/suspend_disabled.
6 - if that succeeded, write the wakeup_count back to /sys/power/wakeup_count
7 - if that succeeded, write sleep to /sys/power/state
8 - unlock /var/run/suspend_disabled if it was locked
9 - rename /var/lock/suspend-next to /var/lock/suspend

This completes the cycle. Any client which asked for notification on /var/lock/suspend_disabled will find out that it has been renamed (so it now has a shared lock on /var/lock/suspend) and can do any wake-from-suspend processing such a telling the GSM chip it is allowed to be noisy again.

This way any process that cares can get race-free notification of suspend and resume so it can do any necessary processing, and it has the opportunity to abort the suspend before it commits. If any wakeup event happens after a thread does a check, that event will prevent the system from actually suspending, and the relevant thread will get another notification and another chance to consume the event before there is any chance of another suspend.

Your alarmtimer would need to hook into this somehow. Presumably there would be a cron-like task which gets notification of suspends, arranges an RTC wakeup before allowing suspend to complete, and holds a suspend_disabled lock from a couple of seconds before something needs to run (depending on how long a typical suspend/resume cycle is) until the task has completed.

So the "wait for woken task to run and ... stop it from suspending" is achieved by holding a lock which the pm-daemon must wait for before it is allowed to complete a suspend.

Does that make it clear and concrete enough? Is it convincing?

Current best guess at Android suspend-blocker requirements

Posted Oct 1, 2011 17:41 UTC (Sat) by PaulMcKenney (subscriber, #9624) [Link]

It is not clear to me how your approach handles wakeup events (for example, keyboard/button input). Your thought is to have the kernel drivers acquire locks on the files, or did you have something else in mind?

Also, are you assuming FIFO processing of file lock requests? In the absence of FIFO processing, I don't see what causes all outstanding requests to be processed between your step 4 and 5. (I don't know of any formal standard requiring that file locking process requests in FIFO order, but I might well be missing some fine print somewhere.)

Current best guess at Android suspend-blocker requirements

Posted Oct 1, 2011 21:46 UTC (Sat) by neilbrown (subscriber, #359) [Link]

Hi Paul,

I trust the kernel to handle wakeup events properly and to get them to a point where user-space can see them before allowing the wakeup event to be considered complete. Up thread John says "that can be protected using pm_stay_awake/pm_relax chaining, as I did for the RTC/alarmtimer paths" and I believe that he is talking about exactly that issue and believes it can be addressed - and I trust him in this.... and I just had a look at his patches and the relevant code and find my trust in John was not misplaced (though my trust in the kernel might be ... the kernel currently doesn't do the right thing with RTC wakeups ... but that is a bug that John is fixing, it is not a problem with the design).

The locking protocol relies on the difference between shared and exclusive locks. The clients (who want suspend notification) take shared locks. The service (which will initiate suspend) takes an exclusive lock which will wait for all shared locks to release. This provides the required ordering.

New client locks are directed to a different file: suspend-next (if 'suspend' is not empty, clients must not try to lock it, but should lock suspend-next instead).

I believe this foundation can be used to address all the concerns in your fine document. I'm not completely certain on "reasonable statistics in order to enable debugging and to enable reporting to endusers" as I don't know what the "reasonable statistics are". Presumably that is just statistics about who locks "suspend_disabled" and for how long. That isn't available directly in my scheme, but the locking of suspend_disabled is a trivial part of the API - quite separate from the race-free-suspend-notification - and could easily be replaced by something that carries more detailed information.

(Of course the whole API could be replaced with something using "binder", or sockets, or SUN-RPC or maybe even smoke signals. I just like filesystem APIs so that is what I tend to use. It is the sequencing and interactions that are important).

The interactions are:
- service multicasts "I'm going to suspend now - prepare yourselves"
- clients perform any preparation, which may include saying "please don't suspend now" on a separate channel.
- clients confirm they are prepared
- services waits for all confirmations. Then checks again if anyone is blocking suspend (via wakeup_count or checking user-space requests).
- If no-one is blocking suspend - enter suspend, and resume
- service tells all client "Ok, we are awake again - go about your business"
- clients act accordingly

Current best guess at Android suspend-blocker requirements

Posted Oct 2, 2011 8:09 UTC (Sun) by neilbrown (subscriber, #359) [Link]

It occurred to me that I didn't really answer your first question properly.

The auto-suspend daemon performs 2 checks to ensure that it is OK to suspend.
1/ The file locking handshake that I described in detail - this makes sure that user-space processes are happy with the suspend.
2/ the wakeup_count handshake. This ensures that the kernel is happy with the suspend.

The kernel has "wakeup_sources" that are similar to suspend blockers. When a wakeup_source is active, it means that suspend should not succeed.

When a wakeup_source is activated (normally in an interrupt handler), two global counters are incremented, and when it is deactivated (which should happen once the event is visible to user-space) one of them is decremented.

When the auto-suspend daemon reads from /sys/power/wakeup_count, the read blocks until the decrementing count is zero - so there are no active wakeup_sources. The other count is returned.
Subsequently - after checking with all of userspace - the daemon writes the number back to wakeup_count. If the corresponding counter has changed, the write fails. This means there has been another event and the daemon needs to check with user-space again.
If it succeeds then the kernel stores this number and the next attempt to suspend (echo mem > /sys/power/state) will only complete if there are no more wakeup events. i.e. if, right up to the moment of suspend, the event counter has the same value as the stored number that was written to /sys/power/state.

So the file-locking is only for user-space, and the wakeup event counting completely handles the kernel-event side.

Current best guess at Android suspend-blocker requirements

Posted Oct 2, 2011 22:24 UTC (Sun) by PaulMcKenney (subscriber, #9624) [Link]

Ah, understood. For whatever reason, I thought you were trying to re-implement John's patches as opposed to building a user-space API for them. Agreed, it would be quite hard to properly handle wakeup events in user space, given that the kernel needs to be involved in the wakeup/resume process!

Current best guess at Android suspend-blocker requirements

Posted Oct 2, 2011 23:43 UTC (Sun) by neilbrown (subscriber, #359) [Link]

Yes and no.

Some of John's patches were just bug-fixes. The wakeup event from the CMOS timer wasn't being counted as a wakeup event properly. Those patches I wasn't trying to re-implement - they are good.

The other patches from John did things to the scheduler so that distinguished processes could prevent suspend just by being busy. I was trying to show that these patches were not necessary because a completely race-free protocol is available in userspace.

(and finally there was a patch which tried to make /sys/power/state resistant to processes that didn't follow the wakeup_count protocol. That issue is best addressed by fixing the processes, or bind-mounting a file over /sys/power/state).

Current best guess at Android suspend-blocker requirements

Posted Oct 3, 2011 16:18 UTC (Mon) by PaulMcKenney (subscriber, #9624) [Link]

Cool! But I must defer to John and the Android guys at this point, as they would have a better idea of what works and does not in the Android environment.

My guess is that if they needed something lighter weight, this same general approach could be implemented within a device driver, though an implementation that just used existing user-space interfaces would be preferable, all else being equal. (Which it never is, but that is life!)

A new approach to opportunistic suspend

Posted Sep 29, 2011 18:22 UTC (Thu) by iabervon (subscriber, #722) [Link]

Surely, it's not true that your computer is set to suspend after 15 minutes of no X input; your computer is set not to suspend (by itself) within 15 minutes of X input. I think that the difference between these two similar statements is a large part of the reason that a lot of proposed solutions are useless in practice. If you say it the way your said it, the misbehavior you discuss is as specified. If you want your system to be glitch-free, I think you have to start by carefully describing glitch-free policies in your text.

(For that matter, I think the main problem that Android has is due to specifying the wakeup button in a glitchy way; I think it would solve everything to say that the wakeup button keeps the system on for a certain period of time, measured by kernel timer, during which period the user has a chance to starting using the system in a way that keeps it on longer but wouldn't be detectable while the system was off.)

A new approach to opportunistic suspend

Posted Sep 29, 2011 22:10 UTC (Thu) by neilbrown (subscriber, #359) [Link]

It is an important distinction that you draw. No application should be saying "nothing has happened for a while, time to suspend". Rather it should be able to say "something is happening, don't suspend just now".

However "should be" != "is"

The Power Management Preferences on my Gnome desktop have an option "Put compute to sleep when inactive for: ....". Assuming that text is representative of what is actually done, the g-p-m seems to be acting in exactly the wrong way.

I read the original lkml thread, and a big part of John's apparent motivation was to be able to intervene when g-p-m explicitly tries to put the system to sleep. Apparently a dbus API is coming but until then.....

It is actually fairly easy to interfere with g-p-m. Simply create an empty file and bind-mount it over /sys/power/state. Then when g-p-m tries to suspend the system it will write to the file instead. Some daemon can watch the file and when "mem" appears in it, check that nothing else wants to keep the system up, and only then write to /sys2/power/state (assuming sysfs is mounted at /sys2 as well).

It is a bit ugly, but it should allow effective prototyping of a sensible suspend manager until g-p-m comes to the party.

(and if there are any g-p-m developers listening who want to tell me that I'm completely wrong about g-p-m, I'd love to hear the details, thanks!)

A new approach to opportunistic suspend

Posted Sep 29, 2011 22:58 UTC (Thu) by iabervon (subscriber, #722) [Link]

I think having a userspace daemon that tracks why the system is on (and turns it off if there's no need for it to be on) would take care of a lot of the issues; however, I'm not entirely convinced that userspace can be race-free for this sort of thing. The other part of his situation is that he's worried about the fact that he'd have to poke the userspace daemon (whatever it is) from the hardware timer interrupt in order to make sure that the system doesn't suspend after the timer has fired (so the timer doesn't wake it) but before the daemon finds out that something needs the system up.

Anyway, I think the principle that most sensible policies specify "I need the system on/I don't need the system on", not "I want the system off" applies to the kernel. (For that matter, where there are sensible policies for "I want the system off", it's things like low battery or the lid being closed, where you want to keep the system suspended, regardless of ordinary wakeup events, until the particular issue that triggers the suspend is cleared.) So I don't see the logic in having the kernel API only offer a misleading operation, rather than letting userspace specify the policy in sensible terms and the kernel carry it out.

A new approach to opportunistic suspend

Posted Sep 29, 2011 23:32 UTC (Thu) by neilbrown (subscriber, #359) [Link]

Yes, exactly. The /sys/power/state api is wrong and backwards. But arguable the hardware/firmware api which it provides access to is busted as well. But plenty of interfaces are busted and we just have to work with them until well after the next-best-thing is perfected.

There is a new api which is per-device power management where the kernel is in control of everything instead of the firmware, and there is a school of thought which says we should forget suspend because both the hardware concept and the kernel API is broken - just use per-device power management and fix all the bugs.

However we still need to work with hardware that saves more power on "suspend" and /sys/power/state is the natural API to work with that. By itself is a racy API, so /sys/power/wake_count was added to make the races avoidable, which it does. Putting more work into make the API in the kernel for suspend/resume even "better" would be a mistake - the work should go to making per-device power-management work really well.

So yes: /sys/power/state is busted, but it is possible to create a user-space API on top of that which provides sensible semantics. So we should just do that rather than nagging the kernel devs to make life easier.

And I think you are wrong - as user-space solution using power/state and power/wake_count can be race free (modulo bugs). You just need to use power/wake_count correctly (see my post up-thread)

Think of system-wide suspend/resume as legacy. Yes it should work and work reliably, but it doesn't need to be pretty. per-device power states and power policy is the new way. We should use that were possible and only fall back on suspend/resume on legacy systems that require it (like .. uhmmm... x86). Android (and all others) are welcome to use suspend/resume and it should work. But they are encouraged to explore the various improvements possible with finer grained power management.

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds