Kernel development

Brief items

Kernel release status

The current development kernel is 3.5-rc6, released on July 7. "There's mainly some btrfs and md stuff in here, with the normal driver changes, arm updates and some networking changes. And a smattering of random stuff (including docs etc). None of it looks very scary, it's all pretty small, and there aren't even all that many of those small changes." Linus also notes that the 3.6 merge window is likely to hit when a lot of developers are on vacation, so the 3.6 kernel might contain a relatively small set of changes.

Stable updates: 3.2.22 was released on July 5. The 3.2.23 update is in the review process as of this writing; it can be expected almost anytime.

Comments (none posted)

Quotes of the week

One man's idiom is another man's idiocy.

— Andrew Morton

Perhaps it's a typo, and was meant to be AArgh64

— Jan Ceuleers

Dunno, the moment Apple comes out with their 64-bit iPhone6 (or whichever version they'll go 64-bit) *everyone* will scramble to go 64-bit, for marketing reasons. Code and SoC details will be ported to 64-bit in the usual fashion: in the crudest, fastest possible way.

There's also a technological threshold: once RAM in a typical smartphone goes above 2GB the pain of a 32-bit kernel becomes significant. We are only about a year away from that point.

So are you *really* convinced that the colorful ARM SoC world is not going to go 64-bit and will all unify behind a platform, and that we can actually force this process by not accepting non-generic patches? Is such a platform design being enforced by ARM, like Intel does it on the x86 side?

— Ingo Molnar

A number of the developers all went to a climbing gym one evening, and I found myself climbing with another kernel developer who worked for a different company, someone whose code I had rejected in the past for various reasons, and then eventually accepted after a number of different iterations. So I've always thought after that incident, "always try to be nice in email, you never know when the person on the other side of the email might be holding onto a rope ensuring your safety."

— Greg Kroah-Hartman

Comments (none posted)

30 Linux Kernel Developers in 30 Weeks: Greg Kroah-Hartman (Linux.com)

Jennifer Cloer interviews Greg Kroah-Hartman for the Linux.com series. "I was an embedded software developer testing the device I was working on (a barcode scanner) with all different operating systems to ensure that I had gotten the USB firmware correct. Linux had very little USB support at the time, and I realized I could help out and contribute to make it work better. One thing led to another and I soon got a job doing Linux kernel development full time over 10 years ago and never looked back."

Comments (none posted)

Kernel development news

Btrfs send/receive

By Jonathan Corbet
July 11, 2012

The btrfs snapshot capability allows a system administrator to quickly capture the state of a filesystem at any given time. Thanks to the copy-on-write mechanism used by btrfs, snapshots share data with other snapshots or the "live" system; blocks are only duplicated when they are changed. While btrfs makes the creation and management of snapshots easy, it currently lacks the ability to efficiently determine what the differences are between two snapshots and save that information for future use. Given that some other advanced filesystems (ZFS, for example) offer that capability, btrfs can arguably be seen as falling a little short in this particular area.

Happily, that situation appears to be about to change, as Alexander Block's btrfs send/receive patch set has been well received by the development community. In short, with this patch set (and the associated user space tools), btrfs can be instructed to calculate the set of changes made between two snapshots and serialize them to a file. That file can then be replayed elsewhere, possibly at some future time, to regenerate one snapshot from the other.

This functionality is implemented with the new BTRFS_IOC_SEND ioctl() command. In its simplest form, this operation accepts a file descriptor representing a mounted volume and the subvolume ID corresponding to the snapshot of interest; it will then find the changes between the snapshot and the "parent" snapshot it was generated from. There are more options, though:

The operation can actually take a list of snapshot/subvolume IDs and generate a combined file for all of them.
The parent snapshot can be specified explicitly. That may be required for older btrfs volumes that lack the needed identifying information. It may also be useful to generate differences that skip over a set of snapshots — differences from a grandparent, say, instead of the direct parent.
The command also accepts an optional list of "clone sources." Those are subvolumes that can be expected to exist on the receiving side; when possible, data blocks will be "cloned" from those snapshots rather than being written into the differences file. That reduces the size of the differences, and enables better data sharing on the receive side.

The generated file is essentially a set of instructions for converting the parent snapshot into the one being "sent." The list of commands is surprisingly long, including operations like create a file (or directory, device node, FIFO, symbolic link, ...), rename or link a file, unlink a file, set and remove extended attributes, write data, clone data blocks, truncate a file, change ownership and permissions, set file times, and so on. The code that generates this file is also surprisingly long, being several thousand lines of complex, nearly uncommented functions (some of the comments that do exist, saying things like "magic happens here," are not entirely helpful).

Interestingly, according to the patch introduction, the custom file format was not in the original plan. Instead, the output was meant to be in something close to the tar file format — close enough that the tar command could be used to extract data from it. Tar turned out not to have the needed capabilities, though, so a new format was created. The format should be considered to be in flux still, though, clearly, it will need to stabilize before this feature can be considered ready for production use. As it happens, the playback of this file can be done almost entirely in user space, so there is no need for a BTRFS_IOC_RECEIVE operation.

At the command level, using this feature can be as simple as:

    btrfs send snapshot

This will send the given snapshot (in its entirety) to the standard output stream. Writing the command as:

    btrfs send -i oldsnap snapshot

will cause the creation of an incremental send containing just the differences from oldsnap. The receive command can be used to apply a file created by btrfs send to an existing filesystem.

The primary use case for this feature (which is clearly patterned after the ZFS send/receive functionality) is backups in various forms. A cron job could easily send a snapshot to a remote server on a regular basis, maintaining a mirror of a filesystem there. The send files can simply be stored as backups; an entire volume can be sent as a full backup, while snapshots are easily sent as incrementals. With some additional tooling, the send/receive feature could develop into an advanced backup capability with low-level support from the underlying filesystem.

That is for some time in the future, though; the feature is currently experimental, and Alexander warns potential users:

If you use it for backups, you're taking big risks and may end up with unusable backups. Please do not only count on btrfs send/receive backups!

That said, there seems to be a fair amount of interest in this feature (btrfs creator Chris Mason described it as "just awesome"), so chances are it will be worked into reasonable shape relatively quickly. Then btrfs will have one more useful feature and one less reason to be concerned about comparisons with that other filesystem.

Comments (8 posted)

Supporting 64-bit ARM systems

By Jonathan Corbet
July 10, 2012

ARM is one of the most successful processor architectures ever created; most of us possess several ARM cores for every x86 processor we have. ARM is very much thought of as an embedded systems processor; it is focused on minimal power use and the ability to be built into a variety of system-on-chip configurations. The "small systems" image of ARM is certainly encouraged by the fact that ARM processors are all 32-bit only. That situation is about to change, though, with the arrival of 64-bit ARM processors. Linux will be ready for these systems — the first set of 64-bit ARM support patches have just been posted — but there is still some debate around a couple of fundamental decisions.

One might well wonder whether a 64-bit ARM processor is truly needed. 64-bit computing seems a bit rich even for the fanciest handsets or tablets, much less for the kind of embedded controllers where ARM processors predominate. But mobile devices are beginning to push the memory-addressing limits of 32-bit systems; even a 1GB system requires the use of high memory in most configurations. So, even if the heavily foreshadowed ARM server systems never materialize, there will be a need for 64-bit ARM processors just to be able to efficiently use the memory that future mobile devices will have. "Mobile" and "embedded" no longer mean "tiny."

Naturally, Linux support is an important precondition to a successful 64-bit ARM processor introduction, so ARM has been supporting work in that area for some time. The initial GCC patches were posted back in May, and the first set of kernel patches was posted by Catalin Marinas on July 6. All this code exists despite the fact that no 64-bit ARM hardware is yet available; it all has been developed on simulators. Once the hardware shows up, with luck, the software will work with a minimum of tweaking required.

64-bit ARM support involves the addition of thousands of lines of new code via a 36-part patch set. There are some novel features, such as the ability to run with a 64KB native memory page size, and a lot of important technical decisions to be reviewed. So the kernel developers did what one would expect: they started complaining about the name given to the architecture. That name ("AArch64") strikes many as simultaneously redundant (of course it is an architecture) and uninformative (what does "A" stand for?). Many would prefer either ARMv8 (which is the actual hardware architecture name—"AArch64" is ARMv8's 64-bit operating mode) or arm64.

Arguments in favor of the current name include the fact that it is already used to identify the architecture in the ELF triplet used in binaries; using the same name everywhere should help to reduce confusion. But, then, as Arnd Bergmann noted: "If everything else is aarch64, we should use that in the kernel directory too, but if everyone calls it arm64 anyway, we should probably use that name for as many things as possible." Jon Masters added that, in classic contrarian style, he likes the name as it is; Fedora is planning to use "aarch64" as the name for its 64-bit ARM releases. Others, such as Ingo Molnar, argue in favor of changing the name now when it is relatively easy to do. Catalin seems inclined to keep the current name but says he will think about it before posting the next version of the patch series.

An arguably more substantive question was raised by a number of developers: wouldn't it make more sense to unify the 32-bit and 64-bit ARM implementations from the outset? A number of other architectures (x86, PowerPC, SPARC, and MIPS) all started with separate implementations, but ended up merging them later on, usually with some significant pain involved. Rather than leave that pain for future ARM developers, it has been suggested that, perhaps, it would be better to start with a unified implementation.

There are a lot of reasons given for the separate 64-bit ARM architecture implementation. Much of the relevant thinking can be found in this note from Arnd. The 64-bit ARM instruction set is completely different from the 32-bit variety, to the point that there is no possibility of writing assembly code that works on both architectures. The system call interfaces also differ significantly, with the 64-bit version taking a more standard approach and leaving a lot of legacy code behind. The 64-bit implementation hopes to leave the entire 32-bit ARM "platform" concept behind as well; indeed, as Jon put it, there are hopes that it will be possible to have a single kernel binary that runs on all 64-bit ARM systems from the outset. In general, it is said, giving AArch64 a clean start in its own top-level hierarchy will make it possible to leave a lot of ARM baggage behind and will result in a better implementation overall.

Others were quick to point out that most of these arguments have been heard in the context of other architectures. x86_64 was also meant to be a clean start that dumped a lot of old i386 code. In the end, things have turned out otherwise. It may be possible that things are different here; 32-bit ARM has rather more legacy baggage than other architectures did, and the processor differences seem to be larger. Some have said that the proper comparison is with x86 and ia64, though one gets the sense that the AArch64 developers don't want to be seen in the same light as ia64 in general.

This decision will come down to what the AArch64 developers want, in the end; it's up to them to produce a working implementation and to maintain it into the future. If they insist that it should be a separate top-level architecture, it is unlikely that others will block its merging for that reason alone. Of course, it will also be up to those developers to manage a merger of the two in the future, should that prove to be necessary. If nothing else, life as a separate top-level architecture will allow some experimentation without the fear of breaking older 32-bit systems; the result could be a better unified architecture some years from now, should things move in that direction.

Thus far, there has been little in the way of deeper technical criticism of the AArch64 patch set. Things may stay that way. The code has already been through a number of rounds of private review involving prominent developers, so the worst problems should already have been found and addressed. Few developers have the understanding of this new processor that would be necessary to truly understand much of the code. So it may go into the mainline kernel (perhaps as early as 3.7) without a whole lot of substantial changes. After that, all that will be needed is actual hardware; then things should get truly interesting.

Comments (39 posted)

Linux power management: The documentation I wanted to read

July 10, 2012

This article was contributed by Neil Brown

Last week we discussed three elements that might serve to guide the creation of introductory technical documentation. This week we put those elements to the test by using them to create some introductory documentation for Linux power management. For me, this exercise precisely answers the question "What were you looking for that you didn't find?", as it is the documentation I would have liked to read.

This documentation is necessarily incomplete, partly because my own experience is not yet broad enough to provide a comprehensive document, and partly because doing so might try the patience of the present readership. As such it stops short of delving into the details of hibernation and completely omits any treatment of quality-of-service and wakeup sources, all of which would have an important place in a more complete document. Fortunately there are still sufficient topics to showcase the presentation of structure, purpose, and examples.

Three perspectives on Linux power management

The power management infrastructure in Linux is quite complex, but hopefully not intractably so. To get a handle on this complexity it is helpful to view it from three different perspectives. The first perspective highlights the different holistic states of the system which roughly divide into "in use", "not in use", and "indefinitely not in use", corresponding to "run time power management", "suspend" and "hibernate". One of the distinctions between these is the size of the power switch. The first uses lots of little power switches at different times, while the last turns off everything all at once (except maybe a real-time clock or similar).

The second of these states is somewhat harder to define. It covers a range of states which are not easy to clearly differentiate. At one end of the spectrum we have the traditional "suspend" mode of a laptop, which is almost like hibernation but uses a little more power and is a little quicker to get into and out of. Once the laptop has entered suspend it really must stay there using minimal power until it is explicitly wakened, as it might have been placed in a padded case for transport and any increase in power usage could result in over-heating and damage. This state is often entered with help from BIOS firmware so, to the OS, it is a bit like a single power switch which transitions from "on" to "suspend".

At the other end of the spectrum is the way that "suspend" is used in the Android mobile platform and similar devices. These devices are expected to wake up spontaneously for various reasons, whether due to an incoming phone call, a reminder alarm, or just a periodic check for new email or software updates. Management of power and temperature is generally better than notebooks so the risk of over-heating is not present. There is normally little or no firmware and the entire power-management transition is handled by the OS, so it is responsible for turning off each individual device in the correct order, and then restoring them again later.

Between these extremes of a light hibernation and a heavy snooze there is room for other possibilities. A server might use a BIOS-based suspend to save power after arranging for wake-ups via wake-on-LAN or a realtime clock alarm. This can be seen as a deeper sleep than an Android phone normally enters, but not as deep as the laptop in its padded case. The "suspend" mode in Linux attempts to cater to all of these and that flexibility leads to some of the complexity.

The second perspective highlights the broad variety in components that need to be managed. Some, like rotating disk drives, have a high cost in power and time for turning off and on again, while others like an LED have essentially no cost. Some, such as a UART, need to either be off or sufficiently on to be able to accept full-rate data at any moment. Others, such as USB, can enter intermediate states where they can receive external signaling, but are free to take some time to fully wake up.

Other sources of variability include the level of independence from other devices, the degree of involvement of user-space in management of the device, and how power is routed - whether through the same bus that commands and data flow, or through some separate regulator or "power domain". These are just some of the ways that devices can vary and thus some of the issues that Linux power management needs to be prepared for.

The final perspective highlights the different stages on the way towards a low-power state, and on the way back to full functionality. The key elements of the low-power transition are to move all relevant components to a quiescent state, to record that state, then to stop powering some or all of the components; similar elements apply on the way back up. The details of managing all the aforementioned complexity through this simple transition means that we have quite a few stages as we will shortly see.

Two Understandings

Part of understanding the solution to managing this complexity is understanding the balance that has been chosen between a "mid-layer" solution and a "library" solution. That is, how much responsibility for correct behavior and sequencing is taken by the core code and imposed on the drivers, and how much of the responsibility is left in the hands of the drivers. Centralizing responsibility tends to be safe but inflexible, while distributing it is risky but versatile. Linux power management takes a middle road, so it is important to understand where each responsibility lies.

The main imposition made by the PM core is the over-all sequencing of suspend and resume. Allowing individual drivers to take a more active role in this process would probably require a general dependency solver and would undoubtedly make debugging a lot harder. In contrast, choices that are local to a specific device, such as timeouts before power management activates, or the use of a separate thread for performing power management actions are actively supported by the core without being imposed on drivers that don't want them.

One other imposition, which will be raised again later, involves interaction with interrupts. The PM core strongly encourages a specific sequencing, but does provide hooks for a driver to escape it if absolutely necessary.

Understanding Linux power management also requires knowing how devices are classified in Linux. The most obvious classification is shown by the "subsystem" link that can be found in the sysfs entry for the device. This points to either a "bus" or a "class" that the device belongs to. This subsystem roughly describes the interface that the device provides. Together with this can be a "device type" which allows further specialization. A simple example is that members of the class "block" - which are block devices such a disk drives - can be of type "disk" or type "partition" reflecting the fact that both the whole device and each individual partition is a block device, but that they do have some specific behaviors that are quite different.

Finally each device can have a "power domain" (or pm_domain). This is an abstraction that is currently only used for ARM SoC modules and represents the fact that different collections of devices within the SoC can be powered on or off together, thus the power domain may need to know when each device changes power state so it can re-evaluate or adjust the overall state of the domain.

These classifications are used to direct all the power management calls that are described below. If a device has a power domain, it gets to handle the call. If not, but the type, class, or bus declares any PM operations, those operations get to handle the call, otherwise the call is handled by the driver for the device. The PM core doesn't attempt to call all of the possible handlers for a particular device, just the first that is found. This is an example of distribution of responsibility. The first handler has the freedom to call more specific handlers, or to do all the work itself, and equally has the responsibility to ensure all required handlers are called.

For example, the power domain handler for the OMAP platform (in arch/arm/plat-omap/omap_device.c) calls the driver-specific handler (bypassing any subsystem handlers) before or after doing any OMAP-specific handling. The MMC bus handlers call into driver-specific handlers which are stored in a non-standard location - presumably for historical reasons.

With these perspective and understandings in place, we can move on to some specifics.

Runtime PM

Runtime power management has the fewest states and so is probably the best place to start digging into details. This is the part of Linux PM that manages power for individual devices without taking the whole system into a low-power state.

In this case the most interesting stage of the transition to lower power is "move to quiescent state". Once that is done there is one method call (runtime_suspend()) which combines "record current state" and "remove power", and another (runtime_resume()) which must restore power and reload any needed device state.

For runtime PM, the "move to quiescent state" transition is a cause, not an effect - the new state isn't requested, it is simply noticed. The PM core keeps track of the activity of each device using two counters and an optional timer. One counter (usage_count) counts active references to the device. These may be external references such as open file handles, or other devices that are making use of this one, or they may be internal references used to hold the device active for the duration of some operation. The other counter (child_count) counts the number of children that are active. The timer can be used to add a delay between the counters reaching zero and the device being considered to be idle. This is useful for devices with a high cost for turning on or off.

This "autosuspend" timer is not widely used at present, with only nine drivers calling pm_runtime_put_autosuspend() to start the timer, while 14 call pm_runtime_set_autosuspend_delay() which sets the timeout (though that can be set via sysfs). One user is the omap_hsmmc driver for the High Speed Multi-Media Card interface in OMAP processors. It sets a 100ms delay before declaring a device to be truly idle, presumably due to costs in activating and deactivating the cards.

The counter of active children can optionally be ignored when determining whether a device is idle. Normally the parent is needed to access the child - typically the parent is a bus sending commands to the child - so powering down the parent while children are active would be counterproductive. Sometimes it is useful though.

One good example is an I2C bus. I2C (inter-integrated circuit) is a very simple 2-wire bus for signaling between integrated circuits on a board. It doesn't carry power, only a clock signal and a bidirectional data signal. The bus is entirely master-driven. Slaves (which appears as children in the Linux device tree) cannot signal the master directly at all, they simply respond to commands from the master.

As an I2C controller is very cheap to turn on before a command is sent, and off after the response is received, there is no need to keep it powered just because its child (which could be a sensor that is monitoring the environment and may have a higher turn-on cost) is left on. Consequently some I2C controllers, such as i2c-nomadik and i2c-sh_mobile use pm_suspend_ignore_children() to allow them to report as idle even when they have active children.

When a device is deemed to be idle by the above criteria its runtime_idle() method is called. This function will normally perform any further checks (as does usb_runtime_idle()) and possibly call pm_runtime_suspend() to initiate the change in power state. For a slight variation, lnw_gpio_runtime_idle() in the gpio-langwell.c driver doesn't call pm_runtime_suspend() directly but rather calls pm_schedule_suspend() with a 500ms delay. Presumably this design predates the introduction of the autosuspend feature.

There is one class of devices that does not follow this structure for power management, and that is the CPU. The general pattern of entering a quiescent state, recording state information, and reducing power usage is the same, however the particular implementation is vastly different. This is partly due to the uniquely central role that the CPU plays, and partly due to the fact that a CPU typically has many more levels and styles of power saving. Runtime PM for the CPU is implemented using the cpuidle, cpufreq, and CPU hotplug subsystems, which will not be discussed further here; see this article for an introduction to cpuidle.

System Suspend

It can be helpful to view the "suspend" process as forcing all devices into a quiescent state, and then simply allowing runtime power management to put them all to sleep. The last to go to sleep would be the CPU (or CPUs) under the guidance of "cpuidle". While this isn't the way it is actually implemented, it provides a perspective which exposes the relationship between suspend and runtime PM quite well.

There are several reasons for not implementing it this way. Possibly the most unavoidable is that PM_RUNTIME and SUSPEND are separate kernel config options and there is a desire to keep it that way, so neither can rely on the other being present. There is also the fact that a BIOS (such as ACPI) might be involved in one or the other and may impose different handling requirements. Finally, individual drivers might want to make different decisions based on what sort of power management is happening, so it is generally best to tell them what is actually happening, rather than pretending that one thing is a form of another.

Forcing devices into a quiescent state has an important difference from just allowing them to get there on their own - any interdependencies between devices need to be explicitly handled. Linux PM has chosen to manage this by having a clear sequence of steps for transitioning to low power, and an explicit ordering of devices so that they make each step in a well defined order.

The ordering (stored in dpm_list linked through dev->power.entry) is normally the order in which devices are registered, with new devices added to the end, thus being after any devices that they depend on. However it is possible to reorder the list using device_move() which gives a device a new parent, and can place it directly after that parent, or at the end of the list. For example, when an rfcomm tty-over-bluetooth device is opened, a bluetooth connection is created and the tty device is reparented under the relevant bluetooth device and placed at the end of the device list.

The first stage of suspend, after some preliminaries like calling "sync" to flush out dirty data and switching to a separate virtual console, is to move all processes into a quiescent state. Devices which interact closely with processes need a chance to have one last chat before their process goes to sleep and this is achieved by registering a "notifier" which gets called before processes are put to sleep, and again when they are woken up.

This is variously used to:

load firmware in case it will be needed during resume

copy device state to swappable memory as may be needed when the device state can be enormous such as video RAM (drivers/gpu/drm/vmwgfx)

avoid deadlock when interacting with sysfs (drivers/acpi/battery.c)

preemptively remove devices that might be removed while the system is suspended, so appropriate cleanup can happen (drivers/mmc/core)

and a few other minor tasks. Once these notifiers run, all processes are sent a special signal which results them being moved to the "freezer" where they are forced to wait for system resume to happen.

Once all processes are quiescent, the next step is to instruct all devices to also become quiescent. To do this we need to walk the list in reverse order, putting children to sleep before their parents - as the parent may be needed to help put the child to sleep. However as a new child could be born at any moment (e.g. due to a device being plugged in), and as children get added to the end of the list, we might miss some children on the first pass. To avoid this, the PM core makes two passes over the list. The first pass starts at the beginning and simply asks devices to stop adding children by calling their "prepare()" method. If children are born during this time they can only be added after the current pointer in the list, and so will not be missed. Once this is complete we know that no new devices will be added, so the list is walked in the reverse order calling the "suspend" method.

The "suspend" method is actually three separate methods, suspend(), suspend_late(), and suspend_noirq(), which can share among themselves the three tasks of making the device quiescent, saving any state, and reducing power usage. How much of which task is allocated to which methods is largely up to each driver providing that the division works with the calling patterns of the three methods.

Calls to these methods are made to all devices in child-before-parent order and the sets of calls are interleaved with system-wide suspend operations, made largely through the suspend_ops dispatch table. The ordering is roughly:

system wide begin()
per-device suspend()
system wide prepare()
per-device suspend_late()
system wide disable (almost) all interrupt handlers
per-device suspend_noirq()
system wide prepare_late()
disable nonboot CPUs
syscore_suspend()
system wide enter()

Note that it is possible for the sequence from system wide prepare() onwards to be repeated (after being unwound by corresponding "resume" actions) without going all the way up to fully awake and starting the sequence from the top. This happens if the suspend_again() suspend operation requests it. Currently this is only requested by the charger manager which often needs to wake up parts of the system to check battery charging state, without wanting the cost of a full wakeup.

Deducing the purpose of these method calls by looking for example usage in the code is problematic for a number of reasons.

For the system-wide operations (begin(), prepare(), prepare_late()), there are few users and those that exist do not make their purpose clear to an untrained observer. The most complete user is ACPI, so possibly a full understanding of that specification would help. Unfortunately that is beyond the scope of this article (and of this author).
In general, ACPI recommends specific procedures for entering and leaving system sleep states (such as suspend) and Linux PM was modeled on that and then adjusted to meet broader needs. For example, prepare_late() was added to resolve a conflict between the needs of ACPI and the needs of the ARM platform.

For the per-device operations, suspend_late() was only recently added (commit cf579dfb82550e3) and there are no users in Linux-3.4. So any examples we find may be working around the absence of suspend_late() and so should not be copied.

The initial reason for producing this document was finding code in drivers that simply wasn't working correctly and trying to understand what "correctly" might mean. Those drivers clearly cannot be used as good examples and there is evidence that other drivers aren't always doing the right thing, so any example may be equally faulty.

Examining the documentation brings a little more useful information.

suspend() should leave the device in a quiescent state

suspend_late() can often be the same as runtime_suspend() (see also commit cf579dfb82550e3)

suspend_noirq() happens after interrupts are disabled and is useful when shared interrupts are used as you can be certain that the interrupt handler will not be called after suspend_noirq() runs. Some interrupts, such as the timer interrupt, are not disabled.

One observation from the code that seems to be important before we try to paint the big picture is that, after calling the suspend() method on a device, runtime power management is disabled. The purpose of this seems to be to stop runtime PM from racing with system suspend PM - we really don't want two threads trying to power off a device at once, and this is the interlock that prevents that. It also prevents runtime PM from powering the device back on again, so any device that might be needed in the late states of power management needs to be left on when runtime PM is disabled.

Tying all these threads together we get that:

The suspend() method should cause the device to stop doing anything, and enter a state much like it would be just before runtime PM might decide to turn it off. So it should wait for any DMA requests to complete and ensure new ones won't start. It should stop transmitting information and ensure that incoming information is either ignored, or triggers a wake-from-suspend (possibly marking the interrupt for wakeups). It should cancel any timers and generally prepare for nothing to happen for a while.
If the device might be needed to power down other devices, such as an I2C controller that might be needed to tell some regulator to turn off, then the device should be activated for runtime PM purposes so that it will still be active when runtime PM is disabled.
Part of the task of ignoring incoming information is to ensure that no new children will be created much like the runtime PM prepare() method does. Having new devices appear after suspend() would be awkward.

The suspend_late() method should power off the device in much the same way that runtime_suspend() does, and it may be exactly the same routine as runtime_suspend(). Occasionally preparing the device to wake up may differ between the system suspend and runtime PM cases. This would be one situation where suspend_late() might need to be different from runtime_suspend().
The only case where suspend_late() should not be used is where interrupts might still be delivered, but the interrupt handler cannot tolerate the device being off. In many cases the suspend() routine will have put the device in a state in which it will not generate interrupts. Likely exceptions to this are when the interrupt line is shared, or when the device supports wake-from-suspend and so deliberately does not disable interrupts.
If the platform that the device runs on uses BIOS support to enter suspend, then it is possible that this support will power off the device, so suspend_late() does not need to bother. If it doesn't, it could still be that the device gets powered off by instructing the BIOS to effect the state change, and it may require different power-off procedures for runtime PM and for entering suspend. If this is the case, then suspend_late() will quite likely be very different from runtime_suspend().

The suspend_noirq() method is an alternative to suspend_late() but is run without interrupts enabled. It is unlikely that any driver will provide both methods.
Having interrupts disabled means not only that an interrupt will not occur at an awkward time, but also that using any functionality that requires interrupts will not work. So if the driver uses an I2C bus or similar to tell the device to turn off, and if the I2C bus uses interrupts to indicate completion (which is normal), then either the device must be powered-off in suspend_late, or the I2C interrupt must be marked IRQF_NO_SUSPEND.

Paired with each of these methods is a method that is called when returning back towards full-functionality: resume_noirq(), resume_early() and resume(). These simply do the reverse of what the corresponding "suspend" function did.

Closing

Structure, purpose, and examples - these seem to be the elements that distinguish good documentation and enable the reader not just to collect knowledge but to gain understanding. I'll leave you, dear reader, to be the judge of whether their presence here is sufficient to bring an understanding of power management, or indeed an understanding of quality documentation.

I would like to thank Rafael Wysocki for valuable review of an early draft of this article.

Comments (5 posted)

Patches and updates

Kernel trees

Linus Torvalds Linux 3.5-rc6 ?

Ben Hutchings Linux 3.2.22 ?

Steven Rostedt 3.2.22-rt34 ?

Architecture-specific

Catalin Marinas AArch64 Linux kernel port ?

Kelvin Cheung MIPS: Add board support for Loongson1B ?

Joerg Roedel AMD IOMMU interrupt remapping support v2 ?

Core kernel code

Chen RIFS-V3-Test For 3.4.x kernel. ?

Tejun Heo workqueue: reimplement high priority using a separate worker pool ?

Michel Lespinasse rbtree updates ?

John Stultz Fix for leapsecond caused hrtimer/futex issue (updated) ?

Development tools

Anton Vorontsov KGDB/KDB FIQ (NMI) debugger ?

David Herrmann fblog: framebuffer kernel log driver ?

Anton Vorontsov Function tracing support for pstore ?

Yan, Zheng perf/x86: Add Intel Nehalem-EX/Westmere-EX uncore support ?

Device drivers

Sangbeom Kim Initial release of Samsung S2MPS11 pmic driver ?

Christian Riesch Add a driver for the ASIX AX88172A ?

Qiao Zhou add 88pm80x mfd driver ?

Javier Martin Add support for 'Coda' video codec IP. ?

Or Gerlitz Add Ethernet IPoIB driver ?

Benson Leung Input: add driver for Cypress APA I2C Trackpad ?

Filesystems and block I/O

Lin Ming : block layer runtime pm ?

Namjae Jeon fat: Support fallocate on fat. ?

vgoyal@redhat.com [V4] block: Support online resize of disk partitions ?

Dmitry Monakhov RFC: introduce extended inode owner identifier v10 ?

Memory management

Johannes Weiner mm: memcg: charge/uncharge improvements ?

Yasuaki Ishimatsu memory-hotplug : hot-remove physical memory ?

Maarten Lankhorst Dmabuf synchronization ?

Networking

Eric Dumazet tcp: TCP Small Queues ?

Virtualization and containers

Jason Wang Multiqueue virtio-net ?

Page editor: Jonathan Corbet
Next page: Distributions>>