User: Password:
Subscribe / Log in / New account

Kernel development

Brief items

Kernel release status

The current development kernel is 3.14-rc2, released on February 9. Linus noted that the patch volume has been light, but worried that kernel developers are lurking in the background waiting to dump more stuff on him. "Because I know kernel developers, and they are sneaky. I suspect Davem (to pick somebody not at random) is giggling to himself, waiting for this release message, planning to send me some big-ass pull request tomorrow."

Stable updates: 3.13.2, 3.12.10, 3.10.29, and 3.4.79 were released on February 6.

The 3.13.3, 3.12.11, 3.10.30, and 3.4.80 updates are in the review process as of this writing. Greg notes that these updates may be a bit more problematic than some:

Some -stable releases spring out from my build system bright and shiny and ready to go. Not so with these releases. Maybe it's the horrid weather that was happening during the creation of these kernels, or something else, but whatever it was, they came into this world screaming, kicking, killing build servers left-and-right, and breaking the build every other patch. [...]

Test these out well, they have barely survived my systems, and I don't trust them in the slightest to not eat your disks, reap your tasks, and run away laughing as your CPU turns into a space heater.

Assuming the carnage turns out not to be that bad, these updates can be expected on or after February 13.

3.2.55 is also in the review process, with a release expected on or after the 14th.

Comments (2 posted)

Quotes of the week

This could be a cool drinking game. Every time you fix someone else's sparse error, they have to drink a pint.
Steven Rostedt

... which game would result in a heightened mood amongst developers and even more Sparse errors, resulting in more fixes from PeterZ and more pints downed, creating a nice feedback loop. Sounds like fun for everyone, except Peter?
Ingo Molnar

We don't really *have* a good way of deprecation, this is the problem. Usually it doesn't happen until we find out that a bug snuck its way in and "X hasn't worked for N releases now, and no one has noticed."
H. Peter Anvin on subarchitecture deprecation policy

We plan to remove 31 bit kernel support of the s390 architecture in the Linux kernel sources.

The reason for this is quite simple: the current 31 bit kernel was broken for nearly a year before somebody noticed.

Heiko Carstens using it in practice

Comments (8 posted)

Kernel development news

Controlling device power management

By Jonathan Corbet
February 12, 2014
The kernel's power management code works to keep the hardware in the most power-efficient state that is consistent with the current demands on the system. Sometimes, though, overly aggressive power management can interfere with the proper functioning of the system; putting the CPU into a sleep state might wreck ongoing DMA operations, for example. To avoid situations like that, the pm_qos (power management quality of service) mechanism was added to the kernel; using pm_qos, device drivers can describe their requirements to the power-management subsystem. More recently, we have seen a bit of a change in focus in pm_qos, though, as it evolves to handle power management within peripheral devices as well.

A partial step in that direction was taken in the 3.2 development cycle, when per-device constraints were added. Like the original pm_qos subsystem, this mechanism is a way for devices to specify their own quality-of-service needs; it allows a driver to specify a maximum value for how long a powered-down device can wait to get power back when it needs to do something. This value (called DEV_PM_QOS_LATENCY in current kernels) is meant to be used with the power domains feature to determine whether (and how deeply) a particular domain on a system-on-chip could be powered down.

The quest for lower power consumption continues, though, and, as a result, we are seeing more devices that perform their own internal power management based on the access patterns they observe. Memory controllers might put some banks into lower power states if they are not seeing much use, for example; this technology seems to work well enough to take much of the wind out of the sails of the various memory power management patch sets out there. Disk drives can spin themselves down, camera sensors can turn themselves off, and so on. Peripherals do not have as good an idea of future access patterns as the host computer should, but, it turns out, they can often do a good job of guessing based on the recent past.

That said, there will certainly be times when a device will decide to take a nap at an inopportune moment. To help avoid this kind of embarrassing situation, many devices that have internal power management provide a way for the host system to communicate its latency needs to the device. If such a device has been informed by the CPU that it should respond with a latency no greater than, say, 10ms, it will not go into any sleep states that would take longer to come back out of.

Current kernels have no formalized way to control the latency requirements communicated to devices, though. That situation could change as early as the 3.15 development cycle, though, if Rafael Wysocki's latency tolerance device pm_qos type patch set finds its way into the mainline. This work uses much of the existing pm_qos framework, but to a different end: rather than allowing drivers to communicate their requirements to the power management core, this mechanism carries latency requirements back to drivers.

The first step is to rename DEV_PM_QOS_LATENCY, which, it could be argued, has an ambiguous name in the new way of doing things. The new name (DEV_PM_QOS_RESUME_LATENCY) may not be that much clearer to developers reading the code from the outside, but it does make room for the new DEV_PM_QOS_LATENCY_TOLERANCE value. As noted above, this pm_qos type differs from the others in that it communicates a tolerance to a device; it also differs in that it is exposed to user space. Any device that supports this feature will have a new attribute (pm_qos_latency_tolerance_us) in its sysfs power directory. A specific latency value (in µs) can be written to this attribute to indicate that the device must be able to respond in the given period of time. There are two special values as well: "auto", which puts the device into its fully automatic power-management mode, and "any", which does not set any specific constraints, but which tells the hardware not to adjust its latency tolerance values in response to other power-management events (transitions to and from a suspended state, for example).

Device power management information is stored in struct dev_pm_info which, in turn, is found in struct device. Devices supporting DEV_PM_QOS_LATENCY_TOLERANCE need to provide a new function in that structure:

    void (*set_latency_tolerance)(struct device *dev, s32 tolerance);

Whenever the latency tolerance value changes, set_latency_tolerance() will be called with the new value. The special tolerance value PM_QOS_LATENCY_ANY corresponds to the "any" value described above. Otherwise, a negative tolerance value indicates that the device should be put into the fully automatic mode.

In many cases, driver authors will not need to concern themselves with providing this callback, though. Instead, it will be handled at the bus level, perhaps in combination with the firmware. The initial implementation posted by Rafael takes advantage of the "latency tolerance reporting" registers provided via ACPI by some Intel devices; for such devices, the power management implementation exists in the ACPI code and need not be duplicated elsewhere.

The final step is to actually make use of this feature when hardware that supports it is available. Such use seems most likely to show up in mobile systems and other dedicated settings where the software can easily be taught to tweak the latency parameters when the need arises. Writing applications that can tune those parameters on a general-purpose system seems like a harder task. But, even there, when the hardware wants to do the wrong thing, there will be a mechanism to set it straight.

Comments (none posted)

Flags as a system call API design pattern

February 12, 2014

This article was contributed by Michael Kerrisk.

The renameat2() system call recently proposed by Miklos Szeredi is a fresh reminder of a category of failures in the design of kernel-user-space APIs that has a long history on Linux (and, going even further back, Unix). A closer look at that history yields a lesson that should be kept in mind for all future system calls added to the kernel.

The renameat2() system call is an extension of renameat() which, in turn, is an extension of the ancient rename() system call. All of these system calls perform the same general task: manipulating directory entries to give an existing file a new name on the same filesystem. The original rename() system call took just two arguments: the old pathname and the new pathname. renameat() added two arguments, one associated with each pathname argument. Each of the new arguments can be a file descriptor that refers to a directory: if the corresponding pathname argument is relative, then it is interpreted relative to the associated directory file descriptor, rather than the current working directory (as is done by rename()).

renameat() was one of a raft of thirteen new system calls added to Linux in kernel 2.6.16 to perform various operations on files. The twofold purpose of the directory file descriptor argument is elaborated in the openat(2) manual page:

  • to avoid race conditions that could occur with the corresponding traditional system calls if one of the directory components in a (relative) pathname was changed at the same time as the system call, and

  • to allow the implementation of per-thread "current working directories" via directory file descriptors.

The next step, renameat2(), extends the functionality of renameat() to support a new use case: atomically swapping two existing pathnames. Although that use case is related to the earlier system calls, it was necessary to define a new system call for one simple reason: renameat() lacked a mechanism for the kernel to support (and the caller to request) variations in its behavior. In other words, it lacked the kind of flags bit-mask argument that is provided by system calls such as clone(), fcntl(), mremap(), and open(), all of which allow a varying number of arguments, depending on the bits specified in the flags argument.

renameat2() implements the new "swap" functionality and adds a new flags argument whose bits can be used to select variations in behavior of the system call. The first of these bits is RENAME_EXCHANGE, which selects the "swap" functionality; without that flag, renameat2() behaves like renameat(). The addition of the flags arguments hopefully forestalls the need to one day create a renameat3() system call to add other new functionality. And indeed, Andy Lutomirski soon observed that another flag could be added: RENAME_NOREPLACE, to prevent a rename operation from overwriting an existing file. Formerly, the only race-free way of preventing an existing file from being clobbered was to use link() (which fails if the target pathname exists) to create the new name, followed by unlink() to remove the old name.

Mistakes repeated

There is, of course, a sense of déjà vu about the renameat2() story, since the reason that the earlier renameat() system call was required was that rename() lacked the extensibility that would have been allowed by a flags argument. Consideration of this example prompts one to ask: "How many times have we made that particular mistake?" The answer turns out to be "quite a few."

One does not need to go far to find some other examples. Returning to the thirteen "directory file descriptor" system calls that were added in Linux 2.6.16, we find that, with no particular rhyme or reason, four of the new system calls (fchownat(), fstatat(), linkat(), and unlinkat()) added a flags argument that was not present in the traditional call, while eight others (faccessat(), fchmodat(), futimesat(), mkdirat(), mknodat(), readlinkat(), renameat(), and symlinkat()) did not. (The remaining call, openat(), retained the flags argument that was already present in open().)

Of the new calls that did not include a flags argument, one, futimesat(), was soon superseded by a new call that did have a flags argument (utimensat(), added in Linux 2.6.22), and renameat() seems poised to suffer the same fate. One is left wondering: would any of the remaining calls also have benefited from the inclusion of a flags argument? Studying this set of functions further, it is soon evident that the answer is "yes", in at least three cases.

The first case is the faccessat() system call. This system call lacks a flags flags argument, but the GNU C Library (glibc) wrapper function adds one. If bits are specified in that argument, then the wrapper function instead uses the fstatat() system call to determine file access permissions. It seems clear that the lack of a flags argument was realized too late, and the design problem was subsequently papered over in glibc. (The implementer of the "directory file descriptor" system calls was the then glibc maintainer.)

The second case is the fchmodat() system call. Like the faccessat() system call, it lacks a flags argument, but the glibc wrapper adds one. That wrapper function allows for an AT_SYMLINK_NOFOLLOW flag. However, the flag is not currently supported, because the kernel doesn't provide the necessary support. Clearly, the glibc wrapper function was written to allow for the possibility of an fchmodat2() system call in the future.

The third case is the readlinkat() system call. To understand why this system call would have benefited from a flags argument, we need to consider three of the system calls that were added in Linux 2.6.13 that do permit a flags argument—fchownat(), fstatat(), and linkat(). Those system calls added the AT_EMPTY_PATH flag in Linux 2.6.39. If this flag is specified in the call, and the pathname argument is an empty string, then the call instead operates on the open file referred to by the "directory file descriptor" argument (and in this case, that argument can refer to file types other than directories). This allows these system calls to provide functionality analogous to that provided by fchmod() and fstat() in the traditional Unix API. (There is no "flink()" in the traditional API.)

Strictly speaking, the AT_EMPTY_PATH functionality could have been supported without the use of a flag: if the pathname argument was an empty string, then these calls could have assumed that they are to operate on the file descriptor argument. However, the requirement to use a flag serves the dual purposes of documenting the programmer's intent and preventing accidents that might occur if the pathname argument was unintentionally specified as an empty string.

The "operate on a file descriptor" functionality also turned out to be useful for readlinkat(), which likewise added that functionality in Linux 2.6.39. However, readlinkat() does not have a flags argument; the call simply operates on the file descriptor if the pathname argument is an empty string, and thus does not have the benefits that the AT_EMPTY_PATH flag confers on the other system calls. Thus readlinkat() is another system call where a flags argument would have been desirable.

In summary, then, of the eight "directory file descriptor" system calls that lacked a flags argument, this lack has turned out to be a mistake in at least five cases.

Of course, Linux developers were not the first to make this kind of design error. Long before Linux appeared, there was wait() without flags and then wait3() with flags. And Linux has gone on to fix some instances of this design error in APIs inherited from Unix, adding, for example, dup3() as a successor to dup2(), and pipe2() as the successor to pipe() (both new system calls added in kernel 2.6.27).

Latter-day missing-flags examples

But, given the lessons of history, we've managed to repeat the mistake far too many times in Linux-specific system calls. As well as the directory file descriptor examples mentioned above, here are some other examples:

Original system call   Successor
epoll_create() (2.6.0) epoll_create1() (2.6.27)
eventfd() (2.6.22) eventfd2() (2.6.27)
inotify_init() (2.6.13) inotify_init1() (2.6.27)
signalfd() (2.6.22) signalfd4() (2.6.27)

The realization that certain system calls might need a flags argument sometimes comes in waves, as developers realize that multiple related APIs may need such an argument; one such wave occurred in Linux 2.6.13, when four of the "directory file descriptor" system calls added a flags argument.

As can be seen from the other examples shown just above, another such wave occurred in kernel 2.6.27, when a total of six new system calls were added. All of these new calls, as well as accept4(), which was added for the same reasons in Linux 2.6.28, return new file descriptors. The main reason for the addition of the new calls was to allow the caller the option of requesting that the close-on-exec flag be set on the new file descriptor at the time it is created, rather than in a separate step using the fcntl(F_SETFD) operation. This allows user-space applications to avoid certain race conditions when using the traditional counterparts of these system calls in multithreaded applications. Those races could occur when one thread tried to create a file descriptor and use fcntl(F_SETFD) to set its close-on-exec flag at the same time as another thread happened to perform a fork() plus execve(). (The socket() and socketpair() system calls also added this new functionality in 2.6.27. However, somewhat bizarrely, this was done by jamming bit flags into the high bytes of these calls' socket type argument, rather than creating new system calls with a flags argument.)

Turning to more recent Linux development history, we see that a number of new system calls added since kernel 2.6.28 have all included a flags argument, including fanotify_init(), fanotify_mark(), open_by_handle_at(), and name_to_handle_at(). However, in all of those cases, the flags argument was required at the outset, so no decision about future-proofing this aspect of the API was required.

On the other hand, there have been some misses or near misses for other system calls. The syncfs() system call added in Linux 2.6.39 does not have a flags argument, although one wonders whether some filesystem developer might have taken advantage of such a flag, if it existed, to allow the caller to vary the manner in which a filesystem is synced to disk. And the finit_module() system call added in Linux 3.8 only got a flags argument after some last minute prompting; once added, the flag proved immediately useful.

The conclusion from this oft-repeated pattern of creating new incarnations of system calls that add a flags argument is that a suitable question to ask during the design of every new system call is: "Is there a reason not to include a flags argument in the API?" Considering the question from that perspective is likely to more often lead developers to default to following the wise example of the process_vm_readv() and process_vm_writev() system calls added in Linux 3.2. The developers of those system calls included a (currently unused) flags argument on the suspicion that it may prove useful in the future. History suggests that they'll one day be proved right.

Comments (14 posted)

Best practices for a big patch series

February 12, 2014

This article was contributed by Wolfram Sang

The kernel development process features thousands of developers all working together without stepping on each other's toes — very often, anyway. The modularity of the kernel is one of the biggest reasons for the smoothness of the process; developers rarely find themselves trying to work on the same code at the same time. But there are always exceptions, one of which is the large, cross-subsystem patch series. Merging such a series does not have to be a recipe for trouble, especially if a few guidelines are followed; this article offers some suggestions in that direction.

Changing the whole kernel tree using a pattern has become a lot easier in recent years. There is more processing power available, example scripts are out there, and tools like Coccinelle are especially targeted for such tasks. While this is great for wide-ranging work like API changes and bug fixes across all drivers, handling a patch series spanning across various subsystems can be a bit cumbersome. Dependencies and responsibilities need to be clear, the granularity (i.e. number of patches) needs to be proper, and relevant information needs to reach all people involved. If these conditions are not met, maintainers might miss important details which means more work for both the submitter and the maintainer. The best practices described below are intended to make submitting such a patch series smooth and to avoid this unnecessary work.

Patch organization

The first question to answer is: in what form should your changes be posted? Here are the most commonly used choices along with examples of when they were used. There are no strict rules when to use which approach (and there can't be), so the examples hopefully give you an idea what issues to consider and what might be appropriate for your series.

  1. Changing the whole tree at once: Having one patch changing files tree-wide in one go has the advantage of immediately changing the API (no transition time). Once applied, it is done, ready, and there should be no cruft left over. Because only one maintainer is needed to merge the huge patch, this person can easily handle any dependencies that might exist. The major drawback is a high risk of merge conflicts all over the tree because so many subsystems are touched. This approach was used for renaming INIT_COMPLETION() to reinit_completion().

  2. Grouping changes per file: Having one patch for every modified file gives each subsystem maintainer freedom regarding when to apply the patches and how to handle merge conflicts because the patches do not cross subsystems. However, if there are dependencies, this can become a nightmare ("Shall I apply patches 32-53 to my tree now? Do I have to wait until 1-5 are applied? Who does that? Or is there a V2 of the series coming?"). Also, a huge number of patches pollutes the Git history. This choice was used for removing platform_driver_probe() from bus masters like I2C and SPI. It was chosen to provide a more fine-grained bisectability in case something went wrong.

  3. Grouping changes per subdirectory: Having a patch per subdirectory somewhat resembles a patch per subsystem. This is a compromise of the former two options. Fewer patches to handle, but still each subsystem maintainer is responsible for applying and for conflict resolution. When the pinctrl core became able to select the default state for a group of pins, the explicit function call doing that in drivers was removed in this fashion. In another example, a number of drivers did sanity checks of resources before passing them to devm_ioremap_resource(). Because the function does its own checks already, the drivers could be cleaned up a little, one subdirectory at a time. Finally, the notorious UAPI header file split was also handled this way.

  4. Drop the series: Finally, some tasks are just not suitable for mass conversion. One example is changing device drivers to use the managed resources API (devm_* and friends). There are surely some useful patterns to remove boilerplate code here. Still, not knowing hardware details may lead to subtle errors. Those will probably be noticed for popular drivers, but may introduce regressions for less popular ones. So, those patches should be tested on real hardware before they are applied. If you really want to do a series like this as a service to the community, you should then ask for and collect Tested-by tags. Expect the patches to go in individually, not as a series. Patches that do not get properly tested may never be applied.

Of course, the decision of which form to use should be driven by technical reasons only, patch count statistics, in particular, should not be a concern. As mentioned before, there are no hard rules, but you can assume that changing the whole tree at once is usually frowned upon unless the dependencies require it. Also, try to keep the number of patches low without sacrificing flexibility. That makes changes per subdirectory a good start if you are unsure. In any case, say in the cover letter what you think would be best. Be open for discussion because approaches do vary. For example, I would have preferred if the removal of __dev* attributes would have been one huge patch instead of 358 small ones. As a result, be prepared to convert your series from one form into another.

Note: To automatically create commits per subdirectory with git, the following snippet can be used as a basis. It reads a commit message template specified by $commit_msg_template to create the commit descriptions. There, it replaces the string SEDME with the directory currently being processed.

    dirs=$(git status --porcelain --untracked-files=no $startdir | \
	 dirname $(awk '/^ M / { print $2 }') | sort -u)

    for d in $dirs; do
        git add --update $d/*.[ch]
        sed "s|SEDME|${d//|/\|}|" $commit_msg_template | git commit --quiet -F -

An example commit message template might look like::

    SEDME: calling foo() in drivers is obsolete

    foo() is called by the core since commit 12345678 ("call foo() in core").
    No need for the driver to do it.

    Signed-off-by: Wolfram Sang <>

The procedure

With any patch series, the good old "release early, release often" rule holds true. Let people know what you are up to. Set up a public repository, push your complete branch there, and update it regularly. If the series is not trivial, send an RFC to collect opinions. For an RFC, it may be appropriate to start by patching only one subsystem rather than the whole tree, or to use a whole-tree patch this one time in order to keep the mail count low. Always send a cover letter and describe your aim, dependencies, public repositories, and other relevant information.

Ask Fengguang Wu to build your branch with his great kbuild test service. When all issues are resolved and there are no objections, send the whole series right away. Again, don't forget a proper cover letter. In case of per-file or per-directory patches, the subsystem maintainers will pick up the individual patches as they see fit. Be prepared for this process to take longer than one development cycle. In that case, rerun your pattern in the next development cycle and post an updated series. Keep at it until done.

If it has been agreed to use the all-at-once approach, there may be a subsystem maintainer willing to pick up your series and take care of needed fixups during the merge window (or maybe you will be asked to do them). If there is no maintainer to pick your series but appropriate Acked-by tags were given, then (and only then) it is up to you to send a pull request to Linus. Shortly after the -rc1 release is a good time for this, though it is best to agree on this timing ahead of time. Make sure you have reapplied your pattern on the relevant -rc1 release so that the patches apply. Ask Stephen Rothwell to pull your branch into linux-next. If all went well, send out the pull request to Linus.

Whom to send the patches to

When you send the series, use git send-email. The linux-kernel mailing list is usually the best --to recipient. Manually add people and lists to CC if they should be interested in the whole series.

For other CCs, from the kernel scripts directory is the tool to use. It supports custom settings via .get_maintainer.conf, which must be placed in the kernel top directory. The option --no-rolestats should be in that file; it suppresses the printing of information about why an email address was added. This extra output may confuse git and is also seen as noise on the mailing lists. The other default options are sane, but the usage of --git-fallback depends on the series you want to send. For per-file changes, it makes sense to activate this feature, because it adds people who actually worked on the modified files. For per-subsystem and whole-tree changes, --no-git-fallback (the default) makes sense, because those changes are mostly interesting for maintainers, so individual developers don't need to be on CC. If they are interested in the series, they will usually read the mailing list of the subsystem and notice your work there.

There is one last tricky bit left: the cover letter. If it has too few CCs, people who receive individual patches might miss it; they are then left wondering what the patches are trying to accomplish. On the other hand, copying the cover letter to everybody who is also on CC of the patches will usually result in rejected emails, because the CC list becomes too large. The rule of thumb here is: add all mailing lists which get patches to the cover letter. Below is a script that does exactly that. It can be used as a --cc-cmd for git send-email. If it detects a cover letter, it runs on all patches, collecting only mailing lists (--no-m option.) If it detects a patch, it simply executes

    #! /bin/bash
    # cocci_cc - send cover letter to all mailing lists referenced in a patch series
    # intended to be used as 'git send-email --cc-cmd=cocci_cc ...'
    # done by Wolfram Sang in 2012-14, version 20140204 - WTFPLv2

    shopt -s extglob
    cd $(git rev-parse --show-toplevel) > /dev/null


    if [ "$num" = "0000" ]; then
        for f in $dir/!(0000*).patch; do
            scripts/ --no-m $f
        done | sort -u
        scripts/ $1


Applying patterns to the kernel tree is surely a useful tool. As with any tool, knowledge when to use it and how to properly handle it needs to be developed. This article is hopefully a useful contribution in that direction. The author hopes to inspire other developers and is open for discussion.

Comments (1 posted)

Patches and updates

Kernel trees


Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Memory management


Virtualization and containers

Page editor: Jonathan Corbet
Next page: Distributions>>

Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds