Brief itemsreleased on February 23. Linus's comment was: "Nothing big, and nothing that looks particularly scary."
Kernel development newsdeferrable timers concept in the kernel dates back to 2007. A deferrable timer can be used in situations where an arbitrary delay between the timer's expiration and the execution of the timeout code is acceptable. In such situations, the expiration of the timer can be delayed until something else wakes up the CPU. Deferring expiration in this way minimizes the number of wakeups, which, in turn, helps to minimize power consumption.
Deferrable timers are available within the kernel, but they are not provided to user space. So the timer-related system calls (including timerd_settime(), clock_nanosleep(), and nanosleep()) will all make an effort to notify user space quickly on timer expiration, even if user space could tolerate some delay. That is irritating to developers who are working to improve power performance. The good news for those developers is that, after a couple of false starts, it now appears that deferrable timers may finally be made available to user space.
Some readers will certainly be thinking about the timer slack mechanism, which is available in user space. However, it affects all timers; for some applications, some timers may be more deferrable than others. Timers with slack can also only be deferred for so long, meaning that they might still wake a sleeping CPU when they expire. So deferrable timers may well address some use cases that are not well handled by timer slack.
Back in 2012, Anton Vorontsov sent out a patch set adding deferrable timer support to the timerfd_settime() system call. In putting together this patch, Anton ran into a problem: like all of the timing-related system calls, the timerfd mechanism uses the kernel's high-resolution timers. But high-resolution timers do not support deferrable operation; that functionality is limited to the older "timer wheel" mechanism (described in this article). The timer wheel is old code with a number of problems; it has been slated for removal for years. But nobody has done that work, so the timer wheel remains in place, and it remains the only facility with the deferrable option.
Anton's response was to split timers in the timerfd subsystem across both mechanisms. Regular timer requests would use the high-resolution timer subsystem as usual, but any request with the TFD_TIMER_DEFERRABLE flag set would, instead, be handled by the timer wheel. Among other things, that limited the resolution of deferrable timers to one jiffy (0.001 to 0.01 seconds, depending on the kernel configuration), but, if the timer is deferrable, low resolution is not necessarily a problem. Still, this patch set did not go very far, and Anton appears to have dropped it fairly quickly.
More recently, Alexey Perevalov has picked up this concept and tried to push it forward. He first posted a patch set in January; this posting generated rather more discussion than its predecessor did. John Stultz was concerned that only timerfd timers gained the new functionality; it would be better, he thought, to push it down to a lower level so that all timer-related system calls would benefit. Doing so, he thought, would likely entail adding the deferrable capability to the high-resolution timer subsystem.
Thomas Gleixner was rather more emphatic, stating that use of the timer wheel "is not going to happen." He suggested that this functionality should instead be added to high-resolution timers in the form of a new set of clock IDs. The clock ID is a parameter provided by user space describing which timekeeping regime should be used. CLOCK_REALTIME, for example, corresponds to the system clock; it can go backward should the administrator change the system time. The CLOCK_MONOTONIC clock, instead, is guaranteed to always move forward. There are a number of other clocks, including CLOCK_TAI, added in 3.10, which corresponds to international atomic time. Thomas tossed out a proof-of-concept patch adding new versions of all of these clocks (CLOCK_MONOTONIC_DEFERRABLE, for example) that would provide deferrable operation.
John, however, argued that clock IDs were not the right interface to expose to user space:
The use of separate clock IDs within the kernel as an implementation detail is fine, he said, but the right way for user space to request the feature is with modifier flags. Fortunately, almost all of the relevant system calls already have flag arguments; the big exception is nanosleep(), which is a call some developers would like to see simply vanish anyway. John's argument, when expressed in this way, prevailed with no real dissent.
Alexey posted a couple more versions of his patch set, but, to put it gently, they did not find approval with Thomas, who eventually posted a deferrable timers patch set of his own, showing how he thinks the problem should be solved. In this patch set, clock_nanosleep() gets a new TIMER_IS_DEFERRABLE flag, while timerfd_settime() gets TFD_TIMER_IS_DEFERRABLE. In either case, setting that flag causes the kernel to use one of the new deferrable internal clock IDs. Timers on those IDs never actually program the hardware, so they never generate interrupts and cannot wake the system. Instead, the expiration functions will be executed when the system is awakened for some other task. In the absence of the new flag, these system calls behave as they always have.
Thomas's patch set has not gotten much in the way of comments beyond low-level implementation details; if that situation persists for long, silence is likely to indicate consent. So, unless some surprise comes along, the kernel will probably offer deferrable timers to user space around 3.15 or so.Last week's article on C11 atomic variables covered the discussion on the apparent mismatch between what the C11 standard defines and the kernel needs. This discussion did not conveniently end with the publication of the article, though. So this followup looks at the ground that was covered since then, with a particular focus on the "consume" memory order. It is possible, though far from guaranteed, that the outcome of this discussion could lead to changes in the standard to make it more applicable to kernel use.
Much of the work around memory ordering relates to two modes of operation called "acquire" and "release". This page describes the meaning of these models within the standard. In short: a read from memory with "acquire" semantics is guaranteed to happen before any subsequent reads or writes in the same thread. A write with "release" semantics will happen (become globally visible) after any preceding reads or writes. The two are typically used together. Code that modifies a data structure will perform the final write (the one that makes any other data it wrote accessible globally) with a "release" operation, while code consuming that data will read the pointer to the data with an "acquire" operation.
Acquire and release are useful concepts when trying to figure out how to work with shared data in a lockless manner. But in many cases an acquire operation provides stronger ordering than is really necessary. Reading with acquire semantics imposes ordering on all subsequent reads and stores, even if many of those operations do not depend on the value that was read and could be more freely reordered by either the compiler or the processor. There is one case in particular in the kernel where it would be nice to have weaker (and cheaper) ordering guarantees than acquire provides.
That case has to do with the read-copy-update (RCU) operation. The LWN kernel index includes many articles on the details of RCU; to simplify things greatly here, for the purposes of this article it is hopefully enough to say that RCU works by putting potentially volatile data into structures that are accessed via pointers. Changing the data involves allocating a new structure, copying the new data into it, then updating the pointer to point to the new structure. Code consuming that data will see either the older or the newer pointer, depending on the relative timing of things, but either will be valid at the time. It is important, though, that the data written to the new structure all be globally visible before the pointer to that structure becomes visible; otherwise a consuming thread could end up reading the wrong information.
This requirement can be met by assigning the pointer with release ordering, and reading the pointer (usually done with rcu_dereference()) with acquire ordering. But the only ordering that really matters is that between obtaining the pointer and accessing the contents of the structure it points to. On many processors, that ordering comes for free, with no expensive memory barriers required at all.
Providing this weaker ordering is the role of the "consume" ordering, which only ensures that writes that are "dependent" on the read value must be visible. So in code that looks like this:
p = rcu_dereference(pointer); q = p->something; *a = something_else;
With acquire ordering, the assignment to *a could not be reordered to happen before the rcu_dereference() call; with consume ordering, instead, that reordering could be done, and, on some architectures at least, the run-time cost of ensuring that ordering would be lower (or zero). Given that techniques like RCU are used in places where performance matters greatly, the extra performance obtained through the use of consume semantics seems worth having.
The problem with consume ordering as defined by the standard is that it requires extensive tracking of dependencies between data accesses. That tracking, it seems, is hard to understand and hard to do. The result is a standard text that is not entirely approachable to developers. There are also reported bugs in GCC indicating that the handling of consume ordering is not always done correctly. With some compilers, it seems, consume is just implemented as acquire, leading to correct results but losing the performance advantages that consume is supposed to provide.
These problems make consume ordering sufficiently difficult to use in the kernel that, chances are, the kernel will continue to use its current mix of architecture-dependent macros and barriers. But what if the definition of consume ordering could be tweaked in the standard itself? There are (probably) few users of consume now, and many implementations likely just implement it as if it were acquire, so there may be scope for changes.
Linus has an idea for a change that, he thinks, would solve most of the problems. He would like to get rid of the extensive language describing dependencies and their tracking and replace it with something simpler. His suggested wording is:
The idea here is simple and, in theory, it provides just the ordering that RCU needs. There are some interesting subtleties, though. The "chain of pointers" concept, for example, refers to assignments and simple modifications. So with an assignment like:
p = rcu_dereference(something);
These assignments would create pointers in the chain:
q = p; r = p + 1;
What the "chain" idea explicitly does not cover is aliases. If some other pointer in the function happens to point to the object that p points to, accesses to that object via the second pointer will not be ordered in any way. That makes Linus's idea of consume semantics different from that found in the standard; the latter requires the compiler to try to catch and handle that kind of aliasing.
But what really makes up a "chain of pointers"? Paul McKenney, who would be the person who would have to try to sell any such concept to the standard committee, posted a set of twelve rules describing how these chains would be formed. There is an attempt to distinguish between pairs of operations like this (for example):
q = p & ~0x1; r = p & 0x1;
This kind of logical AND operation is often found in the kernel; the lowest bits of pointers are sometimes used as flags to carry additional information about the pointer. The assignment of q above should preserve the dependency chain, while the assignment of r would not. Compilers can often detect assignments like that second one, which produces an integer value, not a pointer, and reorder them in surprising ways.
It turns out that there are a lot of ways that one can destroy the essential "pointerness" of a pointer and break the dependency chain. In fact, there are so many that Linus advised (to put it politely) Paul to give up on trying to describe them:
An example he gave was:
p = atomic_read(pp, consume); if (p == &variable) return p->val;
In this case, he said, the compiler could reasonably turn p->val into variable.val. At that point, there is no chain of pointers and no ordering; the read of variable.val could conceivably happen before the atomic read. If, instead, the == were to be changed to !=, the chain (and the ordering of the operations) would be preserved because there is no way for the compiler to know where p might point.
After reading Linus's description, Paul tried to write down the requirements again, and came up with this summary:
Linus more-or-less agreed that this is the case:
He did go on to confess, though, that what he really wants is something like what Intel hardware provides. If he were the king of the world, he said, he would outlaw the weaker ordering provided by architectures like ARM and PowerPC.
And that ties into one of Paul's biggest concerns: will we be able to count on hardware providing the relatively strong Intel-style ordering in the future? Optimization techniques have advanced considerably over the years and will likely continue to do so. Paul wondered: "Are ARM and Power really the bad boys here? Or are they instead playing the role of the canary in the coal mine?" If the latter is true, then building a memory ordering regime around Intel's rules might prove hard to sustain over the long term.
Responses so far suggest that others do not expect weaker ordering to hold in the long term; as George Spelvin put it, once a processor adds the cache-coherency hardware to support other types of advanced optimization, it has the capability to provide stronger ordering anyway. Programming on systems with weaker memory ordering is harder, and thus more costly; there will come a time, some think, when those costs clearly are not justified if the ability to provide stronger ordering is available.
Predicting the long-term future of computing hardware is hard, of course, and, meanwhile, systems with weaker ordering are around and must be supported. If Linus's model of consume semantics were to prevail, it could be supported on such hardware now with the use of appropriate memory barriers. But predicting whether Linus's vision might ever make it into a standard revision is just as hard. It might just have a persuasive champion who could present it to the committee in the proper language, but standard committees move in strange, mysterious, and slow ways. So this could be a story that plays out over years; in the meantime, the kernel will almost certainly not switch to C11 atomic variables for anything that benefits from consume-style semantics.
As noted by various commenters on our earlier article, Flags as a system call API design pattern, there is an important step that is required when equipping a system call with a flags argument. If the kernel does not take care to return an error if unknown flags are set, it is setting the stage for a number of compatibility problems in the future. Unfortunately, the history of Linux (and Unix) system call development shows that this lesson has been hard to learn.
In particular, a system call flags argument (or indeed any input structure argument that has a bit-flags field) should always include a check of the following form in its implementation:
if (flags & ~(FL_XXX | FL_YYY)) return -EINVAL;
Here, FL_XXX and FL_YYY form the hypothetical set of flags that the system call understands, and the effect of this check is to deliver an error when the caller specifies any bit value other than one in the set. Checks like this future-proof the API against the day when the system call understands additional flags. Suppose that the system call adds a new flag, FL_ZZZ, and adjusts its check to:
if (flags & ~(FL_XXX | FL_YYY | FL_ZZZ)) return -EINVAL;
A user-space application is now able to check whether it is running on a kernel where the system call supports FL_ZZZ by checking for an EINVAL error when making the system call. This allows the application to flexibly deal with system call differences across kernel versions.
Although implementing flags checks such as the above inside the kernel might seem simple and obvious, it turns out that dozens of system calls don't make this check, including clock_nanosleep(), clone(), epoll_ctl(), fcntl(F_SETFL), mmap(), msgrcv(), msgsnd(), open(), recv(), send(), sigaction(), splice(), unshare(), and many others.
Most of those system calls have been around for several years. In more recent times, most new system calls that have a flags argument include the required check. However, such checks are missing even in a few system calls added in recent kernel versions, such as as open_by_handle_at() (2.6.39), recvmmsg() (2.6.33), and sendmmsg() (3.0). In each of those recent cases, the implementer was presumably emulating the lack of checking that was done in the corresponding earlier system call (open(), recv(), send()). However, the failure to add the checks represents a missed opportunity to improve on the original API.
For each of the system calls that lack a check on the flags argument, user-space applications have no easy way of detecting what API flags a particular kernel version supports. Furthermore, failure to implement such checks in the kernel can also complicate the lives of kernel developers, as a couple of examples demonstrate.
When the kernel fails to check that only valid bits are passed in flags, user-space applications can, with impunity, place random garbage in the "unused" bits of flags. If a kernel developer then decides to make use of one of the hitherto unused bits, this may lead to surprising breakage in user-space applications, which in turn may require the kernel developer to write suboptimal implementations of new user-space API features. One recent example of this was in the implementation of the EPOLLWAKEUP flag, where avoiding user-space breakage meant that the kernel silently ignored this flag if the caller did not have the CAP_BLOCK_SUSPEND capability. Ideally, of course, the kernel would have informed the caller by returning an error from the call. Consequently, applications that want to be absolutely sure that the call will succeed must explicitly check beforehand that they have the CAP_BLOCK_SUSPEND capability.
An even more recent example was in the implementation of the O_TMPFILE flag for open(), where the flag definition incorporated the O_DIRECTORY flag, with the goal that older kernels that do not support O_TMPFILE would give an error if the flag was specified in a call to open(). This was necessary, because applications that create temporary files are often security conscious, and need to know whether their requests to create hidden temporary files have been honored. Without this fix, the O_TMPFILE flag would be silently ignored on older kernels, and an application might end up creating a visible file. An unpleasant side effect of that implementation is that user-space applications must check for two different errors from open() in order to determine whether they are running on a kernel that doesn't support O_TMPFILE.
Finally, it is worth mentioning that a few system calls have added the required flags check after the call was first implemented. Two examples are the ancient system calls umount2() (check added in Linux 2.6.34) and swapon() (check added in Linux 3.4). In addition, the mremap() call, which first appeared in Linux 2.0, added the check in Linux 2.4, and the timerfd_settime() system call, which first appeared in Linux 2.6.25, added the check in Linux 2.6.29.
However, the addition of flags checks to these system calls represents an exception to the general rule that such checks cannot be added after the fact, because doing so would break existing applications that happen to pass random garbage in the "unused" bits of flags. With umount2() and swapon(), the change was possible presumably because there are few users of these system calls other than the mount and swapon commands, and those programs could be modified if the kernel change caused them to break. In the case of timerfd_settime(), the change was made soon after the initial implementation, when there were likely to have been few users of the interface. And in the case of mremap(), the change was made at the time of a major kernel version change (from 2.2 to 2.4), when such ABI changes were occasionally permitted; with the contemporary 10-week release cycle, such changes are not permitted.
Thus if the check on unused flag bits is not included in the initial implementation, it is often impossible to add it later. The clear conclusion is that any addition of flag bits to a system call should come with the proper checks from the outset.
Patches and updates
Core kernel code
Filesystems and block I/O
Page editor: Jonathan Corbet
Next page: Distributions>>
Copyright © 2014, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds