LWN.net Weekly Edition for April 27, 2017
Welcome to the April 27, 2017 LWN.net Weekly Edition
This edition marks the second week of our experiment with the edition format. Our feature content from this week includes:
- The great leap backward: the
numerology of software versioning doesn't get any easier.
- Turmoil for Drupal: when disapproval of
a member's private life spills over into project governance.
- Which email client for Ubuntu 17.10?:
Or, to put it another way, does that release need an email client at
all?
- The MuQSS CPU scheduler: the latest in
interactive scheduling from Con Kolivas.
- Two new block I/O schedulers for 4.12:
the lack of multiqueue block I/O scheduling will come to an end with
the probable merging of two schedulers for different use cases.
- Device power management with the OPP
library: a close look at device power management from one of the
developers doing the work.
- A formal kernel memory-ordering model (part 2): the conclusion of the series on proving the correctness of concurrent memory accesses and scaring small children at scale. Sharpen your pencils, quick quizzes are in your future.
As usual, the following two pages are full of brief news items from all over our community. Thanks are due, as always, to LWN.net's subscribers, who make all of this writing possible.
The great leap backward
Sayre's law states: "In any dispute the intensity of feeling is inversely proportional to the value of the issues at stake". In that context, it is perhaps easy to understand why the discussion around the version number for the next major openSUSE Leap release has gone on for hundreds of sometimes vitriolic messages. While this change is controversial, the openSUSE board hopes that it will lead to more rational versioning in the long term — but the world has a way of interfering with such plans.
OpenSUSE Leap is an interesting hybrid distribution; its core packages come from the slow-and-stable SUSE Linux Enterprise (SLE) release, but those packages are replaced or supplemented by much newer software where desired. The current (and only) openSUSE Leap release was originally based on SLE 12 and openSUSE 13.1. The project had an immediate problem in that it needed to come up with a version number for this new distribution; in the end, it did what any of us would have done and chose 42. The current release is openSUSE Leap 42.2.
Eventually even an enterprise distribution gets around to a new major release; that is expected to happen with SLE sometime in 2018. A new SLE release will have a new version number, of course. The working expectation on the openSUSE side was that the next SLE would be SLE 13, and that openSUSE leap would, continuing with the SLE+30 formula, release openSUSE Leap 43 based on that release.
But, as openSUSE board chair Richard Brown recently noted, plans can change. In its wisdom, SUSE management decided that the next SLE release would not be SLE 13; it seems that SLE 15 has greater market appeal. The reasons behind the decision have not escaped the smoke-filled room in which it was made, but it has been speculated that the purpose was to avoid the number 13, which is seen as being unlucky in many parts. Similarly, 14 contains a 4, an ill-starred number in parts of Asia, so it was out of the running. 15, being a triangular, hexagonal, and pentatope number (among other things), was the obvious choice.
Parents in the US, who often acquire much gray hair at the prospect of their children getting driving permits at the age of 15, may well not consider that number to be especially auspicious either. It seems that Europe-based SUSE is less concerned about that particular fear, unreasonable as that may seem.
Naturally, using 43 as the version number for the next openSUSE Leap release is now entirely out of the question; there is no point in even talking about it. So the openSUSE board retired to its rather smaller smoke-filled room to find a solution to this existential problem. The triumphant result was then announced by Brown: openSUSE Leap 42.x would be followed by openSUSE Leap 15. That is when the mailing lists erupted with complaints about this decision; it seems that the inherent rightness of 15 isn't entirely obvious to the community as a whole.
There were complaints that this decision was made in private, without the prior involvement of the community. As anybody who has watched our community for any time knows, had the version-number question been posed to the mailing list as a whole, the result would have certainly been a much more polite conversation converging on an obvious consensus that all participants would then support wholeheartedly. It always works that way, after all, why should this time be any different?
Unsurprisingly, some community members are horrified by the idea of a software release having a lower version number than its predecessor; they think that both humans and scripts might become confused in a world where the first derivative of version numbers is not always positive. But they fail to appreciate the full mathematical complexity of openSUSE version numbers, of which the headline number (15 in this case) is only the most visible. Consider, for example, the suse_version number found within the spec files used to build openSUSE packages.
The old openSUSE 13.1 release sets suse_version
to 1310, while the slightly less old 13.2 release (from 2014) has 1320.
The much newer OpenSUSE
Leap 42 release, having branched off before openSUSE 13.2, uses
1315 for suse_version — another example of what we might call, in
this day and age, "alternative incrementing". OpenSUSE Leap 15 is
expected to use 1500, restoring order to that part of the universe, but at
the cost of creating potential confusion elsewhere. The leading-edge
Tumbleweed release, which should have the highest version number, will not
be content to remain at its current 1330; Brown worries that numerous scripts will
"explode
" when Tumbleweed inevitably leapfrogs Leap to
something over
1500. Throw in sle_version (tied to the underlying SLE version)
and leap_version (giving the precise Leap version — but only in
42.2) and one might argue that a backward step in the major version number
is the least of the project's numbering issues.
That said, this decision has set up another existential crisis for the future: what happens when the SLE 42 release comes out and openSUSE Leap is faced with reusing a version number — a deed seen as being even more foul than going backward? Brown shrugged off this problem, saying that, at the current release rate, SLE 42 isn't due for over 100 years. There should, he implied, be time for plenty of other flame wars before that one needs to heat up.
Brown's math is neglecting an important fact, though: SLE just skipped over two numbers, and might well be expected to do the same thing again in the next century. After all, 16 is a power of two, and all those zeroes might make some potential customers nervous. It's also the atomic number of sulfur; best to just skip it. Italians see 17 as an exceptionally ill-starred number. 18 is voting age in much of the world, and nobody has had luck with voting recently, so that one should be avoided too. 19 is suspiciously prime, but might yet prove acceptable pending further research. And so on; SLE 42 may come far sooner than anybody expects.
Be that as it may, it would appear that minds are made up, and that calls to "kick the board" and continue with the existing versioning scheme, reasonable as they may be, will go unheeded. OpenSUSE may have leaped before it looked, but whether that happened now or back when "42" was chosen is unclear. The good news is that our community is resilient, and even openSUSE Leap's version-number decisions will fade into history to a well-earned place alongside events like the merging of devfs and the invention of JavaScript. The flames may be high but, once again, that's a reliable indication that the stakes are low.
Turmoil for Drupal
The Drupal content management system (CMS) has been an open-source tool of choice for many web site owners for well over a decade now. Over that time, it has been overseen by its original developer, Dries Buytaert, who is often referred to as the benevolent dictator for life (BDFL) for the project. Some recent events have led a sizable contingent in the Drupal community to question his leadership, however. A request that a prominent developer leave the Drupal community, apparently over elements of his private life rather than any Drupal-related misstep, has led to something of an outcry in that community—it may well lead to a change in the governance of the project.
Too much information
Publicly, at least, the incident began with a post from Drupal developer Larry Garfield on March 22. It was titled "TMI about me" and largely lives up to the "too much information" (TMI) billing. It is, he said, a "self-outing" that describes his personal life and, in particular, his involvement in the Gorean subculture [possibly NSFW] of the BDSM community (which is an umbrella term for bondage, discipline, dominance, submission, sadism, and masochism generally associated with sexual activity).
Garfield said that he felt he had to "out" himself due to a smear campaign
against him that started in October 2016 and culminated in Buytaert calling
him on February 24; Buytaert asked
Garfield to "step down from Drupal
" because it is "in
the best interest of the project
". That was followed up by an email
from Drupal Association
executive director Megan Sanicki on February 27; based on his conversation with
Buytaert, she was "informing me
that I'd been summarily dismissed from my position as track chair and as a
speaker at DrupalCon
", Garfield said.
Notably lacking from any of these communications was any indication that he
had violated the code of conduct,
he said. Garfield summarized the situation as follows:
Now take that paragraph, replace the word "Gor" with "being gay", and go back in time 15 years. Maybe even 10. Imagine being told that you need to leave Drupal before people find out that you're gay and it [embarrasses] the project.
Now try replacing "Gor" with "Muslims", and think about it today.
A different view
As might be guessed, Buytaert sees things differently. On March 23,
he posted about the
situation since Garfield had gone public; that post has been updated
multiple times since. There is, Buytaert said, confidential
information that Garfield had omitted from his post; while Buytaert does
not "judge the choices anyone makes in their private life or what
beliefs they subscribe to
", he does find the Gorean philosophy
espoused by Garfield to be misaligned with Drupal's fundamental values.
Buytaert said that he has to look at the effect on the project as a whole:
However, when a highly-visible community member's private views become public, controversial, and disruptive for the project, I must consider the impact that his words and actions have on others and the project itself. In this case, Larry has entwined his private and professional online identities in such a way that it blurs the lines with the Drupal project. Ultimately, I can't get past the fundamental misalignment of values.
On the same day, Sanicki also posted
about Garfield's removal as a DrupalCon track chair and canceling his talks
at the conference. The "decision was based on
confidential information conveyed in private by many sources
", she
said, but
was "not because of his private life or personal beliefs
".
Confidential information
The responses to Buytaert and Sanicki's posts mostly focused on the "confidential information" that they referred to. Since there was no specific mention of violations of the code of conduct, though that would seemingly be the only grounds for the actions taken against Garfield, many were puzzled—or outraged. To some, it smacked of a secret court convicting someone of secret offenses using secret evidence. References to 1984 were made in the comment threads.
According to Garfield, who followed up his
"TMI" post on March 27, some folks had been digging up his postings
on various Gor and BDSM forums, as well as, eventually, information from a
dating profile he posted. These are decidedly non-Drupal forums and
collecting and sharing information from them is a violation of the terms of
service of the sites, he said. Much of that information had been turned
over to the Drupal Community
Working Group (CWG), which "initially found that there were no
Code of Conduct violations by Larry
" the group said in a
statement.
The CWG tried to mediate between the parties (Garfield and Klaus Purer, evidently,
though only Garfield names Purer), which failed. At that point, the CWG
escalated the matter to Buytaert.
There are various interpretations of the actions of the CWG, Buytaert, and
Sanicki floating around, but the crux of the issue for many seems to be a
lack of any concrete statement that Garfield's actions within the
Drupal community were somehow problematic. The closest seems to come from
a March 31 joint
statement by Buytaert and Sanicki that outlines the information used in
the decision-making process. That includes "some of Larry's online
interactions, both on and off Drupal.org
", though no details are
provided even though any postings to drupal.org are seemingly already public. There is also mention of information gathered from members-only
sites; they do not condone that and the activity has been referred to the
CWG for possible action.
Garfield, for his part, has tried to stay reasonably positive, while steadfastly—stridently—defending his reputation and disputing some parts of the public statements made about him and the situation. Apparently, someone had been threatening to go public with the information about Garfield's private life (which is part of why he outed himself); Garfield clearly sees that "blackmail" as part of why Buytaert acted how and when he did:
But that, like much of the information on the alleged offenses, is Garfield's interpretation. That he has to interpret it at all is, of course, one of the complaints he and others have about the process and the current state of affairs. He said that he doesn't want to see this affect the project and has made a proposal to Buytaert to try to get things back on course:
[...]
There has been wild talk of a Drupal fork, of reorganizing the Drupal Association, of people resigning, and so forth. I have no interest in such discussion, nor interest in a Drupal fork. My goal is not to split or harm Drupal, nor anyone in it. My goal is entirely defending my reputation and putting a stop to blackmail and libel.
Garfield also strongly condemned some who were trying to defend him using the same tactics that were used by his accusers. In no uncertain terms he made it clear that he wanted that to stop:
He did point to the Drupal Confessions site that is gathering support for an open letter to Buytaert. That letter has been "taken off the site for the duration of DrupalCon", according to an email from the site's contact address; the Twitter account for the site has some additional information. DrupalCon Baltimore runs from April 24-28.
The letter takes a strong stance against the actions taken by the project toward Garfield and asks Buytaert to apologize to Garfield and to take some active steps to ensure this kind of thing cannot happen again. It ends with a warning that some of the 150+ signatories, some of them high-profile members of the community, will cease participating in Drupal if Buytaert does not act to restore faith in the professionalism and governance of the community.
Moving forward
Buytaert apologized to the community on April 9 and posted some thoughts on project governance the next day. The overarching goal of the governance change is to remove Buytaert as the sole arbiter of community and governance problems that are not resolved in the CWG or elsewhere:
He would also like to review the code of conduct to clarify its intent and scope. The plan is to start some of that work at DrupalCon Baltimore and to hopefully complete it (factoring in comments from the community and taking input from both insiders and outsiders with experience in governance, diversity, inclusion, and so on) by DrupalCon Vienna in late September.
In his final
post before DrupalCon, Buytaert noted that he and Garfield were talking
and that Garfield would be attending the conference as a member of the
community. Though both have said
that Garfield was asked to leave the community, Buytaert pointed to a CWG
message
from April 16 that unequivocally stated Garfield remained a member of
the community. That message also noted that the CWG members
"strongly reject any suggestion or assertion that Larry was asked to
leave solely on the basis of his personal beliefs or what he does in his
private life
"; if that had been the case, all would have resigned
the CWG. There is also mention of the community member who threatened
Garfield (presumably with exposure, though it is not said); "they
have agreed to step down from all positions of responsibility and
leadership within the community and make personal apologies to those
directly impacted by their actions.
"
Messy
Any way you slice it, this is a messy situation. There is a lot of murkiness and confidentiality that make it difficult to understand what actually went on. Most of the Drupal community is in the dark and it would seem that the project leadership would have much preferred to sweep this all under the rug.
It is a bit hard to reconcile Buytaert's recent statements with the earliest one; is adherence to a Gorean lifestyle incompatible with being a Drupal member or not? If it is, as the first post from Buytaert said, then it is hard to see how Garfield was not shown the door for his beliefs, which is something Buytaert and others have more recently strongly denied.
There are multiple lessons here for Drupal and other communities. If you are attempting to eject a member for some transgression, it would probably be prudent to have a high-level, non-confidential reason for doing so that you can share with the person and the greater community. There may well be good reasons for ending Garfield's relationship with the project, but none have been forthcoming—except for the confidential information that is apparently not a legal matter nor a code of conduct violation. "Trust us" can only go so far for an open source community.
The role of the code of conduct and its bearing on what behavior is considered to be acceptable is obviously important as well. There is more to a code of conduct than just having one. It's a bit hard to see how there could be an action that is legal and does not cross the line of a reasonably written code of conduct that is also so egregious it requires removing someone from a project.
In this Drupal controversy, some may see echoes of Brendan Eich being forced out as Mozilla's CEO due to his financial support for an anti-gay-marriage law in California. There are some parallels, in that there were no indications that Eich's views spilled over into the workplace, but that his adherence to those views was considered enough by some to refuse to work with him—and by extension Mozilla. There are reasonable arguments on both sides of the Eich situation, but there is a difference between being a high-profile project member, even in a leadership role, and being the CEO of a company. For good or ill, a project's code of conduct does not directly govern the executives (or other employees) of a company sponsoring a project.
There will always be people who think differently on any project; there are no end of opinions on hot-button topics that one side or the other will find to be detestable and unfit for anyone to hold. We need to learn to live with that diversity as well; people should be judged on their actions while engaged in community activities. The lines can be somewhat blurry, granted, but policing opinions and consensual actions that happen completely outside of the "community space" is tantamount to the dreaded thought police. If those actions do blur or spill into the space, the code of conduct should be there to help. Projects would do well to consider that in light of the Drupal situation.
With luck, as these words are being typed, progress is being made toward resolving the situation and on considering steps to head off other similar situations in the future. This kind of controversy has the potential to fragment the community—we have already seen signs of that. But putting things back on an even keel, while looking to learn and evolve from a mess of this sort, can also help bring the community together again.
Which email client for Ubuntu 17.10?
An email client was once a mandatory offering for any operating system, but that may be changing. A discussion on the ubuntu-desktop mailing list explores the choices for a default email client for Ubuntu 17.10, which is due in October. One of the possibilities being considered is to not have a default email client at all.
Jeremy Bicha raised the issue in
mid-April. He noted that Ubuntu had switched from Evolution to Thunderbird
in 2011 and thought it time to revisit that decision. For one thing, while an
email client is useful, it may not be "useful enough to enough people
to justify it being installed for everyone
". If there is to be a
default email client, though, which should it be?
The Ubuntu GNOME 16.10 release included Evolution, but the 17.04 release
had no client by default, which was disappointing to some, but there has
been little in the way of negative feedback, he said. In addition, Michael
Catanzaro of the GNOME release team "recommends not
installing an email client by default since there isn't an app that is
both well-maintained and very well-integrated into the GNOME 3
style
".
Since Ubuntu will be switching
away from Unity to GNOME over the next year, something that integrates
well with that environment is desirable. But the belief is that most are using web mail
and/or phone apps to access their email; those who do want a desktop client
do not agree on which one it should be.
Both of the obvious choices, Evolution and Thunderbird, suffer from a number of shortcomings that Bicha outlined. Neither completely integrates with GNOME 3, though Evolution is closer. Both suffer from relatively small development communities. Thunderbird has a lot of work ahead of it to handle the changes coming in Firefox to eliminate traditional extensions; the project has also been considering a rewrite using web technologies. And so on.
There was some support for leaving out a default email client, given that the main options were not quite right, but there was concern about "mailto:" links in web pages. If there is no default mail client, there is nowhere to direct those links. Robert Ancell suggested some kind of dialog program that would catch the mailto links and redirect them to the user's email provider. Dimitri John Ledkov went a bit further, suggesting:
Some were not convinced that Ubuntu should be without a default mail
client, though.
There were recommendations of looking at other clients, such as Nylas Mail and Geary. Nylas is pursuing an open-core
strategy, but does have a client with lots of features available under the
GPL; Bicha called it "a bit controversial in
the open-source community
". That client is not well integrated with
GNOME, however, and it is not
packaged for Debian or Ubuntu.
There were also complaints about Electron-based applications, which is what
Nylas Mail is built on. Electron allows developers to build cross-platform
desktop applications using JavaScript, HTML, and CSS. Adolfo Jayme
Barrientos said that he hated Electron
applications for unspecified reasons, but suggested sylpheed. Jorge O. Castro cautioned against completely eliminating
applications simply because of the platform (Electron); there are other relevant
"health metrics
" to consider.
Castro did acknowledge some "real performance/memory implications of electron
apps
" that should be discussed, but "at the end of
the day I'd rather use a well maintained electron app than a poorly
maintained 'native' one
". However,
Dmitry Shachnev pointed out that Electron
applications are difficult to package:
Geary, on the other hand, targets GNOME 3, but it suffers from a lack of features. Sebastien Bacher is worried that it is not quite ready:
Bicha came to a similar conclusion after
briefly trying Geary out. He ran into a number of deficiencies and said:
"I think the hope has been for years that
Geary would be the simple
GNOME email client. I'm not sure it's quite there yet.
" But
Khurshid Alam pointed out that several of
those problems are being worked on and may be solved before the next Geary
release, which is due well before 17.10 needs to be finalized.
There is an even more fundamental question, though, that Jo-Erlend Schinstad raised: should Ubuntu stop shipping default applications that are free software simply because many have moved on to proprietary cloud alternatives?
There's very few desktop applications in Ubuntu that needs to be shipped once we accept the argument that the proprietary clouds are a suitable replacement.
That argument didn't draw much in the way of comment, but it is one that Ubuntu and other distributions will have to grapple with as time goes on. Is the future of operating systems to simply be a platform for running web applications? Is not installing an email client by default one step down a slippery slope? In the end, users will determine whether that is the future they want, but replacing our entire free-software ecosystem with just a (hopefully free) browser would be a pretty disappointing outcome. It is worth pondering on that, not just from a distribution-centric view, but from a free and open-source software community perspective as well.
No real decision was made or even consensus reached in the thread, but there is still a fair amount of time until the 17.10 release. It remains to be seen where it all leads.
The MuQSS CPU scheduler
The scheduler is a topic of keen interest for the desktop user; the scheduling algorithm partially determines the responsiveness of the Linux desktop as a whole. Con Kolivas maintains a series of scheduler patch sets that he has tuned considerably over the years for his own use, focusing primarily on latency reduction for a better desktop experience. In early October 2016, Kolivas updated the design of his popular desktop scheduler patch set, which he renamed MuQSS. It is an update (and a name change) from his previous scheduler, BFS, and it is designed to address scalability concerns that BFS had with an increasing number of CPUs.
BFS
To understand the updates to the scheduler in MuQSS, we first take a look at the design of BFS, which is the basis for the new patch set. First released in 2009, BFS was a simplified scheduler that was made in response to Kolivas's issues with the mainline scheduler for desktop use. The mainline scheduler has to work well for CPU-bound tasks as well as keeping the desktop smooth and snappy, and it is a difficult balancing act to achieve. Kolivas does not believe in a one-size-fits-all scheduler, so he crafted his own specifically for desktop kernels.
BFS implements a single, global queue of tasks. There is no priority modification based on sleep behavior, and all priorities are set according to the relevant process's nice value. Kolivas argues that any kind of interactivity estimation algorithm just doesn't work well enough and triggers a lot of false positives when trying to find which tasks are interactive. The single run queue design was chosen because Kolivas wanted a scheduler that made global decisions independently of the originating CPU of the processes on the system, and to avoid cycling through a number of different queues to find the next task to run.
BFS uses an earliest eligible virtual deadline first (EEVDF) algorithm to decide which tasks get to run when. Every task entering the run queue gets a time slice and a deadline. The time slice (rr_interval), determines the amount of CPU time every task receives. The deadline is computed as a function of current time plus the product of rr_interval and the priority level, which is the nice level in ratio form. Processes with a lower nice value (hence higher priority) get deadlines that are sooner than lower-priority ones. Therefore a high-priority task may get to run many times before a lower priority task's deadline is reached. When a task's time slice expires, it will be requeued with a new deadline. However if the task sleeps or is preempted by a higher-priority task, its deadline is not readjusted; instead, the task is just put back on the queue where it will be attended to again in the next round of scheduling.
The rr_interval tunable is set at 6ms by default. The reason for this is the threshold for human perception of jitter is just under 7ms. Kolivas predicts a common case of between zero and two running tasks per CPU and, in such a scenario, a task will need to wait no longer than 6ms to get service.
BFS does away with explicit CPU load rebalancing algorithms between CPUs, instead opting for selecting the CPU for task execution when scheduling every process that wakes up. The global run queue will select an available idle CPU for the next task that is ready to run. When selecting a CPU, the scheduler will try to select the last CPU the task was running on. Failing that, the CPU selection moves "outward", next trying any hyperthreaded or core siblings that share cache. CPUs in different sockets or on different NUMA nodes are treated as "cache distant", and are the least preferred when selecting a CPU for a task to run on.
Other simplifications in the design of BFS include the elimination of most of the tunable knobs and the lack of support for control groups. Kolivas argues that these are features desktop users have no interest in, and an excess of tuning knobs just creates confusion.
MuQSS
Due to BFS's single run queue of tasks across all CPUs with a global lock, Kolivas reports that it suffers from lock contention when the number of CPUs increased beyond 16. Every time the scheduler wanted to make a decision on which task should go next, it needed to scan the entire run queue for one with the earliest deadline. Also, iterating over a linked list led to cache-thrashing behavior.
MuQSS is BFS with multiple run queues, one per CPU. Instead of linked lists, the queues have been implemented as skip lists. First described by William Pugh in 1990, skip lists are probabilistic data structures that give some of the performance advantages of a balanced binary tree (such as the red-black tree used in CFS), but with an implementation that is computationally less expensive — if less deterministically efficient. Kolivas's implementation is a custom skip list for his scheduler where the "levels" are static eight-entry arrays. The size of the array was chosen to be exactly one cache line wide.
The deadlines for MuQSS now use "niffies" for keeping track of time. Niffies are a nanosecond-resolution monotonic counter, the value of which is obtained from the high resolution TSC timers. Niffies were introduced sometime in 2010 for BFS, which initially used jiffies for calculating the deadline. The virtual deadline algorithm is essentially the same as BFS:
virtual_deadline = niffies + (prio_ratio * rr_interval)
Again, prio_ratio is the nice level of a task normalized as a ratio and rr_interval is the time slice that the task gets at its nice level. When a task is inserted into the skip list queue, it is ordered by the value of its virtual deadline. As a result, the scheduler can find the next eligible task to run in O(1) time, as that task will be at the head of the queue. Insertion into the skip list is done in O(log n) time.
When selecting a task to run, the scheduler will do a lockless examination of every CPU's run queue to find an eligible task to run, picking the one with the nearest deadline. The scheduler will use a non-blocking "trylock" attempt when popping the chosen task from the relevant run queue, but will simply move on to the next-nearest deadline on another queue if it fails to acquire the queue lock. Although this means the scheduler sometimes doesn't get to run the task with the absolutely nearest deadline all the time, it does alleviate the problem of lock contention among different CPU queues.
MuQSS, like BFS, is aware of the topology of the logical CPUs, whether they are hyperthreaded siblings, cores sharing a package, or NUMA. The hyperthreading awareness of the scheduler means that an idle core, if available, will be selected before a hyperthreaded sibling CPU to avoid slowdowns due to the CPU's processing capacity being shared between tasks on on the same core. When two tasks are executing on a hyperthreaded core, the lower-priority task will have its time limited to allow the higher-priority task to get more CPU time. Also included is a feature called "SMT Nice", which effectively proportions CPU time on hyperthreaded siblings to reflect their nice priorities.
BFS (and subsequently, MuQSS) introduced two new scheduler policies called SCHED_ISO and SCHED_IDLEPRIO. SCHED_ISO, or isochronous scheduling, is a policy that allows unprivileged processes to gain quasi-realtime behavior. A task set to SCHED_ISO will preempt SCHED_NORMAL tasks as long as said SCHED_ISO task does not exceed the default CPU usage of 70% (this is a tunable value) over a rolling average of five seconds. If more than one SCHED_ISO task is running at the same time, they will execute round-robin. This is to prevent other tasks from starving. If a SCHED_ISO task exceeds the 70% threshold it is demoted back temporarily to SCHED_NORMAL and is appropriately scheduled until enough time has elapsed that the average CPU usage of the task dips below 70% again. Since the 70% threshold counts for all available CPUs on a system, it is possible to entirely use an available CPU for SCHED_ISO tasks on SMP machines. For example, a single SCHED_ISO task on a dual-core machine can, at most, only consume 50% of total CPU capacity.
SCHED_IDLEPRIO is the opposite of SCHED_ISO in that it forces tasks to be ultra-low priority. Although it is similar to SCHED_IDLE in the mainline scheduler, there is subtle difference: the mainline scheduler will still eventually run SCHED_IDLE tasks even if there are other, higher priority tasks running at the same time, but SCHED_IDLEPRIO tasks will only run when absolutely nothing else is demanding the CPU's attention. This is useful for batch tasks that scavenge for idle CPUs such as SETI@Home. To avoid deadlocks, if a SCHED_IDLEPRIO task is holding a shared resource, such as a mutex or semaphore, it will be promoted back to SCHED_NORMAL temporarily to allow it to run even if there are other higher-priority processes in the queue.
There is a limited set of tunables which can control the behavior of the scheduler:
- rr_interval which is the CPU quantum, which defaults to 6ms.
- interactive, a tunable to toggle the deadline behavior. If disabled, searching for the next task to run is done independently on each CPU, instead of across all CPUs.
- iso_cpu, which determines what percentage of CPU time, across a rolling five-second average, that isochronous tasks will be allowed to use.
Comparison to the mainline kernel
Kolivas has not tried to merge his scheduler into the mainline kernel, and it is unlikely he will try to as his scheduler patches were done with his own desktop use case in mind. However, his patch set has a following amongst some Linux desktop users who report a "smoother" desktop experience from using it. Previous throughput benchmarking efforts comparing BFS to the mainline CFS did not conclusively demonstrate improvements in Kolivas's patch sets one way or another. The MuQSS scheduler has reportedly better Interbench benchmark scores than CFS. However, ultimately, it is hard to quantify "smoothness" and "responsiveness" and turn them into an automated benchmark, so the best way for interested users to evaluate MuQSS is to try it out themselves.
[I would like to thank Con Kolivas for his help in answering questions about MuQSS.]
Two new block I/O schedulers for 4.12
The multiqueue block layer subsystem, introduced in 2013, was a necessary step for the kernel to scale to the fastest storage devices on large systems. The implementation in current kernels is incomplete, though, in that it lacks an I/O scheduler designed to work with multiqueue devices. That gap is currently set to be closed in the 4.12 development cycle when the kernel will probably get not just one, but two new multiqueue I/O schedulers.The lack of an I/O scheduler might seem like a fatal flaw in the multiqueue code, but the truth is that the need for a scheduler was not clearly evident at the outset. High-end drives are generally solid-state devices lacking rotational delay problems; they are thus not as sensitive to the ordering of operations. But it turns out that there is value in I/O scheduling even for the fastest devices; a scheduler can coalesce adjacent requests, reducing the overall operation count, and it can prioritize some operations over others. So the desire to add I/O scheduling for multiqueue devices has been there for a while, but the code has been lacking.
Things got closer in the 4.11 merge window, when the block layer gained support for I/O scheduling for multiqueue devices. The deadline I/O scheduler was ported to this mechanism as a proof of concept, but it was never seen as the real solution going forward.
When I/O scheduling support was added, the first intended user was the Budget Fair Queuing (BFQ) scheduler. BFQ has been in the works for years; it assigns an I/O budget to each process that, when combined with a bunch of heuristics, is said to produce much better I/O response, especially on slower devices. Users with rotational storage benefit from BFQ, but there are also benefits when using slower solid-state devices; as a result, there is a fair amount of interest in using BFQ on devices like mobile handsets, for example.
The idea that BFQ is an improvement over the CFQ scheduler found in mainline kernels is fairly uncontroversial, but getting BFQ merged was still a lengthy process. One of the final stumbling blocks was that it was a traditional I/O scheduler rather than a multiqueue scheduler. The block subsystem developers have a long-term goal of moving all drivers to the multiqueue model, and merging a non-multiqueue I/O scheduler did not seem like a step in that direction.
Over the last several months, BFQ developer Paolo Valente has worked to port the code to the multiqueue interface. The known problems with the port have been resolved, and block subsystem maintainer Jens Axboe has agreed to merge it for 4.12. If all goes to plan, the long wait for the BFQ I/O scheduler will finally be at an end.
The interesting thing is that BFQ will not be the only multiqueue I/O scheduler entering the mainline in 4.12; there will be another one, developed over a much shorter time period, that should also be merged then. One might well wonder why a second scheduler is needed, especially since kernel developers place a premium on general solutions that can address a wide variety of use cases. But it seems that there is, indeed, a reasonable case to be made for merging a second multiqueue I/O scheduler.
BFQ is a complex scheduler that is designed to provide good interactive response, especially on those slower devices. It has a relatively high per-operation overhead, which is justified when the I/O operations themselves are slow and expensive. This complexity may not make sense, though, in situations where I/O operations are cheap and throughput is a primary concern. When running a server workload using solid-state devices, it may be better to run a much simpler scheduler that allows for request merging and perhaps some simple policies, but which mostly stays out of the way.
The Kyber I/O scheduler, posted by Omar Sandoval, would appear to be such a beast. Kyber is intended for fast multiqueue devices and lacks much of the complexity found in BFQ; it is less than 1,000 lines of code. Its policies, while simple, would appear to be an interesting echo of the bufferbloat work done in the networking stack.
I/O requests passing through Kyber are split into two primary queues, one for synchronous requests and one for asynchronous requests — or, in other words, one for reads and one for writes. A process issuing a read request is typically unable to proceed until that request completes and the data is available, so such requests are seen as synchronous. A write operation, instead, can complete at some later time; the initiating process doesn't usually care when that write actually happens. So it is common to prioritize reads over writes, but not to the point that writes are starved.
The key to the Kyber algorithm is that the number of operations (both reads and writes) sent to the dispatch queues (the queues that feed operations directly to the device) is strictly limited, keeping those queues relatively short. If the dispatch queues are short, then the amount of time that passes while any given request waits in the queues (the per-request latency) will be relatively small. That ensures a quick completion time for higher-priority requests.
The scheduler tunes the actual number of requests allowed into the dispatch queues by measuring the completion time of each request and adjusting the limits to achieve the desired latencies. There are two tuning knobs available to the system administrator for setting the latency targets; they default to 2ms for read requests and 10ms for writes. As always, there will be tradeoffs in setting these values; setting them too low will ensure low latencies, but at the cost of reducing the opportunities for the merging of requests, hurting throughput.
Kyber, too, has been accepted for the 4.12 merge window. So, if all goes according to plan, the 4.12 kernel will have two new options for I/O scheduling on multiqueue devices. Users concerned with interactive response and possibly using slower devices will likely opt for BFQ, while throughput-sensitive server loads are more likely to run with Kyber. Either way, an important gap in the kernel's multiqueue block I/O subsystem has been filled, clearing the path for an eventual transition of all of the kernel's block drivers to the multiqueue API.
Device power management with the OPP library
During the 4.6 development cycle, the operating performance points (OPP) framework gained the infrastructure to do dynamic voltage and frequency scaling (DVFS) on behalf of device drivers. This helps in reducing the complexity of those drivers, which can instead focus on platform-specific details. The rest of this article discusses what has changed and how can we use it to simplify our device drivers.Until Linux kernel release 4.5, the OPP framework was acting as a helper library that provided a table of voltage-frequency pairs (with some additional information) for the kernel. Kernel frameworks, like cpufreq and devfreq, used these OPP tables to perform DVFS for the devices. The OPP framework creates this table dynamically via platform-specific code and statically from device-tree blobs.
Operating performance points
Systems on chips (SoCs) have become increasingly complex and power-efficient. There are multiple sub-modules within a SoC that work in conjunction, but not all of them are required to function at their highest performance frequency and voltage levels at all times, as that can be less power-efficient. Devices like CPUs, GPUs, and I/O devices have the capability of working at a range of frequency and voltage pairs. They should stay at lower voltages and frequencies when the system load is low and at higher levels otherwise.
The set of discrete tuples consisting of frequency and voltage pairs that the device supports are called "operating performance points". For example, a CPU core that can operate at 1.0GHz at minimum voltage 1.0V, 1.1GHz at minimum voltage 1.1V, and 1.2GHz at minimum voltage 1.3V can be represented by these OPP tuples:
Hz uV
1000000000 1000000
1100000000 1100000
1200000000 1300000
These tuples may contain more configurable values as well, for example voltage levels for multiple power supplies. The example at the end of this article shows how the OPP nodes are present in a device tree (DT) blob.
Before the 4.6 kernel, the OPP framework was responsible for creating an OPP table by parsing the device tree (or via the platform-specific code) and providing a set of helpers to inquire about the target OPPs. For example, finding the minimum or maximum OPP corresponding to the target frequency. The consumer drivers of the OPP library used the helpers to find an OPP corresponding to the target frequency and used it to configure the device's clock and power supplies (if required).
What's new
For the most common configurations (with at most one power supply for the device), all consumer drivers had pretty much identical DVFS code. So it made sense to let the OPP core configure the device to a particular OPP and simplify the drivers by removing such code from them. During the 4.6 development cycle, the OPP core thus gained the functionality to perform DVFS on behalf of device drivers. Those drivers need to pass a target frequency, and the OPP core will find and set the best possible OPP corresponding to that.
In order to perform DVFS on behalf of device drivers, the OPP core needs some of the device's resources. Some of them are acquired automatically by the OPP core, while the core needs help from the driver to get others. It is important for driver writers to understand the expectations of the OPP core before they try to use it to do DVFS for their devices.
In order to change the frequency of a device, the OPP core needs the pointer of the struct clk for the device. The OPP core gets this automatically by calling clk_get() using the device's struct device pointer. The driver must make sure that the device has a valid clock registered for it with the clock framework, otherwise the OPP core will fail to do DVFS for the device.
Voltage scaling isn't always required while doing frequency scaling, so acquiring the power-supply resources is optional. But for platforms that need to do voltage scaling, the OPP core needs some input from the driver. The OPP core supports devices that don't need a power supply, or that need single or multiple supplies. The driver needs to provide the names of all the power supplies to the OPP core that are required to be configured to perform DVFS for the device, using:
struct opp_table *dev_pm_opp_set_regulators(struct device *dev,
const char * const names[],
unsigned int count);
Here, dev is the pointer to the device structure, names is the pointer to an array of power-supply names and count is the number of entries in that array. This routine returns a pointer to the struct opp_table for the device on success and an error number (using ERR_PTR()) if something goes wrong. The order in which the names of the power supplies are present in this array is significant. The OPP core assumes that the entries in the opp-microvolt property in the OPP table in DT will be present in the same order as in the array. Refer to the example at the end for more on the opp-microvolt property. If this function isn't called for a device, the OPP core assumes that the device doesn't need to participate in voltage scaling and that frequency scaling can be done independently.
The OPP core in turn calls regulator_get_optional() for each string present in the names array. If the OPP core fails to get the regulator corresponding to any of the strings, it returns with an error.
Once the consumer driver is done with the OPP table, it should free the resources acquired by the OPP core using the following routine:
void dev_pm_opp_put_regulators(struct opp_table *opp_table);
Here, opp_table is the pointer to the OPP table, earlier returned by dev_pm_opp_set_regulators().
Performing DVFS
Once the OPP core has all the resources it needs to do DVFS for a device, the consumer drivers can use the helpers described below to let the OPP core perform DVFS on its behalf. DVFS methods differ a bit depending on the number of power supplies required to be configured for the device. In the most common cases, the OPP core either needs to do only frequency scaling (no power supply) or needs to do voltage scaling for a single power supply along with it. For such platforms, the driver needs to call this helper to let the OPP core do DVFS for the device:
int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq);
Where dev is the pointer to the device structure, and target_freq is the frequency we need to program the device for. This routine configures the device for the OPP with the lowest frequency greater than or equal to the target frequency. This routine returns zero on success and a negative error number otherwise.
If the device doesn't need to do voltage scaling at all, then dev_pm_opp_set_rate() can be called without calling dev_pm_opp_set_regulators() earlier. Otherwise, dev_pm_opp_set_regulators() must be called successfully before calling dev_pm_opp_set_rate(). If the target OPP has higher frequency than the current OPP, then dev_pm_opp_set_rate() does voltage scaling before doing frequency scaling. Otherwise frequency scaling is done before voltage scaling.
The handling is a bit different in the complex cases where voltage scaling of multiple power supplies is required. The order in which multiple power supplies need to be programmed is platform-specific and it is difficult to come up with common code that can work in all cases. To simplify things, the OPP core provides the capability to provide platform-specific set_opp() callbacks, which will be called by the OPP core from within dev_pm_opp_set_rate() at the time of DVFS. This callback can be registered using:
struct opp_table *dev_pm_opp_register_set_opp_helper(struct device *dev,
int (*set_opp)(struct dev_pm_set_opp_data *data));
Here, dev is the pointer to the device structure, and set_opp() is the platform-specific callback. The callback takes struct dev_pm_set_opp_data as argument, which contains all the configuration the callback needs to do DVFS, and returns zero on success and negative error number otherwise. This helper returns a pointer to the struct opp_table for the device on success and an error number (cast as a pointer) if something went wrong.
The platform-specific callback should be unregistered using the following routine after the driver is done with the OPP table:
void dev_pm_opp_register_put_opp_helper(struct opp_table *opp_table);
Here, opp_table is the pointer to the OPP table, earlier returned by dev_pm_opp_register_set_opp_helper().
Connecting it all together
Here is an example that connects the dots to explain how it all fits together. We have two CPU devices here (that share their clock/voltage rails) and we need to configure a single power supply to perform DVFS for them. The device-tree fragment describing the CPUs themselves would be:
cpus {
#address-cells = <1>;
#size-cells = <0>;
cpu@0 {
compatible = "arm,cortex-a9";
reg = <0>;
next-level-cache = <&L2>;
clocks = <&clk_controller 0>;
clock-names = "cpu";
vdd-supply = <&vdd_supply0>;
operating-points-v2 = <&cpu_opp_table>;
};
cpu@1 {
compatible = "arm,cortex-a9";
reg = <1>;
next-level-cache = <&L2>;
clocks = <&clk_controller 0>;
clock-names = "cpu";
vdd-supply = <&vdd_supply0>;
operating-points-v2 = <&cpu_opp_table>;
};
};
These definitions reference cpu_opp_table, which is a table describing the valid operating points for these CPUs; it is also found in the device tree:
cpu_opp_table: opp_table {
compatible = "operating-points-v2";
opp-shared;
opp@1000000000 {
opp-hz = /bits/ 64 <1000000000>;
opp-microvolt = <990000 1000000 1050000>;
opp-microamp = <70000>;
clock-latency-ns = <300000>;
opp-suspend;
};
opp@1100000000 {
opp-hz = /bits/ 64 <1100000000>;
opp-microvolt = <1090000 1100000 1150000>;
opp-microamp = <80000>;
clock-latency-ns = <310000>;
};
opp@1200000000 {
opp-hz = /bits/ 64 <1200000000>;
opp-microvolt = <1190000 1200000 1250000>;
opp-microamp = <90000>;
clock-latency-ns = <290000>;
turbo-mode;
};
};
The platform-specific code needed to set up DVFS would look something like:
const char *name[] = {"vdd"};
struct opp_table *opp_table;
opp_table = dev_pm_opp_set_regulators(dev, &name, ARRAY_SIZE(name));
if (IS_ERR(opp_table))
dev_err(dev, "Failed to set regulators: %d\n", PTR_ERR(opp_table));
The driver responsible for voltage and frequency scaling would then do something like this:
ret = dev_pm_opp_set_rate(dev, target_freq);
if (ret)
dev_err(dev, "Failed to set rate: %d\n", ret);
With these enhancements in the OPP core, using the standard interfaces like clocks and regulators, the device drivers are simplified to a great extent. Going forward we should enhance the OPP core further to keep all future DVFS-related configurations in a single place.
A formal kernel memory-ordering model (part 2)
This article continues the description of the Linux-kernel memory model that started here as follows, with the intended audience for each section in parentheses:- Memory models and the role of cycles (people interested in using memory-ordering tools).
- Specifying a memory model in terms of prohibited cycles (people interested in understanding the memory model).
- Conclusions (all).
This is followed by the inescapable answers to the quick quizzes.
Memory models and the role of cycles
One way of formalizing a memory model is to create an abstract description of how a running system operates internally, and then enumerate all the possible outcomes this abstract operation can give rise to. There are tools that take this operational approach. Another way is to define the constraints imposed by the memory model, in the form of logical axioms, and then enumerate all the possible outcomes that are consistent with these constraints. A tool using this axiomatic approach (called "herd") is described here. This tool can be downloaded and built as described in the INSTALL.md file.
Both approaches take as input a small fragment of code and an assertion (together called a litmus test) and produce an output value indicating whether the memory model permits the code fragment to execute in a way that would make the assertion true. Here is a simple example of a litmus test (with line numbers added) that illustrates the so-called “message-passing” pattern:
Litmus Test #1
1 C C-MP+o-mb-o+o-mb-o
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*y, 1);
9 smp_mb();
10 WRITE_ONCE(*x, 1);
11 }
12
13 P1(int *x, int *y)
14 {
15 int r1;
16 int r2;
17
18 r1 = READ_ONCE(*x);
19 smp_mb();
20 r2 = READ_ONCE(*y);
21 }
22
23 exists
24 (1:r1=1 /\ 1:r2=0)
Line 1 identifies the source language of the code fragment (“C”) and gives the litmus test's name (“C-MP+o-mb-o+o-mb-o”). Lines 3 and 4 are where initial values could be provided. In this program no explicit initialization is needed, because all variables' initial values default to zero. Lines 6-21 provide the code, in this case, one function for each of two processors. You can choose any name you like for these functions as long as it consists of a “P” immediately followed by the processor's number, numbered consecutively starting from zero. By convention, local variable names begin with “r” (these variables are treated as though they are stored in CPU registers), and global variables must be passed in by reference as function parameters. The names of these function parameters are significant: they must match the names of the corresponding global variables.
Finally, lines 23 and 24 provide an “exists” assertion expression to evaluate the final state. This final state is evaluated after the dust has settled: both processes have completed and all of their memory references and memory barriers have propagated to all parts of the system. The references to the local variables “r1” and “r2” in line 24 must be prefixed with “1:” to specify which processor they are local to. Note that a single “=” in this expression is an equality operator rather than an assignment (the assertion expression is written in the litmus-test language rather than in C). The “/\” character combination means “and”; it is an ASCII representation of the mathematical “∧” symbol. Similarly, “\/” stands for “or” (the mathematical “∨” symbol); this assertion could have been expressed just as well in negated form by writing:
23 forall 24 (1:r1=0 \/ 1:r2=1)
The “~” character indicates negation, so this assertion could also have been written in non-negated form as follows:
23 exists 24 ~(1:r1=0 \/ 1:r2=1)
The herd tool normally only prints variables that appear in the exists (or forall) clause. Additional variables can be printed using locations, which is demonstrated here.
The software tools mentioned above simply tell you whether the logic expression evaluates to true in all, some, or none of the possible executions of the code. Value judgments are left to the user. herd may be run using the linux.def macro file included in the source package, the Litmus Test #1 source file, and the “bell” and “cat” files for the strong kernel memory model described here. The command is as follows:
herd7 -macros linux.def -bell strong-kernel.bell -cat strong-kernel.cat \
C-MP+o-mb-o+o-mb-o.litmus
For people who prefer shorter command lines, the
strong.cfg configuration file
specifies these settings already, along with several others
related to the style of the plot files herd is
capable of producing.
The command using this configuration file is:
The output from either command is:herd7 -conf strong.cfg C-MP+o-mb-o+o-mb-o.litmus
Outcome for Litmus Test #1 (strong model)1 Test C-MP+o-mb-o+o-mb-o Allowed 2 States 3 3 1:r1=0; 1:r2=0; 4 1:r1=0; 1:r2=1; 5 1:r1=1; 1:r2=1; 6 No 7 Witnesses 8 Positive: 0 Negative: 3 9 Condition exists (1:r1=1 /\ 1:r2=0) 10 Observation C-MP+o-mb-o+o-mb-o Never 0 3 11 Hash=3240a31645e46554cb09739d726087ad
This output indicates the three possible outcomes from running this code in the Linux kernel:
- r1 == 0 && r2 == 0. This outcome occurs when P1 completes before P0 begins.
- r1 == 0 && r2 == 1. This outcome occurs when P1 executes concurrently with P0, so that P1's read from x executes before P0's write to x, but P1's read from y executes after P0's write to y.
- r1 == 1 && r2 == 1. This outcome occurs when P1 starts after P0 completes.
The outcome r1 == 1 && r2 == 0 is not possible, as indicated by the “Never 0 3” near the end of the output. Here, "Never" indicates that the test condition is true; "0 3" reinforces that by saying that the condition was found to be true in zero possible outcomes and false in three outcomes. This forbidden outcome would require a cycle of events, each happening before the next and the last happening before the first:
- P0 writes to x,
- P1 reads from x,
- P1 reads from y, and
- P0 writes to y.
Please note that the concept of cycles can be thought of as a mathematically precise generalization of the memory-barriers.txt concept of memory-barrier pairing. The labels in the diagram are defined as follows:
- fr = “from-read”, linking each read to any writes to the same variable that execute too late to affect the value returned by that read.
- po = “program order”, linking statements within a given process in the order that they appear in the instruction stream.
- rf = “reads from”, linking a given write to any reads that load the value stored by that write.
The fr relation can be somewhat counter-intuitive, so please look here for additional explanation. The herd tool provides many additional relations, which are tabulated here.
It is important to note that not all cycles are prohibited. To see this, consider the following:
Litmus Test #2
1 C C-MP+o-o+o-o
2
3 {
4 }
5
6 P0(int *x, int *y)
7 {
8 WRITE_ONCE(*y, 1);
9 WRITE_ONCE(*x, 1);
10 }
11
12 P1(int *x, int *y)
13 {
14 int r1;
15 int r2;
16
17 r1 = READ_ONCE(*x);
18 r2 = READ_ONCE(*y);
19 }
20
21 exists
22 (1:r1=1 /\ 1:r2=0)
This is exactly the same as the previous litmus test except that the smp_mb() calls have been removed. Despite the fact that the outcome r1 == 1 && r2 == 0 exhibits the same cycle as above, it can in fact occur on weakly ordered systems where, for example, P0's writes and P1's reads can be reordered by the hardware. On such systems, the smp_mb() statements are necessary to ensure that the order of execution of the writes and reads is the same as their order in the source code. This can be confirmed by running the tool in the same way as before, but on the new litmus test:
herd7 -conf strong.cfg C-MP+o-o+o-o.litmus
The output will be as follows:
Outcome for Litmus Test #2 (strong model)1 Test C-MP+o-o+o-o Allowed 2 States 4 3 1:r1=0; 1:r2=0; 4 1:r1=0; 1:r2=1; 5 1:r1=1; 1:r2=0; 6 1:r1=1; 1:r2=1; 7 Ok 8 Witnesses 9 Positive: 1 Negative: 3 10 Condition exists (1:r1=1 /\ 1:r2=0) 11 Observation C-MP+o-o+o-o Sometimes 1 3 12 Hash=c3bdaae6256fa364ad31fb3c1e07c0f5
Note that all four possible states are present, and note also the “Sometimes 1 3” near the end of the output.
On sufficiently weakly ordered systems, the cyclic outcome in Litmus Test #2 could occur even without instruction reordering, because the writes might not propagate from P0 to P1 in the order they were executed. And even on more strongly ordered systems, it would be sufficient to reorder either the reads or the writes; it is not necessary to reorder both. For example, if P1's accesses were reordered then we could have the following sequence of events:
- P1 reads from y,
- P0 writes to y,
- P0 writes to x, and
- P1 reads from x.
This illustrates an important point: cycles in time of instruction execution are impossible, because time is linearly ordered (in our universe, even if not in all solutions to Einstein's field equations). Part of a memory model's job is to provide the conditions under which one instruction must execute before another and to check for any resulting cycles. On the other hand, if there is no such cycle then it is possible to find an order of execution for all the instructions which is compatible with the memory model's ordering requirements (for example, by doing a topological sort). If this potential execution order did not violate any of the memory model's other requirements, it would demonstrate that the litmus test's assertion could hold.
Okay, we admit the preceding paragraph is an oversimplification. Modern CPUs do not execute instructions at precise moments in time; instead they run instructions through complicated multi-stage pipelines and engage in multiple issue (running more than one instruction through the same pipeline stages in parallel). Furthermore, other ordering requirements come into play along with time of execution, such as cache coherence (see below). Nevertheless, the basic idea is valid.
It is worth pointing out that computer hardware almost always has additional restrictions beyond what the memory models describe; CPU designers generally do not implement all of the behaviors allowed by the instruction set architecture. The fact that a memory model says a particular litmus test's assertion might hold does not mean it can actually happen on any given computer. As a simple example, the finite write buffers found in real hardware prevent that hardware from actually doing all the reorderings of writes that memory models typically allow. It also goes the other way—sometimes CPU designers mistakenly implement a behavior that is prohibited by the instruction set architecture (otherwise known as a “silicon bug” or “CPU erratum”).
Specifying a memory model in terms of prohibited cycles
As we have just seen, there is a close relationship between orderings and the existence of cycles: If some events are constrained to be ordered in a certain way then that ordering cannot contain a cycle. Conversely, if a given relation among various events does not contain any cycles then it is possible to order those events consistently with the relation. Thus, if we can precisely specify which instructions must execute before others in a given piece of Linux kernel code, we will be well on our way to constructing a formal model that defines the kernel's execution-ordering guarantees in terms of cycles among instructions. Even better, this model can then be used to construct a tool that analyzes litmus tests for execution-ordering problems. (And of course, the same technique can be used for describing a memory model's other ordering requirements.)
The herd tool implements a language, called cat, designed to represent memory models, which it does by specifying what cycles are prohibited. This specification is defined in terms of sets and relations involving memory-access events, barriers, and threads. (For our purposes, each processor in a litmus test corresponds to a distinct thread.) herd is discussed in more detail here; in this section we will see how to write some simple memory models in the cat language.
But first, what cycles should the Linux kernel memory model prohibit? Here is a partial list:
- Placing a full memory barrier (smp_mb()) between each pair of memory accesses in each process will prohibit all cycles. In other words, smp_mb() can be said to restore sequential consistency, that is, to ensure that all processors agree on a global order of all memory accesses by all processors. The next section shows example herd code that accomplishes this.
- So-called out-of-thin-air computations must be ruled out. These are cycles where each CPU writes a value depending on the value written by the previous CPU in the cycle, and they are handled by the Happens-Before section of the model's cat file. The underlying reason why the model does not produce out-of-thin-air values is that any WRITE_ONCE() that depends on a READ_ONCE() is ordered after it, regardless of the type of dependency.
- All CPUs should agree on the order of accesses to any single memory location. Making this happen is described below.
- A chain of release-acquire pairs, where each load-acquire returns the value stored by the preceding store-release, should never form a cycle. The strong model's bell and cat files prohibit such cycles.
- Cycles violating RCU's guarantees must be prohibited. This is handled by the RCU-specific portions of the bell and cat files.
There are quite a few additional nuances of Linux-kernel use cases and peculiarities of specific hardware, but this list provides a good starting point. The following sections present trivial “toy” memory models that prohibit the first two types of cycles.
Relaxed memory order: toy specification
The following shows a simple herd program that represents a fragment of the Linux kernel memory model involving simple memory accesses (READ_ONCE() and WRITE_ONCE()) and strong memory barriers (smp_mb()):
toy-RMO.cat1 "Toy RMO" 2 3 include "cos.cat" 4 5 let rfe = rf & ext 6 let fence = fencerel(F) 7 8 let rmo-order = fence | rfe | co | fr 9 acyclic rmo-order
Line 1 provides a name for the model, and line 3 pulls in some definitions that can be thought of as the herd equivalent to the C-language:
#include <stdio.h>
However, instead of defining I/O primitives, “cos.cat” defines some basic relations, including the fr relation mentioned earlier.
For Litmus Test #2 and Litmus Test #1 above (assuming the cyclic execution), the built-in rf (“reads-from”) relation contains the following links:
- WRITE_ONCE(*x, 1) ⟶ r1 = READ_ONCE(*x),
- INIT(*y, 0) ⟶ r2 = READ_ONCE(*y).
- r1 = READ_ONCE(*y) ⟶ WRITE_ONCE(*y, 1)
- INIT(*x, 0) ⟶ WRITE_ONCE(*x, 1)
- INIT(*y, 0) ⟶ WRITE_ONCE(*y, 1)
Line 5 computes rfe (“reads-from external”), which is a restricted version of the rf relation that covers only write-read pairs where the write and the read are executed by different threads. It does this by intersecting (the & operator) the rf relation with the predefined ext relation, which links all pairs of instructions belonging to different threads. For the two litmus tests above, the rfe relation turns out to be exactly the same as the rf relation.
Line 6 uses the standard fencerel() function and F event set to define a relation that links any two instructions separated by a memory barrier. For Litmus Test #2, which contains no instances of smp_mb(), this relation is empty. For Litmus Test #1, it contains the following links:
- WRITE_ONCE(*y, 1) ⟶ WRITE_ONCE(*x, 1)
- r1 = READ_ONCE(*x) ⟶ r2 = READ_ONCE(*y)
Line 8 defines the rmo-order relation as the union (the | operator) of the fence, rfe, co, and fr relations. rmo-order includes all pairs of instructions for which this toy model of relaxed memory order (RMO) requires the first to execute before the second. Line 9 expresses this requirement by stating that the rmo-order relation is acyclic (contains no cycles).
For Litmus Test #2, rmo-order does not contain a cycle, as shown below:
(The dotted “po” edges are for illustration only; they are not present in the rmo-order relation and do not contribute to any cycles.)![]()
On the other hand, for Litmus Test #1, the additional links added by the fence relation do create a cycle:
Thus this model correctly distinguishes the “message-passing” examples with and without memory barriers, as can be seen by downloading toy-RMO.cat and passing it via the -cat command-line argument for Litmus Test #2 as follows:
herd7 -conf strong.cfg -cat toy-RMO.cat C-MP+o-o+o-o.litmus
This produces the following output:
Outcome for Litmus Test #2 (toy-RMO model)Given the lack of a cycle in the rmo-order relationship, the counter-intuitive cyclic execution is permitted, as indicated by “Sometimes 1 3” in the output. In contrast, for Litmus Test #1, with memory barriers, the command line:1 Test C-MP+o-o+o-o Allowed 2 States 4 3 1:r1=0; 1:r2=0; 4 1:r1=0; 1:r2=1; 5 1:r1=1; 1:r2=0; 6 1:r1=1; 1:r2=1; 7 Ok 8 Witnesses 9 Positive: 1 Negative: 3 10 Condition exists (1:r1=1 /\ 1:r2=0) 11 Observation C-MP+o-o+o-o Sometimes 1 3 12 Hash=c3bdaae6256fa364ad31fb3c1e07c0f5
produces the following output:herd7 -conf strong.cfg -cat toy-RMO.cat C-MP+o-mb-o+o-mb-o.litmus
Outcome for Litmus Test #1 (toy-RMO model)As expected, the memory barriers exclude the counter-intuitive outcome where r1 == 1 && r2 == 0.1 Test C-MP+o-mb-o+o-mb-o Allowed 2 States 3 3 1:r1=0; 1:r2=0; 4 1:r1=0; 1:r2=1; 5 1:r1=1; 1:r2=1; 6 No 7 Witnesses 8 Positive: 0 Negative: 3 9 Condition exists (1:r1=1 /\ 1:r2=0) 10 Observation C-MP+o-mb-o+o-mb-o Never 0 3 11 Hash=3240a31645e46554cb09739d726087ad
Relaxed memory order: coherence included
Consider this ridiculous single-thread litmus test:
Litmus Test #3
1 C C-CO+o-o.litmus
2
3 {
4 }
5
6 P0(int *x)
7 {
8 *x = 3;
9 *x = 4;
10 }
11
12 exists
13 (x=3)
On the face of it, this test can never succeed. If we set x to 3 and then overwrite it with the value 4, how can x possibly end up containing 3? Nevertheless, running the Toy RMO model shows that this outcome is permitted:
Outcome for Litmus Test #3 (toy-RMO model)1 Test C-CO+o-o Allowed 2 States 2 3 x=3; 4 x=4; 5 Ok 6 Witnesses 7 Positive: 1 Negative: 1 8 Condition exists (x=3) 9 Observation C-CO+o-o Sometimes 1 1 10 Hash=b9e4f0d747854e10ad7310b4381f3652
This is because the model does not forbid it, and everything that is not explicitly forbidden is permitted. The model does not account for cache coherence, a feature supported by most modern microprocessors—and demanded by the vast majority of sane kernel hackers. That's one reason why this model should be considered to be a toy.
Cache coherence (sometimes referred to as “per-location sequential consistency”) requires that the writes to any one location in memory occur in a single total order (the coherence order), which all the processors must agree on. It also says that within each thread, the coherence order must be consistent with the program order, as described by the following four coherence rules:
- Write-write coherence: If two writes in the same thread access the same location, the write that comes first in program order must come first in the coherence order for that location.
- Write-read coherence: If a write W precedes (in program order) a read R of the same location, then R must read from W or from a write that occurs after W in the location's coherence order.
- Read-write coherence: If a read R precedes (in program order) a write W of the same location, then R must read from a write that occurs before W in the location's coherence order.
- Read-read coherence: If two reads R and R' in the same thread access the same location, where R comes before R' in program order, either they must read from the same write or else the write read by R must occur before the write read by R' in the location's coherence order.
In Litmus Test #3 above, there are three writes to the location where x is stored: the initializing write of 0 (implicit in lines 3-4), and the writes of 3 and 4 (lines 8-9). The initializing write always comes first in the coherence order, and the value tested in the “exists” clause is always the value stored by the write that comes last in the coherence order (called the final write). Thus for the test to succeed, the coherence order for x would have to be: x=0, x=4, x=3. But this would violate the write-write coherence rule, because the write that sets x to 3 comes before (in program order) the write that sets it to 4.
(Note: The C11 standard recognizes the notion of sequenced-before rather than that of program order. For the most part the two are the same, referring to the order in which loads and stores occur in the source code, but there are a few differences. For example, the compiler is not required to evaluate the arguments to a function call in any particular order. Thus, even though the statement:
printf("%d %d", WRITE_ONCE(x, 3), WRITE_ONCE(x, 4));
will always print out “3 4”, after it executes x
may be equal either to 3 or 4.
We will not worry such subtleties for now.
But we will point out that in
Litmus Test #3,
the
“*x = 3” write
is sequenced before the “*x = 4” write,
and the compiler is not permitted to reorder them.
That is why we have omitted the WRITE_ONCE() calls and
reverted to plain ordinary assignment.
It's okay in this case, because x isn't shared between
processors and we're only trying to make a simple point.
But note that even with this two-line test program,
the compiler is permitted to eliminate the
“*x = 3” write entirely.)
Our Toy RMO memory model can be strengthened to take cache coherence into account. Here is the result:
coherent-RMO.cat1 "Coherent RMO" 2 3 include "cos.cat" 4 5 let rfe = rf & ext 6 let fence = fencerel(F) 7 8 let rmo-order = fence | rfe | co | fr 9 acyclic rmo-order 10 11 let com = rf | co | fr 12 let coherence-order = po-loc | com 13 acyclic coherence-order
Answer
po-loc in line 12 is another standard relation; it is the intersection of po and loc, where the loc relation links all pairs of memory accesses that refer to the same location in memory. Thus, po-loc links each memory access to all those that occur after it in program order and access the same variable. Lines 12-13 go on to define coherence-order as the union of po-loc and com and to require that coherence-order not have any cycles.
Since Litmus Test #3 contains no reads, its rf and fr relations are empty and therefore com ends up being the same as co. In the non-intuitive execution accepted by the Toy RMO model (where x=3 comes last in the coherence order), com contains the following links:
- INIT(*x, 0) ⟶ *x = 3,
- INIT(*x, 0) ⟶ *x = 4, and
- *x = 4 ⟶ *x = 3,
- *x = 3 ⟶ *x = 4.
Outcome for Litmus Test #3 (coherent-RMO model)1 Test C-CO+o-o Allowed 2 States 1 3 x=4; 4 No 5 Witnesses 6 Positive: 0 Negative: 1 7 Condition exists (x=3) 8 Observation C-CO+o-o Never 0 1 9 Hash=b9e4f0d747854e10ad7310b4381f3652
Here's a slightly more sophisticated test that probes the read-read coherence rule:
Litmus Test #4
1 C C-CO+o-o+o-o.litmus
2
3 {
4 }
5
6 P0(int *x)
7 {
8 WRITE_ONCE(*x, 3);
9 WRITE_ONCE(*x, 4);
10 }
11
12 P1(int *x)
13 {
14 int r1;
15 int r2;
16
17 r1 = READ_ONCE(*x);
18 r2 = READ_ONCE(*x);
19 }
20
21 exists
22 (1:r1=4 /\ 1:r2=3)
Because of the write-write coherence rule, we know that the coherence order for x must be: x=0, x=3, x=4. If r1 and r2 were to end up equal to 4 and 3 respectively, it would mean the later read (in program order) had read from the earlier write (in x's coherence order), thereby violating read-read coherence.
To see why the Coherent RMO model forbids this result, consider how the various relations would turn out. Because x=4 must come last in the coherence order for x, the co relation contains these links:
- INIT(*x, 0) ⟶ WRITE_ONCE(*x, 3),
- INIT(*x, 0) ⟶ WRITE_ONCE(*x, 4), and
- WRITE_ONCE(*x, 3) ⟶ WRITE_ONCE(*x, 4).
- WRITE_ONCE(*x, 4) ⟶ r1 = READ_ONCE(*x) and
- WRITE_ONCE(*x, 3) ⟶ r2 = READ_ONCE(*x),
- r2 = READ_ONCE(*x) ⟶ WRITE_ONCE(*x, 4).
- WRITE_ONCE(*x, 3) ⟶ WRITE_ONCE(*x, 4) and
- r1 = READ_ONCE(*x) ⟶ r2 = READ_ONCE(*x).
Putting these together shows that coherence-order contains the following length-3 cycle:
- r2 = READ_ONCE(*x)
- WRITE_ONCE(*x, 4)
- r1 = READ_ONCE(*x)
As can be seen in the following herd output, this cycle is prohibited:
Outcome for Litmus Test #4 (strong model)1 Test C-CO+o-o+o-o Allowed 2 States 6 3 1:r1=0; 1:r2=0; 4 1:r1=0; 1:r2=3; 5 1:r1=0; 1:r2=4; 6 1:r1=3; 1:r2=3; 7 1:r1=3; 1:r2=4; 8 1:r1=4; 1:r2=4; 9 No 10 Witnesses 11 Positive: 0 Negative: 6 12 Condition exists (1:r1=4 /\ 1:r2=3) 13 Observation C-CO+o-o+o-o Never 0 6 14 Hash=e28b27408fda33a59c7f2cd8a5ff7615
Answer
Quick Quiz 4:
Whatever happened to memory-barrier pairing???
Answer
This background will help you to understand the strong memory model itself.
Conclusions
We have presented a Linux-kernel memory model that we hope will be useful for education, concurrent design, and for inclusion in other tooling. As far as we know, this is the first realistic formal memory model that includes RCU ordering properties. In addition, we believe this to be the first realistic formal memory model of the Linux kernel.
This model is not set in stone, but subject to change with the evolution of hardware and of Linux-kernel use cases. We expect the change rate to be roughly similar to the historical change rate of Documentation/memory-barriers.txt, however, we believe that the guiding principles underlying this memory model will be more durable.
The strong model accepts significant complexity to attain greater strength. In contrast, the weak model accepts some weakenings in order to achieve some degree of simplicity. Candidate weakenings include:
- Omitting C11 release sequences.
- Simplifying the preserved-program-order relations by omitting the po-loc, rd-rdw, detour, and rdw relations from them.
- Omitting B-cumulativity, which has the effect of allowing some cycles as exemplified by this litmus test.
- Retaining a less-ornate version of the obs (observation) relation, which is called short-obs.
- Retaining a less-ornate version of the cpord (coherence-point ordering) relation. This allows write-only cycles as exemplified by this litmus test.
Although we expect that this memory model will prove quite valuable, it does have a few limitations in addition to those called out earlier here and here:
- These memory models do not constitute official statements by the various CPU vendors on their respective architectures. For example, any of these vendors might report bugs at any time against any version of this memory model. This memory model is therefore not a substitute for a carefully designed and vigorously executed validation regime. In addition, this memory model is under active development and might change at any time.
- It is quite possible that this memory model will disagree with CPU architectures or with real hardware. For example, the model might well choose to allow behavior that all CPUs forbid if forbidding that behavior would render the model excessively complex. On the other hand, any situation where the model forbids behavior that some CPU allows constitutes a bug, either in the model or in the CPU.
- This tool is exponential in nature. Litmus tests that seem quite small compared to the entire Linux kernel might well take geologic time for the herd tool to analyze. That said, this tool can be extremely effective in exhaustively analyzing the code at the core of a synchronization primitive.
- The herd tool can only detect problems for which you have coded an assertion. This weakness is common to all formal methods, and is one reason that we expect testing to continue to be important. In the immortal words of Donald Knuth: "Beware of bugs in the above code; I have only proved it correct, not tried it."
On the other hand, one advantage of formal memory models is that tools based on them can detect any problem that might occur, even if the probability of occurrance is vanishingly small, in fact, even if current hardware is incapable of making that problem happen. Use of tools based on this memory model is therefore an excellent way to future-proof your code.
Acknowledgments
We owe thanks to H. Peter Anvin, Will Deacon, Andy Glew, Derek Williams, Leonid Yegoshin, and Peter Zijlstra for their patient explanations of their respective systems' memory models. We are indebted to Peter Sewell, Susmit Sarkar, and their groups for their seminal work formalizing many of these same memory models. We all owe thanks to Dmitry Vyukov, Boqun Feng, and Peter Zijlstra for their help making this human-readable. We are also grateful to Michelle Rankin and Jim Wasko for their support of this effort.
Answers to quick quizzes
Quick Quiz 1: Can't the compiler also reorder these accesses?
Answer: Given the current Linux-kernel definitions of READ_ONCE() and WRITE_ONCE(), no. These two macros map to volatile accesses, which the compiler is not allowed to reorder with respect to each other.
However, if these macros instead mapped to non-volatile C11 memory_order_relaxed loads and stores, then the compiler would be permitted to reorder them. And, as a general rule, compilers are much more aggressive about reordering accesses than even the most weakly ordered hardware. In both cases, those who don't like such code rearrangement call it “weak ordering” while those who do call it “optimization”.
Quick Quiz 2: The rf, co, and fr terms in the definition of com describe write-read, write-write, and read-write links respectively, corresponding to three of the four coherence rules. Why is there no term corresponding to the read-read rule?
Answer: It's not needed. As we will see in the discussion of Litmus Test #4, a violation of the read-read coherence rule involves a write being “interposed” between two reads in the coherence order. It therefore can be described as a length-3 cycle in coherence-order, involving an fr link followed by an rf link followed by a po-loc link.
Quick Quiz 3: But don't Itanium and SPARC RMO allow read-read reordering of acccesses to a single variable by a single CPU? How does the model handle these CPUs?
Answer: In the case of Itanium, gcc compiles volatile reads (as in READ_ONCE()) as ld,acq, which enforces read-read ordering. And the Linux kernel runs SPARC in TSO mode, which prohibits read-read reorderings in general, including to a single variable.
Quick Quiz 4: Whatever happened to memory-barrier pairing???
Answer: Memory-barrier pairing can be thought of as a special case of cycles, but it was designed for a simpler time when people used much less aggressive lockless designs. Here is an example that breaks memory-barrier pairing:
Litmus Test #5
1 C C-R+o-wmb-o+o+mb+o.litmus
2
3 {
4 }
5
6 P0(int *a, int *b)
7 {
8 WRITE_ONCE(*a, 1);
9 smp_wmb();
10 WRITE_ONCE(*b, 1);
11 }
12
13 P1(int *a, int *b)
14 {
15 int r1;
16
17 WRITE_ONCE(*b, 2);
18 smp_mb();
19 r1 = READ_ONCE(*a);
20 }
21
22 exists
23 (b=2 /\ 1:r1=0)
Because the smp_wmb() orders writes and because the smp_mb() orders everything, straightforward application of memory-barrier pairing would lead you to believe that this cycle would be forbidden. This belief would be incorrect, as can be seen from running the litmus test against the strong model:
Outcome for Litmus Test #5 (strong model)1 Test C-R+o-wmb-o+o+mb+o Allowed 2 States 4 3 1:r1=0; b=1; 4 1:r1=0; b=2; 5 1:r1=1; b=1; 6 1:r1=1; b=2; 7 Ok 8 Witnesses 9 Positive: 1 Negative: 3 10 Condition exists (b=2 /\ 1:r1=0) 11 Observation C-R+o-wmb-o+o+mb+o Sometimes 1 3 12 Hash=0a4dd1e17f6132a7145a13b711ccd167
The problem is that the co relationship between P0()'s and P1()'s stores does not imply any sort of causal or temporal relationship between the two stores. Real hardware can and does chose the coherence ordering after the fact, and so real hardware can and does allow the cycle.
In short, memory-barrier pairing was useful in its day, but its day is rapidly drawing to a close. More sophisticated use of lockless algorithms requires more sophisticated mental models of memory barriers.
Page editor: Jonathan Corbet
Inside this week's LWN.net Weekly Edition
- Briefs: Bash Bunny; grsecurity; Debian FTP; Kali Linux; Linkerd 1.0; Quotes; ...
- Announcements: Newsletters, CfP, Events, Security updates, Kernel patches.
