LWN Weekly Edition Front pageSecurity Kernel development Distributions Development Linux in the news Announcements ->One big page
This page Previous weekFollowing week Sponsored link Serve your customers, not your servers, with VERIO Linux VPS. Full-access test-drive here. |
Kernel developmentRelease status Kernel release status The 2.6.26 merge window is open, so there is no current 2.6 development release. See the article below for a summary of the patches merged for 2.6.26 so far.The current -mm tree is 2.6.25-mm1. Recent changes to -mm include some read-copy-update enhancements and the OLPC architecture support code, but mostly -mm is just getting ready for the big flow of patches into the mainline. See the -mm merge plans document for Andrew's plans for 2.6.26. The current stable 2.6 kernel is 2.6.25, released on April 16. After nearly three months of development and the merging of over 12,000 patches from almost 1200 developers, this kernel is now considered ready for wider use. Highlights of this release include the ath5k (Atheros wireless) driver, a bunch of realtime work including realtime group scheduling, preemptable RCU, LatencyTop support, a number of new ext4 filesystem features, support for the controller area network protocol, more network namespace work, the return of the timerfd() system call, the page map patches (providing much better information on system memory use), the SMACK security module, better kernel support for Intel and ATI R500 graphics chipsets, the memory use controller, ACPI thermal regulation support, MN10300 architecture support, and much more. See the KernelNewbies 2.6.25 page for lots of details, or the full changelog for unbelievable amounts of detail. 2.6.24.5 was released on April 18. It contains a relatively long list of fixes for significant 2.6.24 problems. For older kernels: 2.4.36.3 was released on April 19. "Nothing outstanding here, I've just decided to release pending fixes. Those already running 2.4.36.2 have no particular reason to upgrade, unless they already experience troubles in the fixed areas."
Kernel development news Quotes of the week
In any case, we'll continue to use the fact that mprotect is also
broken to get our WC mapping working (using mprotect PROT_NONE
followed by mprotect PROT_READ|PROT_WRITE causes the CD and WT bits
to get cleared). We're fortunate in this case that we've found a
bug to exploit that gives us the desired behaviour.
-- Keith Packard
Nice-looking code - kgdb has improved rather a lot. I'm glad we
finally got it in. Maybe one day I'll get to use it again.
-- Andrew Morton
/me duly notes this request to break Andrew's systems even more frequently ;-)
-- Ingo Molnar
The 2.6.26 merge window opens That shiny new 2.6.25 kernel which was released on April 16 is now ancient history; some 3500 changesets have been merged into the mainline git repository since then. Some of the most significant user-visible changes include:
Changes visible to kernel developers include:
Needless to say, this development series is still young and, as of this writing, the merge window has over a week to run. So there will be a lot more code going into the mainline before the shape of 2.6.26 becomes clear.
4K stacks by default? The kernel stack is a rather important chunk of memory in any Linux system. The unpleasant kernel memory corruption that results from overflowing it is something that is to be avoided at all costs. But the stack is allocated for each process and thread in the system, so those who are looking to reduce memory usage target the 8K stack used by default on x86. In addition, an 8K stack requires two physically contiguous pages (an "order 1" allocation) which can be difficult to satisfy on a running system due to fragmentation. Linux has had optional support for 4K stacks for nearly four years now, with Fedora and RHEL enabling it on the kernels they ship, but a recent patch to make it the default for x86 has raised some eyebrows. Andrew Morton sees it as bypassing the normal patch submission process:
This patch will cause kernels to crash.
It has no changelog which explains or justifies the alteration. afaict the patch was not posted to the mailing list and was not discussed or reviewed. It is not surprising that patch author Ingo Molnar sees things a little differently:
what mainline kernels crash and how will they crash? Fedora and other
distros have had 4K stacks enabled for years [ ... ]
and we've conducted tens of thousands of bootup tests with all sorts of
drivers and kernel options enabled and have yet to see a single crash
due to 4K stacks. So basically the kernel default just follows the
common distro default now. (distros and users can still disable it)
As described in an earlier LWN article, the main concerns about only providing 4K for the kernel stack are for complicated storage configurations or for people using NDISwrapper. There is fairly high disdain for the latter case—as it is done to load proprietary Windows drivers into the kernel—but it could lead to a pretty hideous failure in the former. Data corruption certainly seems like a possibility, but, regardless, a kernel crash is definitely not what an administrator wants to have to deal with. Arjan van de Ven summarized the current state, noting that NDISwrapper really requires 12K stacks, so having 8K only makes it less likely those kernels will crash. The stacking of multiple storage drivers (network filesystems, device mapper, RAID, etc.) is a bigger issue:
we need to know which they are, and then solve them, because even on x86-64
with 8k stacks they can be a problem (just because the stack frames are
bigger, although not quite double, there).
Proponents of default 4K stacks seem to be puzzled why there is objection to the change since there have been no problems with Red Hat kernels. But Andi Kleen notes:
One way they do that is by marking significant parts of the kernel
unsupported. I don't think that's an option for mainline.
The xfs filesystem, which is not supported in RHEL or Fedora, can potentially use a great deal of stack. This leads some kernel hackers to worry that a complicated configuration that uses it, an "nfs+xfs+md+scsi writeback" configuration as Eric Sandeen puts it, could overflow. Work is already proceeding to reduce the xfs stack usage, but it clearly is a problem that xfs hackers have seen. David Chinner responds to a question about stack overflows:
We see them regularly enough on x86 to know that the first question
to any strange crash is "are you using 4k stacks?". In comparison,
I have never heard of a single stack overflow on x86_64....
It would seem premature to make 4K stacks the default. There is good reason to believe that folks using xfs could run into problems. But there is a larger issue, one that Morton brought up in his initial message, then reiterated later in the thread:
Anyway. We should be having this sort of discussion _before_ a patch
gets merged, no?
The memory savings can be significant, especially in the embedded world. Coupled with the elimination of order 1 allocations each time a process gets created, there is good reason to keep working toward 4K stacks by default. As of this writing, the default remains for 4K stacks in Linus's tree, but that could change before long.
ELC: Morton and Saxena on working with the kernel community In many ways, Andrew Morton's keynote set the tone for this year's Embedded Linux Conference (ELC) by describing the ways that embedded companies and developers can work with the kernel community in a way that will be "mutually beneficial". Morton provided reasons, from a purely economic standpoint, why it makes sense for companies to get their code into the mainline kernel. He also provided concrete suggestions on how to make that happen. The theme of the conference seemed to be "working with the community" and Morton's speech provided an excellent example of how and why to do just that. Conference organizer Tim Bird introduced the keynote as "the main event" for ELC, noting that he often thought of Morton as "kind of like the adult in the room" on linux-kernel. Readers of that mailing list tend to get the impression that there's more than one of him around because of all that he does. He also noted that it was surprising to some that Morton has an embedded Linux background—from his work at Digeo. Morton believes that embedded development is underrepresented in kernel.org work relative to its economic importance. This is caused by a number of factors, not least the financial constraints under which much embedded development is done. An exceptional case is the chip and board manufacturers who have a big interest in seeing Linux run well on their hardware so that they can attract more customers. But even those do not contribute as much as he would like to see to kernel development. An effect of this underrepresentation is a risk that it will tilt kernel development more toward the server and desktop. The kernel team is already accused of being server-centric, and there is some truth to that, "but not as much as one might think". Kernel hackers do care about the desktop as well as embedded devices, but without an advocate for embedded concerns, sometimes things get missed. Something Morton would like to see is a single full-time "embedded maintainer". That person would serve as the advocate for embedded concerns, ensuring that they didn't get overlooked in the process. An embedded maintainer could make a significant impact for embedded development. Not all kernel contributions need to be code, he said. There is a need just to hear the problems that are being faced by the embedded community along with lists of things that are missing. "Senior, sophisticated people" are needed to help prioritize the features that are being considered as well. Morton often finds out things he didn't know at conferences, things that he should have known about much earlier: "That's bad!" Morton is trying to incite the embedded community to interact with the kernel hackers more on linux-kernel. He said that a great way to get the attention of the team is to come onto the mailing list and make them look bad. Unfavorable comparisons to other systems or earlier kernels, for example, especially when backed up with numbers, are noticed quickly. He said that it is important to remember that the person who makes the most noise gets the most attention. One of the areas that he is most concerned about is the practice of "patch hoarding"—holding on to kernel changes as patches without submitting them upstream to the kernel hackers. It is hopefully only due to a lack of resources, but he has heard that some are doing it to try and gain a competitive advantage. This is simply wrong, he said, companies have a "moral if not legal obligation" to submit those patches. [PULL QUOTE: The code will be better because of the review done by the kernel hackers; once it is done, the maintenance cost falls to near zero as well. He also touted the competitive advantage, noting that getting your code merged means that you have won—competing proposals won't get in. END QUOTE]There are many good reasons for getting code merged upstream that Morton outlined. The code will be better because of the review done by the kernel hackers; once it is done, the maintenance cost falls to near zero as well. He also touted the competitive advantage, noting that getting your code merged means that you have won—competing proposals won't get in. Being the first to merge a feature can make it easier on yourself and harder on your competition. There are downsides to getting your code upstream as well. Most of those stem from not getting code out there early enough for review. The kernel developers can ask for significant changes to the code especially in the area of user space interfaces. If a company already has lots of code using the new feature and/or interface, it could be very disruptive; "sorry, there's no real fix for that except getting your code out early enough". Another downside that companies may run into is with competitors being brought into the process. Morton and other kernel hackers will try to find others who might have a stake in a new feature to get them involved so that everybody's needs are taken into account. This can blunt the "win" of getting your feature merged. Some are also concerned that competitors will get access to the code once it has been submitted; "tough luck" Morton said, everything in the kernel is GPL. Morton had specific suggestions for choosing a kernel version to use for an embedded project. 2.6.24 is not a lot better than 2.4.18 for embedded use, but it has one important feature: the kernel team will be interested in bugs in the current kernel. He suggests starting with the current kernel, upgrading it while development proceeds, freezing it only when it is time to ship the product. He also suggests that a company create an internal kernel team with one or two people who are the interface to linux-kernel. This will help with name recognition on the mailing list, which will in turn get patches submitted more attention. Over time, by participating and reviewing others' code, the interface people will build up "brownie points" that will allow them to call in favors to get their code reviewed, or to help smooth the path for inclusion. The kernel.org developers appear to give free support, generally very good support, Morton said, but it is not truly free. Kernel hackers do it as a "mutually beneficial transaction"; they don't do it to make more money for your company, they do it to make the kernel better. Morton is definitely a big part of that, inviting people to email him, especially if "five minutes of my time can save months of yours". The decision about when to merge a new feature is hard for some to understand. Many consider Linux a dictatorship, which is incorrect, it is instead "a democracy that doesn't vote". The merge decision is made on the model of the "rule of law" with kernel hackers playing the role of judges. Unfortunately, there are few written rules. Some of the factors that go into his decision about a particular feature are its maintainability, whether there will be an ongoing maintenance team, as well as the general usefulness of the feature. Depending on the size of the feature, an ongoing maintenance team can be the deciding factor. It is not so important for a driver, but a new architecture, for example, needs ongoing maintenance that can only be done by people with knowledge of and access to the hardware. MontaVista kernel hacker, Deepak Saxena, gave a presentation entitled "Appropriate Community Practices: Social and Technical Advice" later in the conference that mirrored many of Morton's points. He showed some examples of hardware vendors making bad decisions that got shot down by the kernel developers, mostly because they didn't "release early and release often". There is a dangerous attitude that "it's Linux, it's open source, I can do anything I want" which is true, but won't get you far with the community. Saxena has high regard for the benefits of working with the system: if your competitor is active in the community, they are getting an advantage that you aren't. Like Morton, he believes that some members of the development team need to get involved in kernel.org activities. "The community is an extension of your team, your team is an extension of the community." He also has specific advice for hardware vendors: avoid abstraction layers, recognize that your hardware is not unique, and think beyond the reference board implementation. Generalizing your code so that others can use it will make it much more acceptable, as will talking with the developers responsible for the subsystems you are touching. Abstraction layers may be helpful for hardware vendors trying to support multiple operating systems, but they make it difficult for the kernel hackers to understand and maintain the code. The kernel.org folks are not interested in finding and fixing bugs in an abstraction layer. He also points out additional benefits of getting code merged. Once it is in the kernel, the company's team will no longer have to keep up with kernel releases, updating their patches to follow the latest changes. The code will still need to be maintained, but day-to-day changes will be handled by the kernel.org folks. An additional benefit is that the code will be enhanced by various efforts to automatically find bugs in mainline kernel code with tools like lockdep. It is clear that the kernel hackers are making a big effort to not only get code from the embedded folks, but also some of their expertise. There are various outreach efforts to try and get more people involved in the Linux development process; these two talks are certainly a part of that. By making it clear that there are benefits to both parties, they hope to make an argument that will reach up from engineering to management resulting in a better kernel for all.
Integrating and Validating dynticks and Preemptable RCU IntroductionRead-copy update (RCU) is a synchronization mechanism that was added to the Linux kernel in October of 2002. RCU is most frequently described as a replacement for reader-writer locking, but it has also been used in a number of other ways. RCU is notable in that RCU readers do not directly synchronize with RCU updaters, which makes RCU read paths extremely fast, and also permits RCU readers to accomplish useful work even when running concurrently with RCU updaters. In early 2008, a preemptable variant of RCU was accepted into mainline Linux in support of real-time workloads, a variant similar to the RCU implementations in the -rt patchset since August 2005. Preemptable RCU is needed for real-time workloads because older RCU implementations disable preemption across RCU read-side critical sections, resulting in excessive real-time latencies. However, one disadvantage of the -rt implementation was that each grace period required work to be done on each CPU, even if that CPU is in a low-power “dynticks-idle” state, and thus incapable of executing RCU read-side critical sections. The idea behind the dynticks-idle state is that idle CPUs should be physically powered down in order to conserve energy. In short, preemptable RCU can disable a valuable energy-conservation feature of recent Linux kernels. Although Josh Triplett and Paul McKenney had discussed some approaches for allowing CPUs to remain in low-power state throughout an RCU grace period (thus preserving the Linux kernel's ability to conserve energy), matters did not come to a head until Steve Rostedt integrated a new dyntick implementation with preemptable RCU in the -rt patchset. This combination caused one of Steve's systems to hang on boot, so in
October, Paul coded up a dynticks-friendly modification to preemptable RCU's
grace-period processing.
Steve coded up Paul reviewed the code repeatedly from October 2007 to February 2008, and almost always found at least one bug. In one case, Paul even coded and tested a fix before realizing that the bug was illusory, but in all cases, the “bug” was in fact illusory. Near the end of February, Paul grew tired of this game. He therefore decided to enlist the aid of Promela and spin, as described in the LWN article Using Promela and Spin to verify parallel algorithms. This article presents a series of seven increasingly realistic Promela models, the last of which passes, consuming about 40GB of main memory for the state space.
Quick Quiz 1:
Yeah, that's great!!!
Now, just what am I supposed to do if I don't happen to have a machine with
40GB of main memory???
More important, Promela and Spin did find a very subtle bug for me!!! This article is organized as follows: These sections are followed by conclusions and answers to the Quick Quizzes. Introduction to Preemptable RCU and dynticksThe per-CPU
Preemptable RCU's grace-period machinery samples the value of
the The following three sections give an overview of the task
interface, the interrupt/NMI interface, and the use of
the Task InterfaceWhen a given CPU enters dynticks-idle mode because it has no more
tasks to run, it invokes 1 static inline void rcu_enter_nohz(void)
2 {
3 mb();
4 __get_cpu_var(dynticks_progress_counter)++;
5 WARN_ON(__get_cpu_var(dynticks_progress_counter) & 0x1);
6 }
This function simply increments Similarly, when a CPU that is in dynticks-idle mode prepares to
start executing a newly runnable task, it invokes
1 static inline void rcu_exit_nohz(void)
2 {
3 __get_cpu_var(dynticks_progress_counter)++;
4 mb();
5 WARN_ON(!(__get_cpu_var(dynticks_progress_counter) & 0x1));
6 }
This function again increments The Interrupt InterfaceThe Interrupt entry is handled by the 1 void rcu_irq_enter(void)
2 {
3 int cpu = smp_processor_id();
4
5 if (per_cpu(rcu_update_flag, cpu))
6 per_cpu(rcu_update_flag, cpu)++;
7 if (!in_interrupt() &&
8 (per_cpu(dynticks_progress_counter, cpu) & 0x1) == 0) {
9 per_cpu(dynticks_progress_counter, cpu)++;
10 smp_mb();
11 per_cpu(rcu_update_flag, cpu)++;
12 }
13 }
Quick Quiz 2:
Why not simply increment
Line 3 fetches the current CPU's number, while lines 4 and 5
increment the rcu_update_flag, and then only
increment dynticks_progress_counter if the old value
of rcu_update_flag was zero???
Quick Quiz 3:
But if line 7 finds that we are the outermost interrupt, wouldn't
we always need to increment rcu_update_flag nesting counter if it
is already non-zero.
Lines 6 and 7 check to see whether we are the outermost level of
interrupt, and, if so, whether dynticks_progress_counter
needs to be incremented.
If so, line 9 increments dynticks_progress_counter,
line 10 executes a memory barrier, and line 11 increments
rcu_update_flag.
As with rcu_exit_nohz(), the memory barrier ensures that
any other CPU that sees the effects of an RCU read-side critical section
in the interrupt handler (following the rcu_irq_enter()
invocation) will also see the increment of
dynticks_progress_counter.
Interrupt entry is handled similarly by
1 void rcu_irq_exit(void)
2 {
3 int cpu = smp_processor_id();
4
5 if (per_cpu(rcu_update_flag, cpu)) {
6 if (--per_cpu(rcu_update_flag, cpu))
7 return;
8 WARN_ON(in_interrupt());
9 smp_mb();
10 per_cpu(dynticks_progress_counter, cpu)++;
11 WARN_ON(per_cpu(dynticks_progress_counter, cpu) & 0x1);
12 }
13 }
Line 3 fetches the current CPU's number, as before.
Line 5 checks to see if the These two sections have described how the
Grace-Period InterfaceOf the four preemptable RCU grace-period states shown below
(taken from
The Design of Preemptable Read-Copy Update),
only the Of course, if a given CPU is in dynticks-idle state, we shouldn't
wait for it.
Therefore, just before entering one of these two states,
the preceding state takes a snapshot of each CPU's
1 static void dyntick_save_progress_counter(int cpu)
2 {
3 per_cpu(rcu_dyntick_snapshot, cpu) =
4 per_cpu(dynticks_progress_counter, cpu);
5 }
The 1 static inline int
2 rcu_try_flip_waitack_needed(int cpu)
3 {
4 long curr;
5 long snap;
6
7 curr = per_cpu(dynticks_progress_counter, cpu);
8 snap = per_cpu(rcu_dyntick_snapshot, cpu);
9 smp_mb(); /* force ordering with cpu entering/leaving dynticks. */
10 if ((curr == snap) && ((curr & 0x1) == 0))
11 return 0;
12 if ((curr - snap) > 2 || (snap & 0x1) == 0)
13 return 0;
14 return 1;
15 }
Lines 7 and 8 pick up current and snapshot versions of
For its part, the 1 static inline int
2 rcu_try_flip_waitmb_needed(int cpu)
3 {
4 long curr;
5 long snap;
6
7 curr = per_cpu(dynticks_progress_counter, cpu);
8 snap = per_cpu(rcu_dyntick_snapshot, cpu);
9 smp_mb(); /* force ordering with cpu entering/leaving dynticks. */
10 if ((curr == snap) && ((curr & 0x1) == 0))
11 return 0;
12 if (curr != snap)
13 return 0;
14 return 1;
15 }
This is quite similar to
Quick Quiz 4:
Can you spot any bugs in any of the code in this section?
We now have seen all the code involved in the interface between RCU and the dynticks-idle state. The next section builds up the Promela model used to validate this code. Validating Preemptable RCU and dynticksThis section develops a Promela model for the interface between dynticks and RCU step by step, with each of the following sections illustrating one step, starting with the process-level code, adding assertions, interrupts, and finally NMIs. Basic ModelThis section translates the process-level dynticks entry/exit
code and the grace-period processing into
Promela.
We start with 1 proctype dyntick_nohz()
2 {
3 byte tmp;
4 byte i = 0;
5
6 do
7 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break;
8 :: i < MAX_DYNTICK_LOOP_NOHZ ->
9 tmp = dynticks_progress_counter;
10 atomic {
11 dynticks_progress_counter = tmp + 1;
12 assert((dynticks_progress_counter & 1) == 1);
13 }
14 tmp = dynticks_progress_counter;
15 atomic {
16 dynticks_progress_counter = tmp + 1;
17 assert((dynticks_progress_counter & 1) == 0);
18 }
19 i++;
20 od;
21 }
Lines 6 and 20 define a loop.
Line 7 exits the loop once the loop counter
Quick Quiz 5:
Why isn't the memory barrier in
rcu_exit_nohz()
and rcu_enter_nohz() modeled in Promela?
Quick Quiz 6:
Isn't it a bit strange to model Each pass through the loop therefore models a CPU exiting dynticks-idle mode (for example, starting to execute a task), then re-entering dynticks-idle mode (for example, that same task blocking). The next step is to model the interface to RCU's grace-period
processing.
For this, we need to model
1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5
6 atomic {
7 printf("MAX_DYNTICK_LOOP_NOHZ = %d\n", MAX_DYNTICK_LOOP_NOHZ);
8 snap = dynticks_progress_counter;
9 }
10 do
11 :: 1 ->
12 atomic {
13 curr = dynticks_progress_counter;
14 if
15 :: (curr == snap) && ((curr & 1) == 0) ->
16 break;
17 :: (curr - snap) > 2 || (snap & 1) == 0 ->
18 break;
19 :: 1 -> skip;
20 fi;
21 }
22 od;
23 snap = dynticks_progress_counter;
24 do
25 :: 1 ->
26 atomic {
27 curr = dynticks_progress_counter;
28 if
29 :: (curr == snap) && ((curr & 1) == 0) ->
30 break;
31 :: (curr != snap) ->
32 break;
33 :: 1 -> skip;
34 fi;
35 }
36 od;
37 }
Lines 6-9 print out the loop limit (but only into the .trail file
in case of error) and model a line of code
from Lines 10-22 model the relevant code in
Line 23 models a line from Finally, lines 24-36 model the relevant code in
Quick Quiz 7:
Wait a minute!
In the Linux kernel, both
dynticks_progress_counter and
rcu_dyntick_snapshot are per-CPU variables.
So why are they instead being modeled as single global variables?
The resulting model, when run with the runspin.sh script, generates 691 states and passes without errors, which is not at all surprising given that it completely lacks the assertions that could find failures. The next section therefore adds safety assertions. Validating SafetyA safe RCU implementation must never permit a grace period to
complete before the completion of any RCU readers that started
before the start of the grace period.
This is modeled by a 1 #define GP_IDLE 0 2 #define GP_WAITING 1 3 #define GP_DONE 2 4 byte grace_period_state = GP_DONE; The 1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5
6 grace_period_state = GP_IDLE;
7 atomic {
8 printf("MAX_DYNTICK_LOOP_NOHZ = %d\n", MAX_DYNTICK_LOOP_NOHZ);
9 snap = dynticks_progress_counter;
10 grace_period_state = GP_WAITING;
11 }
12 do
13 :: 1 ->
14 atomic {
15 curr = dynticks_progress_counter;
16 if
17 :: (curr == snap) && ((curr & 1) == 0) ->
18 break;
19 :: (curr - snap) > 2 || (snap & 1) == 0 ->
20 break;
21 :: 1 -> skip;
22 fi;
23 }
24 od;
25 grace_period_state = GP_DONE;
26 grace_period_state = GP_IDLE;
27 atomic {
28 snap = dynticks_progress_counter;
29 grace_period_state = GP_WAITING;
30 }
31 do
32 :: 1 ->
33 atomic {
34 curr = dynticks_progress_counter;
35 if
36 :: (curr == snap) && ((curr & 1) == 0) ->
37 break;
38 :: (curr != snap) ->
39 break;
40 :: 1 -> skip;
41 fi;
42 }
43 od;
44 grace_period_state = GP_DONE;
45 }
Quick Quiz 8:
Given there are a pair of back-to-back changes to
Lines 6, 10, 25, 26, 29, and 44 update this variable (combining
atomically with algorithmic operations where feasible) to
allow the grace_period_state on lines 25 and 26,
how can we be sure that line 25's changes won't be lost?
dyntick_nohz() process to validate the basic
RCU safety property.
The form of this validation is to assert that the value of the
grace_period_state variable cannot jump from
GP_IDLE to GP_DONE during a time period
over which RCU readers could plausibly persist.
The 1 proctype dyntick_nohz()
2 {
3 byte tmp;
4 byte i = 0;
5 bit old_gp_idle;
6
7 do
8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break;
9 :: i < MAX_DYNTICK_LOOP_NOHZ ->
10 tmp = dynticks_progress_counter;
11 atomic {
12 dynticks_progress_counter = tmp + 1;
13 old_gp_idle = (grace_period_state == GP_IDLE);
14 assert((dynticks_progress_counter & 1) == 1);
15 }
16 atomic {
17 tmp = dynticks_progress_counter;
18 assert(!old_gp_idle || grace_period_state != GP_DONE);
19 }
20 atomic {
21 dynticks_progress_counter = tmp + 1;
22 assert((dynticks_progress_counter & 1) == 0);
23 }
24 i++;
25 od;
26 }
Line 13 sets a new The resulting model, when run with the runspin.sh script, generates 964 states and passes without errors, which is reassuring. That said, although safety is critically important, it is also quite important to avoid indefinitely stalling grace periods. The next section therefore covers validating liveness. Validating LivenessAlthough liveness can be difficult to prove, there is a simple
trick that applies here.
The first step is to make 1 proctype dyntick_nohz()
2 {
3 byte tmp;
4 byte i = 0;
5 bit old_gp_idle;
6
7 do
8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break;
9 :: i < MAX_DYNTICK_LOOP_NOHZ ->
10 tmp = dynticks_progress_counter;
11 atomic {
12 dynticks_progress_counter = tmp + 1;
13 old_gp_idle = (grace_period_state == GP_IDLE);
14 assert((dynticks_progress_counter & 1) == 1);
15 }
16 atomic {
17 tmp = dynticks_progress_counter;
18 assert(!old_gp_idle || grace_period_state != GP_DONE);
19 }
20 atomic {
21 dynticks_progress_counter = tmp + 1;
22 assert((dynticks_progress_counter & 1) == 0);
23 }
24 i++;
25 od;
26 dyntick_nohz_done = 1;
27 }
With this variable in place, we can add assertions to
1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5 bit shouldexit;
6
7 grace_period_state = GP_IDLE;
8 atomic {
9 printf("MAX_DYNTICK_LOOP_NOHZ = %d\n", MAX_DYNTICK_LOOP_NOHZ);
10 shouldexit = 0;
11 snap = dynticks_progress_counter;
12 grace_period_state = GP_WAITING;
13 }
14 do
15 :: 1 ->
16 atomic {
17 assert(!shouldexit);
18 shouldexit = dyntick_nohz_done;
19 curr = dynticks_progress_counter;
20 if
21 :: (curr == snap) && ((curr & 1) == 0) ->
22 break;
23 :: (curr - snap) > 2 || (snap & 1) == 0 ->
24 break;
25 :: else -> skip;
26 fi;
27 }
28 od;
29 grace_period_state = GP_DONE;
30 grace_period_state = GP_IDLE;
31 atomic {
32 shouldexit = 0;
33 snap = dynticks_progress_counter;
34 grace_period_state = GP_WAITING;
35 }
36 do
37 :: 1 ->
38 atomic {
39 assert(!shouldexit);
40 shouldexit = dyntick_nohz_done;
41 curr = dynticks_progress_counter;
42 if
43 :: (curr == snap) && ((curr & 1) == 0) ->
44 break;
45 :: (curr != snap) ->
46 break;
47 :: else -> skip;
48 fi;
49 }
50 od;
51 grace_period_state = GP_DONE;
52 }
We have added the Lines 32, 39, and 40 operate in a similar manner for the second (memory-barrier) loop. However, running this
model
results in failure, as line 23 is checking that the wrong variable
is even.
Upon failure, We see that the So one of these two conditions has to be incorrect.
Referring to the comment block in /* * If the CPU remained in dynticks mode for the entire time * and didn't take any interrupts, NMIs, SMIs, or whatever, * then it cannot be in the middle of an rcu_read_lock(), so * the next rcu_read_lock() it executes must use the new value * of the counter. So we can safely pretend that this CPU * already acknowledged the counter. */ The first condition does match this, because if /* * If the CPU passed through or entered a dynticks idle phase with * no active irq handlers, then, as above, we can safely pretend * that this CPU already acknowledged the counter. */ The first part of the condition is correct, because if The corrected C code is as follows: 1 static inline int
2 rcu_try_flip_waitack_needed(int cpu)
3 {
4 long curr;
5 long snap;
6
7 curr = per_cpu(dynticks_progress_counter, cpu);
8 snap = per_cpu(rcu_dyntick_snapshot, cpu);
9 smp_mb(); /* force ordering with cpu entering/leaving dynticks. */
10 if ((curr == snap) && ((curr & 0x1) == 0))
11 return 0;
12 if ((curr - snap) > 2 || (curr & 0x1) == 0)
13 return 0;
14 return 1;
15 }
Making the corresponding correction in the
model
results in a correct validation with 661 states that passes without
errors.
However, it is worth noting that the first version of the liveness
validation failed to catch this bug, due to a bug in the liveness
validation itself.
This liveness-validation bug was located by inserting an infinite
loop in the We have now successfully validated both safety and liveness conditions, but only for processes running and blocking. We also need to handle interrupts, a task taken up in the next section. InterruptsThere are a couple of ways to model interrupts in Promela:
A bit of thought indicated that the second approach would have a
smaller state space, though it requires that the interrupt handler
somehow run atomically with respect to the Fortunately, it turns out that Promela permits you to branch
out of atomic statements.
This trick allows us to have the interrupt handler set a flag, and
recode 1 #define EXECUTE_MAINLINE(label, stmt) \
2 label: skip; \
3 atomic { \
4 if \
5 :: in_dyntick_irq -> goto label; \
6 :: else -> stmt; \
7 fi; \
8 } \
One might use this macro as follows:
EXECUTE_MAINLINE(stmt1, tmp = dynticks_progress_counter)
Line 2 of the macro creates the specified statement label.
Lines 3-8 are an atomic block that tests the Validating Interrupt HandlersThe first step is to convert 1 proctype dyntick_nohz()
2 {
3 byte tmp;
4 byte i = 0;
5 bit old_gp_idle;
6
7 do
8 :: i >= MAX_DYNTICK_LOOP_NOHZ -> break;
9 :: i < MAX_DYNTICK_LOOP_NOHZ ->
10 EXECUTE_MAINLINE(stmt1, tmp = dynticks_progress_counter)
11 EXECUTE_MAINLINE(stmt2,
12 dynticks_progress_counter = tmp + 1;
13 old_gp_idle = (grace_period_state == GP_IDLE);
14 assert((dynticks_progress_counter & 1) == 1))
15 EXECUTE_MAINLINE(stmt3,
16 tmp = dynticks_progress_counter;
17 assert(!old_gp_idle || grace_period_state != GP_DONE))
18 EXECUTE_MAINLINE(stmt4,
19 dynticks_progress_counter = tmp + 1;
20 assert((dynticks_progress_counter & 1) == 0))
21 i++;
22 od;
23 dyntick_nohz_done = 1;
24 }
Quick Quiz 9:
But what would you do if you needed the statements in a single
It is important to note that when a group of statements is passed
to EXECUTE_MAINLINE() group to execute non-atomically?
Quick Quiz 10:
But what if the EXECUTE_MAINLINE(), as in lines 11-14, all
statements in that group execute atomically.
The next step is to write a 1 proctype dyntick_irq()
2 {
3 byte tmp;
4 byte i = 0;
5 bit old_gp_idle;
6
7 do
8 :: i >= MAX_DYNTICK_LOOP_IRQ -> break;
9 :: i < MAX_DYNTICK_LOOP_IRQ ->
10 in_dyntick_irq = 1;
11 if
12 :: rcu_update_flag > 0 ->
13 tmp = rcu_update_flag;
14 rcu_update_flag = tmp + 1;
15 :: else -> skip;
16 fi;
17 if
18 :: !in_interrupt && (dynticks_progress_counter & 1) == 0 ->
19 tmp = dynticks_progress_counter;
20 dynticks_progress_counter = tmp + 1;
21 tmp = rcu_update_flag;
22 rcu_update_flag = tmp + 1;
23 :: else -> skip;
24 fi;
25 tmp = in_interrupt;
26 in_interrupt = tmp + 1;
27 old_gp_idle = (grace_period_state == GP_IDLE);
28 assert(!old_gp_idle || grace_period_state != GP_DONE);
29 tmp = in_interrupt;
30 in_interrupt = tmp - 1;
31 if
32 :: rcu_update_flag != 0 ->
33 tmp = rcu_update_flag;
34 rcu_update_flag = tmp - 1;
35 if
36 :: rcu_update_flag == 0 ->
37 tmp = dynticks_progress_counter;
38 dynticks_progress_counter = tmp + 1;
39 :: else -> skip;
40 fi;
41 :: else -> skip;
42 fi;
43 atomic {
44 in_dyntick_irq = 0;
45 i++;
46 }
47 od;
48 dyntick_irq_done = 1;
49 }
Quick Quiz 11:
Why are lines 44 and 45 (the
The loop from line 7-47 models up to in_dyntick_irq = 0;
and the i++;) executed atomically?
Quick Quiz 12:
What property of interrupts is this MAX_DYNTICK_LOOP_IRQ
interrupts, with lines 8 and 9 forming the loop condition and line 45
incrementing the control variable.
Line 10 tells dyntick_nohz() that an interrupt handler
is running, and line 44 tells dyntick_nohz() that this
handler has completed.
Line 48 is used for liveness validation, much as is the corresponding
line of dyntick_nohz().
Lines 11-24 model 1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5 bit shouldexit;
6
7 grace_period_state = GP_IDLE;
8 atomic {
9 printf("MAX_DYNTICK_LOOP_NOHZ = %d\n", MAX_DYNTICK_LOOP_NOHZ);
10 printf("MAX_DYNTICK_LOOP_IRQ = %d\n", MAX_DYNTICK_LOOP_IRQ);
11 shouldexit = 0;
12 snap = dynticks_progress_counter;
13 grace_period_state = GP_WAITING;
14 }
15 do
16 :: 1 ->
17 atomic {
18 assert(!shouldexit);
19 shouldexit = dyntick_nohz_done && dyntick_irq_done;
20 curr = dynticks_progress_counter;
21 if
22 :: (curr == snap) && ((curr & 1) == 0) ->
23 break;
24 :: (curr - snap) > 2 || (curr & 1) == 0 ->
25 break;
26 :: else -> skip;
27 fi;
28 }
29 od;
30 grace_period_state = GP_DONE;
31 grace_period_state = GP_IDLE;
32 atomic {
33 shouldexit = 0;
34 snap = dynticks_progress_counter;
35 grace_period_state = GP_WAITING;
36 }
37 do
38 :: 1 ->
39 atomic {
40 assert(!shouldexit);
41 shouldexit = dyntick_nohz_done && dyntick_irq_done;
42 curr = dynticks_progress_counter;
43 if
44 :: (curr == snap) && ((curr & 1) == 0) ->
45 break;
46 :: (curr != snap) ->
47 break;
48 :: else -> skip;
49 fi;
50 }
51 od;
52 grace_period_state = GP_DONE;
53 }
The implementation of This model results in a correct validation with roughly half a million states, passing without errors. However, this version of the model does not handle nested interrupts. This topic is taken up in the next section. Validating Nested Interrupt HandlersNested interrupt handlers may be modeled by splitting the body of
the loop in 1 proctype dyntick_irq()
2 {
3 byte tmp;
4 byte i = 0;
5 byte j = 0;
6 bit old_gp_idle;
7 bit outermost;
8
9 do
10 :: i >= MAX_DYNTICK_LOOP_IRQ && j >= MAX_DYNTICK_LOOP_IRQ -> break;
11 :: i < MAX_DYNTICK_LOOP_IRQ ->
12 atomic {
13 outermost = (in_dyntick_irq == 0);
14 in_dyntick_irq = 1;
15 }
16 if
17 :: rcu_update_flag > 0 ->
18 tmp = rcu_update_flag;
19 rcu_update_flag = tmp + 1;
20 :: else -> skip;
21 fi;
22 if
23 :: !in_interrupt && (dynticks_progress_counter & 1) == 0 ->
24 tmp = dynticks_progress_counter;
25 dynticks_progress_counter = tmp + 1;
26 tmp = rcu_update_flag;
27 rcu_update_flag = tmp + 1;
28 :: else -> skip;
29 fi;
30 tmp = in_interrupt;
31 in_interrupt = tmp + 1;
32 atomic {
33 if
34 :: outermost ->
35 old_gp_idle = (grace_period_state == GP_IDLE);
36 :: else -> skip;
37 fi;
38 }
39 i++;
40 :: j < i ->
41 atomic {
42 if
43 :: j + 1 == i ->
44 assert(!old_gp_idle ||
45 grace_period_state != GP_DONE);
46 :: else -> skip;
47 fi;
48 }
49 tmp = in_interrupt;
50 in_interrupt = tmp - 1;
51 if
52 :: rcu_update_flag != 0 ->
53 tmp = rcu_update_flag;
54 rcu_update_flag = tmp - 1;
55 if
56 :: rcu_update_flag == 0 ->
57 tmp = dynticks_progress_counter;
58 dynticks_progress_counter = tmp + 1;
59 :: else -> skip;
60 fi;
61 :: else -> skip;
62 fi;
63 atomic {
64 j++;
65 in_dyntick_irq = (i != j);
66 }
67 od;
68 dyntick_irq_done = 1;
69 }
This is similar to the earlier Line 40 has the do-loop conditional for interrupt-exit modeling:
as long as we have exited fewer interrupts than we have entered, it is
legal to exit another interrupt.
Lines 41-48 check the safety criterion, but only if we are exiting
from the outermost interrupt level.
Finally, lines 63-66 increment the interrupt-exit count This model results in a correct validation with a bit more than half a million states, passing without errors. However, this version of the model does not handle NMIs, which are taken up in the nest section. Validating NMI HandlersWe take the same general approach for NMIs as we do for interrupts,
keeping in mind that NMIs do not nest.
This results in a 1 proctype dyntick_nmi()
2 {
3 byte tmp;
4 byte i = 0;
5 bit old_gp_idle;
6
7 do
8 :: i >= MAX_DYNTICK_LOOP_NMI -> break;
9 :: i < MAX_DYNTICK_LOOP_NMI ->
10 in_dyntick_nmi = 1;
11 if
12 :: rcu_update_flag > 0 ->
13 tmp = rcu_update_flag;
14 rcu_update_flag = tmp + 1;
15 :: else -> skip;
16 fi;
17 if
18 :: !in_interrupt && (dynticks_progress_counter & 1) == 0 ->
19 tmp = dynticks_progress_counter;
20 dynticks_progress_counter = tmp + 1;
21 tmp = rcu_update_flag;
22 rcu_update_flag = tmp + 1;
23 :: else -> skip;
24 fi;
25 tmp = in_interrupt;
26 in_interrupt = tmp + 1;
27 old_gp_idle = (grace_period_state == GP_IDLE);
28 assert(!old_gp_idle || grace_period_state != GP_DONE);
29 tmp = in_interrupt;
30 in_interrupt = tmp - 1;
31 if
32 :: rcu_update_flag != 0 ->
33 tmp = rcu_update_flag;
34 rcu_update_flag = tmp - 1;
35 if
36 :: rcu_update_flag == 0 ->
37 tmp = dynticks_progress_counter;
38 dynticks_progress_counter = tmp + 1;
39 :: else -> skip;
40 fi;
41 :: else -> skip;
42 fi;
43 atomic {
44 i++;
45 in_dyntick_nmi = 0;
46 }
47 od;
48 dyntick_nmi_done = 1;
49 }
Of course, the fact that we have NMIs requires adjustments in
the other components.
For example, the
1 #define EXECUTE_MAINLINE(label, stmt) \
2 label: skip; \
3 atomic { \
4 if \
5 :: in_dyntick_irq || in_dyntick_nmi -> goto label; \
6 :: else -> stmt; \
7 fi; \
8 } \
We will also need to introduce an
1 #define EXECUTE_IRQ(label, stmt) \
2 label: skip; \
3 atomic { \
4 if \
5 :: in_dyntick_nmi -> goto label; \
6 :: else -> stmt; \
7 fi; \
8 } \
It is further necessary to convert
1 proctype dyntick_irq()
2 {
3 byte tmp;
4 byte i = 0;
5 byte j = 0;
6 bit old_gp_idle;
7 bit outermost;
8
9 do
10 :: i >= MAX_DYNTICK_LOOP_IRQ && j >= MAX_DYNTICK_LOOP_IRQ -> break;
11 :: i < MAX_DYNTICK_LOOP_IRQ ->
12 atomic {
13 outermost = (in_dyntick_irq == 0);
14 in_dyntick_irq = 1;
15 }
16 stmt1: skip;
17 atomic {
18 if
19 :: in_dyntick_nmi -> goto stmt1;
20 :: !in_dyntick_nmi && rcu_update_flag ->
21 goto stmt1_then;
22 :: else -> goto stmt1_else;
23 fi;
24 }
25 stmt1_then: skip;
26 EXECUTE_IRQ(stmt1_1, tmp = rcu_update_flag)
27 EXECUTE_IRQ(stmt1_2, rcu_update_flag = tmp + 1)
28 stmt1_else: skip;
29 stmt2: skip; atomic {
30 if
31 :: in_dyntick_nmi -> goto stmt2;
32 :: !in_dyntick_nmi &&
33 !in_interrupt &&
34 (dynticks_progress_counter & 1) == 0 ->
35 goto stmt2_then;
36 :: else -> goto stmt2_else;
37 fi;
38 }
39 stmt2_then: skip;
40 EXECUTE_IRQ(stmt2_1, tmp = dynticks_progress_counter)
41 EXECUTE_IRQ(stmt2_2, dynticks_progress_counter = tmp + 1)
42 EXECUTE_IRQ(stmt2_3, tmp = rcu_update_flag)
43 EXECUTE_IRQ(stmt2_4, rcu_update_flag = tmp + 1)
44 stmt2_else: skip;
45 EXECUTE_IRQ(stmt3, tmp = in_interrupt)
46 EXECUTE_IRQ(stmt4, in_interrupt = tmp + 1)
47 stmt5: skip;
48 atomic {
49 if
50 :: in_dyntick_nmi -> goto stmt4;
51 :: !in_dyntick_nmi && outermost ->
52 old_gp_idle = (grace_period_state == GP_IDLE);
53 :: else -> skip;
54 fi;
55 }
56 i++;
57 :: j < i ->
58 stmt6: skip;
59 atomic {
60 if
61 :: in_dyntick_nmi -> goto stmt6;
62 :: !in_dyntick_nmi && j + 1 == i ->
63 assert(!old_gp_idle || grace_period_state != GP_DONE);
64 :: else -> skip;
65 fi;
66 }
67 EXECUTE_IRQ(stmt7, tmp = in_interrupt);
68 EXECUTE_IRQ(stmt8, in_interrupt = tmp - 1);
69
70 stmt9: skip;
71 atomic {
72 if
73 :: in_dyntick_nmi -> goto stmt9;
74 :: !in_dyntick_nmi && rcu_update_flag != 0 ->
75 goto stmt9_then;
76 :: else -> goto stmt9_else;
77 fi;
78 }
79 stmt9_then: skip;
80 EXECUTE_IRQ(stmt9_1, tmp = rcu_update_flag)
81 EXECUTE_IRQ(stmt9_2, rcu_update_flag = tmp - 1)
82 stmt9_3: skip;
83 atomic {
84 if
85 :: in_dyntick_nmi -> goto stmt9_3;
86 :: !in_dyntick_nmi && rcu_update_flag == 0 ->
87 goto stmt9_3_then;
88 :: else -> goto stmt9_3_else;
89 fi;
90 }
91 stmt9_3_then: skip;
92 EXECUTE_IRQ(stmt9_3_1, tmp = dynticks_progress_counter)
93 EXECUTE_IRQ(stmt9_3_2, dynticks_progress_counter = tmp + 1)
94 stmt9_3_else:
95 stmt9_else: skip;
96 atomic {
97 j++;
98 in_dyntick_irq = (i != j);
99 }
100 od;
101 dyntick_irq_done = 1;
102 }
Note that we have open-coded the “if” statements
(for example, lines 16-28).
In addition, statements that process strictly local state
(such as line 56) need not exclude Finally,
1 proctype grace_period()
2 {
3 byte curr;
4 byte snap;
5 bit shouldexit;
6
7 grace_period_state = GP_IDLE;
8 atomic {
9 printf("MAX_DYNTICK_LOOP_NOHZ = %d\n", MAX_DYNTICK_LOOP_NOHZ);
10 printf("MAX_DYNTICK_LOOP_IRQ = %d\n", MAX_DYNTICK_LOOP_IRQ);
11 printf("MAX_DYNTICK_LOOP_NMI = %d\n", MAX_DYNTICK_LOOP_NMI);
12 shouldexit = 0;
13 snap = dynticks_progress_counter;
14 grace_period_state = GP_WAITING;
15 }
16 do
17 :: 1 ->
18 atomic {
19 assert(!shouldexit);
20 shouldexit = dyntick_nohz_done &&
21 dyntick_irq_done &&
22 dyntick_nmi_done;
23 curr = dynticks_progress_counter;
24 if
25 :: (curr == snap) && ((curr & 1) == 0) ->
26 break;
27 :: (curr - snap) > 2 || (curr & 1) == 0 ->
28 break;
29 :: else -> skip;
30 fi;
31 }
32 od;
33 grace_period_state = GP_DONE;
34 grace_period_state = GP_IDLE;
35 atomic {
36 shouldexit = 0;
37 snap = dynticks_progress_counter;
38 grace_period_state = GP_WAITING;
39 }
40 do
41 :: 1 ->
42 atomic {
43 assert(!shouldexit);
44 shouldexit = dyntick_nohz_done &&
45 dyntick_irq_done &&
46 dyntick_nmi_done;
47 curr = dynticks_progress_counter;
48 if
49 :: (curr == snap) && ((curr & 1) == 0) ->
50 break;
51 :: (curr != snap) ->
52 break;
53 :: else -> skip;
54 fi;
55 }
56 od;
57 grace_period_state = GP_DONE;
58 }
We have added the
Quick Quiz 13:
Do you always write your code in this painfully incremental manner???
The model results in a correct validation with several hundred million states, passing without errors. ConclusionsThis effort provided some lessons (re)learned: Promela and spin can validate interrupt/NMI-handler interactions. Documenting code can help locate bugs. In this case, the documentation effort located this bug. Validate your code early, often, and up to the point of destruction. This effort located one subtle bug that might have been quite difficult to test or debug. Always validate your validation code. The usual way to do this is to insert a deliberate bug and verify that the validation code catches it. Of course, if the validation code fails to catch this bug, you may also need to verify the bug itself, and so on, recursing infinitely. However, if you find yourself in this position, getting a good night's sleep can be an extremely effective debugging technique. Finally, if AcknowledgmentsWe are indebted to Andrew Theurer, who maintains the large-memory machine that ran the full test. We all owe a debt of gratitude to Vara Prasad for his help in rendering this article human-readable. |
Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.