User: Password:
Subscribe / Log in / New account

Hierarchical RCU - state schematic

Hierarchical RCU - state schematic

Posted Nov 6, 2008 18:49 UTC (Thu) by ds2horner (subscriber, #13438)
Parent article: Hierarchical RCU

In state machine schematic : "Mark CPU as being in Extended QS" from "CPU Offline" immediately returns to checking if "all cpus passed though QS".

However the "Mark CPU as being in Extended QS" for a CPU in dyntick-idle still gets rescheduled.

Is a dyntick-idle CPU actually a "hold out"?
I thought it was in Extended QS and no further checking/scheduling was required.

So is the arrow to "Send Resched IPI ..." incorrect, or should this be another Quick (or perhaps no so) Quiz?

(Log in to post comments)

Hierarchical RCU - state schematic

Posted Nov 6, 2008 19:49 UTC (Thu) by PaulMcKenney (subscriber, #9624) [Link]

The trick with that part of the diagram is that the code is actually dealing with groups of CPUs. So a given group of CPUs might have some that are in dyntick-idle state and others that have somehow avoided passing through a quiescent state, despite the fact that they are online and running. We send a reschedule IPI only to these latter CPUs, not to the dyntick-idle CPUs.

Now that you mention it, this might indeed be a good quick-quiz candidate. :-)

Hierarchical RCU - state schematic

Posted Nov 7, 2008 1:55 UTC (Fri) by ds2horner (subscriber, #13438) [Link]

I think the leveling confusion in the diagram starts with the exit from the "wait for QS" state.

"CPU passes through QS" and "CPU offline" are considered singleton events, but the "GP too long" is considered a bulk event with all blocking CPUs being addressed.

This, and the explanation of the initiation of the check for the dynaticks state, clarified for me (indeed corrected a misconception I had ) that the dyntick counters are not captured at the beginning of each GP, but only after a "time out" (a reasonable optimization for a low occurrence event).

Speaking of the time out -
in "Detect a Too-Long Grace Period" the "record_gp_stall_check_time() function records the time and also a timestamp set three seconds into the future." timeframe seemed excessive. I believe jiffies was meant (not seconds) which would be consistent with the later reference "A two-jiffies offset helps ensure that CPUs report on themselves when possible".

Each article clarifies details I missed before.

Hierarchical RCU - state schematic

Posted Nov 7, 2008 5:52 UTC (Fri) by PaulMcKenney (subscriber, #9624) [Link]

Glad it helped!

There are indeed two levels of tardiness. Reschedule IPIs are sent to hold-out CPUs after three jiffies, which, in absence of kernel bugs, should end the grace period. These reschedule IPIs are considered "normal" rather than "errors".

There is a separate "error" path enabled by CONFIG_RCU_CPU_STALL_DETECTOR that does in fact have a three-second timeout (recently upped to 10 seconds based on the fact that three-second stalls appear to be present in boot-up code). There is also a three-second-and-two-jiffies timeout as part of this "error" path so that CPUs will normally report on themselves rather than being reported on by others, given that self-reported stack traces are usually more reliable.

Hierarchical RCU - unification suggestion

Posted Nov 7, 2008 15:50 UTC (Fri) by ds2horner (subscriber, #13438) [Link]

Thank you again.

And now that I hope I believe I understand the process adequately I have a suggestion.

Why not unify the CPU hotplug and "dyntick suspend" status tracking by using the same counter as dyntick for both?

If the CPU offline / online advanced the (now renamed rcu_cpu_awake ( with the LBS matching CPU state)) counter, the even / odd state will satisfy both scenarios.

And I am not advocating this solely for the purpose of reducing the state variables and reducing code (if not complexity):
Conceptually, an offline CPU (a powered down engine which cannot receive interrupts ) is equivalent to a dyntick CPU (powered down state) that receives no interrupts in the grace period.

I realize there are, of course, other reasons to track online/offline CPUs, that the offline check is immediately available.

However, the unification would allow a simpler "no off line" option to the this intended "dyntick replacement" to CLASSIC RCU.
And a simpler "no NO_HZ" option that would be useful for virtual CPUs.

I know - "show me the code!".

OK reading your posts
"v6 scalable classic RCU implementation" "Tue, 23 Sep 2008 16:53:40 -0700"
indicate the magnitude of code that is present to handle a CPU offline activity.
Concerns like moving outstanding rcu work from the "forced" offline CPU to the current CPU, make the offlining distinct.

However, it appears to be performed (at least in this version of the code) just before sending the tardy CPUs the IPI reschedule. Which, back to the schematic, makes it part of the "GP too long" path?

And reading your further post "rcu-state" "Mon, 27 Oct 2008 20:52:01 +0100"
It looks like this version is using a unifying indicator "rcu_cpumode" - still with 2 distinct substates, that are separate from RCU_CPUMODE_PERIODIC which requires the notifications each Grace Period.

How does one keep up?

Hierarchical RCU - unification suggestion

Posted Nov 8, 2008 1:03 UTC (Sat) by PaulMcKenney (subscriber, #9624) [Link]

Looks like I should have had yet another Quick Quiz on unifying dyntick-idle and offline detection.

First, it might well turn out to be the right thing to do. However, the challenges include the following:

  1. As you say,
  2. As you say, although CPUs are not allowed to go into dyntick-idle state while they have RCU callbacks pending, CPUs -are- allowed to go offline in this state. This means that the code to move their callbacks is still requires. We cannot simply let them silently go offline.
  3. Present-day systems often run with NR_CPUS much larger than the actual number of CPUs, so the unified approach could waste time scanning CPUs that never will exist. (There are workarounds for this.)
  4. CPUs get onlined one at a time, so RCU needs to handle onlining.
  5. A busy system with offlined CPUs would always take three ticks to figure out that the offlined CPUs were never going to respond.
  6. Switching into and out of dyntick-idle state can happen extremely frequently, so we cannot treat a dyntick-idle CPU as if it was offline due to the high overhead of offlining CPUs.
Under normal circumstances, offline CPUs are not included in the bitmasks indicating which CPUs need to be waited on, so that normally RCU never waits on offline CPUs. However, there are race conditions that can occur in the online and offline processes that can result in RCU believing that a given CPU is online when it is in fact offline. Therefore, if RCU sees that the grace period is extending longer than expected (jiffies, not seconds), it will check to see if some of the CPUs that it is waiting on are offline. This situation corrects itself after a few grace periods: RCU will get back in sync with which CPUs really are offline. So the offline-CPU checks invoked from force_quiescent_state() are only there to handle rare race conditions. Again, under normal circumstances, RCU never waits on offline CPUs.

At this point, when the code is still just a patch, and therefore subject to change, the only way I can see to keep up is to ask questions. Which you are doing. :-)

Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds