A full task-isolation mode for the kernel
Some applications require guaranteed access to the CPU without even brief interruptions; realtime systems and high-bandwidth networking applications with user-space drivers can fall into the category. While Linux provides some support for CPU isolation (moving everything but the critical task off of one or more CPUs) now, it is an imperfect solution that is still subject to some interruptions. Work has been continuing in the community to improve the kernel's CPU-isolation capabilities, notably with improvements in the nohz (tickless) mode, but it is not finished yet. Recently, Alex Belits submitted a patch set (based on work by Chris Metcalf in 2015) that introduces a completely predictable environment for Linux applications — as long as they do not need any kernel services.
Nohz and task isolation
Currently, the nohz mode in Linux allows partial task isolation. It decreases the number of interrupts that the CPU receives; for example, the clock tick interrupt is disabled for nearly all CPUs. However, nohz does not guarantee there will be no interruptions; the running task can still be interrupted by page faults (careful design of an application can avoid that) or delayed workqueues. The advantage of this mode is that the tasks can run regular code, including system calls. In addition to that, any additional overhead is limited to the system-call entry and exit paths.
For some applications, the lack of absolute guarantees from nohz may cause problems. As an example, high-performance, user-space network drivers that have a small number of CPU cycles in which to handle each packet; for those, interrupt and interrupt handling may cause a significant delay in their response and use up to the entire time available. Realtime operating systems (RTOSes) can provide the needed guarantees, but they have limited hardware support; the authors of the patch feel that it is less work to develop and maintain interrupt-free applications than to support a RTOS next to Linux, as Belits explained:
In these times, even embedded systems often contain a number of cores, and system designers are adding more for tasks requiring predictability. Belits explained that further:
The kernel currently has a couple of features meant to make it possible to run applications without interruptions: nohz (described above) and CPU isolation (or "isolcpus"). The latter feature isolates one or more CPUs — making them unavailable to the scheduler and only accessible to a process via an explicitly set affinity — so that any processes running there need not compete with the rest of the workload for CPU time. These features reduce interruptions on the isolated CPUs, but do not fully eliminate them; task isolation is an attempt to finish the job by removing all interruptions. A process that enters the isolation mode will be able to run in user space with no interference from the kernel or other processes.
Configuring and activating task isolation
The authors assume that isolation is not needed in kernel space or during the task's initialization phase. A task enters the isolation mode at some point in time and stays in this mode until it leaves the isolation on its own, performs some action that causes the isolation to be broken, or receives a signal that was directed to it.
The kernel needs to be compiled with the CONFIG_TASK_ISOLATION flag and then booted with the same options as for nohz mode with CPU isolation:
isolcpus=nohz,domain,CPULIST
where nohz disables the timer tick on the specified CPUs, domain removes the CPUs from the scheduling algorithms, and CPULIST is the list of CPUs where the isolation options are applied. Optionally, the task_isolation_debug kernel command-line option causes a stack backtrace when a task loses isolation.
When a task has finished its initialization, it can activate isolation by using the PR_TASK_ISOLATION operation provided by the prctl() system call. This operation may fail for either permanent or temporary reasons. An example of a permanent error is when the task is set up on a CPU without isolation; in this case, entering isolation mode is not possible. Temporary errors are indicated by the EAGAIN error code; examples include a time when the delayed workqueues could not be stopped. In such cases, the task may retry the operation if it wants to enter isolation, as it may succeed the next time.
In the prctl() call, the developer may also configure the signal to be sent to the task when it loses isolation. The additional macro to use is PR_TASK_ISOLATION_SET_SIG(), passing it the signal to send. The command then becomes similar to the one in the example code:
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_SET_SIG(SIGUSR1), 0, 0, 0);Here, the process has requested the receipt of a SIGUSR1 signal rather than the default SIGKILL should it lose isolation.
Losing isolation
The task will lose isolation if it enters kernel space as the result of a system call, a page fault, an exception, or an interrupt. The (fatal by default) signal will be sent when this happens, with a couple of exceptions: a prctl() call to turn off isolation, or exit() and exit_group(); these calls cause the task to exit, so the isolation mode is finished at that point.
When the task loses isolation by any means other than the above system calls, it will receive a signal, SIGKILL by default, which causes termination of the task. The signal can be modified, in the case the application prefers to catch it. This can be used, for example, if an application wants to log the information about lost isolation before exiting or attempt to rerun the code without isolation guarantees.
The task can enter and exit isolation when it desires. To leave isolation without a signal it should call:
prctl(PR_SET_TASK_ISOLATION, 0, 0, 0, 0);
The internals
When a process calls prtcl() to enable task isolation, it is marked with the TIF_TASK_ISOLATION flag in the kernel. The main part of the job of setting up task isolation, though, is done when returning from the prctl(). When the kernel returns to user space and sees the TIF_TASK_ISOLATION flag set, it arranges for the task not to be interrupted in the future. Interrupts are disabled, and the kernel disables any events that may interrupt the isolated CPU(s). In current patches, it disables the scheduler's clock tick and vmstat delayed work, and drains pages out of the per-CPU pagevec to avoid inter-processor interrupts (IPIs) for cache flushes. More isolation actions may be added in the future.
This isolation work is more straightforward in the current version than it was in the 2015 patch set. Since then, Linux has gained the ability to offload timer ticks from the isolated CPUs to so-called "housekeeping" CPUs — all that are not on the CPU list of the isolcpus kernel option. That removes the need to make additional requirements for dealing with pending timers on CPUs before they can be isolated.
The patch set also adds diagnostics on the non-isolated CPUs. If the kernel finds itself about to interrupt an isolated CPU, it will generate diagnostics (a warning in the kernel log by default, but a stack dump is also possible) on the interrupting CPU. Examples of such situations include sending an IPI or TLB flush. If an interrupt is not handled by Linux, for example a hypervisor interrupt, it can end up sending a reschedule IPI to an isolated CPU, causing the signal to notify the isolated task to be generated. With regard to that problem, Frédéric Weisbecker wondered if support for hypervisors is even necessary, but no conclusion has been reached on this topic.
The task-isolation mode requires changes in the architecture code; the patch set includes implementations for x86, arm, and arm64. An architecture needs to define HAVE_ARCH_TASK_ISOLATION and the new TIF_TASK_ISOLATION task flag. It needs to change its interrupt and page-fault entry routines to add a call to task_isolation_interrupt() so that any isolated tasks will exit isolation. The reschedule IPI should call task_isolation_remote() for the same purpose. Finally the system-call code should invoke task_isolation_syscall() to check if the call is allowed. When exiting to user space it should call task_isolation_check_run_cleanup() to run pending cleanup and task_isolation_start() if the isolation flag is set for the current task.
Apart from the changes in the architecture-specific code, adding the isolation feature caused several changes in other kernel subsystems. For example, in the network code, flush_all_backlogs() will enqueue work only on non-isolated CPUs. The trace ring buffer behaves on isolated CPUs in a similar way to offline ones — any updates will be done when the task exits isolation. Another change in the isolation mode is that kernel jobs are scheduled on housekeeping CPUs only. This includes tasks like probing for PCIe devices. Finally, kick_all_cpus_sync() has been modified to avoid scheduling interrupts on CPUs with isolated tasks. Weisbecker did not agree with this approach and listed a number of race conditions that may happen between this function and the task entering isolation. He suggested fixing the callers instead.
Summary
The patch set has received initial favorable reviews and it seems that this feature is of interest to developers. There are still some unresolved comments to be addressed and some patches did not receive a review yet. The patch set changes some basic kernel functions in a subtle way, so there will surely be questions asked about testing of the feature. In addition, of course, to the possible regressions. When those issues are resolved, it will likely be included in one of the upcoming kernel releases.
Index entries for this article | |
---|---|
Kernel | Scheduler/CPU isolation |
GuestArticles | Rybczynska, Marta |
Posted Apr 6, 2020 21:43 UTC (Mon)
by ncm (guest, #165)
[Link] (14 responses)
But instead of the isolated process needing to call prctl(), it should happen automatically, for any process running on (and, by inference, bound to) an isolated CPU, after no system calls / page faults have occurred for a few milliseconds. The prctl() call should be needed only if that wouldn't be soon enough.
Also, "nohz" and "domain" should be _the_default_ on any isolcpu. I am not going to isolate a core, yet still want a load of crap interruptions sent to it. I isolated it for reasons. If I want interruptions I can ask for them.
Files that are mapped to an isolated process image (and the file descriptor closed) should never, ever cause the process to be blocked, even if the kernel decides it is time to copy changes in mapped memory to the spinny disk image. If tearing would be a problem, it is the process's problem to solve.
Finally, taskset should be able to designate, _all_by itself_, that the core the process is bound to is to be fully isolated. This business of needing to reserve isolcpus at boot time is nonsense.
Posted Apr 6, 2020 23:04 UTC (Mon)
by martin.langhoff (subscriber, #61417)
[Link] (1 responses)
Ideally this feature evolves towards essentially being as automagic as possible...
Posted Apr 6, 2020 23:18 UTC (Mon)
by ncm (guest, #165)
[Link]
Posted Apr 6, 2020 23:15 UTC (Mon)
by f18m (guest, #133856)
[Link] (1 responses)
However I'm unsure whether having the kernel deciding that a taskset should be "fully isolated" after a few milliseconds of zero-system-calls...
Anyway this patch set would be greatly useful to e.g. DPDK applications, which are mostly often using isolcpus and nohz options already.
I'd also love to have this RTOS-like as a post-boot option somewhere (maybe a sysctl setting?) rather than being forced to create scripts that must interact with the bootloader (GRUBv1, GRUBv2, etc) to deploy a new Linux boot option... moreover the reboot required to apply this change may not be acceptable in some contexts...
Posted Apr 7, 2020 1:13 UTC (Tue)
by flussence (guest, #85566)
[Link]
Posted Apr 7, 2020 0:07 UTC (Tue)
by erkki (guest, #124843)
[Link] (7 responses)
Posted Apr 7, 2020 1:44 UTC (Tue)
by ncm (guest, #165)
[Link]
The core that is busy writing out the page necessarily generates contention on the isolated core's cache bus, as dirty cache lines are copied out of it, but no TLB shenanigans ought to be needed.
Posted Apr 7, 2020 3:18 UTC (Tue)
by luto (guest, #39314)
[Link] (5 responses)
Then CPU B starts writeback. Subsequently, CPU A writes to the page again.
Without a TLB flush, the kernel has no way to know that the page has been dirtied again.
Posted Apr 7, 2020 10:07 UTC (Tue)
by ncm (guest, #165)
[Link] (3 responses)
If I wanted it clean, I would do something else and take the hit. The whole point of isolcpus is not to stall. If it takes a stall to get magickal feature X, it means I don't want magickal X. Just give me whatever approximation to X can be done without stalling.
Isolcpus: not want stalls. What is not clear?
Posted Apr 7, 2020 22:35 UTC (Tue)
by luto (guest, #39314)
[Link] (2 responses)
Posted Apr 7, 2020 23:45 UTC (Tue)
by ncm (guest, #165)
[Link] (1 responses)
In practice, though, a mapped stats file will *always* be dirty, no matter how frequently it is flushed. Maybe what we need is an mmap flag to tell the kernel that it should always assume a given mapping is dirty, and should skip the dance of checking. At least mmap flags are discoverable, unlike (e.g.) ioctls.
Mapping from an unbacked fs and copying from there works, but is a deployment problem. The user wants the stats file where they want it, and it is not easy to discover whether the place they want it is backed.
Posted Apr 8, 2020 2:01 UTC (Wed)
by luto (guest, #39314)
[Link]
Years ago, I worked a little bit on reducing the stalls from writing to a recently written-back mmapped file. I made some progress but never upstreamed it. Some day I'll finish the job.
Posted Apr 9, 2020 0:30 UTC (Thu)
by erkki (guest, #124843)
[Link]
Posted Apr 7, 2020 1:09 UTC (Tue)
by gus3 (guest, #61103)
[Link] (1 responses)
Scenario: a multi-core system with N cores. What's to stop a process from forking N times, then each process isolating itself? Perhaps the Nth isolation attempt will fail, but now, you have N-1 isolated processes, and the last core is saturated as if running on a single-core system.
Does this now provide a new opportunity to use Meltdown/Spectre-style exploits against N-1 non-isolated processes?
Posted Apr 7, 2020 1:34 UTC (Tue)
by ncm (guest, #165)
[Link]
Systems that configure isolated cores are rarely shared between organizations, although that will probably change as it becomes increasingly impractical to run without.
Posted Apr 7, 2020 2:18 UTC (Tue)
by ncm (guest, #165)
[Link]
Using a defaults-to-ignore signal would be more compatible with an eventual goal of automatic task isolation for programs that spin. If you want to drop core if your program violates isolation, a handler is the right way to make it happen. We don't need another.
Posted Apr 7, 2020 13:07 UTC (Tue)
by caliloo (subscriber, #50055)
[Link]
Posted Apr 7, 2020 14:52 UTC (Tue)
by liralon (guest, #124426)
[Link]
It should be quite common for a fast-path running as an isolated task, to use a system call to wake-up some slow-path. For example, GCP Andromeda (https://www.usenix.org/system/files/conference/nsdi18/nsd...) use system calls to wake-up the coprocessor thread or set irqfd (To raise a virtual interrupt to guest).
I would have expected instead, that the fatal signal will be raised to the isolated task when the kernel will reach some code that is about to block the task. E.g. wait for completion of some I/O request or sleep until some eventfd is set.
Posted Apr 7, 2020 17:03 UTC (Tue)
by josh (subscriber, #17465)
[Link]
Posted Apr 12, 2020 16:57 UTC (Sun)
by dave4444 (subscriber, #127523)
[Link] (1 responses)
Looks like some good progress here, but what about events that may (or may not) be out of the control of the kernel? Such as:
SMM/SMI/NMI on that CPU, this may not be preventable, but could it be detected?
ECC errors can cause very unpredictable slowdowns (especially correctable ones).
Some applications require more than just a CPU core's resources to itself, memory contention (L3, or beyond), common IO paths etc... can produce slow or even starved applications. Another app on another core can hog and consume L3 and DRAM bandwidth at the detrement to others for example.
Posted Apr 15, 2020 18:43 UTC (Wed)
by jithu83 (guest, #134793)
[Link]
X86_CPU_RESCTRL kernel config option does provide some fine grained control over memory bandwidth, L2/L3 cache partitioning/locking etc. This is available only on certain newer processors and requires additional effort to correctly provision these to the appropriate process etc
Posted Aug 11, 2020 0:46 UTC (Tue)
by rezete22 (guest, #140723)
[Link]
Any update on this patchset. Is it pushed upstream? Which version of Linux works with this patchset?
R/
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
Looking forward for it!
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel
A full task-isolation mode for the kernel