LWN: Comments on "A full task-isolation mode for the kernel"

A full task-isolation mode for the kernel

rezete22 — Tue, 11 Aug 2020 00:46:54 +0000

Hello

Any update on this patchset. Is it pushed upstream? Which version of Linux works with this patchset?

A full task-isolation mode for the kernel

jithu83 — Wed, 15 Apr 2020 18:43:25 +0000

> Some applications require more than just a CPU core's resources to itself, memory contention (L3, or beyond), common IO paths etc... can produce slow or even starved applications. Another app on another core can hog and consume L3 and DRAM bandwidth at the detrement to others for example.

X86_CPU_RESCTRL kernel config option does provide some fine grained control over memory bandwidth, L2/L3 cache partitioning/locking etc. This is available only on certain newer processors and requires additional effort to correctly provision these to the appropriate process etc

A full task-isolation mode for the kernel

dave4444 — Sun, 12 Apr 2020 16:57:48 +0000

Looks like some good progress here, but what about events that may (or may not) be out of the control of the kernel? Such as:

SMM/SMI/NMI on that CPU, this may not be preventable, but could it be detected?

ECC errors can cause very unpredictable slowdowns (especially correctable ones).

Some applications require more than just a CPU core's resources to itself, memory contention (L3, or beyond), common IO paths etc... can produce slow or even starved applications. Another app on another core can hog and consume L3 and DRAM bandwidth at the detrement to others for example.

A full task-isolation mode for the kernel

erkki — Thu, 09 Apr 2020 00:30:35 +0000

Right, that makes sense. At least the kernel could allow for writeback without guaranteeing per page atomicity. In that case writeback CPU and the isolated CPU can operate on the page concurrently.

A full task-isolation mode for the kernel

luto — Wed, 08 Apr 2020 02:01:17 +0000

You could have a little daemon that periodically copies the file from tmpfs to the real backing store and have the isolated code write to tmpfs.

Years ago, I worked a little bit on reducing the stalls from writing to a recently written-back mmapped file. I made some progress but never upstreamed it. Some day I'll finish the job.

A full task-isolation mode for the kernel

ncm — Tue, 07 Apr 2020 23:45:33 +0000

Ok, thank you. It sounds like the multi-millisecond stall is not necessary, but flushing the TLB entry for the page is.

In practice, though, a mapped stats file will *always* be dirty, no matter how frequently it is flushed. Maybe what we need is an mmap flag to tell the kernel that it should always assume a given mapping is dirty, and should skip the dance of checking. At least mmap flags are discoverable, unlike (e.g.) ioctls.

Mapping from an unbacked fs and copying from there works, but is a deployment problem. The user wants the stats file where they want it, and it is not easy to discover whether the place they want it is backed.

A full task-isolation mode for the kernel

luto — Tue, 07 Apr 2020 22:35:29 +0000

In your model, if an isolcpus task maps a file writably, the kernel will have to write back every mapped page every 30 seconds or so regardless of whether it’s written. Or, at least, every page that has ever been written. I don’t think this is wise.

A full task-isolation mode for the kernel

josh — Tue, 07 Apr 2020 17:03:06 +0000

What would it take to eliminate the periodic wakeups for vmstats entirely, not just on isolated CPUs but in general? That seems like something the kernel could account for incrementally as it performs operations, and then just sum up when something wants the statistics.

A full task-isolation mode for the kernel

liralon — Tue, 07 Apr 2020 14:52:35 +0000

It seems to me that raising a fatal signal to the isolated task (which will kill it) when the task invokes a system call is an issue.

It should be quite common for a fast-path running as an isolated task, to use a system call to wake-up some slow-path. For example, GCP Andromeda (https://www.usenix.org/system/files/conference/nsdi18/nsd...) use system calls to wake-up the coprocessor thread or set irqfd (To raise a virtual interrupt to guest).

I would have expected instead, that the fatal signal will be raised to the isolated task when the kernel will reach some code that is about to block the task. E.g. wait for completion of some I/O request or sleep until some eventfd is set.

A full task-isolation mode for the kernel

caliloo — Tue, 07 Apr 2020 13:07:36 +0000

I wonder how this interacts with the new system calls that can be placed on a buffer ring, sounds to me like you’d be able to place system calls without leaving isolation, which is kind of nice.

A full task-isolation mode for the kernel

ncm — Tue, 07 Apr 2020 10:07:35 +0000

There is no reasonable expectation of coherence anyway, because there is no sychronization available. It will get dirty again, and get copied out again, later.

If I wanted it clean, I would do something else and take the hit. The whole point of isolcpus is not to stall. If it takes a stall to get magickal feature X, it means I don't want magickal X. Just give me whatever approximation to X can be done without stalling.

Isolcpus: not want stalls. What is not clear?

A full task-isolation mode for the kernel

luto — Tue, 07 Apr 2020 03:18:22 +0000

I’m afraid this can’t really be done. Suppose CPU A runs an isolated task and writes to a mapped page. Now CPU A has a writable dirty TLB entry for the page.

Then CPU B starts writeback. Subsequently, CPU A writes to the page again.

Without a TLB flush, the kernel has no way to know that the page has been dirtied again.

A full task-isolation mode for the kernel

ncm — Tue, 07 Apr 2020 02:18:49 +0000

Also, why use SIGKILL, and then provide a back-door way to change it to some other signal, instead of using one of the numerous other defaults-to-terminate, or even defaults-to ignore, signals? That seems like complexity for the sake of complexity. Even inventing a new signal number for the purpose would be simpler.

Using a defaults-to-ignore signal would be more compatible with an eventual goal of automatic task isolation for programs that spin. If you want to drop core if your program violates isolation, a handler is the right way to make it happen. We don't need another.

A full task-isolation mode for the kernel

ncm — Tue, 07 Apr 2020 01:44:24 +0000

TLB shootdowns are the least of the problem. I see multi-millisecond pauses caused by the behavior in question. Such files have to live in a tmpfs, shmfs, or the like, and snapshots of the stats normally kept in them have to be produced by copying to another file in the same fs, and then out, to prevent pathological stalls.

The core that is busy writing out the page necessarily generates contention on the isolated core's cache bus, as dirty cache lines are copied out of it, but no TLB shenanigans ought to be needed.

A full task-isolation mode for the kernel

ncm — Tue, 07 Apr 2020 01:34:52 +0000

Claiming a core for isolation would reasonably require a capability normally provided only to root.

Systems that configure isolated cores are rarely shared between organizations, although that will probably change as it becomes increasingly impractical to run without.

A full task-isolation mode for the kernel

flussence — Tue, 07 Apr 2020 01:13:17 +0000

“Soft” isolation is still useful for HPC work, where you'd like to have something stronger than SCHED_BATCH, but nobody's going to end up taken away in an ambulance if it only gets 99.99% of the CPU.

A full task-isolation mode for the kernel

gus3 — Tue, 07 Apr 2020 01:09:20 +0000

This could be an exploit waiting to happen.

Scenario: a multi-core system with N cores. What's to stop a process from forking N times, then each process isolating itself? Perhaps the Nth isolation attempt will fail, but now, you have N-1 isolated processes, and the last core is saturated as if running on a single-core system.

Does this now provide a new opportunity to use Meltdown/Spectre-style exploits against N-1 non-isolated processes?

A full task-isolation mode for the kernel

erkki — Tue, 07 Apr 2020 00:07:10 +0000

Preventing TLB shootdowns due to writeback of file backed mmaps would really interesting. Maybe through a new mmap flag?

A full task-isolation mode for the kernel

ncm — Mon, 06 Apr 2020 23:18:47 +0000

OK, but sensible defaults for isolcpus is no more work than crazy defaults.

A full task-isolation mode for the kernel

f18m — Mon, 06 Apr 2020 23:15:33 +0000

I fully agree with ncm: what's the sense of isolating a CPU core other than the need of using it without any sort of interrupt to emulate a real-time OS ?

However I'm unsure whether having the kernel deciding that a taskset should be "fully isolated" after a few milliseconds of zero-system-calls...

Anyway this patch set would be greatly useful to e.g. DPDK applications, which are mostly often using isolcpus and nohz options already.
Looking forward for it!

I'd also love to have this RTOS-like as a post-boot option somewhere (maybe a sysctl setting?) rather than being forced to create scripts that must interact with the bootloader (GRUBv1, GRUBv2, etc) to deploy a new Linux boot option... moreover the reboot required to apply this change may not be acceptable in some contexts...

A full task-isolation mode for the kernel

martin.langhoff — Mon, 06 Apr 2020 23:04:35 +0000

Gotta walk before you run, but long term I completely agree.

Ideally this feature evolves towards essentially being as automagic as possible...

A full task-isolation mode for the kernel

ncm — Mon, 06 Apr 2020 21:43:58 +0000

About time.

But instead of the isolated process needing to call prctl(), it should happen automatically, for any process running on (and, by inference, bound to) an isolated CPU, after no system calls / page faults have occurred for a few milliseconds. The prctl() call should be needed only if that wouldn't be soon enough.

Also, "nohz" and "domain" should be _the_default_ on any isolcpu. I am not going to isolate a core, yet still want a load of crap interruptions sent to it. I isolated it for reasons. If I want interruptions I can ask for them.

Files that are mapped to an isolated process image (and the file descriptor closed) should never, ever cause the process to be blocked, even if the kernel decides it is time to copy changes in mapped memory to the spinny disk image. If tearing would be a problem, it is the process's problem to solve.

Finally, taskset should be able to designate, _all_by itself_, that the core the process is bound to is to be fully isolated. This business of needing to reserve isolcpus at boot time is nonsense.