Core scheduling

By Jonathan Corbet
February 28, 2019

Kernel developers are used to having to defend their work when posting it to the mailing lists, so when a longtime kernel developer describes their own work as "expensive and nasty", one tends to wonder what is going on. The patch set in question is core scheduling from Peter Zijlstra. It is intended to make simultaneous multithreading (SMT) usable on systems where cache-based side channels are a concern, but even its author is far from convinced that it should actually become part of the kernel.

SMT increases performance by turning one physical CPU into two virtual CPUs that share the hardware; while one is waiting for data from memory, the other can be executing. Sharing a processor this closely has led to security issues and concerns for years, and many security-conscious users disable SMT entirely. The disclosure of the L1 terminal fault vulnerability in 2018 did not improve the situation; for many, SMT simply isn't worth the risks it brings with it.

But performance matters too, so there is interest in finding ways to make SMT safe (or safer, at least) to use in environments with users who do not trust each other. The coscheduling patch set posted last September was one attempt to solve this problem, but it did not get far and has not been reposted. One obstacle to this patch set was almost certainly its complexity; it operated at every level of the scheduling domain hierarchy, and thus addressed more than just the SMT problem.

Zijlstra's patch set is focused on scheduling at the core level only, meaning that it is intended to address SMT concerns but not to control higher-level groups of physical processors as a unit. Conceptually, it is simple enough. On kernels where core scheduling is enabled, a core_cookie field is added to the task structure; it is an unsigned long value. These cookies are used to define the trust boundaries; two processes with the same cookie value trust each other and can be allowed to run simultaneously on the same core.

By default, all processes have a cookie value of zero, placing them all in the same trust group. Control groups are used to manage cookie values via a new tag field added to the CPU controller. By placing a group of processes into their own group and setting tag appropriately, the system administrator can ensure that this group will not share a core with any processes outside of the group.

Underneath, of course, there is some complexity involved in making all of this work. In current kernels, each CPU in an SMT core schedules independently of the others, but that cannot happen when core scheduling is enabled; scheduling decisions must now take into account what is happening elsewhere in the core. So when one CPU in a core calls into the scheduler, it must evaluate the situation for all CPUs in that core. The highest-priority process eligible to run on any CPU in that core is chosen; if it has a cookie value that excludes processes currently running in other CPUs, those processes must be kicked out to ensure that no unwanted sharing takes place. Other, lower-priority processes might replace these outcasts, but only if they have the right cookie value.

The CPU scheduler works hard to avoid moving processes between distant CPUs in an attempt to maximize cache effectiveness. Load balancing kicks in occasionally to even out the load on the system as a whole. The calculation changes a bit, though, when core scheduling is in use; moving a process is more likely to make sense if that process can run on a CPU that would otherwise sit idle, even if the moved process leaves a hot cache behind. Thus, if an SMT CPU is forced idle due to cookie exclusion, a new load balancing algorithm will look across other cores for a process with a matching cookie to move onto the idle CPU.

The patch set has seen a fair amount of discussion. Greg Kerr, representing Chrome OS, questioned the control-group interface. Making changes to control groups is a privileged operation, but he would like for unprivileged processes to be able to set their own cookies. To that end, he proposed an API based on ptrace() prctl() calls. Zijlstra replied that the interface issues can be worked out later; first it's necessary to get everything working as desired.

Whether that can be done remains to be seen. As Linus Torvalds pointed out, performance is the key consideration here. Core scheduling only makes sense if it provides better throughput than simply running with SMT disabled, so the decision on whether to merge it depends heavily on benchmark results. There is not a lot of data available yet; it seems that perhaps it works better on certain types of virtualized loads (those that only rarely exit back to the hypervisor) and worse on others. Subhra Mazumdar also reported a performance regression on database loads.

Thus, even if core scheduling is eventually accepted, it seems unlikely to be something that everybody turns on. But it may yet be a tool that proves useful for some environments where there is a need to get the most out of the hardware, but where strong isolation between users is also needed. The process of finishing this work and figuring out if it justifies the costs seems likely to take a while in any case; this sort of major surgery to the CPU scheduler is not done lightly, even when its developer doesn't claim to "hate it with a passion". So security-conscious users seem likely to be without an alternative to disabling SMT for some time yet.

Index entries for this article
Kernel	Scheduler/Core scheduling

Core scheduling

Posted Feb 28, 2019 18:12 UTC (Thu) by josh (subscriber, #17465) [Link] (6 responses)

(Note: Greg Kerr suggested an API based on prctl, not ptrace.)

Rather than using cgroups for this, what about making the default "processes that can ptrace each other can share a core"? (To a first approximation, that's "processes running as the same user and group".) Inventing a new mechanism that allows finer-grained usage of cookies seems like a waste; we already have mechanisms for processes to isolate themselves from each other, and we just need those mechanisms to help with side-channel attacks.

Core scheduling

Posted Mar 1, 2019 8:14 UTC (Fri) by diconico07 (guest, #117416) [Link] (2 responses)

"Processes that can ptrace each other" is likely to be an empty set with Yama set to restricted ptrace (i.e a process can only ptrace its children). I'd rather suggest that when you want to isolate processes you run them in different PID namespaces, so if I were to chose an already existing value to discriminate "processes that can run on the same core" I'd take "processes that are in the same PID namespaces"

Core scheduling

Posted Mar 1, 2019 16:39 UTC (Fri) by epa (subscriber, #39769) [Link] (1 responses)

If a process can only ptrace its own children, it should only be able to do side channel attacks on its own children.

Core scheduling

Posted Mar 2, 2019 10:43 UTC (Sat) by diconico07 (guest, #117416) [Link]

But in the case of ptrace there is a clear direction, as you say it should only be able to do side channel attacks on its children, but this kind of attack also allows to do it in the other direction, thus you can't prevent a child from doing side channel attack on its parent. Moreover it is quite long to check if two process are parents rather than compare two values like the suggested tag or a pid namespace identifier.

Core scheduling

Posted Mar 1, 2019 14:22 UTC (Fri) by droundy (subscriber, #4559) [Link]

I agree that this would at least be great to have as an option. Less tweaking required in order to get security!

Core scheduling

Posted Mar 1, 2019 15:18 UTC (Fri) by smurf (subscriber, #17840) [Link] (1 responses)

Checking the user+group (which one? neither effective nor real UID+GID is sufficient when you have a setuid process you want to protect) is significantly more expensive than checking a single cookie for equality. In the scheduler, that matters.

Core scheduling

Posted May 15, 2019 14:04 UTC (Wed) by riel (subscriber, #3142) [Link]

It might be possible to automatically create and set cookies based on things like UID + GID, in the same way that CONFIG_SCHED_AUTOGROUP automatically creates cgroups based on UID. That way the enforcement code only ever checks the single cookie.

At this stage, it is hard to say exactly what the policy should look like.

Core scheduling

Posted Apr 23, 2019 7:34 UTC (Tue) by erlong (guest, #130062) [Link]

In current release, the cookie is strictly matched between smts in one core, the cost in reschedule between smt is whether too expensive or not. afterwards, the cookie is softly matched between smts in one core? and one cookie may pardon some cookie.