Exposing concurrency bugs with a custom scheduler
Jake Hillion gave a presentation at FOSDEM about using sched_ext, the BPF scheduling framework that was introduced in kernel version 6.12, to help find elusive concurrency problems. In collaboration with Johannes Bechberger, he has built a scheduler that can reveal theoretically possible but unobserved concurrency bugs in test code in a few minutes. Since their scheduler only relies on mainline kernel features, it can theoretically be applied to any application that runs on Linux — although there are a number of caveats since the project is still in its early days.
Bechberger, who unfortunately could not be present for the talk, is an OpenJDK developer. Since Java has its own concurrency model that OpenJDK is responsible for upholding, Bechberger often has to spend time debugging nasty concurrency problems. After wrestling with one such bug, wasting a lot of time trying to reproduce it, he came up with the idea of making a scheduler that deliberately scheduled a process "badly" in order to try and make misbehavior more likely, and therefore easier to debug.
Hillion works at Meta, mainly on custom Linux schedulers. Meta uses a sched_ext scheduler in production, which is well-engineered, tuned to their specific applications, and reliable, he said. The scheduler that Hillion and Bechberger have built, concurrency-fuzz-scheduler, is the exact opposite of that.
A bad scheduler
To illustrate how a scheduler could expose bugs in an application, Hillion gave an extremely simplified example of the kind of workload that some applications have: a producing thread produces items of work, and a consuming thread consumes them. The items are passed between the threads using a queue. But, each item has an expiration time, after which trying to process it causes an error.
This model may seem oversimplified, but it describes applications Hillion has worked on. He specifically mentioned the case of a large C++ application where someone had copied a pointer to some shared data, instead of copying the data, and not correctly validated that the data would remain alive long enough. In that application, as long as the thread that was using the copied pointer was scheduled often enough, it would run before the data was deleted, and things would appear to work. When the system came under heavy load, the processing thread couldn't keep up, and it started accessing freed memory.
Hillion and Bechberger's scheduler does this deliberately: when a thread is ready to run, instead of assigning it to a CPU right away, their scheduler will put it to sleep for a random amount of time based on some configurable parameters. This gives threads that would ordinarily not have any problem keeping up with each other a chance to let one thread race ahead and overwhelm the other. It also ensures that threads run in a random order, which can help expose bugs as well. In a way, it is the fact that Linux's default scheduler is so good that makes these problems hard to reproduce, Hillion explained. EEVDF is too fair, and only actually lets one thread overwhelm the other when the machine is already overloaded, and even then rarely.
The idea of running threads in a random order is not new. There are plenty of specialized concurrency testing tools that do something similar. But concurrency-fuzz-scheduler is not nearly as fine-grained as such tools usually are. It won't help deterministically find violations of a particular memory model, for example. It will find the kinds of gross logic errors that are common in real-life applications, however.
Hillion showed a recording of a Java application that implements his example running while being managed by the scheduler. In a little less than a minute, the application crashed — an eventuality that they could not get to reoccur on that machine without the scheduler, despite leaving it running for several days.
Early days
Eventually, Hillion would like concurrency-fuzz-scheduler to be able to run on a production machine, increasing the probability of finding bugs without otherwise impacting a workload's performance too much. For now, however, tasks run with the scheduler end up running many times more slowly than they otherwise would, because of the added delays. So currently the scheduler is more suited to use on an individual developer's workstation while trying to reproduce intermittent bugs.
Hillion expects to be able to fix this drawback with a bit more work, however. His job at Meta has taught him that writing a performant scheduler for large machines with sched_ext is a bit tricky. The simplest possible scheduler (which he showed the code for in his talk) is actually quite easy to implement: newly runnable tasks are added to a global queue. When a CPU is about to go idle, it pulls a task from the global queue.
This kind of simple sched_ext demo is nearly obligatory when anyone gives a talk on the topic; the simple C and Rust examples in the scx repository have been presented dozens of times. Hillion's example code, however, was written in Java — as is concurrency-fuzz-scheduler. One of Bechberger's other projects has been working to make it possible to compile Java to BPF bytecode for use in the kernel. Unfortunately, Hillion has not done as much work on that part of things, and didn't dedicate much time to talking about it.
In any case, the simple sched_ext example that has been used to demonstrate the project, while safe and functional for small systems, does not scale to larger systems. The use of a single global queue imposes synchronization costs between different CPUs whenever they need to switch tasks. For systems with dozens of CPUs, especially on separate sockets, these synchronization costs can sometimes get so bad that the system actually gets slow enough to trigger the kernel's softlockup detector and force a reboot, Hillion said.
He explained that sched_ext has a watchdog that is supposed to kick out misbehaving schedulers and switch back to EEVDF. But if a large enough machine has a scheduler that relies on rapidly updating global state, the hardware cost of synchronizing memory between different cores can actually delay things enough that the system won't recover before the softlockup detector trips. In practice, this is not much of an issue. Things only bog down enough to cause problems when all the cores of a large machine are constantly contending for the same global queue; giving each core a local queue and transferring tasks in batches or putting cores to sleep when there aren't enough tasks completely avoids the problem.
concurrency-fuzz-scheduler, however, ends up deliberately introducing serious delays into the tasks it schedules, often running only a single thread at a time and leaving many CPUs idle. Then, when it does run another thread, it may well do so on another CPU to try to expose synchronization problems. No well-behaving scheduler would normally leave so many CPUs idle, or, if it did, it would at least let them go to sleep instead of having them constantly contend for the scheduler's global data structures. Together, the deliberate delays and cross-CPU communication can make concurrency-fuzz-scheduler time out and trigger the softlockup detector on larger systems. So while the scheduler is entirely suitable for use on a programmer's computer, it's not yet in a state where it makes sense to deploy to production (or even to sufficiently large test systems).
Hillion said that the next steps for the project are making it more reliable — scaling it to larger machines, trying to come up with ways to still cause as many problems with less performance overhead, and adding a seed for deterministic randomness.
Questions
After the talk, one audience member asked about whether the scheduler did anything intelligent, such as looking at an application's memory to decide which threads to schedule. Hillion said that it currently does not, but wants to look into the possibility of enabling more communication between an application and the scheduler. Meta's production scheduler lets applications give the scheduler hints about how they should be run, and potentially concurrency-fuzz-scheduler could do the same thing. It should be possible to do things like preempt a task specifically when it's in the middle of a critical section, or holding a lock — that just hasn't been implemented yet.
Another person asked about other sources of erratic behavior. For example, simulating additional memory latency. Hillion agreed that would be useful for catching additional bugs, but said that interfering with processes like that is much harder than just writing a scheduler. There are some things you can do, such as switching a task to another CPU core to force it out of cache, he said, but that's about it. The scheduler would need a lot of additional work to be able to actually interfere with tasks as they run, instead of only when their time slice expires.
After taking several related questions, Hillion said that he was excited to see so many people having interesting scheduling-related ideas. He works on schedulers professionally, and Bechberger was interested in schedulers as well. But part of the promise of sched_ext was that it would open up scheduler development to more people, and he's excited to see whether that means people who aren't as deeply involved in the space can actually try out some of these ideas.
Hillon and Bechberger's concurrency-fuzz-scheduler is a promising example of a sched_ext scheduler that does something more than act as a testbed for scheduling ideas. It is a practical tool for shaking more bugs out of applications — something that is always helpful, in a world where software only grows more complex.
[Unfortunately, LWN couldn't send anyone in person to FOSDEM this year. But thanks to the great video of the talk, we are able to report on it.]
| Index entries for this article | |
|---|---|
| Conference | FOSDEM/2025 |
