Exposing concurrency bugs with a custom scheduler

By Daroc Alden
February 5, 2025

Jake Hillion gave a presentation at FOSDEM about using sched_ext, the BPF scheduling framework that was introduced in kernel version 6.12, to help find elusive concurrency problems. In collaboration with Johannes Bechberger, he has built a scheduler that can reveal theoretically possible but unobserved concurrency bugs in test code in a few minutes. Since their scheduler only relies on mainline kernel features, it can theoretically be applied to any application that runs on Linux — although there are a number of caveats since the project is still in its early days.

Bechberger, who unfortunately could not be present for the talk, is an OpenJDK developer. Since Java has its own concurrency model that OpenJDK is responsible for upholding, Bechberger often has to spend time debugging nasty concurrency problems. After wrestling with one such bug, wasting a lot of time trying to reproduce it, he came up with the idea of making a scheduler that deliberately scheduled a process "badly" in order to try and make misbehavior more likely, and therefore easier to debug.

Hillion works at Meta, mainly on custom Linux schedulers. Meta uses a sched_ext scheduler in production, which is well-engineered, tuned to their specific applications, and reliable, he said. The scheduler that Hillion and Bechberger have built, concurrency-fuzz-scheduler, is the exact opposite of that.

A bad scheduler

To illustrate how a scheduler could expose bugs in an application, Hillion gave an extremely simplified example of the kind of workload that some applications have: a producing thread produces items of work, and a consuming thread consumes them. The items are passed between the threads using a queue. But, each item has an expiration time, after which trying to process it causes an error.

This model may seem oversimplified, but it describes applications Hillion has worked on. He specifically mentioned the case of a large C++ application where someone had copied a pointer to some shared data, instead of copying the data, and not correctly validated that the data would remain alive long enough. In that application, as long as the thread that was using the copied pointer was scheduled often enough, it would run before the data was deleted, and things would appear to work. When the system came under heavy load, the processing thread couldn't keep up, and it started accessing freed memory.

Hillion and Bechberger's scheduler does this deliberately: when a thread is ready to run, instead of assigning it to a CPU right away, their scheduler will put it to sleep for a random amount of time based on some configurable parameters. This gives threads that would ordinarily not have any problem keeping up with each other a chance to let one thread race ahead and overwhelm the other. It also ensures that threads run in a random order, which can help expose bugs as well. In a way, it is the fact that Linux's default scheduler is so good that makes these problems hard to reproduce, Hillion explained. EEVDF is too fair, and only actually lets one thread overwhelm the other when the machine is already overloaded, and even then rarely.

The idea of running threads in a random order is not new. There are plenty of specialized concurrency testing tools that do something similar. But concurrency-fuzz-scheduler is not nearly as fine-grained as such tools usually are. It won't help deterministically find violations of a particular memory model, for example. It will find the kinds of gross logic errors that are common in real-life applications, however.

Hillion showed a recording of a Java application that implements his example running while being managed by the scheduler. In a little less than a minute, the application crashed — an eventuality that they could not get to reoccur on that machine without the scheduler, despite leaving it running for several days.

Early days

Eventually, Hillion would like concurrency-fuzz-scheduler to be able to run on a production machine, increasing the probability of finding bugs without otherwise impacting a workload's performance too much. For now, however, tasks run with the scheduler end up running many times more slowly than they otherwise would, because of the added delays. So currently the scheduler is more suited to use on an individual developer's workstation while trying to reproduce intermittent bugs.

Hillion expects to be able to fix this drawback with a bit more work, however. His job at Meta has taught him that writing a performant scheduler for large machines with sched_ext is a bit tricky. The simplest possible scheduler (which he showed the code for in his talk) is actually quite easy to implement: newly runnable tasks are added to a global queue. When a CPU is about to go idle, it pulls a task from the global queue.

This kind of simple sched_ext demo is nearly obligatory when anyone gives a talk on the topic; the simple C and Rust examples in the scx repository have been presented dozens of times. Hillion's example code, however, was written in Java — as is concurrency-fuzz-scheduler. One of Bechberger's other projects has been working to make it possible to compile Java to BPF bytecode for use in the kernel. Unfortunately, Hillion has not done as much work on that part of things, and didn't dedicate much time to talking about it.

In any case, the simple sched_ext example that has been used to demonstrate the project, while safe and functional for small systems, does not scale to larger systems. The use of a single global queue imposes synchronization costs between different CPUs whenever they need to switch tasks. For systems with dozens of CPUs, especially on separate sockets, these synchronization costs can sometimes get so bad that the system actually gets slow enough to trigger the kernel's softlockup detector and force a reboot, Hillion said.

He explained that sched_ext has a watchdog that is supposed to kick out misbehaving schedulers and switch back to EEVDF. But if a large enough machine has a scheduler that relies on rapidly updating global state, the hardware cost of synchronizing memory between different cores can actually delay things enough that the system won't recover before the softlockup detector trips. In practice, this is not much of an issue. Things only bog down enough to cause problems when all the cores of a large machine are constantly contending for the same global queue; giving each core a local queue and transferring tasks in batches or putting cores to sleep when there aren't enough tasks completely avoids the problem.

concurrency-fuzz-scheduler, however, ends up deliberately introducing serious delays into the tasks it schedules, often running only a single thread at a time and leaving many CPUs idle. Then, when it does run another thread, it may well do so on another CPU to try to expose synchronization problems. No well-behaving scheduler would normally leave so many CPUs idle, or, if it did, it would at least let them go to sleep instead of having them constantly contend for the scheduler's global data structures. Together, the deliberate delays and cross-CPU communication can make concurrency-fuzz-scheduler time out and trigger the softlockup detector on larger systems. So while the scheduler is entirely suitable for use on a programmer's computer, it's not yet in a state where it makes sense to deploy to production (or even to sufficiently large test systems).

Hillion said that the next steps for the project are making it more reliable — scaling it to larger machines, trying to come up with ways to still cause as many problems with less performance overhead, and adding a seed for deterministic randomness.

Questions

After the talk, one audience member asked about whether the scheduler did anything intelligent, such as looking at an application's memory to decide which threads to schedule. Hillion said that it currently does not, but wants to look into the possibility of enabling more communication between an application and the scheduler. Meta's production scheduler lets applications give the scheduler hints about how they should be run, and potentially concurrency-fuzz-scheduler could do the same thing. It should be possible to do things like preempt a task specifically when it's in the middle of a critical section, or holding a lock — that just hasn't been implemented yet.

Another person asked about other sources of erratic behavior. For example, simulating additional memory latency. Hillion agreed that would be useful for catching additional bugs, but said that interfering with processes like that is much harder than just writing a scheduler. There are some things you can do, such as switching a task to another CPU core to force it out of cache, he said, but that's about it. The scheduler would need a lot of additional work to be able to actually interfere with tasks as they run, instead of only when their time slice expires.

After taking several related questions, Hillion said that he was excited to see so many people having interesting scheduling-related ideas. He works on schedulers professionally, and Bechberger was interested in schedulers as well. But part of the promise of sched_ext was that it would open up scheduler development to more people, and he's excited to see whether that means people who aren't as deeply involved in the space can actually try out some of these ideas.

Hillon and Bechberger's concurrency-fuzz-scheduler is a promising example of a sched_ext scheduler that does something more than act as a testbed for scheduling ideas. It is a practical tool for shaking more bugs out of applications — something that is always helpful, in a world where software only grows more complex.

[Unfortunately, LWN couldn't send anyone in person to FOSDEM this year. But thanks to the great video of the talk, we are able to report on it.]

Index entries for this article
Conference	FOSDEM/2025

ThreadSanitiser

Posted Feb 6, 2025 0:51 UTC (Thu) by milesrout (subscriber, #126894) [Link] (3 responses)

Does anyone know how this compares with the approach taken by TSan?

ThreadSanitiser

Posted Feb 6, 2025 2:28 UTC (Thu) by roc (subscriber, #30627) [Link] (2 responses)

It's very different.

This kind of approach randomizes scheduling in clever ways to try to trigger bugs. rr chaos mode is a similar kind of approach and there are many others.

TSAN and other race detectors try to find inadequate synchronization, e.g. two CPU cores racing to access the same memory location, where at least one is a write, with no synchronization preventing the trace. A TSAN report is therefore not necessarily an application bug, although it usually is. But the advantage of TSAN is that you can report issues without actually having to trigger the precise schedule that exposes the bug.

ThreadSanitiser

Posted Feb 6, 2025 15:04 UTC (Thu) by parttimenerd (guest, #175795) [Link] (1 responses)

(co-speaker of the talk here) You're correct. The goals of the two tools are also different. My scheduler runs programs with random/erratic scheduling orders, but without interfering with the program's execution itself. The scheduler can, therefore, find only larger races or broken inter-dependencies between two threads. A basic example of such interdependency is given in the Queue example (https://github.com/parttimenerd/concurrency-fuzz-schedule...) that I used for the talk.

The use case of this tool is essentially to run it during integration testing or long fuzzing sessions.

> rr chaos mode is a similar kind of approach and there are many others.

Yes, but rr chaos mode is far slower. One of the ideas is to combine the custom scheduler with the other techniques.

Another goal of the custom scheduler was to show what is possible with sched-ext, allowing us to implement a custom scheduling policy in a matter of hours.

ThreadSanitiser

Posted Feb 7, 2025 10:20 UTC (Fri) by roc (subscriber, #30627) [Link]

> Yes, but rr chaos mode is far slower. One of the ideas is to combine the custom scheduler with the other techniques.

Tradeoffs! rr chaos mode has the advantage that you get a fully reproducible recording of the execution which makes it really easy to debug. AFAICT from the description in this article, this sched_ext approach has no guarantee that you can reproduce a bug reliably after you've caught it once.

Race with softlockup detector

Posted Feb 6, 2025 10:02 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

The hardware cost of synchronizing memory between different cores can actually delay things enough that the system won't recover before the softlockup detector trips.

So there's a race condition between the scheduler and other parts of the kernel.

Perhaps when introducing a delay to scheduling, it needs to schedule a dummy user-space process to run for part of that time, so that the CPU does not appear idle and the system is not stuck in kernel mode for too long?

Race with softlockup detector

Posted Feb 7, 2025 5:14 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

As an SRE, I would be very unhappy with a solution along those lines. If you want to put something like this into production, I would expect it to have quite low overhead, because hardware and electricity are expensive and carry significant opportunity costs. And if you have low overhead, then the kernel's softlockup detection should not have a problem with you in the first place.

I wouldn't necessarily object to doing this sort of thing in a non-prod CI or testing setup, because that's (usually) much smaller and O(1) in terms of incoming QPS (i.e. if you have twice as many incoming requests, you do not need to turn up twice as many test machines, but you do need twice as many prod machines to serve those requests).

Nice to have an open-source tool with this functionality

Posted Feb 6, 2025 11:11 UTC (Thu) by dottedmag (subscriber, #18590) [Link]

This so helpful!

I remember there was a start-up company several years ago that have tried to market a similar userspace tool, but they failed to get any traction and also didn't release any source code after they closed their doors.

Very cool

Posted Feb 6, 2025 14:33 UTC (Thu) by chris_se (subscriber, #99706) [Link]

This is extremely cool. It will make much easier to see if synchronization abstractions in one's own codebase (such as a SPSC queue or similar) are actually correct, because now unit tests may trigger problematic behavior that they might not have caught previously, because they were too simple. I'll definitely take a look at that project. Thanks for reporting on this! (I was at FOSDEM, but because there's SO MUCH going on there, I didn't even have this on my radar.)

This looks incredibly promising

Posted Feb 6, 2025 20:00 UTC (Thu) by marcH (subscriber, #57642) [Link]

This looks incredibly promising, well done!

I haven't debugged that many race conditions in my career but every time I did, _testing_ it and proving it required temporarily making it worse. Generally by inserting some delay in the "right place". The obvious, massive challenge: finding that right place means I got a decent understanding of the race and that I was _already_ very close to solving the problem - after an incredibly long and difficult debug session!

Please correct me but this sounds like this can in some cases skip most of that debug effort. Amazing.

It does not matter if this method is not perfect and can't catch all races (as other methods try to do). ANY time saved debugging any race condition(s) is a huge win.

A bit of "+1 / me too" post sorry but hey, it's not like we have too much positive stuff on the Internet these days.

In userspace with glibc hooks

Posted Feb 14, 2025 16:34 UTC (Fri) by AlexeyMilovidov (guest, #176032) [Link] (1 responses)

It is possible to do this entirely in userspace (without a custom scheduler).

See the implementation here: https://github.com/ClickHouse/ClickHouse/blob/master/src/...

It adds random sleep: - before and after mutex locks and unlocks; - at random places using a signal handler for SIGALRM.

It is named ThreadFuzzer (don't be confused with Thread Sanitizer - by the way, they combine nicely together).

It works and makes significant improvements for the detection of concurrency bugs in our practice.

In userspace with glibc hooks

Posted Feb 25, 2025 11:48 UTC (Tue) by parttimenerd (guest, #175795) [Link]

Thanks for the suggestion. It would be great to combine both approaches. Doing all the fuzzing in the kernel has the advantage of not modifying (or hooking) the application under test. Is there any more information out there and how this tool has helped and how it can be used? I would be happy to compare both approaches.

Additional material

Posted Feb 25, 2025 11:50 UTC (Tue) by parttimenerd (guest, #175795) [Link]

I just published a blog post that covers the implementation of the scheduler and how it came to be. You can find it at https://mostlynerdless.de/blog/2025/02/25/helle-ebpf-conc...