Diagnosing workqueues
There are many mechanisms for deferred work in the Linux kernel. One of them, workqueues, has seen increasing use as part of the move away from software interrupts. Alison Chaiken gave a talk at SCALE about how they compare to software interrupts, the new challenges they pose for system administrators, and what tools are available to kernel developers wishing to diagnose problems with workqueues as they become increasingly prevalent.
Background on software interrupts
Software interrupts are a mechanism that allows Linux to split the work done by interrupt handlers into two parts. The interrupt handler invoked by the hardware does the minimum amount of work, and then raises a software interrupt for the kernel to run later that does the actual work. This can reduce the amount of time spent in the registered interrupt handler, which ensures that interrupts get serviced efficiently.
![Alison Chaiken [Alison Chaiken]](https://static.lwn.net/images/2024/Chaiken-fuzz.png)
Chaiken explained that when a hardware interrupt raises a software interrupt, there are two possible cases. When no software interrupt is already running on the CPU, the new software interrupt can start running immediately. When a software interrupt is already running on the CPU, the new interrupt is enqueued to be handled later — even if the new interrupt would actually be higher priority than the currently running one. There are ten different kinds of software interrupt, and each kind has a specific priority. Chaiken showed a list of these priorities, and remarked that even without knowing anything else about the design of software interrupts, seeing network interrupts listed above timer interrupts might make people "feel some foreboding".
These priority inversions are a problem on their own, because they contribute to latency and jitter for high-priority tasks, but the priority system also introduces other problems. The lowest priority interrupts are part of the kernel's read-copy-update (RCU) system. Chaiken called the RCU system "basically the kernel's garbage collector". This means that not servicing interrupts fast enough can actually cause the kernel to run out of memory.
On the other hand, servicing software interrupts too much can disrupt latency-sensitive operations such as audio processing — a common issue for kernel maintainers is a software interrupt that runs too long and refuses to yield, effectively tying up a core.
To balance these two problems, there are two heuristic limits used to balance latency against fairness. MAX_SOFTIRQ_TIME is the maximum time that a software interrupt is allowed to run; it is set to 2ms. MAX_SOFTIRQ_RESTART is the maximum number of times that a software interrupt that is itself interrupted by something else will be restarted; it is set to ten attempts. Unfortunately, these parameters are hard-coded and built into the kernel. They were supposedly set to good values via experimentation, but Linux runs on so many different kinds of device that no setting could be optimal for all of them. "No one has the nerve to change them", she said, which is "not a great situation". She summed up the problems with software interrupts by saying that they "are not the most beloved feature of the kernel" and that there have already been several attempts to get rid of them across many versions of the kernel.
But progress removing software interrupts is slow. Despite those efforts, there are still 250 call sites of local_bh_disable() — a function which Chaiken called "the villain of this part of the talk". local_bh_disable() prevents software interrupts from being run on a particular CPU. In practice, however, it functions as a lock to protect data structures from being concurrently accessed by software-interrupt handlers. One audience member asked which resources were guarded by the bottom half lock. Chaiken responded that "no one actually knows" because the calls are spread throughout the kernel.
Even worse, software interrupts are largely opaque, because they run in an interrupt context — just like hardware interrupts do. They don't have access to many kernel facilities — such as debug logging. "You can't be printing from interrupt handlers". There are a few ways to get visibility, but they're cumbersome compared to the functionality available to the rest of the kernel.
Even though software interrupts are difficult to work with, there are some observability tools. Chaiken did a demo on her laptop — "On which I am running a kernel which no sane person would use on a computer used for a presentation" — showing how to use the stackcount program to get stack traces for all the software interrupts currently running.
Increasingly, there has been a push to move some of the work done by software interrupts to the workqueue mechanism, which Chaiken called "just an all-around better design".
Workqueues
Workqueues have existed in the kernel for a long time, but they have recently seen a lot of new functionality added. "The hardest part of this presentation has been that workqueues have changed so much in the last 18 months I've had trouble keeping up".
Workqueues are a generic way for drivers and other kernel components to schedule delayed work. Each workqueue is — theoretically — associated with a single component, which can add whatever work to the queue it likes. In actuality, a lot of the kernel uses shared workqueues that are not specific to a component. Each workqueue is also associated with a pool of kernel theads that will service tasks from that queue.
By default, Linux creates two worker pools per CPU, one normal priority and one high priority. These pools contain dedicated workers, which the kernel will spawn more of or remove as required. The fact that these pools are adjusted automatically also means that an administrator who runs into a problem with a misbehaving workqueue item cannot solve the problem by changing the priority of the worker, or pinning it to a separate core. As more functionality gets moved over to workqueues, problems and bug reports will undoubtedly start becoming more common.
The proper way to change what happens with items in workqueues is to use the "workqueue API that manages work" as opposed to managing the workers directly. Chaiken showed a demonstration of how this could be done. She picked out a workqueue and showed that it was running on a particular pool that was also servicing many other workqueues. Then she changed the priority of the workqueue itself, and showed that this had caused the workqueue to change to a different worker pool — one that matched its new attributes. In response to an audience question, she clarified that "the kernel will just create new work pools, if there is no work pool that matches a work queue."
"Treatment of affinity of workqueues has really improved in recent kernels", she remarked. Since pinning individual workers to CPU cores is not possible, recent kernels allow the user to change the CPU affinity of the workqueues themselves. The addition of features like this mean that workqueues in general have gotten substantially more useful over the last 18 months, which Chaiken called a "march of progress".
She also showed a demonstration of the much more flexible tracing and debugging capabilities available with workqueues. She used the LGPL-licensed drgn debugger with a set of workqueue-specific debugging scripts from the kernel. wq_dump.py shows the current workqueue configuration, including which worker pools exist and how they are arranged between cores. wq_monitor.py shows the behavior of workqueues in real time, which can be helpful for diagnosing problems with how work is scheduled.
Workqueues also show up under the sysfs filesystem in /sys/devices/virtual/workqueue, which can be a quick way to get information on a workqueue without breaking out a debugger. Only workqueues configured with the WQ_SYSFS flag appear there, so Chaiken noted that "if a workqueue is giving you heartburn, one of the things you can do is make a tiny kernel patch" to enable the flag.
Finally, workqueue workers run in process context instead of interrupt context — meaning that many of the kernel's normal debugging facilities, such as trace points, debug logs, etc., are available when an item from a workqueue is being processed.
In the Q&A after the talk, one audience member asked what resources they could use to learn more about workqueues. Chaiken responded that "the documentation for workqueues is excellent". "You can learn a lot by just reading the kernel's entry documentation, and using these tools." She also provided a link to her slides, which themselves contain many links to the resources she referenced while putting together the talk.
Another audience member asked whether there were existing tools that could migrate work between pools based on observed latency. Chaiken responded that "a lot of this stuff is so new that people haven't really grokked it yet", but also warned that anyone creating a tool like that would "really need [...] tests which characterize your workload and its performance".
Readers who wish to dive into more of the details can find a recording of Chaiken's talk here. Her talk left me with the impression that workqueues promise to be easier to manage and debug than software interrupts. Despite these benefits, there are downsides to workqueues — such as increased latency — that are hard to mitigate. It will be a long time before software interrupts can be completely eliminated, and switching — when so many different parts of the kernel use software interrupts — will certainly be painful. Kernel developers and system administrators alike will require a good working knowledge of workqueues, but that knowledge is readily available in the form of documentation and new tools.
Index entries for this article | |
---|---|
Conference | Southern California Linux Expo/2024 |
Posted Apr 10, 2024 12:38 UTC (Wed)
by abatters (✭ supporter ✭, #6932)
[Link] (2 responses)
printk() works from any context (https://lwn.net/Articles/800946/), so I'm not sure what this means. Maybe that it is just a bad idea to routinely printk() from interrupt handlers for performance reasons?
Posted Apr 11, 2024 4:28 UTC (Thu)
by alison (subscriber, #63752)
[Link]
Posted Apr 12, 2024 1:43 UTC (Fri)
by nevets (subscriber, #11875)
[Link]
That's why I created trace_printk() that writes into the tracing ring buffer and takes less than a micosecond to do so.
Diagnosing workqueues
Diagnosing workqueues
Diagnosing workqueues