The end of tasklets
One context where deferred execution is often needed is interrupt handlers. An interrupt diverts a CPU from whatever it was doing at the time and must be handled as quickly as possible; sleeping in an interrupt handler is not even remotely an option. So interrupt handlers typically just make a note of what needs to be done, then arrange for the actual work to be done in a more forgiving context. There are several options for this deferral:
- Threaded interrupt handlers. This mechanism, which originated in the realtime tree, was merged into the 2.6.30 release in 2009; it causes the bulk of a driver's interrupt handler to be run in a separate kernel thread. Threaded handlers, since they are running in process context, are allowed to sleep; the system administrator can also adjust their priority if need be.
- Workqueues were first added during the 2.5 development series and have been extensively enhanced since then. A driver can create a work item, containing a function pointer and some data, and submit it to a workqueue. At some future time, that function will be called with the provided data; again, this call will happen in process context. There are various types of workqueues with different performance characteristics, and subsystems can create their own private workqueues if need be.
- Software interrupts (or "bottom halves"). This mechanism is among the oldest in the kernel; it takes its inspiration from earlier Unix systems. A software interrupt is a dedicated handler that runs, usually immediately after completion of a hardware interrupt or before a return to user space, in atomic context. There has been a desire to remove this mechanism for years, since it can create surprising latencies in the kernel, but it persists; adding a new (direct) user of software interrupts would encounter significant opposition. See this article for more information on software interrupts.
- Tasklets. Like workqueues, tasklets are a way to arrange for a function to be called at a future time. In this case, though, the tasklet function is called from a software interrupt, and it runs in atomic context. Tasklets have been around since the 2.3 development series; they, too, have been on the chopping block for many years, but no such effort has succeeded to date.
Threaded interrupt handlers and workqueues are seen as the preferred mechanisms for deferred work in modern kernel code, but the other APIs have proved hard to phase out. Tasklets, in particular, remain because they offer lower latency than workqueues which, since they must go through the CPU scheduler, can take longer to execute a deferred-work item.
Mikulas Patocka recently encountered a problem with the tasklet API. A tasklet is defined by struct tasklet_struct, which contains the address of the callback function and related information. The tasklet subsystem needs to be able to manipulate that structure, and may do so after the tasklet function has completed its execution and returned. This can be a problem if the tasklet function itself wants to free that structure, as might happen for a one-shot tasklet that will not be called again. The tasklet subsystem could end up writing to a structure that has been freed and allocated for another use, with predictably unpleasant consequences.
Patocka sought to fix this problem by adding a new "one-shot" tasklet variant, where the tasklet subsystem would promise to not touch the tasklet_struct structure after the tasklet itself runs. Linus Torvalds, though, did not like that patch; he said that tasklets just should not be used in that way. Workqueues are better designed, he said, and are better for that use case — except for the extra latency they can impose. So, he suggested, the right approach might be a new type of workqueue:
I think if we introduced a workqueue that worked more like a tasklet - in that it's run in softirq context - but doesn't have the interface mistakes of tasklets, a number of existing workqueue users might decide that that is exactly what they want.
Tejun Heo, the workqueue maintainer, ran with that idea; the result was this patch series adding a new workqueue type, WQ_BH, with the semantics that Torvalds described. A work item submitted to a WQ_BH workqueue will be run quickly, in atomic context, on the same CPU.
Interestingly, these work items are run out of a tasklet — for now. Fearing priority-inversion problems between WQ_BH work items and existing tasklets, Heo chose to leave the tasklet subsystem in control. The patch series converts a number of tasklet users over to the new workqueue type, though, and the plan is clearly to convert the rest over time. That may take a while; there are well over 500 tasklet users in the kernel. Once that conversion is complete, though, it will be possible to run WQ_BH workqueues directly from a software interrupt and remove the tasklet API entirely.
This implementation, of course, still leaves software interrupts in place; removing that subsystem will be a job for another day. Using software interrupts led to a complaint from Sebastian Andrzej Siewior, who would rather see tasklet users moved to threaded interrupt handlers or regular workqueues. But, as Heo answered, that doesn't help the cases where the shortest latency is required. It seems there may always be a place for a deferred-work mechanism that does not require scheduling, as much as the realtime developers would like to avoid it.
Heo has the patch series marked as
targeted at the 6.9 kernel release, meaning that it would need to be ready
for the merge window in mid-March. That is relatively quick for a
significant new feature like this, but it is using a well-established
kernel API to edge out a subsystem that developers have wanted to get rid
of for years. So there a is a reasonable chance that this particular work
may not be deferred past the next kernel cycle.
| Index entries for this article | |
|---|---|
| Kernel | Releases/6.9 |
| Kernel | Tasklets |
| Kernel | Workqueues |
Posted Feb 5, 2024 15:43 UTC (Mon)
by kees (subscriber, #27264)
[Link] (3 responses)
Posted Feb 6, 2024 7:36 UTC (Tue)
by epa (subscriber, #39769)
[Link] (1 responses)
Posted Feb 6, 2024 13:24 UTC (Tue)
by syang (subscriber, #141053)
[Link]
Posted Feb 5, 2024 17:58 UTC (Mon)
by willy (subscriber, #9762)
[Link] (2 responses)
Posted Feb 5, 2024 20:33 UTC (Mon)
by rgb (guest, #57129)
[Link]
Posted Feb 6, 2024 2:33 UTC (Tue)
by chexo4 (subscriber, #169500)
[Link]
Posted Feb 5, 2024 22:04 UTC (Mon)
by klossner (subscriber, #30046)
[Link] (11 responses)
Posted Feb 6, 2024 0:14 UTC (Tue)
by bertschingert (subscriber, #160729)
[Link] (8 responses)
Posted Feb 6, 2024 8:36 UTC (Tue)
by intelfx (subscriber, #130118)
[Link] (7 responses)
How so, though? If you imagine the work that needs to be done in response to an interrupt as a function written in an imperative language, with control flow from top to down, then the "top half" is the part that is executed first — in the actual interrupt handler — and the bottom half is executed second (using some sort of a deferred execution mechanism).
Posted Feb 6, 2024 10:05 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link] (4 responses)
Not saying that this is right, but I've never found mnemonics like these easier to remember than the raw fact itself. For example, the I vs O in the power symbol is actually better remembered as 1 for "on" and 0 for "off" but I kept getting it mixed up with | is an open wire ("off") and a O is a closed circuit ("on").
Posted Feb 6, 2024 15:24 UTC (Tue)
by intelfx (subscriber, #130118)
[Link] (3 responses)
Posted Feb 6, 2024 18:03 UTC (Tue)
by antiphase (subscriber, #111993)
[Link] (1 responses)
higher/lower are comparative and don't indicate an absolute position, although you can argue that top and bottom are relative to one another as well.
Ah, English.
Posted Feb 6, 2024 21:41 UTC (Tue)
by intelfx (subscriber, #130118)
[Link]
Posted Feb 11, 2024 6:32 UTC (Sun)
by ssmith32 (subscriber, #72404)
[Link]
Posted Feb 6, 2024 14:12 UTC (Tue)
by bertschingert (subscriber, #160729)
[Link] (1 responses)
Posted Feb 6, 2024 15:24 UTC (Tue)
by intelfx (subscriber, #130118)
[Link]
Posted Feb 6, 2024 18:10 UTC (Tue)
by willy (subscriber, #9762)
[Link]
Posted Feb 6, 2024 21:51 UTC (Tue)
by amarao (guest, #87073)
[Link]
... (Missing xkcd about using intuitive 4d navigation abstractions here)
Posted Feb 6, 2024 3:38 UTC (Tue)
by shemminger (subscriber, #5739)
[Link] (5 responses)
It was before my involvement in Linux, so don't know the exact details but heard that was the motivation to add softirq so that network SYN packets would get processed without expensive context switch to application.
CPU vs network speeds have changed in 25 years, so tradeoffs made then maybe different.
Posted Feb 6, 2024 4:59 UTC (Tue)
by iabervon (subscriber, #722)
[Link]
Also, it's now possible to have other user space tasks that are more important and urgent than your web server performance, so it's worth being technically able to preempt it if necessary even if you normally wouldn't.
Posted Feb 6, 2024 15:40 UTC (Tue)
by Wol (subscriber, #4433)
[Link] (3 responses)
MS had noticed that the linux stack bottle-necked on a single CPU, even on a multi-CPU machine. So they improved their stack to use all four CPUs on some monster machine, and successfully flooded 4 Gigabit network cards.
They rapidly learnt not to try that sort of stunt. I can't remember how quick the community responded but, as with any perceived bugs in linux, there were fixes for the bottleneck within a day or so, and a proper solution went live within about a week. MS was left trailing in the dust ...
Cheers,
Posted Feb 6, 2024 18:07 UTC (Tue)
by willy (subscriber, #9762)
[Link] (2 responses)
But this is, was and always has been the game. I was part of it when I worked for Intel on Linux in the mid-2000s. Team A would work on Benchmark B and produce a result that beat Linux. So we'd take a look at what bottlenecks Benchmark B had on Linux, eliminate one, rerun the benchmark. Repeat until we beat Team A. Send patches upstream. Team A would typically come back to us a month or two later with an improved result and we'd repeat until either we or Team A lost interest.
Competition is healthy, and as long as Benchmark B represents a real customer workload (and you're actually eliminating bottlenecks, not putting in special hacks for Benchmark B), this is a win for everybody. The downside of Linux basically making every other kernel irrelevant is that we've lost that impetus.
See also LLVM vs GCC, Firefox vs Chrome, etc, etc.
Posted Feb 7, 2024 10:54 UTC (Wed)
by farnz (subscriber, #17727)
[Link] (1 responses)
To some extent, the old thing of multiple competing forks of Linux provided the competition; if you have a choice between Torvalds, Cox, Kolivas and other kernels, where the Torvalds fork is the blessed version, and the Cox, Kolivas and other forks make different compromises to Torvalds. And then, the fact that for some workloads, Cox or Kolivas is "better" than Torvalds (but not for others) provides the impetus to work out whether there's a way to do better than all the forks include Torvalds' fork.
To a limited degree, we've seen this with EEVDF; there were the latency-nice patches floating around (and efforts made to get latency-nice to fit in with the design of the CFS scheduler), which triggered investigation into alternatives to CFS, and then inspired Zijlstra to implement EEVDF in a way that matched or beat CFS while also providing latency-nice.
Posted Feb 7, 2024 11:04 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
And this is why Torvalds is such a good manager (and steward). He's not attached to his version, and he actively encourages these short-lived forks precisely to find the best way to do things. Which he then shamelessly appropriates :-)
Cheers,
Posted Feb 6, 2024 5:38 UTC (Tue)
by alison (subscriber, #63752)
[Link]
The end of tasklets
https://github.com/KSPP/linux/issues/94
I hope it gets completed this time! :)
The end of tasklets
The end of tasklets
The end of tasklets
The end of tasklets
The end of tasklets
Linus inverted the bottle. In classic BSD Unix and its successors, the bottom half is the interrupt handler and the top half is the thread to which work is deferred. I have to look up the terms to avoid confusion. See e.g. UNIX Programmer's Supplementary Documents Volume 2 (PS2) 4.3 Berkeley Software Distribution
Upside-down halves
Upside-down halves
Upside-down halves
Upside-down halves
Upside-down halves
Upside-down halves
Upside-down halves
Upside-down halves
Like taking an elevator to the top.
Upside-down halves
Upside-down halves
Upside-down halves
Upside-down halves
The end of tasklets
The end of tasklets
The end of tasklets
Wol
The end of tasklets
The end of tasklets
The end of tasklets
Wol
The end of tasklets
