The end of tasklets

By Jonathan Corbet
February 5, 2024

A common problem in kernel development is controlling when a specific task should be done. Kernel code often executes in contexts where some actions (sleeping, for example, or calling into filesystems) are not possible. Other actions, while possible, may prevent the kernel from taking care of a more important task in a timely manner. The kernel community has developed a number of deferred-execution mechanisms designed to ensure that every task is handled at the right time. One of those mechanisms, tasklets, has been eyed for removal for years; that removal might just happen in the near future.

One context where deferred execution is often needed is interrupt handlers. An interrupt diverts a CPU from whatever it was doing at the time and must be handled as quickly as possible; sleeping in an interrupt handler is not even remotely an option. So interrupt handlers typically just make a note of what needs to be done, then arrange for the actual work to be done in a more forgiving context. There are several options for this deferral:

Threaded interrupt handlers. This mechanism, which originated in the realtime tree, was merged into the 2.6.30 release in 2009; it causes the bulk of a driver's interrupt handler to be run in a separate kernel thread. Threaded handlers, since they are running in process context, are allowed to sleep; the system administrator can also adjust their priority if need be.
Workqueues were first added during the 2.5 development series and have been extensively enhanced since then. A driver can create a work item, containing a function pointer and some data, and submit it to a workqueue. At some future time, that function will be called with the provided data; again, this call will happen in process context. There are various types of workqueues with different performance characteristics, and subsystems can create their own private workqueues if need be.
Software interrupts (or "bottom halves"). This mechanism is among the oldest in the kernel; it takes its inspiration from earlier Unix systems. A software interrupt is a dedicated handler that runs, usually immediately after completion of a hardware interrupt or before a return to user space, in atomic context. There has been a desire to remove this mechanism for years, since it can create surprising latencies in the kernel, but it persists; adding a new (direct) user of software interrupts would encounter significant opposition. See this article for more information on software interrupts.
Tasklets. Like workqueues, tasklets are a way to arrange for a function to be called at a future time. In this case, though, the tasklet function is called from a software interrupt, and it runs in atomic context. Tasklets have been around since the 2.3 development series; they, too, have been on the chopping block for many years, but no such effort has succeeded to date.

Threaded interrupt handlers and workqueues are seen as the preferred mechanisms for deferred work in modern kernel code, but the other APIs have proved hard to phase out. Tasklets, in particular, remain because they offer lower latency than workqueues which, since they must go through the CPU scheduler, can take longer to execute a deferred-work item.

Mikulas Patocka recently encountered a problem with the tasklet API. A tasklet is defined by struct tasklet_struct, which contains the address of the callback function and related information. The tasklet subsystem needs to be able to manipulate that structure, and may do so after the tasklet function has completed its execution and returned. This can be a problem if the tasklet function itself wants to free that structure, as might happen for a one-shot tasklet that will not be called again. The tasklet subsystem could end up writing to a structure that has been freed and allocated for another use, with predictably unpleasant consequences.

Patocka sought to fix this problem by adding a new "one-shot" tasklet variant, where the tasklet subsystem would promise to not touch the tasklet_struct structure after the tasklet itself runs. Linus Torvalds, though, did not like that patch; he said that tasklets just should not be used in that way. Workqueues are better designed, he said, and are better for that use case — except for the extra latency they can impose. So, he suggested, the right approach might be a new type of workqueue:

I think if we introduced a workqueue that worked more like a tasklet - in that it's run in softirq context - but doesn't have the interface mistakes of tasklets, a number of existing workqueue users might decide that that is exactly what they want.

Tejun Heo, the workqueue maintainer, ran with that idea; the result was this patch series adding a new workqueue type, WQ_BH, with the semantics that Torvalds described. A work item submitted to a WQ_BH workqueue will be run quickly, in atomic context, on the same CPU.

Interestingly, these work items are run out of a tasklet — for now. Fearing priority-inversion problems between WQ_BH work items and existing tasklets, Heo chose to leave the tasklet subsystem in control. The patch series converts a number of tasklet users over to the new workqueue type, though, and the plan is clearly to convert the rest over time. That may take a while; there are well over 500 tasklet users in the kernel. Once that conversion is complete, though, it will be possible to run WQ_BH workqueues directly from a software interrupt and remove the tasklet API entirely.

This implementation, of course, still leaves software interrupts in place; removing that subsystem will be a job for another day. Using software interrupts led to a complaint from Sebastian Andrzej Siewior, who would rather see tasklet users moved to threaded interrupt handlers or regular workqueues. But, as Heo answered, that doesn't help the cases where the shortest latency is required. It seems there may always be a place for a deferred-work mechanism that does not require scheduling, as much as the realtime developers would like to avoid it.

Heo has the patch series marked as targeted at the 6.9 kernel release, meaning that it would need to be ready for the merge window in mid-March. That is relatively quick for a significant new feature like this, but it is using a well-established kernel API to edge out a subsystem that developers have wanted to get rid of for years. So there a is a reasonable chance that this particular work may not be deferred past the next kernel cycle.

Index entries for this article
Kernel	Releases/6.9
Kernel	Tasklets
Kernel	Workqueues

The end of tasklets

Posted Feb 5, 2024 15:43 UTC (Mon) by kees (subscriber, #27264) [Link] (3 responses)

We've been trying to get rid of tasklets for more than 5 years...
https://github.com/KSPP/linux/issues/94
I hope it gets completed this time! :)

The end of tasklets

Posted Feb 5, 2024 17:56 UTC (Mon) by kees (subscriber, #27264) [Link]

Actually way earlier than that...
https://lwn.net/Articles/239633/

The end of tasklets

Posted Feb 6, 2024 7:36 UTC (Tue) by epa (subscriber, #39769) [Link] (1 responses)

So you mean the work keeps getting deferred?

The end of tasklets

Posted Feb 6, 2024 13:24 UTC (Tue) by syang (subscriber, #141053) [Link]

I guess we're, uh... deferring their execution from the kernel.

The end of tasklets

Posted Feb 5, 2024 17:58 UTC (Mon) by willy (subscriber, #9762) [Link] (2 responses)

You missed timers! IIRC one of the proposals for killing tasklets was to make them a timer that expired immediately.

The end of tasklets

Posted Feb 5, 2024 20:33 UTC (Mon) by rgb (guest, #57129) [Link]

Why did that idea not get anywhere?

The end of tasklets

Posted Feb 6, 2024 2:33 UTC (Tue) by chexo4 (subscriber, #169500) [Link]

Wait, so how exactly would that have worked?

Upside-down halves

Posted Feb 5, 2024 22:04 UTC (Mon) by klossner (subscriber, #30046) [Link] (11 responses)

Linus inverted the bottle. In classic BSD Unix and its successors, the bottom half is the interrupt handler and the top half is the thread to which work is deferred. I have to look up the terms to avoid confusion. See e.g. UNIX Programmer's Supplementary Documents Volume 2 (PS2) 4.3 Berkeley Software Distribution

Upside-down halves

Posted Feb 6, 2024 0:14 UTC (Tue) by bertschingert (subscriber, #160729) [Link] (8 responses)

I always thought the Linux usage of top/bottom half was confusing and counterintuitive. It's nice to know I'm maybe not the only one who thought the other way around makes more sense.

Upside-down halves

Posted Feb 6, 2024 8:36 UTC (Tue) by intelfx (subscriber, #130118) [Link] (7 responses)

> I always thought the Linux usage of top/bottom half was confusing and counterintuitive

How so, though? If you imagine the work that needs to be done in response to an interrupt as a function written in an imperative language, with control flow from top to down, then the "top half" is the part that is executed first — in the actual interrupt handler — and the bottom half is executed second (using some sort of a deferred execution mechanism).

Upside-down halves

Posted Feb 6, 2024 10:05 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (4 responses)

Another way to think about it is that interrupts are "closer to the hardware" which is "lower level". Once you come back "up" from there, you can do that deferred work.

Not saying that this is right, but I've never found mnemonics like these easier to remember than the raw fact itself. For example, the I vs O in the power symbol is actually better remembered as 1 for "on" and 0 for "off" but I kept getting it mixed up with | is an open wire ("off") and a O is a closed circuit ("on").

Upside-down halves

Posted Feb 6, 2024 15:24 UTC (Tue) by intelfx (subscriber, #130118) [Link] (3 responses)

Hm, that's curious. I wonder if this is due to me not being a native speaker — for me, top/bottom and higher/lower (level) are completely disparate pairs of words, with little to no semantic overlap.

Upside-down halves

Posted Feb 6, 2024 18:03 UTC (Tue) by antiphase (subscriber, #111993) [Link] (1 responses)

top is synonymous with highest or uppermost, as in they are superlative-like; similarly (but reversed) for bottom/lowest/lowermost.

higher/lower are comparative and don't indicate an absolute position, although you can argue that top and bottom are relative to one another as well.

Ah, English.

Upside-down halves

Posted Feb 6, 2024 21:41 UTC (Tue) by intelfx (subscriber, #130118) [Link]

That seems correct, but I was rather making a different point. I learned these words in different contexts, and so it simply doesn't "click" in my brain that top/bottom is "almost like higher/lower". These words are "in different parts of my brain", that's why I never experienced the confusion described in the sibling thread.

Upside-down halves

Posted Feb 11, 2024 6:32 UTC (Sun) by ssmith32 (subscriber, #72404) [Link]

The top is the highest level?
Like taking an elevator to the top.

Upside-down halves

Posted Feb 6, 2024 14:12 UTC (Tue) by bertschingert (subscriber, #160729) [Link] (1 responses)

That actually does make sense, and I hadn't thought about it that way before. But my brain landed on and stuck with the "closer to the hardware = lower level" approach. After all, in OS diagrams, the hardware is usually at the bottom of the diagram, not the top.

Upside-down halves

Posted Feb 6, 2024 15:24 UTC (Tue) by intelfx (subscriber, #130118) [Link]

In fact, I believe this is actually the original intention behind this naming. (Also, see my reply to @mathstuf regarding higher/lower level.)

Upside-down halves

Posted Feb 6, 2024 18:10 UTC (Tue) by willy (subscriber, #9762) [Link]

Hm, I thought in Linux the "top half" was the part that submitted work, and the "bottom half" was the deferred interrupt processing. I've been around Linux over 25 years, and I've never seen anyone call the interrupt handler the "top half". I've seen plenty of code call it the "DPC" (Windows terminology, I believe)

Upside-down halves

Posted Feb 6, 2024 21:51 UTC (Tue) by amarao (guest, #87073) [Link]

It's very easy to understand where up and down is. Up is the same as the left-most, or the same as the front. Everyone can look on the byte and day where it begins and where it ends.

... (Missing xkcd about using intuitive 4d navigation abstractions here)

The end of tasklets

Posted Feb 6, 2024 3:38 UTC (Tue) by shemminger (subscriber, #5739) [Link] (5 responses)

The rational for tasklets (and softirq) goes all the way back to the networking overhaul that took place in 2.4 development. Back in 1999, there was a Microsoft sponsored web benchmark from Mindcraft that showed that Windows NT and IIS were faster than Linux and Apache. This was fixed by DavidM by tuning and fixing the networking stack.

It was before my involvement in Linux, so don't know the exact details but heard that was the motivation to add softirq so that network SYN packets would get processed without expensive context switch to application.

CPU vs network speeds have changed in 25 years, so tradeoffs made then maybe different.

The end of tasklets

Posted Feb 6, 2024 4:59 UTC (Tue) by iabervon (subscriber, #722) [Link]

It looks like they're not changing how it runs, just how you write the code and how it's tracked in memory. If anything, context switches have gotten more expensive relative to everything else since 1999, but that doesn't mean that the method for avoiding them can't benefit from more recent API design lessons.

Also, it's now possible to have other user space tasks that are more important and urgent than your web server performance, so it's worth being technically able to preempt it if necessary even if you normally wouldn't.

The end of tasklets

Posted Feb 6, 2024 15:40 UTC (Tue) by Wol (subscriber, #4433) [Link] (3 responses)

> Back in 1999, there was a Microsoft sponsored web benchmark from Mindcraft that showed that Windows NT and IIS were faster than Linux and Apache. This was fixed by DavidM by tuning and fixing the networking stack.

MS had noticed that the linux stack bottle-necked on a single CPU, even on a multi-CPU machine. So they improved their stack to use all four CPUs on some monster machine, and successfully flooded 4 Gigabit network cards.

They rapidly learnt not to try that sort of stunt. I can't remember how quick the community responded but, as with any perceived bugs in linux, there were fixes for the bottleneck within a day or so, and a proper solution went live within about a week. MS was left trailing in the dust ...

Cheers,
Wol

The end of tasklets

Posted Feb 6, 2024 18:07 UTC (Tue) by willy (subscriber, #9762) [Link] (2 responses)

Oh, muffin. It was 4x100Mbit. If you used a more natural 1Gbit card, Linux came out on top.

But this is, was and always has been the game. I was part of it when I worked for Intel on Linux in the mid-2000s. Team A would work on Benchmark B and produce a result that beat Linux. So we'd take a look at what bottlenecks Benchmark B had on Linux, eliminate one, rerun the benchmark. Repeat until we beat Team A. Send patches upstream. Team A would typically come back to us a month or two later with an improved result and we'd repeat until either we or Team A lost interest.

Competition is healthy, and as long as Benchmark B represents a real customer workload (and you're actually eliminating bottlenecks, not putting in special hacks for Benchmark B), this is a win for everybody. The downside of Linux basically making every other kernel irrelevant is that we've lost that impetus.

See also LLVM vs GCC, Firefox vs Chrome, etc, etc.

The end of tasklets

Posted Feb 7, 2024 10:54 UTC (Wed) by farnz (subscriber, #17727) [Link] (1 responses)

To some extent, the old thing of multiple competing forks of Linux provided the competition; if you have a choice between Torvalds, Cox, Kolivas and other kernels, where the Torvalds fork is the blessed version, and the Cox, Kolivas and other forks make different compromises to Torvalds. And then, the fact that for some workloads, Cox or Kolivas is "better" than Torvalds (but not for others) provides the impetus to work out whether there's a way to do better than all the forks include Torvalds' fork.

To a limited degree, we've seen this with EEVDF; there were the latency-nice patches floating around (and efforts made to get latency-nice to fit in with the design of the CFS scheduler), which triggered investigation into alternatives to CFS, and then inspired Zijlstra to implement EEVDF in a way that matched or beat CFS while also providing latency-nice.

The end of tasklets

Posted Feb 7, 2024 11:04 UTC (Wed) by Wol (subscriber, #4433) [Link]

> which triggered investigation into alternatives to CFS, and then inspired Zijlstra to implement EEVDF in a way that matched or beat CFS while also providing latency-nice.

And this is why Torvalds is such a good manager (and steward). He's not attached to his version, and he actively encourages these short-lived forks precisely to find the best way to do things. Which he then shamelessly appropriates :-)

Cheers,
Wol

The end of tasklets

Posted Feb 6, 2024 5:38 UTC (Tue) by alison (subscriber, #63752) [Link]

The threaded sysfs attributes take NET_RX handling mostly away from softirqs. The rcu_nocbs kernel cmdline parameter takes RCU handling out of softirqs. In the last year, as previously noted by LWN, Frederic Weisbecker tried and failed (again) to pull out timer softirqs. If tasklets (and presumably also HI) softirqs could come out, that would leave ksoftirqd truly diminished. Getting rid of softirqs altogether is still a dream, but not as implausible as it once seemed.