JDK 21 released

Posted Sep 20, 2023 13:32 UTC (Wed) by MarcB (guest, #101804)
In reply to: JDK 21 released by vasvir
Parent article: JDK 21 released

It think there is some disillusionment with async APIs and async applications can quickly become hard to understand or debug.
One thread per request (or connection) seems to be a much simpler model, but it can have too much overhead on a 1:1 implementation.

JDK 21 released

Posted Sep 20, 2023 14:29 UTC (Wed) by paulj (subscriber, #341) [Link] (13 responses)

At best, it's swapping state overhead for compute overheads. I.e., if the OS thread carries more state than is needed, then you take the compute hit of syncing the subset of state required in user-space to/from the running thread as and when your userspace scheduler swaps lwts on the running thread.

There's issues of preemption, blocking, etc.

I wonder what kind of state a user-space lwt scheduler would /not/ need that the OS does, that gives wins here?

Intuitively, a problem-domain specific work-task:OS-thread multi-plexer seems like it could give scalability wins, but I'm less convinced about M:N lwt:OS threads - if that were such an easy win, why did Linux and Solaris go for 1:1 after much deliberation over and (in case of Solaris) experience with M:N models?

JDK 21 released

Posted Sep 20, 2023 16:07 UTC (Wed) by farnz (subscriber, #17727) [Link] (3 responses)

The only bit of state that the kernel scheduler must deal with, but a user scheduler can potentially ignore, is the CPU context (registers etc).

However, back when threads first became a significant part of OSes, kernels had a lot of state associated with each thread. This made M:N models look very attractive, since the user scheduler was being built up with the bare minimum of state, not being cut down from the process scheduler (where the cost of switching MMU setups was high enough to hide the cost of handling excess state).

JDK 21 released

Posted Sep 21, 2023 10:19 UTC (Thu) by paulj (subscriber, #341) [Link] (2 responses)

struct task_struct on Linux is actually huge, with the CPU state in a smaller (but still largish) struct thread_struct embedded at the end. Lot of the stuff in task_struct seems related to debugging systems - in kernel tracing mechanisms, and also user-space ptrace. Then all kinds of book-keeping and references needed for security and resource-allocation/separation stuff.

A good chunk of the kernel sub-system and book-keeping could be saved by LWTs I guess. But... then you lose stuff like... NUMA balancing, separation between threads (e.g. resources via cgroups, security via LSM). However, you could argue: Who on earth would design a process that relied on different traditional threads within the process having different resource or security contexts?

I guess the argument could be made that the common case of traditional threads (i.e. shared memory, shared resources, shared everything bar a few contexts like CPU regs, stack) should be handled specially and made much lighter-weight in the kernel. I.e. a thread should be able to run with just the stuff from thread_struct - it shouldn't need a whole task_struct for each thread?

In the context of this article, Java and it's M:N lwt's, it is course not trying to present a full Linux or POSIX thread interface to user-space, so that's a much "easier" job.

JDK 21 released

Posted Sep 21, 2023 11:08 UTC (Thu) by MarcB (guest, #101804) [Link] (1 responses)

> A good chunk of the kernel sub-system and book-keeping could be saved by LWTs I guess. But... then you lose stuff > like... NUMA balancing, separation between threads (e.g. resources via cgroups, security via LSM). However, you
> could argue: Who on earth would design a process that relied on different traditional threads within the process
> having different resource or security contexts?

Exactly, the idea here explicitly is not to replace "real" threads. It is intended as an alternative to asynchronous APIs that would all be used from a single OS thread anyway.

The primary goal, according to https://openjdk.org/jeps/444, is to "Enable server applications written in the simple thread-per-request style to scale with near-optimal hardware utilization."

JDK 21 released

Posted Sep 21, 2023 13:14 UTC (Thu) by paulj (subscriber, #341) [Link]

Yeah, could make sense for Java.

The other interesting question here is to ask whether the kernel should have much lighter threads. Full tasks come with an _immense_ amount of baggage, for separation/security/etc. It'd be nice to have in-kernel light-weight threads, that just do the minimum needed for share-everything-bar-CPU-and-stack with POSIX thread APIs.

JDK 21 released

Posted Sep 20, 2023 21:12 UTC (Wed) by dcoutts (guest, #5387) [Link] (1 responses)

> I wonder what kind of state a user-space lwt scheduler would /not/ need that the OS does, that gives wins here?

There are examples out there. Notably, GHC (the primary Haskell compiler) supports light-weight pre-emptive threads which are much cheaper than OS threads, both in memory and context switching time. It has a simple thread scheduler in the RTS. I'm not sure exactly what state it saves vs an OS thread mainly because I'm not sure exactly what state a Linux thread carries, but a GHC thread is just a small struct and a small initial stack (less than 4k).

This means one can get great performance from the simple classic server design of one thread per client. 10k or 100k such threads is perfectly reasonable.

The main difficulty with libc M:N schedulers is pre-emption: usually it being expensive or complex or both. There's quite a bit of academic research that looks at solutions involving compiler support for inserting cheaper pre-emption points. This is the approach that GHC takes too.

For example: https://dl.acm.org/doi/abs/10.1145/2400682.2400695

And also advocated by: https://www.usenix.org/legacy/events/hotos03/tech/full_pa...

JDK 21 released

Posted Sep 21, 2023 1:51 UTC (Thu) by wahern (subscriber, #37304) [Link]

> The main difficulty with libc M:N schedulers is pre-emption: usually it being expensive or complex or both.

Relatedly, O(1) event polling APIs like kqueue, epoll, and Solaris Ports[1] didn't exist back when M:N threading was being debated. Those APIs don't solve pre-emption, but without such APIs the ability to spawn hundreds or thousands of threads in a network server didn't make any sense--a scheduler relying on select/poll had little chance of "C10K" scaling.[2][3] Such O(1) event interfaces were a necessary, if crude, facility for user land scheduling to be meaningfully competitive with 1:1 threads for any real-world work loads. That fact that they composed well (i.e. they're built around a special file descriptor, rather than some global facility or state) also made it easier for user land developers, including language implementors, to experiment.

NetBSD, like Solaris before it, did end up with scheduler activations to support it's M:N threading model, but I assume it was too little, too late. And AFAIU its event reporting was based on signals (similar to SIGIO/SIGPOLL), which effectively meant it was an interface only NetBSD system developers could iterate on and make use of. (I'm also not sure what types of events it reported. Did it even resolve the network connection scaling problem? Was Solaris' /dev/poll interface an offshoot of its scheduler activation framework?)

[1] Solaris 7 had /dev/poll, but M:N threading predated it. /dev/poll wasn't a public API until Solaris 8, yet by Solaris 9 M:N threading was already deprecated.

[2] SIGIO/SIGPOLL was a thing, but POSIX signal semantics are simply too cumbersome. Interrupts vs polling for software interfaces was and remains an entire debate until itself, ongoing since the 1970s (1960s?).

[3] poll hinting, where the kernel optimistically installed persistent event watchers, also existed, at least experimentally (see https://web.archive.org/web/20020226200329/http://www.hum...), but it was more of a band-aid and marginal fix. I think some BSDs effectively ended up implementing this later as part of refactoring their VFS layer and poll/select around the internal kevent (kqueue) system.

JDK 21 released

Posted Sep 21, 2023 5:35 UTC (Thu) by joib (subscriber, #8541) [Link] (6 responses)

Yes, it seems that "full" POSIX C threads API using a M:N model bogs down into issues with preemption, blocking syscalls, etc., that they then try to solve with scheduler activations and whatnot. And in the end it turned out a simple 1:1 model was the better one.

But it seems M:N threading is more successful in a runtime for higher level languages, where you can convert blocking I/O into epoll/io_uring/etc. behind the scenes, you can do things like split stacks or movable stacks, and maybe you can make do with co-operative multithreading in the user-level scheduler rather than supporting full preemption, etc.

JDK 21 released

Posted Sep 21, 2023 12:42 UTC (Thu) by ms-tg (subscriber, #89231) [Link] (5 responses)

> But it seems M:N threading is more successful in a runtime for higher level languages, where you can convert blocking I/O into epoll/io_uring/etc. behind the scenes, you can do things like split stacks or movable stacks, and maybe you can make do with co-operative multithreading in the user-level scheduler rather than supporting full preemption, etc.

I’m curious how this will go in Java - do they plan to instrument *every* existing IO API with inherently blocking semantics as you describe, so that transparently and with no code changes, the “blocking” IO can occur at runtime as a paused virtual thread which transparently wakes up when the IO is ready?

In the Ruby world there were many evolutions of this model, now there are IO hooks in a core Fiber Scheduler interface: https://docs.ruby-lang.org/en/3.2/Fiber/Scheduler.html

My understanding is that in Rust, Python, Typescript and Javascript, no such unification of blocking code semantics with non-blocking IO via lightweight virtual threads is currently offered - meaning pretty much every IO interface needs to be duplicated in a way that signals async readiness in order to power async/await or straight callback code patterns, is that accurate?

From the Ruby world it felt

JDK 21 released

Posted Sep 23, 2023 6:03 UTC (Sat) by znix (subscriber, #159961) [Link] (3 responses)

> the “blocking” IO can occur at runtime as a paused virtual thread which transparently wakes up when the IO is ready?

Yep, this is how it works. It's supposed to be plug-and-play, with no code changes required.

Java has had a pretty powerful IO API - New I/O or 'NIO' - that's supported both synchronous and asynchronous modes for ages, across both network and file IO.

My understanding is that most of the other IO APIs have been slowly rewritten in Java (eg by JEP 353) to internally use NIO, so that adding Loom support to NIO makes all the other APIs also support Loom.

This works because FFI is relatively rarely used in Java, so the problem of a native function in a library making a blocking syscall is a relatively small one - there's a few database APIs that do this, but AFAIK they're mostly being rewritten in pure Java.

JDK 21 released

Posted Sep 23, 2023 21:09 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> Java has had a pretty powerful IO API - New I/O or 'NIO' - that's supported both synchronous and asynchronous modes for ages, across both network and file IO.

NIO has been present since 2002 or so, and it was almost completely useless. Pretty much nothing in the Java ecosystem supported it. It also turned out to be kinda useless for lightweight threads, so that the NIO core had to be rewritten to support them.

Also, NIO doesn't support async file operations on Linux (I haven't checked Solaris or Windows).

JDK 21 released

Posted Sep 23, 2023 23:28 UTC (Sat) by znix (subscriber, #159961) [Link] (1 responses)

> It also turned out to be kinda useless for lightweight threads, so that the NIO core had to be rewritten to support them.

Oh, interesting! I didn't realise that.

> Also, NIO doesn't support async file operations on Linux (I haven't checked Solaris or Windows).

I thought it did on Solaris, but I may very well be mistaken.

JDK 21 released

Posted Sep 25, 2023 3:38 UTC (Mon) by ssmith32 (subscriber, #72404) [Link]

You may have been thinking of the NIO.2 additions, which, unlike the intial release of NIO, do support async IO, and are quite powerful.

So technically, your original statement is correct, NIO does support a powerful API and async I/O, but it didn't at first.. once Java 17 rolled around, they added a bunch of functionality to nio, which, lumped together, was referred to as NIO.2 at the time.

Nowadays, everyone just calls it NIO.

JDK 21 released

Posted Sep 23, 2023 20:58 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

> I’m curious how this will go in Java - do they plan to instrument *every* existing IO API

Basically. Java doesn't have a lot of IO interfaces, so it's not such a huge task. They way JDK handles this is similar to Go, each time a potentially blocking operation happens, the virtual thread is "pinned" to the system thread, so that the Java lightweight thread scheduler knows not to wait for it to become available.

For example, here's the code that handles file reading: https://github.com/openjdk/jdk/blob/master/src/java.base/... This basically means that any file IO will still require thread-per-operation.

Network IO is special, so it has hooks into the lightweight scheduler. For example, network blocking reads ultimately end up here: https://github.com/openjdk/jdk/blob/a2391a92cd09630cc3c46... The code transparently yields to the scheduler in case of lightweight threads, or just does a blocking wait if it's started from a real thread.

One bad thing is proliferation of special-casing. For example, JDK developers really wanted to support thread cancellation. But this means that threads might leak network connections, so they are automatically closed if this happens: https://github.com/openjdk/jdk/blob/a2391a92cd09630cc3c46... This already leads to some problems where the code actually wants to handle interrupts.

JDK 21 released

Posted Sep 25, 2023 2:47 UTC (Mon) by ssmith32 (subscriber, #72404) [Link]

Yes. That was explicitly mentioned in the JEP:

https://openjdk.org/jeps/444

In the asynchronous style, each stage of a request might execute on a different thread, and every thread runs stages belonging to different requests in an interleaved fashion. This has deep implications for understanding program behavior: Stack traces provide no usable context, debuggers cannot step through request-handling logic, and profilers cannot associate an operation's cost with its caller. Composing lambda expressions is manageable when using Java's stream API to process data in a short pipeline but problematic when all of the request-handling code in an application must be written in this way. This programming style is at odds with the Java Platform because the application's unit of concurrency — the asynchronous pipeline — is no longer the platform's unit of concurrency"