LWN: Comments on "Microsoft Research: A fork() in the road"

Microsoft research: A fork() in the road

immibis — Mon, 07 Jun 2021 16:45:52 +0000

Note that the researchers are not talking about CreateProcess() specifically, but CreateProcess-style APIs in general, compared to fork-style APIs in general.

Microsoft research: A fork() in the road

Cyberax — Tue, 01 Jun 2021 01:42:01 +0000

Java doesn't really handle isolation well. Threads can leak, the heap is shared, etc.

Microsoft research: A fork() in the road

immibis — Mon, 31 May 2021 17:44:34 +0000

I recall already seeing this approach. But it has so many moving parts compared to just telling the kernel to do what you want.

What if opening /proc/self/fds fails because too many FDs are open? Okay, then you just close FD 0. But you actually need that one. So close FD 3 instead. You're closing all the FDs, right - so it doesn't matter if you close one prematurely?

What if FD 3 is on your do-not-close list? Okay, just pick the lowest number that isn't.

What if there are too many FDs and they're all really high numbers? Scan the whole 32-bit or 64-bit FD space until you manage to close one, then open /proc/self/fds? (they can be higher than your RLIMIT_NOFILE, if RLIMIT_NOFILE was set to a larger number in the past)

What if your RLIMIT_NOFILE is zero? Then you can't open /proc/self/fds. But there is nothing to close. But will you detect that and succeed instead of failing?

Actually, there could be open FDs from before RLIMIT_NOFILE was set to zero. Will you temporarily increase it, so you can open /proc/self/fds?

What if /proc isn't mounted? This is actually very likely to come up, IF your code is ever used in a program that creates containers, or perhaps even just from a rescue shell.

Wouldn't it be great if you could *just tell the kernel to do the thing you want it to do*?

Microsoft research: A fork() in the road

immibis — Mon, 31 May 2021 17:38:06 +0000

As wahern has already stated:

> The paper recommends a cross-process operation primitive, not something like CreateProcess or pthread_spawn, which will always fall far short of the ability to execute arbitrary code.

It recommends that if you want to redirect a file descriptor, for example, you should just be able to "remote-control" the child process to issue that call, before you unsuspend it.

Microsoft research: A fork() in the road

immibis — Mon, 31 May 2021 17:35:30 +0000

Java has perfectly functional language-level isolation primitives, and although not everything in the standard library is well-behaved, most things are - no different from the C library, really.

There is generally no good reason you should split your Java app into multiple processes just because the OS demands it. Half the point of Java is to shield you from such things, is it not? If you want to split up your app into multiple cooperating modules - as you should - you can do that within the one process.

Microsoft Research: A fork() in the road

nix — Thu, 25 Apr 2019 11:03:14 +0000

Even overcommit-shy swap-space-happy Solaris has overcommit for the main stack of a process. (I'll admit to not entirely understanding why overcommit would be desirable for the main stack but not thread stacks...)

Microsoft Research: A fork() in the road

tao — Mon, 22 Apr 2019 20:48:14 +0000

AIX has SIGDANGER though.

Microsoft Research: A fork() in the road

BenHutchings — Wed, 17 Apr 2019 16:41:34 +0000

At least AIX and FreeBSD also have overcommit.

Microsoft research: A fork() in the road

gfernandes — Tue, 16 Apr 2019 07:08:34 +0000

I do actually, work on very large, in memory cache, Java applications. And guess what?

We're now _breaking it ALL up_ into microservices, throwing out all the large in memory caches, even moving databases to Mongo or PGSQL.

*ecree* is right.

Gigantic monoliths are no excuse for poor software design.

Microsoft research: A fork() in the road

farnz — Mon, 15 Apr 2019 08:41:45 +0000

Not really - the paper says that in practical terms, fork isn't a good API, and while posix_spawn looks better in theory, it practically becomes a mess to use.

The paper is more of an academic opinion piece; it sets out why fork causes issues, why posix_spawn and friends aren't enough better to be worth the effort of a wholesale rewrite of software, and asserts that it should be possible to produce a better API given that, in theory, spawn-type APIs are easier for OS developers to implement.

Within the bounds of academia, this sort of paper serves to legitimise research into better APIs; someone has asserted with examples that existing APIs are imperfect, and now future researchers interested in process creation APIs have something they can use as a reference when they justify spending time on the "solved" problem of spawn versus fork APIs. Maybe the answer will turn out to be that posix_spawn and fork are both local maximums, and the only way to do better is a radical rethink of process design; maybe some bright spark will demonstrate that there is a better API we can use if we step aware from the existing ones.

Key is that we don't have good data on better alternatives to the current "spawn with 101 flags to inherit the right bits of the world" and "fork then clean up" APIs; the paper says we need to work out what the "something other" should look like, because "fork and clean up" is easy for the user, but sets various design choices for the kernel (and requires certain hardware support to be performant - we get CoW very cheaply with modern MMUs, but at the expense of requiring MMUs for an OS kernel, not just MPUs), while "spawn" is easy for the kernel, but leads to huge complexity for the user as they have to handle 101 flags to get the "right" environment in the spawned process.

Microsoft research: A fork() in the road

joncb — Mon, 15 Apr 2019 06:02:28 +0000

> There's no "something other" on Linux.

Don't you think this is putting the cart before the horse just a little bit then? Surely creating a "something other" should take precedence to advocating for developers to stop using the one tool they have for this basic task?

Microsoft research: A fork() in the road

neilbrown — Sun, 14 Apr 2019 23:49:18 +0000

> There's no "something other" on Linux.

Couldn't you open a socket and send a dbus message to systemd to ask it to run some service for you ??
Of course, if you don't like systemd, just write a dedicated server which does whatever you want done.

O_CLOFORK

magfr — Sun, 14 Apr 2019 22:40:34 +0000

By the way, O_CLOFORK shows up once in a while (2011 and then again 2017) and apparently it exists on AIX, *BSD, Solaris and MacOS but I see no actual rejections of it, it just peters out. Is there any interest in it?

Microsoft research: A fork() in the road

zlynx — Sun, 14 Apr 2019 00:45:51 +0000

Large memory processes like Java should use "vfork()" or "clone()" instead of fork.

Even with overcommit turned on, trying to fork a 10 GB Java process can fail because it exceeds the heuristic.

With overcommit disabled, which is how I run my Linux servers, it will definitely fail.

Luckily we have vfork which was designed for exactly this problem. It doesn't duplicate the process memory, not even CoW. With a bit of care to not overwrite important memory in the parent process, it works very well to launch new child processes.

So "vfork()" is "something other" because it is like fork, but isn't actually fork.

Microsoft research: A fork() in the road

Cyberax — Sat, 13 Apr 2019 23:54:48 +0000

There's no "something other" on Linux.

Microsoft research: A fork() in the road

joncb — Sat, 13 Apr 2019 23:52:57 +0000

> They both use fork (more precisely, clone) on Linux. There's no way to avoid it, and this is one of the problems.

I assume you really don't mean "No way to avoid it" here because if there's literally "no way" then this whole exercise is just shouting into the void.

In particular, i'm thinking you (and i specify you because yours is the use case here) write a patch for openJDK that re-implements ProcessBuilder to use something other than fork when calling start(). From your comments on this story that should be very doable. You submit that patch to openJDK and make your case. Regardless of whether it is accepted or not, you can now run openJDK secure in the knowledge that your application is using this faster/safer/cleaner/whatever alternative.

In my travails doing an informal survey of how languages fork i came across an interesting python issue about moving to posix_spawn. It looks like it's stalled for technical compatibility reasons ( https://bugs.python.org/issue35823 ). The part stating that libc "may be more than a decade behind in enterprise Linux distros" shows where bigger problems lie.

Microsoft research: A fork() in the road

farnz — Sat, 13 Apr 2019 12:08:30 +0000

I disagree; I think that fork isn't a particularly tasteful interface (although I understand why it was implemented that way for the PDP-7, which had no virtual memory); it conflates three primitive operations in a multiprocess virtual memory OS:

Create a new schedulable task.
Create a new virtual address space.
Clone one virtual address space into another, making copies whenever necessary to ensure that neither side of the clone is surprised by unexpected sharing of memory.

Now, I don't see the problem with conflating the first two options (a "spawn" operation, if you will); they are both simple operations, and you simplify the OS if each schedulable task has a unique virtual address space (we can call this combination a "process" 😀). The last, however, is a complex operation on any OS that lets you have shared memory (rather than doing what early UNIX did, and swapping entire processes to disk in order to switch to another process), and shouldn't be conflated with the first two.

On the other hand, exec is a deeply tasteful interface; it says that process environment setup has a lot of details, and there will be more details in future, so don't try to enumerate them all via a CreateProcess-like interface; instead, just let the user run arbitrary code in their new process to set up the world, and then replace the running code with the code that wants that environment.

With full hindsight on 50-odd years of hardware and software evolution, including the creation of dynamic linking, I'd prefer to see a spawn+exec pair. spawn takes an image to spawn, plus an "overlay" section that gets copied (CoW) to the program interpreter (in the sense that ld.so is an interpreter, not in the sense that Python is an interpreter) to be used in the dynamic linking phase. For statically linked programs, the program interpreter is part of the main program image, and thus gets the overlay section as input. This lets you send state down to the new process, and then have it used to set up the new process environment; the new process might turn out to be a simple helper that just sets up the environment and calls exec, of course (and, indeed, as you get a new section, you can have both helper and original process live in the same image, using the contents of the spawn section to distinguish "executed fresh" from "spawned ready to exec a new process").

Microsoft research: A fork() in the road

Cyberax — Sat, 13 Apr 2019 09:23:06 +0000

> The whole point of Java is to detach yourself from these low level concerns.
In this particular case I was running a GPU-based optimizer in a separate process. It was kinda crashy (drivers...), so isolating it was a good idea. Heck, it even used pipe-based interaction. How much more Unixy can you get?

> Indeed, a very quick search suggests that to create a helper process you should either use Runtime.Exec or ProcessBuilder (haven't really touched Java in a good decade so that is probably misleading in the nuances). While i wouldn't be surprised if one of the implementations involves a fork under the covers there's no reason it couldn't be anything else that guarantees the expected semantics.
They both use fork (more precisely, clone) on Linux. There's no way to avoid it, and this is one of the problems.

Microsoft research: A fork() in the road

joncb — Sat, 13 Apr 2019 07:26:13 +0000

The whole point of Java is to detach yourself from these low level concerns.

Indeed, a very quick search suggests that to create a helper process you should either use Runtime.Exec or ProcessBuilder (haven't really touched Java in a good decade so that is probably misleading in the nuances). While i wouldn't be surprised if one of the implementations involves a fork under the covers there's no reason it couldn't be anything else that guarantees the expected semantics.

The difference, of course, between C/C++ and Java/C# is that the former are languages that are expected to execute (more or less) directly on top of the current system whereas the latter are expected to present a virtual facade across such. Therefore i would expect C to have access to fork() where it is available whereas i would not expect Java or C# to do so. Golang is a weird blending of the two where some things are more C like and somethings are not, low level fork access apparently being one of the nots. Rust appears to have fork but has some hefty safety warnings on it.

Microsoft research: A fork() in the road

foom — Sat, 13 Apr 2019 02:09:29 +0000

Umm? You don't need to use a new flag unless you're using a new syscall.

These flags are all for different APIs that can open a new file descriptor. If you're using fanotify_init, you use FAN_CLOEXEC with it. If you're using open, you use O_CLOEXEC, etc.

Microsoft research: A fork() in the road

warrax — Fri, 12 Apr 2019 22:36:55 +0000

(I suspect no one will read this, but...)

> It's actually all the other calls that need various forms CLOEXEC and preparation which makes mess, but that's semantics. A quick grep of /usr/include shows this:
O_CLOEXEC, FD_CLOEXEC, EFD_CLOEXEC, EPOLL_CLOEXEC, F_DUPFD_CLOEXEC, IN_CLOEXEC, MFD_CLOEXEC, SFD_CLOEXEC, SOCK_CLOEXEC, TFD_CLOEXEC, DRM_CLOEXEC, FAN_CLOEXEC, UDMABUF_FLAGS_CLOEXEC

The implicitness around all of this means that an application *CANNOT* be future-proof. Every time one of these flags got/gets added there's a new failure mode for an application written to the old API.

(I.e. an application cannot -- by definition -- know which *_CLOEXEC flag will be needed in future.)

"Clone shit" is *not* by any means a reasonably specification of behavior.

Microsoft research: A fork() in the road

roc — Fri, 12 Apr 2019 22:14:34 +0000

As a maintainer of rr, perhaps the heaviest ptrace() user ever: you're not wrong.

Microsoft research: A fork() in the road

roc — Fri, 12 Apr 2019 22:12:47 +0000

> The problem mappings are those which have a _separate_ mapping in the child, which is actually the private ones; shared mappings remain mapped in the child but without COW (I think?)

That's correct.

> and there's no kind of M_CLOFORK mapping that just isn't mapped in the child at all (which is what my brain late last night said private meant).

That's true. Though there is madvise(MADV_DONTFORK) which gives you similar functionality.

> > Shared anonymous mappings sometimes need COW too
> Why? If it's a shared mapping, then writes by the child should be visible in the parent and vice-versa, so both processes can map the same page and no need to COW. What am I missing?

As discussed in the paper that spawned this thread, sometimes fork() is used to create checkpoints of process state (e.g. rr and Redis do this). COW makes this extremely efficient for MAP_PRIVATE pages, which is great, but it doesn't work with MAP_SHARED pages, so rr (not sure about Redis) has to eagerly copy them into the checkpoint. This is bad.

The MAP_PRIVATE/MAP_SHARED model is too inflexible. It would be better to have a model where you can create memory objects backed by files or anonymous memory, and then explicitly COW-clone them (and of course map those objects into your address space, pass them to other processes, etc). The Fuschia documentation isn't great but it seems to have this kind of API. This would require the kernel to manage a tree of COW-clones for each memory object, but that isn't very different to today where Unix kernels have to manage a tree of COW-clones of process address spaces.

Microsoft Research: A fork() in the road

rweikusat2 — Fri, 12 Apr 2019 21:36:24 +0000

As I've now actually read a part of this ... ehh ... interesting piece of text, one glaring inaccuracy would be the 'overcommit' bit. AFAIK, ever since the introduction of some virtual memory support in UNIX (with BSD for VAX), systems have slavishly emulated the 7th edition fork behaviour of "allocate enough swap space to store the entire new process on fork" because That's How It Is To Be Done! (eg, McKusick simply formulates this as demand). Apparently, fork didn't "encourage memory overcommit" in any system supporting it except on Linux (which - certainly coincidentally - is probably going to be the only non-Windows system the intended audience of this paper might have encountered, hence, they hopefully won't spot this --- fingers crossed).

Memory overcommit on fork is indeed sensible but that's a Linux innovation. The default behaviour of 7th edition emulation forks when not enough swap space can be reserved could be described as "suicide out of fear of death": No one knows how much of the inherited address space will need to be copied, this entirlely arbitrary limit thus prefers "guaranteed failure now" over "possible success in future", despite "guaranteed failure now"-mode obviously cannot guarantee that neither of the two forked process will end up failing due to an out of memory situation encountered in a future memory allocation.

Microsoft research: A fork() in the road

MatejLach — Fri, 12 Apr 2019 19:21:40 +0000

You articulated my feelings about the systemd hate more acurately than I could. It seems that as time goes on, everything seems to be remembered more fondly, (not just true for sysvinit, it happens with movies, president approval ratings etc.).

One thing that many people also miss, is that systemd's a 'service manager', therefore its work doesn't stop once your services are up and running. Now I know many would argue that's a downside, but the reality is, the alternative is to get the same set of functionality via a patchwork of variable-quality scripts on top of a 'simpler' init system.

Also, complaints about logind are funny, because nobody was apparently willing to do equivalent maintenance work, (consolekit etc.), so yeah.

Anyway, it's getting a bit ranty, but the point still stands.

Microsoft research: A fork() in the road

dufkaf — Fri, 12 Apr 2019 19:09:35 +0000

I was anwering to "How a multi-threaded application can do a fork in a sane manner?". One can fork as much as needed to create separate child processes first and then possibly create threads in each of those (which I guess is not a problem?) but why forking already multithreaded process (instead of creating additional thread) if not for the exec?

Microsoft research: A fork() in the road

HelloWorld — Fri, 12 Apr 2019 18:45:45 +0000

> The quoted passage is very ambiguous then
No it's not, it is crystal clear. The only thing muddying the waters here is people's interpretation.

Microsoft research: A fork() in the road

rweikusat2 — Fri, 12 Apr 2019 18:41:18 +0000

The first Unix versions were written in assembly. Unfortunately, PDP-s became unavailable otherwise Unix fans would have still be extolling the virtues of it.

The original PDP-7 implementation was written in machine language for want of any other choice. Dito for parts of the original PDP-11 implementation. Nevertheless,

We all wanted to create interesting software more easily. Using assembler was dreary enough that B, despite its performance problems, had been supplemented by a small library of useful service routines and was being used for more and more new programs.

[D. Ritchie, The Development of the C Language]

and

By early 1973, the essentials of modern C were complete. The language and compiler were strong enough to permit us to rewrite the Unix kernel for the PDP-11 in C during the summer of that year. (Thompson had made a brief attempt to produce a system coded in an early version of C--before structures--in 1972, but gave up the effort.)

[p. 16]

There was indeed an OS written in PDP-10 machine language whose fans keep extolling its virtues until today: The MIT AI lab Incompatible Timesharing System (with PCLSRIng being 'the virtue') but that's something different.

Microsoft research: A fork() in the road

Cyberax — Fri, 12 Apr 2019 18:18:51 +0000

Why is that? Java can use helper utilities just like everything else. There's also Golang that suffers from the same issues.

Microsoft research: A fork() in the road

joncb — Fri, 12 Apr 2019 14:12:22 +0000

I feel like if you worrying about fork() while working in Java then something has gone horribly wrong.
I could be wrong, i don't know your workload, but I feel like Java and fork are not meant to be friends.

Microsoft research: A fork() in the road

tao — Fri, 12 Apr 2019 11:24:59 +0000

Ah, you mean works much better than the alternative, but there's a rabid small group that seems convinced otherwise that screams very loudly, but cannot really agree with each other on what the alternative "better" solution would be, except that everyone seems convinced that things were better in the mythical "before".

Yes, your simile is rather apt.

Microsoft research: A fork() in the road

ecree — Fri, 12 Apr 2019 10:35:04 +0000

> Private file mappings need COW.

Yeah I was getting my terminology a bit confused last night.

What I was trying to say was that malloc() memory 'normally' needs COW and file mappings 'normally' don't.

The problem mappings are those which have a _separate_ mapping in the child, which is actually the private ones; shared mappings remain mapped in the child but without COW (I think?), and there's no kind of M_CLOFORK mapping that just isn't mapped in the child at all (which is what my brain late last night said private meant).

> Shared anonymous mappings sometimes need COW too

Why? If it's a shared mapping, then writes by the child should be visible in the parent and vice-versa, so both processes can map the same page and no need to COW. What am I missing?

Microsoft research: A fork() in the road

ecree — Fri, 12 Apr 2019 10:27:39 +0000

Note where I said "incautiously import ideas from another system and everything falls apart".

If I wanted to be maximally inflammatory, I would say that in the analogy, the EU represents systemd. But let's not go down that rabbithole.

Microsoft Research: A fork() in the road

eru — Fri, 12 Apr 2019 06:53:30 +0000

>Or perhaps not, if Google simply replaces Linux with Fuchsia in Android and everybody is forced to write to that API.

The mobile app developers probably would not even notice that change of kernel, especially since Google would work hard to minimize its visible effects on interfaces, for backward-compatibility reasons. Aren't Android apps mostly written in Java or some other higher-level language anyway?

Microsoft Research: A fork() in the road

Cyberax — Fri, 12 Apr 2019 05:01:46 +0000

> One last point. Removing fork+exec from UNIX (really Linux these days) is a fools errand.
Sometimes they are necessary. Any realistic removal plan would require decades of transition time, though. Or perhaps not, if Google simply replaces Linux with Fuchsia in Android and everybody is forced to write to that API.

> I don't mean to dump all over the authors. But this piece is an opinion piece, if not a gripe session, not a research report. Cygwin under their research provides a crappy fork() performance, primarily because of the impedance mis-match between the UNIX model and their model running over Windows.
Windows actually supports pretty performant fork() in its kernel. It's used in the new Linux subsystem for Windows and before that it was used in UNIX Services for Windows. It suffers from the overcommit problem, but otherwise it's enough to run most ported Unix apps.

Microsoft Research: A fork() in the road

lieb — Fri, 12 Apr 2019 04:18:21 +0000

I did a close read of this article and the comments, particularly some of the missed points.

I started off with Tenex (1.34) and moved to UNIX (V6 w/ Univ Ill NCP (Arpanet)) so I've seen a bit. I had to dig out my old TENEX docs to refresh my memory... It has been a while since I wrote a JSYS FORK. But I'm not bragging my age but rather, there needs to be some longer perspective in this discussion.

As the authors hinted, TENEX did have a FORK that would would do everything in one go; create the fork, populate it with an image, and start it. There were other variations on the theme such as an equivalent of vfork() but it was not all that great. It was pretty advanced for its day but in the end, it didn't really do much other than introduce a real VM with functioning pagefaults. Its filesystem wasn't much better than its progeny VFAT. For example, with its rigid process architecture, there was no way to do "foocmd < bits.in >stuff.out&". Things like threading were also awkward at best. The UNIX model of fork() + exec() solved a lot of those problems. It worked for two reasons. First, a pid is a global object. I can create a process, tell someone else its pid and they can play with it. A fork was and still is cheap(er) and the model of orphaning a pid to the init proc made backgrounding trivial. That does not seem like much but it is when you try to make a network daemon or batch system and don't have it. Second, splitting the two has real power - and with real power there is also risks (and bugs).

A TENEX fork, just like CreateProcess was a single shot. Once it is gone into the system it is gone. Sure, you can manipulate it some but now you have one proc fiddling with another with all the race conditions it implies as shown by the complexity of ptrace(). CreateProcess solves this problem with its mountain of API args. However...

Back with V6, fork() and exec() were pretty simple. But they already had some things that TENEX didn't have. That power was and still is what goes on between the child's return of fork() and its subsequent exec(). In those days, we didn't have too much to do other than some close() and open() calls, usually redirecting one or more of 0,1,2 and the closing of random other files (very rare but possible), and maybe setuid(), setgid(). There were no capabilities or shared anything to clean up. Since those days the number of things to be policed from the old environment before the new one got launched has grown. For example, execve() popped up because we got this thing called "environment" in V7 which sometimes required a scrubbing of the env vector. It only got better and worse. The authors rightfully note this growth of features and the resource costs that go with them. The costs are there no matter what the model used for creating/managing/destroying them. You will have to do those actions somewhere. The only question is where.

They also criticize all the close-on-exec stuff scattered about in the various kernel subsystems that use an fd. Point well taken. Then again, this is still territory where one really needs to know more than just garden variety algorithms. And, where else would you handle things such as this? The kernel doesn't know the significance of one open fd from another and how would you construct an API extension for clone() to handle such an open ended requirement? This is something that only the app has knowledge of the full context. Therefore, it is the app's responsibility. You have a choice to either do it somewhere in the app (the Linux choice) or in a system API somewhere. You cannot fob it off somewhere else.

Consider the following issues on allocated resources, most of them open "files". There was a time when a proc could only have 16 open files. BSD moved that to 20. They then implemented the select() call. Their argument list included bit masks for the fd's of interest in the API. It seemed like a good idea at the time. Besides, what app would have more than 10-12 files open at any one time? Well, guess what. In the early 90's, having only 4k open files in a database or pthread app was a limit. This forced the new select2() syscall because the API had to change to handle a variable length bitmask but the old select() was cast in concrete. The AltaVista webserver, which I maintained for a while back then, blew that number bigtime. The select() syscall was no longer a good idea and select2 with huge holes in the bitmap was only marginally better. Hint, even if all you kept open most of the time were 0,1,2 and a few fds that you had hanging around after a flurry of file stuff, you still could have an fd > 4096... Bit masks were no longer a good idea. Eventually poll/epoll and friends took over. The lesson here is that systems must evolve to address the continuing stream of new requirements.

This is where CreateProcess() and its API comes in. All those arguments are there for a purpose and are fixed for all time unless you are really into pain. But is that all that will be needed/wanted? History shows that no, it isn't. The things that must be manipulated when a process/fork gets created will grow in size. There are three choices for this, the system remains static for all time, you hack the API one more time or have an interim period in the child's code where all this can happen prior to launching the new image with exec(). UNIX chose the last and we benefit from that choice.

The creation of a process/thread in any OS is always tricky. There are timing/race conditions to consider, security vulnerabilities to deal with, etc., etc. This is not code for kiddies. But just as the original exec() -> execve() and fork() -> vfork() issues, new capabilities stretch the limits of the process model and, from experience, adding a new syscall, painful as it is, is better than extending an existing API into uncharted territory. Therefore, having that interim period in the child's initialization has real value. This is where one deals with the future, right there after the child's return from fork(). Only the app knows what privs and resources should be freed before letting the next image take over after exec. Only the app knows that whether a particular resource that it needs must be closed on exec. So have the app do it. The kernel doesn't know enough to do the right thing even if it cared. This is one place in the app codebase where such things matter. It has to be carefully written and debugged. It takes skill to do it. It also has to be done someplace. Launching a proc in any system involves two phases; first, clean up inherited "stuff" and, second, initialize a new, safe new environment for the child before it invokes main(). There are three places to do it. You can do it in this interim code before exec(); you can do it in crt0 or ld.so; or you can do it in the kernel. Take your pick. The interim code is safe in that it is no worse than the rest of the app and fits the need. Doing it in crt0 or ld.so adds special and unique app requirements to a standard/common bit of system runtime. Don't even think about the kernel. Once in the kernel ABI, always in the kernel ABI, and for very good reasons. Hint, they bring pitchforks to these change proposal meetings...

The problems with threads and the mixing of threads with process creation (fork+exec) are real. They are two very different beasts and don't get along all that well. We argued about that back in the cde_threads days and things did not get easier just because we changed the name to pthreads. But pthreads itself, and any of its offshoots such as LWP are not much better. Pthreads does a reasonable job at forking a process but one successfully does such things by abiding by a set of design rules just a little less complex than a EULA. And that is so for a reason. I look at pthread and its mix of mutex+condition vars as being a little more safe than writing the whole thing in ASM, which I did on TENEX, i.e. there is nothing to enforce any of those mostly documented rules and design patterns. There is little help and no guard rails in this model which is why, 30 years later, there are not all that many of us who can really grok this stuff and even then, getting critical sections and lock optimization right is hard work. A pthread call is, after all, just another function call... Coverity et al have to do some real back breaking work (magic) to make sense of the rubbish shoved into it enough to report a reasonable error. Java doesn't offer much in this space either even if it has some threading "primitives" in the language (more or less). C++ is worse, little better than C + pthreads. The closest I've seen in modern languages is the golang model, mainly because it has concurrency (and the constraints necessary to keep it "safe") built into the language itself where the compiler and analysis tools can see what is going on. Also note that they use a "concurrency" model, not a multi-threading model (See Pike's numerous blogs). All the magic is in semantic pass(es) of the compiler and the runtime well out of the reach of the app programmer.

If we look at the fork() implementation in the Linux kernel, we find that fork and vfork are just wrappers around the full blown clone call, all of them calling _do_fork(). The pthread_create() lib call uses clone directly. This is also why pthread_spawn() is faster in their graph. It is a properly clamped down clone() followed by quick lib resources cleanup followed by exec(). This was a smart move when NPTL entered the kernel in 2.6. The kernel doesn't care if a task is a thread or a proc; it just does its scheduling and resources thing. Only the app cares and the library does a respectable job with the rest. Note that its options to COW or share a restricted set of objects is limited to just the things that user code can't manage. Hint: why don't they use clone() instead in their runtime?

The authors made some comments on how, without fork+exec, they could do really cool stuff like load and relocate another process in the same address space. Why would anyone want to do such a thing, other than fool around with some academic notion? Memory management, even the pre-VM segment management in the PDP-11, is a very good thing. The reason is simple. If you can't address the object (in memory or the kernel) you can't piddle all over it. It is bad enough when a pthread goes rogue and stomps on things. Why would you import an unknown quantity like an arbitrary executable into your address space? That is an attack surface bigger than the flight deck of the Carl Vinson. One can escape any language runtime into ASM and once there, all bets are off. In other words, so what if you an load and randomly relocate multiple copies of a DLL/SO. I submit that is a feature in search of a problem to solve. If you want to do such things, use a VM or container and let the hypervisor keep you out of mischief. If you don't want fork, use a unikernel in a VM and get on with it. The realtime gadget people do it all the time with bare iron things like Arduinos and MIPS SOCs.

One last point. Removing fork+exec from UNIX (really Linux these days) is a fools errand. There is one very big reason why anyone would care, other than an academic exercise in woulda-coulda-been semantics. There is a massive amount of code out there that runs inside a UNIX model and it does so for a very simple engineering and operational reason. As bad as it is, it is still, on the whole better than all the OS models it displaced. I mentioned at the top that my first system was TENEX. I also worked on the DECSystem-20, a commercialized version of that OS before I moved exclusively to UNIX/Linux. Those were good systems that did cool things but most of us who left them behind had good reasons to move on. Those systems and all the other "proprietary" systems are now but memories to talk about over beers with other retired hackers. Anyone remember the DG Eclipse? AOS/VS had some really cool features, such as a built in threading model, that were way ahead of their time. But where is DG, or DEC or even Sun now? Most of the UNIX systems are gone leaving only {Free,Net,Open}BSD still chugging away. All those systems have been replaced by a standard system that does its job very well and it happens to be UNIX. Linux has evolved over the years but the core similarities and model are still closer to UNIX V6 than any of the other long gone OS designs. It is the standard OS just like the electrical outlets you get down at Home Depot are standard. Imagine the chaos that would return to metal fabrication if instead of using metric or "English" sizes, one chose their own arbitrary dimensions for thread sizes for fasteners. Having two complete set of tools, one metric and one SAE is pain enough which is why every country (other than the USA) is now almost completely metric. The same applies to current OS ABI/API standards. That massive amount of code only really happened when those of us who had to build real systems stopped arguing and accepted the one system we could all agree (at least in principle) could do the job and we could share in common without a lot of legal/financial friction. The world converged even more so on Linux because of the same reason. People who want to build big, complex systems or who want to build handheld things like smartphones by the billion just want something on top of the iron that they could depend on rather than re-invent. Even Microsoft has figured this out. There is no money in maintaining a proprietary OS anymore other than to support an Office suite that is, itself moving off the desktop and into "the Cloud". Unlike Linux where the development model scales to fill the staffing requirement because everyone and anyone who needs it can contribute their expertise, all of the Windows system specialists who really understand how the guts of the thing works are proprietary need-to-know box on the Microsoft payroll which is why Windows/N, N=1->inf is really in maintenance mode. That group is a "cost center" that can't grow because it would eat the engineering budget alive while providing little more than a support layer underneath their Office products (the real cash cow). Their next new thing, where their dev money is being spent these days, is Azure which is a service whose profitability is based on simple usage scaling not feature development. And yes, most of the VMs and containers they run have fork() somewhere in the runtime.

I don't mean to dump all over the authors. But this piece is an opinion piece, if not a gripe session, not a research report. Cygwin under their research provides a crappy fork() performance, primarily because of the impedance mis-match between the UNIX model and their model running over Windows. So what else is new. My son has solved that problem. He's given up on using things like git on Windows and is tired of the self inflicted incompatibilities in Mac/OS (old python et al). He now has a Windows/10 machine for company stuff like Outlook and runs Fedora 29 in a VM to do his development work which does the deed just fine. When the authors and the users of their OS paradigm have enough code to double the size of github and Sourceforge, maybe then their argument would make sense. Otherwise, this is much about nothing and wishing for unicorns. (Lots of) code that works beats elegant designs that don't (yet) every time.

Sorry for being a grumpy old hacker.

Microsoft research: A fork() in the road

roc — Fri, 12 Apr 2019 03:46:58 +0000

Private file mappings need COW.

Shared anonymous mappings sometimes need COW too, but you just can't have that in Linux/POSIX.

Microsoft research: A fork() in the road

pabs — Thu, 11 Apr 2019 23:20:20 +0000

Hmm, clone/vfork don't sound like what I would expect a posix_spawn kernel API would be like.

Microsoft research: A fork() in the road

mpr22 — Thu, 11 Apr 2019 22:59:26 +0000

*looks at British politics*

You know, your analogy says some pretty unflattering things about Unix.

Microsoft research: A fork() in the road

Cyberax — Thu, 11 Apr 2019 22:47:22 +0000

> Unless of course you'd rather have a spawn() function that takes as an argument a BPF program that sets up the child environment before the new process image is executed ;)
Why is a server that allows to seamlessly share complex graphs of objects is badly designed? Designing something as multiple processes is not at all better in itself.

> If that were true, Unix systems would still be written in B.
The first Unix versions were written in assembly. Unfortunately, PDP-s became unavailable otherwise Unix fans would have still be extolling the virtues of it.