Microsoft Research: A fork() in the road [LWN.net]

Microsoft research: A fork() in the road

Posted Apr 10, 2019 12:51 UTC (Wed) by fhuberts (guest, #64683) [Link] (48 responses)

A rather conspicuous claim from a man working for an OS vendor that doesn't have that call.

Personally I think fork is a nice enough call and actually an advantage to have. For example, Git rather suffers in performance on Window because of the lack of that call (amongst other things).

Microsoft research: A fork() in the road

Posted Apr 10, 2019 16:39 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (37 responses)

NT has had fork for decades. (NtCreateProcess can be given a source address space.)

These researchers happen to be right. fork requires overcommit, and overcommit is the enemy of guaranteed forward progress.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 19:38 UTC (Thu) by simcop2387 (subscriber, #101710) [Link] (36 responses)

Performant fork requires overcommit and copy on write. Fork itself doesn't need that, as long as you copy the entire memory space of the process when you fork.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 20:02 UTC (Thu) by ecree (guest, #95790) [Link] (35 responses)

Arguably, performant fork() doesn't need overcommit either. If you have enough RAM, you can reserve pages at fork() and release them at exec(), without having to actually populate those pages except as-needed for COW. You could even stall fork() calls elsewhere in the system, rather than immediately returning -ENOMEM, if the system thinks its memory pressure is due only to such short-term reservations.

This only leads to problems in the case where you have a single-process behemoth with huge amounts of writable anonymous pages; also known as a badly-designed program. As long as userland developers are following proper Unix philosophy (in this case, multiprogramming), fork() can remain performant even without overcommitting memory. (And if you're _not_ doing multiprogramming, and are happy to have a single fat process, then you won't want to run subprocesses anyway, so you won't be calling fork(). It's only the ugly half-way compromises that have a problem.)

Microsoft research: A fork() in the road

Posted Apr 11, 2019 20:05 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (34 responses)

> This only leads to problems in the case where you have a single-process behemoth with huge amounts of writable anonymous pages; also known as a badly-designed program.
Like, say, an application server with a large amount of cached data?

Microsoft research: A fork() in the road

Posted Apr 11, 2019 20:39 UTC (Thu) by ecree (guest, #95790) [Link] (33 responses)

Separate your driver program (which handles forking of new processes) from your data-crunching (which has the large anonymous shared mappings), and all is well.

And do note that it's only the _anonymous shared_ mappings that are a problem; file-backed mappings don't require COW, and nor do private anonymous mappings. Your "large amount of cached data" could have been stored in memory allocated with mmap(MAP_PRIVATE | MAP_ANON), instead of regular malloc(), and then it wouldn't show up in the child after fork().

Microsoft research: A fork() in the road

Posted Apr 11, 2019 21:25 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (22 responses)

> Separate your driver program (which handles forking of new processes) from your data-crunching (which has the large anonymous shared mappings), and all is well.
Oh wow. So we need a persistent daemon that does RPC to simply launch processes efficiently?

And you're arguing that Unix is well designed?

Microsoft research: A fork() in the road

Posted Apr 11, 2019 21:40 UTC (Thu) by ecree (guest, #95790) [Link] (21 responses)

> Oh wow. So we need a persistent daemon that does RPC to simply launch processes efficiently?
There doesn't need to be any RPC involved, the data-cruncher can & should be a child process forked from the driver. The communication between the two might be sockets, or shm, but it could be as simple as the cruncher receiving jobs on stdin and shipping notifications on stdout.

Modularity is a virtue.

Besides, I'm not arguing that fork() has to be the _only_ way to launch processes; it's entirely OK to _also_ have a spawn()-like interface for the 'simple case' where you don't want to juggle fds, ulimits, creds, etc., as long as fork() is still supported for the hard cases. And there's always vfork()...

> And you're arguing that Unix is well designed?
Yes. It is.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 21:47 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (20 responses)

> There doesn't need to be any RPC involved, the data-cruncher can & should be a child process forked from the driver.
I'm sorry. Have you ever worked with Java? Typically you have a server that runs some kind of service. It's a single process - it makes sharing data between requests very easy.

This single process can be very large, tens of gigabytes in size. Modern JVMs are quite efficient at managing large heaps, so this is desirable.

Now you need to launch a helper process. If you use fork()+exec then you're looking at duplicating the entire working set of the application server.

> Yes. It is.
Nope. We have examples of better-designed APIs now.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 22:04 UTC (Thu) by ecree (guest, #95790) [Link] (8 responses)

> Have you ever worked with Java?

Not when there was any alternative.

> It's a single process - it makes sharing data between requests very easy.

Fun fact: you can share memory between distinct processes, by any of several means.

Also, I'm not suggesting spinning off a separate process to handle each request (the xinetd model); just splitting up the workload into separate processes doing different aspects of the job. Do one thing well.

> This single process can be very large, tens of gigabytes in size. Modern JVMs are quite efficient at managing large heaps, so this is desirable.

Your definition of "desirable" clearly differs from mine.

> Now you need to launch a helper process. If you use fork()+exec then you're looking at duplicating the entire working set of the application server.

I know that. Which is but one of the many reasons you shouldn't build a gigantic monolithic application server in the first place.

The Unix system philosophy is like the Westminster system of government. Take any one part of it in isolation, and it looks obviously silly; incautiously import ideas from another system and everything falls apart. But the whole thing, when put together and kept intact, thrums along beautifully and achieves world domination.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 22:10 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

> Fun fact: you can share memory between distinct processes, by any of several means.
Fun fact: none of them allow you to transparently share complex data, with automatic garbage collection.

> Also, I'm not suggesting spinning off a separate process to handle each request (the xinetd model); just splitting up the workload into separate processes doing different aspects of the job. Do one thing well.
So how about being able to run processes without requiring ugly workarounds? Or is this a part that doesn't need to be done well?

> I know that. Which is but one of the many reasons you shouldn't build a gigantic monolithic application server in the first place.
So you're confirming the authors' statement - you have to build your whole system around deficiencies of fork().

> The Unix system philosophy is like the Westminster system of government. Take any one part of it in isolation, and it looks obviously silly; incautiously import ideas from another system and everything falls apart. But the whole thing, when put together and kept intact, thrums along beautifully and achieves world domination.
No. The Unix philosophy is to get something working ASAP and then just objectify it as the epitome of creation, whether it's bad or not.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 22:42 UTC (Thu) by ecree (guest, #95790) [Link] (2 responses)

> So you're confirming the authors' statement - you have to build your whole system around deficiencies of fork().

No; you have to build your system in ways that are already the Right Thing _for other reasons_.

fork()'s "deficiencies" are only deficient for software that is _already badly designed_ before fork() enters the picture.

> The Unix philosophy is to get something working ASAP and then just objectify it as the epitome of creation

If that were true, Unix systems would still be written in B.

The developers of Research Unix at Bell Labs weren't averse to experimenting with changes to the system. They merely avoided changes which, while superficially attractive, did more harm than good. They had 'engineering taste' — which is really the ability to intuit the deeper consequences and ramifications of a design decision.

And the Unix design, as continued by Plan 9 and Linux, continues to evolve (/proc, /sys, entirely new kinds of fds), but always guided by the Unix philosophy.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 22:47 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> Unless of course you'd rather have a spawn() function that takes as an argument a BPF program that sets up the child environment before the new process image is executed ;)
Why is a server that allows to seamlessly share complex graphs of objects is badly designed? Designing something as multiple processes is not at all better in itself.

> If that were true, Unix systems would still be written in B.
The first Unix versions were written in assembly. Unfortunately, PDP-s became unavailable otherwise Unix fans would have still be extolling the virtues of it.

Microsoft research: A fork() in the road

Posted Apr 12, 2019 18:41 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link]

The first Unix versions were written in assembly. Unfortunately, PDP-s became unavailable otherwise Unix fans would have still be extolling the virtues of it.

The original PDP-7 implementation was written in machine language for want of any other choice. Dito for parts of the original PDP-11 implementation. Nevertheless,

We all wanted to create interesting software more easily. Using assembler was dreary enough that B, despite its performance problems, had been supplemented by a small library of useful service routines and was being used for more and more new programs.

[D. Ritchie, The Development of the C Language]

and

By early 1973, the essentials of modern C were complete. The language and compiler were strong enough to permit us to rewrite the Unix kernel for the PDP-11 in C during the summer of that year. (Thompson had made a brief attempt to produce a system coded in an early version of C--before structures--in 1972, but gave up the effort.)

[p. 16]

There was indeed an OS written in PDP-10 machine language whose fans keep extolling its virtues until today: The MIT AI lab Incompatible Timesharing System (with PCLSRIng being 'the virtue') but that's something different.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 22:59 UTC (Thu) by mpr22 (subscriber, #60784) [Link] (3 responses)

*looks at British politics*

You know, your analogy says some pretty unflattering things about Unix.

Microsoft research: A fork() in the road

Posted Apr 12, 2019 10:27 UTC (Fri) by ecree (guest, #95790) [Link] (2 responses)

Note where I said "incautiously import ideas from another system and everything falls apart".

If I wanted to be maximally inflammatory, I would say that in the analogy, the EU represents systemd. But let's not go down that rabbithole.

Microsoft research: A fork() in the road

Posted Apr 12, 2019 11:24 UTC (Fri) by tao (subscriber, #17563) [Link] (1 responses)

Ah, you mean works much better than the alternative, but there's a rabid small group that seems convinced otherwise that screams very loudly, but cannot really agree with each other on what the alternative "better" solution would be, except that everyone seems convinced that things were better in the mythical "before".

Yes, your simile is rather apt.

Microsoft research: A fork() in the road

Posted Apr 12, 2019 19:21 UTC (Fri) by MatejLach (guest, #84942) [Link]

You articulated my feelings about the systemd hate more acurately than I could. It seems that as time goes on, everything seems to be remembered more fondly, (not just true for sysvinit, it happens with movies, president approval ratings etc.).

One thing that many people also miss, is that systemd's a 'service manager', therefore its work doesn't stop once your services are up and running. Now I know many would argue that's a downside, but the reality is, the alternative is to get the same set of functionality via a patchwork of variable-quality scripts on top of a 'simpler' init system.

Also, complaints about logind are funny, because nobody was apparently willing to do equivalent maintenance work, (consolekit etc.), so yeah.

Anyway, it's getting a bit ranty, but the point still stands.

Microsoft research: A fork() in the road

Posted Apr 12, 2019 14:12 UTC (Fri) by joncb (guest, #128491) [Link] (9 responses)

I feel like if you worrying about fork() while working in Java then something has gone horribly wrong.
I could be wrong, i don't know your workload, but I feel like Java and fork are not meant to be friends.

Microsoft research: A fork() in the road

Posted Apr 12, 2019 18:18 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

Why is that? Java can use helper utilities just like everything else. There's also Golang that suffers from the same issues.

Microsoft research: A fork() in the road

Posted Apr 13, 2019 7:26 UTC (Sat) by joncb (guest, #128491) [Link] (7 responses)

The whole point of Java is to detach yourself from these low level concerns.

Indeed, a very quick search suggests that to create a helper process you should either use Runtime.Exec or ProcessBuilder (haven't really touched Java in a good decade so that is probably misleading in the nuances). While i wouldn't be surprised if one of the implementations involves a fork under the covers there's no reason it couldn't be anything else that guarantees the expected semantics.

The difference, of course, between C/C++ and Java/C# is that the former are languages that are expected to execute (more or less) directly on top of the current system whereas the latter are expected to present a virtual facade across such. Therefore i would expect C to have access to fork() where it is available whereas i would not expect Java or C# to do so. Golang is a weird blending of the two where some things are more C like and somethings are not, low level fork access apparently being one of the nots. Rust appears to have fork but has some hefty safety warnings on it.

Microsoft research: A fork() in the road

Posted Apr 13, 2019 9:23 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> The whole point of Java is to detach yourself from these low level concerns.
In this particular case I was running a GPU-based optimizer in a separate process. It was kinda crashy (drivers...), so isolating it was a good idea. Heck, it even used pipe-based interaction. How much more Unixy can you get?

> Indeed, a very quick search suggests that to create a helper process you should either use Runtime.Exec or ProcessBuilder (haven't really touched Java in a good decade so that is probably misleading in the nuances). While i wouldn't be surprised if one of the implementations involves a fork under the covers there's no reason it couldn't be anything else that guarantees the expected semantics.
They both use fork (more precisely, clone) on Linux. There's no way to avoid it, and this is one of the problems.

Microsoft research: A fork() in the road

Posted Apr 13, 2019 23:52 UTC (Sat) by joncb (guest, #128491) [Link] (5 responses)

> They both use fork (more precisely, clone) on Linux. There's no way to avoid it, and this is one of the problems.

I assume you really don't mean "No way to avoid it" here because if there's literally "no way" then this whole exercise is just shouting into the void.

In particular, i'm thinking you (and i specify you because yours is the use case here) write a patch for openJDK that re-implements ProcessBuilder to use something other than fork when calling start(). From your comments on this story that should be very doable. You submit that patch to openJDK and make your case. Regardless of whether it is accepted or not, you can now run openJDK secure in the knowledge that your application is using this faster/safer/cleaner/whatever alternative.

In my travails doing an informal survey of how languages fork i came across an interesting python issue about moving to posix_spawn. It looks like it's stalled for technical compatibility reasons ( https://bugs.python.org/issue35823 ). The part stating that libc "may be more than a decade behind in enterprise Linux distros" shows where bigger problems lie.

Microsoft research: A fork() in the road

Posted Apr 13, 2019 23:54 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

There's no "something other" on Linux.

Microsoft research: A fork() in the road

Posted Apr 14, 2019 0:45 UTC (Sun) by zlynx (guest, #2285) [Link]

Large memory processes like Java should use "vfork()" or "clone()" instead of fork.

Even with overcommit turned on, trying to fork a 10 GB Java process can fail because it exceeds the heuristic.

With overcommit disabled, which is how I run my Linux servers, it will definitely fail.

Luckily we have vfork which was designed for exactly this problem. It doesn't duplicate the process memory, not even CoW. With a bit of care to not overwrite important memory in the parent process, it works very well to launch new child processes.

So "vfork()" is "something other" because it is like fork, but isn't actually fork.

Microsoft research: A fork() in the road

Posted Apr 14, 2019 23:49 UTC (Sun) by neilbrown (subscriber, #359) [Link]

> There's no "something other" on Linux.

Couldn't you open a socket and send a dbus message to systemd to ask it to run some service for you ??
Of course, if you don't like systemd, just write a dedicated server which does whatever you want done.

Microsoft research: A fork() in the road

Posted Apr 15, 2019 6:02 UTC (Mon) by joncb (guest, #128491) [Link] (1 responses)

> There's no "something other" on Linux.

Don't you think this is putting the cart before the horse just a little bit then? Surely creating a "something other" should take precedence to advocating for developers to stop using the one tool they have for this basic task?

Microsoft research: A fork() in the road

Posted Apr 15, 2019 8:41 UTC (Mon) by farnz (subscriber, #17727) [Link]

Not really - the paper says that in practical terms, fork isn't a good API, and while posix_spawn looks better in theory, it practically becomes a mess to use.

The paper is more of an academic opinion piece; it sets out why fork causes issues, why posix_spawn and friends aren't enough better to be worth the effort of a wholesale rewrite of software, and asserts that it should be possible to produce a better API given that, in theory, spawn-type APIs are easier for OS developers to implement.

Within the bounds of academia, this sort of paper serves to legitimise research into better APIs; someone has asserted with examples that existing APIs are imperfect, and now future researchers interested in process creation APIs have something they can use as a reference when they justify spending time on the "solved" problem of spawn versus fork APIs. Maybe the answer will turn out to be that posix_spawn and fork are both local maximums, and the only way to do better is a radical rethink of process design; maybe some bright spark will demonstrate that there is a better API we can use if we step aware from the existing ones.

Key is that we don't have good data on better alternatives to the current "spawn with 101 flags to inherit the right bits of the world" and "fork then clean up" APIs; the paper says we need to work out what the "something other" should look like, because "fork and clean up" is easy for the user, but sets various design choices for the kernel (and requires certain hardware support to be performant - we get CoW very cheaply with modern MMUs, but at the expense of requiring MMUs for an OS kernel, not just MPUs), while "spawn" is easy for the kernel, but leads to huge complexity for the user as they have to handle 101 flags to get the "right" environment in the spawned process.

Microsoft research: A fork() in the road

Posted Apr 16, 2019 7:08 UTC (Tue) by gfernandes (subscriber, #119910) [Link]

I do actually, work on very large, in memory cache, Java applications. And guess what?

We're now _breaking it ALL up_ into microservices, throwing out all the large in memory caches, even moving databases to Mongo or PGSQL.

*ecree* is right.

Gigantic monoliths are no excuse for poor software design.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 21:28 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (6 responses)

> And do note that it's only the _anonymous shared_ mappings that are a problem; file-backed mappings don't require COW, and nor do private anonymous mappings. Your "large amount of cached data" could have been stored in memory allocated with mmap(MAP_PRIVATE | MAP_ANON), instead of regular malloc(), and then it wouldn't show up in the child after fork().
Typically this kind of stuff is stored in native data structures and so it doesn't have to do anything with files. You also can't typically control the allocations made by the JVM or your language runtime.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 21:52 UTC (Thu) by ecree (guest, #95790) [Link] (3 responses)

> Typically this kind of stuff is stored in native data structures and so it doesn't have to do anything with files.

I know, that's why you use MAP_ANON. Do pay attention ;)

> You also can't typically control the allocations made by the JVM or your language runtime.

I very nearly said something about "the problem with most application servers is they're written in Java", but I held back. Maybe I shouldn't've.

Language runtimes ought to provide mechanisms for allocating objects in private memory, if they're intended to be used for big programs that want child processes. Indeed, if they're going to be written around a spawn()ish view of the world, then objects allocated from user code won't need to be visible post-fork(), so such objects could just be allocated private by default.

C gives you that control, through the aforementioned mmap(), and it's probably even possible (I haven't tried it) to patch your libc to make malloc default-private.

An even more fine-grained system might be tagged allocations, where the fork()-analogue (probably clone()) could specify which tags it wanted to copy into the child. But probably no-one's ever needed that, else there would have been a serious attempt to implement it.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 21:54 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> Language runtimes ought to provide mechanisms for allocating objects in private memory, if they're intended to be used for big programs that want child processes. Indeed, if they're going to be written around a spawn()ish view of the world, then objects allocated from user code won't need to be visible post-fork(), so such objects could just be allocated private by default.
Well, they don't. Go, Python, Java, C# all use simple private mappings.

Why _should_ they be designed around fork()?

Microsoft research: A fork() in the road

Posted Apr 11, 2019 22:29 UTC (Thu) by ecree (guest, #95790) [Link] (1 responses)

> Why _should_ they be designed around fork()?

Because fork() is necessary to allow complex control of child environment without excessive API surface (spawn() functions with 42 arguments, etc.). So it needs to be supported.

Unless of course you'd rather have a spawn() function that takes as an argument a BPF program that sets up the child environment before the new process image is executed ;)

Microsoft research: A fork() in the road

Posted Apr 11, 2019 22:33 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> Because fork() is necessary to allow complex control of child environment without excessive API surface (spawn() functions with 42 arguments, etc.). So it needs to be supported.
As demonstrated by Zircon in Fuchsia, it's not necessary. You can download the Fuchsia SDK yourself and check it, it's available right now as a counter-example to your point.

> Unless of course you'd rather have a spawn() function that takes as an argument a BPF program that sets up the child environment before the new process image is executed ;)
No, I would have a family of process-management functions that accept the target process handle as a parameter and ability to create suspended processes.

Microsoft research: A fork() in the road

Posted May 31, 2021 17:35 UTC (Mon) by immibis (subscriber, #105511) [Link] (1 responses)

Java has perfectly functional language-level isolation primitives, and although not everything in the standard library is well-behaved, most things are - no different from the C library, really.

There is generally no good reason you should split your Java app into multiple processes just because the OS demands it. Half the point of Java is to shield you from such things, is it not? If you want to split up your app into multiple cooperating modules - as you should - you can do that within the one process.

Microsoft research: A fork() in the road

Posted Jun 1, 2021 1:42 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Java doesn't really handle isolation well. Threads can leak, the heap is shared, etc.

Microsoft research: A fork() in the road

Posted Apr 12, 2019 3:46 UTC (Fri) by roc (subscriber, #30627) [Link] (2 responses)

Private file mappings need COW.

Shared anonymous mappings sometimes need COW too, but you just can't have that in Linux/POSIX.

Microsoft research: A fork() in the road

Posted Apr 12, 2019 10:35 UTC (Fri) by ecree (guest, #95790) [Link] (1 responses)

> Private file mappings need COW.

Yeah I was getting my terminology a bit confused last night.

What I was trying to say was that malloc() memory 'normally' needs COW and file mappings 'normally' don't.

The problem mappings are those which have a _separate_ mapping in the child, which is actually the private ones; shared mappings remain mapped in the child but without COW (I think?), and there's no kind of M_CLOFORK mapping that just isn't mapped in the child at all (which is what my brain late last night said private meant).

> Shared anonymous mappings sometimes need COW too

Why? If it's a shared mapping, then writes by the child should be visible in the parent and vice-versa, so both processes can map the same page and no need to COW. What am I missing?

Microsoft research: A fork() in the road

Posted Apr 12, 2019 22:12 UTC (Fri) by roc (subscriber, #30627) [Link]

> The problem mappings are those which have a _separate_ mapping in the child, which is actually the private ones; shared mappings remain mapped in the child but without COW (I think?)

That's correct.

> and there's no kind of M_CLOFORK mapping that just isn't mapped in the child at all (which is what my brain late last night said private meant).

That's true. Though there is madvise(MADV_DONTFORK) which gives you similar functionality.

> > Shared anonymous mappings sometimes need COW too
> Why? If it's a shared mapping, then writes by the child should be visible in the parent and vice-versa, so both processes can map the same page and no need to COW. What am I missing?

As discussed in the paper that spawned this thread, sometimes fork() is used to create checkpoints of process state (e.g. rr and Redis do this). COW makes this extremely efficient for MAP_PRIVATE pages, which is great, but it doesn't work with MAP_SHARED pages, so rr (not sure about Redis) has to eagerly copy them into the checkpoint. This is bad.

The MAP_PRIVATE/MAP_SHARED model is too inflexible. It would be better to have a model where you can create memory objects backed by files or anonymous memory, and then explicitly COW-clone them (and of course map those objects into your address space, pass them to other processes, etc). The Fuschia documentation isn't great but it seems to have this kind of API. This would require the kernel to manage a tree of COW-clones for each memory object, but that isn't very different to today where Unix kernels have to manage a tree of COW-clones of process address spaces.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 18:38 UTC (Wed) by sjfriedl (✭ supporter ✭, #10111) [Link]

> A rather conspicuous claim from a man working for an OS vendor that doesn't have that call.

From the paper:

> Ironically, the NT kernel natively supports fork; only theWin32 API on which Cygwin depends does not

Microsoft research: A fork() in the road

Posted Apr 10, 2019 18:47 UTC (Wed) by thoughtpolice (subscriber, #87455) [Link] (8 responses)

> A rather conspicuous claim from a man working for an OS vendor that doesn't have that call.

Three of the (four) authors are not from Microsoft Research. You're going to be surprised when you find out what it is the majority of MSR employees do (hint: it's research, and a lot of it uses Linux.)

> Personally I think fork is a nice enough call and actually an advantage to have. For example, Git rather suffers in performance on Window because of the lack of that call (amongst other things).

This is just a flippant response, the paper's entire argument has nothing to do with whether it's "nice enough" or "easy to use", you can use `posix_spawn` for that. It's arguing that the contract implementations of `fork()` must provide is a large burden on the design of new systems that is worth reconsidering, e.g. the section that outlines the design of K42, an object-capability system, had its design significantly burdened by general `fork()` semantics, because suddenly you have to start talking about what state objects can be in after they are cloned, rather than when they are created fresh. Similarly, thanks to the semantics of `CreateProcess` actually spawning separate process (vs fork), it can work in environments that have single-address spaces -- for example, by porting their APIs to different runtimes, you can support single-address-space multi-process SGX enclaves with the same API, same with Unikernels, etc (Section 5). If you actually treat the *process* as the object of work (the object which your vocabulary is designed around) then it's easy to see how this works vs fork. Even if fork is "nice" to use for users, it has significant design ramifications throughout the system.

Also `CreateProcess` on Windows by itself isn't "expensive" because it's expensive to map in new address spaces and run main(), and Windows is just weirdly mysteriously terrible at it and they just never cared, it's expensive because actually initializing user contexts in a Windows process is expensive (that requires kernel locks, context switches to system services, thread/stack initialization, etc) but this is a wholly separate design issue. There's no reason to believe a well-implemented spawn mechanism can't have excellent performance because Windows has technical debt (and given the chance to start fresh, if you wanted a high-level spawning API, you'd almost certainly try to *ensure* it has good performance from the get-go.)

The paper never even makes the argument "`CreateProcess` in Windows is better and easier to use than `fork`" (which some other people here have read somewhere, and something that would be pretty hard to argue anyway), merely that the actual underlying design distinction -- an API revolving around spawning actual processes as a kind of first class object in the system without imposing semantic requirements on the memory subsystem -- is perhaps a better one, moving forward. The arguments seem pretty good, to me. I could live without `fork()` if it meant I got something in return, and it seems I do get some things in return.

It also comes across as a good example of where higher level, domain-specific APIs for users give more flexibility than a low level one when considering the overall system design. A higher level API necessarily gives more degrees of freedom in the implementation which, in turn, helps isolate users from underlying implementation details. You can of course go too far here, but if you capture the domain properly, then you can have the advantages of implementation freedom combined with the exact control you need.

Here's a similar discussion I had recently: if you want to write an efficient C program (energy efficient, time efficient, whatever), compared to some existing one, it will never be enough to just throw a new compiler or better optimizations at said existing program -- you cannot choose better instructions or register allocate your way out of it, etc. That will be peanuts compared to real gains you can have. Real optimization comes from choosing a different design, different algorithms and different memory layout -- the kind of decisions that are impossible for the C compiler to make for you while preserving semantics. But tiling, memory locality optimizations are much more common and practical in more restricted settings if you take away a few degrees of freedom from the user, and give it back to the computer -- the Halide image compiler is an example of this. So it is not a problem, or even a "failure", of individual *technical tools* that you are using (calling it a "failure" is not based on any technical understanding of the problem, but on the deep, social, human desire to assign blame, in order to rationalize and reduce complex interactive failure into singular causes.) It is an impedance mismatch in the *abstract language* you are using to communicate with the machine, in turn, restricting the degrees of freedom the computer has for response. Language problem, not a technical one.

Generally I enjoy LWN but most people here have (IMO) failed to level any actual substantial criticisms of the paper (beyond made up ones in their head) before dismissing it offhand, which is sad because it's pretty well written and easy to approach. You don't have to have .patch files in hand every time you want to take some basic idea and run with it and see where it might lead; and in fact, such a demand hampers actual progress more than anything, but that's another discussion...

Microsoft research: A fork() in the road

Posted Apr 10, 2019 19:36 UTC (Wed) by smoogen (subscriber, #97) [Link]

Thank you for the long form explanation on the items. It helped clarify some questions I had.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 21:42 UTC (Wed) by nix (subscriber, #2304) [Link] (5 responses)

Yeah. The problem with dropping fork() is really twofold: you lose the ability to create multiple long-running processes that share file descriptions, signal masks etc but not MM (this is quite rare but important when it happens), but more importantly you lose the ability to execute arbitrary code to set up the child between fork() and exec().

*This* is a killer, because it means you are constrained to whatever process-setup code the people who specified the replacement (posix_spawn(), say) happened to think of, and you can't add to it because in a system without fork() you cannot implement your own spawn replacement, but have to rely on whatever limitations the one in the OS happened to provide. The fact that the posix_spawn() API family is already a horrible tentacular monster and is *still growing* and that it is trivial to generate scenarios it cannot handle suggests that this is a rather serious limitation, and a limitation that bites real code. (Even the scenarios it can handle are really hard to read because it has to handle so many cases that it really wants to be a programming language but isn't.)

Microsoft research: A fork() in the road

Posted Apr 10, 2019 22:37 UTC (Wed) by wahern (subscriber, #37304) [Link] (2 responses)

The paper recommends a cross-process operation primitive, not something like CreateProcess or pthread_spawn, which will always fall far short of the ability to execute arbitrary code.

But Unix *has* cross-process operations in ptrace. Nobody is really clamoring to use that interface to build a better fork replacement because fork *already* represents a compromise between how much complexity to put into the kernel and how much complexity to put into userspace, and additionally how costly (mostly in complexity, not performance) the implementation must be. It shouldn't be shocking that those responsible for the kernel side are complaining about having to put in so much work; nor should it be shocking that they heavily discount the cost to user space by shifting the remainder of the burden to them.

Taken to its logical end the paper's argument basically mirrors the same arguments as for microkernels. And while I think microkernels are great and am eagerly waiting an excuse to put seL4 to some use, fork+exec is sufficiently flexible and performant to have ushered in the age of containers and other more complex process management strategies.

While cross-process operations would be more powerful we can't underestimate the cost necessary in building the stack of software that would be necessary to bring the promise to reality. It's the same inconvenient truth as with microkernels. fork+exec is just too good enough, whether by accident or design.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 22:43 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

The problem with ptrace() is that it's scary. I won't use it for general purpose software. It's also pretty slow, since nobody cared too much to optimize it.

Microsoft research: A fork() in the road

Posted Apr 12, 2019 22:14 UTC (Fri) by roc (subscriber, #30627) [Link]

As a maintainer of rr, perhaps the heaviest ptrace() user ever: you're not wrong.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 22:45 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> Yeah. The problem with dropping fork() is really twofold: you lose the ability to create multiple long-running processes that share file descriptions, signal masks etc but not MM
The most important part of this is sharing of FDs, and Linux could use something like DuplicateHandle from NT: https://docs.microsoft.com/en-us/windows/desktop/api/hand...

But again, this needs an API that has a process handle as a first-class object.

Microsoft research: A fork() in the road

Posted May 31, 2021 17:38 UTC (Mon) by immibis (subscriber, #105511) [Link]

As wahern has already stated:

> The paper recommends a cross-process operation primitive, not something like CreateProcess or pthread_spawn, which will always fall far short of the ability to execute arbitrary code.

It recommends that if you want to redirect a file descriptor, for example, you should just be able to "remote-control" the child process to issue that call, before you unsuspend it.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 13:35 UTC (Thu) by roblucid (guest, #48964) [Link]

Implementers love not supporting stuff that is a "burden", but fork(2) allows parent and child to share information, or the parent to set up the environment state, that the child has no idea about eg) shell redirection. It opened up a whole number of possibilities, without coding into every programme.
For users/application programmers that flexibility is a major feature, a huge increase in the power of the system.
Over-commit is an optimisation feature, which like networks can cause temporary or sudden fail of operations, apparently guaranteed by the OS. It was bolted on in POSIX like OSes, I ran machines which actually needed all of virtually memory backed by swap space and applications which were careless, paid a performance penalty or caused themselves to fail very often due to insufficient virtual memory. So the argument that applications used huge sparsely used blocks of memory was moot; that just didn't fly!
Perhaps programme's that require over-commit to run, should have to accept penalties ie being classed as likely memory hogs and prime candidates for being "frozen" to disk in event the system is under pressure. I'd rather have the applications that use over-commit to pay a price in complexity, using some library provided sparse data structure, rather than every single simple program every written for the system!

Microsoft research: A fork() in the road

Posted Apr 10, 2019 13:01 UTC (Wed) by naptastic (guest, #60139) [Link]

s/research paper/editorial dressed up as a research paper/;

<3

Microsoft research: A fork() in the road

Posted Apr 10, 2019 13:17 UTC (Wed) by mm7323 (subscriber, #87386) [Link] (15 responses)

Personally I think fork() is a beautiful function. It is incredibly simple in use, but very powerful. fork() + exec() is an elegant pairing.

Windows CreateProcess() takes 10 or so direct parameters, of which some are structures of yet more parameters again. Yuck.

Some of the complaints in the paper seem misplaced too (fork is slow, doesn't scale, forces memory overcommit). fork() has a certain purpose and trying to use it for performance is something which we already know doesn't work that well - that's why webservers went through design iterations of pre-forking, using threads, async IO or some combination of techniques.

Suggesting fork() is insecure possibly has some truth as inheritting most things by default (except other threads) is the opposite of what would be safest, but it's too late to change. Perhaps a new part() call that inherits little could be made, but the knock on is then making APIs for everything to opt into inheritance.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 13:54 UTC (Wed) by warrax (subscriber, #103205) [Link] (14 responses)

False dichotomy. CreateProcess() is ugly because of extreme backward-compat requirements, there's nothing inherent in the idea which would make such a function ugly.

Fork() is absurdly hard to use in multithreaded cases because of the implicit forking of resource handles of various kinds. It's also really hard to do error handling around it correctly(!). In languages like C it's certainly *possible* to use it correctly though it's rare to see people get the edge cases right, but try asking any language runtime implementors how much fun they had having to work around its nightmarish semantics.

There's nothing elegant about fork() at all -- it overconstrains implementations by way of its semantics for a use case which is almost always going to be "start a new subprocess /usr/bin/something with arguments X, Y, Z". It's a complete mess because of its implicitness. Explicit over implicit all the way!

Microsoft research: A fork() in the road

Posted Apr 10, 2019 14:15 UTC (Wed) by smoogen (subscriber, #97) [Link] (1 responses)

I think the issue is that for people who don't deal with multi-threading (which is still a lot of coding) fork is the hammer-rock which does the job that the coder wants. It is simple, brutal and gets it done. Most people are happy with hammer-rocks because you get something completed, you get to take out some aggression safely, and you can go focus on some other problem.

The problem is that we when you get to threaded-screws, the hammer no longer works well, and you end up with splintered walls and bashed fingers. So you upgrade your toolbox with better hammers, and maybe some screwdrivers. You might even go with a toolbox with no hammer in it (aka Windows). You quickly find that all of the remaining tools still have enough rusty bits to give every program still gives you a bad case of tentanus, gangrene and blood poisoning while trying to deal with threads.

In the end, papers do not fix things.. especially papers which do not give code which clearly shows a better solution. They may provoke people to think about building better tools.. but even then it takes multiple generations of people who are happy with their rocks to retire before you even get claw hammers or phillips head screwdrivers.. and even then you will find that someone stuck a nice sharp rusty point on the #1 phillipshead and the fix was to keep wrapping it in indirection duct-tape until it only pokes you now and then.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 19:33 UTC (Wed) by smoogen (subscriber, #97) [Link]

And thanks for the clarification above on fork in Windows.. I learned on the WinABI that Cygwin uses so was used to not having 'fork'. So in my case, I was definitely using a subset of the tools but should have looked to see if the bigger toolbox had it before I wrote anything more.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 14:26 UTC (Wed) by mm7323 (subscriber, #87386) [Link] (9 responses)

False dichotomy. CreateProcess() is ugly because of extreme backward-compat requirements, there's nothing inherent in the idea which would make such a function ugly.

Of course, you could say the weaknesses of fork() remain because of the backward-compat requirement too. But fork() + exec() is still a prettier API that's easy to teach.

Fork() is absurdly hard to use in multithreaded cases

Yes, fork() and threads don't really mix well as you won't normally know what state other threads were in, and so their memory state at the point of fork(), unless you acquire locks first as you would for any normal case of accessing another threads state. Threads don't mix well with other things too e.g. mixing locks, condition variables and poll()/select()/epoll(). Certainly in C and similar languages, use of threads requires fore-thought about the overall program structure and care over data ownership.

almost always going to be "start a new subprocess /usr/bin/something with arguments X, Y, Z"

posix_spawn() maybe what you are after then, though the man page of that says that it only offers a sub-set of fork() + exec() functionality. Forking servers are quite a common use case too, and again, are simple to teach and understand - even if the inherit by default semantics can present a booby trap.

It's a complete mess because of its implicitness.

It's actually all the other calls that need various forms CLOEXEC and preparation which makes mess, but that's semantics. A quick grep of /usr/include shows this:


O_CLOEXEC, FD_CLOEXEC, EFD_CLOEXEC, EPOLL_CLOEXEC, F_DUPFD_CLOEXEC, IN_CLOEXEC, MFD_CLOEXEC, SFD_CLOEXEC, SOCK_CLOEXEC, TFD_CLOEXEC, DRM_CLOEXEC, FAN_CLOEXEC, UDMABUF_FLAGS_CLOEXEC

Microsoft research: A fork() in the road

Posted Apr 10, 2019 15:15 UTC (Wed) by barryascott (subscriber, #80640) [Link] (4 responses)

> It's actually all the other calls that need various forms CLOEXEC and preparation which makes mess, but that's semantics. A quick grep of /usr/include shows this:
>
> O_CLOEXEC, FD_CLOEXEC, EFD_CLOEXEC, EPOLL_CLOEXEC, F_DUPFD_CLOEXEC, IN_CLOEXEC, MFD_CLOEXEC, SFD_CLOEXEC, SOCK_CLOEXEC, TFD_CLOEXEC, DRM_CLOEXEC, FAN_CLOEXEC, UDMABUF_FLAGS_CLOEXEC

But that is the point the paper makes. Because of the fork() design *everything* else has to work around the limitations.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 15:25 UTC (Wed) by mm7323 (subscriber, #87386) [Link] (3 responses)

fork() existed long before things like timerfd(), epoll() and so on. And yet these systems decided that CLOEXEC should still be a opt-in extra step, rather than the [sensible] default.

It would have been a bold decision to make new sub-systems implicitly set CLOEXEC by default, but it perhaps is only now more obvious with hindsight that such a could would have been saner, but it's not the fault of fork() that it came first.

And still there is no better suggestion of a replacement or upgrade.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 2:10 UTC (Thu) by epa (subscriber, #39769) [Link] (2 responses)

A better answer might have been to introduce close_and_exec() which closes open file descriptors (etc) apart from those explicitly passed in to keep open.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 5:32 UTC (Thu) by mm7323 (subscriber, #87386) [Link] (1 responses)

I've actually implemented something like that in the past - you can scan /proc/self/fds and generally call close() on what you find there.

One thing this can break is Valgrind, which creates some high numbered descriptors above the normal ulimit(). By testing the ulimit you can avoid closing these. Other libraries and tools may not be so lucky though, so an O_NOCLOEXEC maybe better, and it's actually a O_NOCLOFORK that would be best.

Microsoft research: A fork() in the road

Posted May 31, 2021 17:44 UTC (Mon) by immibis (subscriber, #105511) [Link]

I recall already seeing this approach. But it has so many moving parts compared to just telling the kernel to do what you want.

What if opening /proc/self/fds fails because too many FDs are open? Okay, then you just close FD 0. But you actually need that one. So close FD 3 instead. You're closing all the FDs, right - so it doesn't matter if you close one prematurely?

What if FD 3 is on your do-not-close list? Okay, just pick the lowest number that isn't.

What if there are too many FDs and they're all really high numbers? Scan the whole 32-bit or 64-bit FD space until you manage to close one, then open /proc/self/fds? (they can be higher than your RLIMIT_NOFILE, if RLIMIT_NOFILE was set to a larger number in the past)

What if your RLIMIT_NOFILE is zero? Then you can't open /proc/self/fds. But there is nothing to close. But will you detect that and succeed instead of failing?

Actually, there could be open FDs from before RLIMIT_NOFILE was set to zero. Will you temporarily increase it, so you can open /proc/self/fds?

What if /proc isn't mounted? This is actually very likely to come up, IF your code is ever used in a program that creates containers, or perhaps even just from a rescue shell.

Wouldn't it be great if you could *just tell the kernel to do the thing you want it to do*?

Microsoft research: A fork() in the road

Posted Apr 10, 2019 15:53 UTC (Wed) by sjfriedl (✭ supporter ✭, #10111) [Link] (1 responses)

> But fork() + exec() is still a prettier API that's easy to teach.

It's only easier to teach if you ignore the hard parts.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 6:42 UTC (Thu) by nilsmeyer (guest, #122604) [Link]

> It's only easier to teach if you ignore the hard parts.

Isn't that how teaching works, at least in the beginning?

Microsoft research: A fork() in the road

Posted Apr 12, 2019 22:36 UTC (Fri) by warrax (subscriber, #103205) [Link] (1 responses)

(I suspect no one will read this, but...)

> It's actually all the other calls that need various forms CLOEXEC and preparation which makes mess, but that's semantics. A quick grep of /usr/include shows this:
O_CLOEXEC, FD_CLOEXEC, EFD_CLOEXEC, EPOLL_CLOEXEC, F_DUPFD_CLOEXEC, IN_CLOEXEC, MFD_CLOEXEC, SFD_CLOEXEC, SOCK_CLOEXEC, TFD_CLOEXEC, DRM_CLOEXEC, FAN_CLOEXEC, UDMABUF_FLAGS_CLOEXEC

The implicitness around all of this means that an application *CANNOT* be future-proof. Every time one of these flags got/gets added there's a new failure mode for an application written to the old API.

(I.e. an application cannot -- by definition -- know which *_CLOEXEC flag will be needed in future.)

"Clone shit" is *not* by any means a reasonably specification of behavior.

Microsoft research: A fork() in the road

Posted Apr 13, 2019 2:09 UTC (Sat) by foom (subscriber, #14868) [Link]

Umm? You don't need to use a new flag unless you're using a new syscall.

These flags are all for different APIs that can open a new file descriptor. If you're using fanotify_init, you use FAN_CLOEXEC with it. If you're using open, you use O_CLOEXEC, etc.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 22:45 UTC (Wed) by wahern (subscriber, #37304) [Link] (1 responses)

> there's nothing inherent in the idea which would make such a function ugly.

I think the authors would disagree with you. They claim (rightly) that CreateProcess and pthread_spawn are fundamentally incapable of the expressiveness necessary of a core primitive.

Read fairly their claim is that both CreateProcess and fork+exec suck. The fork+exec model is more expressive and powerful, CreateProcess less of a burden on the kernel and more performant. Their preferred alternative is cross-process operations, though as I mention elsethread the pros and cons basically mirror the debate regarding microkernels, IMO, and unsurprisingly (from the perspective of operating system researchers busily writing experimental kernels) substantially shifts the complexity burden to user space software.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 22:53 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> substantially shifts the complexity burden to user space software.
Does it? Something like 90% of cases are simple fork-exec and they can migrate to posix_spawn() as is, even given its deficiencies.

Everything else can probably be expressed much simpler with a newer sane process API.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 13:32 UTC (Wed) by ibukanov (subscriber, #3942) [Link] (13 responses)

The paper argues that fork encourages overcommit of memory, as it is a feature that one should avoid. But overcommit is pretty much a necessity in VM environment or on a modern desktops with memory compression and cannot be used as an argument against it.

The real fork drawback is that it does not have sane semantics in multi-threaded semantics and using it with threads with shared memory do more harm then good . But fork in single threaded applications that uses it for computational workers works nicely and may even leads to better CPU cache utilization.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 21:00 UTC (Wed) by rweikusat2 (subscriber, #117920) [Link] (12 responses)

Ehh ... sorry but that's a pthreads drawback. The people who designed that came from the "fork useless step in front of exec!!1" camp and imagined their baby supplanting all other uses of fork. That's an ancient holy war clotted into a specification.

NB: I'm not going to read a paper presenting decades-old VMS 'design [irr]rationales' as 'new reseach'.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 7:23 UTC (Thu) by ibukanov (subscriber, #3942) [Link] (11 responses)

How a multi-threaded application can do a fork in a sane manner? I mean the real fork that does not follow almost immediately by exec when suspending all other threads is OK.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 10:31 UTC (Thu) by dufkaf (guest, #10358) [Link] (10 responses)

And why should multithreaded app do it if not for the exec? For all other cases it can just create new thread?

Microsoft research: A fork() in the road

Posted Apr 11, 2019 15:33 UTC (Thu) by ibukanov (subscriber, #3942) [Link] (2 responses)

But if the only point of fork is to call exec, then why should it exist in the first place? What is necessary is API to create a new suspended process, setup it in any necessary way and start it. CreateProcess and posix_spawn tried to provide a single function both to setup and start the process. The result was bad and awkward. But it does not mean that fork is necessary.

But my point is that for single-threaded applications fork has clear semantic. For example, to spawn a computation, prepare in the parent process all data in memory, fork, compute and send the results back using a pipe or shared memory. Works nicely.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 17:04 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link]

> But if the only point of fork is to call exec, then why should it exist in the first place?

Even assuming this was true (and it isn't), the original fork use-case would come to mind here: Execute a command in a background process instead of in the current one.

A use-case for 'exec in same process': In-place update of a running program. The currently running instance serializes and records its current state somehow and then execs itself, causing the updated program file to be loaded. The new instance then restores the serialized state and continues where the previous one left off.

Microsoft research: A fork() in the road

Posted Apr 12, 2019 19:09 UTC (Fri) by dufkaf (guest, #10358) [Link]

I was anwering to "How a multi-threaded application can do a fork in a sane manner?". One can fork as much as needed to create separate child processes first and then possibly create threads in each of those (which I guess is not a problem?) but why forking already multithreaded process (instead of creating additional thread) if not for the exec?

Microsoft research: A fork() in the road

Posted Apr 11, 2019 16:58 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (6 responses)

Simple example: I have program here which is supposed to capture stdout and stderr output of another program and forward that to syslog. In order to guarantee a sensible process hierarchy (the other command should be the child of the parent process of the program, not the log forwarder), this program forks, executes the other command in the original process and provides the capture-and-forward via two threads running in the forked process (because this was easier to implement than using I/O multiplexing).

Microsoft research: A fork() in the road

Posted Apr 11, 2019 17:18 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

Why shouldn't the log forwarded be a child?

Microsoft research: A fork() in the road

Posted Apr 11, 2019 17:35 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (4 responses)

Simple answer: Because the original parent might want the exit status of the payload command and not that of the log forwarder.

More complicated extension: For my usual use-case, the parent of the log forwarder will be a program which monitors another program, restarts that if it terminates unexpectedly and provides facilities for reliable termination of the other program and for reliably sending signals to it (feels like wrong grammar ...). For this to work, it needs to be the parent of the payload process.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 17:51 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

Uhm, then why fork()?

In your wrapper just spawn a log process, passing your stdin/stdout to it. Then exec() the payload.

No fork() required.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 18:03 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link] (2 responses)

There's no "wrapper" here. The example I used is a command supposed to log stdout and stderr of another command. And the code of this command has to run in some process, just not in the original one. Besides "this could be implemented in another way" is not an argument. Everything can always be implemented in another way.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 18:07 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

So your code will be cleaner with spawn(), it will work faster (no VM cloning), have less overhead (no need for overcommit) but fork+exec() is better?

Microsoft research: A fork() in the road

Posted Apr 11, 2019 19:07 UTC (Thu) by rweikusat2 (subscriber, #117920) [Link]

I have absolutely no idea what you're writing about.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 15:04 UTC (Wed) by xl2784 (guest, #131031) [Link] (10 responses)

Can someone explain this case? That would be great help:)
“
A simple but common case is one thread doing memory allocation and holding a heap lock, while another thread forks. Any attempt to allocate memory in the child (and thus acquire the same lock) will immediately deadlock waiting for an unlock operation that will never happen.
”

Microsoft research: A fork() in the road

Posted Apr 10, 2019 15:58 UTC (Wed) by metan (subscriber, #74107) [Link] (9 responses)

There is a lock that guards the access to malloc data structures the race looks like this:

* Thread A calls malloc()
* Thread A acquires malloc() lock
* Thread B calls fork()
* Thread A releases malloc() lock

Now the child of the Thread B ends up with malloc locked for eternity and any attempt to allocate memory will end up with a deadlock there.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 16:04 UTC (Wed) by xl2784 (guest, #131031) [Link]

That helps a lot. Thanks.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 16:14 UTC (Wed) by mm7323 (subscriber, #87386) [Link] (7 responses)

The solution is to make the memory allocator thread safe (which it already must be in this example) and to use pthread_atfork() to make it fork() safe.

int pthread_atfork(void (*prepare)(void), void (*parent)(void), void (*child)(void));

prepare() should do something like take locks for exclusive access on the malloc() area (potentially blocking until exclusive access is guaranteed), then returning to allow the fork() to proceed. parent() can drop the locks again in the original process & thread, while child() can replace any locks with new ones specific to the child.

Of course, glibc implements both malloc() and pthread_atfork() so can use internal mechanisms to achieve the same, but it's still there for others if needed on other resources and you really have a design that calls for fork() and threads.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 18:29 UTC (Wed) by pbonzini (subscriber, #60935) [Link] (6 responses)

That'd be nice, but it is not always so easy. For example, error checking mutexes are fundamentally incompatible with pthread_atfork: if you release the mutex in the child, it will fail due to the thread id having changed since the time the mutex was created. If you reinitialize it, then it's technically undefined behavior.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 20:08 UTC (Wed) by mm7323 (subscriber, #87386) [Link] (5 responses)

Nope. It's because of things like error checking mutexes that pthread_atfork() is needed.

It's not so much a reinitialise mutexes in the child, but more of a 'create new mutexes for the new process and replace any references to mutexes from the parent' that needs to happen in the child() call.

Sorry if I wrote reinitialise and threw you previously.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 22:25 UTC (Wed) by pbonzini (subscriber, #60935) [Link] (4 responses)

> but more of a 'create new mutexes for the new process and replace any references to mutexes from the parent' that needs to happen in the child() call.

Still not enough, as "attempting to destroy a locked mutex results in undefined behaviour" (from http://pubs.opengroup.org/onlinepubs/007908799/xsh/pthrea...).

Microsoft research: A fork() in the road

Posted Apr 11, 2019 5:38 UTC (Thu) by mm7323 (subscriber, #87386) [Link] (3 responses)

The child process doesn't touch the inherited mutexes - they don't belong to it, and if on shared memory (via mmap() or shmat() etc...) could interfere with the parent.

So after fork(), when there is only one thread in the child, it just creates its own new locks and synchronisation primatives and off it goes. No reinitialisation or destroying is needed in the child.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 11:58 UTC (Thu) by pbonzini (subscriber, #60935) [Link] (2 responses)

So it leaks the memory they were part of? You cannot free() without a previous pthread_mutex_destroy.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 21:05 UTC (Thu) by mm7323 (subscriber, #87386) [Link] (1 responses)

It can be done. And the fact is that glibc, bionic, muscl and others all implement thread-safe memory allocators that don't fall apart after fork(), even if mixing threads and fork() is a bad idea. You can search the sources of those projects for the exact examples of how they each make it work, but I assure you it works.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 21:34 UTC (Thu) by pbonzini (subscriber, #60935) [Link]

Sure---they just don't use error checking mutexes. Normal mutexes can be unlocked in the atfork child callback.

But the very fact that the interaction between atfork and error checking mutexes is completely undocumented, is a sign that it is not a great API.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 15:27 UTC (Wed) by dullfire (guest, #111432) [Link]

I don't think the paper authors even knew what they was talking about.
One of the may ... dubious claims is "fork() breaks Buffered I/O"... so does read()/write().

Or in other words... if you gonna use functions from multiple abstraction layers (in this case libc vs thin-libc syscall wrappers) then it's on you to properly manage them.

I'm not sure fork() is an inspired design, however the assertions at the beginning of the paper don't fill me with confidence about the authors

Microsoft research: A fork() in the road

Posted Apr 10, 2019 16:04 UTC (Wed) by flussence (guest, #85566) [Link]

fork() is pretty fiddly once your software becomes complicated enough that you have to care about the details, I don't disagree with that part. Maybe stop writing programs that suck as hard as modern webpages.

It's also six characters long and good enough for the rest of the world. If this also-ran open-core company can't/won't build something more compelling, then all of this is just intellectual onanism and whining over something they don't have the brains to implement efficiently.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 16:55 UTC (Wed) by magfr (subscriber, #16052) [Link]

I think the paper was well thought out and the references I looked at are good, I did in particular enjoy the BeOS process modell. One critique of the paper is that the Abstract and solution(7) chapters are the weakest parts so I would reccomend reading the whole thing.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 17:34 UTC (Wed) by evad (subscriber, #60553) [Link] (4 responses)

I don't understand why the paper is arguing fork must be removed? Surely the way forward is to document using posix_spawn() more frequently? There will still be cases where fork() makes sense, and others where posix_spawn() is better suited.

I can only assume Microsoft wants fork() to be removed from the kernel so its easier for them to support Linux apps on Windows. Otherwise why ask for its removal? Why not just educate people on the alternatives?

A very confusing paper, and very much an editorial rather than a research document.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 18:25 UTC (Wed) by randomguy3 (subscriber, #71063) [Link] (3 responses)

I think viewing this paper as coming from "Microsoft" misses the point - Microsoft Research is not just some academic PR wing of the company.

The paper gives several motivations, but I reckon the primary one comes from the authors' work as OS researchers interested in making new research operating systems. Currently, fork() usage is so prevalent in UNIX software that they are faced with implementing fork() (which they claim - I see no reason to doubt their experience in this area - infects the entire OS design) or have an OS that can't run huge amounts of existing software out there (removing a valuable testing resource and possible adoption path for the OS).

It's notable that S7 only suggests the OS might be rewritten to not have fork() as a core syscall after the most important software (however you want to define that, I guess) has been rewritten to avoid it. They're a little vague on how either part of that process would happen, but the purpose of the paper is just to convince people that it should be done, not set out a plan for achieving it.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 20:16 UTC (Wed) by mm7323 (subscriber, #87386) [Link]

uClinux (Linux for systems without an MMU) coped pretty well without fork(), though it does have the more restrictive vfork().

Microsoft research: A fork() in the road

Posted Apr 10, 2019 22:34 UTC (Wed) by evad (subscriber, #60553) [Link] (1 responses)

So just build/design/use an OS without it? Why does Linux have to remove it?

Microsoft research: A fork() in the road

Posted Apr 11, 2019 18:24 UTC (Thu) by mrshiny (guest, #4266) [Link]

Because of inertia. Linux is a huge source of software and that software relies on fork(). So if Linux software can be convinced to get rid of fork (whether the kernel does or not), then that software can become portable to newer, maybe better kernels that never had to deal with fork's complications.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 17:50 UTC (Wed) by Liskni_si (subscriber, #91943) [Link]

Back in 2017 I was looking for that missing copy-on-write API the paper talks about, for fast cloning/forking of virtual machines (implemented in and patented by VMware [1], implemented in Xen [2], but not available in QEMU/KVM). The typical use case for that is testing complex systems with expensive initialization (e.g. MySQL test suite [3]). My idea was to reuse most of QEMU snapshotting and live migration infrastructure and just add an optimization that detects when a VM is being migrated (cloned) to another QEMU instance on the same host over a UNIX socket and send the memory snapshot as a file descriptor to memfd (or something like that) which would be a copy-on-write clone of the source VM's memory. And I thought that using reflink on tmpfs might be the perfect API to obtain that copy-on-write clone.

Vlastimil Babka pointed me to https://lwn.net/Articles/717950/ which I understood to mean that it's not at all straightforward to implement and some larger refactorings would be necessary, and that's certainly beyond my ability. So I'm wondering if we're any closer to being able to add such functionality today, and whether others think my idea of reflink on tmpfs is good or bad.

[1]: https://blogs.vmware.com/consulting/2016/09/anatomy-insta...
[2]: http://www.cs.toronto.edu/~brudno/public/pdf/lagar2011sno...
[3]: http://www.cs.toronto.edu/~sahil/suneja-hotcloud14.pdf

Microsoft research: A fork() in the road

Posted Apr 10, 2019 17:58 UTC (Wed) by alogghe (subscriber, #6661) [Link]

I think Fuschia has a very different approach, it would be good to see a writeup and comparison to fork.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 18:40 UTC (Wed) by randomguy3 (subscriber, #71063) [Link] (1 responses)

An interesting piece. Having dealt with spawning processes in a cross-platform multi-threaded application as part of my day job, I am very sympathetic to the complaints of these researchers (although I'll admit I don't care that much about the difficulties fork() poses for implementing microkernel systems...).

CreateProcess() certainly has its faults (some of which it shares with fork(), such as not defaulting to CLOEXEC), but it's a lot easier to get right than fork()+exec() - the constraints on what you can do after fork() are easy to forget, and hard to even know when there are multiple threads around.

posix_spawn is a good idea, but suffers from several shortcomings (some of which are mentioned in the paper), including poor error returns, some missing basic features (like working directly) and an inherently racy approach to fd inheritance in a multithreaded environment.

Microsoft research: A fork() in the road

Posted Jun 7, 2021 16:45 UTC (Mon) by immibis (subscriber, #105511) [Link]

Note that the researchers are not talking about CreateProcess() specifically, but CreateProcess-style APIs in general, compared to fork-style APIs in general.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 19:15 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

One interesting method is to create a process in a quiescent state and then just poke it from the parent process until it's ready. Then just start it.

This neatly avoids all the complications of forking and memory overcommit.

That's what Fuchsia does, btw.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 20:01 UTC (Wed) by roc (subscriber, #30627) [Link]

The paper mentions this under "Cross-process operations".

Microsoft research: A fork() in the road

Posted Apr 11, 2019 15:26 UTC (Thu) by sbaugh (guest, #103291) [Link]

I'm hesitant to comment here because it's not done, but I've been working on an implementation for Linux of cross-process operations, so that inchoate processes can be created and manipulated from other processes.

The implementation (as some other comments speculate about) is as a userspace stub which receives syscalls to execute over some transport, and sends their results back. I use a pair of file descriptors, but other transports could be implemented too.

The issue with ptrace is not just that it's hard to use, not just that it's slow, but also that there can only be one ptracer at a time. A program that used ptrace in normal operation to manipulate its children would be much less compatible with strace, gdb, and other tools. That's not workable for a general purpose API.

Furthermore, ptrace puts limits on what kind of transport can be used between the stub and the main process. It would be nice to use shared memory to send syscall instructions to the stub, to improve performance when much setup must be done. As it stands, with a pipe used for transport, this API is actually network transparent; this could allow for some interesting novel APIs for starting and manipulating processes on different hosts.

The hardest part has been the need to create new abstractions that use this new way of executing syscalls. I couldn't think of an acceptable and performant way to reuse existing functions which implicitly make syscalls in the current process, in this new world where syscalls are done in the explicit context of some arbitrary process handle. So a fair bit of reinvention has been required to support explicitly specifying the process to operate on.

Another difficulty is the book-keeping of resources (file descriptors, paths, pointers) across multiple processes. Treating file descriptors as ints is difficult to keep straight when working with multiple file descriptor tables across multiple processes, where the same int might refer to different file descriptors in different processes. So I've had to develop multiple layers of abstractions for user programs which manipulate other processes: one layer which works with raw int file descriptors, and other layers on top of it which work with file descriptors as a combination of an int and the fd table it is valid within. Similar abstractions are needed for other resources as well.

It's so far very expressive and powerful. It's been surprisingly easy to adapt my development to this new way of spawning and manipulating processes. I definitely think that cross-process operations (more generally, explicitly specifying the thing to act on in all syscalls, instead of implicitly working on the current process or whatever) are the right design for operating systems; it's much more expressive than both the posix_spawn style and the fork style.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 19:38 UTC (Wed) by patrakov (subscriber, #97174) [Link] (3 responses)

I found this phrase in the report:

"""
...found 1304 Ubuntu packages (7.2% of the total) calling fork, compared to only 41 uses of the more modern posix_spawn(). Fork is used by almost every Unix shell, major web and database servers (e.g., Apache, PostgreSQL, and Oracle), Google Chrome, the Redis key-value store, and even Node.js.
"""

That's outright manipulation of the available facts, good enough to be included in propaganda textbooks. For starters, fork() is used not only in a way immediately followed by exec(). E.g., Redis uses fork() as a method to obtain a consistent snapshot of the database in memory, without running a separate executable. While indeed not every use of fork() is justified, the authors could at least not mix examples convertible and not convertible to posix_spawn().

Microsoft research: A fork() in the road

Posted Apr 10, 2019 20:05 UTC (Wed) by roc (subscriber, #30627) [Link] (2 responses)

The paper explicitly mentions using fork() for snapshotting in Redis, in section 6. They're not trying to hide anything here, they simply observe that these packages use fork(). No need to be so hostile.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 6:48 UTC (Thu) by nilsmeyer (guest, #122604) [Link] (1 responses)

The quoted passage is very ambiguous then and should probably have been left out - or, time permitting, do an analysis of the cases where posix_spawn() should have been preferred over fork().

Microsoft research: A fork() in the road

Posted Apr 12, 2019 18:45 UTC (Fri) by HelloWorld (guest, #56129) [Link]

> The quoted passage is very ambiguous then
No it's not, it is crystal clear. The only thing muddying the waters here is people's interpretation.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 20:28 UTC (Wed) by roc (subscriber, #30627) [Link] (6 responses)

Dismissing this work because it's from Microsoft is lazy. Engage with the arguments, because they're strong.

I've always found fork() much better than CreateProcess() because it's simply untenable for a single function call to set up every aspect of the child process, and the authors acknowledge this in section 6. Their suggestion to get around that is to have system calls that let you change the state of the child from the parent. Unfortunately that creates its own problems --- richer system call API surface with more potential races and security issues (though those races also exist for "modify own process" system calls in the presence of threads).

Another possible approach that they don't mention would be to replace fork() with a spawn() function that can handle common cases, but still support exec(). Then for complicated cases not handled by spawn(), you would spawn() a helper binary that communicates with the parent to complete setting up the child before exec()ing the real binary. Then again, Linux execve() is *also* a big problem, requiring all kernel resources to specify what happens when an execve() occurs, and also having problems with multithreaded processes. (The section of the ptrace() man page on execve() of multithreaded processes is very scary.) So maybe the way to go is to eliminate both fork() and exec() and have spawn() start execution in a standard userspace stub like ld.so, which supports a standard protocol for communicating with the parent to set up the child process environment before entering the real binary.

The paper would have been stronger if they had also discussed issues with execve() and talked about the benefits of eliminating *both* fork() and execve() in favour of spawn().

One thing that worries me about marginalizing or removing fork() is that a COW memory snapshot system call is still very needed to implement rr replay and other things. fork() being that syscall is great for us because it's so commonly used, it's guaranteed to work efficiently and well. Then again, a dedicated COW snapshot call could potentially eliminate some of the problems with using fork() for this, e.g. the fact that shared memory segments aren't copied.

Microsoft research: A fork() in the road

Posted Apr 10, 2019 21:20 UTC (Wed) by roc (subscriber, #30627) [Link] (4 responses)

> So maybe the way to go is to eliminate both fork() and exec() and have spawn() start execution in a standard userspace stub like ld.so, which supports a standard protocol for communicating with the parent to set up the child process environment before entering the real binary.

Even better, have the spawn() syscall specify an arbitrary executable to do the job of ld.so, and then pass the real executable to it as one of the parameters you send over IPC. Then you can do whatever you want to set up the process even if the standard component doesn't support it.

Of course the elephant in the room is that even if everyone in the world agrees that fork() should go, getting rid of it in the application software people care about is a very long-term project whose benefits would take a long time to be realised. Perhaps all the more reason to start working on a transition now, initially by designing, implementing and deploying the replacement APIs.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 4:08 UTC (Thu) by dw (subscriber, #12017) [Link] (3 responses)

I don't think it's necessary to have any specially formatted stub, just a call like:

create_process(
    /* helper program */
    const char *path,
    int argc, 
    const char *argv[],
    const char *env[],
    /* fds count */
    int fdc,
    /* mapped in new process 0..fdc */
    int fds[]
);

Where libc might ship a static (and therefore almost invisibly fast) helper to interpret the contents of e.g. a memfd passed in on a known descriptor, the helper would initially implement the posix_spawn calls. An even simpler option might dump the FD mapping arguments for a fixed behaviour of starting the program with only one fd connected to a UNIX socket, with FD passing used as desired to communicate additional objects to the helper

The goal as with yours of course is to avoid putting any of this policy in the kernel again if it could be practically avoided, and making it simple to iterate the userspace helper without changing any OS interface or even having to wait for libc

Microsoft research: A fork() in the road

Posted Apr 11, 2019 15:42 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (2 responses)

Explicitly forwarding fds sounds nice, but wouldn't that interfere with being able to communicate with a tool deeper in a pipeline?

exec 4>logfile
some_command | analysis --log-file=/proc/self/4 | transform | moreanalysis --log-file=/proc/self/4

This also could affect something which does intermediate shell scripts before launching the real binary or things like `git` fork/exec'ing into a non-builtin subcommand.

Maybe this is just a silly use case and not worth all the CLOEXEC stuff. Though there is some set of tools I saw around where everything was done via "set one thing and then exec the next" for everything from environment modification to dropping priviledges. That might have more of an issue there.

Maybe the better default is to have everything be CLOEXEC by default, but once something is not CLOEXEC, it sticks around after an exec transitively (stdin/stdout/stderr would default to being not-CLOEXEC)? Of course, this is a much more expansive API change.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 19:52 UTC (Thu) by roc (subscriber, #30627) [Link] (1 responses)

> Explicitly forwarding fds sounds nice, but wouldn't that interfere with being able to communicate with a tool deeper in a pipeline?

It's unclear what you're relying on here. Are you making use of the behaviour that the default behaviour of inheriting file descriptors through fork/exec allows you to smuggle fds from your process to its grandchildren?

If so, that is indeed fundamentally incompatible with the desire to inheriting capabilities by default. There are a few ways to work around it. One is to add features to specific processes (e.g. shells) to notify them of fds that they should pass forward into spawned children. Another is to make that a library feature so that when you create a process you can pass in a set of inheritable fds, and make your library spawn function have an option that lets you opt into forwarding those fds. Another would be to stop relying on inheritance and use something else like AF_UNIX sockets to communicate with the grandchildren.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 19:52 UTC (Thu) by roc (subscriber, #30627) [Link]

"desire to AVOID"

Microsoft research: A fork() in the road

Posted Apr 10, 2019 21:48 UTC (Wed) by nix (subscriber, #2304) [Link]

Their suggestion to get around that is to have system calls that let you change the state of the child from the parent.

This is a very nice suggestion, indeed -- you can tell I've been through the fire, because my first thought was "if they're nice to use, we could replace most of ptrace() with them! yeaaaaahhhhhh!". (I mean, yes, PTRACE_SEIZE is far nicer than the old model, but it's still a horrible syscall to use, though much of its horror has to do with unrelated problems like signal handling that only apply to processes that have started running...)

Microsoft research: A fork() in the road

Posted Apr 11, 2019 4:24 UTC (Thu) by joncb (guest, #128491) [Link] (6 responses)

I question the assertion "that fork’s continued existence as a first-class OS primitive holds back systems research".

There's no OS police that will come knock down your door and arrest you if you don't implement fork in your research system. People create all kinds of research OSes that make all kinds of oddball decisions all the time, unikernels being a specific example of a system that (generally) doesn't implement fork (or certainly doesn't implement fork the same way as everyone else). Sure if you want a POSIX compatibility layer then you need to implement fork because fork is a required part of POSIX but if your hope is to change that then don't be surprised if people tell you to go to hell for wanting to inconvenience the roughly 20M software developers and 4B software users who rely on fork semantics on a daily basis to reduce the amount of work that maybe 10K of OS researchers will have to do in those rarish circumstances where they need to.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 6:22 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> developers and 4B software users who rely on fork semantics on a daily basis to reduce the amount of work that maybe 10K of OS researchers will have to do in those rarish circumstances where they need to
Most of the fork() stuff is used for simple fork+exec. Which is totally dumb.

A special checkpoint/restore functionality for people who NEED it explicitly (see: Redis) would be much better.

After all, why should this very niche use dictate the design of the OS?

Microsoft research: A fork() in the road

Posted Apr 11, 2019 6:55 UTC (Thu) by nilsmeyer (guest, #122604) [Link] (1 responses)

> Most of the fork() stuff is used for simple fork+exec. Which is totally dumb.

Is that conjecture or can you back up that claim with data? That might make for an interesting research project.

> A special checkpoint/restore functionality for people who NEED it explicitly (see: Redis) would be much better.

That's an interesting idea, is there an implementation of that in any OS? The problem is as long as you don't have a critical mass of systems implementing the new semantics you'll still have to use fork() and then the question quickly becomes whether or not it's worthwhile to cover other cases.

> After all, why should this very niche use dictate the design of the OS?

Compatibility with existing software.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 7:32 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> Is that conjecture or can you back up that claim with data?
I actually did this some time ago during one of the older flame wars about process API. I traced the fork() call and the exec syscall, and checked the difference over the course of several hours. They matched within 5%.

> That's an interesting idea, is there an implementation of that in any OS? The problem is as long as you don't have a critical mass of systems implementing the new semantics you'll still have to use fork() and then the question quickly becomes whether or not it's worthwhile to cover other cases.
Fuchsia doesn't have it as an easy-to-use built-in (yet?), but it's implementable through its core API: https://fuchsia.googlesource.com/zircon/+/HEAD/docs/sysca...

You can clone your VMAs with CoW semantics: https://fuchsia.googlesource.com/zircon/+/HEAD/docs/sysca... and map them into a new process if needed or use however else you want.

> Compatibility with existing software.
fork() can be implemented on top of checkpoint()/restore() mechanisms as a fallback. Perhaps with some efficiency hit.

A huge amount of software is already ported to Windows which doesn't have fork() support, so it's unlikely that porting to a new API will be an insurmountable problem.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 9:18 UTC (Thu) by mm7323 (subscriber, #87386) [Link] (2 responses)

Perhaps the paper's authors are just aggravated that there aren't any free and open high quality user spaces that don't heavily rely on fork(), and thus if they want to push a toy research OS further, they have to provide fork(), even if that is counter to other OS design goals.

Had fork() & exec() been higher level operations with more descriptive parameters, like CreateProcess(), they would perhaps have been able to implement wrappers through different research OS primitives or mechanisms and then quickly gain access to a rich user-space environment supporting real-world workloads.

Sucks to be them.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 15:30 UTC (Thu) by beagnach (guest, #32987) [Link] (1 responses)

Really infantile comment.

The core argument is that fork() forces undesirable design choices into every layer of any system that implements it, Linux being affected by this as much as any "toy research OS". The authors point is that we have become so accustomed to fork()/exec() being the "natural" way to handle process creation that we have become oblivious to the compromises entailed.

I get that reading and understanding a 6 page article by heavyweight computer researchers is hard work, but still... this is LWN, which is valued for the quality of its reporting and technical discussion.

If the best you can come up with is "Sucks to be them" then why not head on over to slashdot (assuming it still exists).

Microsoft research: A fork() in the road

Posted Apr 11, 2019 21:53 UTC (Thu) by mm7323 (subscriber, #87386) [Link]

Actually, guest, I think it is your ad hominem response that should be over at Slashdot.

For the benefit of others, I apologise if my choice of wording is inflammatory, but I don't think it is incorrect. As the top level comment says, there is no 'fork() police' forcing every OS to implement those semantics. Kernels are free to go a different route, and Linux could add alternative process creation syscalls if desirable. But to date fork() is entrenched in userspace software and to make a relevant and _practical_ kernel you need a reasonably fully featured userspace from somewhere before you can run meaningful workloads and claim you have anything but a toy.

So I think the paper may be born out of the frustration research kernel developers see when faced with some of the following choices:

1) Implement fork() with it's semantics and pitfalls, make little to no new research in that area.
2) Try and port packages to their kernel to produce a usable userspace.
2) Climb the mountain of creating a brand new userspace for the research kernel.

Because of the nuances of the fork() + exec() API, item 2 isn't just porting a libc compatible runtime - instead you need to be looking at every call site to fork() and/or exec() *and* other syscalls, and then patch them in each package. And then potentially maintain those patches if the research kernel is to stay up to date. It's a lot of work, and I dare say not the most interesting work for kernel researchers to be undertaking, and almost completely secondary to actual kernel research.

This is where I think fuschia/zircon has a real advantage. While Bionic supports both fork() and exec(), it's requirement is most likely confined and most of Android 'userspace' is up in the JVM anyway. Within Android Java there are methods that may classically result in fork() + exec() (e.g. Runtime.exec() and ProcessBuilder().start()), but these are high enough up that they don't require exact fork() semantics and so may be more easily be converted to different primitives on a new kernel model, benefiting file descriptor and memory abstractions too.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 6:35 UTC (Thu) by pabs (subscriber, #43278) [Link] (4 responses)

Does anyone know which OSes support posix_spawn (or similar semantics)? Does Linux?

Microsoft research: A fork() in the road

Posted Apr 11, 2019 7:55 UTC (Thu) by chatcannon (subscriber, #122400) [Link] (3 responses)

> Does anyone know which OSes support posix_spawn (or similar semantics)? Does Linux?

So far as I can tell from the man page, posix_spawn() on Linux is implemented by the libc and uses the fork() and exec() system calls internally.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 9:49 UTC (Thu) by jwilk (subscriber, #63328) [Link] (2 responses)

Since glibc 2.24, it uses clone() with CLONE_VM+CLONE_VFORK:
https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=9f...

Microsoft research: A fork() in the road

Posted Apr 11, 2019 16:02 UTC (Thu) by magfr (subscriber, #16052) [Link]

As is noted in the paper. Reference 38 is to that libc bug.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 23:20 UTC (Thu) by pabs (subscriber, #43278) [Link]

Hmm, clone/vfork don't sound like what I would expect a posix_spawn kernel API would be like.

Microsoft research: A fork() in the road

Posted Apr 11, 2019 20:23 UTC (Thu) by ecree (guest, #95790) [Link] (1 responses)

It's curious how everyone always claims that Unix is beholden to "Worse is Better", but as soon as it makes a design decision that prioritises interface complexity over implementation complexity, suddenly it's a knuckle-dragging Neanderthal that "holds back systems research".

The authors' criticisms seem to revolve mostly around "fork plays badly with $modern_os_feature", but usually the fault lies with $modern_os_feature (threading being the most obvious example).

The snark about goto in section 7 demonstrates that the authors have exactly the kind of ivory-tower attitude to which Unix has always been opposed; a sensible programming course would start bottom-up, with assembly, to impart the crucial concepts of a computer's execution model; and assembly does indeed begin with goto (although it calls it jmp).

fork()/exec() is an example of brilliant taste by the original inventors of Unix; and taste is like jazz: if you have to ask what it is, you ain't never gonna know.

Microsoft research: A fork() in the road

Posted Apr 13, 2019 12:08 UTC (Sat) by farnz (subscriber, #17727) [Link]

I disagree; I think that fork isn't a particularly tasteful interface (although I understand why it was implemented that way for the PDP-7, which had no virtual memory); it conflates three primitive operations in a multiprocess virtual memory OS:

Create a new schedulable task.
Create a new virtual address space.
Clone one virtual address space into another, making copies whenever necessary to ensure that neither side of the clone is surprised by unexpected sharing of memory.

Now, I don't see the problem with conflating the first two options (a "spawn" operation, if you will); they are both simple operations, and you simplify the OS if each schedulable task has a unique virtual address space (we can call this combination a "process" 😀). The last, however, is a complex operation on any OS that lets you have shared memory (rather than doing what early UNIX did, and swapping entire processes to disk in order to switch to another process), and shouldn't be conflated with the first two.

On the other hand, exec is a deeply tasteful interface; it says that process environment setup has a lot of details, and there will be more details in future, so don't try to enumerate them all via a CreateProcess-like interface; instead, just let the user run arbitrary code in their new process to set up the world, and then replace the running code with the code that wants that environment.

With full hindsight on 50-odd years of hardware and software evolution, including the creation of dynamic linking, I'd prefer to see a spawn+exec pair. spawn takes an image to spawn, plus an "overlay" section that gets copied (CoW) to the program interpreter (in the sense that ld.so is an interpreter, not in the sense that Python is an interpreter) to be used in the dynamic linking phase. For statically linked programs, the program interpreter is part of the main program image, and thus gets the overlay section as input. This lets you send state down to the new process, and then have it used to set up the new process environment; the new process might turn out to be a simple helper that just sets up the environment and calls exec, of course (and, indeed, as you get a new section, you can have both helper and original process live in the same image, using the contents of the spawn section to distinguish "executed fresh" from "spawned ready to exec a new process").

Microsoft Research: A fork() in the road

Posted Apr 11, 2019 21:37 UTC (Thu) by gnu_lorien (subscriber, #44036) [Link]

It was a bit of a revelation to read this paper. My production experiences with fork() have never matched up with the pedestal the call has been put on. Most of my searching and reading followed a lot of the commentary here where they asserted I needed to redesign my entire program to fit fork(). This was just never an option and that advice never really guided me to posix_spawn() or clone() for the times in which those are the only features I need. I appreciate having a resource for how to solve those problems better.

Microsoft Research: A fork() in the road

Posted Apr 12, 2019 4:18 UTC (Fri) by lieb (guest, #42749) [Link] (2 responses)

I did a close read of this article and the comments, particularly some of the missed points.

I started off with Tenex (1.34) and moved to UNIX (V6 w/ Univ Ill NCP (Arpanet)) so I've seen a bit. I had to dig out my old TENEX docs to refresh my memory... It has been a while since I wrote a JSYS FORK. But I'm not bragging my age but rather, there needs to be some longer perspective in this discussion.

As the authors hinted, TENEX did have a FORK that would would do everything in one go; create the fork, populate it with an image, and start it. There were other variations on the theme such as an equivalent of vfork() but it was not all that great. It was pretty advanced for its day but in the end, it didn't really do much other than introduce a real VM with functioning pagefaults. Its filesystem wasn't much better than its progeny VFAT. For example, with its rigid process architecture, there was no way to do "foocmd < bits.in >stuff.out&". Things like threading were also awkward at best. The UNIX model of fork() + exec() solved a lot of those problems. It worked for two reasons. First, a pid is a global object. I can create a process, tell someone else its pid and they can play with it. A fork was and still is cheap(er) and the model of orphaning a pid to the init proc made backgrounding trivial. That does not seem like much but it is when you try to make a network daemon or batch system and don't have it. Second, splitting the two has real power - and with real power there is also risks (and bugs).

A TENEX fork, just like CreateProcess was a single shot. Once it is gone into the system it is gone. Sure, you can manipulate it some but now you have one proc fiddling with another with all the race conditions it implies as shown by the complexity of ptrace(). CreateProcess solves this problem with its mountain of API args. However...

Back with V6, fork() and exec() were pretty simple. But they already had some things that TENEX didn't have. That power was and still is what goes on between the child's return of fork() and its subsequent exec(). In those days, we didn't have too much to do other than some close() and open() calls, usually redirecting one or more of 0,1,2 and the closing of random other files (very rare but possible), and maybe setuid(), setgid(). There were no capabilities or shared anything to clean up. Since those days the number of things to be policed from the old environment before the new one got launched has grown. For example, execve() popped up because we got this thing called "environment" in V7 which sometimes required a scrubbing of the env vector. It only got better and worse. The authors rightfully note this growth of features and the resource costs that go with them. The costs are there no matter what the model used for creating/managing/destroying them. You will have to do those actions somewhere. The only question is where.

They also criticize all the close-on-exec stuff scattered about in the various kernel subsystems that use an fd. Point well taken. Then again, this is still territory where one really needs to know more than just garden variety algorithms. And, where else would you handle things such as this? The kernel doesn't know the significance of one open fd from another and how would you construct an API extension for clone() to handle such an open ended requirement? This is something that only the app has knowledge of the full context. Therefore, it is the app's responsibility. You have a choice to either do it somewhere in the app (the Linux choice) or in a system API somewhere. You cannot fob it off somewhere else.

Consider the following issues on allocated resources, most of them open "files". There was a time when a proc could only have 16 open files. BSD moved that to 20. They then implemented the select() call. Their argument list included bit masks for the fd's of interest in the API. It seemed like a good idea at the time. Besides, what app would have more than 10-12 files open at any one time? Well, guess what. In the early 90's, having only 4k open files in a database or pthread app was a limit. This forced the new select2() syscall because the API had to change to handle a variable length bitmask but the old select() was cast in concrete. The AltaVista webserver, which I maintained for a while back then, blew that number bigtime. The select() syscall was no longer a good idea and select2 with huge holes in the bitmap was only marginally better. Hint, even if all you kept open most of the time were 0,1,2 and a few fds that you had hanging around after a flurry of file stuff, you still could have an fd > 4096... Bit masks were no longer a good idea. Eventually poll/epoll and friends took over. The lesson here is that systems must evolve to address the continuing stream of new requirements.

This is where CreateProcess() and its API comes in. All those arguments are there for a purpose and are fixed for all time unless you are really into pain. But is that all that will be needed/wanted? History shows that no, it isn't. The things that must be manipulated when a process/fork gets created will grow in size. There are three choices for this, the system remains static for all time, you hack the API one more time or have an interim period in the child's code where all this can happen prior to launching the new image with exec(). UNIX chose the last and we benefit from that choice.

The creation of a process/thread in any OS is always tricky. There are timing/race conditions to consider, security vulnerabilities to deal with, etc., etc. This is not code for kiddies. But just as the original exec() -> execve() and fork() -> vfork() issues, new capabilities stretch the limits of the process model and, from experience, adding a new syscall, painful as it is, is better than extending an existing API into uncharted territory. Therefore, having that interim period in the child's initialization has real value. This is where one deals with the future, right there after the child's return from fork(). Only the app knows what privs and resources should be freed before letting the next image take over after exec. Only the app knows that whether a particular resource that it needs must be closed on exec. So have the app do it. The kernel doesn't know enough to do the right thing even if it cared. This is one place in the app codebase where such things matter. It has to be carefully written and debugged. It takes skill to do it. It also has to be done someplace. Launching a proc in any system involves two phases; first, clean up inherited "stuff" and, second, initialize a new, safe new environment for the child before it invokes main(). There are three places to do it. You can do it in this interim code before exec(); you can do it in crt0 or ld.so; or you can do it in the kernel. Take your pick. The interim code is safe in that it is no worse than the rest of the app and fits the need. Doing it in crt0 or ld.so adds special and unique app requirements to a standard/common bit of system runtime. Don't even think about the kernel. Once in the kernel ABI, always in the kernel ABI, and for very good reasons. Hint, they bring pitchforks to these change proposal meetings...

The problems with threads and the mixing of threads with process creation (fork+exec) are real. They are two very different beasts and don't get along all that well. We argued about that back in the cde_threads days and things did not get easier just because we changed the name to pthreads. But pthreads itself, and any of its offshoots such as LWP are not much better. Pthreads does a reasonable job at forking a process but one successfully does such things by abiding by a set of design rules just a little less complex than a EULA. And that is so for a reason. I look at pthread and its mix of mutex+condition vars as being a little more safe than writing the whole thing in ASM, which I did on TENEX, i.e. there is nothing to enforce any of those mostly documented rules and design patterns. There is little help and no guard rails in this model which is why, 30 years later, there are not all that many of us who can really grok this stuff and even then, getting critical sections and lock optimization right is hard work. A pthread call is, after all, just another function call... Coverity et al have to do some real back breaking work (magic) to make sense of the rubbish shoved into it enough to report a reasonable error. Java doesn't offer much in this space either even if it has some threading "primitives" in the language (more or less). C++ is worse, little better than C + pthreads. The closest I've seen in modern languages is the golang model, mainly because it has concurrency (and the constraints necessary to keep it "safe") built into the language itself where the compiler and analysis tools can see what is going on. Also note that they use a "concurrency" model, not a multi-threading model (See Pike's numerous blogs). All the magic is in semantic pass(es) of the compiler and the runtime well out of the reach of the app programmer.

If we look at the fork() implementation in the Linux kernel, we find that fork and vfork are just wrappers around the full blown clone call, all of them calling _do_fork(). The pthread_create() lib call uses clone directly. This is also why pthread_spawn() is faster in their graph. It is a properly clamped down clone() followed by quick lib resources cleanup followed by exec(). This was a smart move when NPTL entered the kernel in 2.6. The kernel doesn't care if a task is a thread or a proc; it just does its scheduling and resources thing. Only the app cares and the library does a respectable job with the rest. Note that its options to COW or share a restricted set of objects is limited to just the things that user code can't manage. Hint: why don't they use clone() instead in their runtime?

The authors made some comments on how, without fork+exec, they could do really cool stuff like load and relocate another process in the same address space. Why would anyone want to do such a thing, other than fool around with some academic notion? Memory management, even the pre-VM segment management in the PDP-11, is a very good thing. The reason is simple. If you can't address the object (in memory or the kernel) you can't piddle all over it. It is bad enough when a pthread goes rogue and stomps on things. Why would you import an unknown quantity like an arbitrary executable into your address space? That is an attack surface bigger than the flight deck of the Carl Vinson. One can escape any language runtime into ASM and once there, all bets are off. In other words, so what if you an load and randomly relocate multiple copies of a DLL/SO. I submit that is a feature in search of a problem to solve. If you want to do such things, use a VM or container and let the hypervisor keep you out of mischief. If you don't want fork, use a unikernel in a VM and get on with it. The realtime gadget people do it all the time with bare iron things like Arduinos and MIPS SOCs.

One last point. Removing fork+exec from UNIX (really Linux these days) is a fools errand. There is one very big reason why anyone would care, other than an academic exercise in woulda-coulda-been semantics. There is a massive amount of code out there that runs inside a UNIX model and it does so for a very simple engineering and operational reason. As bad as it is, it is still, on the whole better than all the OS models it displaced. I mentioned at the top that my first system was TENEX. I also worked on the DECSystem-20, a commercialized version of that OS before I moved exclusively to UNIX/Linux. Those were good systems that did cool things but most of us who left them behind had good reasons to move on. Those systems and all the other "proprietary" systems are now but memories to talk about over beers with other retired hackers. Anyone remember the DG Eclipse? AOS/VS had some really cool features, such as a built in threading model, that were way ahead of their time. But where is DG, or DEC or even Sun now? Most of the UNIX systems are gone leaving only {Free,Net,Open}BSD still chugging away. All those systems have been replaced by a standard system that does its job very well and it happens to be UNIX. Linux has evolved over the years but the core similarities and model are still closer to UNIX V6 than any of the other long gone OS designs. It is the standard OS just like the electrical outlets you get down at Home Depot are standard. Imagine the chaos that would return to metal fabrication if instead of using metric or "English" sizes, one chose their own arbitrary dimensions for thread sizes for fasteners. Having two complete set of tools, one metric and one SAE is pain enough which is why every country (other than the USA) is now almost completely metric. The same applies to current OS ABI/API standards. That massive amount of code only really happened when those of us who had to build real systems stopped arguing and accepted the one system we could all agree (at least in principle) could do the job and we could share in common without a lot of legal/financial friction. The world converged even more so on Linux because of the same reason. People who want to build big, complex systems or who want to build handheld things like smartphones by the billion just want something on top of the iron that they could depend on rather than re-invent. Even Microsoft has figured this out. There is no money in maintaining a proprietary OS anymore other than to support an Office suite that is, itself moving off the desktop and into "the Cloud". Unlike Linux where the development model scales to fill the staffing requirement because everyone and anyone who needs it can contribute their expertise, all of the Windows system specialists who really understand how the guts of the thing works are proprietary need-to-know box on the Microsoft payroll which is why Windows/N, N=1->inf is really in maintenance mode. That group is a "cost center" that can't grow because it would eat the engineering budget alive while providing little more than a support layer underneath their Office products (the real cash cow). Their next new thing, where their dev money is being spent these days, is Azure which is a service whose profitability is based on simple usage scaling not feature development. And yes, most of the VMs and containers they run have fork() somewhere in the runtime.

I don't mean to dump all over the authors. But this piece is an opinion piece, if not a gripe session, not a research report. Cygwin under their research provides a crappy fork() performance, primarily because of the impedance mis-match between the UNIX model and their model running over Windows. So what else is new. My son has solved that problem. He's given up on using things like git on Windows and is tired of the self inflicted incompatibilities in Mac/OS (old python et al). He now has a Windows/10 machine for company stuff like Outlook and runs Fedora 29 in a VM to do his development work which does the deed just fine. When the authors and the users of their OS paradigm have enough code to double the size of github and Sourceforge, maybe then their argument would make sense. Otherwise, this is much about nothing and wishing for unicorns. (Lots of) code that works beats elegant designs that don't (yet) every time.

Sorry for being a grumpy old hacker.

Microsoft Research: A fork() in the road

Posted Apr 12, 2019 5:01 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> One last point. Removing fork+exec from UNIX (really Linux these days) is a fools errand.
Sometimes they are necessary. Any realistic removal plan would require decades of transition time, though. Or perhaps not, if Google simply replaces Linux with Fuchsia in Android and everybody is forced to write to that API.

> I don't mean to dump all over the authors. But this piece is an opinion piece, if not a gripe session, not a research report. Cygwin under their research provides a crappy fork() performance, primarily because of the impedance mis-match between the UNIX model and their model running over Windows.
Windows actually supports pretty performant fork() in its kernel. It's used in the new Linux subsystem for Windows and before that it was used in UNIX Services for Windows. It suffers from the overcommit problem, but otherwise it's enough to run most ported Unix apps.

Microsoft Research: A fork() in the road

Posted Apr 12, 2019 6:53 UTC (Fri) by eru (subscriber, #2753) [Link]

>Or perhaps not, if Google simply replaces Linux with Fuchsia in Android and everybody is forced to write to that API.

The mobile app developers probably would not even notice that change of kernel, especially since Google would work hard to minimize its visible effects on interfaces, for backward-compatibility reasons. Aren't Android apps mostly written in Java or some other higher-level language anyway?

Microsoft Research: A fork() in the road

Posted Apr 12, 2019 21:36 UTC (Fri) by rweikusat2 (subscriber, #117920) [Link] (3 responses)

As I've now actually read a part of this ... ehh ... interesting piece of text, one glaring inaccuracy would be the 'overcommit' bit. AFAIK, ever since the introduction of some virtual memory support in UNIX (with BSD for VAX), systems have slavishly emulated the 7th edition fork behaviour of "allocate enough swap space to store the entire new process on fork" because That's How It Is To Be Done! (eg, McKusick simply formulates this as demand). Apparently, fork didn't "encourage memory overcommit" in any system supporting it except on Linux (which - certainly coincidentally - is probably going to be the only non-Windows system the intended audience of this paper might have encountered, hence, they hopefully won't spot this --- fingers crossed).

Memory overcommit on fork is indeed sensible but that's a Linux innovation. The default behaviour of 7th edition emulation forks when not enough swap space can be reserved could be described as "suicide out of fear of death": No one knows how much of the inherited address space will need to be copied, this entirlely arbitrary limit thus prefers "guaranteed failure now" over "possible success in future", despite "guaranteed failure now"-mode obviously cannot guarantee that neither of the two forked process will end up failing due to an out of memory situation encountered in a future memory allocation.

Microsoft Research: A fork() in the road

Posted Apr 17, 2019 16:41 UTC (Wed) by BenHutchings (subscriber, #37955) [Link] (2 responses)

At least AIX and FreeBSD also have overcommit.

Microsoft Research: A fork() in the road

Posted Apr 22, 2019 20:48 UTC (Mon) by tao (subscriber, #17563) [Link]

AIX has SIGDANGER though.

Microsoft Research: A fork() in the road

Posted Apr 25, 2019 11:03 UTC (Thu) by nix (subscriber, #2304) [Link]

Even overcommit-shy swap-space-happy Solaris has overcommit for the main stack of a process. (I'll admit to not entirely understanding why overcommit would be desirable for the main stack but not thread stacks...)

O_CLOFORK

Posted Apr 14, 2019 22:40 UTC (Sun) by magfr (subscriber, #16052) [Link]

By the way, O_CLOFORK shows up once in a while (2011 and then again 2017) and apparently it exists on AIX, *BSD, Solaris and MacOS but I see no actual rejections of it, it just peters out. Is there any interest in it?