Microsoft Research: A fork() in the road
As the designers and implementers of operating systems, we should acknowledge that fork’s continued existence as a first-class OS primitive holds back systems research, and deprecate it. As educators, we should teach fork as a historical artifact, and not the first process creation mechanism students encounter." The discussion of better alternatives is limited, though.
Posted Apr 10, 2019 12:51 UTC (Wed)
by fhuberts (guest, #64683)
[Link] (48 responses)
Personally I think fork is a nice enough call and actually an advantage to have. For example, Git rather suffers in performance on Window because of the lack of that call (amongst other things).
Posted Apr 10, 2019 16:39 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (37 responses)
These researchers happen to be right. fork requires overcommit, and overcommit is the enemy of guaranteed forward progress.
Posted Apr 11, 2019 19:38 UTC (Thu)
by simcop2387 (subscriber, #101710)
[Link] (36 responses)
Posted Apr 11, 2019 20:02 UTC (Thu)
by ecree (guest, #95790)
[Link] (35 responses)
This only leads to problems in the case where you have a single-process behemoth with huge amounts of writable anonymous pages; also known as a badly-designed program. As long as userland developers are following proper Unix philosophy (in this case, multiprogramming), fork() can remain performant even without overcommitting memory. (And if you're _not_ doing multiprogramming, and are happy to have a single fat process, then you won't want to run subprocesses anyway, so you won't be calling fork(). It's only the ugly half-way compromises that have a problem.)
Posted Apr 11, 2019 20:05 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (34 responses)
Posted Apr 11, 2019 20:39 UTC (Thu)
by ecree (guest, #95790)
[Link] (33 responses)
And do note that it's only the _anonymous shared_ mappings that are a problem; file-backed mappings don't require COW, and nor do private anonymous mappings. Your "large amount of cached data" could have been stored in memory allocated with mmap(MAP_PRIVATE | MAP_ANON), instead of regular malloc(), and then it wouldn't show up in the child after fork().
Posted Apr 11, 2019 21:25 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (22 responses)
And you're arguing that Unix is well designed?
Posted Apr 11, 2019 21:40 UTC (Thu)
by ecree (guest, #95790)
[Link] (21 responses)
Modularity is a virtue.
Besides, I'm not arguing that fork() has to be the _only_ way to launch processes; it's entirely OK to _also_ have a spawn()-like interface for the 'simple case' where you don't want to juggle fds, ulimits, creds, etc., as long as fork() is still supported for the hard cases. And there's always vfork()...
> And you're arguing that Unix is well designed?
Posted Apr 11, 2019 21:47 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (20 responses)
This single process can be very large, tens of gigabytes in size. Modern JVMs are quite efficient at managing large heaps, so this is desirable.
Now you need to launch a helper process. If you use fork()+exec then you're looking at duplicating the entire working set of the application server.
> Yes. It is.
Posted Apr 11, 2019 22:04 UTC (Thu)
by ecree (guest, #95790)
[Link] (8 responses)
Not when there was any alternative.
> It's a single process - it makes sharing data between requests very easy.
Fun fact: you can share memory between distinct processes, by any of several means.
Also, I'm not suggesting spinning off a separate process to handle each request (the xinetd model); just splitting up the workload into separate processes doing different aspects of the job. Do one thing well.
> This single process can be very large, tens of gigabytes in size. Modern JVMs are quite efficient at managing large heaps, so this is desirable.
Your definition of "desirable" clearly differs from mine.
> Now you need to launch a helper process. If you use fork()+exec then you're looking at duplicating the entire working set of the application server.
I know that. Which is but one of the many reasons you shouldn't build a gigantic monolithic application server in the first place.
The Unix system philosophy is like the Westminster system of government. Take any one part of it in isolation, and it looks obviously silly; incautiously import ideas from another system and everything falls apart. But the whole thing, when put together and kept intact, thrums along beautifully and achieves world domination.
Posted Apr 11, 2019 22:10 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
> Also, I'm not suggesting spinning off a separate process to handle each request (the xinetd model); just splitting up the workload into separate processes doing different aspects of the job. Do one thing well.
> I know that. Which is but one of the many reasons you shouldn't build a gigantic monolithic application server in the first place.
> The Unix system philosophy is like the Westminster system of government. Take any one part of it in isolation, and it looks obviously silly; incautiously import ideas from another system and everything falls apart. But the whole thing, when put together and kept intact, thrums along beautifully and achieves world domination.
Posted Apr 11, 2019 22:42 UTC (Thu)
by ecree (guest, #95790)
[Link] (2 responses)
No; you have to build your system in ways that are already the Right Thing _for other reasons_.
fork()'s "deficiencies" are only deficient for software that is _already badly designed_ before fork() enters the picture.
> The Unix philosophy is to get something working ASAP and then just objectify it as the epitome of creation
If that were true, Unix systems would still be written in B.
The developers of Research Unix at Bell Labs weren't averse to experimenting with changes to the system. They merely avoided changes which, while superficially attractive, did more harm than good. They had 'engineering taste' — which is really the ability to intuit the deeper consequences and ramifications of a design decision.
And the Unix design, as continued by Plan 9 and Linux, continues to evolve (/proc, /sys, entirely new kinds of fds), but always guided by the Unix philosophy.
Posted Apr 11, 2019 22:47 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
> If that were true, Unix systems would still be written in B.
Posted Apr 12, 2019 18:41 UTC (Fri)
by rweikusat2 (subscriber, #117920)
[Link]
and
There was indeed an OS written in PDP-10 machine language whose fans keep extolling its virtues until today: The MIT AI lab Incompatible Timesharing System (with PCLSRIng being 'the virtue') but that's something different.
Posted Apr 11, 2019 22:59 UTC (Thu)
by mpr22 (subscriber, #60784)
[Link] (3 responses)
You know, your analogy says some pretty unflattering things about Unix.
Posted Apr 12, 2019 10:27 UTC (Fri)
by ecree (guest, #95790)
[Link] (2 responses)
If I wanted to be maximally inflammatory, I would say that in the analogy, the EU represents systemd. But let's not go down that rabbithole.
Posted Apr 12, 2019 11:24 UTC (Fri)
by tao (subscriber, #17563)
[Link] (1 responses)
Yes, your simile is rather apt.
Posted Apr 12, 2019 19:21 UTC (Fri)
by MatejLach (guest, #84942)
[Link]
One thing that many people also miss, is that systemd's a 'service manager', therefore its work doesn't stop once your services are up and running. Now I know many would argue that's a downside, but the reality is, the alternative is to get the same set of functionality via a patchwork of variable-quality scripts on top of a 'simpler' init system.
Also, complaints about logind are funny, because nobody was apparently willing to do equivalent maintenance work, (consolekit etc.), so yeah.
Anyway, it's getting a bit ranty, but the point still stands.
Posted Apr 12, 2019 14:12 UTC (Fri)
by joncb (guest, #128491)
[Link] (9 responses)
Posted Apr 12, 2019 18:18 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
Posted Apr 13, 2019 7:26 UTC (Sat)
by joncb (guest, #128491)
[Link] (7 responses)
Indeed, a very quick search suggests that to create a helper process you should either use Runtime.Exec or ProcessBuilder (haven't really touched Java in a good decade so that is probably misleading in the nuances). While i wouldn't be surprised if one of the implementations involves a fork under the covers there's no reason it couldn't be anything else that guarantees the expected semantics.
The difference, of course, between C/C++ and Java/C# is that the former are languages that are expected to execute (more or less) directly on top of the current system whereas the latter are expected to present a virtual facade across such. Therefore i would expect C to have access to fork() where it is available whereas i would not expect Java or C# to do so. Golang is a weird blending of the two where some things are more C like and somethings are not, low level fork access apparently being one of the nots. Rust appears to have fork but has some hefty safety warnings on it.
Posted Apr 13, 2019 9:23 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
> Indeed, a very quick search suggests that to create a helper process you should either use Runtime.Exec or ProcessBuilder (haven't really touched Java in a good decade so that is probably misleading in the nuances). While i wouldn't be surprised if one of the implementations involves a fork under the covers there's no reason it couldn't be anything else that guarantees the expected semantics.
Posted Apr 13, 2019 23:52 UTC (Sat)
by joncb (guest, #128491)
[Link] (5 responses)
I assume you really don't mean "No way to avoid it" here because if there's literally "no way" then this whole exercise is just shouting into the void.
In particular, i'm thinking you (and i specify you because yours is the use case here) write a patch for openJDK that re-implements ProcessBuilder to use something other than fork when calling start(). From your comments on this story that should be very doable. You submit that patch to openJDK and make your case. Regardless of whether it is accepted or not, you can now run openJDK secure in the knowledge that your application is using this faster/safer/cleaner/whatever alternative.
In my travails doing an informal survey of how languages fork i came across an interesting python issue about moving to posix_spawn. It looks like it's stalled for technical compatibility reasons ( https://bugs.python.org/issue35823 ). The part stating that libc "may be more than a decade behind in enterprise Linux distros" shows where bigger problems lie.
Posted Apr 13, 2019 23:54 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
Posted Apr 14, 2019 0:45 UTC (Sun)
by zlynx (guest, #2285)
[Link]
Even with overcommit turned on, trying to fork a 10 GB Java process can fail because it exceeds the heuristic.
With overcommit disabled, which is how I run my Linux servers, it will definitely fail.
Luckily we have vfork which was designed for exactly this problem. It doesn't duplicate the process memory, not even CoW. With a bit of care to not overwrite important memory in the parent process, it works very well to launch new child processes.
So "vfork()" is "something other" because it is like fork, but isn't actually fork.
Posted Apr 14, 2019 23:49 UTC (Sun)
by neilbrown (subscriber, #359)
[Link]
Couldn't you open a socket and send a dbus message to systemd to ask it to run some service for you ??
Posted Apr 15, 2019 6:02 UTC (Mon)
by joncb (guest, #128491)
[Link] (1 responses)
Don't you think this is putting the cart before the horse just a little bit then? Surely creating a "something other" should take precedence to advocating for developers to stop using the one tool they have for this basic task?
Posted Apr 15, 2019 8:41 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
Not really - the paper says that in practical terms, fork isn't a good API, and while posix_spawn looks better in theory, it practically becomes a mess to use.
The paper is more of an academic opinion piece; it sets out why fork causes issues, why posix_spawn and friends aren't enough better to be worth the effort of a wholesale rewrite of software, and asserts that it should be possible to produce a better API given that, in theory, spawn-type APIs are easier for OS developers to implement.
Within the bounds of academia, this sort of paper serves to legitimise research into better APIs; someone has asserted with examples that existing APIs are imperfect, and now future researchers interested in process creation APIs have something they can use as a reference when they justify spending time on the "solved" problem of spawn versus fork APIs. Maybe the answer will turn out to be that posix_spawn and fork are both local maximums, and the only way to do better is a radical rethink of process design; maybe some bright spark will demonstrate that there is a better API we can use if we step aware from the existing ones.
Key is that we don't have good data on better alternatives to the current "spawn with 101 flags to inherit the right bits of the world" and "fork then clean up" APIs; the paper says we need to work out what the "something other" should look like, because "fork and clean up" is easy for the user, but sets various design choices for the kernel (and requires certain hardware support to be performant - we get CoW very cheaply with modern MMUs, but at the expense of requiring MMUs for an OS kernel, not just MPUs), while "spawn" is easy for the kernel, but leads to huge complexity for the user as they have to handle 101 flags to get the "right" environment in the spawned process.
Posted Apr 16, 2019 7:08 UTC (Tue)
by gfernandes (subscriber, #119910)
[Link]
We're now _breaking it ALL up_ into microservices, throwing out all the large in memory caches, even moving databases to Mongo or PGSQL.
*ecree* is right.
Gigantic monoliths are no excuse for poor software design.
Posted Apr 11, 2019 21:28 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (6 responses)
Posted Apr 11, 2019 21:52 UTC (Thu)
by ecree (guest, #95790)
[Link] (3 responses)
I know, that's why you use MAP_ANON. Do pay attention ;)
> You also can't typically control the allocations made by the JVM or your language runtime.
I very nearly said something about "the problem with most application servers is they're written in Java", but I held back. Maybe I shouldn't've.
Language runtimes ought to provide mechanisms for allocating objects in private memory, if they're intended to be used for big programs that want child processes. Indeed, if they're going to be written around a spawn()ish view of the world, then objects allocated from user code won't need to be visible post-fork(), so such objects could just be allocated private by default.
C gives you that control, through the aforementioned mmap(), and it's probably even possible (I haven't tried it) to patch your libc to make malloc default-private.
An even more fine-grained system might be tagged allocations, where the fork()-analogue (probably clone()) could specify which tags it wanted to copy into the child. But probably no-one's ever needed that, else there would have been a serious attempt to implement it.
Posted Apr 11, 2019 21:54 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Why _should_ they be designed around fork()?
Posted Apr 11, 2019 22:29 UTC (Thu)
by ecree (guest, #95790)
[Link] (1 responses)
Because fork() is necessary to allow complex control of child environment without excessive API surface (spawn() functions with 42 arguments, etc.). So it needs to be supported.
Unless of course you'd rather have a spawn() function that takes as an argument a BPF program that sets up the child environment before the new process image is executed ;)
Posted Apr 11, 2019 22:33 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> Unless of course you'd rather have a spawn() function that takes as an argument a BPF program that sets up the child environment before the new process image is executed ;)
Posted May 31, 2021 17:35 UTC (Mon)
by immibis (subscriber, #105511)
[Link] (1 responses)
There is generally no good reason you should split your Java app into multiple processes just because the OS demands it. Half the point of Java is to shield you from such things, is it not? If you want to split up your app into multiple cooperating modules - as you should - you can do that within the one process.
Posted Jun 1, 2021 1:42 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Apr 12, 2019 3:46 UTC (Fri)
by roc (subscriber, #30627)
[Link] (2 responses)
Shared anonymous mappings sometimes need COW too, but you just can't have that in Linux/POSIX.
Posted Apr 12, 2019 10:35 UTC (Fri)
by ecree (guest, #95790)
[Link] (1 responses)
Yeah I was getting my terminology a bit confused last night.
What I was trying to say was that malloc() memory 'normally' needs COW and file mappings 'normally' don't.
The problem mappings are those which have a _separate_ mapping in the child, which is actually the private ones; shared mappings remain mapped in the child but without COW (I think?), and there's no kind of M_CLOFORK mapping that just isn't mapped in the child at all (which is what my brain late last night said private meant).
> Shared anonymous mappings sometimes need COW too
Why? If it's a shared mapping, then writes by the child should be visible in the parent and vice-versa, so both processes can map the same page and no need to COW. What am I missing?
Posted Apr 12, 2019 22:12 UTC (Fri)
by roc (subscriber, #30627)
[Link]
That's correct.
> and there's no kind of M_CLOFORK mapping that just isn't mapped in the child at all (which is what my brain late last night said private meant).
That's true. Though there is madvise(MADV_DONTFORK) which gives you similar functionality.
> > Shared anonymous mappings sometimes need COW too
As discussed in the paper that spawned this thread, sometimes fork() is used to create checkpoints of process state (e.g. rr and Redis do this). COW makes this extremely efficient for MAP_PRIVATE pages, which is great, but it doesn't work with MAP_SHARED pages, so rr (not sure about Redis) has to eagerly copy them into the checkpoint. This is bad.
The MAP_PRIVATE/MAP_SHARED model is too inflexible. It would be better to have a model where you can create memory objects backed by files or anonymous memory, and then explicitly COW-clone them (and of course map those objects into your address space, pass them to other processes, etc). The Fuschia documentation isn't great but it seems to have this kind of API. This would require the kernel to manage a tree of COW-clones for each memory object, but that isn't very different to today where Unix kernels have to manage a tree of COW-clones of process address spaces.
Posted Apr 10, 2019 18:38 UTC (Wed)
by sjfriedl (✭ supporter ✭, #10111)
[Link]
From the paper:
> Ironically, the NT kernel natively supports fork; only theWin32 API on which Cygwin depends does not
Posted Apr 10, 2019 18:47 UTC (Wed)
by thoughtpolice (subscriber, #87455)
[Link] (8 responses)
Three of the (four) authors are not from Microsoft Research. You're going to be surprised when you find out what it is the majority of MSR employees do (hint: it's research, and a lot of it uses Linux.)
> Personally I think fork is a nice enough call and actually an advantage to have. For example, Git rather suffers in performance on Window because of the lack of that call (amongst other things).
This is just a flippant response, the paper's entire argument has nothing to do with whether it's "nice enough" or "easy to use", you can use `posix_spawn` for that. It's arguing that the contract implementations of `fork()` must provide is a large burden on the design of new systems that is worth reconsidering, e.g. the section that outlines the design of K42, an object-capability system, had its design significantly burdened by general `fork()` semantics, because suddenly you have to start talking about what state objects can be in after they are cloned, rather than when they are created fresh. Similarly, thanks to the semantics of `CreateProcess` actually spawning separate process (vs fork), it can work in environments that have single-address spaces -- for example, by porting their APIs to different runtimes, you can support single-address-space multi-process SGX enclaves with the same API, same with Unikernels, etc (Section 5). If you actually treat the *process* as the object of work (the object which your vocabulary is designed around) then it's easy to see how this works vs fork. Even if fork is "nice" to use for users, it has significant design ramifications throughout the system.
Also `CreateProcess` on Windows by itself isn't "expensive" because it's expensive to map in new address spaces and run main(), and Windows is just weirdly mysteriously terrible at it and they just never cared, it's expensive because actually initializing user contexts in a Windows process is expensive (that requires kernel locks, context switches to system services, thread/stack initialization, etc) but this is a wholly separate design issue. There's no reason to believe a well-implemented spawn mechanism can't have excellent performance because Windows has technical debt (and given the chance to start fresh, if you wanted a high-level spawning API, you'd almost certainly try to *ensure* it has good performance from the get-go.)
The paper never even makes the argument "`CreateProcess` in Windows is better and easier to use than `fork`" (which some other people here have read somewhere, and something that would be pretty hard to argue anyway), merely that the actual underlying design distinction -- an API revolving around spawning actual processes as a kind of first class object in the system without imposing semantic requirements on the memory subsystem -- is perhaps a better one, moving forward. The arguments seem pretty good, to me. I could live without `fork()` if it meant I got something in return, and it seems I do get some things in return.
It also comes across as a good example of where higher level, domain-specific APIs for users give more flexibility than a low level one when considering the overall system design. A higher level API necessarily gives more degrees of freedom in the implementation which, in turn, helps isolate users from underlying implementation details. You can of course go too far here, but if you capture the domain properly, then you can have the advantages of implementation freedom combined with the exact control you need.
Here's a similar discussion I had recently: if you want to write an efficient C program (energy efficient, time efficient, whatever), compared to some existing one, it will never be enough to just throw a new compiler or better optimizations at said existing program -- you cannot choose better instructions or register allocate your way out of it, etc. That will be peanuts compared to real gains you can have. Real optimization comes from choosing a different design, different algorithms and different memory layout -- the kind of decisions that are impossible for the C compiler to make for you while preserving semantics. But tiling, memory locality optimizations are much more common and practical in more restricted settings if you take away a few degrees of freedom from the user, and give it back to the computer -- the Halide image compiler is an example of this. So it is not a problem, or even a "failure", of individual *technical tools* that you are using (calling it a "failure" is not based on any technical understanding of the problem, but on the deep, social, human desire to assign blame, in order to rationalize and reduce complex interactive failure into singular causes.) It is an impedance mismatch in the *abstract language* you are using to communicate with the machine, in turn, restricting the degrees of freedom the computer has for response. Language problem, not a technical one.
Generally I enjoy LWN but most people here have (IMO) failed to level any actual substantial criticisms of the paper (beyond made up ones in their head) before dismissing it offhand, which is sad because it's pretty well written and easy to approach. You don't have to have .patch files in hand every time you want to take some basic idea and run with it and see where it might lead; and in fact, such a demand hampers actual progress more than anything, but that's another discussion...
Posted Apr 10, 2019 19:36 UTC (Wed)
by smoogen (subscriber, #97)
[Link]
Posted Apr 10, 2019 21:42 UTC (Wed)
by nix (subscriber, #2304)
[Link] (5 responses)
*This* is a killer, because it means you are constrained to whatever process-setup code the people who specified the replacement (posix_spawn(), say) happened to think of, and you can't add to it because in a system without fork() you cannot implement your own spawn replacement, but have to rely on whatever limitations the one in the OS happened to provide. The fact that the posix_spawn() API family is already a horrible tentacular monster and is *still growing* and that it is trivial to generate scenarios it cannot handle suggests that this is a rather serious limitation, and a limitation that bites real code. (Even the scenarios it can handle are really hard to read because it has to handle so many cases that it really wants to be a programming language but isn't.)
Posted Apr 10, 2019 22:37 UTC (Wed)
by wahern (subscriber, #37304)
[Link] (2 responses)
But Unix *has* cross-process operations in ptrace. Nobody is really clamoring to use that interface to build a better fork replacement because fork *already* represents a compromise between how much complexity to put into the kernel and how much complexity to put into userspace, and additionally how costly (mostly in complexity, not performance) the implementation must be. It shouldn't be shocking that those responsible for the kernel side are complaining about having to put in so much work; nor should it be shocking that they heavily discount the cost to user space by shifting the remainder of the burden to them.
Taken to its logical end the paper's argument basically mirrors the same arguments as for microkernels. And while I think microkernels are great and am eagerly waiting an excuse to put seL4 to some use, fork+exec is sufficiently flexible and performant to have ushered in the age of containers and other more complex process management strategies.
While cross-process operations would be more powerful we can't underestimate the cost necessary in building the stack of software that would be necessary to bring the promise to reality. It's the same inconvenient truth as with microkernels. fork+exec is just too good enough, whether by accident or design.
Posted Apr 10, 2019 22:43 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Apr 12, 2019 22:14 UTC (Fri)
by roc (subscriber, #30627)
[Link]
Posted Apr 10, 2019 22:45 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
But again, this needs an API that has a process handle as a first-class object.
Posted May 31, 2021 17:38 UTC (Mon)
by immibis (subscriber, #105511)
[Link]
> The paper recommends a cross-process operation primitive, not something like CreateProcess or pthread_spawn, which will always fall far short of the ability to execute arbitrary code.
It recommends that if you want to redirect a file descriptor, for example, you should just be able to "remote-control" the child process to issue that call, before you unsuspend it.
Posted Apr 11, 2019 13:35 UTC (Thu)
by roblucid (guest, #48964)
[Link]
Posted Apr 10, 2019 13:01 UTC (Wed)
by naptastic (guest, #60139)
[Link]
<3
Posted Apr 10, 2019 13:17 UTC (Wed)
by mm7323 (subscriber, #87386)
[Link] (15 responses)
Windows CreateProcess() takes 10 or so direct parameters, of which some are structures of yet more parameters again. Yuck.
Some of the complaints in the paper seem misplaced too (fork is slow, doesn't scale, forces memory overcommit). fork() has a certain purpose and trying to use it for performance is something which we already know doesn't work that well - that's why webservers went through design iterations of pre-forking, using threads, async IO or some combination of techniques.
Suggesting fork() is insecure possibly has some truth as inheritting most things by default (except other threads) is the opposite of what would be safest, but it's too late to change. Perhaps a new part() call that inherits little could be made, but the knock on is then making APIs for everything to opt into inheritance.
Posted Apr 10, 2019 13:54 UTC (Wed)
by warrax (subscriber, #103205)
[Link] (14 responses)
Fork() is absurdly hard to use in multithreaded cases because of the implicit forking of resource handles of various kinds. It's also really hard to do error handling around it correctly(!). In languages like C it's certainly *possible* to use it correctly though it's rare to see people get the edge cases right, but try asking any language runtime implementors how much fun they had having to work around its nightmarish semantics.
There's nothing elegant about fork() at all -- it overconstrains implementations by way of its semantics for a use case which is almost always going to be "start a new subprocess /usr/bin/something with arguments X, Y, Z". It's a complete mess because of its implicitness. Explicit over implicit all the way!
Posted Apr 10, 2019 14:15 UTC (Wed)
by smoogen (subscriber, #97)
[Link] (1 responses)
The problem is that we when you get to threaded-screws, the hammer no longer works well, and you end up with splintered walls and bashed fingers. So you upgrade your toolbox with better hammers, and maybe some screwdrivers. You might even go with a toolbox with no hammer in it (aka Windows). You quickly find that all of the remaining tools still have enough rusty bits to give every program still gives you a bad case of tentanus, gangrene and blood poisoning while trying to deal with threads.
In the end, papers do not fix things.. especially papers which do not give code which clearly shows a better solution. They may provoke people to think about building better tools.. but even then it takes multiple generations of people who are happy with their rocks to retire before you even get claw hammers or phillips head screwdrivers.. and even then you will find that someone stuck a nice sharp rusty point on the #1 phillipshead and the fix was to keep wrapping it in indirection duct-tape until it only pokes you now and then.
Posted Apr 10, 2019 19:33 UTC (Wed)
by smoogen (subscriber, #97)
[Link]
Posted Apr 10, 2019 14:26 UTC (Wed)
by mm7323 (subscriber, #87386)
[Link] (9 responses)
Of course, you could say the weaknesses of fork() remain because of the backward-compat requirement too. But fork() + exec() is still a prettier API that's easy to teach.
Yes, fork() and threads don't really mix well as you won't normally know what state other threads were in, and so their memory state at the point of fork(), unless you acquire locks first as you would for any normal case of accessing another threads state. Threads don't mix well with other things too e.g. mixing locks, condition variables and poll()/select()/epoll(). Certainly in C and similar languages, use of threads requires fore-thought about the overall program structure and care over data ownership.
posix_spawn() maybe what you are after then, though the man page of that says that it only offers a sub-set of fork() + exec() functionality. Forking servers are quite a common use case too, and again, are simple to teach and understand - even if the inherit by default semantics can present a booby trap.
It's actually all the other calls that need various forms CLOEXEC and preparation which makes mess, but that's semantics. A quick grep of /usr/include shows this:
Posted Apr 10, 2019 15:15 UTC (Wed)
by barryascott (subscriber, #80640)
[Link] (4 responses)
But that is the point the paper makes. Because of the fork() design *everything* else has to work around the limitations.
Posted Apr 10, 2019 15:25 UTC (Wed)
by mm7323 (subscriber, #87386)
[Link] (3 responses)
It would have been a bold decision to make new sub-systems implicitly set CLOEXEC by default, but it perhaps is only now more obvious with hindsight that such a could would have been saner, but it's not the fault of fork() that it came first.
And still there is no better suggestion of a replacement or upgrade.
Posted Apr 11, 2019 2:10 UTC (Thu)
by epa (subscriber, #39769)
[Link] (2 responses)
Posted Apr 11, 2019 5:32 UTC (Thu)
by mm7323 (subscriber, #87386)
[Link] (1 responses)
One thing this can break is Valgrind, which creates some high numbered descriptors above the normal ulimit(). By testing the ulimit you can avoid closing these. Other libraries and tools may not be so lucky though, so an O_NOCLOEXEC maybe better, and it's actually a O_NOCLOFORK that would be best.
Posted May 31, 2021 17:44 UTC (Mon)
by immibis (subscriber, #105511)
[Link]
What if opening /proc/self/fds fails because too many FDs are open? Okay, then you just close FD 0. But you actually need that one. So close FD 3 instead. You're closing all the FDs, right - so it doesn't matter if you close one prematurely?
What if FD 3 is on your do-not-close list? Okay, just pick the lowest number that isn't.
What if there are too many FDs and they're all really high numbers? Scan the whole 32-bit or 64-bit FD space until you manage to close one, then open /proc/self/fds? (they can be higher than your RLIMIT_NOFILE, if RLIMIT_NOFILE was set to a larger number in the past)
What if your RLIMIT_NOFILE is zero? Then you can't open /proc/self/fds. But there is nothing to close. But will you detect that and succeed instead of failing?
Actually, there could be open FDs from before RLIMIT_NOFILE was set to zero. Will you temporarily increase it, so you can open /proc/self/fds?
What if /proc isn't mounted? This is actually very likely to come up, IF your code is ever used in a program that creates containers, or perhaps even just from a rescue shell.
Wouldn't it be great if you could *just tell the kernel to do the thing you want it to do*?
Posted Apr 10, 2019 15:53 UTC (Wed)
by sjfriedl (✭ supporter ✭, #10111)
[Link] (1 responses)
It's only easier to teach if you ignore the hard parts.
Posted Apr 11, 2019 6:42 UTC (Thu)
by nilsmeyer (guest, #122604)
[Link]
Isn't that how teaching works, at least in the beginning?
Posted Apr 12, 2019 22:36 UTC (Fri)
by warrax (subscriber, #103205)
[Link] (1 responses)
> It's actually all the other calls that need various forms CLOEXEC and preparation which makes mess, but that's semantics. A quick grep of /usr/include shows this:
The implicitness around all of this means that an application *CANNOT* be future-proof. Every time one of these flags got/gets added there's a new failure mode for an application written to the old API.
(I.e. an application cannot -- by definition -- know which *_CLOEXEC flag will be needed in future.)
"Clone shit" is *not* by any means a reasonably specification of behavior.
Posted Apr 13, 2019 2:09 UTC (Sat)
by foom (subscriber, #14868)
[Link]
These flags are all for different APIs that can open a new file descriptor. If you're using fanotify_init, you use FAN_CLOEXEC with it. If you're using open, you use O_CLOEXEC, etc.
Posted Apr 10, 2019 22:45 UTC (Wed)
by wahern (subscriber, #37304)
[Link] (1 responses)
I think the authors would disagree with you. They claim (rightly) that CreateProcess and pthread_spawn are fundamentally incapable of the expressiveness necessary of a core primitive.
Read fairly their claim is that both CreateProcess and fork+exec suck. The fork+exec model is more expressive and powerful, CreateProcess less of a burden on the kernel and more performant. Their preferred alternative is cross-process operations, though as I mention elsethread the pros and cons basically mirror the debate regarding microkernels, IMO, and unsurprisingly (from the perspective of operating system researchers busily writing experimental kernels) substantially shifts the complexity burden to user space software.
Posted Apr 10, 2019 22:53 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Everything else can probably be expressed much simpler with a newer sane process API.
Posted Apr 10, 2019 13:32 UTC (Wed)
by ibukanov (subscriber, #3942)
[Link] (13 responses)
The real fork drawback is that it does not have sane semantics in multi-threaded semantics and using it with threads with shared memory do more harm then good . But fork in single threaded applications that uses it for computational workers works nicely and may even leads to better CPU cache utilization.
Posted Apr 10, 2019 21:00 UTC (Wed)
by rweikusat2 (subscriber, #117920)
[Link] (12 responses)
NB: I'm not going to read a paper presenting decades-old VMS 'design [irr]rationales' as 'new reseach'.
Posted Apr 11, 2019 7:23 UTC (Thu)
by ibukanov (subscriber, #3942)
[Link] (11 responses)
Posted Apr 11, 2019 10:31 UTC (Thu)
by dufkaf (guest, #10358)
[Link] (10 responses)
Posted Apr 11, 2019 15:33 UTC (Thu)
by ibukanov (subscriber, #3942)
[Link] (2 responses)
But my point is that for single-threaded applications fork has clear semantic. For example, to spawn a computation, prepare in the parent process all data in memory, fork, compute and send the results back using a pipe or shared memory. Works nicely.
Posted Apr 11, 2019 17:04 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link]
Even assuming this was true (and it isn't), the original fork use-case would come to mind here: Execute a command in a background process instead of in the current one.
A use-case for 'exec in same process': In-place update of a running program. The currently running instance serializes and records its current state somehow and then execs itself, causing the updated program file to be loaded. The new instance then restores the serialized state and continues where the previous one left off.
Posted Apr 12, 2019 19:09 UTC (Fri)
by dufkaf (guest, #10358)
[Link]
Posted Apr 11, 2019 16:58 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (6 responses)
Posted Apr 11, 2019 17:18 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
Posted Apr 11, 2019 17:35 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (4 responses)
More complicated extension: For my usual use-case, the parent of the log forwarder will be a program which monitors another program, restarts that if it terminates unexpectedly and provides facilities for reliable termination of the other program and for reliably sending signals to it (feels like wrong grammar ...). For this to work, it needs to be the parent of the payload process.
Posted Apr 11, 2019 17:51 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
In your wrapper just spawn a log process, passing your stdin/stdout to it. Then exec() the payload.
No fork() required.
Posted Apr 11, 2019 18:03 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link] (2 responses)
Posted Apr 11, 2019 18:07 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
Posted Apr 11, 2019 19:07 UTC (Thu)
by rweikusat2 (subscriber, #117920)
[Link]
Posted Apr 10, 2019 15:04 UTC (Wed)
by xl2784 (guest, #131031)
[Link] (10 responses)
Posted Apr 10, 2019 15:58 UTC (Wed)
by metan (subscriber, #74107)
[Link] (9 responses)
* Thread A calls malloc()
Now the child of the Thread B ends up with malloc locked for eternity and any attempt to allocate memory will end up with a deadlock there.
Posted Apr 10, 2019 16:04 UTC (Wed)
by xl2784 (guest, #131031)
[Link]
Posted Apr 10, 2019 16:14 UTC (Wed)
by mm7323 (subscriber, #87386)
[Link] (7 responses)
int pthread_atfork(void (*prepare)(void), void (*parent)(void), void (*child)(void));
prepare() should do something like take locks for exclusive access on the malloc() area (potentially blocking until exclusive access is guaranteed), then returning to allow the fork() to proceed. parent() can drop the locks again in the original process & thread, while child() can replace any locks with new ones specific to the child.
Of course, glibc implements both malloc() and pthread_atfork() so can use internal mechanisms to achieve the same, but it's still there for others if needed on other resources and you really have a design that calls for fork() and threads.
Posted Apr 10, 2019 18:29 UTC (Wed)
by pbonzini (subscriber, #60935)
[Link] (6 responses)
Posted Apr 10, 2019 20:08 UTC (Wed)
by mm7323 (subscriber, #87386)
[Link] (5 responses)
It's not so much a reinitialise mutexes in the child, but more of a 'create new mutexes for the new process and replace any references to mutexes from the parent' that needs to happen in the child() call.
Sorry if I wrote reinitialise and threw you previously.
Posted Apr 10, 2019 22:25 UTC (Wed)
by pbonzini (subscriber, #60935)
[Link] (4 responses)
Still not enough, as "attempting to destroy a locked mutex results in undefined behaviour" (from http://pubs.opengroup.org/onlinepubs/007908799/xsh/pthrea...).
Posted Apr 11, 2019 5:38 UTC (Thu)
by mm7323 (subscriber, #87386)
[Link] (3 responses)
So after fork(), when there is only one thread in the child, it just creates its own new locks and synchronisation primatives and off it goes. No reinitialisation or destroying is needed in the child.
Posted Apr 11, 2019 11:58 UTC (Thu)
by pbonzini (subscriber, #60935)
[Link] (2 responses)
Posted Apr 11, 2019 21:05 UTC (Thu)
by mm7323 (subscriber, #87386)
[Link] (1 responses)
Posted Apr 11, 2019 21:34 UTC (Thu)
by pbonzini (subscriber, #60935)
[Link]
But the very fact that the interaction between atfork and error checking mutexes is completely undocumented, is a sign that it is not a great API.
Posted Apr 10, 2019 15:27 UTC (Wed)
by dullfire (guest, #111432)
[Link]
Or in other words... if you gonna use functions from multiple abstraction layers (in this case libc vs thin-libc syscall wrappers) then it's on you to properly manage them.
I'm not sure fork() is an inspired design, however the assertions at the beginning of the paper don't fill me with confidence about the authors
Posted Apr 10, 2019 16:04 UTC (Wed)
by flussence (guest, #85566)
[Link]
It's also six characters long and good enough for the rest of the world. If this also-ran open-core company can't/won't build something more compelling, then all of this is just intellectual onanism and whining over something they don't have the brains to implement efficiently.
Posted Apr 10, 2019 16:55 UTC (Wed)
by magfr (subscriber, #16052)
[Link]
Posted Apr 10, 2019 17:34 UTC (Wed)
by evad (subscriber, #60553)
[Link] (4 responses)
I can only assume Microsoft wants fork() to be removed from the kernel so its easier for them to support Linux apps on Windows. Otherwise why ask for its removal? Why not just educate people on the alternatives?
A very confusing paper, and very much an editorial rather than a research document.
Posted Apr 10, 2019 18:25 UTC (Wed)
by randomguy3 (subscriber, #71063)
[Link] (3 responses)
The paper gives several motivations, but I reckon the primary one comes from the authors' work as OS researchers interested in making new research operating systems. Currently, fork() usage is so prevalent in UNIX software that they are faced with implementing fork() (which they claim - I see no reason to doubt their experience in this area - infects the entire OS design) or have an OS that can't run huge amounts of existing software out there (removing a valuable testing resource and possible adoption path for the OS).
It's notable that S7 only suggests the OS might be rewritten to not have fork() as a core syscall after the most important software (however you want to define that, I guess) has been rewritten to avoid it. They're a little vague on how either part of that process would happen, but the purpose of the paper is just to convince people that it should be done, not set out a plan for achieving it.
Posted Apr 10, 2019 20:16 UTC (Wed)
by mm7323 (subscriber, #87386)
[Link]
Posted Apr 10, 2019 22:34 UTC (Wed)
by evad (subscriber, #60553)
[Link] (1 responses)
Posted Apr 11, 2019 18:24 UTC (Thu)
by mrshiny (guest, #4266)
[Link]
Posted Apr 10, 2019 17:50 UTC (Wed)
by Liskni_si (subscriber, #91943)
[Link]
Vlastimil Babka pointed me to https://lwn.net/Articles/717950/ which I understood to mean that it's not at all straightforward to implement and some larger refactorings would be necessary, and that's certainly beyond my ability. So I'm wondering if we're any closer to being able to add such functionality today, and whether others think my idea of reflink on tmpfs is good or bad.
[1]: https://blogs.vmware.com/consulting/2016/09/anatomy-insta...
Posted Apr 10, 2019 17:58 UTC (Wed)
by alogghe (subscriber, #6661)
[Link]
Posted Apr 10, 2019 18:40 UTC (Wed)
by randomguy3 (subscriber, #71063)
[Link] (1 responses)
An interesting piece. Having dealt with spawning processes in a cross-platform multi-threaded application as part of my day job, I am very sympathetic to the complaints of these researchers (although I'll admit I don't care that much about the difficulties fork() poses for implementing microkernel systems...). CreateProcess() certainly has its faults (some of which it shares with fork(), such as not defaulting to CLOEXEC), but it's a lot easier to get right than fork()+exec() - the constraints on what you can do after fork() are easy to forget, and hard to even know when there are multiple threads around. posix_spawn is a good idea, but suffers from several shortcomings (some of which are mentioned in the paper), including poor error returns, some missing basic features (like working directly) and an inherently racy approach to fd inheritance in a multithreaded environment.
Posted Jun 7, 2021 16:45 UTC (Mon)
by immibis (subscriber, #105511)
[Link]
Posted Apr 10, 2019 19:15 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
This neatly avoids all the complications of forking and memory overcommit.
That's what Fuchsia does, btw.
Posted Apr 10, 2019 20:01 UTC (Wed)
by roc (subscriber, #30627)
[Link]
Posted Apr 11, 2019 15:26 UTC (Thu)
by sbaugh (guest, #103291)
[Link]
The implementation (as some other comments speculate about) is as a userspace stub which receives syscalls to execute over some transport,
and sends their results back.
I use a pair of file descriptors, but other transports could be implemented too.
The issue with ptrace is not just that it's hard to use, not just that it's slow, but also that there can only be one ptracer at a time.
A program that used ptrace in normal operation to manipulate its children would be much less compatible with strace, gdb, and other tools.
That's not workable for a general purpose API.
Furthermore, ptrace puts limits on what kind of transport can be used between the stub and the main process.
It would be nice to use shared memory to send syscall instructions to the stub,
to improve performance when much setup must be done.
As it stands, with a pipe used for transport, this API is actually network transparent;
this could allow for some interesting novel APIs for starting and manipulating processes on different hosts.
The hardest part has been the need to create new abstractions that use this new way of executing syscalls.
I couldn't think of an acceptable and performant way to reuse existing functions which implicitly make syscalls in the current process,
in this new world where syscalls are done in the explicit context of some arbitrary process handle.
So a fair bit of reinvention has been required to support explicitly specifying the process to operate on.
Another difficulty is the book-keeping of resources (file descriptors, paths, pointers) across multiple processes.
Treating file descriptors as ints is difficult to keep straight when working with multiple file descriptor tables across multiple processes,
where the same int might refer to different file descriptors in different processes.
So I've had to develop multiple layers of abstractions for user programs which manipulate other processes:
one layer which works with raw int file descriptors,
and other layers on top of it which work with file descriptors as a combination of an int and the fd table it is valid within.
Similar abstractions are needed for other resources as well.
It's so far very expressive and powerful.
It's been surprisingly easy to adapt my development to this new way of spawning and manipulating processes.
I definitely think that cross-process operations
(more generally, explicitly specifying the thing to act on in all syscalls, instead of implicitly working on the current process or whatever)
are the right design for operating systems;
it's much more expressive than both the posix_spawn style and the fork style.
Posted Apr 10, 2019 19:38 UTC (Wed)
by patrakov (subscriber, #97174)
[Link] (3 responses)
"""
That's outright manipulation of the available facts, good enough to be included in propaganda textbooks. For starters, fork() is used not only in a way immediately followed by exec(). E.g., Redis uses fork() as a method to obtain a consistent snapshot of the database in memory, without running a separate executable. While indeed not every use of fork() is justified, the authors could at least not mix examples convertible and not convertible to posix_spawn().
Posted Apr 10, 2019 20:05 UTC (Wed)
by roc (subscriber, #30627)
[Link] (2 responses)
Posted Apr 11, 2019 6:48 UTC (Thu)
by nilsmeyer (guest, #122604)
[Link] (1 responses)
Posted Apr 12, 2019 18:45 UTC (Fri)
by HelloWorld (guest, #56129)
[Link]
Posted Apr 10, 2019 20:28 UTC (Wed)
by roc (subscriber, #30627)
[Link] (6 responses)
I've always found fork() much better than CreateProcess() because it's simply untenable for a single function call to set up every aspect of the child process, and the authors acknowledge this in section 6. Their suggestion to get around that is to have system calls that let you change the state of the child from the parent. Unfortunately that creates its own problems --- richer system call API surface with more potential races and security issues (though those races also exist for "modify own process" system calls in the presence of threads).
Another possible approach that they don't mention would be to replace fork() with a spawn() function that can handle common cases, but still support exec(). Then for complicated cases not handled by spawn(), you would spawn() a helper binary that communicates with the parent to complete setting up the child before exec()ing the real binary. Then again, Linux execve() is *also* a big problem, requiring all kernel resources to specify what happens when an execve() occurs, and also having problems with multithreaded processes. (The section of the ptrace() man page on execve() of multithreaded processes is very scary.) So maybe the way to go is to eliminate both fork() and exec() and have spawn() start execution in a standard userspace stub like ld.so, which supports a standard protocol for communicating with the parent to set up the child process environment before entering the real binary.
The paper would have been stronger if they had also discussed issues with execve() and talked about the benefits of eliminating *both* fork() and execve() in favour of spawn().
One thing that worries me about marginalizing or removing fork() is that a COW memory snapshot system call is still very needed to implement rr replay and other things. fork() being that syscall is great for us because it's so commonly used, it's guaranteed to work efficiently and well. Then again, a dedicated COW snapshot call could potentially eliminate some of the problems with using fork() for this, e.g. the fact that shared memory segments aren't copied.
Posted Apr 10, 2019 21:20 UTC (Wed)
by roc (subscriber, #30627)
[Link] (4 responses)
Even better, have the spawn() syscall specify an arbitrary executable to do the job of ld.so, and then pass the real executable to it as one of the parameters you send over IPC. Then you can do whatever you want to set up the process even if the standard component doesn't support it.
Of course the elephant in the room is that even if everyone in the world agrees that fork() should go, getting rid of it in the application software people care about is a very long-term project whose benefits would take a long time to be realised. Perhaps all the more reason to start working on a transition now, initially by designing, implementing and deploying the replacement APIs.
Posted Apr 11, 2019 4:08 UTC (Thu)
by dw (subscriber, #12017)
[Link] (3 responses)
Where libc might ship a static (and therefore almost invisibly fast) helper to interpret the contents of e.g. a memfd passed in on a known descriptor, the helper would initially implement the posix_spawn calls. An even simpler option might dump the FD mapping arguments for a fixed behaviour of starting the program with only one fd connected to a UNIX socket, with FD passing used as desired to communicate additional objects to the helper
The goal as with yours of course is to avoid putting any of this policy in the kernel again if it could be practically avoided, and making it simple to iterate the userspace helper without changing any OS interface or even having to wait for libc
Posted Apr 11, 2019 15:42 UTC (Thu)
by mathstuf (subscriber, #69389)
[Link] (2 responses)
exec 4>logfile
This also could affect something which does intermediate shell scripts before launching the real binary or things like `git` fork/exec'ing into a non-builtin subcommand.
Maybe this is just a silly use case and not worth all the CLOEXEC stuff. Though there is some set of tools I saw around where everything was done via "set one thing and then exec the next" for everything from environment modification to dropping priviledges. That might have more of an issue there.
Maybe the better default is to have everything be CLOEXEC by default, but once something is not CLOEXEC, it sticks around after an exec transitively (stdin/stdout/stderr would default to being not-CLOEXEC)? Of course, this is a much more expansive API change.
Posted Apr 11, 2019 19:52 UTC (Thu)
by roc (subscriber, #30627)
[Link] (1 responses)
It's unclear what you're relying on here. Are you making use of the behaviour that the default behaviour of inheriting file descriptors through fork/exec allows you to smuggle fds from your process to its grandchildren?
If so, that is indeed fundamentally incompatible with the desire to inheriting capabilities by default. There are a few ways to work around it. One is to add features to specific processes (e.g. shells) to notify them of fds that they should pass forward into spawned children. Another is to make that a library feature so that when you create a process you can pass in a set of inheritable fds, and make your library spawn function have an option that lets you opt into forwarding those fds. Another would be to stop relying on inheritance and use something else like AF_UNIX sockets to communicate with the grandchildren.
Posted Apr 11, 2019 19:52 UTC (Thu)
by roc (subscriber, #30627)
[Link]
Posted Apr 10, 2019 21:48 UTC (Wed)
by nix (subscriber, #2304)
[Link]
Posted Apr 11, 2019 4:24 UTC (Thu)
by joncb (guest, #128491)
[Link] (6 responses)
There's no OS police that will come knock down your door and arrest you if you don't implement fork in your research system. People create all kinds of research OSes that make all kinds of oddball decisions all the time, unikernels being a specific example of a system that (generally) doesn't implement fork (or certainly doesn't implement fork the same way as everyone else). Sure if you want a POSIX compatibility layer then you need to implement fork because fork is a required part of POSIX but if your hope is to change that then don't be surprised if people tell you to go to hell for wanting to inconvenience the roughly 20M software developers and 4B software users who rely on fork semantics on a daily basis to reduce the amount of work that maybe 10K of OS researchers will have to do in those rarish circumstances where they need to.
Posted Apr 11, 2019 6:22 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
A special checkpoint/restore functionality for people who NEED it explicitly (see: Redis) would be much better.
After all, why should this very niche use dictate the design of the OS?
Posted Apr 11, 2019 6:55 UTC (Thu)
by nilsmeyer (guest, #122604)
[Link] (1 responses)
Is that conjecture or can you back up that claim with data? That might make for an interesting research project.
> A special checkpoint/restore functionality for people who NEED it explicitly (see: Redis) would be much better.
That's an interesting idea, is there an implementation of that in any OS? The problem is as long as you don't have a critical mass of systems implementing the new semantics you'll still have to use fork() and then the question quickly becomes whether or not it's worthwhile to cover other cases.
> After all, why should this very niche use dictate the design of the OS?
Compatibility with existing software.
Posted Apr 11, 2019 7:32 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link]
> That's an interesting idea, is there an implementation of that in any OS? The problem is as long as you don't have a critical mass of systems implementing the new semantics you'll still have to use fork() and then the question quickly becomes whether or not it's worthwhile to cover other cases.
You can clone your VMAs with CoW semantics: https://fuchsia.googlesource.com/zircon/+/HEAD/docs/sysca... and map them into a new process if needed or use however else you want.
> Compatibility with existing software.
A huge amount of software is already ported to Windows which doesn't have fork() support, so it's unlikely that porting to a new API will be an insurmountable problem.
Posted Apr 11, 2019 9:18 UTC (Thu)
by mm7323 (subscriber, #87386)
[Link] (2 responses)
Had fork() & exec() been higher level operations with more descriptive parameters, like CreateProcess(), they would perhaps have been able to implement wrappers through different research OS primitives or mechanisms and then quickly gain access to a rich user-space environment supporting real-world workloads.
Sucks to be them.
Posted Apr 11, 2019 15:30 UTC (Thu)
by beagnach (guest, #32987)
[Link] (1 responses)
The core argument is that fork() forces undesirable design choices into every layer of any system that implements it, Linux being affected by this as much as any "toy research OS". The authors point is that we have become so accustomed to fork()/exec() being the "natural" way to handle process creation that we have become oblivious to the compromises entailed.
I get that reading and understanding a 6 page article by heavyweight computer researchers is hard work, but still... this is LWN, which is valued for the quality of its reporting and technical discussion.
If the best you can come up with is "Sucks to be them" then why not head on over to slashdot (assuming it still exists).
Posted Apr 11, 2019 21:53 UTC (Thu)
by mm7323 (subscriber, #87386)
[Link]
For the benefit of others, I apologise if my choice of wording is inflammatory, but I don't think it is incorrect. As the top level comment says, there is no 'fork() police' forcing every OS to implement those semantics. Kernels are free to go a different route, and Linux could add alternative process creation syscalls if desirable. But to date fork() is entrenched in userspace software and to make a relevant and _practical_ kernel you need a reasonably fully featured userspace from somewhere before you can run meaningful workloads and claim you have anything but a toy.
So I think the paper may be born out of the frustration research kernel developers see when faced with some of the following choices:
1) Implement fork() with it's semantics and pitfalls, make little to no new research in that area.
Because of the nuances of the fork() + exec() API, item 2 isn't just porting a libc compatible runtime - instead you need to be looking at every call site to fork() and/or exec() *and* other syscalls, and then patch them in each package. And then potentially maintain those patches if the research kernel is to stay up to date. It's a lot of work, and I dare say not the most interesting work for kernel researchers to be undertaking, and almost completely secondary to actual kernel research.
This is where I think fuschia/zircon has a real advantage. While Bionic supports both fork() and exec(), it's requirement is most likely confined and most of Android 'userspace' is up in the JVM anyway. Within Android Java there are methods that may classically result in fork() + exec() (e.g. Runtime.exec() and ProcessBuilder().start()), but these are high enough up that they don't require exact fork() semantics and so may be more easily be converted to different primitives on a new kernel model, benefiting file descriptor and memory abstractions too.
Posted Apr 11, 2019 6:35 UTC (Thu)
by pabs (subscriber, #43278)
[Link] (4 responses)
Posted Apr 11, 2019 7:55 UTC (Thu)
by chatcannon (subscriber, #122400)
[Link] (3 responses)
So far as I can tell from the man page, posix_spawn() on Linux is implemented by the libc and uses the fork() and exec() system calls internally.
Posted Apr 11, 2019 9:49 UTC (Thu)
by jwilk (subscriber, #63328)
[Link] (2 responses)
Posted Apr 11, 2019 16:02 UTC (Thu)
by magfr (subscriber, #16052)
[Link]
Posted Apr 11, 2019 23:20 UTC (Thu)
by pabs (subscriber, #43278)
[Link]
Posted Apr 11, 2019 20:23 UTC (Thu)
by ecree (guest, #95790)
[Link] (1 responses)
The authors' criticisms seem to revolve mostly around "fork plays badly with $modern_os_feature", but usually the fault lies with $modern_os_feature (threading being the most obvious example).
The snark about goto in section 7 demonstrates that the authors have exactly the kind of ivory-tower attitude to which Unix has always been opposed; a sensible programming course would start bottom-up, with assembly, to impart the crucial concepts of a computer's execution model; and assembly does indeed begin with goto (although it calls it jmp).
fork()/exec() is an example of brilliant taste by the original inventors of Unix; and taste is like jazz: if you have to ask what it is, you ain't never gonna know.
Posted Apr 13, 2019 12:08 UTC (Sat)
by farnz (subscriber, #17727)
[Link]
I disagree; I think that fork isn't a particularly tasteful interface (although I understand why it was implemented that way for the PDP-7, which had no virtual memory); it conflates three primitive operations in a multiprocess virtual memory OS:
Now, I don't see the problem with conflating the first two options (a "spawn" operation, if you will); they are both simple operations, and you simplify the OS if each schedulable task has a unique virtual address space (we can call this combination a "process" 😀). The last, however, is a complex operation on any OS that lets you have shared memory (rather than doing what early UNIX did, and swapping entire processes to disk in order to switch to another process), and shouldn't be conflated with the first two.
On the other hand, exec is a deeply tasteful interface; it says that process environment setup has a lot of details, and there will be more details in future, so don't try to enumerate them all via a CreateProcess-like interface; instead, just let the user run arbitrary code in their new process to set up the world, and then replace the running code with the code that wants that environment.
With full hindsight on 50-odd years of hardware and software evolution, including the creation of dynamic linking, I'd prefer to see a spawn+exec pair. spawn takes an image to spawn, plus an "overlay" section that gets copied (CoW) to the program interpreter (in the sense that ld.so is an interpreter, not in the sense that Python is an interpreter) to be used in the dynamic linking phase. For statically linked programs, the program interpreter is part of the main program image, and thus gets the overlay section as input. This lets you send state down to the new process, and then have it used to set up the new process environment; the new process might turn out to be a simple helper that just sets up the environment and calls exec, of course (and, indeed, as you get a new section, you can have both helper and original process live in the same image, using the contents of the spawn section to distinguish "executed fresh" from "spawned ready to exec a new process").
Posted Apr 11, 2019 21:37 UTC (Thu)
by gnu_lorien (subscriber, #44036)
[Link]
Posted Apr 12, 2019 4:18 UTC (Fri)
by lieb (guest, #42749)
[Link] (2 responses)
I started off with Tenex (1.34) and moved to UNIX (V6 w/ Univ Ill NCP (Arpanet)) so I've seen a bit. I had to dig out my old TENEX docs to refresh my memory... It has been a while since I wrote a JSYS FORK. But I'm not bragging my age but rather, there needs to be some longer perspective in this discussion.
As the authors hinted, TENEX did have a FORK that would would do everything in one go; create the fork, populate it with an image, and start it. There were other variations on the theme such as an equivalent of vfork() but it was not all that great. It was pretty advanced for its day but in the end, it didn't really do much other than introduce a real VM with functioning pagefaults. Its filesystem wasn't much better than its progeny VFAT. For example, with its rigid process architecture, there was no way to do "foocmd < bits.in >stuff.out&". Things like threading were also awkward at best. The UNIX model of fork() + exec() solved a lot of those problems. It worked for two reasons. First, a pid is a global object. I can create a process, tell someone else its pid and they can play with it. A fork was and still is cheap(er) and the model of orphaning a pid to the init proc made backgrounding trivial. That does not seem like much but it is when you try to make a network daemon or batch system and don't have it. Second, splitting the two has real power - and with real power there is also risks (and bugs).
A TENEX fork, just like CreateProcess was a single shot. Once it is gone into the system it is gone. Sure, you can manipulate it some but now you have one proc fiddling with another with all the race conditions it implies as shown by the complexity of ptrace(). CreateProcess solves this problem with its mountain of API args. However...
Back with V6, fork() and exec() were pretty simple. But they already had some things that TENEX didn't have. That power was and still is what goes on between the child's return of fork() and its subsequent exec(). In those days, we didn't have too much to do other than some close() and open() calls, usually redirecting one or more of 0,1,2 and the closing of random other files (very rare but possible), and maybe setuid(), setgid(). There were no capabilities or shared anything to clean up. Since those days the number of things to be policed from the old environment before the new one got launched has grown. For example, execve() popped up because we got this thing called "environment" in V7 which sometimes required a scrubbing of the env vector. It only got better and worse. The authors rightfully note this growth of features and the resource costs that go with them. The costs are there no matter what the model used for creating/managing/destroying them. You will have to do those actions somewhere. The only question is where.
They also criticize all the close-on-exec stuff scattered about in the various kernel subsystems that use an fd. Point well taken. Then again, this is still territory where one really needs to know more than just garden variety algorithms. And, where else would you handle things such as this? The kernel doesn't know the significance of one open fd from another and how would you construct an API extension for clone() to handle such an open ended requirement? This is something that only the app has knowledge of the full context. Therefore, it is the app's responsibility. You have a choice to either do it somewhere in the app (the Linux choice) or in a system API somewhere. You cannot fob it off somewhere else.
Consider the following issues on allocated resources, most of them open "files". There was a time when a proc could only have 16 open files. BSD moved that to 20. They then implemented the select() call. Their argument list included bit masks for the fd's of interest in the API. It seemed like a good idea at the time. Besides, what app would have more than 10-12 files open at any one time? Well, guess what. In the early 90's, having only 4k open files in a database or pthread app was a limit. This forced the new select2() syscall because the API had to change to handle a variable length bitmask but the old select() was cast in concrete. The AltaVista webserver, which I maintained for a while back then, blew that number bigtime. The select() syscall was no longer a good idea and select2 with huge holes in the bitmap was only marginally better. Hint, even if all you kept open most of the time were 0,1,2 and a few fds that you had hanging around after a flurry of file stuff, you still could have an fd > 4096... Bit masks were no longer a good idea. Eventually poll/epoll and friends took over. The lesson here is that systems must evolve to address the continuing stream of new requirements.
This is where CreateProcess() and its API comes in. All those arguments are there for a purpose and are fixed for all time unless you are really into pain. But is that all that will be needed/wanted? History shows that no, it isn't. The things that must be manipulated when a process/fork gets created will grow in size. There are three choices for this, the system remains static for all time, you hack the API one more time or have an interim period in the child's code where all this can happen prior to launching the new image with exec(). UNIX chose the last and we benefit from that choice.
The creation of a process/thread in any OS is always tricky. There are timing/race conditions to consider, security vulnerabilities to deal with, etc., etc. This is not code for kiddies. But just as the original exec() -> execve() and fork() -> vfork() issues, new capabilities stretch the limits of the process model and, from experience, adding a new syscall, painful as it is, is better than extending an existing API into uncharted territory. Therefore, having that interim period in the child's initialization has real value. This is where one deals with the future, right there after the child's return from fork(). Only the app knows what privs and resources should be freed before letting the next image take over after exec. Only the app knows that whether a particular resource that it needs must be closed on exec. So have the app do it. The kernel doesn't know enough to do the right thing even if it cared. This is one place in the app codebase where such things matter. It has to be carefully written and debugged. It takes skill to do it. It also has to be done someplace. Launching a proc in any system involves two phases; first, clean up inherited "stuff" and, second, initialize a new, safe new environment for the child before it invokes main(). There are three places to do it. You can do it in this interim code before exec(); you can do it in crt0 or ld.so; or you can do it in the kernel. Take your pick. The interim code is safe in that it is no worse than the rest of the app and fits the need. Doing it in crt0 or ld.so adds special and unique app requirements to a standard/common bit of system runtime. Don't even think about the kernel. Once in the kernel ABI, always in the kernel ABI, and for very good reasons. Hint, they bring pitchforks to these change proposal meetings...
The problems with threads and the mixing of threads with process creation (fork+exec) are real. They are two very different beasts and don't get along all that well. We argued about that back in the cde_threads days and things did not get easier just because we changed the name to pthreads. But pthreads itself, and any of its offshoots such as LWP are not much better. Pthreads does a reasonable job at forking a process but one successfully does such things by abiding by a set of design rules just a little less complex than a EULA. And that is so for a reason. I look at pthread and its mix of mutex+condition vars as being a little more safe than writing the whole thing in ASM, which I did on TENEX, i.e. there is nothing to enforce any of those mostly documented rules and design patterns. There is little help and no guard rails in this model which is why, 30 years later, there are not all that many of us who can really grok this stuff and even then, getting critical sections and lock optimization right is hard work. A pthread call is, after all, just another function call... Coverity et al have to do some real back breaking work (magic) to make sense of the rubbish shoved into it enough to report a reasonable error. Java doesn't offer much in this space either even if it has some threading "primitives" in the language (more or less). C++ is worse, little better than C + pthreads. The closest I've seen in modern languages is the golang model, mainly because it has concurrency (and the constraints necessary to keep it "safe") built into the language itself where the compiler and analysis tools can see what is going on. Also note that they use a "concurrency" model, not a multi-threading model (See Pike's numerous blogs). All the magic is in semantic pass(es) of the compiler and the runtime well out of the reach of the app programmer.
If we look at the fork() implementation in the Linux kernel, we find that fork and vfork are just wrappers around the full blown clone call, all of them calling _do_fork(). The pthread_create() lib call uses clone directly. This is also why pthread_spawn() is faster in their graph. It is a properly clamped down clone() followed by quick lib resources cleanup followed by exec(). This was a smart move when NPTL entered the kernel in 2.6. The kernel doesn't care if a task is a thread or a proc; it just does its scheduling and resources thing. Only the app cares and the library does a respectable job with the rest. Note that its options to COW or share a restricted set of objects is limited to just the things that user code can't manage. Hint: why don't they use clone() instead in their runtime?
The authors made some comments on how, without fork+exec, they could do really cool stuff like load and relocate another process in the same address space. Why would anyone want to do such a thing, other than fool around with some academic notion? Memory management, even the pre-VM segment management in the PDP-11, is a very good thing. The reason is simple. If you can't address the object (in memory or the kernel) you can't piddle all over it. It is bad enough when a pthread goes rogue and stomps on things. Why would you import an unknown quantity like an arbitrary executable into your address space? That is an attack surface bigger than the flight deck of the Carl Vinson. One can escape any language runtime into ASM and once there, all bets are off. In other words, so what if you an load and randomly relocate multiple copies of a DLL/SO. I submit that is a feature in search of a problem to solve. If you want to do such things, use a VM or container and let the hypervisor keep you out of mischief. If you don't want fork, use a unikernel in a VM and get on with it. The realtime gadget people do it all the time with bare iron things like Arduinos and MIPS SOCs.
One last point. Removing fork+exec from UNIX (really Linux these days) is a fools errand. There is one very big reason why anyone would care, other than an academic exercise in woulda-coulda-been semantics. There is a massive amount of code out there that runs inside a UNIX model and it does so for a very simple engineering and operational reason. As bad as it is, it is still, on the whole better than all the OS models it displaced. I mentioned at the top that my first system was TENEX. I also worked on the DECSystem-20, a commercialized version of that OS before I moved exclusively to UNIX/Linux. Those were good systems that did cool things but most of us who left them behind had good reasons to move on. Those systems and all the other "proprietary" systems are now but memories to talk about over beers with other retired hackers. Anyone remember the DG Eclipse? AOS/VS had some really cool features, such as a built in threading model, that were way ahead of their time. But where is DG, or DEC or even Sun now? Most of the UNIX systems are gone leaving only {Free,Net,Open}BSD still chugging away. All those systems have been replaced by a standard system that does its job very well and it happens to be UNIX. Linux has evolved over the years but the core similarities and model are still closer to UNIX V6 than any of the other long gone OS designs. It is the standard OS just like the electrical outlets you get down at Home Depot are standard. Imagine the chaos that would return to metal fabrication if instead of using metric or "English" sizes, one chose their own arbitrary dimensions for thread sizes for fasteners. Having two complete set of tools, one metric and one SAE is pain enough which is why every country (other than the USA) is now almost completely metric. The same applies to current OS ABI/API standards. That massive amount of code only really happened when those of us who had to build real systems stopped arguing and accepted the one system we could all agree (at least in principle) could do the job and we could share in common without a lot of legal/financial friction. The world converged even more so on Linux because of the same reason. People who want to build big, complex systems or who want to build handheld things like smartphones by the billion just want something on top of the iron that they could depend on rather than re-invent. Even Microsoft has figured this out. There is no money in maintaining a proprietary OS anymore other than to support an Office suite that is, itself moving off the desktop and into "the Cloud". Unlike Linux where the development model scales to fill the staffing requirement because everyone and anyone who needs it can contribute their expertise, all of the Windows system specialists who really understand how the guts of the thing works are proprietary need-to-know box on the Microsoft payroll which is why Windows/N, N=1->inf is really in maintenance mode. That group is a "cost center" that can't grow because it would eat the engineering budget alive while providing little more than a support layer underneath their Office products (the real cash cow). Their next new thing, where their dev money is being spent these days, is Azure which is a service whose profitability is based on simple usage scaling not feature development. And yes, most of the VMs and containers they run have fork() somewhere in the runtime.
I don't mean to dump all over the authors. But this piece is an opinion piece, if not a gripe session, not a research report. Cygwin under their research provides a crappy fork() performance, primarily because of the impedance mis-match between the UNIX model and their model running over Windows. So what else is new. My son has solved that problem. He's given up on using things like git on Windows and is tired of the self inflicted incompatibilities in Mac/OS (old python et al). He now has a Windows/10 machine for company stuff like Outlook and runs Fedora 29 in a VM to do his development work which does the deed just fine. When the authors and the users of their OS paradigm have enough code to double the size of github and Sourceforge, maybe then their argument would make sense. Otherwise, this is much about nothing and wishing for unicorns. (Lots of) code that works beats elegant designs that don't (yet) every time.
Sorry for being a grumpy old hacker.
Posted Apr 12, 2019 5:01 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (1 responses)
> I don't mean to dump all over the authors. But this piece is an opinion piece, if not a gripe session, not a research report. Cygwin under their research provides a crappy fork() performance, primarily because of the impedance mis-match between the UNIX model and their model running over Windows.
Posted Apr 12, 2019 6:53 UTC (Fri)
by eru (subscriber, #2753)
[Link]
The mobile app developers probably would not even notice that change of kernel, especially since Google would work hard to minimize its visible effects on interfaces, for backward-compatibility reasons. Aren't Android apps mostly written in Java or some other higher-level language anyway?
Posted Apr 12, 2019 21:36 UTC (Fri)
by rweikusat2 (subscriber, #117920)
[Link] (3 responses)
Memory overcommit on fork is indeed sensible but that's a Linux innovation. The default behaviour of 7th edition emulation forks when not enough swap space can be reserved could be described as "suicide out of fear of death": No one knows how much of the inherited address space will need to be copied, this entirlely arbitrary limit thus prefers "guaranteed failure now" over "possible success in future", despite "guaranteed failure now"-mode obviously cannot guarantee that neither of the two forked process will end up failing due to an out of memory situation encountered in a future memory allocation.
Posted Apr 17, 2019 16:41 UTC (Wed)
by BenHutchings (subscriber, #37955)
[Link] (2 responses)
Posted Apr 22, 2019 20:48 UTC (Mon)
by tao (subscriber, #17563)
[Link]
Posted Apr 25, 2019 11:03 UTC (Thu)
by nix (subscriber, #2304)
[Link]
Posted Apr 14, 2019 22:40 UTC (Sun)
by magfr (subscriber, #16052)
[Link]
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Like, say, an application server with a large amount of cached data?
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Oh wow. So we need a persistent daemon that does RPC to simply launch processes efficiently?
Microsoft research: A fork() in the road
There doesn't need to be any RPC involved, the data-cruncher can & should be a child process forked from the driver. The communication between the two might be sockets, or shm, but it could be as simple as the cruncher receiving jobs on stdin and shipping notifications on stdout.
Yes. It is.
Microsoft research: A fork() in the road
I'm sorry. Have you ever worked with Java? Typically you have a server that runs some kind of service. It's a single process - it makes sharing data between requests very easy.
Nope. We have examples of better-designed APIs now.
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Fun fact: none of them allow you to transparently share complex data, with automatic garbage collection.
So how about being able to run processes without requiring ugly workarounds? Or is this a part that doesn't need to be done well?
So you're confirming the authors' statement - you have to build your whole system around deficiencies of fork().
No. The Unix philosophy is to get something working ASAP and then just objectify it as the epitome of creation, whether it's bad or not.
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Why is a server that allows to seamlessly share complex graphs of objects is badly designed? Designing something as multiple processes is not at all better in itself.
The first Unix versions were written in assembly. Unfortunately, PDP-s became unavailable otherwise Unix fans would have still be extolling the virtues of it.
Microsoft research: A fork() in the road
The first Unix versions were written in assembly. Unfortunately, PDP-s became unavailable otherwise Unix fans would have
still be extolling the virtues of it.
The original PDP-7 implementation was written in machine language for want of any other choice. Dito for parts of the original PDP-11 implementation. Nevertheless,
We all wanted to create interesting software more easily. Using assembler was dreary enough that B, despite its
performance problems, had been supplemented by a small library of useful service routines and was
being used for more and more new programs.
[D. Ritchie, The Development of the C Language]
By early 1973, the essentials of modern C were complete. The language and compiler were
strong enough to permit us to rewrite the Unix kernel for the PDP-11 in C during the summer of
that year. (Thompson had made a brief attempt to produce a system coded in an early version of
C--before structures--in 1972, but gave up the effort.)
[p. 16]
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
I could be wrong, i don't know your workload, but I feel like Java and fork are not meant to be friends.
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
In this particular case I was running a GPU-based optimizer in a separate process. It was kinda crashy (drivers...), so isolating it was a good idea. Heck, it even used pipe-based interaction. How much more Unixy can you get?
They both use fork (more precisely, clone) on Linux. There's no way to avoid it, and this is one of the problems.
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Of course, if you don't like systemd, just write a dedicated server which does whatever you want done.
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Typically this kind of stuff is stored in native data structures and so it doesn't have to do anything with files. You also can't typically control the allocations made by the JVM or your language runtime.
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Well, they don't. Go, Python, Java, C# all use simple private mappings.
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
As demonstrated by Zircon in Fuchsia, it's not necessary. You can download the Fuchsia SDK yourself and check it, it's available right now as a counter-example to your point.
No, I would have a family of process-management functions that accept the target process handle as a parameter and ability to create suspended processes.
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
> Why? If it's a shared mapping, then writes by the child should be visible in the parent and vice-versa, so both processes can map the same page and no need to COW. What am I missing?
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
The most important part of this is sharing of FDs, and Linux could use something like DuplicateHandle from NT: https://docs.microsoft.com/en-us/windows/desktop/api/hand...
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
For users/application programmers that flexibility is a major feature, a huge increase in the power of the system.
Over-commit is an optimisation feature, which like networks can cause temporary or sudden fail of operations, apparently guaranteed by the OS. It was bolted on in POSIX like OSes, I ran machines which actually needed all of virtually memory backed by swap space and applications which were careless, paid a performance penalty or caused themselves to fail very often due to insufficient virtual memory. So the argument that applications used huge sparsely used blocks of memory was moot; that just didn't fly!
Perhaps programme's that require over-commit to run, should have to accept penalties ie being classed as likely memory hogs and prime candidates for being "frozen" to disk in event the system is under pressure. I'd rather have the applications that use over-commit to pay a price in complexity, using some library provided sparse data structure, rather than every single simple program every written for the system!
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
False dichotomy. CreateProcess() is ugly because of extreme backward-compat requirements, there's nothing inherent in the idea which would make such a function ugly.
Fork() is absurdly hard to use in multithreaded cases
almost always going to be "start a new subprocess /usr/bin/something with arguments X, Y, Z"
It's a complete mess because of its implicitness.
O_CLOEXEC, FD_CLOEXEC, EFD_CLOEXEC, EPOLL_CLOEXEC, F_DUPFD_CLOEXEC, IN_CLOEXEC, MFD_CLOEXEC, SFD_CLOEXEC, SOCK_CLOEXEC, TFD_CLOEXEC, DRM_CLOEXEC, FAN_CLOEXEC, UDMABUF_FLAGS_CLOEXEC
Microsoft research: A fork() in the road
>
> O_CLOEXEC, FD_CLOEXEC, EFD_CLOEXEC, EPOLL_CLOEXEC, F_DUPFD_CLOEXEC, IN_CLOEXEC, MFD_CLOEXEC, SFD_CLOEXEC, SOCK_CLOEXEC, TFD_CLOEXEC, DRM_CLOEXEC, FAN_CLOEXEC, UDMABUF_FLAGS_CLOEXEC
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
O_CLOEXEC, FD_CLOEXEC, EFD_CLOEXEC, EPOLL_CLOEXEC, F_DUPFD_CLOEXEC, IN_CLOEXEC, MFD_CLOEXEC, SFD_CLOEXEC, SOCK_CLOEXEC, TFD_CLOEXEC, DRM_CLOEXEC, FAN_CLOEXEC, UDMABUF_FLAGS_CLOEXEC
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Does it? Something like 90% of cases are simple fork-exec and they can migrate to posix_spawn() as is, even given its deficiencies.
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
And why should multithreaded app do it if not for the exec? For all other cases it can just create new thread?
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
I was anwering to "How a multi-threaded application can do a fork in a sane manner?". One can fork as much as needed to create separate child processes first and then possibly create threads in each of those (which I guess is not a problem?) but why forking already multithreaded process (instead of creating additional thread) if not for the exec?
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
“
A simple but common case is one thread doing memory allocation and holding a heap lock, while another thread forks. Any attempt to allocate memory in the child (and thus acquire the same lock) will immediately deadlock waiting for an unlock operation that will never happen.
”
Microsoft research: A fork() in the road
* Thread A acquires malloc() lock
* Thread B calls fork()
* Thread A releases malloc() lock
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
One of the may ... dubious claims is "fork() breaks Buffered I/O"... so does read()/write().
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
[2]: http://www.cs.toronto.edu/~brudno/public/pdf/lagar2011sno...
[3]: http://www.cs.toronto.edu/~sahil/suneja-hotcloud14.pdf
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
I'm hesitant to comment here because it's not done,
but I've been working on an implementation for Linux of cross-process operations,
so that inchoate processes can be created and manipulated from other processes.
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
...found 1304 Ubuntu packages (7.2% of the total) calling fork, compared to only 41 uses of the more modern posix_spawn(). Fork is used by almost every Unix shell, major web and database servers (e.g., Apache, PostgreSQL, and Oracle), Google Chrome, the Redis key-value store, and even Node.js.
"""
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
No it's not, it is crystal clear. The only thing muddying the waters here is people's interpretation.
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
I don't think it's necessary to have any specially formatted stub, just a call like:
Microsoft research: A fork() in the road
create_process(
/* helper program */
const char *path,
int argc,
const char *argv[],
const char *env[],
/* fds count */
int fdc,
/* mapped in new process 0..fdc */
int fds[]
);
Microsoft research: A fork() in the road
some_command | analysis --log-file=/proc/self/4 | transform | moreanalysis --log-file=/proc/self/4
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Their suggestion to get around that is to have system calls that let you change the state of the child from the parent.
This is a very nice suggestion, indeed -- you can tell I've been through the fire, because my first thought was "if they're nice to use, we could replace most of ptrace() with them! yeaaaaahhhhhh!". (I mean, yes, PTRACE_SEIZE is far nicer than the old model, but it's still a horrible syscall to use, though much of its horror has to do with unrelated problems like signal handling that only apply to processes that have started running...)
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Most of the fork() stuff is used for simple fork+exec. Which is totally dumb.
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
I actually did this some time ago during one of the older flame wars about process API. I traced the fork() call and the exec syscall, and checked the difference over the course of several hours. They matched within 5%.
Fuchsia doesn't have it as an easy-to-use built-in (yet?), but it's implementable through its core API: https://fuchsia.googlesource.com/zircon/+/HEAD/docs/sysca...
fork() can be implemented on top of checkpoint()/restore() mechanisms as a fallback. Perhaps with some efficiency hit.
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
2) Try and port packages to their kernel to produce a usable userspace.
2) Climb the mountain of creating a brand new userspace for the research kernel.
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=9f...
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft research: A fork() in the road
Microsoft Research: A fork() in the road
Microsoft Research: A fork() in the road
Microsoft Research: A fork() in the road
Sometimes they are necessary. Any realistic removal plan would require decades of transition time, though. Or perhaps not, if Google simply replaces Linux with Fuchsia in Android and everybody is forced to write to that API.
Windows actually supports pretty performant fork() in its kernel. It's used in the new Linux subsystem for Windows and before that it was used in UNIX Services for Windows. It suffers from the overcommit problem, but otherwise it's enough to run most ported Unix apps.
>Or perhaps not, if Google simply replaces Linux with Fuchsia in Android and everybody is forced to write to that API.
Microsoft Research: A fork() in the road
Microsoft Research: A fork() in the road
Microsoft Research: A fork() in the road
Microsoft Research: A fork() in the road
Microsoft Research: A fork() in the road
O_CLOFORK