LWN: Comments on "Cloning into a control group" https://lwn.net/Articles/807882/ This is a special feed containing comments posted to the individual LWN article titled "Cloning into a control group". en-us Sat, 04 Oct 2025 07:23:28 +0000 Sat, 04 Oct 2025 07:23:28 +0000 https://www.rssboard.org/rss-specification lwn@lwn.net Cloning into a control group https://lwn.net/Articles/808723/ https://lwn.net/Articles/808723/ notriddle <div class="FormattedComment"> <a href="https://docs.microsoft.com/en-us/windows/win32/api/shellapi/ns-shellapi-notifyicondataw">https://docs.microsoft.com/en-us/windows/win32/api/shella...</a><br> <p> The cbSize member has the size of the whole struct (it's embedded as the first member, instead of being a separate param, but that's the only difference).<br> </div> Sun, 05 Jan 2020 20:10:22 +0000 Cloning into a control group https://lwn.net/Articles/808121/ https://lwn.net/Articles/808121/ NYKevin <div class="FormattedComment"> <font class="QuotedText">&gt; If you’re using control groups as a mechanism to control potentially malicious users, then any opportunity to consume uncharged resources is a potential problem.</font><br> <p> Ideally, the child process's (E)UID is not set to the untrusted value until after the child is in the desired cgroup (and you've set up whatever other trust boundaries, rlimits, etc. you consider appropriate). But, assuming job configuration is a process rather than an event, then something has to happen last, so I sympathize with this being a nontrivial problem. Doing things in the right order can be very complicated.<br> <p> (There's also the potential for a malicious user to DoS the manager by asking it to create a lot of stub processes faster than the manager can get around to accounting them properly... but that can and should be throttled in userspace long before clone() gets called anyway. clone() is a very expensive syscall compared to updating a token bucket data structure. Alternatively, you might maintain a pool of half-configured stub processes, and when you run out, you stop accepting new jobs for a little while until the pool refills.)<br> </div> Thu, 26 Dec 2019 17:27:43 +0000 Syscall https://lwn.net/Articles/808093/ https://lwn.net/Articles/808093/ johill <div class="FormattedComment"> Arguably, that's kinda what netlink does. You send a message to some virtual destination, and it's not ASN.1, but ...<br> </div> Tue, 24 Dec 2019 19:55:37 +0000 Syscall https://lwn.net/Articles/808092/ https://lwn.net/Articles/808092/ hmh <div class="FormattedComment"> Hmm, I am yet to see an ASN.1 implementation that was not exploitable on its first release :(<br> <p> It is that much of a hell to fully implement safelly and correctly.<br> </div> Tue, 24 Dec 2019 18:02:09 +0000 Cloning into a control group https://lwn.net/Articles/808075/ https://lwn.net/Articles/808075/ dezgeg <div class="FormattedComment"> <font class="QuotedText">&gt; It's elegant, but it locks you into an only-increase backwards compatible ABI.</font><br> <p> Which is exactly what Linux ABI is, in 90% of the cases.<br> <p> <font class="QuotedText">&gt; However, compat in userspace can just be a library; compat checking when a binary is mmap'd for execution can load the missing pieces or reject it as broken.</font><br> <p> How would this work for say, a statically linked Go binary?<br> <p> <font class="QuotedText">&gt; stripping down the available interfaces for attack (on the shared containers at your cloud provider) is the winning reason to do this.</font><br> <p> I would really like to see a concrete example (including full code for both the kernelspace implementing the interface and example userspace code for how to actually use it) that shows how this explicit versioning is supposedly better than the current implementation.<br> <p> For example, assume hypothetically that this clone3() interface gains two more separate features that require adding one or more fields to struct clone_args. Then we realize this clone-to-cgroup feature was a mistake, no one uses it etc. and it is the rare case when a feature can be actually removed.<br> <p> For this current size-based implementation, once all the code that implements actual the feature is removed, I would expect the u64 cgroup; field in clone_args struct stays as is and all that is added is basically<br> <p> if (args.cgroup)<br> return -INVAL;<br> <p> So, how would a version-based interface compare to these two lines of code? Lets not forget that this "compat checking when a binary is mmap'd for execution can load the missing piece" does not currently exist in Linux. It is common code not related to clone3() so it should be compared to the 19-line copy_struct_from_user() function behind this size-based approach.<br> </div> Tue, 24 Dec 2019 13:56:22 +0000 Cloning into a control group https://lwn.net/Articles/808068/ https://lwn.net/Articles/808068/ k3ninho <div class="FormattedComment"> <font class="QuotedText">&gt;The reason it's done this way is that it solves the problem of compatibility in a *much* nicer way than an explicit version field.</font><br> It's elegant, but it locks you into an only-increase backwards compatible ABI. <br> <p> The elegance also plays well for engineers writing programs, you don't have to recall which verbs to conjugate and if the order is subject-verb-object or subject-object-verb.<br> <p> However, compat in userspace can just be a library; compat checking when a binary is mmap'd for execution can load the missing pieces or reject it as broken; the entire history of Linux is in git; stripping down the available interfaces for attack (on the shared containers at your cloud provider) is the winning reason to do this.<br> <p> K3n.<br> </div> Tue, 24 Dec 2019 11:59:53 +0000 Cloning into a control group https://lwn.net/Articles/808063/ https://lwn.net/Articles/808063/ tych0 <div class="FormattedComment"> So does everyone's favorite, the bpf syscall. And since that's all anyone will use in 10 years, Linux's API will be totally clean soon!<br> </div> Tue, 24 Dec 2019 02:47:16 +0000 Cloning into a control group https://lwn.net/Articles/808062/ https://lwn.net/Articles/808062/ cyphar <div class="FormattedComment"> It also allows us to handle compatibility much more gracefully, without userspace having to do extra work -- see [1] for an explanation.<br> <p> [1]: <a href="https://lwn.net/Articles/808061/">https://lwn.net/Articles/808061/</a><br> </div> Tue, 24 Dec 2019 02:00:12 +0000 Cloning into a control group https://lwn.net/Articles/808061/ https://lwn.net/Articles/808061/ cyphar <div class="FormattedComment"> The reason it's done this way is that it solves the problem of compatibility in a *much* nicer way than an explicit version field. By using the size of the structure as the effective "version" you can get both forwards and backwards compatibility without userspace needing to do any extra work. The full explanation of the rules is given in the doc-comment for copy_struct_from_user()[1] but the basic idea is:<br> <p> * If ksize == usize, just copy the struct verbatim.<br> <p> * If ksize &lt; usize, userspace is newer than the kernel. We know what the first ksize bytes mean (because we only append to extensible structures) but the last usize bytes aren't known. But, we guarantee that any new fields will have the semantics that they are a no-op if their value is zero (the most obvious example is flag fields). Thus if the trailing (usize - ksize) bytes are zero then the newer userspace program didn't request any new features, and we can just use the ksize byte struct. If there are non-zero bytes you get -E2BIG.<br> <p> * If ksize &gt; usize, kernel is newer than userspace. Because we only append to structures when they're extended, and the zero-value is a no-op -- we zero-fill the trailing (ksize - usize) bytes of the kernel struct (because userspace doesn't know about the extensions, we can assume they don't want them).<br> <p> If you were to just have a version field, you couldn't handle the "ksize &lt; usize" case correctly (you would always have to give some kind of error because you don't know which fields are valid) -- and newer programs would need to implement backwards-compatibility in userspace. In addition, the "ksize &gt; usize" implementation would likely be more complicated than "copy_from_user and memset", and couldn't really be made generic as in copy_struct_from_user().<br> <p> All-in-all, this is a much nicer solution to the problem. And note that clone3(2) is not the only syscall which does this -- perf_event_open(2), sched_setattr(2), and openat2(2) all work in the same way. sched_getattr(2) is somewhat similar though it's semantics are a little bit ... fruity. It turns out even Windows does this for some syscalls.<br> <p> [1]: <a href="https://elixir.bootlin.com/linux/v5.4.6/source/include/linux/uaccess.h#L284">https://elixir.bootlin.com/linux/v5.4.6/source/include/li...</a><br> </div> Tue, 24 Dec 2019 01:58:58 +0000 Cloning into a control group https://lwn.net/Articles/808057/ https://lwn.net/Articles/808057/ quotemstr <div class="FormattedComment"> CLONE_STOPPED should just the same thing as an atomic process create and PTRACE_SEIZE. There are no signal issues that don't already exist.<br> </div> Mon, 23 Dec 2019 22:59:53 +0000 Cloning into a control group https://lwn.net/Articles/808043/ https://lwn.net/Articles/808043/ dskoll <p>Oh also, if the size is supplied, then it's easy for the kernel to figure out how much needs to be copied from userspace. Otherwise, it'd have to copy a little bit, read the version, and then copy the rest. Mon, 23 Dec 2019 20:41:27 +0000 Cloning into a control group https://lwn.net/Articles/808042/ https://lwn.net/Articles/808042/ dskoll <p>My guess is that it eliminates a line of code. We use:</p> <p><tt>syscall(&amp;foo, sizeof(foo), ...);</tt> <p>Rather than: <p><tt>foo.version = FOO_VERSION;<br>syscall(&amp;foo, ...);</tt> <p>Also, I guess fields would only be ever added to the end of the structure (you'd never want to remove fields if you're relying on the size to tell you what's there), so the size is unambiguous. Mon, 23 Dec 2019 20:40:05 +0000 Cloning into a control group https://lwn.net/Articles/808041/ https://lwn.net/Articles/808041/ chfisher <div class="FormattedComment"> Perhaps it has been discussed else where, but I have never understood why in any syscall that has a structure associated, that there is not a version field at the beginning of the structure. This makes it easy to determine exactly the features/layout of the structure, without having to guess based on the size of the structure. <br> </div> Mon, 23 Dec 2019 19:31:19 +0000 Cloning into a control group https://lwn.net/Articles/807992/ https://lwn.net/Articles/807992/ jiiksteri <div class="FormattedComment"> <font class="QuotedText">&gt; It seems CLONE_STOPPED existed at one point, but it was deprecated and removed. I haven't been able to find the reason for that.</font><br> <p> I suppose you already found this: <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bdff746a3915f109bd13730b6847e33e17e91ed3">https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...</a><br> <p> which says they were running out of clone() flag bits, and CLONE_STOPPED wasn't used for anything except NPTL debugging where it wasn't reliable either. So it was easy to remove.<br> <p> Can't seem to immediately find any discussion about that, but if I had to guess it's about signals. It's always about signals :)<br> <p> </div> Mon, 23 Dec 2019 12:57:24 +0000 Cloning into a control group https://lwn.net/Articles/807989/ https://lwn.net/Articles/807989/ Cyberax <div class="FormattedComment"> A slightly better solution is to have a spawn() syscall that would exec a binary and create a suspended process. No need for the intermediate clone().<br> </div> Mon, 23 Dec 2019 07:43:42 +0000 Cloning into a control group https://lwn.net/Articles/807978/ https://lwn.net/Articles/807978/ felix.s It seems <code>CLONE_STOPPED</code> existed at one point, but it was deprecated and removed. I haven't been able to find the reason for that. Mon, 23 Dec 2019 06:33:27 +0000 Cloning into a control group https://lwn.net/Articles/807986/ https://lwn.net/Articles/807986/ sbaugh <div class="FormattedComment"> For whatever it's worth, I've developed that approach in <a href="https://github.com/catern/rsyscall">https://github.com/catern/rsyscall</a><br> </div> Sun, 22 Dec 2019 22:33:33 +0000 Cloning into a control group https://lwn.net/Articles/807983/ https://lwn.net/Articles/807983/ Cyberax <div class="FormattedComment"> Yes, you can do it. Another alternative is to do a fork() and then wait for some kind of "go ahead" signal in the forked child (sent by the parent when the CG setup is finished).<br> </div> Sun, 22 Dec 2019 21:40:19 +0000 Cloning into a control group https://lwn.net/Articles/807982/ https://lwn.net/Articles/807982/ nivedita76 <div class="FormattedComment"> clone used to have this functionality, but it was removed.<br> <p> CLONE_STOPPED (since Linux 2.6.0)<br> If CLONE_STOPPED is set, then the child is initially stopped (as though it was sent a SIGSTOP signal), and must be resumed by<br> sending it a SIGCONT signal.<br> <p> This flag was deprecated from Linux 2.6.25 onward, and was removed altogether in Linux 2.6.38. Since then, the kernel silently<br> ignores it without error. Starting with Linux 4.6, the same bit was reused for the CLONE_NEWCGROUP flag.<br> </div> Sun, 22 Dec 2019 19:46:05 +0000 Cloning into a control group https://lwn.net/Articles/807981/ https://lwn.net/Articles/807981/ Paf <div class="FormattedComment"> Hmm. Perhaps because - however briefly and however small an amount of resources this consumes - this process is hanging around not in the desired state. If you’re using control groups as a mechanism to control potentially malicious users, then any opportunity to consume uncharged resources is a potential problem.<br> <p> If you were to sort of try to extend this to solve the problem, I just don’t - in the end - see any real difference between “process is created with all these properties set” and “process lives in magic state while we set properties”, except that property setting is now spread out across a bunch of calls, with potential races or other issues you now have to think about.<br> <p> It doesn’t seem impossible to have this sort of API, though. In the end you’re trading off the complexity I highlighted above against increasingly complex clone() (or whatever) calls. My feeling is I like it less, though it really does seem to be a matter of taste.<br> <p> This current change is also definitely much *smaller* than that would be, which is a plus.<br> </div> Sun, 22 Dec 2019 19:36:00 +0000 Cloning into a control group https://lwn.net/Articles/807980/ https://lwn.net/Articles/807980/ Paf <div class="FormattedComment"> I believe the issue is likely that this process still - however briefly, and however small this wrapper is - is not in the correct cgroup, so at startup its resources are not tracked accordingly. So, exactly the problem this modification is intended to solve.<br> </div> Sun, 22 Dec 2019 19:30:28 +0000 Syscall https://lwn.net/Articles/807979/ https://lwn.net/Articles/807979/ stephen.pollei <div class="FormattedComment"> In theory you could have do everything with one syscall that takes two arguments: pointer, and size. Just have to point to asn.1 ber encoded request ;-)<br> </div> Sun, 22 Dec 2019 19:30:14 +0000 Cloning into a control group https://lwn.net/Articles/807976/ https://lwn.net/Articles/807976/ NYKevin <p>This is probably a stupid question, but I would be gratified if anyone more knowledgeable than I could answer it. <p>Why can't you just do this in userspace? Write the following wrapper program: <pre> //#includes, feature test macros, etc. omitted for brevity. int main(int argc, char *argv[]){ if(argc &lt; 2){ fprintf(stderr, "Usage: %s PATH [ARGS...]", argv[0]); // print more information, probably? return EXIT_FAILURE; } if(raise(SIGSTOP)){ perror("raise"); return EXIT_FAILURE; } // Parent process sets cgroup, then sends SIGCONT execv(argv[1], &amp;argv[2]); perror("execv"); return EXIT_FAILURE; } </pre> <p>..then, you "just" tell the manager program to launch this wrapper instead of the original program. Sure, it's less convenient, but does the kernel really need a new feature to avoid writing an extra ~10 line C program? <p>Again, I'm sure someone else has already thought of this, so I'd appreciate it if someone could point out the problem that I am failing to see. Sun, 22 Dec 2019 18:56:29 +0000 Cloning into a control group https://lwn.net/Articles/807956/ https://lwn.net/Articles/807956/ Karellen <blockquote>That means, for example, that a process might run briefly before being placed into a group where its resource usage can be accounted for properly.</blockquote> <p>It seems like the more unixy solution to this problem would be to add a flag to <em>clone2(2)</em> to tell the kernel to create the process in a frozen state, so that other process-manipulating syscalls can suitably frob the process before it is explicitly started, rather than starting down the route of adding all the options (either via explicit parameters, or via pointers to structs containing them) to the <em><s>CreateProcess</s>clone()</em> family of calls. <p>I wonder if that option was considered, and if so why it was rejected? Sun, 22 Dec 2019 10:56:37 +0000 Cloning into a control group https://lwn.net/Articles/807948/ https://lwn.net/Articles/807948/ Cyberax <div class="FormattedComment"> Most Windows syscalls/ioctls also operate this way - the caller packs the information in one packet and sends it to the kernel. So the kernel doesn't need to directly read the userspace data at all.<br> </div> Sat, 21 Dec 2019 21:20:09 +0000 Cloning into a control group https://lwn.net/Articles/807941/ https://lwn.net/Articles/807941/ ibukanov <div class="FormattedComment"> See, for example, <a href="https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualqueryex">https://docs.microsoft.com/en-us/windows/win32/api/memory...</a><br> </div> Sat, 21 Dec 2019 17:11:53 +0000 Cloning into a control group https://lwn.net/Articles/807936/ https://lwn.net/Articles/807936/ cyphar <div class="FormattedComment"> Do you have a good example of such a Windows API (I looked for a while in [1] but couldn't find a good example)? It would be quite useful when arguing for this type of syscall design for newer syscalls.<br> <p> [1]: <a href="https://docs.microsoft.com/en-us/windows/win32/api/_base/index">https://docs.microsoft.com/en-us/windows/win32/api/_base/...</a><br> </div> Sat, 21 Dec 2019 14:08:11 +0000 Cloning into a control group https://lwn.net/Articles/807932/ https://lwn.net/Articles/807932/ ibukanov <div class="FormattedComment"> Windows has been using a structure plus its size for at least 25 years. Linux is finally catching up with this very straightforward way to support future syscall extensions while preserving binary compatibility.<br> <p> </div> Sat, 21 Dec 2019 13:02:09 +0000