|
|
Log in / Subscribe / Register

Resource limits in user namespaces

By Jonathan Corbet
January 18, 2021
User namespaces provide a number of interesting challenges for the kernel. They give a user the illusion of owning the system, but must still operate within the restrictions that apply outside of the namespace. Resource limits represent one type of restriction that, it seems, is proving too restrictive for some users. This patch set from Alexey Gladkov attempts to address the problem by way of a not-entirely-obvious approach.

Consider the following use case, as stated in the patch series. Some user wants to run a service that is known not to fork within a container. As a way of constraining that service, the user sets the resource limit for the number of processes to one, explicitly preventing the process from forking. That limit is global, though, so if this user tries to run two containers with that service, the second one will exceed the limit and fail to start. As a result, our user becomes depressed and considers a career change to goat farming.

Clearly, what is needed is a way to make at least some resource limits apply on per-container basis; then each container could run its service with the process limit set to one and everybody will be happy (except perhaps the goats). One could readily imagine a couple of ways to do this:

  • Turn the resource limits that apply globally (many are per-process now) into limits that can also be set within a user namespace. The global limit would still apply, but lower limits could be set within a namespace to get the desired effect.
  • Create a new control-group controller to manage resource limits in a hierarchical manner. This kind of control, after all, is just what control groups were created for.

Gladkov's patch set, though, takes neither of those approaches. Instead, this patch set moves a number of global resource-usage counters (processes, pending signals, pages locked in memory, bytes in message queues) into the ucounts structure associated with user namespaces. That makes the tracking of the use of these resources specific to each namespace.

User namespaces are arranged hierarchically up to the "initial namespace" at the root, and there is a ucounts structure allocated for each. The resource-usage counts are managed all the way up the hierarchy. So, if a process creates a new process within a user namespace, the process count in that namespace will be incremented, but so will the counts in any higher-level namespaces. The resource limit (which is still global) is checked at every level in the hierarchy; exceeding the limit at any level is cause to block an operation.

If one is slow and undercaffeinated like your editor, one might wonder how this is supposed to work. Yes, each user namespace will now have its own count for resources like processes. If the global limit is set to one, each user namespace can contain one process without exceeding the limit at that level. But the counts propagate upward; if both namespaces have a common parent, then the limit will be exceeded at that level and our user is left no happier than before.

A look at the test code provided with the patch set gives an answer. In the test program, the "server" processes are created by root before changing user and group IDs and moving into a separate user namespace. The parent namespace thus belongs to root and is not subject to any resource limits set after the user-ID change. It all works as long as one's use case matches this pattern.

Still, one might wonder why the other approaches weren't taken. Having the limits propagate downward (rather than counts propagating upward) would seem to address this problem as well in a more flexible way that doesn't require root privileges. In fact, Linus Torvalds asked why this approach wasn't taken in response to a previous version of the patch set. Eric Biederman answered that the limit approach "needs to work as well", but then reiterated the use case without really clarifying why the count-based approach is needed.

Using control groups for this purpose was discussed back in 2015. At that time, control-group maintainer Tejun Heo rejected the idea, calling it "pretty silly". He continued:

In general, I'm pretty strongly against adding controllers for things which aren't fundamental resources in the system. What's next? Open files? Pipe buffer? Number of flocks? Number of session leaders or program groups?

If you want to prevent a certain class of jobs from exhausting a given resource, protecting that resource is the obvious thing to do.

That particular conversation went fairly badly downhill from there, but this specific outcome has stood over time: control-group controllers are not used for control of resource limits within containers.

For users who are facing this problem now, the only apparent solution is Gladkov's patch set. Whether these patches are merged will, however, depend on whether the rest of the kernel community thinks that this approach is the correct one. That conversation has not yet happened, and may depend on a clearer description of the semantics of this change (and its motivation) being posted first. Resource limits within containers is a problem that has remained unsolved for years; it may take longer yet to get to the real solution.

Update: as explained in the comments, resource limits are already per-process, so nothing has to be done on that side to make them adjustable on a per-container basis. The counts used to enforce those limits, though, are global, causing the sort of interference described above. So the proposed solution — making the counts local while still aggregating them upward — appears to make sense.

Index entries for this article
KernelNamespaces/User namespaces
KernelResource limits


to post comments

Resource limits in user namespaces

Posted Jan 18, 2021 20:31 UTC (Mon) by nickodell (subscriber, #125165) [Link] (1 responses)

ulimit supports per-user process limits, right? Is there some reason why you couldn't create a user namespace, and set two process limits on two users within that namespace?

Resource limits in user namespaces

Posted Jan 19, 2021 6:59 UTC (Tue) by cyphar (subscriber, #110703) [Link]

Yes, ulimits are per-user but two processes with the same kuid in two different user namespaces (in other words, they map the to the same underlying user) will have the same limit because the limit is enforced per-kuid (it's not linked the user namespace you're in). This is a problem because some container runtimes reuse the same mapping for different containers, causing resource exhaustion between containers (and isolated containers a-la LXD have their own issues -- namely a fair number of programs expect to be able to create users with very large uids).

Resource limits in user namespaces

Posted Jan 18, 2021 21:11 UTC (Mon) by johannbg (guest, #65743) [Link] (6 responses)

Let's all take a deep breath and considers a career change to goat farming. What a simple life it would be...

Resource limits in user namespaces

Posted Jan 19, 2021 3:11 UTC (Tue) by gus3 (guest, #61103) [Link]

Until the goat smacks your bum with its forehead and knocks you into the watering trough.

Resource limits in user namespaces

Posted Jan 19, 2021 3:42 UTC (Tue) by atai (subscriber, #10977) [Link] (3 responses)

managing goats--a big challenge

Resource limits in user namespaces

Posted Jan 19, 2021 12:57 UTC (Tue) by k3ninho (subscriber, #50375) [Link] (2 responses)

I have considerable experience* herding cats. How hard can it be to adapt to goats?

*: if not success

K3n. :-D

Resource limits in user namespaces

Posted Jan 19, 2021 13:35 UTC (Tue) by pizza (subscriber, #46) [Link] (1 responses)

> I have considerable experience* herding cats. How hard can it be to adapt to goats?

That depends; do goats like Tuna?

Resource limits in user namespaces

Posted Jan 24, 2021 18:55 UTC (Sun) by gutschke (subscriber, #27910) [Link]

Goats eat pretty much everything else. I don't see why they wouldn't eat tuna as well.

Resource limits in user namespaces

Posted Jan 22, 2021 17:05 UTC (Fri) by nix (subscriber, #2304) [Link]

There is at least one sheep farmer working on Linux stuff. It is mostly visible to the rest of us in increased annoyance around lambing time :)

Resource limits in user namespaces

Posted Jan 19, 2021 4:33 UTC (Tue) by ebiederm (subscriber, #35028) [Link] (12 responses)

What was unclear about my reply?

If today the situation is that setting RLIMIT_NPROC == 1 and your service does not start, but it has nor processes in your user namespace. How can you possibly fix that without changing how the count works? AKA by making a per user per user_namespace count?

Resource limits in user namespaces

Posted Jan 19, 2021 9:39 UTC (Tue) by izbyshev (subscriber, #107996) [Link]

Yes, I wondered about that too while reading the article. Clearly, making counts per-user-namespace is a prerequisite for making resource limits per-user-namespace, so I don't understand why the article described the Gladkov's patchset as something orthogonal.

Counts v. limits

Posted Jan 19, 2021 15:09 UTC (Tue) by corbet (editor, #1) [Link] (4 responses)

Yes, I guess I see why you need to change the count infrastructure. Where my confusion comes in is why the limits aren't made per-user-namespace as well. It seems that would create far more straightforward semantics and the possibility for control without root involvement.

Counts v. limits

Posted Jan 19, 2021 21:33 UTC (Tue) by ebiederm (subscriber, #35028) [Link] (3 responses)

We are talking rlimits so the limits are fundamentally per-process.

What moving to ucounts does is it captures the per-process limit value at the
time of user namespace creation. Then when the counts are updated the
outer limit is checked, along with the per-process rlimit counts.

There is no need here for any root involvement.

``root'' can get involved if you want to modify the limits that were captured at user namespace creation. Those limits should be exposed as sysctls. In most cases
it should be safe to ignore them.

Counts v. limits

Posted Jan 19, 2021 21:56 UTC (Tue) by corbet (editor, #1) [Link] (2 responses)

Um....I thought the whole point of this exercise was that some limits are not per-process...? That's why a process in one container prevents the creation of a process in another? How can NPROC be per-process?

I'm clearly missing something here.

Counts v. limits

Posted Jan 19, 2021 22:06 UTC (Tue) by ebiederm (subscriber, #35028) [Link] (1 responses)

The counts are not per process. The limits are per process.

That is starting with struct task_struct *task. The counts
for the problematic rlimits live in:

task->cred->user->{process, sigsigpending, mq_bytes, locked_vm}

The limits for the problematic rlimts live in:

task->signal->rlim[RLIMIT_NNNN];

Counts v. limits

Posted Jan 19, 2021 22:29 UTC (Tue) by corbet (editor, #1) [Link]

Ah right, that's the part I wasn't fully on top of. I've stuck an addendum onto the article.

Resource limits in user namespaces

Posted Jan 19, 2021 18:07 UTC (Tue) by nivedita76 (guest, #121790) [Link] (5 responses)

I'm a little confused by the problem -- it seems to me that it is rather easy to work around for RLIMIT_NPROC == 1 case.

i.e. instead of launching the service by doing setrlimit() as root, then fork(), setuid(), execve(); can't you do fork(), setuid(), setrlimit(), execve()? This should be fine for the "prevent fork()" situation, no?

Resource limits in user namespaces

Posted Jan 19, 2021 18:53 UTC (Tue) by nivedita76 (guest, #121790) [Link] (3 responses)

Hm I also don't get the fix. If the counts are hierarchical, why doesn't the setuid() call for the second run fail because there is already one process running for that user in the root namespace?

Resource limits in user namespaces

Posted Jan 19, 2021 21:27 UTC (Tue) by ebiederm (subscriber, #35028) [Link] (2 responses)

RLIMIT_NPROC is some large number in the parent namespace. So the limit check on the parent namespace passes. Only in the 2 containers is RLIMIT_NPROC == 1.

Resource limits in user namespaces

Posted Jan 20, 2021 22:39 UTC (Wed) by nivedita76 (guest, #121790) [Link] (1 responses)

I'm confused. Isn't there only one RLIMIT_NPROC for a given process? i.e. I thought that the limits are per-process, and the counts, which used to be per-user are changing to per-user/per-namespace?

Resource limits in user namespaces

Posted Jan 20, 2021 22:51 UTC (Wed) by nivedita76 (guest, #121790) [Link]

Agh, this article is very confusing. This patch does make the limits be per-namespace too, not just the counts?

Resource limits in user namespaces

Posted Jan 19, 2021 21:24 UTC (Tue) by ebiederm (subscriber, #35028) [Link]

The problem statement for containers is make existing code work. So adding work-arounds for existing code is not a serious option.

Furthermore the example of a service setting RLIMIT_NPROC==1 while real
is just a motivating example. It shows how a process/container can legitimately tighten it's rlimits in a useful way, as well as being a case that is easy to see why it fails when that happens.

Resource limits in user namespaces

Posted Jan 19, 2021 19:00 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

> In general, I'm pretty strongly against adding controllers for things which aren't fundamental resources in the system. What's next? Open files? Pipe buffer? Number of flocks? Number of session leaders or program groups?
Why not ALL of these?

This would at least unify all the various resource limits that currently exist in a kind of weird fashion.


Copyright © 2021, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds