The 2012 Kernel Summit memcg/mm minisummit
started with a session titled "Accounting other than user pages", but the
slot was mainly dedicated to the topic of improved kernel-memory accounting
for memory cgroups. Glauber Costa has been working on kernel-memory
accounting for some time. He began with a basic overview of the
motivations for memory cgroups and how they can be used to limit memory
usage of a group of processes within a container by accounting for all the
pages used by user space. Glauber would like to see better accounting of
certain workloads that have low user-memory usage but high kernel-memory
usage (see this LWN article for some
background). The problem is that, in the current kernel implementation, a
single memory control group can occupy a lot of kernel memory without being
held responsible for it. This can produce global memory pressure that
pushes other cgroups to swap or even triggers the OOM killer. The goal is
thus to better track usage of kernel memory, in order to prevent memory
cgroups from causing these kinds of problems.
Glauber mentioned earlier work he has
done to track kernel memory used for TCP data. He is now turning his focus
to other kinds of kernel memory that can possibly "explode". His overall
goal is to allow the use of OpenVZ with upstream kernels, rather than
patched kernels. A use case of particular interest is hosting providers
that use OpenVZ-based containers and heavily overcommit resources. In such
scenarios, the kernel can't trust containers to cooperate with the rest of
the system in order to keep the system alive. Thus, accurate accounting of
resource usage within containers is very important.
Glauber would like to account for general kernel memory usage,
including usage of slab objects and kernel stack pages. To begin with, he
is trying to merge support for tracking per-task kernel stack usage. (One
reason for tracking kernel stack usage is that it could provide a mechanism
to mitigate fork bombs
and cap their effect inside a cgroup.) The rationale for starting out by
tracking kernel stack usage is that it would be a reasonably non-invasive
change (unlike tracking slab usage), but would still demonstrate all the
memory cgroups changes that would be required for later, larger
memory-accounting changes; this smaller piece of work would also serve to
show what the user interface (i.e., the configuration files in the cgroup
directories) would look like.
Glauber went on to discuss the additional changes necessary to track
usage of slab
allocations. His description was quite detailed, but, very broadly
speaking, slab pages allocated for a cgroup would be for the exclusive use
of that cgroup. In essence, every cgroup would get its own slab cache for a
certain object type so that pages allocated to a cache are always
exclusively used by one memory cgroup. Slab metadata would be copied over
to a cgroup-specific slab. Thus, for example, if a dentry was touched, this
would lazily trigger the creation of a per-cgroup dentry cache for that
cgroup. From that point on, all dentries allocated by that cgroup would
come not from the global dentry cache, but rather from cgroup-specific
cache. A side effect of that is that pages are filled with objects from the
same cgroup. The intention of the proposed changes is to prevent unrelated
cgroups pinning each other's memory—that is, preventing the situation
where one cgroup, B, forces an unrelated cgroup, A, to
keep memory resident that A is being charged for. For example, if
cgroups A and B have both allocated objects in a slab
page they share, and A is forced to shrink due to hitting its
limits, then it might try freeing all of its slab objects, but still fail
to free the slab page it is being charged for.
Someone asked specifically about page-table pages, noting that it's
easy to create an adverse workload that allocates many page tables while
using very little user memory. Glauber felt that it would be easy to track
these pages using a memory-cgroup-specific GFP (Get Free Page) flag.
Glauber's current proposal does selective accounting of kernel memory
allocation: only explicitly annotated allocations are tracked. Peter
Zijlstra asked: why not account for every kernel memory allocation for by
default? Glauber pointed out that there would be a performance cost with
this, and he wants to avoid a situation where the cost of cgroups is very
high. In some respects, charging cgroups by default would be easier to
deal with, but it would be too slow to be useful.
Some participants in the room asked that the group re-discuss
kernel-plus-user memory accounting. This was discussed at the April 2012
LSF/MM meeting, but it was apparent that there was still some uncertainty
as to whether it was the correct approach.
Roughly, the details are as follows. There is a page counter that
tracks how many pages have been allocated for exclusive use by the kernel
(the kernel page counter, memory.kmem.usage_in_bytes) and a
counter that counts both kernel and user page allocations (the kernel+user
page counter, memory.usage_in_bytes). In conjunction with two
corresponding limits that can be set by the system administrator, this
implementation enables three different use cases:
- memory.kmem.limit_in_bytes == unlimited: This
configuration says that the user is not interested in kernel-memory
accounting at all, so only user memory is limited and accounted. This is
the default and actually what the memory controller provided until now.
-
memory.kmem.limit_in_bytes < memory.limit_in_bytes:
This configuration enables fine-grained control over how much
kernel memory can be allocated. Depending on the amount of user
memory, it is more probable that the kmem limit is hit first and
allocations would fail. Users of this setting should be aware that size
of the kernel memory usage could differ between kernel releases.
- memory.kmem.limit_in_bytes >= memory.limit_in_bytes:
This configuration allows capping both the kernel and the user memory
without any details about how much memory is used for each of them. The
user is just interested in the sum of both.
In Glauber's opinion, having the two integrated counters as described
above is simpler than having separate exclusive user and kernel counters.
However, Michal Hocko pointed out that this makes OOM decisions difficult
(i.e., the kernel may make wrong decisions, such that a program that
is not using a lot of memory can be killed because of heavy kernel memory
usage). Glauber defended his approach on the basis that the current
OOM-killing decisions have the same problem, and if his approach makes
these problems easier to trigger, then they are more likely to be found and
fixed.
There were no fundamental objections to Glauber's patch series to add support for accounting for
pages allocated for the stack. However, three things were requested:
- Documentation of the use cases and how the counters are to be
interpreted by the system administrator. It will be up to other interested
users to object if their use cases are not addressed.
- Incorporation of feedback on the patches, including issues with
naming and minor inconsistencies in the API. The patch series still needs
"a bit of love".
- A demonstration of the behavior in a NUMA environment in order to
determine whether the implementation has problems with CPU cache-line
bouncing. If a single counter is shared for an entire cgroup and processes
in that cgroup run on multiple nodes then the cache line has to "bounce" on
each write to keep the data coherent. It is desirable that Glauber tests
in advance how kernel performance is affected if processes are running on
multiple nodes. Potentially, every kernel allocation updates a counter
shared between NUMA nodes and this would scale very badly. It's worth
noting that NUMA is not the key factor here, but checking the behavior in a
NUMA environment is a good test because NUMA makes the effects of
cache-line bouncing much more visible. (Glauber has subsequently posted a round of benchmarks.)
Next: Kernel-memory shrinking
(
Log in to post comments)