Two approaches to kernel memory usage accounting
Given that, it should not be surprising that a patch set adding the ability to track and limit kernel memory use exists. What may be a little more surprising is the fact that two independent patch sets exist, each of which adds that feature in its own way. Both were posted for consideration in late February.
The first set was posted by Glauber Costa, the author of the related per-cgroup TCP buffer limits controller. Glauber's patch works at the slab allocator level; only the SLUB allocator is supported at this time. With this approach, developers must explicitly mark a slab cache for usage tracking with this interface:
struct memcg_cache_struct {
int index;
struct kmem_cache *cache;
int (*shrink_fn)(struct shrinker *shrink, struct shrink_control *sc);
struct shrinker shrink;
};
void register_memcg_cache(struct memcg_cache_struct *cache);
Once a slab cache has been passed to register_memcg_cache(), it is essentially split into an array of parallel caches, one belonging to each control group managed by the memory controller. With some added infrastructure, each of these per-cgroup slab caches is able to track how much memory has been allocated from it; this information can be used to cause allocations to fail should the control group's limits be exceeded. More usefully, the controller can, when limits are exceeded, call the shrink_fn() associated with the cache; that function's job is to find memory to free, bringing the control group back below its limit.
Glauber's patch set includes a sample implementation for the dentry cache. When a control group creates enough dentries to run past its limits, the shrinker function can clean some of them up. That may slow down processes in the affected control group, but it should prevent a dentry-intensive process from affecting processes in other control groups.
The second patch set comes from Suleiman
Souhlal. Here, too, the slab allocator is the focus point for memory
allocation tracking, but this patch works with the "slab" allocator
instead of
SLUB. One other significant difference with Suleiman's patch is that it
tracks allocations from all caches, rather than just those
explicitly marked for such tracking. There is a new
__GFP_NOACCOUNT. flag to explicitly prevent tracking, but, as a
whole, it's an opt-out system rather than opt-in. One might argue that, if
tracking kernel memory usage is important, one should track all of it.
But, as Suleiman acknowledges, the ability to track allocations from all
caches "is also the main source of complexity in the
patchset
".
Under this scheme, slab caches operate as usual until an allocation is made from a specific cache while under the control of a specific cgroup. At that point, the cache is automatically split into per-cgroup caches without the intervention (or knowledge) of the caller. Of course, this splitting requires taking locks and allocating memory - activities that can have inconvenient results if the system is running in an atomic context at the time. In such situations, the splitting of the cache will be pushed off into a workqueue while the immediate allocation is satisfied from the pre-split cache. Much of the complexity in Suleiman's patch set comes from this magic splitting that works regardless of the calling context.
There is no shrinker interface in this patch set, though that is clearly planned for the future.
When a control group is deleted, both implementations shift the accounting up to the parent group. That operation, too, can involve some complexity; the processes that performed the allocation may, like their control group, be gone when the allocations are finally freed. Glauber's patch does no tracking for the root control group; as a result of that decision (and some careful programming), the cost of the kernel memory tracking feature is almost zero if it is not actually being used. Suleiman's patch does track usage for the root cgroup, but that behavior can be disabled with a kernel configuration option.
Neither patch appears to be ready for merging into the mainline prior to
the 3.5 development cycle - and, probably, not even then. There are a lot
of details to be worked out, the mechanism needs to work with both slab and
SLUB (at least), and, somehow, the two patch sets need to turn into a
single solution. The two developers are talking to each other and express
interest in working together, but there will almost certainly need to be
guidance from others before the two patches can be combined. If users of
this feature feel that tracking allocations from all slab caches is
important, then, clearly, whatever is merged will need to have that
feature. If, instead, picking a few large users is sufficient, then a
solution requiring the explicit marking of caches to be tracked will do.
Thus far, there has not been a whole lot of input from people other than
the two developers; until that happens, it will be hard to know which
approach will win out in the end.
| Index entries for this article | |
|---|---|
| Kernel | Control groups |
| Kernel | Memory management/Control groups |
