Controlling memory use in containers

By Jonathan Corbet
July 31, 2007

"Containers" is the term normally applied to a lightweight virtualization approach where all guest systems run on the host system's kernel (as opposed to running their own kernel on a special virtual machine). The container technique tends to be more efficient at run time, but it poses challenges of its own; since every container runs on the same kernel, a whole series of internal barriers must be created to give each container the illusion of having a machine to itself. The addition of these barriers to the Linux kernel has been a multi-year process as the various projects working in this area work out a set of interfaces that works for everybody.

An important part of a complete container implementation is resource control; it is hard to maintain the fiction of a separate machine for each container if one of those containers is hogging the entire system. Extensive resource management patches have received a chilly reception in the past, but a properly done implementation based on the process containers framework might just make it in. The CFS group scheduling patch can be seen as one type of container-based resource management. But there is far more than just the CPU to worry about.

One of the most contended resources on many systems is core memory. A container which grows without bound and forces other containers out to swap will lead to widespread grumbling on the linux-kernel list. In an effort to avoid this unfortunate occurrence, Balbir Singh and Pavel Emelianov have been working on a container-based memory controller implementation. This patch is now in its fourth iteration.

The patch starts with a simple "resource counter" abstraction which is meant to be used with containers. It will work with any resource which can be described with simple, integer values for the maximum allowed and current usage. Methods are provided to enable hooking these counters into container objects and allowing them to be queried and set from user space.

These counters are pressed into service to monitor the memory use by each container. Memory use can be thought of as the current resident set: the number of resident pages which processes within the container have mapped into their virtual address spaces. Unlike some previous patches, though, the current memory controller also tries to track page cache usage. So a program which is very small, but which brings a lot of data from the filesystem into the page cache, will be seen as using a lot of memory.

To track per-container page usage, the memory controller must hook into the low-level page cache and reclaim code. It must also have a place to store information about which container each page is charged to. To that end, a new structure is created:

    struct meta_page {
	struct list_head lru;
	struct page *page;
	struct mem_container *mem_container;
	atomic_t ref_cnt;
    };

Global memory management is handled by way of two least-recently-used (LRU) lists, the hope being that the pages which have been unused for the longest are the safest ones to evict when memory gets tight. Once containers are added to the mix, though, global management is not enough. So the meta_page structure allows each page to be put onto a separate, container-specific LRU list. When a process within a container brings in a page and causes the container to bump up against its memory limit, the kernel must, if it is to enforce that limit, push some of the container's other pages out. When that situation arises, the container-specific LRU list is traversed to find reclaimable pages belonging to the container without having to pass over unrelated memory.

The page structure in the global memory map gains a pointer to the associated meta_page structure. There is also a new page flag allocated for locking that structure. There is no meta_page structure for kernel-specific pages, but one is created for every user-space or page cache page - even for processes which have not explicitly been assigned to a container (those processes are implicitly placed in a default, global container). There is, thus, a significant cost associated with the memory controller - the addition of five pointers (one in struct page, four in struct meta_page) and one atomic_t for every active page in the system can only hurt.

With this mechanism in place, the kernel is able to implement basic memory usage control for containers. One little issue remains: what happens when the kernel is unable to force a container's memory usage below its limit? In that case, the dreaded out-of-memory killer comes into play; there is a special version of the OOM killer which restricts its predations to a single container. So, in this situation, some process will die, but other containers should be unaffected.

One interesting aspect of the problem which appears to not have been completely addressed is pages which are used by processes in more than one container. Many shared libraries will fall into this category, but much page cache usage could as well. The current code charges a page to the first container which makes use of it. Should the page be chosen to be evicted, it will be unmapped from all containers; if a different container then faults the page in, that container will be charged for it going forward. So, over time, the reclaim mechanism may well cause the charging of shared pages to be spread across the containers on the system - or to accumulate in a single, unlimited container, should one exist. Determining whether real problems could result from this mechanism will require extensive testing with a variety of workloads, and, one suspects, that effort has barely begun.

For now we have a memory controller framework which appears to be capable of doing the core job, which is a good start. It is clearly applicable to the general container problem, but might just prove useful in other situations as well. A system administrator might not want to implement full-blown containers, but might be interested in, for example, keeping filesystem-intensive background jobs (updatedb, backups, etc.) within certain memory bounds. Users could put a boundary around, say, OpenOffice.org to keep it from pushing everything else of interest out of memory. There would seem to be enough value here to justify the inclusion of this patch - though a bit of work may be required first.

Index entries for this article
Kernel	Containers
Kernel	Memory management/Control groups

Controlling memory use in containers

Posted Aug 2, 2007 2:58 UTC (Thu) by Nick (guest, #15060) [Link] (5 responses)

struct meta_page is a horrible name! struct page is already "meta-page",
and meta_page isn't really metadata about the struct page, it is just more
metadata about the page which is broken out so as to minimise overhead.

Controlling memory use in containers

Posted Aug 2, 2007 6:12 UTC (Thu) by balbir_singh (subscriber, #34142) [Link] (4 responses)

Yes, I agree, that page is indeed metadata. But, the idea behind meta_page is that it metadata associated with page. Do you think a direct name like container or meta_container sounds better? I was hoping that we could use meta_page to extend struct page to things beyond just containers in the future. That would require some refactoring of-course.

Controlling memory use in containers

Posted Aug 2, 2007 8:14 UTC (Thu) by Nick (guest, #15060) [Link] (3 responses)

Heh, this is a very appropriate forum for discussing patches ;)

Some more direct name like struct page_container sounds nicer
to me, yes. I wouldn't worry about how to add different types
of extra page data yet. It is going to be yet more costly, so
I would just tackle that if/when it comes up:

eg. struct page_extra {
struct page_container *pc;
struct something_else *blah;
};

While you're here: do you think you could look at not using up a
page flag for locking, please? As you're adding a new pointer in
struct page, then you might be able to use the low bit or two of
that guy as flags. This even gives you the advantage that you can
use a non-atomic store to unlock, once I get my bitops patches in,
as long as the word it isn't subject to concurrent modifications
when it is locked.

Controlling memory use in containers

Posted Aug 2, 2007 9:48 UTC (Thu) by balbir_singh (subscriber, #34142) [Link]

I find this forum very interesting as well :-)

I am in agreement on the naming convention and using the low bit for locking sounds like a good idea! I hope no-one complains about readability. I'll explore your suggestions further.

Controlling memory use in containers

Posted Aug 2, 2007 16:22 UTC (Thu) by iabervon (subscriber, #722) [Link] (1 responses)

I think container_page is better than page_container, since it's information for the container subsystem about pages, and not a struct that contains pages.

Controlling memory use in containers

Posted Aug 3, 2007 6:01 UTC (Fri) by balbir_singh (subscriber, #34142) [Link]

That naming convention would make sense if were referring to the page from the container. Since we do it reverse (from page to container), page_container sounds better.

Why struct meta_page?

Posted Aug 2, 2007 11:39 UTC (Thu) by i3839 (guest, #31386) [Link] (7 responses)

What's the reason for using a seperate struct meta_page, instead of adding lru and mem_container to struct page? That would save two pointers per page.

(BTW, according to http://article.gmane.org/gmane.linux.kernel.containers/224 the ref_cnt field is gone now.)

Why struct meta_page?

Posted Aug 3, 2007 6:00 UTC (Fri) by balbir_singh (subscriber, #34142) [Link] (6 responses)

There's an OLS paper Challenges with the memory controller that describes some of the challenges we face as we design our memory controller. Why we don't directly extend struct page is explained in the paper (hint: number of struct page(s) is directly propotional to the memory in the system)

Why struct meta_page?

Posted Aug 3, 2007 14:35 UTC (Fri) by i3839 (guest, #31386) [Link] (3 responses)

Thanks for the pointer, interesting paper. It explains a lot, but it didn't convince me yet.

You started with an unmodified struct page. Then it turned out adding a pointer back to the struct meta_page was really worth it, so you did. Now you end up with:

/*
 * A meta page is associated with every page descriptor. The meta page
 * helps us identify information about the container
 */
struct meta_page {
	struct list_head lru;		/* per container LRU list */
	struct page *page;
	struct mem_container *mem_container;
};

and a struct meta_page *meta_page; in struct page.

That are 5 pointers, of which two are used to link a meta_page and its corresponding page to each other, which is a very big overhead and probably also makes the code more complex.

So the current method uses (1 * nr_pages + 4 * A) * sizeof(pointer) memory, where 'A' is the number of pages associated with a meta page.

Getting rid of struct meta_page and embedding lru and *mem_container directly in struct page will use 3 * nr_pages * sizeof(pointer).

Using that we can calculate which method is under which circumstances more memory efficient:

1 * nr_pages + 4 * A  <  3 * nr_pages
-> 4 * A < 2 * nr_pages
-> A < nr_pages/2

Assuming that 'A' equals the number of active pages, the conclusion is that a separate struct meta_page is only better when less than half of all memory is active. But:

1) That seems like an uncommon condition, so in general it's expected that it will be much more than that (e.g. on my system the ratio is 1/5 in favour of active pages, and the few other systems I checked also have much more active pages than inactive ones).

2) Optimizing memory usage makes only sense for when there is memory shortage, so to see what approach is most effective it should be compared under memory pressure conditions. If the number of active pages is low it can be expected that the memory pressure is low too.

What am I missing?

Why struct meta_page?

Posted Aug 4, 2007 9:30 UTC (Sat) by balbir_singh (subscriber, #34142) [Link] (2 responses)

Your calculation seems accurate, but remember not everyone wants to use the memory controller. For non-users of the feature, we could later add a boot option and the overhead would be (with config enabled)

nr_pages * sizeof(pointer)

without this, the overhead would clearly be

3 * nr_pages * sizeof(pointer)

I need to double check this, but the sizeof struct page is currently close to being aligned in one cacheline, continuous and uncontrolled extensions can hurt in the long run (may be even now).

Why struct meta_page?

Posted Aug 4, 2007 13:20 UTC (Sat) by i3839 (guest, #31386) [Link] (1 responses)

Currently it's 56 bytes for 64 bit systems and 32 bytes for 32 bit ones, so you're right that adding pointers is bad, but that's also true for the one pointer you add (less so for 64 bit systems though).

Can't you reuse the struct list_head lru in struct page for the container instead of using two lists? That would save two pointers. If you need to do global reclaim you can do it per container, ordered on how "full" they are.

Also, is the struct mem_container* pointer really needed? Don't you know which container is involved from the context it's used? And the few times you really need it, can't you walk the lru list to find it out? (E.g. add struct mem_container to the lru list and distinguish it by setting a bit in the lru pointer or some other trick).

Just throwing ideas at you.

Why struct meta_page?

Posted Aug 6, 2007 17:26 UTC (Mon) by i3839 (guest, #31386) [Link]

If the above method of reusing the current lru isn't possible, an alternative would be to keep around a shadow array of struct meta_page, assuming currently an array of struct pages is used starting at a known location. Then you can calculate the index of the other structure from the memory address of either structure, also making those two pointers redundant, with the added benefit that struct page size isn't increased and the memory is only allocated when containers are used.

Why struct meta_page?

Posted Aug 3, 2007 14:37 UTC (Fri) by i3839 (guest, #31386) [Link] (1 responses)

Unrelated to the above, your patch has

+#define PG_metapage 21 /* Used for checking if a meta_page */
+ /* is associated with a page */

but the flag only seems to be used for locking, not for checking anything like the comment says?

Why struct meta_page?

Posted Aug 4, 2007 9:30 UTC (Sat) by balbir_singh (subscriber, #34142) [Link]

Yes, it is used for locking. The idea is to atomically check if a meta_page is already associated with a page.

Controlling memory use in containers

Posted Aug 28, 2007 15:56 UTC (Tue) by kolyshkin (guest, #34342) [Link]

Note that memory controller (at version 7) was added yesterday to -mm tree: