The implementation language for the Linux kernel is C. That choice makes a
great deal of sense; C does a good job of staying out of the way and
letting programmers control exactly what is happening. Anybody who does
any significant amount of C programming, however, eventually ends up
chasing down memory leaks. Since C forces programmers to track every block
of allocated memory and clean up their own messes, things occasionally slip
through the cracks. Memory leaks can be a problem in applications,
especially those which run for a long time - ask any Firefox user. But
kernel memory leaks are worse; every time the kernel drops a piece of
memory, it is gone until the next boot. A system with a serious kernel
memory leak will quickly become unusable.
Tracking down memory leaks can be painful work. When a proprietary memory
allocation tracking tool became available for SunOS many years ago, your
editor had no qualms about spending thousands of his employer's dollars to
license it; the payback time was quite short. In current times, Linux
users can employ a free tool like valgrind
(version 3.2.0 was released
on June 8) to track down user-space memory leaks. But valgrind does
not work on a running kernel. (Some work has been done on running
User-mode Linux under valgrind, but sometimes one simply has to debug the
As the kernel developers rely more heavily on automated tools for finding
bugs, the creation of a kernel memory leak detector is an obvious next
step. Catalin Marinas has taken that step with a kernel memory leak detector
patch series. This code, if accepted into the kernel, should help to
eliminate another big class of errors.
Catalin's patch functions much like a scan-and-mark garbage collector. The
first step is to track every memory allocation in the system; to that end,
the patch instruments the slab allocator. Every block allocated from a
slab (which will include allocations from kmalloc()) is stored in
a radix tree; along with a
pointer to the block, the stored information includes the block size and a
stack trace identifying where the block was allocated. When blocks are
freed, their corresponding entries are removed from the radix tree.
During normal system operation, this radix tree just sits there. Should
somebody ask about memory leaks (by reading
/sys/kernel/debug/memleak), the detection algorithm swings into
action. The steps performed are:
- A big list is created holding every outstanding memory allocation in
the system. This list is called the "white" list; everything on it is
considered to be a possible memory leak.
- Various parts of memory are scanned for pointers which match the
allocated blocks; every time such a pointer is found, the block is
moved to the "gray" list of memory which is still reachable, and thus
not leaked. The initial scan includes the kernel's static data areas,
each process's kernel stack, and each processor's per-CPU variable
- The first scan finds all memory referenced directly from static
memory, but kernel data structures are more complicated than that.
So, each block which has been put onto the gray list is scanned as
well. Most of these blocks will be structures allocated from a slab
cache, and they may contain pointers to other structures. So each
block is queried, paying attention to that block's remembered size.
Any pointers found within the block are moved over to the gray list,
and scanned in turn.
There is, of course, a provision for remembering which blocks have
been scanned and avoiding infinite loops.
- Once all pointers on the gray list have been scanned, every block of
memory reachable by the kernel has been located. Anything remaining
on the white list is considered to be leaked, and the relevant
information is sent back to user space.
In the real world, things get complicated, so the leak detector is not
quite as simple as described above. One situation which had to be
addressed is cases where the kernel keeps a pointer to the interior of a
block of memory, rather than to the beginning. This happens frequently;
many kernel structures are located by way of an embedded list_head
structure or kobject, for example. As a way of locating these blocks, the
memory leak detector records uses of the container_of() macro; in
particular, it remembers the size of the block and the offset to the
embedded structure. When a block of a given size is allocated, the
detector records "alias" addresses for any possible embedded structures. A
pointer to one of those aliases is considered to be equivalent to a pointer
to the beginning of the block.
There are various other special cases which must be handled. For example,
memory obtained from vmalloc() will be pointed to by the memory
allocation code itself, but might still be leaked. In other cases, memory
is allocated which cannot be found by the scanning algorithm; a number of
special annotations are added to the kernel to suppress the resulting false
positive reports. The detector can also be fooled by pointers which are
left behind in disused memory, or by random data which happens to look like
a pointer to an allocated block; in these cases, false-negatives will
Even with these problems, the situation is better than before - a lot of
memory leak situations can be found. Ingo Molnar, however, has a vision of a more ambitious scheme wherein
type information for every allocated block would be retained. Among other
things, this information would allow the scanning to be restricted to parts
of the block known to contain pointers; that should speed the process and
reduce false negatives. Since type information is available, each scanned
pointer could be checked to ensure that it points to a block of the correct
type, adding another level of checking to the kernel. Implementing all of
this looks like a big task, however; even Ingo may need a couple of days to
get it done.
to post comments)