As those who have looked at kernel programming at all have noticed, there
are two basic memory allocation modes in Linux. One of those, which comes
down to get_free_pages()
in the end, allocates one or more
physically contiguous pages which are in the kernel's main
virtual address space (except for high memory pages, of course). Most
other memory allocation mechanisms, including the slab
allocator and kmalloc()
, are built on top of
. In the
other corner is vmalloc()
, which allocates virtually contiguous (but
physically dispersed) pages in a separate virtual address space.
is relatively slow, but it can perform large allocations
that look contiguous to the kernel. It is thus used, for example, to
allocate space for code from loadable modules.
Erik Jacobson recently found the limits of
kmalloc() while querying /proc/interrupts on a very
large system. The code implementing /proc/interrupts attempts to
allocate a buffer for its output; the size of that buffer is dependent on
the number of processors on the system. On big systems, the required
buffer is large and the allocation fails. So Erik submitted a fix which
uses vmalloc() to allocate the memory instead.
Linus didn't like it. He pointed out that
the seq_file interface should
be used instead. Indeed, /proc/interrupts fits naturally into the
sort of output seq_file is intended to create, and doing things that way
can eliminate the need to allocate a large buffer at all. But Linus also
clarified his thoughts on when vmalloc() should be used:
There are basically no valid new uses of it. There's a few valid
legacy users (I think the file descriptor array), and there are
some drivers that use it (which is crap, but drivers are drivers),
and it's _really_ valid only for modules. Nothing else.
That should be sufficiently clear for most readers; perhaps an entry on
vmalloc() needs to be added to the coding style document.
There are a few reasons for this stance. Every call to vmalloc()
requires page table tweaking and translation buffer flushes, so it will be
slow. Space from vmalloc() lies outside of the regular kernel
range, which is (on most architectures) covered by a single, large page
table entry, so extra translation buffer slots are required to access it.
And, on many architectures, the amount of virtual space set aside for
vmalloc() is relatively small. For all of these reasons, use of
vmalloc() is discouraged, and patches containing
vmalloc() calls are increasingly unlikely to make it into the
to post comments)