One of the many changes rolled into the 2.5.45 kernel was the "hot-n-cold
pages" patch from Martin Bligh, Andrew Morton, and others. It's a
conceptually simple change that shows how far one has to go to deal with
the realities of modern system architecture.
One generally thinks of a system's RAM as being the fastest place to keep
data. But memory is slow; the real speed comes from working out of
the onboard cache in the processor itself. Much effort has, over the
years, gone into trying to optimize the kernel's cache behavior and
avoiding the need to go to main memory. The new page allocation system is
just another step in that direction.
The processor cache contains memory which has been accessed recently. The
kernel often has a good idea of which pages have seen recent accesses and
are thus likely to be present in cache. The hot-n-cold patch tries to take
advantage of that information by adding two per-CPU free page lists (for
each memory zone). When a processor frees a page that is suspected to be
"hot" (i.e. represented in that processor's cache), it gets pushed onto the
hot list; others go onto the cold list. The lists have high and low
limits; after all, if the hot list grows larger than the processor's cache,
the chances of those pages actually being hot start to get pretty small.
When the kernel needs a page of memory, the new allocator
normally tries to get that page from the processor's hot list. Even if the
page is simply going to be overwritten, it's still better to use a
cache-warm page. Interestingly, though, there are times when it makes
sense to use a cold page instead. If the page is to be used for DMA read
operations, it will be filled by the device performing the operation and
the cache will be invalidated anyway. So 2.5.45 includes a new
GPF_COLD page allocation flag for the situations where using a
cold page makes more sense.
The use of per-CPU page lists also cuts down on lock contention, which also
helps performance. When pages must be moved between the hot/cold lists and
the main memory allocator, they are transferred in multi-page chunks, which
also cuts down on lock contention and makes things go faster.
Andrew Morton has benchmarked this patch, and included a number of results
with one of the patchsets. Performance
benefits vary from a mere 1-2% on the all-important kernel compilation time
to 12% on the SDET test. That was enough, apparently, to convince Linus.
to post comments)