By Jonathan Corbet
May 14, 2008
Memory management changes (for the x86 architecture) have caused surprises
for a few kernel developers. As these issues have been worked out, it has
become clear that not everybody understands how memory caching works on
contemporary systems. In an attempt to bring some clarity, Arjan van de
Ven wrote up some notes and sent them to your editor, who has now worked
them into this article. Thanks to Arjan for putting this information
together - all the useful stuff found below came from him.
As readers of What every
programmer should know about memory will have learned, the caching
mechanisms used by contemporary processors are crucial to the performance
of the system. Memory is slow; without caching, systems will run
much slower. There are situations where caching is detrimental,
though, so the hardware must provide mechanisms which allow for control
over caching with specific ranges of memory. With 2.6.26, Linux is
(rather belatedly) starting to catch up with the current state of the art
on x86 hardware; that, in turn, is bringing some changes to how caching is
managed.
It is good to start with a definition of the terms being used. If a piece
of memory is cachable, that means:
- The processor is allowed to read that memory into its cache at
any time. It may choose to do so regardless of whether the
currently-executing program is interested in reading that memory.
Reads of cachable memory can happen in response to speculative
execution, explicit prefetching, or a number of other reasons. The
CPU can then hold the contents of this memory in its cache for an
arbitrary period of time, subject only to an explicit request to
release the cache line from elsewhere in the system.
- The CPU is allowed to write the contents of its cache back to memory
at any time, again regardless of what any running program might choose
to do. Memory which has never been changed by the program might be
rewritten, or writes done by a program may be held in the cache for an
arbitrary period of time. The CPU need not have read an entire cache
line before writing that line back.
What this all means is that, if the processor sees a memory range as
cachable, it must be possible to (almost) entirely disconnect the
operations on the underlying device from what the program thinks it is
doing. Cachable memory must always be readable without side effects.
Writes have to be idempotent (writing the same value to the same location
several times has the same effect as writing it once),
ordering-independent, and size-independent. There must be no side effects
from writing back a value which was read from the same location. In
practice, this means that what sits behind a cachable address range must be
normal memory - though there are some other cases.
If, instead, an address range is uncachable, every read and write
operation generated by software will go directly to the underlying device,
bypassing the CPU's caches. The one exception is with writes to I/O memory
on a PCI bus; in this case, the PCI hardware is allowed to buffer and
combine write operations. Writes are not reordered with reads, though,
which is why a read from I/O memory is often used in drivers for PCI
devices as a sort of write barrier.
A variant form of uncached access is write combining. For read
operations, write-combined memory is the same as uncachable memory. The
hardware is, however, allowed to buffer consecutive write operations and
execute them as a smaller series of larger I/O operations. The main user
of this mode is video memory, which often sees sequential writes and which
offers significant performance improvements when those writes are
combined.
The important thing is to use the right cache mode for each memory range.
Failure to make ordinary memory cachable can lead to terrible performance.
Enabling caching on I/O memory can cause strange hardware behavior,
corrupted data, and is probably implicated in global warming. So the CPU
and the hardware behind a given address must agree on caching.
Traditionally, caching has been controlled with a CPU feature called
"memory type range registers," or MTRRs. Each processor has a finite set
of MTRRs, each of which controls a range of the physical address space.
The BIOS sets up at least some of the MTRRs before booting the operating
system; some others may be available for tweaking later on. But MTRRs are
somewhat inflexible, subject to the BIOS not being buggy, and are limited
in number.
In more recent times, CPU vendors have added a concept known as "page
attribute tables," or PAT. PAT, essentially, is a set of bits stored in
the page table entries which control how the CPU does caching for each
page. The PAT bits are more flexible and, since they live in the page
table entries, they are difficult to run out of. They are also completely
under the control of the operating system instead of the BIOS. The only
problem is that Linux doesn't support PAT on the x86 architecture, despite
the fact that the hardware has had this capability for some years.
The lack of PAT support is due to a few things, not the least of which has
been problematic support on the hardware side. Processors have stabilized
over the years, though, to the point that it is possible to create a
reasonable whitelist of CPU families known to actually work with PAT.
There have also been challenges on the kernel side; when multiple page
table entries refer to the same physical page (a common occurrence), all of
the page table entries must use the same caching mode. Even a brief window
with inconsistent caching can be enough to bring down the system. But the
code on the kernel side has finally been worked into shape; as a
result, PAT support was merged for the 2.6.26 kernel. Your editor is
typing this on a PAT-enabled system with no ill effects - so far.
On most systems, the BIOS will set MTRRs so that regular memory is cachable
and I/O memory is not. The processor can then complicate the situation
with the PAT bits. In general, when there is a conflict between the MTRR
and PAT settings, the setting with the lower level of caching prevails.
The one exception appears to be when one says "uncachable" and the other
enables write combining; in that case, write combining will be used. So
the CPU, through the management of the PAT bits, can make a couple of
effective changes:
- Uncached memory can have write combining turned on. As noted above,
this mode is most useful for video memory.
- Normal memory can be made uncached. This mode can also be useful for
video memory; in this case, though, the memory involved is normal RAM
which is also accessed by the video card.
Linux device drivers must map I/O memory before accessing it; the function
which performs this task is ioremap(). Traditionally,
ioremap() made no specific changes to the cachability of the
remapped range; it just took whatever the BIOS had set up. In practice,
that meant that I/O memory would be uncachable, which is almost always what
the driver writer wanted. There is a separate ioremap_nocache()
variant for cases where the author wants to be explicit, but use of that
interface has always been rare.
In 2.6.26, ioremap() was changed to map the memory uncached at all
times. That created a couple of surprises in cases where, as it happens,
the memory range involved had been cachable before and that was what the
code needed. As of 2.6.26, such code will break until the call is changed
to use the new ioremap_cache() interface instead. There is also a
ioremap_wc() function for cases where a write-combined mapping is
needed.
It is also possible to manipulate the PAT entries for an address range
explicitly:
int set_memory_uc(unsigned long addr, int numpages);
int set_memory_wc(unsigned long addr, int numpages);
int set_memory_wb(unsigned long addr, int numpages);
These functions will set the given pages to uncachable, write-combining, or
writeback (cachable), respectively. Needless to say, anybody using these
functions should have a firm grasp of exactly what they are doing or
unpleasant results are certain.
(
Log in to post comments)