LWN.net Logo

Getting a handle on caching

By Jonathan Corbet
May 14, 2008
Memory management changes (for the x86 architecture) have caused surprises for a few kernel developers. As these issues have been worked out, it has become clear that not everybody understands how memory caching works on contemporary systems. In an attempt to bring some clarity, Arjan van de Ven wrote up some notes and sent them to your editor, who has now worked them into this article. Thanks to Arjan for putting this information together - all the useful stuff found below came from him.

As readers of What every programmer should know about memory will have learned, the caching mechanisms used by contemporary processors are crucial to the performance of the system. Memory is slow; without caching, systems will run much slower. There are situations where caching is detrimental, though, so the hardware must provide mechanisms which allow for control over caching with specific ranges of memory. With 2.6.26, Linux is (rather belatedly) starting to catch up with the current state of the art on x86 hardware; that, in turn, is bringing some changes to how caching is managed.

It is good to start with a definition of the terms being used. If a piece of memory is cachable, that means:

  • The processor is allowed to read that memory into its cache at any time. It may choose to do so regardless of whether the currently-executing program is interested in reading that memory. Reads of cachable memory can happen in response to speculative execution, explicit prefetching, or a number of other reasons. The CPU can then hold the contents of this memory in its cache for an arbitrary period of time, subject only to an explicit request to release the cache line from elsewhere in the system.

  • The CPU is allowed to write the contents of its cache back to memory at any time, again regardless of what any running program might choose to do. Memory which has never been changed by the program might be rewritten, or writes done by a program may be held in the cache for an arbitrary period of time. The CPU need not have read an entire cache line before writing that line back.

What this all means is that, if the processor sees a memory range as cachable, it must be possible to (almost) entirely disconnect the operations on the underlying device from what the program thinks it is doing. Cachable memory must always be readable without side effects. Writes have to be idempotent (writing the same value to the same location several times has the same effect as writing it once), ordering-independent, and size-independent. There must be no side effects from writing back a value which was read from the same location. In practice, this means that what sits behind a cachable address range must be normal memory - though there are some other cases.

If, instead, an address range is uncachable, every read and write operation generated by software will go directly to the underlying device, bypassing the CPU's caches. The one exception is with writes to I/O memory on a PCI bus; in this case, the PCI hardware is allowed to buffer and combine write operations. Writes are not reordered with reads, though, which is why a read from I/O memory is often used in drivers for PCI devices as a sort of write barrier.

A variant form of uncached access is write combining. For read operations, write-combined memory is the same as uncachable memory. The hardware is, however, allowed to buffer consecutive write operations and execute them as a smaller series of larger I/O operations. The main user of this mode is video memory, which often sees sequential writes and which offers significant performance improvements when those writes are combined.

The important thing is to use the right cache mode for each memory range. Failure to make ordinary memory cachable can lead to terrible performance. Enabling caching on I/O memory can cause strange hardware behavior, corrupted data, and is probably implicated in global warming. So the CPU and the hardware behind a given address must agree on caching.

Traditionally, caching has been controlled with a CPU feature called "memory type range registers," or MTRRs. Each processor has a finite set of MTRRs, each of which controls a range of the physical address space. The BIOS sets up at least some of the MTRRs before booting the operating system; some others may be available for tweaking later on. But MTRRs are somewhat inflexible, subject to the BIOS not being buggy, and are limited in number.

In more recent times, CPU vendors have added a concept known as "page attribute tables," or PAT. PAT, essentially, is a set of bits stored in the page table entries which control how the CPU does caching for each page. The PAT bits are more flexible and, since they live in the page table entries, they are difficult to run out of. They are also completely under the control of the operating system instead of the BIOS. The only problem is that Linux doesn't support PAT on the x86 architecture, despite the fact that the hardware has had this capability for some years.

The lack of PAT support is due to a few things, not the least of which has been problematic support on the hardware side. Processors have stabilized over the years, though, to the point that it is possible to create a reasonable whitelist of CPU families known to actually work with PAT. There have also been challenges on the kernel side; when multiple page table entries refer to the same physical page (a common occurrence), all of the page table entries must use the same caching mode. Even a brief window with inconsistent caching can be enough to bring down the system. But the code on the kernel side has finally been worked into shape; as a result, PAT support was merged for the 2.6.26 kernel. Your editor is typing this on a PAT-enabled system with no ill effects - so far.

On most systems, the BIOS will set MTRRs so that regular memory is cachable and I/O memory is not. The processor can then complicate the situation with the PAT bits. In general, when there is a conflict between the MTRR and PAT settings, the setting with the lower level of caching prevails. The one exception appears to be when one says "uncachable" and the other enables write combining; in that case, write combining will be used. So the CPU, through the management of the PAT bits, can make a couple of effective changes:

  • Uncached memory can have write combining turned on. As noted above, this mode is most useful for video memory.

  • Normal memory can be made uncached. This mode can also be useful for video memory; in this case, though, the memory involved is normal RAM which is also accessed by the video card.

Linux device drivers must map I/O memory before accessing it; the function which performs this task is ioremap(). Traditionally, ioremap() made no specific changes to the cachability of the remapped range; it just took whatever the BIOS had set up. In practice, that meant that I/O memory would be uncachable, which is almost always what the driver writer wanted. There is a separate ioremap_nocache() variant for cases where the author wants to be explicit, but use of that interface has always been rare.

In 2.6.26, ioremap() was changed to map the memory uncached at all times. That created a couple of surprises in cases where, as it happens, the memory range involved had been cachable before and that was what the code needed. As of 2.6.26, such code will break until the call is changed to use the new ioremap_cache() interface instead. There is also a ioremap_wc() function for cases where a write-combined mapping is needed.

It is also possible to manipulate the PAT entries for an address range explicitly:

    int set_memory_uc(unsigned long addr, int numpages);
    int set_memory_wc(unsigned long addr, int numpages);
    int set_memory_wb(unsigned long addr, int numpages);

These functions will set the given pages to uncachable, write-combining, or writeback (cachable), respectively. Needless to say, anybody using these functions should have a firm grasp of exactly what they are doing or unpleasant results are certain.


(Log in to post comments)

Getting a handle on caching

Posted May 15, 2008 3:18 UTC (Thu) by NCunningham (guest, #6457) [Link]

Typo: s/global warning/global warming

Regards,

Nigel

Getting a handle on caching

Posted May 15, 2008 8:58 UTC (Thu) by eskild (subscriber, #1556) [Link]

Lovely article. Thanks.

Getting a handle on caching

Posted May 15, 2008 10:33 UTC (Thu) by sergey (guest, #31763) [Link]

Will PAT usage yield better performance in general, only for specific applications, or not at
all?

Getting a handle on caching

Posted May 15, 2008 13:50 UTC (Thu) by arjan (subscriber, #36785) [Link]

PAT isn't about performance per se.

Until now, the kernel uses MTRR and while that's a tad slow, it's done once during X startup,
and pretty much static afterwards... (and "tad slow" is not human-noticable at all, it's way
below that)

The real problem is that the number of MTRR entries is very limited (8 on Intel CPUs for
example); this is in fact so few that BIOSes increasingly will need/use all 8 just to map the
existing system, so there's no new ones available for X in such a case..... PAT is the only
option then to get write-combinable memory.

So *if* you hit the "8 limit", it's a big performance thing, but if you don't hit the limit...
you'll not notice the difference.

Getting a handle on caching

Posted May 15, 2008 19:43 UTC (Thu) by ebiederm (subscriber, #35028) [Link]

MTRRs can be setup in an overlapping mode where UC MTRRs trump WB MTRRs.
In that case even if there are spare MTRRs the only way we can get WC
memory regions is to through the use of PAT.

BIOS's do that even on Boxes where it isn't strictly necessary to
avoid going over the 8 MTRR limit.

Other details.

WC (write combining) technically allows read combining as well.
Just none of the current instructions take advantage of that fact.

Where PAT really wins is if you have a several high performance
cards in your system that want to take advantage of WC.  Which
they can't do today because of the limitation in PAT resources.

WC when the hardware takes advantage of it can be a real performance win.
In one extreme example I saw a specialized low latency network card generate line rate 8Gpbs
traffic directly from the cpu without touching
memory.  So it is very nice when you can get it.

Eric

Getting a handle on caching

Posted May 15, 2008 20:07 UTC (Thu) by iabervon (subscriber, #722) [Link]

That's all quite sensible, but what's this "uc minus" thing that keeps showing up in charts
and comments?

Getting a handle on caching

Posted May 15, 2008 21:47 UTC (Thu) by arjan (subscriber, #36785) [Link]

"UC minus" is basically "Uncached unless something else says otherwise".

So if PAT says "UC minus" and the MTRR says "Write Combining" the result is "Write combining".

However if PAT had said "strong UC" and the MTRR had said "Write combining", the result would
have been "Uncached"


Getting a handle on caching

Posted May 15, 2008 21:55 UTC (Thu) by arjan (subscriber, #36785) [Link]

(just to clarify; if PAT says "UC minus" and the MTRR says "Cached" then the outcome is still
"uncached"; so it's really "Uncached unless the MTRR says write combining")

Getting a handle on caching

Posted May 16, 2008 18:13 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

The article seems to say that one of the weaknesses of MTRRs is that they're controlled by BIOS. But I don't see that -- there's nothing inherent in MTRRs that makes the BIOS set them; Linux could set them if it wanted to, as easily as it sets PAT bits.

Rather, you'd have to say it's a weakness of PAT that only the OS can control them; BIOS isn't an option.

Getting a handle on caching

Posted May 20, 2008 22:20 UTC (Tue) by arjan (subscriber, #36785) [Link]

The weakness of MTRR is that there are only 7 or 8 of them.
And a whole bunch are initially set up by the bios (usually 6 or 7, sometimes more)..... and
those are both needed and untouchable.
(while you *can* change them, it then breaks SMM mode and suspend/resume etc etc)

Getting a handle on caching

Posted May 21, 2008 2:40 UTC (Wed) by giraffedata (subscriber, #1954) [Link]

The weakness of MTRR is that there are only 7 or 8 of them. And a whole bunch are initially set up by the bios (usually 6 or 7, sometimes more)..... and those are both needed and untouchable.

Again, it's confusing to say that the fact some are initially set up by the BIOS is a problem. The fact that they're needed and untouchable is the problem, and it would be regardless of who set them up. So if you really want to explain why MTRRs are in short supply, you have to give an idea of what those 6 or 7 are for.

Are SMM mode and suspend/resume etc. etc. BIOS functions? If so, I guess the clear way to say it is that a whole bunch are used by the BIOS.

BTW, I found on my system, I had to free up some MTRRs by combining ranges, making a few more uncacheable pages than there would ideally be. A big reason all 8 were used is that each one must cover a power of two size block of memory, so it takes 5 registers to describe 31 pages of memory, but only one register to describe 32 pages.

idempotent writes

Posted May 16, 2008 18:18 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

Writes have to be idempotent (writing the same value to the same location several times has the same effect as writing it once).

Though that's a common usage, that's not really what idempotent means. In this context, idempotent means the write has no effect. So the goal is for the second write to be idempotent, but not the first.

"Repeatable" is a better word for something that can be done several times with the same effect as doing it once.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds