LWN.net Logo

Large pages, large blocks, and large problems

Large pages, large blocks, and large problems

Posted Sep 19, 2007 16:22 UTC (Wed) by james (subscriber, #1325)
Parent article: Large pages, large blocks, and large problems

Some of Linus' thoughts (presumably) can be found at Real World Technologies (and associated thread):

[The performance cost of] Page table handling stays pretty constant - you basically get TLB misses proportionately to your data size, which means that the more TLB misses you get, the more data cache misses you get!

So realistically, TLB costs are never going to grow in any unbounded kind of manner - they are always limited by (and generally much smaller than) the D$ costs! There are loads that are more TLB-intensive than others (and loads that are more D$ intensive, of course!), but in the end, TLB's aren't the problem.

Unless the CPU micro-architecture is unbalanced, of course. There have certainly been uarchs that increased the cache size a lot without increasing the TLB size. Now, of course they'll be TLB-limited! But that's not really a fundamental issue, it's just an unbalanced design.

and
You want good cache behavior if you have 256GB of memory, or your performance will suck. It's that easy. And if you have good locality in the D$, then the TLB's will work fine.


(Log in to post comments)

Large pages, large blocks, and large problems

Posted Sep 19, 2007 16:51 UTC (Wed) by avik (guest, #704) [Link]

It's an oversimplification. Suppose you have a large memory machine and
you're doing completely random access. With 4KB pages, you'll get a tlb
miss and two cache hits (one for the data and one for the pte). With
large pages, the page tables can be all cached and you only take a tlb
miss and a single cache miss.

So for this contrived workload, you get a ~2X speedup by using large
pages. Obviously real workloads will get less speedup, but it is still
significant.

Another way of stating this is that large pages don't just increase the
coverage of the tlb, they also increase the pagetable coverage of the
data cache, which can be much more significant.

Large pages, large blocks, and large problems

Posted Sep 20, 2007 0:52 UTC (Thu) by sayler (guest, #3164) [Link]

Yes, *but* you have a commensurate increase in the cost (in silicon area, latency, validation) to support large-page-TLB entries. IIRC (from the above RWT thread) modern Intel processors support only 2 TLB entries for large pages!

In other words, you may lose some performance by going to large pages because of resource constraints.

Large pages, large blocks, and large problems

Posted Sep 20, 2007 5:51 UTC (Thu) by avik (guest, #704) [Link]

I'm talking about current hardware, not proposing changes to hardware.

The workload I described will gain a 2X boost from using large pages
regardless of the tlb's allocation of large and small pages. And while it
is not a real-life workload, others have demonstrated nice performance
improvements with large pages.

Large pages, large blocks, and large problems

Posted Sep 21, 2007 17:17 UTC (Fri) by vonbrand (subscriber, #4458) [Link]

The cost is the same once the data is in RAM. Loading and storing large pages is costlier. It is not at all that simple.

Large pages, large blocks, and large problems

Posted Sep 21, 2007 19:48 UTC (Fri) by avik (guest, #704) [Link]

My comment (and Linus' remarks) is talking about large pages, not large
blocks. The assumption is that this is a memory workload, not an I/O
workload. There's no disk I/O involved.

The story repeats itself

Posted Sep 19, 2007 17:00 UTC (Wed) by khim (subscriber, #9252) [Link]

Deja vu: the same thing happened when question about more the 4GB of memory in 32-bit system was raised. Of course it's insane to try to handle 16GiB or 32GiB with 32bit CPU. Of course the proper solution is switch to 64bit CPU. But when the most popular arch is 32-bit and users need huge memory systems - what can you do ?

Here we have the same situation. Most CPUs are balanced (PPC, Athlon64, etc), but there are one vendor which sells two architectures of CPU which are seriously unbalanced. Can we safely ignore this obscure vendor and these crippled CPUs ? Not when vendor is called Intel and CPUs are Pentium 4 and Core 2. They both only have 128-items TLB (enough for 512KiB with 4KiB pages) and 4MiB of cache (at least in some models). That's 8 times difference! Yes, this is insane, yes, this is problem of CPU design. Yet when it's the most popular vendor and the "crippled architecture" is the most popular CPU from said vendor - you can not just ignore the problem and hope that it'll go away.

There are rumors that Vista SP1 will use 4MB pages to speedup I/O...

The story repeats itself

Posted Sep 19, 2007 17:23 UTC (Wed) by proski (subscriber, #104) [Link]

Isn't Core 2 64-bit? That problem is going away. And 16 Gb of RAM costs significantly more than a CPU and a motherboard that can support it properly, and it has always been like that.

The same applies to the cache. Changing the CPU will cost fraction of the memory it's supposed to support.

The story repeats itself

Posted Sep 19, 2007 17:38 UTC (Wed) by khim (subscriber, #9252) [Link]

Core 2 is 64bit only, but people needed 16GiB of RAM years ago when x86-compatible 64bit CPUs were just a project. Thus PAE support was added to Linux.

The same - with TLB today: may be someday we'll have the truly balanced architecture but today - we don't. Add it's not clear if we'll have balanced architecture tomorrow: TLB must be fast (or else it's useless) but it's hard to create large and fast cache. Of course it's possible to use 2-level TLB (like AMD does today), but it's not some minor modification - it's possible that we'll be forced to wait few years till the new Intel's design. And all these years Linux will be worse then Windows... not a good position...

The story repeats itself

Posted Sep 19, 2007 22:44 UTC (Wed) by proski (subscriber, #104) [Link]

I don't see any references to Windows in the story.

The story repeats itself

Posted Sep 20, 2007 0:52 UTC (Thu) by sayler (guest, #3164) [Link]

"Core 2 is 64bit only"

No.

Oops

Posted Sep 20, 2007 21:09 UTC (Thu) by khim (subscriber, #9252) [Link]

Perils of editing. Initially I wanted so say that only "Core 2" is 64bit while Pentium 4 (except latest models of Prescott), Pentium 3 and so on are 32bit. Phrase was too cumbersome and I've removed everything but the "only" word was left after editing...

The story repeats itself

Posted Sep 27, 2007 13:31 UTC (Thu) by anton (guest, #25547) [Link]

[...] Pentium 4 and Core 2. They both only have 128-items TLB (enough for 512KiB with 4KiB pages)
Note that this assumes that each page is fully utilized with hot data, i.e., the best case. That's rarely the case and that's why cache lines are smaller than a page. So, in the worst case, 128 TLB entries cover only 128 cache lines, i.e., with 64-byte L1 cache lines, 8KB. I have no data on typical cases, but if half the page contains hot data, a 128-entry TLB will cover only 256KB of data in the caches. You can then hope that your working set is smaller, or you will see TLB thrashing.

If page utilization is a problem, larger pages will probably have limited benefit: the relative utilization (hot data/page size) will probably go down. However, as long as the absolute utilization (hot data/page) goes up, larger pages will still be useful in terms of reducing TLB misses; they do have other costs, though.

Large pages, large blocks, and large problems

Posted Sep 19, 2007 18:39 UTC (Wed) by eSk (subscriber, #45221) [Link]

I've never understood Linus' aversion against general support for superpages. He always seems to make use of this argument that it does not matter (performance or otherwise). However, at the OSDI '02 Navarro published a very thorough analysis of how superpage support helped performance by increasing TLB coverage [1]. The page sizes used were 8KB, 64KB, 512KB, and 4MB and the whole thing was implemented in FreeBSD. He also described the techiques used for promoting and demoting superpages (i.e., automatically converting beween different page sizes as needed). He didn't even talk about other advantages of using superpages (e.g., for supporting large blocks), but the results he obtained were still impressive. BTW, another side effect of using superpages is that you tend to get better cache utilization because you automatically get a cache coloring effect by using the larger page sizes.

[1] Navarro et al., "Practical, transparent operating system support for superpages", OSDI '02.
http://www.usenix.org/events/osdi2002/tech/full_papers/na...

Large pages, large blocks, and large problems

Posted Sep 20, 2007 0:54 UTC (Thu) by sayler (guest, #3164) [Link]

"This system is implemented in FreeBSD on the Alpha architecture," The Alpha has (had?) good support for higher-order page allocation at the hardware level that is not currently present in current-gen Intel and AMD chips.

Large pages, large blocks, and large problems

Posted Sep 20, 2007 11:13 UTC (Thu) by eSk (subscriber, #45221) [Link]

Sure. Running these experiments on any x86-lineage chip would obviously not work (except perhaps in a simulator) because of lacking the wider range of page sizes. The point I was trying to make is that Linus argues strongly against the usefulness of any page size above 4K/8K, except for some special cases where very large superpages are used explicitly.

Large pages, large blocks, and large problems

Posted Sep 27, 2007 13:43 UTC (Thu) by anton (guest, #25547) [Link]

Running these experiments on any x86-lineage chip would obviously not work (except perhaps in a simulator) because of lacking the wider range of page sizes.
On a machine that has plenty of memory for what's running on it, a variation of the Navarro approach might still be useful even on an IA32/AMD64 machine. The OS would rarely feel enough memory pressure to consider splitting a large page.

Also, since Linux also supports architectures that have finer-grained page size steps, it should not look just at what IA32/AMD64 support. And once a popular OS like Linux supports it, the hardware designers at Intel and AMD can better justify adding additional page sizes to their hardware.

Large pages, large blocks, and large problems

Posted Sep 20, 2007 13:42 UTC (Thu) by zlynx (subscriber, #2285) [Link]

Intel's Itanium supports page sizes in all powers of 2 from 4K to 256M. That's a current-gen Intel chip.

Large pages, large blocks, and large problems

Posted Sep 21, 2007 14:58 UTC (Fri) by jamesh (guest, #1159) [Link]

It may be present generation, but any software developer who made decisions based on the assumption of increased adoption of Itanium is crazy.

Large pages, large blocks, and large problems

Posted Sep 21, 2007 17:21 UTC (Fri) by vonbrand (subscriber, #4458) [Link]

Current generation intel chips are clones of AMD64.

Large pages, large blocks, and large problems

Posted Sep 21, 2007 18:33 UTC (Fri) by zlynx (subscriber, #2285) [Link]

Not really, or they would have included more of the good parts like the IOMMU and memory controller.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds