LWN: Comments on "Memory part 2: CPU caches"

Memory part 2: CPU caches

mpr22 — Sun, 12 Apr 2020 14:17:47 +0000

The first access to each data element, in principle, costs 200 cycles instead of 15 cycles.

So the cost is (100 * 200) + (100 * 99 * 15).

Memory part 2: CPU caches

remicmacs — Sun, 12 Apr 2020 14:04:12 +0000

Excuse-me I know this is a rather old article and I probably won't get any answer but something bugs me.

I'm reading through this great paper and I have to regularly refer back to this section. So far so good.

But I keep being stumped with this :

> Assume access to main memory takes 200 cycles and access to the cache memory take 15 cycles.
> Then code using 100 data elements 100 times each will spend 2,000,000 cycles on memory operations if there is no cache and only 168,500 cycles if all data can be cached.
> That is an improvement of 91.5%.

100 * 100 * 15 = 168500 ?

I keep looking in this section if there are implied costs that I forgot to take into account but I can't seem to find any. Is this just an error ?

Thanks for your insight

Memory part 2: CPU caches

timesir — Sat, 26 Oct 2019 08:35:51 +0000

Could you show your test code?

Memory part 2: CPU caches

mas4042 — Wed, 31 Oct 2007 17:11:35 +0000

Fig 3.26 doesn't make any sense to me.

I'm supposed to believe that the Core2 can sustain 16B/cycle read bandwidth out to a working
set of 512M?

Lets assume it was a 2 GHz part to make the math easy.  To sustain 16B/clock would require 32
GB/sec bandwidth to main memory.

What am I missing?

Not Windows specific

renox — Thu, 11 Oct 2007 19:51:19 +0000

Uhm, the next architecture clash I expect to happen is not what you're waiting for (a new architecture gaining some marketshare) but low power x86 cores gaining over ARM in the 'intelligent phone' space.

Plus even free software has portability issues: the OLPC doesn't use an x86 for nothing..

Memory part 2: CPU caches -- factual error

arcticwolf — Thu, 11 Oct 2007 07:47:43 +0000

Don't try to make sense of Drepper - doing so will only get you verbally abused. (Hmm, looking at the front page for this - well, last, for you - week, I wonder how many women there are in glibc development. Having lurked on the libc-alpha mailing list for a while, Drepper seems, to me, to be exhibiting a perfect example of the kind of attitude that we-the-community should try to get rid of as much as possible.)

Memory part 2: CPU caches

dlang — Wed, 10 Oct 2007 11:40:33 +0000

one of the huge advantages of hyperthreading is that a result of the tratment of CPU registers. if you schedule two threads on one core you have to save and restore all the registers (possibly as far as main memory, depending on what your second thread does to your cache)

but with hyperthreading each virtual core has it's own set of registers, this provides a drastic speedup when switching from one task to another (under ideal situations)

in the real world it all depends on how the different threads compete for cache space and memory I/O. if you are building a dedicated compute cluster (and some high-end graphics workstations fit in this catagory) you can tune for this and get really good speedups, if you are running a mixed hodgepodge of stuff you are far more likely to hit the problem cases.

Memory part 2: CPU caches

ncm — Wed, 10 Oct 2007 00:48:37 +0000

I stand corrected. Now, if only processes could be rated according to how much of each fragment of the CPU they tend to use, they might be paired with others that favor using the corresponding other fragments.

Unfortunately the mix changes radically from one millisecond to the next. For example, slack UI code may get very busy rendering outline fonts.

Still, I am now inspired to try turning on HT on my boxes and see how it goes.

Reality check on PPC/Cell

jmorris42 — Tue, 09 Oct 2007 20:51:31 +0000

> PPC/Cell gives much better performance/price than x86(-64),
> and will only become more so, and we are there now.

No it doesn't. Tell me where I can buy this mythical PPC/Cell with better price/performance and I'll withdraw the objection.

Hint: I'm not talking about a Playstation3 or an IBM Cell based blade. Both are far too restrictive environments to be 'general purpose'. The PS3 is crippled consumer electronics with a dumb 2D framebuffer, little RAM, low end storage and no high speed IO. A blade is only useful in very high density server farms or compute clusters.

Show me where I can buy an actual desktop PC or 1/2U Server based on a Cell or PPC that exhibits better price/performace on real world workloads. For a desktop that would be 3D modeling, heavy numerical crunching (but not cluster workloads) for engineering or some such CPU intensive workload. A web browser and office suite no longer matters, ANY current production CPU should be more than enough for that. For a server workload take your pick, web server workloads (LAMP or JSP) database, file server, whatever.

The reality is only a few niche players produce PPC/Cell hardware and the price simply isn't competitive with x86 or x86-64 because of that. Yes a Cell has some advantages vs x86 hardware which existed when it was introduced, but Cell is still basically the same and the x86 world has been churning newer faster chips with more and more cores. This is what kills every interesting new arch.

x86(-64) adoption driven by performance/price, but is already second-best

hazelsct — Tue, 09 Oct 2007 19:50:22 +0000

> BTW: Debian is thinking about abandonment of the 11 primary architectures system - obscure architectures often break the only pair that matters (x86 and x86-64) so they'll eventually exclude most of them from "tier 1" support list...

Check your facts: this was proposed a couple of years ago, and ten of the eleven arches (all but m68k) satisfied all of the "tier 1" requirements for official release in etch. And where did you get the idea that obscure architectures "break" x86 or x86-64? The only burden they place on the project is mirror space, and mirror admins can exclude them if they wish.

> Other distributions already gave non-x86 architectures "second class citizens" status...

So what? Eleven arches build 95% of the Debian repository, so we have support for nearly all of the 19,000 Debian packages everywhere. PPC/Cell gives much better performance/price than x86(-64), and will only become more so, and we are there now. And Ubuntu *added* SPARC last year, so they're going in the opposite direction, toward more architectures. Perhaps they see the writing on the high-performance wall?

Memory part 2: CPU caches

nix — Mon, 08 Oct 2007 21:30:39 +0000

That's `arxiv', I suspect.

And, of course, the publishers end up holding the money, which is why
they're so vehemently against open access.

Memory part 2: CPU caches

dmh2000 — Mon, 08 Oct 2007 16:38:17 +0000

For a plain old programmer, the difference between the 3 cache types, fully associative, direct mapped and set associative was somewhat obscured by the hardware details (comparators etc). correct me if I am wrong, from a circuit ignorant software guy, the 3 cache types are like:

a fully associative cache is like an unsorted list. you have to search the list one way or another to find an entry matching your tag (T), or if it is there at all. you can use various search strategies but no matter what you will be pretty slow.

a direct mapped cache is like an array, indexed by S. you index into the array, which is quick, and compare the value in the array element to T to see if you have a hit. if the T in the array slot doesn't match the T you are looking for, you have to evict the current value.

a set associative cache is like a hash table where S is the hash and the table is indexed by an array of size 2**S, The array is sized so there is a unique S for every slot in the table. You index into the table using S (just like direct mapped), but the array element, instead of being a tag value, is a pointer to a list containing a small number of tags. then you search that list (like fully associative) to find the tag or not. but since the list is short, the search penalty is not prohibitive.

since the caches are implemented in circuits instead of software, the searching can have better parallelism than a software implementation at the expense of more transistors.

Memory part 2: CPU caches

dps — Sun, 07 Oct 2007 16:15:59 +0000

If you want something significantly beyond this, especially is SSE of interest, then I suggest that you might sse what you can find about emmerald (a fast SSE-using matrix-matrix mupltication code). I have a 2001 journal article which discusses maximising the use of L0 chache (aka registers), L1 cache and L2 cache. Minimising TLB misses is also discuessed.

It also cites ATLAS which probably merits further investigation if you want to know about cache-efficeint dense matrix muplication. As sparse matricies are my current (real work) interest I have not investigated further.

Fortunately for me Bodlean (library) reader's tickets are valid for life, and students get one :-) Many people have seen Duke Humphrey's library, which is part of the Bodlean library, albeit possibly not with that name attatched.

Emmerald is avialbale at http://csl.anu.edu.au/~daa/reserach.html. I do not know whether there are copies of articles there too. I got mine from an electronic journal. Anyone that does ask the author for a reprint should *not* say I sent them.

FYI some journals will sell you reprints for large amounts of money. You should only persue this option as a last resort. Free copies for anyone are sometimes avialable at institutional websites or axvir. You might also be able to buttonhole a friendly academic :-)

BTW journals get the both the articles and editorial for no charge, so I wonder who ends up holding the money. Most journals have significant subscription charges, mostly paid for by institutions, and some have per page charges for those publishing too.

Memory part 2: CPU caches -- factual error

giraffedata — Fri, 05 Oct 2007 21:16:58 +0000

What the article is biased toward isn't desktop computers or PCs (and the latter is ambiguous; sometimes it means personal computer; other times it means architectures descendant of the IBM PC). The bias is toward general purpose computers. Everything seems to be fully applicable to a typical web server, for example.

Funny how you've mentioned Ultra-high performance systems

khim — Fri, 05 Oct 2007 19:24:16 +0000

Ten years ago very few "ultra-high performance systems" were using x86-based CPUs. Today... They own this space: 63% of systems from TOP500 are using Intel Xeons and AMD Opterons! The next biggest contender is PowerPC - and it's down to 17% already. Why ? People need ultra-fast CPUs not to brag about them but solve real tasks. And x86 makes it easier.

Note: these are custom-built system designed to run custom-built software on top of Linux (usually). So usual excuse that "it's just horrible Microsoft's OS that is driving back adoption of non-x86 CPUs" will not fly.

BTW: Debian is thinking about abandonment of the 11 primary architectures system - obscure architectures often break the only pair that matters (x86 and x86-64) so they'll eventually exclude most of them from "tier 1" support list...

Other distributions already gave non-x86 architectures "second class citizens" status...

Of course we will

hazelsct — Fri, 05 Oct 2007 17:45:46 +0000

You forgot about libraries. It's easy to rip-and-replace a ray tracing engine, (non)linear system solver, etc. for one written in Fortress with a similar interface, and still keep the same expensive C++, Java, or Python high-level framework.

As for your example, what still uses GTK+ 1.2, aside from XMMS 1 (XMMS 2 doesn't) and groach? If this were a real performance or other problem, we would have a 1.2-compatible wrapper over 2.0.

Wipe the Windows out of your eyes

hazelsct — Fri, 05 Oct 2007 17:27:08 +0000

...this statement ignores the complete economic picture: it's not just the hardware that matters, but the total stack that is in a computing solution. Nobody buys just the raw hardware - the software is an inextractable part of the equation.

This is why free software is superior, *all* of our software already runs on *all* the hardware (e.g. 95%+ of Debian builds on all eleven arches), and very well at that! "Programmability" is bunk, all modern CPU architectures are programmable, we're just the only ones bothering to program all of them -- or rather, the only ones open enough to let people adapt our programs to all of them.

So when Sun starts selling its 8-core 8-threads/core Sparcs next year, Solaris, Linux, and maybe one or two BSDs will be there, Windoze and MockOS will not. GNOME, KDE, Compiz Fusion, Croquet, Looking Glass, Second Life, Sugar, etc. will be there; Aero, Carbon, and any 3-D environment they dream up will not. We will have better servers, and better immersive 3-D performance, the games will follow, and it will be over for MS and Apple.

Okay, maybe it won't happen quite that fast. But follow the trends: we were first on Itanium (but who cares?), first on AMD64 (and AMD credits us with driving that platform's success), our software is all over ARM where only a minuscule fraction of Microsoft's is (on "Pocket PC" or "Windows Mobile" or whatever they're calling it now). When the next CPU architecture breakthrough comes, we'll be there and they'll need to play catch-up, again -- hopefully the hardware won't wait for Microsoft this time.

Speaking of which, this is also how DEC fumbled the marketing of Alpha -- they waited for MS to have NT ready before releasing it, and lost more than a year in that wait. They should have bypassed NT, released earlier with VMS and OSF/1 (or Digital Unix, etc.), dominated the workstation/server market, then used Linux on the low end to ramp up the volume and stay faster than Intel.

MS is darned lucky that modern x86-compatible CPUs run 64-bit code somewhat fast with low power consumption. Otherwise Linux on Alpha/Itanic/PPC64 on the high and end ARM on the low end would have eaten their lunch. But then, we're doing that *now* on handhelds and smart phones, we're on Cell and they're not, and we'll beat them to the 64-way Sparcs!

Resistance is futile. World domination is inevitable.

No disagreement here

filker0 — Fri, 05 Oct 2007 01:10:00 +0000

Embedded systems may outnumber general purpose PCs, but I doubt that any single platform
outnumbers them on its own. Also, far fewer programmers ever have a chance to program one.
Whether all programmers have to know how to deal with systems with 4 different types of RAM,
or demand paged high speed static RAM that is paged from a larger SDRAM, that in turn is paged
from NOR or NAND Flash by a separate microprocessor that implements a predictive pre-fetch.
Each platform is a special case.

A game engine such as the one you describe provides a virtual machine, and makes a heck of a
lot of sense. All you have to port, as you said, is the VM. (Not all VMs use byte-codes, afterall).

My current project (I'm the low-level platform guy) involves a lot of cache performance
optimization in the application level code -- aligning data on cache line boundaries, use of burst
DMA to do memory-to-memory transfers in parallel with continued code execution, and explicit
cache loading and flushes. But in our system, everything is deterministic (it has to be by the
rules of our industry). Determinism is extremely hard on a pipelined RISC architecture, and when
you add cache to the picture, it becomes almost impossible. In our case, though we need to
squeeze every drop of performance that we can, that comes second to it always taking the same
amount of time to do a specific operation.

Most programmers don't have to know the kind of cache details that game console and some
other embedded programmers (avionics, in my case) do. Still, I think it's good that more
programmers understand the concepts and techniques for improving cache performance in a
general multi-programming environment.

Wipe the gunk out of your eyes

nlucas — Fri, 05 Oct 2007 00:02:58 +0000

You don't have to go that far. You're forgetting Intel try to shift the architecture on the 64-bit switch, which got smashed by the AMD x86-64 "compatible" mode.

Memory part 2: CPU caches

jzbiciak — Thu, 04 Oct 2007 22:10:30 +0000

Actually, hyperthreading treats ALUs as an underutilized resource, and task scheduling latency as the benchmark. That is, one task might be busy chasing pointer chains and taking branches and cache misses, and not totally making use of the ALUs. (Think "most UI type code.") Another task might be streaming data in a tight compute kernel, scheduling its data ahead of time with prefetch instructions. It will have reasonable cache performance (due to the prefetches), and will happily use the bulk of the ALU bandwidth.

In this scenario, the CPU hog will tend to use its entire timeslice. The other task, which is perhaps interacting with the user, may block, sleep and wake up to go handle minor things like moving the mouse pointer around, blinking the cursor, handling keystrokes, etc. In a single-threaded machine, that interactive task would need to preempt the CPU hog directly, go do its thing, and then let the hog back onto the CPU. In a hyperthreaded environment, there's potentially a shorter latency to waking the interactive task, and both can proceed in parallel.

That's at least one of the "ideal" cases. Another is when one CPU thread is blocked talking to slow hardware (e.g. direct CPU accesses to I/O registers and the like). The other can continue to make progress.

Granted, there are many workloads that don't look like these. Those which cause the cache to fall apart by thrashing it definitely look worse on an HT machine.

Of course we will

jzbiciak — Thu, 04 Oct 2007 21:54:20 +0000

Right, but will new, highly parallel programs be developed in the same languages? I think it's acceptable to say that old programs don't benefit as much from new features.

The requirement to be compatible does not necessarily include the requirement to provide peak performance.

Show us some code?

asamardzic — Thu, 04 Oct 2007 19:11:11 +0000

How exactly number of cycles per operation is calculated for Fig 3.4, and following? It would be interesting to see the code at least for this simple benchmark at this point...

Excellent information, but a bit weighty for the front page...

amikins — Thu, 04 Oct 2007 17:58:09 +0000

Well, knowing they're last will help. I definitely don't mind having these available as a resource, but usually when I'm going over the weekly I'm skimming more than reading, so I can note what I want to read in-depth later. :)

Thanks for the prompt reply. Your attention to detail and the needs of your readers (even when they don't agree fully with mine) are why I still keep a subscription up as much as possible, after all these years.

Excellent information, but a bit weighty for the front page...

corbet — Thu, 04 Oct 2007 16:53:11 +0000

I had thought about it, but we get complaints when we move things off the weekly pages too. So we'll probably keep them inline, but they are always the last item on the page for easy skipping should you want to do so. The next couple of segments are also shorter, to your editor's relief.

Excellent information, but a bit weighty for the front page...

amikins — Thu, 04 Oct 2007 16:49:43 +0000

Is there any chance future installments could be linked to on a seperate subscriber-only page, instead of taking up a huge chunk of the front page? Having to scroll past all of that while looking for more text on the front page is a bit trying.

Hot?

ekj — Thu, 04 Oct 2007 13:14:57 +0000

Actually, that's not true. Bolzmanns constant sets an absolute, physical, lower limit on the amount of power that is needed for causing a permanent lasting state-change. (such a flipping a single bit)

Granted, that limit is *very* low. But it's not zero. I calculated some time back (if you're sufficiently interested, google it) that if we continue doubling computing-power we'll run up against this hard physical limit in aproximately 15-20 years.

That's a long time in computing. But it's not forever. It's short enough that most of us will get to experience it.

Oh yeah, I'm aware of reversible computing. I just don't think that'll go anywhere. I'd be happy to be proven wrong.

Memory part 2: CPU caches

smitty_one_each — Thu, 04 Oct 2007 12:44:52 +0000

I'd be highly interested in buying a hardcopy of this work, for convenience and to compensate the author's effort.
Also, something in a wiki-ish format, to support reader annotations (if not edits against the text proper), would be an interesting experiment for this work.

Wipe the gunk out of your eyes

mingo — Thu, 04 Oct 2007 06:43:00 +0000

Yes, we're stuck with the i86 lineage, and we will suffer performance ceilings because of it.

Yes - but this statement ignores the complete economic picture: it's not just the hardware that matters, but the total stack that is in a computing solution. Nobody buys just the raw hardware - the software is an inextractable part of the equation.

So once you take the cost of writing and supporting software into account too, applied to general computing problems that computers are used for, you will realize that today the hardware is not the main limiting factor but humans are. Most software is running a few orders of magnitude slower than it could run on the hardware.

The platform that is slightly slower but offers superior programmability - especially for something as hard to grok for humans as parallelism and/or cache coherency - will continue to be the most economic choice in all mainstream uses of computers. (and will continue to use the resources it earns from the mainstream to chip away on most of the remaining niches)

The trick is to maintain the right balance between programmability of a platform and raw physical performance - and the x86 space has done that pretty well over the years. (combined with the fact that the x86 instruction set has become a de-facto bytecode - so RISC never had a chance even technologically.)

(If only performance mattered then customers would buy hardware by sending complete chip designs and hard disk images to manufacturers, optimized to their business problem, which would then be assembled from scratch. There would be no uniform 'instruction set' per se. We are still decades (or more) away from being able to transform arbitrary, abstract business problems into actual hardware in such a pervasive way, without having any common platform layered between them. The moment you have even just a single persistent layer between the business problem and the hardware (be that layer controlled by the customer or by a manufacturer or by the market), platform thinking takes over and raw performance takes a backseat.)

Embedded is a special case

tshow — Thu, 04 Oct 2007 06:00:52 +0000

> Embedded systems are a special case and all of the rules change.

That's fair enough, but there are an awful lot of game machines and embedded systems out there; more than there are PCs, if you count game systems, cellphones, PDAs, set-top boxes, the control systems in cars...

Our game engine deals with tightly-coupled address-mapped memory on all the platforms it supports; on platforms that don't actually have such memory (PCs, mostly), we fake it with a block of normal memory. We've built our engine as an OS (and support libraries) for games; the idea being that a game will compile on any platform that the engine supports with minimal resorting to #ifdef. You *can* write fast platform-agnostic game code that crosses (very different) platforms.

A whole lot of the techniques that I'm sure this series of articles is going to delve into (walking memory in address-order whenever possible, aligning data structures to (ideally) machine word size, (hopefully) cache line size or (at worst) hardware page size, keeping transitions across page boundaries to a minimum, unrolling loops is no longer a good idea, strategies for preventing icache misses...) are just as applicable to embedded systems as they are to PCs. Arguably moreso; caches on embedded systems and game systems tend to be significantly smaller than on PCs, so the cost of cache misses is that much higher.

With relatively little effort and a little discussion of the wider realms beyond the beige (or black, or possibly silvery; your mileage may vary) desktop space heater, this could be a significantly more useful treatise.

Embedded is a special case

filker0 — Thu, 04 Oct 2007 03:07:46 +0000

Embedded systems are a special case and all of the rules change. Set top boxes, such as game
machines, are special purpose platforms. General coding techniques used in your typical
application tend to be architecture and platform agnostic, a game is written knowing exactly what
kind of hardware environment it's going to get. Embedded apps often manage their own cache,
too. I know the ones I'm working on right now do.

Another good article for the x86 crowd

filker0 — Thu, 04 Oct 2007 02:51:04 +0000

This article has a lot of good information, much of which I was completely unfamiliar with until
now. It seldom, however, talks about non-x86 architectures. I've been doing a lot of PowerPC
work over the past few years, and the cache implementation is somwhat different. (It also differs
between PowerPC families). I've been using systems that have L1i, L1d, L2i and L2d caches.
That's right, the L2 is still divided between i and d. If you want to do self-modifying code, you
have to explicitly invalidate the instruction cache over that region or you have a good chance of
getting the instructions if there were previously instructions in those locations.

The discussion of how the cache tags/lines/sets are managed is pretty close, but the 32 bit PPCs
in the 7450 line such as the 7448 have three address layers; virtual, effective, and physical.
The 32 bit virtual address maps to an effective address that is 54 bits (I think) wide, which is then
mapped to a 36 bit physical address, which is then passed to the system controller. The cache is
associated with the effective address, so if two tasks are sharing the same data at different
virtual addresses that map to the same effective address, and that address is in a cacheable
region, you don't end up with two copies of the data in the cache. There are a lot of other
variations on other architectures. The PPC7450 series also provides very lightweight advisory
instructions that give hints to the cache controller to pre-fetch data before the instructions that
need that data are reached. These instructions get serviced out of order, and (if I read the
documentation correctly) do not occupy a space in the pipeline.

Quite a few other things covered don't apply to non-x86 style systems. This is not, in itself a
failure on the part of the author, though he ought to make it explicit that he's only covering the
Intel/AMD/Cyrix/VIA world, not the PPC, SPARC, or Alpha.

The only thing I was disappointed in is that he appears to have skipped write-through vs. write-
back cache strategies and cache locking. Some cache systems give you a choice of how to
handle writes (made by the OS, not the application), and some give the system the ability to lock
a range of addresses into i or d cache (L1 and/or L2). At least some of the AMCC PPC440 models
(I think all) allow some or all of L2 cache to be used as static RAM or L2 cache.

Overall, it's a good article. I will be emailing my typographical comments, as requested.

Wipe the gunk out of your eyes

filker0 — Wed, 03 Oct 2007 23:02:23 +0000

We're hearing this tune for quarter-century: in the next few years x86 will be hopelessly outmatched and replace by fade-of-the-day (RISC, VLIW, EPIC, etc). This predictions stubbornly fail to materialize. The fact is: unless you can show drastic increase in speed (not percents, times) - no one will bother. Even if will show drastic increase in speed - you'll only manage to grab tiny niche if you can not run existing tasks as well as x86-solution (see Itanic vs Opteron). So any solution which have any hope of winning must include x86. It can include specialized instructions and cores (for multi-core CPUs) which can only be used by specialized software, but if it's not x86-compatible at all - it's no starter.

It already happened. Years ago. The Alpha RISC processor could run (at a given clock speed and generation of fabrication) anything else on most typical loads. The x86 architecture is, indeed, going to be around for a long time; this has been perpetuated in no small part by Microsoft and their line of operating systems that have depended so much on backward compatibility. Yes, NT was available for the Alpha, but I know of very few apps that were made available for it. DEC fumbled the marketing of the Alpha, Compaq sold it off to Intel, and Intel, having the choice of continuing its development at the expense of the (inferior in my opinion) Itanium decided to do no further development and End-of-Life it as quickly as contractually possible.

Yes, we're stuck with the i86 lineage, and we will suffer performance ceilings because of it. Ultra-high performance systems will use something else, but the average application end user won't move so long as the dominant OS is bound to intel and insufficient commercial apps aren't available on the other platforms. The Mac could as easily move from the Intel chips to some other CPU in the future -- Apple has enough foresight to provide for fat apps (OS/2 did this, too). Linux is available on just about everything with a register, a stack, and more than 1MB of RAM. For the most part, Open Software can run on anything.

But for most of you, yes, the already hopelessly outmatched x86 line is the chain around your ankle for the next decade.

Of course we will

khim — Wed, 03 Oct 2007 20:24:20 +0000

Big programs are developed in 10-15 years, so we can be pretty sure programs developed today and even yesterday will be used on 32-CPU cores. Sure, some parts can (and will) be replaced, but a lot of code and binaries will be reused. A lot of distributions still include GTK+ 1.2.x because not all applications are rewritten yet - and GTK+ 2.0 is over five years old. And it's just upgrade (not even replacement) of one library (not change of language).

We will have 32-core CPUs in five years - do you really believe that someone will abandon billions lines of existing code by then ?

Should disclose: x86 specific

BenHutchings — Wed, 03 Oct 2007 15:59:04 +0000

All SMP systems (well, all of them that run Linux) implement some form of cache coherency because without that synchronisation requires an expensive cache flush. The potential for inconsistency comes mainly from reordering in load and store buffers.

x86 and x86-64 actually aren't sequentially-consistent, because this would result in a huge performance hit. They implement "processor consistency" which means loads can pass stores but no other reordering is allowed (except for some special instructions). Or to put it another way, loads have an acquire barrier and stores have a release barrier. Implementations can issue loads to the bus out of order, but will invalidate early loads if necessary to achieve the same affect as if all loads were done in order.

Explicit memory barrier instructions may be necessary or useful even on x86 and x86-64. But ideally programmers will use portable locking or lockless abstractions instead.

Good diagnosis, wrong conclusion

njs — Wed, 03 Oct 2007 07:11:02 +0000

>More to the point, when 32-core PPCs are several times faster than 32-core x86es, you will naturally become more interested in how to program them reliably.

But by that point, we won't be using languages with shared-everything concurrency models.

...Please?

Yawn.

khim — Wed, 03 Oct 2007 05:47:36 +0000

We're hearing this tune for quarter-century: in the next few years x86 will be hopelessly outmatched and replace by fade-of-the-day (RISC, VLIW, EPIC, etc). This predictions stubbornly fail to materialize. The fact is: unless you can show drastic increase in speed (not percents, times) - no one will bother. Even if will show drastic increase in speed - you'll only manage to grab tiny niche if you can not run existing tasks as well as x86-solution (see Itanic vs Opteron). So any solution which have any hope of winning must include x86. It can include specialized instructions and cores (for multi-core CPUs) which can only be used by specialized software, but if it's not x86-compatible at all - it's no starter.

Of course you can win some new, specialized market (ARM did this for mobile applications), but you can not push x86 from servers and desktops. And niches tend to evaporate over time. The only big one PPC occupies today is game consoles - and few programmers interact with them...

Small correction

k8to — Wed, 03 Oct 2007 02:36:03 +0000

He knows.

The thing about memory is that latency kills your processing throughput. If the CPU needs the data and it isn't available then the thread stalls until it becomes available. There's various techniques which can sometimes hide this problem, which is what this section of course is largely about.

In the context of making memory which does require cache (the topic of the post), the speed that matters is necessarily latency.

Should disclose: x86 specific

ncm — Wed, 03 Oct 2007 01:48:57 +0000

I'm sorry, I should have pointed out this quote from the article: "All processors are supposed to see the same memory content at all times.". (My emphasis.)

I agree that coding with memory barriers (etc.!) is a big subject, and beyond the scope of this installment. It would have sufficed, though, to mention that (and where) it is a matter for concern, and why.

Should disclose: x86 specific

mikov — Wed, 03 Oct 2007 01:14:26 +0000

I think that you may be confusing cache coherency with memory consistency. Although they are obviously related, in the context of the article the latter is not important.

To the best of my knowledge, the description in the article applies to all cache coherent systems, including the ones listed in your previous post. It has nothing to do with memory consistency, which is an issue mostly internal to the CPU.

I am very possibly wrong, of course - I am not a hardware system designer - so I am glad to discuss it. Can you describe how the cache/memory behavior in an Alpha (for example; or any other weak consistency system) differs from the article ?

Memory part 2: CPU caches

ncm — Wed, 03 Oct 2007 01:01:30 +0000

There are lots of differences, but extra L1 cache pressure is an important one. Another is competition for memory bus bandwidth.

Hyperthreading treats the ALUs as the scarce resource, and sacrifices cache capacity and memory bandwidth to grant more of such access. For those (much more common) workloads already limited by cache size and memory bandwidth, this seems like a really bad idea, but there are a few workloads where it's not. To cater to those workloads, the extra cost is just a bit of extra scheduling logic and a bunch of extra hidden registers.

If it could be turned on and off automatically according to whether it helps, we wouldn't need to pay it any attention. That it can't is a problem, because we don't have any good place to put the logic to turn it on and off.