Should disclose: x86 specific
Posted Oct 2, 2007 9:42 UTC (Tue) by ncm
Parent article: Memory part 2: CPU caches
It should have been noted in the text that much of the description of multi-cache interaction is specific to x86 and similarly "sequentially-consistent" architectures. Most modern architectures are not sequentially consistent, and threaded programs must be extremely careful about one thread depending on data written by another thread becoming visible in the order in which it was written. "Modern", in this context, includes Alpha, PPC, Itanium, and (sometimes) SPARC, but not x86, AMD, or MIPS. The consequence of the requirement to maintain sequential consistency is poor performance and/or horrifyingly complex cache interaction machinery on machines with more than (about) four CPUs, so we can expect to see more non-x86 multi-core chips in use soon.
As an example of an unfortunate interaction, consider a simple case with only two threads. One thread fills a buffer and then sets a flag indicating its contents are ready. A second thread periodically checks the flag, and when it finds the flag set it uses the contents of the buffer. Without sequential consistency, the second thread might see old contents of the buffer because they have not yet been spilled from the cache of the processor executing the first thread.
To protect against memory write-order visibility problems on non-x86 machines, your threads must arrange to execute "memory barrier" instructions at appropriate points. The simplest way to do this portably is to use mutex lock and unlock operations. The lock/unlock operations may be entirely superfluous to synchronize between threads (although you might also find uses for them) but their side effects of executing memory barrier instructions may be needed to ensure that the second thread actually sees all the bytes that were written by the first at the time that it looks. You must hope, too, that your compiler knows not to optimize by moving assignments across the barrier event.
Extant machines on which this sort of care is needed include Apple dual-G5 and any multi-CPU Itanium. The need to pay attention to this sort of detail in multi-threaded coding will increase when other architectures not bound by x86's sequential consistency model begin to trounce x86 performance as the the number of cores that fit on a chip grows. Failure to get these details right will often mean randomly occurring unreproducible bugs. Conventional testing may never detect the bugs; they might be identified only through inspection of the code.
to post comments)