LWN.net Logo

I/O space write barriers

Some platforms, it seems, have an interesting property: writes to I/O memory space from multiple processors may be reordered before reaching the device. Even if the device registers are protected by a lock (pretty much necessary to keep multiple processors from writing simultaneously and confusing the device), writes issued by one CPU can arrive before those from another, even if the second CPU had held the lock and issued its writes first. The Itanium architecture in particular behaves this way, though others may as well.

The answer, according to Jesse Barnes is the addition of a new type of memory barrier to force the ordering of writes to the device. Jesse's patch adds a new function, mmiowb(), which implements this barrier. He has also updated the qla1280 driver to make use of it.

Authors of PCI drivers are accustomed to coding a different sort of barrier: reading from a device register to ensure that all writes have actually been posted to the device. mmiowb() is a different, lighter-weight mechanism. After a call to mmiowb(), writes might still have not reached the device. Writes are not forced out; they just have their ordering with respect to subsequent writes guaranteed. In many situations, that sort of guarantee is all that is needed.


(Log in to post comments)

I/O space write barriers

Posted Sep 23, 2004 17:35 UTC (Thu) by jsbarnes (guest, #4096) [Link]

> The Itanium architecture in particular behaves this way, though others
> may as well.

According to Grant Grundler, ia64 architectures other than sn2 don't have
this behavior. However, SGI Challenge, Origin, and Altix machines do
exhibit posted write reordering from different CPUs, and I suspect some
other large NUMA machines do as well. For those, it's faster to simply
issue a write barrier than it is to do a full I/O read from the target
bus. See http://www.finux.org/Reprints/Reprint-Bryant-OLS2004.pdf, under
"I/O changes for Altix" for details and performance numbers (this is a
paper I co-authored with some other SGI engineers to describe changes we
made to Linux for the Altix platform), in particular the section on
'Ordering posted writes efficiently'. Quick highlights:

regular PIO read 5940 ns
relaxed PIO read 2619 ns (also described in the paper)
sn_mmiob() 1610 ns

(Note that sn_mmiob() was the sn2 specific implementation of mmiowb() and
is renamed in the patch I posted.)

And the above measurements were just taken from a small system. On a
large system, with a PIO read from a distant device, I'd expect the new
barrier to shave off far more than 1000 ns from the time.

Jesse

Copyright © 2004, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds