Memory-mapped I/O barriers

[Posted September 10, 2006 by corbet]

Paul Mackerras recently reported a subtle bug. The tg3 Ethernet driver, like many other network drivers, operates on a set of buffer descriptors stored in the host system's memory. These descriptors describe the buffers which are available for incoming network packets; when a packet arrives, the interface picks the next descriptor on the list, stuffs the data there, then tells the processor that the packet is available. The reported bug works like this: the processor makes some changes to this descriptor data structure, then does a write to a memory-mapped I/O (MMIO) register to tell the device to start I/O. The device, however, receives this MMIO write before the data written to main memory arrives at its final destination, and thus operates on old data. When this happens, correct operation is, to say the least, unlikely.

Bugs resulting from the reordering of memory operations can be some of the most subtle and difficult-to-find problems. A developer can stare at the code for hours without realizing that what is actually happening, deep down within the system's hardware, does not quite match the code as it appears to be written. The incorrect behavior can happen infrequently and be impossible to reproduce in any easy way.

The solution for this kind of problem is usually to add some sort of memory barrier in situations where the ordering of operations matters. The sort of barrier most familiar to device driver writers may well be the classic rule: MMIO writes to I/O memory hosted on a PCI bus cannot be considered to be complete until a read has been done from that memory range. So drivers often have a pattern where many registers are set with values describing an I/O operation, but a read is done before the final write which sets the "go" bit. Without that read, which functions as a sort of MMIO barrier, the device could take off using older values and make a mess of things.

The tg3 bug illustrates a slightly different sort of problem, however: there is no guaranteed ordering between writes to regular memory and writes to a memory-mapped I/O range. So Paul's question was: should an MMIO write be redefined to be strictly ordered with respect to preceding writes to regular memory? On a number of architectures (including the i386), the hardware orders things nicely now, but on others (Paul is working with PowerPC64), there are no such guarantees. Redefining the MMIO write operations (iowrite32(), writel(), etc.) to add the necessary barriers on the relevant architectures could make a number of potential bugs go away.

Linus didn't like the idea, stating that it was too expensive. Memory barriers can stall the processor for long periods of time, so it is nice to leave them out when they are not truly needed. So, Linus says, the preferred approach is to require the programmer to put in an explicit barrier operation when one is needed.

There are some problems with this approach, however. One of those is that the kernel does not currently implement a barrier designed to force ordering between regular and MMIO memory operations. There is mmiowb(), but its real purpose is to enforce ordering between MMIO operations only. So Linus mentioned the possibility of creating new barriers with names like mem_to_io_barrier() to bring about the desired ordering in this situation.

Alternatively, the MMIO operations could be redefined to contain a barrier before the MMIO access happens. That would fix the tg3 bug without adding any extra cost, but it would come at the cost of removing the barrier that is currently placed after the operation. This is the solution that Paul favors:

I suspect the best thing at this point is to move the sync in writeX() before the store, as you suggest, and add an "eieio" before the load in readX(). That does mean that we are then relying on driver writers putting in the mmiowb() between a writeX() and a spin_unlock, but at least that is documented.

This approach brought out a different objection from David Miller (and others), however:

Driver authors will not get these memory barriers right, you can say they will because it will be "documented" but that does not change reality which is that driver folks will get simple interfaces right but these memory barriers are relatively advanced concepts, which they thus will get wrong half the time

David would rather see things work correctly in the simple scenario, even if the run-time expense is higher. As others have mentioned, one can always implement no-barrier versions of the MMIO primitives for performance-minded developers who (think they) know what they are doing.

The case mentioned by Paul above - putting in a call to mmiowb() between the last MMIO write operation and a spin_unlock() call - would be the biggest concern. Spinlocks are used to keep multiple processors (or, in a preemptive scenario, multiple processes on a single processor) from mixing up operations to the same device. But a spinlock lives in regular memory, so it is possible that the unlock operation could succeed (allowing another process to access the MMIO region) before the previous process's MMIO writes complete. That is why mmiowb() is called for - but it does look like the sort of thing that driver authors will have a hard time remembering.

An alternative suggested by Alan Cox is the creation of a new pair of spinlock operations: spin_lock_io() and spin_unlock_io(). They would be explicitly defined to protect operations on MMIO regions, and would contain the requisite barriers. If device drivers could be trained to use these locking operations (and driver writers often can be trained - just feed them beer when they do something right), they would not have to remember to insert barriers.

There's a couple of problems here too, however. There are already a number of variations on the spin_lock() operation; adding another option will expand the number of locking calls considerably. Code which calls functions while holding locks must already be aware of the called functions' locking needs, and that awareness will be made more complicated as well. So Linus would much rather avoid this approach and just require the use of explicit barriers.

Yet another approach - the one which might just be adopted in the end - is to redefine and expand the set of MMIO accessor functions. In this scenario, as described by Benjamin Herrenschmidt, the existing functions (writel(), etc.) would be made fully ordered - even though that might well slow them down some. All drivers using those functions would continue to work - and some might have rare, subtle bugs fixed in the process.

For most drivers, the above functions will be adequate - memory barriers around MMIO operations will not materially affect performance most of the time. There are exceptions, however. For situations where the barriers are unnecessary and hurtful, a new set of accessors with names like __writel() or __iowrite32() would be defined. These functions would ensure that MMIO operations are seen by the peripheral device in the order issued by the processor, but no other guarantees would be made. When these primitives are used, the programmer is responsible for inserting barriers in cases where ordering between MMIO and regular memory operations is important.

Finally, for developers who truly want to live on the edge, a set of functions with names like __raw_writel() has been proposed. These accessors would provide no ordering guarantees at all and would not concern themselves with issues like byte swapping. They are one small step above issuing I/O operations directly in assembly. Benjamin's proposal also brings back the idea of creating a new set of memory barriers for specific situations. Thus, io_to_io_barrier() would ensure ordering between MMIO operations; it would be useful in conjunction with the "raw" operations described above. Other barriers would deal with ordering between MMIO and regular memory operations in various ways; see Benjamin's post for the full set.

There have been a number of suggestions for changes to this proposal, but no real opposition to the general idea. So, in the end, that may be just how it works out - though expect this discussion to return in the future. When the topic is one of the trickiest areas of kernel programming on contemporary hardware, easy and final solutions will likely be hard to come by.

Index entries for this article
Kernel	Memory barriers
Kernel	mmiowb()
Kernel	Race conditions