Paul Mackerras recently reported
bug. The tg3 Ethernet driver, like many other network drivers, operates on
a set of buffer descriptors stored in the host system's memory. These
descriptors describe the buffers which are available for incoming network
packets; when a packet arrives, the interface picks the next descriptor on
the list, stuffs the data there, then tells the processor that the packet
is available. The reported bug works like this: the processor makes some
changes to this descriptor data structure, then does a write to a
memory-mapped I/O (MMIO) register to tell the device to start I/O. The
device, however, receives this MMIO write before the data written to main
memory arrives at its final destination, and thus operates on old data.
When this happens, correct operation is, to say the least, unlikely.
Bugs resulting from the reordering of memory operations can be some of the
most subtle and difficult-to-find problems. A developer can stare at the
code for hours without realizing that what is actually happening, deep down
within the system's hardware, does not quite match the code as it appears
to be written. The incorrect behavior can happen infrequently and be
impossible to reproduce in any easy way.
The solution for this kind of problem is usually to add some sort of
memory barrier in situations where the ordering of operations matters. The
sort of barrier most familiar to device driver writers may well be the
classic rule: MMIO writes to I/O memory hosted on a PCI bus cannot be
considered to be complete until a read has been done from that memory
range. So drivers often have a pattern where many registers are set with
values describing an I/O operation, but a read is done before the final
write which sets the "go" bit. Without that read, which functions as a
sort of MMIO barrier, the device could take
off using older values and make a mess of things.
The tg3 bug illustrates a slightly different sort of problem, however:
there is no guaranteed ordering between writes to regular memory and writes
to a memory-mapped I/O range. So Paul's question was: should an MMIO write
be redefined to be strictly ordered with respect to preceding writes to
regular memory? On a number of architectures (including the i386), the
hardware orders things nicely now, but on others (Paul is working with
PowerPC64), there are no such guarantees. Redefining the MMIO write
operations (iowrite32(), writel(), etc.) to add the
necessary barriers on the relevant architectures could make a number of
potential bugs go away.
Linus didn't like the idea, stating that it
was too expensive. Memory barriers can stall the processor for long
periods of time, so it is nice to leave them out when they are not truly
needed. So, Linus says, the preferred approach is to require the
programmer to put in an explicit barrier operation when one is needed.
There are some problems with this approach, however. One of those is that
the kernel does not currently implement a barrier designed to force
ordering between regular and MMIO memory operations. There is
mmiowb(), but its real purpose is to enforce ordering between MMIO
operations only. So Linus mentioned the possibility of creating new
barriers with names like mem_to_io_barrier() to bring about the
desired ordering in this situation.
Alternatively, the MMIO operations could be redefined to contain a barrier
before the MMIO access happens. That would fix the tg3 bug without adding
any extra cost, but it would come at the cost of removing the barrier that
is currently placed after the operation. This is the solution that
I suspect the best thing at this point is to move the sync in
writeX() before the store, as you suggest, and add an "eieio"
before the load in readX(). That does mean that we are then
relying on driver writers putting in the mmiowb() between a
writeX() and a spin_unlock, but at least that is documented.
This approach brought out a different
objection from David Miller (and others), however:
Driver authors will not get these memory barriers right, you can
say they will because it will be "documented" but that does not
change reality which is that driver folks will get simple
interfaces right but these memory barriers are relatively advanced
concepts, which they thus will get wrong half the time
David would rather see things work correctly in the simple scenario, even
if the run-time expense is higher. As others have mentioned, one can
always implement no-barrier versions of the MMIO primitives for
performance-minded developers who (think they) know what they are doing.
The case mentioned by Paul above - putting in a call to mmiowb()
between the last MMIO write operation and a spin_unlock() call -
would be the biggest concern. Spinlocks are used to keep multiple
processors (or, in a preemptive scenario, multiple processes on a single
processor) from mixing up operations to the same device. But a spinlock
lives in regular memory, so it is possible that the unlock operation could
succeed (allowing another process to access the MMIO region) before the
previous process's MMIO writes complete. That is why mmiowb() is
called for - but it does look like the sort of thing that driver authors
will have a hard time remembering.
An alternative suggested by Alan Cox is the
creation of a new pair of spinlock operations: spin_lock_io() and
spin_unlock_io(). They would be explicitly defined to protect
operations on MMIO regions, and would contain the requisite barriers. If
device drivers could be trained to use these locking operations (and driver
writers often can be trained - just feed them beer when they do something
right), they would not have to remember to insert barriers.
There's a couple of problems here too, however. There are already a number
of variations on the spin_lock() operation; adding another option
will expand the number of locking calls considerably. Code which calls
functions while holding locks must already be aware of the called
functions' locking needs, and that awareness will be made more complicated
as well. So Linus would much rather avoid this
approach and just require the use of explicit barriers.
Yet another approach - the one which might just be adopted in the end - is
to redefine and expand the set of MMIO accessor functions. In this
scenario, as described by
Benjamin Herrenschmidt, the existing functions (writel(), etc.)
would be made fully ordered - even though that might well slow them down
some. All drivers using those functions would continue to work - and some
might have rare, subtle bugs fixed in the process.
For most drivers, the above functions will be adequate - memory barriers
around MMIO operations will not materially affect performance most of the
time. There are exceptions, however. For situations where the barriers
are unnecessary and hurtful, a new set of accessors with names like
__writel() or __iowrite32() would be defined. These
functions would ensure that MMIO operations are seen by the peripheral
device in the order issued by the processor, but no other guarantees would
be made. When these primitives are used, the programmer is responsible for
inserting barriers in cases where ordering between MMIO and regular memory
operations is important.
Finally, for developers who truly want to live on the edge, a set of
functions with names like __raw_writel() has been proposed. These
accessors would provide no ordering guarantees at all and would not concern
themselves with issues like byte swapping. They are one small step above
issuing I/O operations directly in assembly.
Benjamin's proposal also brings back the idea of creating a new set of
memory barriers for specific situations. Thus, io_to_io_barrier()
would ensure ordering between MMIO operations; it would be useful in
conjunction with the "raw" operations described above. Other barriers
would deal with ordering between MMIO and regular memory operations in
various ways; see Benjamin's post for the full set.
There have been a number of suggestions for changes to this proposal, but
no real opposition to the general idea. So, in the end, that may be just
how it works out - though expect this discussion to return in the future.
When the topic is one of the trickiest areas of kernel programming on
contemporary hardware, easy and final solutions will likely be hard to come
to post comments)