I guess it would depend on the cache protocol too, wouldn't it?
In a MESI, you would end up bouncing lines (S => M transition on the first writer, S => I on the others, followed by M => S and I => S). An ESI system (write through caches w/ no notion of "modified"), you'd get something similar.
In a MOESI such as Athlon's, I believe you minimize the cost. The first writer does an S => O and broadcasts the write to everyone else that's in S.
From that, I'd say it's rather important to measure on multiple architectures, since the tradeoffs will vary.