write(), thread safety, and POSIX
[Posted April 18, 2006 by corbet]
Dan Bonachea recently
reported a problem.
It seems that he has a program where multiple threads are simultaneously
writing to the same file descriptor. Occasionally, some of that output
disappears - overwritten by other threads. Random loss of output data is
not generally considered to be a desirable sort of behavior, and,
says Dan, POSIX requires that
write()
calls be thread-safe. So he would like to see this behavior fixed.
Andrew Morton quickly pointed out the
source of this behavior. Consider how write() is currently
implemented:
asmlinkage ssize_t sys_write(unsigned int fd, const char __user *buf,
size_t count)
{
struct file *file;
ssize_t ret = -EBADF;
int fput_needed;
file = fget_light(fd, &fput_needed);
if (file) {
loff_t pos = file_pos_read(file);
ret = vfs_write(file, buf, count, &pos);
file_pos_write(file, pos);
fput_light(file, fput_needed);
}
return ret;
}
There is no locking around this function, so it is possible for two (or
more) threads performing simultaneous writes to obtain the same value for
pos. They will each then write their data to the same file
position, and the thread which writes last wins.
Putting some sort of lock (using the inode lock, perhaps) around the entire
function would solve the problem and make write() calls
thread-safe. The cost of this solution would be high, however: an extra
layer of locking when almost no application actually needs it. Serializing
write() operations in this way would also rule out simultaneous
writes to the same file - a capability which can be useful to some
applications.
So some developers have questioned whether this behavior should be fixed at
all. It is not something which causes problems for over 99.9% of applications,
and, for those which need to be able to perform this sort of simultaneous
write, there are other options available. These include user-space locking
or using the O_APPEND option. So, it is asked, why add
unnecessary overhead to the kernel?
Linus responds that it is a "quality of implementation" issue, and that if
there is a low-cost way of getting the system to behave the way users would
like, it might as well be done. His proposal is to apply a lock to the file
position in particular. His patch adds a f_pos_lock mutex to the
file structure and uses that lock to serialize uses of and changes
to the file position. This change will have the effect of serializing
calls to write(), while leaving other forms (asynchronous I/O,
pwrite()) unserialized.
The patch has not drawn a lot of comments, and it has not been merged as of
this writing. Its ultimate fate will probably depend on whether avoiding
races in this obscure case is truly seen to be worth the additional cost
imposed on all users.
(
Log in to post comments)