|
|
Log in / Subscribe / Register

O_EXCL over NFS: Don't!!! (Repost in HTML)

O_EXCL over NFS: Don't!!! (Repost in HTML)

Posted Sep 21, 2007 20:48 UTC (Fri) by AnswerGuy (guest, #1256)
Parent article: Exploiting symlinks and tmpfiles

While the original article did not cover this particular topic I'd like to remind everyone that the desired semantics of open(..., O_EXCL|O_CREAT) are NOT supported over NFS (at least as late as the version 3 of those protocols).

Quoting from the open(2) man page:

   O_EXCL When used with O_CREAT, if the file already  exists  it  is  an
	      error  and the open will fail. In this context, a symbolic link
	      exists, regardless of where its points to.  O_EXCL is broken on
	      NFS  file	 systems,  programs  which  rely on it for performing
	      locking tasks will contain a race condition.  The solution  for
	      performing  atomic file locking using a lockfile is to create a
	      unique file on the same fs (e.g.,	 incorporating	hostname  and
	      pid),  use  link(2)  to  make a link to the lockfile. If link()
	      returns 0, the lock is successful.  Otherwise, use  stat(2)  on
	      the  unique file to check if its link count has increased to 2,
	      in which case the lock is also successful.

An portion of the link(2) provides details on why a fstat() is necessary if the link() fails. (Basically there are conditions where the NFS server's RPC (success) response could fail to reach the client even after the link was created. The subsequent fstat() on the originally opened file descriptor can detect cases where the link() erroneously returned an error.

I should add that the use of stat() would be sufficient for cases where one is concerned about inadvertant race conditions --- but I think that fstat() is required for situations where one must defend against potentially hostile processes with write access to the directory in which all this locking is taking place. In other words the advice in the man page only covers the non-hostile case (suitable for non-SUID/non-SGID use in a directory which is not allowing group nor world write access.

In cases where security is a consideration I think we have to unconditionally perform an fstat() on the originally opened file descriptor. Otherwise we are vulnerable to an unlink() and recreation race. A stat() will check the file/inode which is on the underlying filesystem at the time the call is performed. So the link that's present is resolved to an inode during that call. An fstat() checks against the inode that was originally opening (syncronizing the vnode to the underlying inode. In the case where an unlink() was slipped in between the open() then the new link points at a new and different inode. (The original inode may be, at that point, anonymous; in which case the target of our successfully called link() is also pointing at the new (now compromised) inode.

Of course I'm just speculating here ... reasoning things out from my understanding. I'm not an expert in secure programming and I can't cite any canonical sources.

I have personally seen that open(...,O_EXCL...) is NOT supported on NFS. So that's not speculation. I've read hearsay that it's supposed to work under NFSv3 ... but I haven't seen a convincing, credible statement to that effect. I don't know if it's "intended" to be supported and if there are buggy NFSv3 implemenations that fail to achieve this. In short I would recommend a more conservative approach for the foreseaable future.

I will forward this comment along to David Wheeler and suggest that he review it and consider adding anything he considers worthy and appropriate to his HOWTO on the topic ... and I would welcome any comments from others with deeper expertise. I'd be particularly interested in pointers to any stress testing harness which could be deployed to a few hundred clients to beat the tar out of any code which is supposed to be doing such things correctly. (My first test case would the the venerable old lockfile utility that ships with the procmail package. My next one would be an internally used utility that my employers are trying to fix as I write this).

[Of course I realize the essential futility of trying to prove that a given work of code doesn't have any race conditions. You can never be sure of that via any form of blockbox testing. However, I do want to be able to definitely demonstate when a program is failing to be race-free in a reproducible fashion. I've proposed a crude design for such a harness; it makes a "contest" comprised of processes which each create qmail "lock free" styled results files then busy wait on a starting sentinel (which I call the starting gun and implement as touch $LOCKDIR/BANG ... then they all contend for the lock; all the losing contestants post their results to their private files, renaming those to *.done and exiting. Then winner waits, holding the lock, until all the other contestants are "done" and then tallies up the results, searching for any other proceses which claim to also be "winners." There's some additional timeout handling. Any case where there appears to be more than one "winning contestant" means that the locking semantics being tested are definitely broken. Cases with a single winnner are inconclusive (an underlying race condition could simply have been missed, as is always the case with races). Timeouts resulting from "losing contestants" who fail to complete are indications of unreliability among the client systems, the networking infrastructure, or the filers --- but they say nothing about the locking semantics under test. Anyone who is interested in more details of my proposed test harness is welcome to contact me (I'll monitor this thread) and anyone who sees potential flaws or can suggest code which has already robustly implemented soemthing like this is especially encouraged to do so].

Jim "The AnswerGuy" Dennis
My apologies for posting this twice; given the complexity of commentary, I'd intended for it to be posted in HTML for easier reading. I hope John or someone on the LWN team will delete the earlier copy of this


to post comments

O_EXCL over NFS: Don't!!! (Repost in HTML)

Posted Sep 27, 2007 22:39 UTC (Thu) by cras (guest, #7000) [Link]

I've read hearsay that it's supposed to work under NFSv3 ... but I haven't seen a convincing, credible statement to that effect.

My NFS tester shows that it at least appears to work with Linux, Solaris and FreeBSD: http://www.dovecot.org/list/dovecot/2007-July/024102.html. Looking at Linux 2.6 sources it doesn't look like it tries to implement a racy O_EXCL check in client side (fs/nfs/nfs3proc.c nfs3_proc_create()), so the test's results should be correct. I don't know if other OSes do that. I guess it would be nice to have a better O_EXCL tester which tries to catch race conditions.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds