O_EXCL over NFS: Don't!!! (Repost in HTML)
O_EXCL over NFS: Don't!!! (Repost in HTML)
Posted Sep 21, 2007 20:48 UTC (Fri) by AnswerGuy (guest, #1256)Parent article: Exploiting symlinks and tmpfiles
While the original article did not cover this particular topic I'd like to remind everyone that the desired semantics of open(..., O_EXCL|O_CREAT) are NOT supported over NFS (at least as late as the version 3 of those protocols).
Quoting from the open(2) man page:
O_EXCL When used with O_CREAT, if the file already exists it is an error and the open will fail. In this context, a symbolic link exists, regardless of where its points to. O_EXCL is broken on NFS file systems, programs which rely on it for performing locking tasks will contain a race condition. The solution for performing atomic file locking using a lockfile is to create a unique file on the same fs (e.g., incorporating hostname and pid), use link(2) to make a link to the lockfile. If link() returns 0, the lock is successful. Otherwise, use stat(2) on the unique file to check if its link count has increased to 2, in which case the lock is also successful.
An portion of the link(2) provides details on why a fstat() is necessary if the link() fails. (Basically there are conditions where the NFS server's RPC (success) response could fail to reach the client even after the link was created. The subsequent fstat() on the originally opened file descriptor can detect cases where the link() erroneously returned an error.
I should add that the use of stat() would be sufficient for cases where one is concerned about inadvertant race conditions --- but I think that fstat() is required for situations where one must defend against potentially hostile processes with write access to the directory in which all this locking is taking place. In other words the advice in the man page only covers the non-hostile case (suitable for non-SUID/non-SGID use in a directory which is not allowing group nor world write access.
In cases where security is a consideration I think we have to unconditionally perform an fstat() on the originally opened file descriptor. Otherwise we are vulnerable to an unlink() and recreation race. A stat() will check the file/inode which is on the underlying filesystem at the time the call is performed. So the link that's present is resolved to an inode during that call. An fstat() checks against the inode that was originally opening (syncronizing the vnode to the underlying inode. In the case where an unlink() was slipped in between the open() then the new link points at a new and different inode. (The original inode may be, at that point, anonymous; in which case the target of our successfully called link() is also pointing at the new (now compromised) inode.
Of course I'm just speculating here ... reasoning things out from my understanding. I'm not an expert in secure programming and I can't cite any canonical sources.
I have personally seen that open(...,O_EXCL...) is NOT supported on NFS. So that's not speculation. I've read hearsay that it's supposed to work under NFSv3 ... but I haven't seen a convincing, credible statement to that effect. I don't know if it's "intended" to be supported and if there are buggy NFSv3 implemenations that fail to achieve this. In short I would recommend a more conservative approach for the foreseaable future.
I will forward this comment along to David Wheeler and suggest that he review it and consider adding anything he considers worthy and appropriate to his HOWTO on the topic ... and I would welcome any comments from others with deeper expertise. I'd be particularly interested in pointers to any stress testing harness which could be deployed to a few hundred clients to beat the tar out of any code which is supposed to be doing such things correctly. (My first test case would the the venerable old lockfile utility that ships with the procmail package. My next one would be an internally used utility that my employers are trying to fix as I write this).
[Of course I realize the essential futility of trying to prove that a given work of code doesn't have any race conditions. You can never be sure of that via any form of blockbox testing. However, I do want to be able to definitely demonstate when a program is failing to be race-free in a reproducible fashion. I've proposed a crude design for such a harness; it makes a "contest" comprised of processes which each create qmail "lock free" styled results files then busy wait on a starting sentinel (which I call the starting gun and implement as touch $LOCKDIR/BANG ... then they all contend for the lock; all the losing contestants post their results to their private files, renaming those to *.done and exiting. Then winner waits, holding the lock, until all the other contestants are "done" and then tallies up the results, searching for any other proceses which claim to also be "winners." There's some additional timeout handling. Any case where there appears to be more than one "winning contestant" means that the locking semantics being tested are definitely broken. Cases with a single winnner are inconclusive (an underlying race condition could simply have been missed, as is always the case with races). Timeouts resulting from "losing contestants" who fail to complete are indications of unreliability among the client systems, the networking infrastructure, or the filers --- but they say nothing about the locking semantics under test. Anyone who is interested in more details of my proposed test harness is welcome to contact me (I'll monitor this thread) and anyone who sees potential flaws or can suggest code which has already robustly implemented soemthing like this is especially encouraged to do so].
Jim "The AnswerGuy" DennisMy apologies for posting this twice; given the complexity of commentary, I'd intended for it to be posted in HTML for easier reading. I hope John or someone on the LWN team will delete the earlier copy of this
