Poettering: Revisiting how we put together Linux systems

Posted Sep 2, 2014 20:23 UTC (Tue) by mezcalero (subscriber, #45103)
In reply to: Poettering: Revisiting how we put together Linux systems by nix
Parent article: Poettering: Revisiting how we put together Linux systems

NFS locking on Linux is a complete disaster. For example, the Linux NFS client implicitly forwards BSD locks made on an NFS share to the server where the kernel picks it up as POSIX locks. Hence: if you lock a file with BSD locks as well as POSIX on a local file system, then that works fine, they don't conflict. If you do the same on NFS then you get a deadlock.

And yeah, this happens, because BSD locks are per-fd and hence actually usable to use. And POSIX locks are per-process, which makes them very hard to use (especially as *any* close() invoked by the fd on the file drops the lock implicitly), but then again they support byte-range locking. Hence people end up using both inter-mixed quite frequently, maybe not on purpose, but certainly in real-life.

So yeah, file locking is awful on Linux anyway, and it's particularly bad on NFS.

Poettering: Revisiting how we put together Linux systems

Posted Sep 2, 2014 21:36 UTC (Tue) by bfields (subscriber, #19510) [Link] (9 responses)

And yeah, this happens, because BSD locks are per-fd and hence actually usable to use.

For what it's worth, note Jeff Layton's File-private POSIX locks have been merged now.

Poettering: Revisiting how we put together Linux systems

Posted Sep 3, 2014 19:30 UTC (Wed) by ermo (subscriber, #86690) [Link] (8 responses)

I was going to naïvely ask why POSIX hadn't adopted the superior (in the context of e.g. NFS) BSD-style locking and then you posted that link, which contains this little gem at the very top:

"File-private POSIX locks are an attempt to take elements of both BSD-style and POSIX locks and combine them into a more threading-friendly file locking API."

Sounds like the above is just what the doctor ordered?

Poettering: Revisiting how we put together Linux systems

Posted Sep 3, 2014 23:07 UTC (Wed) by nix (subscriber, #2304) [Link] (7 responses)

It doesn't really help. The problem is that local fses have multiple locks which do not conflict with each other, but the NFS protocol has only one way to signal a lock to the remote end. So there's a trilemma: either you use that for all lock types (and suddenly they conflict remotely where they did not locally), or you don't signal one lock type at all (and suddenly you have things not locking at all remotely where they did locally), or you use a protocol extension, which has horrible compatibility problems.

I don't see a way to solve this without a new protocol revision :(

Poettering: Revisiting how we put together Linux systems

Posted Sep 4, 2014 14:03 UTC (Thu) by foom (subscriber, #14868) [Link]

It does help because it removes any reason to use the BSD lock API (at least when running on Linux, new enough kernel). Before that addition, the POSIX lock programming model was so broken, nobody sane would ever *want* to use it.

Yet, on Linux, local POSIX locks interoperate properly with POSIX locks via NFS, so, if software all switches to using POSIX locks, it'll work properly when used both locally and remotely at the same time.

Of course, very often, nothing is ever running on the NFS server that touches the exported data (or at least, nothing that needs to lock it) -- the NFS server is *just* a fileserver. In such an environment, using BSD locks over NFS on linux works properly too.

Poettering: Revisiting how we put together Linux systems

Posted Sep 5, 2014 0:44 UTC (Fri) by mezcalero (subscriber, #45103) [Link] (5 responses)

I think a big step forward would actually be if the NFS implementations were honest, and would return a clean error if they cannot actually provide correct locking. But that's not what happens, you have no way to figure what is going on a file system...

Just pretending that locking works, even if it doesn't, and returning success to apps is really the worst thing to do...

Poettering: Revisiting how we put together Linux systems

Posted Sep 8, 2014 16:12 UTC (Mon) by nix (subscriber, #2304) [Link] (4 responses)

You're suggesting erroring if a lock of one type is held on on a file when an attempt is made to take out a lock of the other type? I suspect this is the only possible fix, if you can call it a fix. Now we just have to hope that programs check for errors from the locking functions! But of course they will, everyone checks for errors religiously :P

Poettering: Revisiting how we put together Linux systems

Posted Sep 8, 2014 18:56 UTC (Mon) by bfields (subscriber, #19510) [Link] (1 responses)

That wouldn't help. I think he's suggesting just returning -ENOLCK to BSD locks unconditionally. I agree that that's cleanest but in practice I suspect it would break a lot of existing setups.

I suppose you could make it yet another mount option and then advocate making it the default. Or just add support NFS protocol support for BSD locks if it's really a priority, doesn't seem like it should be that hard.

Poettering: Revisiting how we put together Linux systems

Posted Sep 9, 2014 13:56 UTC (Tue) by nix (subscriber, #2304) [Link]

That wouldn't help. I think he's suggesting just returning -ENOLCK to BSD locks unconditionally. I agree that that's cleanest but in practice I suspect it would break a lot of existing setups.

Given how awful POSIX locks are (until you have a very recent kernel and glibc 2.20), and how sane people therefore avoided using the bloody things, I'd say it would break almost every setup relying on locking over NFS at all. A very bad idea.

Poettering: Revisiting how we put together Linux systems

Posted Sep 9, 2014 14:43 UTC (Tue) by foom (subscriber, #14868) [Link] (1 responses)

I don't think he was suggesting that, but that's actually what BSD does with BSD/POSIX locks.
A BSD lock will block a POSIX lock, and v.v.. (At least that's what happens locally; no idea what the BSD's NFS clients do.)

Linux also had that behavior a long time ago IIRC. Not sure why it changed, that was before I paid attention.

Poettering: Revisiting how we put together Linux systems

Posted Sep 9, 2014 15:27 UTC (Tue) by bfields (subscriber, #19510) [Link]

Huh. A freebsd man page agrees with you:

https://www.freebsd.org/cgi/man.cgi?query=flock&sektion=2
If a file is locked by a process through flock(), any record within the file will be seen as locked from the viewpoint of another process using fcntl(2) or lockf(3), and vice versa.

Recent linux's flock(2) suggests the Linux behavior was an attempt to match BSD behavior that has since changed?:

http://man7.org/linux/man-pages/man2/flock.2.html
Since kernel 2.0, flock() is implemented as a system call in its own right rather than being emulated in the GNU C library as a call to fcntl(2). This yields classical BSD semantics: there is no interaction between the types of lock placed by flock() and fcntl(2), and flock() does not detect deadlock. (Note, however, that on some modern BSDs, flock() and fcntl(2) locks do interact with one another.)

Strange. In any case, changing the local Linux behavior is probably out of the question at this point.

Poettering: Revisiting how we put together Linux systems

Posted Sep 3, 2014 23:05 UTC (Wed) by nix (subscriber, #2304) [Link] (5 responses)

Oh yes, I can see how that's problematic -- though TBH it sounds like a bug (the server should export BSD locks as BSD locks, though I can understand the protocol difficulties in doing so). The fact remains that it can't be that common: it has never happened to me at all, and everything I do, I do over NFS.

What I really want -- and still seems not to exist -- is something that gives you the POSIXness of local filesystems (and things like ceph, IIRC) while retaining the 'just take a local filesystem tree, possibly constituting one or many or parts of local filesystems, and export them to other machines' property of NFS: i.e,. not needing to make a new filesystem or move things around madly on the local machine just in order to export the fs. I know, this property is really hard to retain due to the need to make unique inums on the remote machine without exhausting local state, and NFS doesn't quite get it right -- but it would be very nice if it could be done.

Poettering: Revisiting how we put together Linux systems

Posted Sep 4, 2014 15:09 UTC (Thu) by bfields (subscriber, #19510) [Link] (4 responses)

"What I really want -- and still seems not to exist -- is something that gives you the POSIXness of local filesystems"

What exactly are you missing?

"not needing to make a new filesystem or move things around madly on the local machine just in order to export the fs. I know, this property is really hard to retain due to the need to make unique inums on the remote machine without exhausting local state"

I'm not sure I understand that description of the problem. The problem I'm aware of is just that it's difficult to determine given a filehandle whether the object pointed to by that filehandle is exported or not.

"NFS doesn't quite get it right"

Specifically, if you export a subtree of a filesystem then it's possible for someone with a custom NFS client and access to the network to access things outside that subtree by guessing filehandles.

Poettering: Revisiting how we put together Linux systems

Posted Sep 8, 2014 15:55 UTC (Mon) by nix (subscriber, #2304) [Link] (3 responses)

On the POSIXness side of things, I'd like the atomicity guarantees you get from a local fs, rather than having just rename() be atomic; I'd like to not have to deal with silly-rename leaving spew all over my disks that it is hard to figure out when it is safe to clean up; I'd like the same ACL system on the local and the remote filesystems rather than its being mapped through a crazy system designed to be interoperable with Windows... oh, and decent performance would be nice (like NFSv4 allegedly has, though I haven't yet managed to get NFSv4 to work -- haven't tried hard enough, I think its requirements for strong authentication are getting in my way).

Clearly NFS can't do all this: silly-rename and the rest are intrinsic to (the way NFS has chosen to do) statelessness. So I guess we need something else.

As for the not-quite-rightness of NFS's lovely ability to just ad-hoc export things, I have seen spurious but persistent -ESTALEs from nested exports and exports crossing host filesystems in the last year or two, and am still carrying round a horrific patch to make them go away (I was going to submit it, but it's a) horrific and b) I have to retest and make sure it's actually still needed: the underlying bug may have been fixed).

Poettering: Revisiting how we put together Linux systems

Posted Sep 8, 2014 16:30 UTC (Mon) by rleigh (guest, #14622) [Link] (1 responses)

At least with NFSv4 and ZFS, ACLs are propagated to client systems just fine (it's storing NFSv4 ACLs natively in ZFS on disk). For a combination of FreeBSD server and client at least. With a FreeBSD server and Linux client, NFSv4 ACL support isn't working for me, though the standard ownership and perms work correctly. I put this down to the Linux NFS client being less sophisticated and/or buggy, but I can't rule out some configuration issue.

Poettering: Revisiting how we put together Linux systems

Posted Sep 8, 2014 18:49 UTC (Mon) by bfields (subscriber, #19510) [Link]

With a FreeBSD server and Linux client, NFSv4 ACL support isn't working for me, though the standard ownership and perms work correctly. I put this down to the Linux NFS client being less sophisticated and/or buggy, but I can't rule out some configuration issue.

The actual kernel client code is pretty trivial, so the bug's probably either in the FreeBSD server or the client-side nfs4-acl-tools. Please report the problem.

Poettering: Revisiting how we put together Linux systems

Posted Sep 8, 2014 18:46 UTC (Mon) by bfields (subscriber, #19510) [Link]

I think its requirements for strong authentication are getting in my way

The spec does require that it be implemented, but you're not required to use it. If you're using NFS between two hosts with a recent linux boxes then you're likely already using NFSv4. (It's default since RHEL6, for example.)

silly-rename and the rest are intrinsic

See the discussion of OPEN4_RESULT_PRESERVE_UNLINKED in RFC 5661. It hasn't been implemented. I don't expect it's hard, so will probably get done some time depending on the priority, at which point you'll no longer see sillyrenames between updated 4.1 clients and servers.

spurious but persistent -ESTALEs from nested exports and exports crossing host filesystems

Do let us know what you figure out (linux-nfs@vger.kernel.org, or your distro).