File-private POSIX locks

February 19, 2014

This article was contributed by Jeffrey T. Layton

File locks are heavily used by applications such as databases and file servers. Whenever you have multiple programs accessing files at the same time there is always the potential for data corruption or other bugs unless that access is carefully synchronized. File locks solve that problem, but the existing implementation can be difficult to use, especially for multi-threaded programs. File-private POSIX locks are an attempt to take elements of both BSD-style and POSIX locks and combine them into a more threading-friendly file locking API.

Multiple writers attempting to change a file at the same time can clobber each other's changes. In addition, an update to a file may need to be done in more than one place. If another thread of execution sees only part of the update, it may trigger bugs in the program.

File locks are generally available in two flavors: read (also known as shared) locks and write (also known as exclusive) locks. Multiple read locks can be given out for a portion of a file, but only one write lock can be handed out at any given time, and only if no other read or write lock for that region has been set. While file locks on some operating systems are mandatory, on Unix-like systems locking is generally advisory. Advisory locks are like stoplights — they only work if everyone pays attention to them.

One of the primary mechanisms to handle file locking is the one specified by the POSIX standard. POSIX defines a file-locking standard that allows the ability to lock arbitrary byte ranges in a file for read or write. Unfortunately, they have a couple of serious problems that make them unsuitable for use by modern applications.

The problems with POSIX locking

Whenever a program attempts to acquire a lock, that lock is either granted or denied based on whether there is already a conflicting lock set over the given range. If there is no conflicting lock present, the lock will be granted. If there is then it will be denied.

Classic POSIX lock requests from the same process never conflict with one another. When a request for a lock comes in that would conflict with an existing lock that the process set previously, the kernel treats it as a request to modify the existing lock. Thus, classic POSIX locks are useless for synchronization between threads within the same process. Given the prevalence of threaded applications in modern computing, this renders POSIX locks fairly useless as a synchronization mechanism.

More troublingly, the standard states that all locks held by a process are dropped any time the process closes any file descriptor that corresponds to the locked file, even if those locks were made using a still-open file descriptor. It is this detail that catches most programmers by surprise as it requires that a program take extra care not to close a file descriptor until it is certain that locks held on that file are able to be dropped.

That's not always a simple question to answer. If a program opens two different links of a hardlinked file, takes a lock on one file descriptor and then closes the other, that lock is implicitly dropped even though the file descriptor on which the lock was originally acquired remains open.

This is a particular problem for applications that use complex libraries that do file access. It's common to have a library routine that opens a file, reads or writes to it, and then closes it again, without the calling application ever being aware that has occurred. If the application happens to be holding a lock on the file when that occurs, it can lose that lock without ever being aware of it. That sort of behavior can lead to silent data corruption, and loss of developer sanity. Jeremy Allison has an excellent writeup of this problem and of how such a broken standard came into being (see the section entitled "First Implementation Past the Post").

There is however, another competing (or complementary) file locking standard that has its roots in BSD Unix. These locks (which are manipulated via the flock() system call) have more sane semantics. Whereas POSIX locks are owned by the process, BSD locks are owned by the open file. If a process opens a file twice and tries to set exclusive locks on both, the second one will be denied. Thus, BSD locks are usable as a synchronization mechanism between threads as long as each thread has a separate opened file. Note that cloning a file descriptor with dup() is not sufficient since that simply takes a reference to the same opened file.

Also, BSD locks are only released when the last reference to the open file on which they were acquired is closed. Thus if a program opens a file, takes a lock on it and then uses dup() to duplicate the file descriptor, the lock will only be released automatically when both file descriptors are closed.

The only real problem with BSD locks is that they are whole-file locks. POSIX locks, on the other hand can operate on arbitrary byte ranges within a file. While whole-file locks are useful (and indeed, many applications just lock entire files even with POSIX locks), they are not sufficient for many cases. Applications such as databases need granular locking in order to allow for better parallelism.

File-private POSIX locks

I will assert that what is needed is a new type of lock that is a hybrid of the two — a byte-range lock that has BSD-like semantics for inheritance across fork() and on close(). Furthermore, since there is a large legacy codebase of programs that use "classic" POSIX locks, these new locks need to be aware of the classic locks so that programs using the new locks will interoperate correctly with those applications.

Classic POSIX locks are manipulated using a set of command values passed to the fcntl() system call:

F_GETLK - test whether a lock is able to be applied
F_SETLK - attempt to set a lock. Return error if unable to do so
F_SETLKW - attempt to set a lock and block until able to do so

These commands are accompanied by a pointer to a binary struct flock argument that looks something like this:

    struct flock {
        short int l_type;   /* Type of lock: F_RDLCK, F_WRLCK, or F_UNLCK.  */
        short int l_whence; /* Where `l_start' is relative to (like `lseek').  */
        off_t l_start;      /* Offset where the lock begins.  */
        off_t l_len;        /* Size of the locked area; zero means until EOF.  */
        pid_t l_pid;        /* Process holding the lock. (F_GETLK only) */
    };

Similarly, file-private POSIX locks are manipulated with a similar set of commands, this time appended with 'P':

F_GETLKP - test whether a lock is able to be applied
F_SETLKP - attempt to set a file-private lock
F_SETLKPW - attempt to set a file-private lock and block until able to do so

The new commands should look very familiar to those used to working with classic POSIX locks and they take the same struct flock argument. The only real difference between file-private and classic POSIX locks is their "ownership". Classic POSIX locks are owned by the process whereas file-private POSIX locks are owned by the opened file.

Using file-private POSIX locks

It is currently necessary to define the _GNU_SOURCE preprocessor macro in order to get the new command definitions as file-private locks are not yet part of POSIX. Using file-private locks is very similar to using classic POSIX locks. In many cases, one can simply replace the command value with the file-private equivalent. There are subtle differences, however.

Since one of the most troublesome aspects of classic POSIX locks is their behavior on close, there should be no surprise that file-private locks behave differently. File-private locks are only released automatically when the last reference to the open file is released.

It's tempting to then consider file-private locks to be "owned" by the file descriptor, but that's not technically true. If a file descriptor is cloned via dup(), the kernel will simply take an extra reference to the open file and assign it to a new slot in the open file descriptor table. File-private locks set on a cloned file descriptor will not conflict with locks set on the original file descriptor. The kernel will treat such a lock request as a request to modify the existing lock. Furthermore, file-private locks set using either file descriptor would only be released automatically once both file descriptors are closed, though one can always release a lock manually with an F_UNLCK request.

Interaction across fork() is very similar. When fork() is called, the kernel takes an extra reference to each open file and assigns it to the same slot in the new process's file descriptor table. Locks set by either process on the same open file would not conflict with one another, and would only be automatically released once both processes have closed it.

Classic and file-private locks will always conflict with one another, even when used in the same process and/or on the same file descriptor. I don't expect that many programs will mix the two, but given the pain that undefined behaviors can cause I think it's prudent to declare that explicitly.

Whither `F_GETLK`?

F_GETLK would probably have been better named F_TESTLK. While it does technically fetch the existing status of a locked range, its real purpose is to allow one to test whether a given lock request could be set without actually setting it. If there happens to be a conflicting lock already set within that range, the kernel will overwrite the struct flock with information about that lock and set l_pid to the value of the process that owns that lock.

The l_pid field is a bit of a dilemma for file-private locks. File-private locks are not owned by processes. A file descriptor could have been inherited across a fork(), so the l_pid value is somewhat meaningless if the conflicting lock is a file-private one. Still, when a program using classic POSIX locks calls F_GETLK, we do need to put something in the l_pid field. That something is -1.

This precedent comes from BSD. On Linux, POSIX and BSD locks operate in a completely different namespace. On BSD, however, they operate in the same namespace and, thus, will conflict with each other. If a program holds a BSD lock on a file, and another does a F_GETLK request against it, the BSD kernel will set the l_pid to -1. Since portable programs already need to contend with such behavior, using the same behavior for file-private locks seems like a reasonable choice.

Using file-private locks with threads

It's common for modern applications to use a threading model instead of forking to create a new thread of execution. This is problematic with classic POSIX locks. They are associated with a process, so locks acquired by threads within the same process cannot conflict.

With file-private locks however we can circumvent that restriction by giving each thread its own open file. Here's an example (note that I've left out proper error handling for the sake of brevity):

    #define _GNU_SOURCE
    #include <stdio.h>
    #include <sys/types.h>
    #include <sys/stat.h>
    #include <unistd.h>
    #include <fcntl.h>
    #include <pthread.h>

    #define FILENAME    "/tmp/foo"
    #define NUM_THREADS 3
    #define ITERATIONS  5

    void *
    thread_start(void *arg)
    {
        int i, fd, len;
        long tid = (long)arg;
        char buf[256];
        struct flock lck = {
            .l_whence = SEEK_SET,
            .l_start  = 0,
            .l_len    = 1,
        };

        fd = open(FILENAME, O_RDWR|O_CREAT, 0666);

        for (i = 0; i < ITERATIONS; i++) {
            lck.l_type = F_WRLCK;
            fcntl(fd, F_SETLKPW, &lck);

            len = sprintf(buf, "%d: tid=%ld fd=%d\n", i, tid, fd);

            lseek(fd, 0, SEEK_END);
            write(fd, buf, len);
            fsync(fd);

            lck.l_type = F_UNLCK;
            fcntl(fd, F_SETLKP, &lck);

            usleep(1);
        } 
        pthread_exit(NULL);
    }

    int
    main(int argc, char **argv)
    {
        long i;
        pthread_t threads[NUM_THREADS];

        truncate(FILENAME, 0);

        for (i = 0; i < NUM_THREADS; i++)
            pthread_create(&threads[i], NULL, thread_start, (void *)i);

        pthread_exit(NULL);
        return 0;
    }

This example spawns three threads and has each do five iterations of appending to a file. Access to that file is serialized via file-private locks. If we compile and run the above program, we end up with /tmp/foo that has 15 lines in it.

If we, however, were to replace the F_SETLKP and F_SETLKPW commands with their classic POSIX lock equivalents, the locking essentially becomes a noop since it is all done within the context of the same process. That leads to data corruption (missing lines) as some threads race in and overwrite the data from others.

Conclusion

File-private locks can solve many of the problems experienced with classic POSIX locks, but programmers intending to use them should take heed of the differences.

Developers from several projects including Samba, NFS Ganesha, SQLite, and OpenJDK have expressed interest in using file-private locks since they help simplify the code for many of their use cases, and help eliminate data corruption issues that can occur when files are closed.

The kernel patchset is available in the linux-kernel mailing list posting, or via the linux-next branch of my git tree. I plan to keep that branch updated with the latest version until it gets merged into mainline kernels. The kernel patches are currently being pulled into the linux-next tree as well. Anyone using linux-next kernels can use these now. There is also a (fairly trivial) GNU C library (glibc) patchset which implements the definitions needed to access these locks.

I'm currently aiming to have the kernel patches merged into mainline in v3.15, and glibc patches to add the new command definitions along with an update to the glibc manual should hopefully be merged soon afterward. Assuming that these patches are merged, I also intend to submit an update to the POSIX specification to make these a formal part of POSIX. I have opened a request to have it considered. There have already been some helpful suggestions and The Austin Group (who oversees POSIX) seems receptive to the general idea.

Hopefully other operating systems will follow suit and implement these as well so that programmers dealing with those platforms can reap the same benefits.

Index entries for this article
Kernel	Filesystems/POSIX locks
GuestArticles	Layton, Jeffrey

File-private POSIX locks

Posted Feb 20, 2014 17:10 UTC (Thu) by bfields (subscriber, #19510) [Link] (12 responses)

"With file-private locks however we can circumvent that restriction by giving each thread its own open file."

Are there cases where we'd want to use file locking to coordinate access to a file that we only know by file descriptor?

For example, you can open a file, unlink it, fork, and then use posix locks to coordinate parent and child, but that's something file-private locks can't do. (Unless I'm misunderstanding something.)

File-private POSIX locks

Posted Feb 21, 2014 14:40 UTC (Fri) by jlayton (subscriber, #31672) [Link] (11 responses)

Well, the simple workaround is to have the parent open the file twice, unlink it and then fork. Designate one fd for the parent and one for the child.

I'm not sure what we can do about the case where all we have is a fd though. Do we have an actual use-case for that?

If so, then one possibility would be to add something like a fd_to_handle() syscall and then use open_by_handle_at() to reopen it.

File-private POSIX locks

Posted Feb 21, 2014 15:13 UTC (Fri) by foom (subscriber, #14868) [Link] (1 responses)

You can already reopen a file given an fd by calling open on "/proc/self/fd/%d"

File-private POSIX locks

Posted Feb 21, 2014 15:46 UTC (Fri) by bfields (subscriber, #19510) [Link]

I forgot about that. OK, that would be simpler!

File-private POSIX locks

Posted Feb 21, 2014 15:38 UTC (Fri) by bfields (subscriber, #19510) [Link] (8 responses)

Do we have an actual use-case for that?

I don't.

If so, then one possibility would be to add something like a fd_to_handle() syscall

Oh, good idea, looks like we already have that--it should work to do

  ret = name_to_handle_at(fd, NULL, &handle, &mnt_id, AT_EMPTY_PATH)
  if (!ret)
    newfd = open_by_handle_at(fd, &handle, O_RDWR);

(untested.) So that gives a way to create arbitrary "owners" for the purpose of file lock conflicts, good.

I wonder why there isn't some way to do an AT_EMPTY_PATH open? Then you could do it with one syscall

  newfd = openat(fd,..., AT_EMPTY_PATH)

and there'd be no need to require an exportable filesystem or involve the filehandle code.

File-private POSIX locks

Posted Feb 21, 2014 16:04 UTC (Fri) by jlayton (subscriber, #31672) [Link] (7 responses)

Hah, I had forgotten about /proc/pid/fd too...thanks, foom!

An AT_EMPTY_PATH open sounds like a neat idea, but as always we'd have to think through the security implications.

What happens if you open the file, drop privs such that it would no longer be reachable by lookup and then do an AT_EMPTY_PATH open? Is that bad juju?

File-private POSIX locks

Posted Feb 21, 2014 16:09 UTC (Fri) by bfields (subscriber, #19510) [Link] (6 responses)

What happens if you open the file, drop privs such that it would no longer be reachable by lookup and then do an AT_EMPTY_PATH open? Is that bad juju?

Commit f0cc6ffb8ce8961db587e5072168cac0cbc25f05 on the comparable problem for link is interesting:

    Revert "fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink"
    
    This reverts commit bb2314b47996491bbc5add73633905c3120b6268.
    
    It wasn't necessarily wrong per se, but we're still busily discussing
    the exact details of this all, so I'm going to revert it for now.
    
    It's true that you can already do flink() through /proc and that flink()
    isn't new.  But as Brad Spengler points out, some secure environments do
    not mount proc, and flink adds a new interface that can avoid path
    lookup of the source for those kinds of environments.
    
    We may re-do this (and even mark it for stable backporting back in 3.11
    and possibly earlier) once the whole discussion about the interface is done.

Looks like an unanswered question?

File-private POSIX locks

Posted Feb 21, 2014 16:39 UTC (Fri) by jlayton (subscriber, #31672) [Link] (5 responses)

Yeah, I'm not sure. I guess it's usually best to err on the side of caution (and that's what that patch does). For now it's probably best to simply suggest reopening /proc/pid/fd in that sort of situation.

We're wandering off into theoretical territory here anyway...

File-private POSIX locks

Posted Feb 22, 2014 9:46 UTC (Sat) by neilbrown (subscriber, #359) [Link] (4 responses)

Hi Jeff,
It's great to see that this might actually be happening at last!

In Jeremy Allison's writeup that you linked he observes:

> A simple amendment to the original primitive allowing a user-defined "locking context" (like a process id) to be entered in the struct flock structure used to define the lock would have fixed this problem

"this problem" being much the same as you and Bruce are discussing above. i.e. rather than requiring separate opens, you could require 'l_pid' to be set either to 0 or a negative number. Each negative number would be a separate locking sub-context within the context of a given file open. (positive numbers of course being reserved for posix lock pids).

Did you give any thought to implementing something like this?

(even if not, it might be worth rejecting setlk requests with l_pid!=0 as a future-proofing measure).

File-private POSIX locks

Posted Feb 23, 2014 13:16 UTC (Sun) by jlayton (subscriber, #31672) [Link] (3 responses)

Overloading the l_pid's meaning sounds pretty difficult to implement in practice and would lead to convoluted semantics:

Suppose I lock two adjacent bytes in two F_SETLK requests and set the l_pid in both to '-1'. I then go and do an F_UNLCK request that encompasses both bytes (while setting l_pid to 0). Which ones get released?

I think that we could implement something like different lock contexts within a single fd, but that would probably require a new API. At the very least we'd need a "struct flock2" that has an opaque l_owner field or something.

One of the other things that the Samba and Ganesha devs have said they would like is an asynchronous locking API. Something that allows you to set a blocking lock and be called back when it's free, leaving the calling thread free to go off and do other things. That's a much more complex problem to handle.

I did consider both, but I think there's value in simplicity with this sort of thing. Maintaining extra fds is not terribly difficult in most cases, and having a 1:1 lock context to fd ratio is simple to conceptualize.

Note that this work does not preclude us adding a new API onto the file locking layer in the future that does these things.

That said, rejecting F_SETLKP[W] requests with a non-zero l_pid sounds like a good idea. I may respin this again and add that.

One of my main goals here is to leave as little undefined behavior as possible since that's almost always a recipe for trouble later.

File-private POSIX locks

Posted Feb 23, 2014 13:49 UTC (Sun) by jlayton (subscriber, #31672) [Link]

Hah, sorry -- I shouldn't post before I've had my morning coffee. I misunderstood what you were asking, so let me try again...

Ok, so in your above example the answer is that no locks would be released. So yeah, we could (in principle) overload the l_pid field with a lock context like you suggest.

Still, that seems like one of those subtle things that would be easy to mess up. What might be better is to simply add a "struct flock2" that adds a separate l_owner value.

I guess the question is whether the extra complexity of juggling opaque contexts is really any better than juggling separate fds. You would avoid having to do extra open()s, but using file locking at all sort of implies that you're dealing with long-lived fds anyway, so would it be worthwhile?

I still think if we want to add something like the l_owner field, it'd be best to do so in the context of a new API altogether that allows you to do this asynchronously as well.

File-private POSIX locks

Posted Mar 4, 2014 18:17 UTC (Tue) by sethml (guest, #8471) [Link] (1 responses)

One of the other things that the Samba and Ganesha devs have said they would like is an asynchronous locking API. Something that allows you to set a blocking lock and be called back when it's free, leaving the calling thread free to go off and do other things. That's a much more complex problem to handle.

Isn't this the type of problem that poll() and select() solve? Provide a way to convert a file range lock into an FD, and then you can poll() for taking the lock or other things. This is one thing I feel the Win32 api got right and POSIX got wrong: in Win32 there's a HANDLE type for mutexes, open files, and generally anything you might want to block for, and a variety of WaitForHandles() type calls. In POSIX, if a thread wants to block until either of two mutexes is available, or either a mutex or a file read is available, you're out of luck. Or you have to play crazy games with sending a single byte through a pipe to the thread.

File-private POSIX locks

Posted Mar 25, 2014 16:35 UTC (Tue) by jlayton (subscriber, #31672) [Link]

That's an interesting idea. I'm certainly not opposed to a new API for driving file-private locks, but that can (and should, IMO) live in conjunction with the fcntl() interface. I think adding such an interface is a separate project however.

File-private POSIX locks

Posted Feb 22, 2014 22:26 UTC (Sat) by MegabytePhreak (guest, #60945) [Link] (3 responses)

It isn't clear to me why the obvious approach of having the lock being linked to the FD isn't used. (As opposed to here where the lock is linked to the FD and any copies such as made by dup).
What is gained by the more complicated approach? It seems to violate the principle of least surprise,

File-private POSIX locks

Posted Feb 22, 2014 23:03 UTC (Sat) by Jonno (subscriber, #49613) [Link]

> It isn't clear to me why the obvious approach of having the lock being linked to the FD isn't used. (As opposed to here where the lock is linked to the FD and any copies such as made by dup).
From my understanding copies made by dup and copies made by fork are indistinguishable, so tying file-private locks to the fd rather than to the open file table entry would make it unusable in multi-process scenarios.

File-private POSIX locks

Posted Feb 22, 2014 23:37 UTC (Sat) by foom (subscriber, #14868) [Link]

Tying anything to an FD is just broken conceptually. An FD is supposed to be purely a refcounted handle to a "file description" -- an open file. You should be able to renumber fds, copy, close, etc to your heart's content without it causing any other side effects, until you close the last one.

The only exception to that right now is POSIX locks -- which is one of the ways it's completely stupidly broken.

File-private POSIX locks

Posted Feb 23, 2014 13:22 UTC (Sun) by jlayton (subscriber, #31672) [Link]

As others have responded, an "fd" is really just a slot in the opened file table that points to an opened file object. It's very difficult to tie *anything* to that slot, and I think that trying to do so would lead to a mess similar to how classic POSIX locks work today.

One of the main reasons for settling on these semantics is that we already have experience with them. BSD (flock()) locks work like that today.

Will interface be back-filled in glibc?

Posted Feb 24, 2014 19:15 UTC (Mon) by ms-tg (subscriber, #89231) [Link] (2 responses)

Will it be possible for later versions of glibc to support this interface via backfilling the feature on older linux kernels?

Will interface be back-filled in glibc?

Posted Feb 24, 2014 20:09 UTC (Mon) by jlayton (subscriber, #31672) [Link]

I'm not sure I understand the question, but I'll try to answer it anyway.

The glibc patches are really quite trivial -- they just add the F_*LKP definitions. All of the "real work" is done by the kernel. The patches would need to be backported to the older kernel.

The good news is that it is possible to determine whether these new cmd values are supported at runtime. fcntl() always returns -EINVAL if passed a cmd value that it doesn't recognize.

Will interface be back-filled in glibc?

Posted Mar 2, 2014 19:34 UTC (Sun) by ChrisDolan (guest, #41017) [Link]

ms-tg, I'd say no. There's no way for userspace to emulate this feature in older kernels without deep out-of-band userspace cooperation.

now available as "Open File Description (OFD) Locks"

Posted Jan 5, 2015 21:11 UTC (Mon) by zack (subscriber, #7062) [Link]

According to http://austingroupbugs.net/view.php?id=768 (note 0002508), these locks have been merged into Linux v3.15 are are documented in the glibc manual at http://www.gnu.org/software/libc/manual/html_mono/libc.ht...

Pretty amazing!

Open File Description Locks

Posted Apr 15, 2015 15:43 UTC (Wed) by jlayton (subscriber, #31672) [Link]

This comment should be considered an "errata" for this article. After this article was written, later discussions pointed out the "file-private" was really too ambiguous. The name we finally settled on was "open file description locks" (aka OFD locks).

The constants were also similarly renamed:

# define F_OFD_GETLK 36
# define F_OFD_SETLK 37
# define F_OFD_SETLKW 38

...it should also be noted that glibc currently requires that __GNU_SOURCE be defined before including fcntl.h in order to get these macros.

When trying to set a lock, the API now requires that l_pid value in the flock struct be set to 0. This is in response to Neil Brown's suggestion, in the event that we decide to overload the meaning of that field in the future.

Other than the above, the facility basically works as this article describes.

Those folks wanting to use this facility should probably refer to a recent version of the glibc manual which now has a section on OFD locks, and examples of how to use them.

File-private POSIX locks

The problems with POSIX locking

File-private POSIX locks

Using file-private POSIX locks

Whither F_GETLK?

Using file-private locks with threads

Conclusion

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

File-private POSIX locks

Will interface be back-filled in glibc?

Will interface be back-filled in glibc?

Will interface be back-filled in glibc?

now available as "Open File Description (OFD) Locks"

Open File Description Locks

Whither `F_GETLK`?