LWN: Comments on "A virtual filesystem locking surprise"

A virtual filesystem locking surprise

dxin — Sat, 12 Aug 2023 04:34:42 +0000

Isn't mutex_lock already heavily optimized for the uncontended case? E.g mutex_trylock_fast. Is it still necessary to find a case to skip the locking like this?

RCU

corbet — Thu, 03 Aug 2023 13:13:03 +0000

...or look in the RCU section of the LWN kernel index.

General solution

paulj — Thu, 03 Aug 2023 10:31:41 +0000

There's a wealth of articles on RCU here on LWN too, from the RCU author, e.g.: https://lwn.net/Articles/263130/

Just google for "LWN RCU" in your favourite search engine, e.g. DuckDuckGo.

General solution

NYKevin — Thu, 03 Aug 2023 08:14:11 +0000

Link to documentation for people (like me) who are unfamiliar with the term "RCU": https://www.kernel.org/doc/Documentation/RCU/rcu.txt

General solution

josh — Mon, 31 Jul 2023 22:44:05 +0000

That seems like a potentially useful structure, as long as people acquiring the second reference to something are willing to wait the (potentially considerable) amount of time for an RCU grace period to expire.

That might be a reasonable amount of overhead for pidfd_getfd, for instance.

A virtual filesystem locking surprise

pbonzini — Mon, 31 Jul 2023 21:13:48 +0000

Functions such as fdget_pos() return a bunch of flags for later use in fdput() and fdput_pos(). One such flag is FDPUT_POS_UNLOCK, which directs fdput_pos() to release the mutex.

A virtual filesystem locking surprise

stevie-oh — Mon, 31 Jul 2023 19:03:55 +0000

My understanding is that this is (well, *was*) ostensibly impossible, because the requirement is twofold: there is only one reference to this file descriptor, _and_ the process that has that reference is single-threaded.

The logic goes like this:
1. Only one reference exists to this file descriptor
2. The reference belongs to a process with only one thread
3. Therefore, right now, there is only one thread that can access or manipulate this file descriptor
4. Right now, that thread is busy running executing this function, which means it can't conflict with anything.

The problem, then, is that io_uring and pidfd_getfd violate the validity of the leap from 2->3. pidfd_getfd would do what you mentioned: it allows the reference count to be incremented by a thread from another process. io_uring, on the other hand, seems to do work on threads that don't get "counted" for #2.

General solution

calumapplepie — Mon, 31 Jul 2023 18:27:27 +0000

IMO, not acquiring a lock just because the reference count is *currently* one is a pretty nasty anti-pattern, but is also a really useful strategy in a lot of cases. It's nasty because it always creates a race window where a process can acquire a second reference to $PROTECTED_STRUCTURE and then see the instance success. Its useful because there are just so many cases where most objects will be doing something singlethreaded but the occasional object might be used concurrently.

Proposal: someone rigs up something clever, as a general solution that can be applied to all the various objects in the kernel, via a little struct that can be put inside a larger object, call it 'rcutex'. It tracks whether or not a user can assume that there is only a single reference to an object; when you're going lock-acuqiring, you simply check if the bit is set, and if so you can go ahead with the assumption that you have an exclusive reference. When you go to make a new reference to an object containing an rcutex, you clear the singlethreaded bit, wait an RCU grace period if it was set, and then you can proceed knowing that there are no ongoing lockless accesses.

Now, I'm not Paul McKenney, so take this with a grain of salt, but I think it should be possible to use the grace periods of RCU, or a similar construct, to accomplish this. The details are beyond me, however; do we make the rcutex some sort of RCU-proected pointer, or will we need a non-rcu solution? This will obviously slow down the adding-references-to-existing-objects case, due to the need to wait a grace period; there are a half dozen ways to amortize that cost springing to mind, both through heuristics and through trying to avoid actually waiting the period in a time-critical path. Some caution is needed for the case of two references being added at the same time; but two people clearing a bit still results in a cleared bit*
*probably.

A virtual filesystem locking surprise

wsy — Mon, 31 Jul 2023 17:10:32 +0000

> if (file_count(file) > 1)
> mutex_lock(&file->f_pos_lock);

What happens if the refrence count changed between the check and the lock?

A virtual filesystem locking surprise

brauner — Mon, 31 Jul 2023 15:14:09 +0000

> That will impose a performance cost, but nobody seems to have been worried enough about it to actually measure it.

Actually, I did request performance measurements from Intel's lkp-tests even before I sent that patch but only received them after it was applied.

To quote from that (private) mail:

"we've already been merging it into our
so-called hourly kernels which are distributed to our machine pool for
various performance tests which we supported.

so far, we didn't capture any performance change caused by this branch.

in order to avoid missing, we aslo decided to run some performance tests
directly upon this branch [...]
to see if it could cause any performance change comparing to v6.5-rc1.

firstly we want to check stress-ng jobs with HDD such like:
stress-ng-class-filesystem
stress-ng-class-io
stress-ng-class-os
stress-ng-class-vm-stack
stress-ng-os-1-thread
upon our Ice Lake and Cascade Lake test machines."