User: Password:
|
|
Subscribe / Log in / New account

Deferring mtime and ctime updates

Deferring mtime and ctime updates

Posted Aug 22, 2013 11:23 UTC (Thu) by jlayton (subscriber, #31672)
Parent article: Deferring mtime and ctime updates

Yes, great article!

I'm not convinced that the NFS problem is really a huge issue. mmap and NFS have a long and problematic history together...

The answer for people trying to enforce cache-coherency across multiple clients has always been "use POSIX locking". If we simply have nfsd force the c/mtime update whenever a lock is released then that may be enough keep that in check.

Another idea might be to somehow allow the c/mtime to be updated in memory without requiring that to be flushed to disk until flush_cmtime() is called. knfsd could then report the in-memory time on GETATTR calls. That could be problematic though if the host crashes, I guess. The client could see c/mtime move backward. For the Linux client, that's not such a huge problem -- it watches for the c/mtime to change and that change doesn't necessarily need to move toward the future. Other clients might not like it though.


(Log in to post comments)

Deferring mtime and ctime updates

Posted Aug 22, 2013 15:28 UTC (Thu) by bfields (subscriber, #19510) [Link]

Yeah, I don't think it's worth trying for perfect consistency between a local application using mmap and a remote NFS client.

"The answer for people trying to enforce cache-coherency across multiple clients has always been "use POSIX locking". If we simply have nfsd force the c/mtime update whenever a lock is released then that may be enough keep that in check."

That might not be a bad idea.

For ordinary NFS clients the requirement that "times must be updated between any write to a page and either the next msync() call or the writeback of the data in question" may be all we need to guarantee that clients will see updates when they should, as they should normally be committing data before unlocking or opening. (Though I wonder about the case where they hold a write delegation.)

"that change doesn't necessarily need to move toward the future."

Though in the v4 case there's a chance the change attribute could repeat a value after reboot. I wonder if the filesystem could somehow add N to all change attributes after an unclean shutdown, for some big enough N.

Deferring mtime and ctime updates

Posted Aug 22, 2013 20:26 UTC (Thu) by luto (subscriber, #39314) [Link]

The trouble with forcing early cmtime updates is that we don't really know whether there's a dirty pte without walking all dirty pages.

(Given that the kernel doesn't currently support clean+writable ptes for shared mappings, this could be hacked around by tracking the number of writable ptes, but this is IMO gross. The other reason I don't want to do that is because I want the system to have a chance of working on writable XIP devices and on some magic future kernel that does support clean+writable.)

It would be possible to write a function to collect dirty ptes for an entire address_space much faster than calling page_mkclean on every page.

Deferring mtime and ctime updates

Posted Aug 23, 2013 12:08 UTC (Fri) by etienne (guest, #25256) [Link]

Maybe the other trouble with forcing early cmtime updates is that you do not know if someone will ask for it - i.e. the amount of work to maintain cmtime should be reduced to a minimum, even if that means more work when someone "stat()" the file.
Maybe even scan both versions (on disk and on memory) to look for difference when a stat() of a mmap file is done...
I wonder if someone has a idea about measuring the amount of time taken by managing a "page written" bit by setting the page read-only and faulting at first write (on Intel/AMD) and on ARM also managing a "page accessed" bit by setting the page non-accessible and faulting at the first access... that is a lot of faults (lots of pages) and each one may slow down the faulted process (cache and jump target cache pollution, out-of-order CPU stalled)...

Deferring mtime and ctime updates

Posted Aug 23, 2013 17:33 UTC (Fri) by dlang (subscriber, #313) [Link]

think about a multi-GB video file that's accessed via NFS, then look at your proposal to scan the entire file again and see if it still seems reasonable.

Deferring mtime and ctime updates

Posted Aug 23, 2013 11:55 UTC (Fri) by jlayton (subscriber, #31672) [Link]

> Though in the v4 case there's a chance the change attribute could repeat a
> value after reboot.

Is that really a problem though? IIRC, the change_attribute is supposed to be opaque. The client is only supposed to care if it's different from the last one it saw, not necessarily that it has gone forward.

Oh but...I suppose you could get bitten if you saw the change attribute transition from (for instance) 3->4 and then server reboots without committing that to disk. It then comes back again and does another 3->4 transition with a different set of data. Client then sees change attribute is "still" at 4 and doesn't purge the cache.

In that case...yeah -- maybe adding some sort of random offset to the change_attribute that's generated at boot time might make sense.

Deferring mtime and ctime updates

Posted Aug 23, 2013 13:41 UTC (Fri) by bfields (subscriber, #19510) [Link]

Yeah, exactly. At the time of a crash the in-memory change attribute may be well ahead of the on-disk one, and when the client resends the uncommitted data after boot it probably doesn't send exactly the same number and sequence of write rpc's, so as the server processes those resends it could reuse old change attributes with different data.

I don't know if the problem would be easy to hit in practice.

For a fix: we'd rather not invalidate all caches on every boot. We can't know which inodes are affected as that would require a disk write before allowing dirtying of any pages. Especially if there's a possibility of multiple reboots and network partitions I don't think we even know which boots are affected (maybe this is a boot after a clean shutdown but we still have a back-in-time change attribute left over from a previous crash).

Maybe a simple fix would be: instead of making the change attribute a simple 64-bit counter, instead put current unix time in the top 32 bits and a counter in the bottom 32 bits. Print a warning and congratulations to the log the first time anyone manages to sustain more than 4 billion writes in a second....

Deferring mtime and ctime updates

Posted Aug 23, 2013 17:34 UTC (Fri) by dlang (subscriber, #313) [Link]

keep in mind that the in-memory version should only be used if you are NOT exporting the file via NFS.

So you don't have to worry about clients after a reboot.

Deferring mtime and ctime updates

Posted Aug 23, 2013 18:45 UTC (Fri) by bfields (subscriber, #19510) [Link]

You're suggesting ensuring that any pending ctime/mtime/change attribute updates be committed to disk before responding to an nfs stat? I'm not sure that's practical.

Deferring mtime and ctime updates

Posted Aug 24, 2013 1:31 UTC (Sat) by dlang (subscriber, #313) [Link]

remember that the NFS spec requires that any writes to a NFS volume must be safe on disk before the write completes.

This requires a fsync after every write, which absolutely kills performance (unless you avoid ext3 and you have NVRAM or battery backed cache to write to), updating the attribute at the same time seems to be required by the standard.

Now, many people configure their systems outside the standard, accepting less data reliability in the name of performance, but if you are trying to provide all of the NFS guarantees, you need to update the timestamp after every write

This is why it's a _really_ bad idea to put things like Apache logs on NFS, unless you have a server with a lot of NVRAM to buffer your writes, and even then it's questionable.

Deferring mtime and ctime updates

Posted Aug 24, 2013 2:18 UTC (Sat) by mjg59 (subscriber, #23239) [Link]

I… think that reminding the maintainer of the kernel NFS server how NFS works might be a touch unnecessary.

Deferring mtime and ctime updates

Posted Aug 24, 2013 2:28 UTC (Sat) by dlang (subscriber, #313) [Link]

could be (and for the record, I didn't recognize that was who he was), but I've seen people manage to miss obvious things before in their area of expertise (and I've done it myself)

If I'm wrong about my understanding of what NFS requires, I'm interested in being corrected, I'll learn something and be in a better position to setup or troubleshoot in the future.

David Lang

Deferring mtime and ctime updates

Posted Aug 24, 2013 19:45 UTC (Sat) by bfields (subscriber, #19510) [Link]

No problem, I can overlook the obvious....

But as jlayton says, what you describe is not the typical case for NFS since v3, and reverting to NFSv2-like behavior would be a significant regression in some common cases.

And on a quick check.... I think the Linux v4 client, as one example, does request the change attribute on every write (assuming it doesn't hold a delegation), so the server would be forcing a commit to disk on every write.

Deferring mtime and ctime updates

Posted Aug 24, 2013 20:11 UTC (Sat) by dlang (subscriber, #313) [Link]

Ok, I wasn't aware that newer versions of NFS had relaxed the standard (I've been dealing with NFS for a while, but for the last 10 years or so it's either been with home-grade machines that I didn't expect great performance from, or with EMC/Netapp high end devices that include a lot of NVRAM to handle writes fast anyway)

just so I can see if I've got the use cases correct, I am understanding that we have the following cases

1. no NFS: ctime and mtime updates can be deferrred

2. NFSv2 in use: all writes are synchronous and ctime/mtime updates should be as well.

3. NFSv3+ in use: writes can be delayed (which should include ctime/mtime updates), unless the client says they can't, in which case NFSv2 rules apply

It seems to me that having a mount options like relctime or relmtime where the timestamp gets written out when the file is closed/mmunmap, when a fsync is done, or sooner if the kernel feels like it, should work (assuming NFS does flushes)

The only gap I can see is if the writes to the file are being done locally (mmap for example), then the writes may not be visible to NFS clients immediatly, but if this is a mount option like relatime is, people who care about this case just don't use the mount option and get the old (slower but reliable) mode.

Deferring mtime and ctime updates

Posted Aug 24, 2013 11:23 UTC (Sat) by jlayton (subscriber, #31672) [Link]

> remember that the NFS spec requires that any writes to a NFS volume must
> be safe on disk before the write completes. This requires a fsync after
> every write, which absolutely kills performance (unless you avoid ext3 and
> you have NVRAM or battery backed cache to write to), updating the
> attribute at the same time seems to be required by the standard.

That was true for NFSv2, but NFSv3 and later allow you to do UNSTABLE writes. Those don't need to be written to stable storage until the client issues a COMMIT (though the server is free to write them out earlier if it needs to). Most clients (Linux' included) will use UNSTABLE writes for the bulk of the writes that it does. STABLE (NFSv2-ish) writes are still used in some cases, but that's only where we deem that it's more efficient to do it that way.

Deferring mtime and ctime updates

Posted Aug 24, 2013 11:28 UTC (Sat) by jlayton (subscriber, #31672) [Link]

> Maybe a simple fix would be: instead of making the change attribute a
> simple 64-bit counter, instead put current unix time in the top 32 bits
> and a counter in the bottom 32 bits. Print a warning and congratulations
> to the log the first time anyone manages to sustain more than 4 billion
> writes in a second....

I suspect it wouldn't be too hard to hit that mark ;)

This scheme might work, but you'd still have the same problem that all caches would end up invalidated when the server reboots. You're quite correct that that *is* a problem that can crush an NFS server if it has a lot of clients dealing with large files. I do think we'll need to come up with some scheme that avoids that.

Deferring mtime and ctime updates

Posted Aug 24, 2013 19:59 UTC (Sat) by bfields (subscriber, #19510) [Link]

I suspect it wouldn't be too hard to hit that mark ;)

I'm not so sure, but actually this is really only a problem if the counter wraps around *and* a client's two successive stats manage to hit the same value each time through, which sounds pretty unlikely.

This scheme might work, but you'd still have the same problem that all caches would end up invalidated when the server reboots.

I'm suggesting replacing inode_inc_version by something that does this instead of just i_version++. So existing change attributes wouldn't change on reboot. It'd just ensure that when we write the file again, we choose a genuinely new change attribute and not one we might have used on the previous boot.

Deferring mtime and ctime updates

Posted Aug 26, 2013 16:25 UTC (Mon) by raven667 (subscriber, #5198) [Link]

> manage to hit the same value each time through, which sounds pretty unlikely

Just throwing this out there because I'm not an expert on this but it might also be useful to look from a security perspective, how could someone intentionally cause this to fail because if there is a thing with can fail than someone is going to try very hard and make it fail just to screw up your system if they can.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds