|
|
Log in / Subscribe / Register

A filesystem "change journal" and other topics

By Jake Edge
June 4, 2018

LSFMM

At the 2017 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Amir Goldstein presented his work on adding a superblock watch mechanism to provide a scalable way to notify applications of changes in a filesystem. At the 2018 edition of LSFMM, he was back to discuss adding NTFS-like change journals to the kernel in support of backup solutions of various sorts. As a second topic for the session, he also wanted to discuss doing more performance-regression testing for filesystems.

Goldstein said he is working on getting the superblock watch feature merged. It works well and is used in production by his employer, CTERA Networks, but there is a need to get information about filesystem changes even after a crash. Jan Kara suggested that what was wanted was an indication of which files had changed since the last time the filesystem changes were queried; Goldstein agreed.

[Amir Goldstein]

NTFS has a change journal and he is working on something similar for his company's products that is based on overlayfs. Changes that are made to the filesystem go into the overlay. For his use case, he does not need high accuracy, false positives are not a serious problem; it is sufficient to simply know that something may have changed. His application will then scan the directory to determine what, if anything, has actually changed.

Kent Overstreet asked about using filesystem sequence numbers, but Goldstein said that not all filesystems have them. He is going to continue with his overlayfs-based plan, but would rather see a generic API that could be used by others.

Dave Chinner wondered if the inode version (i_version) field could be used. Goldstein said that he wants to be able to query the filesystem to get all of the changes that have happened since a particular point in time. Josef Bacik said that Btrfs has that feature. Goldstein said that both Btrfs and dm-thin (thin provisioning) will provide a list of blocks that have changed.

Simply scanning the inode array at mount time to find updates since the last query would be easy, but could take a long time, Ted Ts'o said. Chinner said the real underlying problem is missed notifications—filesystem changes that the application is not notified about. If that problem is solved, there is no need to scan after a crash.

Goldstein is using a journal of the fsnotify event stream to reconstruct lost events in the event of a crash. But Chinner is worried about missing change notifications because the fsnotify event does not make it into the journal before the crash. Kara suggested a new fsnotify event that would indicate the intent to change; it would be journaled before the actual change. Since false positives are not a problem, if the actual change does not happen (and the fsnotify event is not actually generated), everything will still work.

Kara said that FreeBSD has a facility that provides something similar to the NTFS change journal. The API for that is already established and might provide inspiration for the Linux API. Goldstein said that he already has a way to solve his immediate problem; he has lots of ideas for additional features if he gets the time to work on them.

Filesystem performance regressions

Goldstein then shifted gears; he would like to see more filesystem performance-regression testing and wanted to discuss that. Bacik said that some performance tests have been merged into xfstests recently and asked for more. He has created a way to get fio data dumped in JSON that can be pulled into a SQLite database for doing comparisons.

Overstreet suggested that those tests should be run automatically; if it has to be done manually, it won't happen. But Bacik said he has been focused on just getting it running; he wondered if it would be more valuable to run performance tests every time or only when the developer wants to look at performance numbers.

Chinner said that the performance testing is really only meaningful for him when he is doing A/B testing. Otherwise, various runs of the test suite might have different debug settings (e.g. lockdep), so the results would not be comparable. In order for the runs to be meaningful, they have to be done in a controlled and consistent environment.

Al Viro wondered how much variability was being seen between test runs. In his testing he has seen lots of variability, which makes it even harder to compare the results between different kernel versions. The allowable variability before flagging a regression is defined in the tests, Bacik said; it is around 2% or so. Right now, the output from every run is stored in a database, but it is fairly rudimentary, he said.

Kara said that MMTests gathers similar kinds of data. He has found that averages are not particularly useful because the data is so noisy run to run, especially if the difference between the kernels is large. Average plus standard deviation is a reasonable starting point. He is not opposed to incorporating something simple into xfstests, but is concerned that more complex tests just make the run-to-run variability so high that it makes it hard to find where an observed regression is coming from.

Bacik said that Facebook rebases its kernels yearly and he would like to have a simple test to be sure that the performance hasn't gone down radically. He wants discrete tests that won't show a lot of variability. But he is not trying to catch small performance losses with these tests. He said that the tests that are there now are "better than nothing" and that nothing is what was there before. He asked again for more tests. He also asked that Ts'o and Chinner run the performance tests for ext4 and XFS, as he is doing for Btrfs.


Index entries for this article
KernelFilesystems
ConferenceStorage, Filesystem, and Memory-Management Summit/2018


to post comments

A filesystem "change journal" and other topics

Posted Jun 7, 2018 8:31 UTC (Thu) by scabrero (guest, #124899) [Link] (9 responses)

The change journal may be potentially useful for a future DFS-Replication server for Samba. To implement the server side of the protocol it is necessary to generate a data structure and store it somewhere each time a file is created/modified/deleted, which is not a problem if samba is on the path (eg. a client through SMB), but the "change journal" would help to notice about changes made by local users even if samba is not running.

A filesystem "change journal" and other topics

Posted Jun 8, 2018 15:13 UTC (Fri) by nix (subscriber, #2304) [Link] (8 responses)

If such a thing is ever possible, I hope to use it to implement a new "continuous" bup indexer, which streams all changes immediately to the backup medium as long as they are low-volume (as a new 'backup journal' data structure suitable only for wholesale replay: large volumes of changes get throttled to avoid flooding the system with dirty writes to a slow backup medium, but the knowledge that they happened is preserved), then, when you want to do a bup run, it uses the info on the complete (unthrottled) set of changes since the last such run to update the bup index very rapidly (deleting the "backup journal" at the same time).

This should give you the best of all worlds: everything you do except for, say, very large software updates or DB transactions (which get throttled) is instantly dropped onto the backup device so that you can recover from disk failures to exactly the failed state by doing a bup restore and then a backup journal replay, and actual bup runs are very fast because all that is needed is an index update and bup run, with no need for a filesystem walk. When bup moves to a sqlite index this should reduce even full bup runs to a few seconds plus half a minute or so for the bloom filter update, particularly given that all the modified data is likely still in the disk cache unless I/O load was very heavy.

But this all requires a way to tell what has changed (not only locally, thanks, I want to know if my NFS server gets stuff modified by a client, too, so the server can back it up, yes I know NFS makes this nice and hard: saying "NFSv4 only" is acceptable to me).

A filesystem "change journal" and other topics

Posted Jun 10, 2018 14:37 UTC (Sun) by bfields (subscriber, #19510) [Link] (1 responses)

"But this all requires a way to tell what has changed (not only locally, thanks, I want to know if my NFS server gets stuff modified by a client, too, so the server can back it up, yes I know NFS makes this nice and hard: saying "NFSv4 only" is acceptable to me)."

knfsd isn't really *that* special a filesystem user. If the filesystem supported a feature like this, I don't see why we'd have any trouble making it work properly for exported filesystems, and I doubt the protocol would matter.

A filesystem "change journal" and other topics

Posted Jun 11, 2018 5:49 UTC (Mon) by nix (subscriber, #2304) [Link]

I guess it *used* to be special because it was the only thing which had to open files by inode number. Now we have open_by_handle_at() this problem has to be solved in any case, even for local filesystems :)

A filesystem "change journal" and other topics

Posted Jun 14, 2018 6:04 UTC (Thu) by njs (subscriber, #40338) [Link] (5 responses)

This is how the (proprietary) CrashPlan backup system works... except of course it can't use the yet-to-be-created change journal, so it has to register an inotify watch on every file in the system. Which is why all my systems have an override file in /etc/sysctl.d/ with some ridiculous setting like:

fs.inotify.max_user_watches = 4194304

It's absolutely worth it to get continuous, fine-grained backup snapshots, but clearly Linux is not supporting this as well as it could.

A filesystem "change journal" and other topics

Posted Jun 19, 2018 19:33 UTC (Tue) by nix (subscriber, #2304) [Link] (4 responses)

I'm fairly sure that won't work if some of the changes are made by NFS clients. (The inotify watch gets fired on the client, not on the server, but it's on the server you'd want it for a backup system.)

Needless to say I won't be competing with CrashPlan because a) I won't have a wizzy GUI and b) I won't write something that slows the system to an utterly unusable crawl (or that's what the Windows version does, anyway, which I am forced to use on the work laptop I never turn on). Oh also CrashPlan actually exists. Code that doesn't exist has minimal effect on system performance!

A filesystem "change journal" and other topics

Posted Jun 20, 2018 14:27 UTC (Wed) by bfields (subscriber, #19510) [Link] (3 responses)

"I'm fairly sure that won't work if some of the changes are made by NFS clients. (The inotify watch gets fired on the client, not on the server, but it's on the server you'd want it for a backup system.)"

If you've observed this, please file a bug report against the NFS server.

Sure, a write that's still in the client's cache won't result in a notification on the server until it's flushed back to the server, but that will happen eventually (barring overwrites, a client crash before sync, etc.--but I think a backup application will be OK with those exceptions).

But any change that reaches the exported filesystem should definitely result in a notification.

Client-side notifications are much harder as they'd need some (very chatty) protocol to learn about changes made by anyone other than themselves.

A filesystem "change journal" and other topics

Posted Jul 8, 2018 15:09 UTC (Sun) by nix (subscriber, #2304) [Link] (2 responses)

> "I'm fairly sure that won't work if some of the changes are made by NFS clients. (The inotify watch gets fired on the client, not on the server, but it's on the server you'd want it for a backup system.)"
> If you've observed this, please file a bug report against the NFS server.

What, this works?! It didn't used to, but mind you that was many years ago. I assumed it still didn't work because I mentioned it here and someone -- perhaps Al Viro? -- asked astoundedly why I would ever expect changes on the client to fire watches on the server. I decided that inotify was therefore useless-by-design for more reasons than I thought it was, and never looked again.

If that's actually been considered a bug and fixed at some point in the last decade or so, then the only remaining problem is the scalability one, and for that one presumably just needs a better API. (Well, that and a way to get the right subset of change notifications back out to clients, but *that* is a problem that can obviously be solved entirely in userspace.)

A filesystem "change journal" and other topics

Posted Jul 9, 2018 13:46 UTC (Mon) by bfields (subscriber, #19510) [Link] (1 responses)

I'll admit I've never tested it!

But it should have worked since inotify was introduced. It's not really even up to knfsd, the notification hooks are mostly in lower level vfs code that knfsd calls. (And of course userspace servers like Ganesha couldn't avoid notifying even if they wanted to.)

Before inotify was introduced, there was dnotify, was was a poor fit since as I understand it only allowed directories. Sending the notification for a write to the parent directory is a little strange to start off with, for knfsd it's often impossible. Nevertheless looking through the history I see Neil Brown added code that attempts to do this on read and write in early 2004. Even on dnotify I expect notifications for other operations worked since it was introduced.

The harder problem again is notifying one client about changes that occurred on another. Yeah, in theory I guess you could do that with a separate protocol implemented in userspace. Sounds painful to me.

A filesystem "change journal" and other topics

Posted Jul 27, 2018 13:38 UTC (Fri) by nix (subscriber, #2304) [Link]

The harder problem again is notifying one client about changes that occurred on another. Yeah, in theory I guess you could do that with a separate protocol implemented in userspace. Sounds painful to me.
I think that's the only sensible approach. Whatever the client is, it would have to mirror the NFS server layout, with a dispatcher/broadcaster on the NFS server that inspected the NFS mount points and rebroadcast inotify notifications to the appropriate client: every client would then pretend to be something like gamin so that glib etc would Just Work with it.

This is much simpler than the alternative if the NFS server did not see changes done on clients: that would need every client to broadcast inotify requests it saw to the server, which then bounced them to every client, and that is getting seriously ugly, really.


Copyright © 2018, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds