|
|
Subscribe / Log in / New account

A filesystem for virtualization

By Jake Edge
May 14, 2019

LSFMM

A new filesystem aimed at sharing host filesystems with KVM guests, virtio-fs, was the topic of a session led by Miklos Szeredi at the 2019 Linux Storage, Filesystem, and Memory-Management Summit. The existing solution, which is based on the 9P filesystem from Plan 9, has some shortcomings, he said. Virtio-fs is a prototype that uses the Filesystem in Userspace (FUSE) interface.

The existing 9P-based filesystem does not provide local filesystem semantics and is "pretty slow", Szeredi said. The FUSE-based virtio-fs (RFC patches) is performing "much better". One of the ideas behind the new filesystem is to share the page cache between the host and guests, so there would be no data duplication for multiple guests accessing the same files from the host filesystem.

There are still some areas that need work, however. Metadata and the directory entry cache (dcache) cannot be shared, because data structures cannot be shared between the host and guests. There are two ways to handle that. Either there can be a round trip from the guest to the host for each operation to ensure the coherence of the metadata cache and dcache, or the guest can cache that information and somehow revalidate the cache on each operation without going to the host kernel.

[Miklos Szeredi]

The question is what the best solution would be, he said. For example, if a file has changed on the host, the modification time is updated and a stat() on the guest should indicate that. There have been some discussions on how to get notifications from the host kernel to the guest; the notifications would be propagated via a ring buffer in memory. When the guest caches an inode, it could tell the host that it wants notifications for that inode. When it gets a notification, the guest can revalidate its cache. If the ring buffer overflows for some reason, the guest will need to revalidate all of its caches.

Amir Goldstein asked if that mechanism could also be used by Samba to implement its own dcache. Trond Myklebust said that what Szeredi was talking about was an asynchronous notification mechanism, while Samba needs something synchronous. The problem with doing synchronous notifications, Szeredi said, is that the guest should not be able to block operations in the host kernel.

Another topic is POSIX file locking, he said. It is difficult to write a user-space filesystem that allows POSIX locking to work consistently with the host filesystem. The kernel NFS server (knfsd) uses kernel-internal functions to do its locking, but he is not sure what user-space NFS servers do.

The traditional way to handle that is with a user-space lock manager that takes the standard POSIX locks as needed, Myklebust said. Szeredi asked if it would make sense to add a kernel interface for the kernel-internal locking used by knfsd. Boaz Harrosh said that the Ganesha NFS server had a similar problem; it used open file description locks (OFD locks), which put the lock on the struct file so that multiple threads can use the locks successfully, unlike POSIX locks.

Szeredi said the idea was to have POSIX locks that work across guests and the host. Steve French said that Samba also uses OFD locks, which is what he recommended. They have easier semantics, in part because they don't get lost when the file is closed. It is a solution that was added partly for NFS, he said. Szeredi said that it sounded like the conclusion is that it is not worth it to make a new kernel interface for POSIX locks.

Another area that needs attention is on the ctime and mtime timestamps stored for files. They record the time of the last metadata update (ctime) and file data update (mtime). If writes to the file are going to a shared page cache, it will cause the timestamps to be updated on the host filesystem, but only sometimes. That could lead to inconsistencies with the guests' metadata caches.

He is thinking about adding a flag to open() to turn off the updating of these timestamps, which would partially solve the problem. XFS already has a flag like this, but it is not exported to user space. That kind of flag may well have security implications, he said. Goldstein said that he thought the flag was added for Data Management API (DMAPI) support in XFS so that it could make changes to files without updating the timestamps. But DMAPI has been deprecated for XFS, which is probably why the flag is not exported.

The worry about such a flag is that changes can be made to a file's contents without anyone noticing, Myklebust said. That is why it was not added to POSIX, he believes. The solution to the problem is to implement a proper version field that gets exported from the inode.


Index entries for this article
KernelFilesystems/Virtualization
KernelVirtualization/virtio
ConferenceStorage, Filesystem, and Memory-Management Summit/2019


to post comments

A filesystem for virtualization

Posted May 15, 2019 9:16 UTC (Wed) by eru (subscriber, #2753) [Link] (5 responses)

A very welcome addition. The 9p file system sucks even worse than noted: it does not support symlinks properly, which at one time caused me hours of wasted time when trying to build a project in a VM that mounted a host directory via 9p (just had to quit trying that, and switched to using docker).

A filesystem for virtualization

Posted May 15, 2019 10:19 UTC (Wed) by grawity (subscriber, #80596) [Link] (4 responses)

Symlink support was supposed to be added by 9p2000.L, or does nobody implement that?

A filesystem for virtualization

Posted May 15, 2019 21:23 UTC (Wed) by dezgeg (subscriber, #92243) [Link] (3 responses)

It is implemented by the Linux kernel as 9p client + QEMU as 9p server combo. Symlinks indeed work correctly.

Offhand I know at least these things are broken in that combo though:
- Creating a file with open(O_CREAT|O_RDWR) but giving a mode that doesn't allow write access fails. This is observable in practice by rsync/cp of files with 0444 mode failing.
- Creating an xattr with zero length value fails, because in the protocol this is interpreted as xattr deletion.

A filesystem for virtualization

Posted May 16, 2019 9:07 UTC (Thu) by eru (subscriber, #2753) [Link]

It is implemented by the Linux kernel as 9p client + QEMU as 9p server combo

Just the combination that did not work for me. Some forms of symlinking did, but not all (I think giving the link target as a relative reference failed, but it was a couple of years ago, so I might remember inaccurately). I did not investigate this further, after figuring out why my build runs failed in an odd way.

A filesystem for virtualization

Posted May 17, 2019 8:30 UTC (Fri) by vdanjean (subscriber, #1552) [Link] (1 responses)

> Creating a file with open(O_CREAT|O_RDWR) but giving a mode that doesn't allow write access fails. This is observable in practice by rsync/cp of files with 0444 mode failing.

This is not specific to 9p. I observed the same problem with a kerberos NFSv4 config. And git is using this pattern... My client runs the latest Linux kernel, but I do not have access to the server (probably a CentOS but I do not know its kernel version). I end up writing a small library to intercept such 'open' calls and changing them in separate system calls. I know I lost atomicity, but I gain a working git in this NFS mount. If needed, the code is here : https://gitlab.inria.fr/NGS/nfs-workaround

A filesystem for virtualization

Posted May 17, 2019 13:38 UTC (Fri) by bfields (subscriber, #19510) [Link]

That's a bug--unfortunately, probably a bug in your server of unknown version. (You can verify who's at fault by running wireshark and seeing exactly where it's failing). NFS has always supported write opens that create read-only files.

(Basically NFS servers allow the owner of a file to override permissions and leave enforcement to the client in these cases. It's a minimal loss of security (since the owner could change the permissions anyway) to get better compatibility with local filesystem behavior.)

A filesystem for virtualization

Posted May 17, 2019 0:55 UTC (Fri) by kmeyer (subscriber, #50720) [Link] (2 responses)

> The worry about such a flag is that changes can be made to a file's contents without anyone noticing, Myklebust said.

Can't any user that can modify a file already set the mtime arbitrarily (under ordinary unix permissions)? (I would expect SELinux or ACLs / MAC policy can restrict this in some way.) I would assume open() with the "suppress mtime/atime change" flag would cause open() to EPERM or EACCES if the user does not have that capability per security policy, making the concern moot?

A filesystem for virtualization

Posted May 17, 2019 15:42 UTC (Fri) by nybble41 (subscriber, #55106) [Link] (1 responses)

> Can't any user that can modify a file already set the mtime arbitrarily (under ordinary unix permissions)?

Setting the mtime is a metadata change which forces the the ctime to be updated, so the change would still be noticed. The proposed flag would allow updates to the file's content without any change in mtime *or* ctime.

A filesystem for virtualization

Posted May 17, 2019 16:03 UTC (Fri) by kmeyer (subscriber, #50720) [Link]

I see, thanks.

Taking a step back, I guess I’m not sure how the proposed open mode would be used by userspace NFS/Samba. The article is a bit light on details there.

A filesystem for virtualization

Posted May 17, 2019 2:48 UTC (Fri) by dgc (subscriber, #6611) [Link]

Yes, the FMODE_NOCMTIME flag is used on XFS by both xfs_fsr (online defragmentation) to move data around the filesytem without applications noticing it and by xfsdump for when it is pulling data out of the filesystem during backups.

It was also used by HSM applications that used DMAPI, but the invisible IO had nothing to do with the DMAPI interface. i.e the HSMs used the same mechanism as xfs_fsr to move data in/out of the filesystem (to/from tape) without any user visible file data or metadata modification. So while DMAPI is no longer in use, the filesystem utilities still use this flag for moving data around the filesystem without leaving traces that users and applications may get upset about...

-Dave.


Copyright © 2019, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds