The two sides of reflink()

By Jonathan Corbet
May 5, 2009

One of the discussions your editor missed at the recent Linux Storage and Filesystem workshop covered the proposed reflink() system call. Fortunately, the filesystem developers have now filled in the relevant information with a detailed email exchange, complete with patches. We now have a proposed system call which has created more open questions than answers. The creation of a new core system call requires a lot of thought, so a close look at these questions would seem to be called for.

The proposed system calls are pretty simple:

    int reflink(const char *oldname, const char *newname);

    int reflinkat(int old_dir_fd, const char *oldname,
                  int new_dir_fd, const char *newname, int flags);

These system calls function much like link() and linkat() but with an important exception: rather than create a new link pointing to an existing inode, they create a new inode which happens to share the same disk blocks as the existing file. So, at the conclusion of a reflink() call, newname looks very much like a copy of oldname, but the actual data blocks have not been duplicated. The files are copy-on-write, though, meaning that a write to either file will cause some or all of the blocks to be duplicated. A change to one of the files will thus not be visible in the other file. In a sense, a reflink() call behaves like a low-cost file copy operation, though how copy-like it will be remains to be seen.

The first question to arise was: does the kernel really need to provide both the reflink() and reflinkat() system calls? Most of the other *at() calls are paired with the non-at versions because the latter came first. Since Unix-like systems have had link() for a long time, it cannot be removed without breaking applications. So linkat() had to go in as a separate call. But reflink() is new, so it can just as easily be implemented in the C library as a wrapper around reflinkat(). That is how things will probably be done in the end.

The deeper discussion, though, reveals that there are two fundamentally different views of how this system call should work. Joel Becker, who posted the reflink() patch, sees it as a new variant of the link() system call. Others, though, would like it to behave more like a file copy operation. If you see reflink() as being a type of link(), then certain implications emerge:

The reflink-as-link view requires that the new file have (to the greatest extent possible) the same metadata as the old one; in particular, it must have (at the end of the reflink() system call) the same owner, just like hard links do.
Creating low-level snapshots of filesystems (or portions thereof) is straightforward and easy. Reflinked files look just like the originals; in particular, they have (mostly) the same metadata and can share the same security context.
Disk quotas are a problem. Should a reflinked file count against the owner's disk quota, even though little or no extra storage is actually used? If so, the system must take extra steps to keep users from creating a reflink to a file they do not own; otherwise one user could exhaust another user's quota. If, instead, storage is charged against the quota of the user who created the reflink, complicated structures will be needed to track usage associated with files owned by others.
What happens if the new file's metadata - permissions or owner - are changed? In some scenarios, depending on the underlying filesystem implementation, it seems that a metadata change could require a copy-on-write of the whole file. That would turn a command like chmod into a rather heavy-weight operation.

On the other hand, if a reflink is like making a copy, the situation changes somewhat:

Security becomes a rather more complicated affair. Making a hard link doesn't require messing with SELinux security contexts, but a reflink-as-copy would require that. Permission checks (again, including security module checks) would have to become more elaborate; it would have to be clear that the user making the reflink had read access to the file.
The quota problem goes away. If a reflink is essentially a copy, then the resulting link should be owned by the user who creates it, rather than the owner of the original file. The only course which makes sense is to charge both users for the full size of the file. There are no concerns about one user exhausting another's disk quota, and there are no real difficulties with keeping disk usage information current.
Metadata changes are handled naturally, since the files are completely separate from each other.
Reflinks are no longer true snapshots; they will not work to represent the state of the filesystem at a given time. For a user whose real interest is low-level snapshotting, reflink-as-copy will not work.

On the other hand, reflink-as-copy could be used in a lot of other interesting situations; the cp command could create reflinks by default when the destination file is on the same filesystem. That would turn "cp -r" into a fast and efficient operation. They could also be used to share files between virtualized guests.

What it comes down to is that there are real uses for both the reflink-as-link and reflink-as-copy modes of operation. So the right solution may well be to implement both modes. The flags parameter to reflinkat() can be used to distinguish between the two. Implementing both behaviors will complicate the implementation somewhat, and it muddies up what is otherwise a conceptually clean system call. But that's what happens, sometimes, when designs encounter the real world.

Index entries for this article
Kernel	Filesystems
Kernel	reflink()
Kernel	System calls

The two sides of reflink()

Posted May 5, 2009 21:12 UTC (Tue) by flewellyn (subscriber, #5047) [Link] (16 responses)

I'm a little fuzzy on what benefit such a system call has in the first place. Has this been covered previously?

The two sides of reflink()

Posted May 5, 2009 21:21 UTC (Tue) by martinfick (subscriber, #4455) [Link] (8 responses)

Copy on write has huge benefits in space savings, not just in disk space, but more importantly in memory, particularly for virtualized systems. For example, the vserver project already implements a solution to this which allows many virtual servers to share the same files securely. This means that if you have 1000 servers running the same copy of apache, not only can you have only one copy on disk, but the kernel will also only keep one copy in memory (of the shared stuff like program text, of course). While you could achieve a similar sharing with hard links, this would be less secure since a breach in one system would allow the file to be modified in all the other systems. With COW, this is avoided.

The two sides of reflink()

Posted May 5, 2009 21:29 UTC (Tue) by flewellyn (subscriber, #5047) [Link]

I see. That IS beneficial. Thanks very much.

The two sides of reflink()

Posted May 6, 2009 14:25 UTC (Wed) by wilreichert (guest, #17680) [Link] (2 responses)

How is this different from deduplication at the filesystem level?

The two sides of reflink()

Posted May 6, 2009 15:09 UTC (Wed) by dlang (guest, #313) [Link]

it sounds like it's one mechanism to use for deduplication.

The two sides of reflink()

Posted May 6, 2009 17:03 UTC (Wed) by elanthis (guest, #6227) [Link]

To the filesystem, a cp isn't a copy -- it's one process reading from one file and writing to another. Figuring out that that is supposed to be a copy is very non-trivial and expensive, especially when taking into account metadata operations which aren't part of the regular file stream. I'm not sure it's even plausible to do without a second pass, e.g. a "combine files" daemon, which would still just be extra overhead.

If on the other hand cp says "this is a copy" to the kernel then the filesystem can just do the right thing. Of course, other applications will need to be modified to take advantage of the new feature, but such is the truth of most progress.

The two sides of reflink()

Posted May 7, 2009 21:03 UTC (Thu) by anton (subscriber, #25547) [Link] (3 responses)

For shared stuff like program text, all servers could use the same binaries (through mount, mount -t bind, or hard links), so that's not a good justification for reflinks, either (and if you don't trust the other servers not to write to the file, why would you trust them with access to the device at all?). Writable files that would mostly or completely be the same on both VMs would be a better example, but no concrete example comes to my mind.

The two sides of reflink()

Posted May 7, 2009 21:13 UTC (Thu) by martinfick (subscriber, #4455) [Link] (2 responses)

"why would you trust them with access to the device at all?"

You don't, usually the host system mounts a portion of the filesystem into a separate chroot for each guest server. The guests typically then have a limited root capability that does not included making device nodes so they really do not have access to the device, only the filesystem.

The two sides of reflink()

Posted May 10, 2009 18:42 UTC (Sun) by anton (subscriber, #25547) [Link] (1 responses)

The guests typically then have a limited root capability that does not included making device nodes so they really do not have access to the device, only the filesystem.

With the limits on the root capabilities, the binaries can surely be made read-only even for the guest roots, so no reflinks are needed for the binaries.

The two sides of reflink()

Posted May 10, 2009 19:09 UTC (Sun) by martinfick (subscriber, #4455) [Link]

The guests typically then have a limited root capability that does not included making device nodes so they really do not have access to the device, only the filesystem.
With the limits on the root capabilities, the binaries can surely be made read-only even for the guest roots, so no reflinks are needed for the binaries.

Sure, but if you make the binaries read only you no longer have independent guest systems that can be administered without knowledge of the host or other guests. In other words, if I now want to upgrade the apache server in one guest, I can't since the binary is read only to my guest root user. With COW, no problem, as a guest admin I do not even know that my apache binary is shared with others. It is only relevant to the host (the host unifies the various guest binaries, not the guest).

The two sides of reflink()

Posted May 6, 2009 14:33 UTC (Wed) by MarkWilliamson (guest, #30166) [Link] (6 responses)

Some more possible uses:

Those folks who are fortunate enough to have their home directory on a netapp filer
have for years been able to "cd ~/.snapshot/" and find a special directory of
historical versions of their files. These are stored efficiently because of the nature of
the WAFL filesystem. With reflink, it would be possible to create a lightweight
version of historical snapshotting: you'd have a daemon run every night (for
instance) and recursively reflink the current state of all your files into a directory
tree at ~/.old-versions/<date>/ - then, if you ever needed to go back to an old
version of a file you could just look in there.

With reflinks this would be very fast and would not use up loads of disk space
(though there would still be quota concerns). It would make time-machine or Netapp
.snapshot-like functionality easy to implement efficiently on single disk systems.
Probably the most quoted reference for stuff like this is the Elephant research
filesystem, about which there are a number of decent research papers.

Another use that I've seen mentioned is the ability to make checkouts / clones in
moderen version control systems go faster and be more lightweight in terms of disk
storage - for instance, cloning a git repository could transparently share all the
underlying data (including the working directory!) using reflinks. Similar tricks being
possible for the other VCSes.

Finally, you could probably have a daemon that rummages around the system, finds
identical files and unifies them on disk using reflink in order to save space.

Loads of cool stuff :-)

The two sides of reflink()

Posted May 6, 2009 14:34 UTC (Wed) by MarkWilliamson (guest, #30166) [Link]

Ugh, what happened to my line endings? :-( Maybe my browser did something evil ...
somehow.

The two sides of reflink()

Posted May 6, 2009 16:38 UTC (Wed) by cdarroch (subscriber, #26812) [Link] (4 responses)

Yes -- that .snapshot directory is incredibly convenient. Deleted a file by accident? No problem; there's an hourly backup in .snapshot. Rogue program deleted 1 TB of data overnight? Just reach into .snapshot and pull it all out again. Having equivalent functionality on non-NetApp hardware would awfully nice.

The two sides of reflink()

Posted May 6, 2009 16:59 UTC (Wed) by MarkWilliamson (guest, #30166) [Link] (1 responses)

Indeed.

rdiff-backup (http://www.gnu.org/savannah-checkouts/non-gnu/rdiff-backup/) gives
somewhat similar snapshotting convenience but you have to interact with it through a
command line app. Also, it does use up extra space (although if you're backing up to
another machine / another drive for redundancy then that's just fine!).

archfs (http://code.google.com/p/archfs/) provides a Fuse interface to browse rdiff-backup
repositories. Last time I tried it it wasn't really suitable for large repositories but this may
have been fixed since then. rdiff-backup's page on related info has some other solutions:
http://www.gnu.org/savannah-checkouts/non-gnu/rdiff-backu...

.snapshot is a very nice user interface to have to old revisions.

The two sides of reflink()

Posted May 9, 2009 5:07 UTC (Sat) by TRS-80 (guest, #1804) [Link]

rdiffWeb is a nice web interface to rdiff-backup. At work we're using rdiff-backup for weekly snapshots to complement our nightly amanda tape, a 1TB drive lasted us a year.

Line endings - make sure you select HTML not plain text, as the latter doesn't do wrapping for some reason.

The two sides of reflink()

Posted May 11, 2009 0:21 UTC (Mon) by vonbrand (subscriber, #4458) [Link] (1 responses)

Please don't.

I suffered through DOS's "you can undelete files whenever you fatfingered DEL". Most of the time it worked, but Murphy's Law ensured that when you really needed to get something back, it would usually be gone for good. Unix' idea of "rm is final" is harsh, but you learn not to misplace stuff in the first place. Makes for a better experience in the long run.

The two sides of reflink()

Posted May 11, 2009 1:10 UTC (Mon) by MarkWilliamson (guest, #30166) [Link]

Netapp's .snapshot and the similar functionality reflinks can provide will give you semantics similar to a backup (a version of the file from a particular point in time, which will stay there until your backup regime removes it as too old). So it's a big improvement on DOS's "maybe you'll be able to grab the data back before the space is recycled by the filesystem". So it should at least have reliable, predictable semantics for things like accidental deletion.

Although in practice it's going to get used to undo rm occasionally, it seems to me only sensible to have something like this available so I'm able to roll back important documents and settings to previous states if I make the wrong modification, or if some program barfs over everything and corrupts things.

Users will probably have to be repeatedly reminded that, yes, they do need an independent backup on another disk somewhere because reflinks won't save you if your computer explodes. But most folks don't do proper backups *anyhow*, so I doubt it'll make that aspect of user behaviour much worse!

The two sides of reflink()

Posted May 5, 2009 21:43 UTC (Tue) by JoeBuck (subscriber, #2330) [Link]

Since the other "at" calls ("linkat", etc.) are shared with Solaris, I hope that there will be discussions with Solaris and BSD people about these calls, to help people write portable software going forward.

The two sides of reflink()

Posted May 5, 2009 21:47 UTC (Tue) by martinfick (subscriber, #4455) [Link] (18 responses)

Wouldn't such a modification be filesystem specific? Would this require all supported linux filesystems to be patched? What happens when say an EXT3 partition with files like this is used on a kernel which does not support this and one of the copies is written to?

The two sides of reflink()

Posted May 5, 2009 23:21 UTC (Tue) by adj (subscriber, #7401) [Link] (17 responses)

Not having read the patch (or any relevant standard document lately), but getting back ENOSYS in errno seems like a reasonable result. But I'm probably wrong.

The two sides of reflink()

Posted May 5, 2009 23:31 UTC (Tue) by martinfick (subscriber, #4455) [Link] (15 responses)

Uh, perhaps you are thinking that I asked what would happen if attempting to use reflink() on an unsupported FS or a kernel which does not support it. I was asking what would happen if someone tried to modify a file with a kernel that does not understand this block sharing (assuming it was created by one that does)? Would it end up simply overwritting the blocks on disk that belong to both copies effectively making it not COW, but just a plain hard link?

The two sides of reflink()

Posted May 6, 2009 0:05 UTC (Wed) by adj (subscriber, #7401) [Link]

Yeah, I did miss that. I'm going to have to imagine that a reflink-supporting ext3 filesystem would have a new feature bit set in the superblock. And hopefully would not be mountable on a non-reflink-supporting kernel. Two inodes sharing the same data blocks isn't something that any traditional UNIXy filesystem is going to understand. MS-DOS type filesystems surely don't support it either (cross linked files should sound familiar to anyone who used a DOS system in the 1980s or early 1990s.)

The two sides of reflink()

Posted May 6, 2009 0:09 UTC (Wed) by clugstj (subscriber, #4020) [Link] (4 responses)

How would you get into this situation? The call is implemented in the file system. If the file system doesn't support it, the file isn't shared.

The two sides of reflink()

Posted May 6, 2009 0:14 UTC (Wed) by martinfick (subscriber, #4455) [Link] (3 responses)

Yes, but I can run multiple kernels on the same machine (at different times). So if I create the shared file with a modern spiffy new kernel which supports this feature and then boot an older kernel and write to one of the files, what happens?

The two sides of reflink()

Posted May 6, 2009 0:49 UTC (Wed) by JoeBuck (subscriber, #2330) [Link] (1 responses)

You'd need flags in the filesystem that would prevent the filesystem from being mounted by a kernel that lacks the needed feature.

The two sides of reflink()

Posted May 6, 2009 5:59 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

Specifically the ext filesystem family contains two sets of bit flags for this purpose. Each bit represents a feature which a particular implementation may or may not be aware of. Implementations are supposed to check one set before attempting to mount the filesystem at all, and another set in addition if the mount is read-write.

It also contains per-inode flags, so that implementations can be warned that they're missing a feature needed to read or update a particular file, in this case the implementation should fail open() for that file.

Of course poor quality implementations from third parties may be missing some or all of these checks. Fortunately the worst implementation I'm aware of as of this moment is read-only, so any problems only occur when reading files with that implementation and if/when they reboot into Linux everything is fine again.

The two sides of reflink()

Posted May 6, 2009 19:31 UTC (Wed) by clugstj (subscriber, #4020) [Link]

You are f***ed. I would suggest you don't do something this crazy. It would not be trivial to add a feature like this to a file system and assure that the older version knows about it.

a reflink would be a new type of inode

Posted May 6, 2009 3:00 UTC (Wed) by xoddam (guest, #2322) [Link] (8 responses)

Hard links are not distinct inodes (as reflinks must be); rather they are multiple directory entries pointing to a single inode. Symlinks are (in most posixish filesystems) a special kind of inode. Reflinks will be yet another special kind of inode; if the filesystem code does not recognise an inode type it will return an error when you attempt to open it (maybe -ENOSYS, I'm not sure).

You could also specify flags in the superblock as others have suggested, so as to prevent a filesystem with reflinks from being mounted at all by a kernel which does not support them.

a reflink would be a new type of inode

Posted May 6, 2009 8:55 UTC (Wed) by epa (subscriber, #39769) [Link] (7 responses)

There seems to be some asymmetry. If you make another hard link to a file, then the two links are equal in status and you can't see which is the original. But a reflink is to be a special inode type and different somehow from the original version of the file.

a reflink would be a new type of inode

Posted May 6, 2009 9:46 UTC (Wed) by xoddam (guest, #2322) [Link]

Yes, like a symlink.

a reflink would be a new type of inode

Posted May 6, 2009 12:24 UTC (Wed) by corbet (editor, #1) [Link] (5 responses)

A reflink would be a new type of inode only in so far as the filesystem must track the fact that it has blocks shared with another inode. There is no difference, though, between an inode created by a reflink and the file's original inode; they both become reflink inodes. In Btrfs, I believe, things are even less different; the tracking of the shared blocks is done at the extent level.

a reflink would be a new type of inode

Posted May 6, 2009 16:07 UTC (Wed) by masoncl (subscriber, #47138) [Link] (2 responses)

It is more accurate to say (for both btrfs and ocfs2) that the result of the reflink is an entirely new file. It has a known starting point (the contents and permissions of the original).

The two files can be changed independently without affecting each other. One could be deleted, truncated, expanded, chmoded, have new acls set, etc.

The actual block sharing is just an implementation detail...it could be implemented as a lazy copy for example.

a reflink would be a new type of inode

Posted May 6, 2009 19:40 UTC (Wed) by adj (subscriber, #7401) [Link] (1 responses)

That leaves the "link" part of the interface name sounding terribly misleading.

Surely there's a better name for a make-a-copy-of-this-inode-and-all-its-data-and-maybe-do-some-cool-COW-magic system call. It's too bad that the "dup" family of system call names is already used for something with a completely separate meaning.

madcow()?

Posted May 6, 2009 21:47 UTC (Wed) by AnswerGuy (guest, #1256) [Link]

magically-allocated-data-copy-on-write ... :)

a reflink would be a new type of inode

Posted May 7, 2009 10:12 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

A reflink would be a new type of inode only in so far as the filesystem must track the fact that it has blocks shared with another inode. There is no difference, though, between an inode created by a reflink and the file's original inode; they both become reflink inodes.

Ah, so it is symmetric, somewhat like making a hard link with 'ln'.

If one of the two reflink inodes is then removed, does the other one revert back to being a normal file? If not, is there any difference between a lone reflink inode and a normal one? Couldn't all files be reflinks?

a reflink would be a new type of inode

Posted May 7, 2009 22:59 UTC (Thu) by dlang (guest, #313) [Link]

I believe that one of the features of a reflink is that it tells you what else it's linked with, so that you can find it to break the COW

if this isn't the case, it should be, for just this reason.

as I understand things, if a file is changed it may break the linking entirely (copying the entire file), or it may break the link partially (still sharing the common parts of the file, but with the differences being separated) at the option of the filesystem

The two sides of reflink()

Posted May 9, 2009 14:25 UTC (Sat) by sdbrady (guest, #56894) [Link]

Hmm, surely you'd expect ENOTSUP if the filesystem doesn't support the operation, and ENOSYS if the kernel doesn't support it?

The two sides of reflink()

Posted May 5, 2009 22:34 UTC (Tue) by ikm (guest, #493) [Link] (2 responses)

What seems to be an omission in this article is some background information on why this was called for in the first place. From the article, it looks like this syscall came completely out of the blue and that's just it, but I'm sure there's some background, and that kind stuff is always nice to read and know.

The two sides of reflink()

Posted May 5, 2009 22:55 UTC (Tue) by JoeBuck (subscriber, #2330) [Link] (1 responses)

Read the article again; the uses are described (snapshots, more efficient copying).

The two sides of reflink()

Posted May 5, 2009 23:02 UTC (Tue) by ikm (guest, #493) [Link]

I wasn't referring to the uses, but rather to a history of the whole thing.

Pity userspace on where this is done on a simple minded filesystem

Posted May 5, 2009 23:36 UTC (Tue) by adj (subscriber, #7401) [Link] (6 responses)


  reflink("myold200Gbytefile", "myreflinked20Gbytefile);

  myfd = open("myreflinked200Gbytefile", (O_WRONLY|O_APPEND|O_LARGEFILE));

  res = write(myfd, "hereismyshortlittlemessage", 27);

  /*

    * wait, like _forever_ while i just added 200Gbytes to the amount

    * of space used on this filesystem.  Not to mention getting back

    * ENOSPC, because, even though there were 100Gbytes available

    * before this 27 byte long write, now there are none.  reflinks

    * sound neat, but sometimes have unexpected teeth.

    */

Re: Pity userspace on where this is done on a simple minded filesystem

Posted May 6, 2009 0:00 UTC (Wed) by daney (guest, #24551) [Link] (4 responses)

On your filesystem that has 200Gbyte blocks you would indeed wait a while.

On my filesystem with somewhat smaller blocks presumably the wait would be shorter.

Re: Pity userspace on where this is done on a simple minded filesystem

Posted May 6, 2009 0:12 UTC (Wed) by adj (subscriber, #7401) [Link] (3 responses)

Where does my filesystem get 200Gbyte blocks? Per the article, "a write to either file will cause some or all of the blocks to be duplicated." I have to think the bookkeeping involved in the former would be simpler and is more likely to be implemented as a prototype in a non-OCFS2 filesystem. At least in the short term.

Re: Pity userspace on where this is done on a simple minded filesystem

Posted May 6, 2009 0:17 UTC (Wed) by adj (subscriber, #7401) [Link] (1 responses)

s/former/latter/ in the above.

Re: Pity userspace on where this is done on a simple minded filesystem

Posted May 6, 2009 0:57 UTC (Wed) by JoeBuck (subscriber, #2330) [Link]

But appending (to logs, for example) is such a common special case that it would be likely to be supported at an early stage, so that you have an efficient representation when you take a snapshot of a filesystem, and then you let it go on, appending lots of stuff to your log files, blocks belonging to a common prefix can be shared.

Re: Pity userspace on where this is done on a simple minded filesystem

Posted May 6, 2009 1:01 UTC (Wed) by corbet (editor, #1) [Link]

btrfs, at least, has COW wired pretty deeply into it. So it will only copy blocks which have actually been changed. Results on other filesystems may vary.

Pity userspace on where this is done on a simple minded filesystem

Posted May 6, 2009 18:49 UTC (Wed) by jzbiciak (guest, #5246) [Link]

Simple answer: Don't implement this on simple minded filesystems. :-) Save it for filesystems that COW effectively (Btrfs), or can be taught to do so (ext4). I doubt anyone's planning to implement this over xiafs or vfat.

Slightly fancier answer: Perhaps this is another thing to "ulimit"? Return EPERM or EFBIG (file is too big) if the limit's exceeded.

Any introspection into reflinked files?

Posted May 6, 2009 2:12 UTC (Wed) by knobunc (guest, #4678) [Link]

Will there be any way to tell if a reflink is still viable? I can tell if two files are hardlinked by looking at the inodes.

But can I tell if two files are reflinked? Can I tell if one has changed after the fact? If the smarter COW semantics that only copy necessary blocks in a changed file is used, is there any way to tell which portions of the file are common, or at least what percentage?

Can you reflink something more than once... I assume so. And if so, how does that impact the above questions.

Thanks, -ben

The two sides of reflink()

Posted May 6, 2009 7:34 UTC (Wed) by butlerm (subscriber, #13312) [Link] (18 responses)

It seems to me that "reflink" is an extraordinarily bad name for something
that from a user point of view does not appear to make any kind of link at
all.

The reasonable options are full copy semantics or (read-only) snapshot
semantics. A writable non-link "link" that you can't change the owner of
doesn't seem like a very useful construct to me.

The general idea here is excellent of course. However, I suggest this system
call would make a lot more sense if it were named "fclone" (or something like
that) with full copy semantics. It should preserve the owner, permissions
and data to begin with, and then the caller should be able to change all
three after the fact. Sort of like "fork" or "clone".

The two sides of reflink()

Posted May 6, 2009 13:22 UTC (Wed) by Cato (guest, #7643) [Link]

Definitely some sort of "clone" name would be better - it's rather like forking a process as you say.

The two sides of reflink()

Posted May 9, 2009 21:34 UTC (Sat) by giraffedata (guest, #1954) [Link] (16 responses)

It should preserve the owner, permissions and data to begin with, and then the caller should be able to change all three after the fact.

The caller should be able to change the owner of the file? Or change permissions on a file, whether he owns it or not? These are not normal Unix things, except for the superuser.

That's why there's a decision to be made (or not -- has anyone considered doing one of each?)

I would use a reflink that makes new metadata (and I can then copy whatever metadata I can, like cp --archive, if I like). I wouldn't use one that clones the metadata -- that function is better served by a snapshot function that does a whole directory tree at once.

The two sides of reflink()

Posted May 9, 2009 23:31 UTC (Sat) by butlerm (subscriber, #13312) [Link] (15 responses)

Assuming they have the necessary privileges to do so, obviously.

The two sides of reflink()

Posted May 10, 2009 1:45 UTC (Sun) by giraffedata (guest, #1954) [Link] (14 responses)

Assuming they have the necessary privileges to do so, obviously.

Well not obviously, since that assumption leaves the situation equally weird.

But now that you've said that's what you have in mind, maybe you can elaborate. Would an unprivileged person be able to use reflink? What would happen if he did it on a file he doesn't own? Would it be possible for someone to create a file he can't access? One whose space is charged to someone else?

The two sides of reflink()

Posted May 10, 2009 10:30 UTC (Sun) by nix (subscriber, #2304) [Link] (12 responses)

It already is possible. Create a directory readable/executable only by
yourself; hardlink someone else's file into it; wait for that other person
to delete it. Now you've stolen that person's quota.

The two sides of reflink()

Posted May 10, 2009 18:03 UTC (Sun) by giraffedata (guest, #1954) [Link] (11 responses)

Yes, and you misspoke. The other person didn't delete the file because no one can delete a file. The system deletes one automatically when it's no longer accessible. The space charging problem is one of the many reasons this innovative Unix concept should actually be scrapped. Along with the related concepts that directories are kernel level things, and you can't give a file a name.

The two sides of reflink()

Posted May 10, 2009 18:36 UTC (Sun) by nix (subscriber, #2304) [Link] (2 responses)

I suspect that if you actually tried to scrap link() et al, a million MTA
authors would try to kill you.

(I'd be rather annoyed, too: I use hardlinks all the time.)

The two sides of reflink()

Posted May 11, 2009 5:50 UTC (Mon) by giraffedata (guest, #1954) [Link] (1 responses)

I suspect that if you actually tried to scrap link() et al, a million MTA authors would try to kill you.

Well, I wouldn't scrap link() et al -- I'd just move them out of the kernel and add the ability to explicitly create and delete files independent of directory links.

The two sides of reflink()

Posted May 11, 2009 6:10 UTC (Mon) by nix (subscriber, #2304) [Link]

We already have the ability to create and delete files independently of
directory links: mkstemp(). What you can't do is easily create them
outside of /tmp, or link them to names at a later date.

The two sides of reflink()

Posted May 11, 2009 5:42 UTC (Mon) by butlerm (subscriber, #13312) [Link] (7 responses)

Personally, I would rather not have to reboot every time I installed or
updated virtually any piece of system software. That would be the direct
consequence of discarding the directory entry / inode distinction in Unix -
to regress to the reboot happy world of Win32.

The two sides of reflink()

Posted May 11, 2009 5:57 UTC (Mon) by giraffedata (guest, #1954) [Link] (6 responses)

That would be the direct consequence of discarding the directory entry / inode distinction in Unix -

But what I described makes the distinction even larger. Today directory entries and inodes are tied together tightly by the kernel.

But I'm curious about how this affects having to reboot when you update system software.

The two sides of reflink()

Posted May 11, 2009 6:09 UTC (Mon) by nix (subscriber, #2304) [Link] (5 responses)

What? Directory entries and inodes aren't tied together in the fs model at
all, except that each directory entry increases i_nlink in the
corresponding inode by one. Reflinks simply would ensure that i_nlink was
*at least* one but would not increment it (probably by maintaining a
separate i_reflink count), and the semantics of unlink() would change to
ensure that a reflink()/unlink() sequence had the same (no) effect on link
count as link()/unlink().

You could no longer rely on unlink() decrementing i_nlink, but I don't
know of *anything* that depends on this (some things doubtless do but it
can't be common).

It breaks updating running software because that involves unlinking files
that are in use, and because the update process generally consists of
creating a file with a temporary name, filling it out, and rename()ing it
over the original (that's an implicit unlink right there, and it does not
fail). If you break that you break every package manager on the face of
the earth.

The two sides of reflink()

Posted May 11, 2009 7:37 UTC (Mon) by giraffedata (guest, #1954) [Link] (3 responses)

What? Directory entries and inodes aren't tied together in the fs model at all, except that each directory entry increases i_nlink in the corresponding inode by one.

That's a pretty tight bond, especially since i_nlink controls when the inode/file gets deleted. Also, you can't make the kernel create an inode without also creating a directory entry, and except temporarily, an inode cannot exist without at least one directory entry associated with it. Those are the bonds that it would be nice to get away from, as pretty much every OS except Unix does.

Reflinks simply would ...

We must be talking about different things. I was just talking about what Unix should do instead of what it always has (as a fundamental design point) done. Nothing to do with reflinks. And I'm also not claiming it would be compatible with any existing Unix application, but I do believe every application could be done at least as well with a kernel without automatic file deletion and directories.

The two sides of reflink()

Posted May 11, 2009 18:32 UTC (Mon) by nix (subscriber, #2304) [Link] (2 responses)

*And directories*? You're dreaming. Directories are in practice essential
for scalability. If they weren't in the kernel, they'd need to be in some
userspace library (ew).

The two sides of reflink()

Posted May 12, 2009 1:22 UTC (Tue) by giraffedata (guest, #1954) [Link] (1 responses)

If they weren't in the kernel, they'd need to be in some userspace library (ew).

They work better in user space -- there's more flexibility there and the basic concept of a directory has nothing to do with resource allocation between users, which is what the kernel is for. Many OSes do them outside the kernel. The only reason they have to be in the kernel in Unix is that the kernel deletes files implicitly based on directory references. And as I've been saying, we'd be better off without that.

The two sides of reflink()

Posted May 12, 2009 19:59 UTC (Tue) by nix (subscriber, #2304) [Link]

Putting directories outside the kernel also means that a whole pile of
things POSIX guarantees become, as near I can tell, impossible to provide.
I can't see any way to keep cross-directory rename() atomic, for instance.

Also it's a grotesque security hole: now you can't keep stuff secret by
hiding it in unreadable directories anymore.

Periodically there are proposals to introduce an open()-by-inode-number
syscall. They are always shot down. I don't know what sort of system
you're thinking of, but it isn't Unix.

(And if you're going to go that route, make the inums 1024 bits long and
bingo, you've got a capability-based system.)

The two sides of reflink()

Posted May 11, 2009 15:38 UTC (Mon) by butlerm (subscriber, #13312) [Link]

There is no practical way for a filesystem to implement "reflinks" such that
the reflink shares the same inode. The ownership, permissions, and file data
of both the original file and the new file all have to be modifiable
independently. To make any sense, they would also need separate inode
numbers.

The two sides of reflink()

Posted May 11, 2009 5:34 UTC (Mon) by butlerm (subscriber, #13312) [Link]

You raise an excellent point. The useful implementation of "reflink" would
have semantics as a file copy. Since an unprivileged user cannot change the
ownership of an existing file, a general purpose implementation *must* be
able to change the ownership of the new file to that of the current user in
the process.

Otherwise you would get a highly restricted operation that would only be
useful to unprivileged users for making efficient copies of files they
already own.

The two sides of reflink()

Posted May 6, 2009 8:15 UTC (Wed) by amw (subscriber, #29081) [Link] (9 responses)

Why should my reflinked file be counted against my quota at all - I'm not using any more storage (at least not initially).

The two sides of reflink()

Posted May 6, 2009 8:52 UTC (Wed) by lmb (subscriber, #39048) [Link] (1 responses)

The answer to the quota question points out that this call is only the first step.

There is, of course, also a need for reflinkstat() or whatever it is going to be called - one needs to find out how many blocks are part of a specific COW, and also reflinkdiff(), which enumerates the (meta-)data blocks which differ from the link target (needed for efficient rsync/backup).

Then, it also becomes possible to use quota to account just the difference, i.e. the actual space used by the reflink.

A further complication arises when we look at using reflink() on directories, which would of course also be quite desirable (for snapshots, multi-use chroots etc). That will be an interesting direction to explore ;-)

The two sides of reflink()

Posted May 9, 2009 23:52 UTC (Sat) by butlerm (subscriber, #13312) [Link]

ZFS (for example) uses an COW design where it isn't remotely practical to
figure out which files / snapshots share which blocks in a reasonable amount
of time. I imagine BTRFS is similar.

However both aforementioned filesystems store block checksums, so perhaps a
more practical means would be to add an interface that returns the block
checksums of a file range (if not the block offsets) to user space, and
generate a candidate duplication list from that.

The two sides of reflink()

Posted May 6, 2009 9:02 UTC (Wed) by epa (subscriber, #39769) [Link] (3 responses)

Because if you make a reflink to a file, and your quota is unaffected, you might later find that when simply changing the contents of the reflink (without making it any larger) your quota is exhausted. Userspace really doesn't expect that a seemingly harmless operation like changing one byte in the middle of a file could suddenly exhaust quota or free disk space.

That said, it makes no sense to account disk quota conservatively while lying about the amount of free space really available. The two should be treated the same, so if reflinking a large file has no effect on the reported free space, it shouldn't cost quota either.

The two sides of reflink()

Posted May 6, 2009 9:22 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

Any userspace that does not expect write() in the middle of a file to
potentially fail with -ENOSPC is broken. Such write()s can fail even now
thanks to sparse files. It is true that currently userspace can rely on
the second write() in a ftell()/write()/fseek()/write() sequence not
failing, but this seems a rather thin thing to rely on, to me.

The two sides of reflink()

Posted May 6, 2009 10:42 UTC (Wed) by epa (subscriber, #39769) [Link]

Thanks for pointing this out. So if currently, creating a 10 gigabyte sparse file does not subtract 10 gigs from your quota nor from the free space report, giving the possibility that writing to the middle of an existing file can run out of space, then making a reflink to a 10 gig file should be treated the same way. There is precedent.

The two sides of reflink()

Posted May 6, 2009 12:33 UTC (Wed) by vonbrand (subscriber, #4458) [Link]

Even worse, if I reflink() a file of yours, and yout then change it (or delete it, whatever) suddenly my quota goes up without any action on my part.

quota behaviour of reflink()

Posted May 6, 2009 11:21 UTC (Wed) by pjm (guest, #2080) [Link] (1 responses)

One issue is what happens when one of the copies is removed, especially if there's more than one owner involved. Presumably the result must be that removing a file can increase someone else's quota usage. One wonders then what should happen if that other user has already exhausted their quota, and what bugs may be triggered by not expecting whichever policy is chosen.

The reasons for different quota behaviour would be if it results in different (and more desirable) user behaviour, or if it helps system administrators choose better quota limits, i.e. if it results in less frequent filesystem-full situations for a given amount of user productivity.

How much are quotas used these days, and for what uses ? Can people comment on the usefulness of different quota policies in the context of specific use cases?

As to whether different quota behaviour would result in different user behaviour (e.g. encouraging taking steps for files to be reflink'ed rather than copied), I wonder how many quota'd users would have the necessary knowledge for it to change their behaviour.

quota behaviour of reflink()

Posted May 6, 2009 12:47 UTC (Wed) by utoddl (guest, #1232) [Link]

Interesting points, but remember: quota can be impacted by unrelated actions by other processes owned by the same or other users at any time, so user space needs to respond to space-related issues regardless of what quotas existed "moments ago". In fact, quota can be affected when the user takes no actions at all; the admin can change quotas, file systems can be resized, etc., and a user who was not over quota may suddenly be so even with no changes to his files.

Space-reporting tools can report variations of (1) actual space used by extant allocated blocks (modulo sparse files), (2) space free, (3) space that would be used in the case where a naive copy were made to another file system -- all of which are valid and different numbers. The "simple" question of how much storage is used is in fact complicated, our desire for simple answers notwithstanding.

The two sides of reflink()

Posted May 16, 2009 0:37 UTC (Sat) by efexis (guest, #26355) [Link]

There's lots of discussion of a similar issue known as "over committing", and also in the memory limiting cgroup code, where I think a lot can be learnt on different solutions to the problem.

One example - if you share somebody's file and it doesn't count as your own space, then the original owner deletes their copy, the space is now soley allocated to you, and so should count against your quota? This act could push you way over your quota; what if the file alone is bigger than your complete quota?

What if two people are sharing a file that you delete? Should what's on your quota be divided equally amongst the two of them? If you want to be fair, surely the thing to do when you share a file from someone else is add half of its size to your own quote, and remove half of it from theres. If you share the file, you should share the cost?

If it goes straight onto your quota when you share it, you simply don't have to worry about any of these - but you also do at the same time lose certain benefits of sharing the data. With VMs, over-committing can often be specified as a %, maybe a similar option for sharing files... sharing a file could save you a certain % of its size from your quota, which then means if you suddenly become the sole owner, you only have added the remaining % rather than the full 100%. The % chosen would be linked to the average reference count of all blocks that you own, as this shows the likelyhook of any block being or becoming soley yours.

COW links?

Posted May 6, 2009 12:08 UTC (Wed) by arnd (subscriber, #8866) [Link] (1 responses)

Interesting to see this come back years after the similar concept of COW
links was shot down. The interface is rather different, but the effective
use cases are practically the same.

COW links were last covered five years ago, in
http://lwn.net/Articles/77972/.

COW links?

Posted May 6, 2009 12:25 UTC (Wed) by corbet (editor, #1) [Link]

I meant to work that in, but didn't... cowlinks were a similar idea, but a different implementation. They were really links, while "reflinks" create a new inode. So there's quite a bit of difference in how they work.

The two sides of reflink()

Posted May 6, 2009 16:03 UTC (Wed) by iabervon (subscriber, #722) [Link] (1 responses)

I don't see why the reflink-as-copy idea wouldn't work for snapshotting. Sure, a snapshot of a file wouldn't include the metadata, but a snapshot of the directory containing a file would presumably copy the metadata (and would presumably require all sorts of interesting authorizations so that users can't squirrel away copies of setuid binaries and wait for security holes to be found in them). The difference in design seem to me only to affect where exactly the boundary between the part of the filesystem that gets snapshotted and the part that doesn't is for a particular choice of arguments.

Of course, being about to copy directories with this sort of mechanism is probably trickier to implement, but any good snapshot mechanism would have to handle this sort of thing.

The two sides of reflink()

Posted May 6, 2009 17:44 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

Most metadata is part of the inode, not the directory entry, so snapshotting the directory wouldn't have any effect on the metadata (apart from the filename). In the reflink-as-copy scenario the owner, group, permissions (e.g. setuid), and the like would be tracked separately for each reflink. Presumably the limitations concerning these attribute would be the same as for creating new files; e.g. the reflink would be created with the owner, group, and umask of the current user, and only the data would be shared (COW).

Given their small average size, it probably makes more sense to just copy the directories themselves outright rather than implementing them as COW. Recursive copies would be similar to "cp -dlR", just with reflink() in place of symlink().

There are better ways to snapshotting

Posted May 7, 2009 20:54 UTC (Thu) by anton (subscriber, #25547) [Link] (3 responses)

Many posters seem to think of snapshotting as a good use of reflinks, but snapshots should be created on the whole file system at once, atomically (so you get a snapshot that is consistent across files). File systems like btrfs have this functionality, and one can also use LVM for that (with a little fs support), so snapshots don't seem to be a good use and justification of reflink().

There are better ways to snapshotting

Posted May 10, 2009 0:05 UTC (Sun) by butlerm (subscriber, #13312) [Link] (2 responses)

One of the advantages of ZFS style snapshotting is it is a constant time
operation that is far more efficient than walking a directory tree and
creating thousands of new inodes. It doesn't work for making a (writable)
snapshot of a snapshot, however, and that is a very important use case for
virtualization, for example.

Copy-on-write filesystems could use a good interface (like this one) to make
efficient *copies* of files in the same (extended) filesystem. That is
something that ZFS and BTRFS both appear to lack, and which for portability
reasons is easy to downgrade to an ordinary file copy.

There are better ways to snapshotting

Posted May 10, 2009 18:36 UTC (Sun) by anton (subscriber, #25547) [Link] (1 responses)

One of the advantages of ZFS style snapshotting is it is a constant time operation that is far more efficient than walking a directory tree and creating thousands of new inodes. It doesn't work for making a (writable) snapshot of a snapshot, however

That may be a limitation of ZFS, but it's not a necessary limitation. LLFS can do it, and nothing I have read about Btrfs indicates that it cannot do it.

Copy-on-write filesystems could use a good interface (like this one) to make efficient *copies* of files in the same (extended) filesystem.

Yes, implementing cp more efficiently seems to be the main benefit I see from this system call.

There are better ways to snapshotting

Posted May 11, 2009 5:22 UTC (Mon) by butlerm (subscriber, #13312) [Link]

Netapp does writable snapshots of writable snapshots. The advantage of the
ZFS scheme is you can have an unlimited number of snapshots, where with
Netapp these days I believe you get 256.

The ZFS scheme is based on logical sequence numbers, Netapp uses block maps.
I understand that BTRFS allows nested writable snapshots, but that comes at a
general performance cost that ZFS doesn't have.

different API

Posted May 14, 2009 21:12 UTC (Thu) by mslusarz (guest, #58587) [Link]

It would be more general and powerful to reflink (or clone or ...) only part of file. API could probably look like splice.

long splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags);

So don't add new syscall. Implement splice for two descriptors on one filesystem!