| Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today! |
One of the discussions your editor missed at the recent Linux Storage and Filesystem workshop covered the proposed reflink() system call. Fortunately, the filesystem developers have now filled in the relevant information with a detailed email exchange, complete with patches. We now have a proposed system call which has created more open questions than answers. The creation of a new core system call requires a lot of thought, so a close look at these questions would seem to be called for.
The proposed system calls are pretty simple:
int reflink(const char *oldname, const char *newname);
int reflinkat(int old_dir_fd, const char *oldname,
int new_dir_fd, const char *newname, int flags);
These system calls function much like link() and linkat() but with an important exception: rather than create a new link pointing to an existing inode, they create a new inode which happens to share the same disk blocks as the existing file. So, at the conclusion of a reflink() call, newname looks very much like a copy of oldname, but the actual data blocks have not been duplicated. The files are copy-on-write, though, meaning that a write to either file will cause some or all of the blocks to be duplicated. A change to one of the files will thus not be visible in the other file. In a sense, a reflink() call behaves like a low-cost file copy operation, though how copy-like it will be remains to be seen.
The first question to arise was: does the kernel really need to provide both the reflink() and reflinkat() system calls? Most of the other *at() calls are paired with the non-at versions because the latter came first. Since Unix-like systems have had link() for a long time, it cannot be removed without breaking applications. So linkat() had to go in as a separate call. But reflink() is new, so it can just as easily be implemented in the C library as a wrapper around reflinkat(). That is how things will probably be done in the end.
The deeper discussion, though, reveals that there are two fundamentally different views of how this system call should work. Joel Becker, who posted the reflink() patch, sees it as a new variant of the link() system call. Others, though, would like it to behave more like a file copy operation. If you see reflink() as being a type of link(), then certain implications emerge:
On the other hand, if a reflink is like making a copy, the situation changes somewhat:
On the other hand, reflink-as-copy could be used in a lot of other interesting situations; the cp command could create reflinks by default when the destination file is on the same filesystem. That would turn "cp -r" into a fast and efficient operation. They could also be used to share files between virtualized guests.
What it comes down to is that there are real uses for both the reflink-as-link and reflink-as-copy modes of operation. So the right solution may well be to implement both modes. The flags parameter to reflinkat() can be used to distinguish between the two. Implementing both behaviors will complicate the implementation somewhat, and it muddies up what is otherwise a conceptually clean system call. But that's what happens, sometimes, when designs encounter the real world.
The two sides of reflink()
Posted May 5, 2009 21:12 UTC (Tue) by flewellyn (subscriber, #5047) [Link]
The two sides of reflink()
Posted May 5, 2009 21:21 UTC (Tue) by martinfick (subscriber, #4455) [Link]
The two sides of reflink()
Posted May 5, 2009 21:29 UTC (Tue) by flewellyn (subscriber, #5047) [Link]
The two sides of reflink()
Posted May 6, 2009 14:25 UTC (Wed) by wilreichert (subscriber, #17680) [Link]
The two sides of reflink()
Posted May 6, 2009 15:09 UTC (Wed) by dlang (subscriber, #313) [Link]
The two sides of reflink()
Posted May 6, 2009 17:03 UTC (Wed) by elanthis (guest, #6227) [Link]
If on the other hand cp says "this is a copy" to the kernel then the filesystem can just do the right thing. Of course, other applications will need to be modified to take advantage of the new feature, but such is the truth of most progress.
The two sides of reflink()
Posted May 7, 2009 21:03 UTC (Thu) by anton (subscriber, #25547) [Link]
For shared stuff like program text, all servers could use the same binaries (through mount, mount -t bind, or hard links), so that's not a good justification for reflinks, either (and if you don't trust the other servers not to write to the file, why would you trust them with access to the device at all?). Writable files that would mostly or completely be the same on both VMs would be a better example, but no concrete example comes to my mind.
The two sides of reflink()
Posted May 7, 2009 21:13 UTC (Thu) by martinfick (subscriber, #4455) [Link]
You don't, usually the host system mounts a portion of the filesystem into a separate chroot for each guest server. The guests typically then have a limited root capability that does not included making device nodes so they really do not have access to the device, only the filesystem.
The two sides of reflink()
Posted May 10, 2009 18:42 UTC (Sun) by anton (subscriber, #25547) [Link]
The guests typically then have a limited root capability that does not included making device nodes so they really do not have access to the device, only the filesystem.With the limits on the root capabilities, the binaries can surely be made read-only even for the guest roots, so no reflinks are needed for the binaries.
The two sides of reflink()
Posted May 10, 2009 19:09 UTC (Sun) by martinfick (subscriber, #4455) [Link]
Sure, but if you make the binaries read only you no longer have independent guest systems that can be administered without knowledge of the host or other guests. In other words, if I now want to upgrade the apache server in one guest, I can't since the binary is read only to my guest root user. With COW, no problem, as a guest admin I do not even know that my apache binary is shared with others. It is only relevant to the host (the host unifies the various guest binaries, not the guest).
The two sides of reflink()
Posted May 6, 2009 14:33 UTC (Wed) by MarkWilliamson (subscriber, #30166) [Link]
Those folks who are fortunate enough to have their home directory on a netapp filer
have for years been able to "cd ~/.snapshot/" and find a special directory of
historical versions of their files. These are stored efficiently because of the nature of
the WAFL filesystem. With reflink, it would be possible to create a lightweight
version of historical snapshotting: you'd have a daemon run every night (for
instance) and recursively reflink the current state of all your files into a directory
tree at ~/.old-versions/<date>/ - then, if you ever needed to go back to an old
version of a file you could just look in there.
With reflinks this would be very fast and would not use up loads of disk space
(though there would still be quota concerns). It would make time-machine or Netapp
.snapshot-like functionality easy to implement efficiently on single disk systems.
Probably the most quoted reference for stuff like this is the Elephant research
filesystem, about which there are a number of decent research papers.
Another use that I've seen mentioned is the ability to make checkouts / clones in
moderen version control systems go faster and be more lightweight in terms of disk
storage - for instance, cloning a git repository could transparently share all the
underlying data (including the working directory!) using reflinks. Similar tricks being
possible for the other VCSes.
Finally, you could probably have a daemon that rummages around the system, finds
identical files and unifies them on disk using reflink in order to save space.
Loads of cool stuff :-)
The two sides of reflink()
Posted May 6, 2009 14:34 UTC (Wed) by MarkWilliamson (subscriber, #30166) [Link]
The two sides of reflink()
Posted May 6, 2009 16:38 UTC (Wed) by cdarroch (guest, #26812) [Link]
The two sides of reflink()
Posted May 6, 2009 16:59 UTC (Wed) by MarkWilliamson (subscriber, #30166) [Link]
rdiff-backup (http://www.gnu.org/savannah-checkouts/non-gnu/rdiff-backup/) gives
somewhat similar snapshotting convenience but you have to interact with it through a
command line app. Also, it does use up extra space (although if you're backing up to
another machine / another drive for redundancy then that's just fine!).
archfs (http://code.google.com/p/archfs/) provides a Fuse interface to browse rdiff-backup
repositories. Last time I tried it it wasn't really suitable for large repositories but this may
have been fixed since then. rdiff-backup's page on related info has some other solutions:
http://www.gnu.org/savannah-checkouts/non-gnu/rdiff-backu...
.snapshot is a very nice user interface to have to old revisions.
The two sides of reflink()
Posted May 9, 2009 5:07 UTC (Sat) by TRS-80 (subscriber, #1804) [Link]
rdiffWeb is a nice web interface to rdiff-backup. At work we're using rdiff-backup for weekly snapshots to complement our nightly amanda tape, a 1TB drive lasted us a year.Line endings - make sure you select HTML not plain text, as the latter doesn't do wrapping for some reason.
The two sides of reflink()
Posted May 11, 2009 0:21 UTC (Mon) by vonbrand (guest, #4458) [Link]
Please don't.
I suffered through DOS's "you can undelete files whenever you fatfingered DEL". Most of the time it worked, but Murphy's Law ensured that when you really needed to get something back, it would usually be gone for good.
Unix' idea of "rm is final" is harsh, but you learn not to misplace stuff in the first place. Makes for a better experience in the long run.
The two sides of reflink()
Posted May 11, 2009 1:10 UTC (Mon) by MarkWilliamson (subscriber, #30166) [Link]
Although in practice it's going to get used to undo rm occasionally, it seems to me only sensible to have something like this available so I'm able to roll back important documents and settings to previous states if I make the wrong modification, or if some program barfs over everything and corrupts things.
Users will probably have to be repeatedly reminded that, yes, they do need an independent backup on another disk somewhere because reflinks won't save you if your computer explodes. But most folks don't do proper backups *anyhow*, so I doubt it'll make that aspect of user behaviour much worse!
The two sides of reflink()
Posted May 5, 2009 21:43 UTC (Tue) by JoeBuck (guest, #2330) [Link]
Since the other "at" calls ("linkat", etc.) are shared with Solaris, I hope that there will be discussions with Solaris and BSD people about these calls, to help people write portable software going forward.
The two sides of reflink()
Posted May 5, 2009 21:47 UTC (Tue) by martinfick (subscriber, #4455) [Link]
The two sides of reflink()
Posted May 5, 2009 23:21 UTC (Tue) by adj (subscriber, #7401) [Link]
The two sides of reflink()
Posted May 5, 2009 23:31 UTC (Tue) by martinfick (subscriber, #4455) [Link]
The two sides of reflink()
Posted May 6, 2009 0:05 UTC (Wed) by adj (subscriber, #7401) [Link]
The two sides of reflink()
Posted May 6, 2009 0:09 UTC (Wed) by clugstj (subscriber, #4020) [Link]
The two sides of reflink()
Posted May 6, 2009 0:14 UTC (Wed) by martinfick (subscriber, #4455) [Link]
The two sides of reflink()
Posted May 6, 2009 0:49 UTC (Wed) by JoeBuck (guest, #2330) [Link]
You'd need flags in the filesystem that would prevent the filesystem from being mounted by a kernel that lacks the needed feature.
The two sides of reflink()
Posted May 6, 2009 5:59 UTC (Wed) by tialaramex (subscriber, #21167) [Link]
It also contains per-inode flags, so that implementations can be warned that they're missing a feature needed to read or update a particular file, in this case the implementation should fail open() for that file.
Of course poor quality implementations from third parties may be missing some or all of these checks. Fortunately the worst implementation I'm aware of as of this moment is read-only, so any problems only occur when reading files with that implementation and if/when they reboot into Linux everything is fine again.
The two sides of reflink()
Posted May 6, 2009 19:31 UTC (Wed) by clugstj (subscriber, #4020) [Link]
a reflink would be a new type of inode
Posted May 6, 2009 3:00 UTC (Wed) by xoddam (subscriber, #2322) [Link]
You could also specify flags in the superblock as others have suggested, so as to prevent a filesystem with reflinks from being mounted at all by a kernel which does not support them.
a reflink would be a new type of inode
Posted May 6, 2009 8:55 UTC (Wed) by epa (subscriber, #39769) [Link]
a reflink would be a new type of inode
Posted May 6, 2009 9:46 UTC (Wed) by xoddam (subscriber, #2322) [Link]
a reflink would be a new type of inode
Posted May 6, 2009 12:24 UTC (Wed) by corbet (editor, #1) [Link]
A reflink would be a new type of inode only in so far as the filesystem must track the fact that it has blocks shared with another inode. There is no difference, though, between an inode created by a reflink and the file's original inode; they both become reflink inodes. In Btrfs, I believe, things are even less different; the tracking of the shared blocks is done at the extent level.
a reflink would be a new type of inode
Posted May 6, 2009 16:07 UTC (Wed) by masoncl (subscriber, #47138) [Link]
The two files can be changed independently without affecting each other. One could be deleted, truncated, expanded, chmoded, have new acls set, etc.
The actual block sharing is just an implementation detail...it could be implemented as a lazy copy for example.
a reflink would be a new type of inode
Posted May 6, 2009 19:40 UTC (Wed) by adj (subscriber, #7401) [Link]
Surely there's a better name for a make-a-copy-of-this-inode-and-all-its-data-and-maybe-do-some-cool-COW-magic system call. It's too bad that the "dup" family of system call names is already used for something with a completely separate meaning.
madcow()?
Posted May 6, 2009 21:47 UTC (Wed) by AnswerGuy (subscriber, #1256) [Link]
magically-allocated-data-copy-on-write ... :)
a reflink would be a new type of inode
Posted May 7, 2009 10:12 UTC (Thu) by epa (subscriber, #39769) [Link]
A reflink would be a new type of inode only in so far as the filesystem must track the fact that it has blocks shared with another inode. There is no difference, though, between an inode created by a reflink and the file's original inode; they both become reflink inodes.Ah, so it is symmetric, somewhat like making a hard link with 'ln'.
If one of the two reflink inodes is then removed, does the other one revert back to being a normal file? If not, is there any difference between a lone reflink inode and a normal one? Couldn't all files be reflinks?
a reflink would be a new type of inode
Posted May 7, 2009 22:59 UTC (Thu) by dlang (subscriber, #313) [Link]
if this isn't the case, it should be, for just this reason.
as I understand things, if a file is changed it may break the linking entirely (copying the entire file), or it may break the link partially (still sharing the common parts of the file, but with the differences being separated) at the option of the filesystem
The two sides of reflink()
Posted May 9, 2009 14:25 UTC (Sat) by sdbrady (guest, #56894) [Link]
The two sides of reflink()
Posted May 5, 2009 22:34 UTC (Tue) by ikm (subscriber, #493) [Link]
The two sides of reflink()
Posted May 5, 2009 22:55 UTC (Tue) by JoeBuck (guest, #2330) [Link]
Read the article again; the uses are described (snapshots, more efficient copying).
The two sides of reflink()
Posted May 5, 2009 23:02 UTC (Tue) by ikm (subscriber, #493) [Link]
Pity userspace on where this is done on a simple minded filesystem
Posted May 5, 2009 23:36 UTC (Tue) by adj (subscriber, #7401) [Link]
reflink("myold200Gbytefile", "myreflinked20Gbytefile);
myfd = open("myreflinked200Gbytefile", (O_WRONLY|O_APPEND|O_LARGEFILE));
res = write(myfd, "hereismyshortlittlemessage", 27);
/*
* wait, like _forever_ while i just added 200Gbytes to the amount
* of space used on this filesystem. Not to mention getting back
* ENOSPC, because, even though there were 100Gbytes available
* before this 27 byte long write, now there are none. reflinks
* sound neat, but sometimes have unexpected teeth.
*/
Re: Pity userspace on where this is done on a simple minded filesystem
Posted May 6, 2009 0:00 UTC (Wed) by daney (subscriber, #24551) [Link]
On my filesystem with somewhat smaller blocks presumably the wait would be shorter.
Re: Pity userspace on where this is done on a simple minded filesystem
Posted May 6, 2009 0:12 UTC (Wed) by adj (subscriber, #7401) [Link]
Re: Pity userspace on where this is done on a simple minded filesystem
Posted May 6, 2009 0:17 UTC (Wed) by adj (subscriber, #7401) [Link]
Re: Pity userspace on where this is done on a simple minded filesystem
Posted May 6, 2009 0:57 UTC (Wed) by JoeBuck (guest, #2330) [Link]
But appending (to logs, for example) is such a common special case that it would be likely to be supported at an early stage, so that you have an efficient representation when you take a snapshot of a filesystem, and then you let it go on, appending lots of stuff to your log files, blocks belonging to a common prefix can be shared.
Re: Pity userspace on where this is done on a simple minded filesystem
Posted May 6, 2009 1:01 UTC (Wed) by corbet (editor, #1) [Link]
btrfs, at least, has COW wired pretty deeply into it. So it will only copy blocks which have actually been changed. Results on other filesystems may vary.
Pity userspace on where this is done on a simple minded filesystem
Posted May 6, 2009 18:49 UTC (Wed) by jzbiciak (subscriber, #5246) [Link]
Slightly fancier answer: Perhaps this is another thing to "ulimit"? Return EPERM or EFBIG (file is too big) if the limit's exceeded.
Any introspection into reflinked files?
Posted May 6, 2009 2:12 UTC (Wed) by knobunc (subscriber, #4678) [Link]
But can I tell if two files are reflinked? Can I tell if one has changed after the fact? If the smarter COW semantics that only copy necessary blocks in a changed file is used, is there any way to tell which portions of the file are common, or at least what percentage?
Can you reflink something more than once... I assume so. And if so, how does that impact the above questions.
Thanks, -ben
The two sides of reflink()
Posted May 6, 2009 7:34 UTC (Wed) by butlerm (guest, #13312) [Link]
The reasonable options are full copy semantics or (read-only) snapshot
semantics. A writable non-link "link" that you can't change the owner of
doesn't seem like a very useful construct to me.
The general idea here is excellent of course. However, I suggest this system
call would make a lot more sense if it were named "fclone" (or something like
that) with full copy semantics. It should preserve the owner, permissions
and data to begin with, and then the caller should be able to change all
three after the fact. Sort of like "fork" or "clone".
The two sides of reflink()
Posted May 6, 2009 13:22 UTC (Wed) by Cato (subscriber, #7643) [Link]
The two sides of reflink()
Posted May 9, 2009 21:34 UTC (Sat) by giraffedata (subscriber, #1954) [Link]
It should preserve the owner, permissions and data to begin with, and then the caller should be able to change all three after the fact.
The caller should be able to change the owner of the file? Or change permissions on a file, whether he owns it or not? These are not normal Unix things, except for the superuser.
That's why there's a decision to be made (or not -- has anyone considered doing one of each?)
I would use a reflink that makes new metadata (and I can then copy whatever metadata I can, like cp --archive, if I like). I wouldn't use one that clones the metadata -- that function is better served by a snapshot function that does a whole directory tree at once.
The two sides of reflink()
Posted May 9, 2009 23:31 UTC (Sat) by butlerm (guest, #13312) [Link]
The two sides of reflink()
Posted May 10, 2009 1:45 UTC (Sun) by giraffedata (subscriber, #1954) [Link]
Assuming they have the necessary privileges to do so, obviously.
Well not obviously, since that assumption leaves the situation equally weird.
But now that you've said that's what you have in mind, maybe you can elaborate. Would an unprivileged person be able to use reflink? What would happen if he did it on a file he doesn't own? Would it be possible for someone to create a file he can't access? One whose space is charged to someone else?
The two sides of reflink()
Posted May 10, 2009 10:30 UTC (Sun) by nix (subscriber, #2304) [Link]
The two sides of reflink()
Posted May 10, 2009 18:03 UTC (Sun) by giraffedata (subscriber, #1954) [Link]
Yes, and you misspoke. The other person didn't delete the file because no one can delete a file. The system deletes one automatically when it's no longer accessible. The space charging problem is one of the many reasons this innovative Unix concept should actually be scrapped. Along with the related concepts that directories are kernel level things, and you can't give a file a name.
The two sides of reflink()
Posted May 10, 2009 18:36 UTC (Sun) by nix (subscriber, #2304) [Link]
(I'd be rather annoyed, too: I use hardlinks all the time.)
The two sides of reflink()
Posted May 11, 2009 5:50 UTC (Mon) by giraffedata (subscriber, #1954) [Link]
I suspect that if you actually tried to scrap link() et al, a million MTA authors would try to kill you.
Well, I wouldn't scrap link() et al -- I'd just move them out of the kernel and add the ability to explicitly create and delete files independent of directory links.
The two sides of reflink()
Posted May 11, 2009 6:10 UTC (Mon) by nix (subscriber, #2304) [Link]
The two sides of reflink()
Posted May 11, 2009 5:42 UTC (Mon) by butlerm (guest, #13312) [Link]
The two sides of reflink()
Posted May 11, 2009 5:57 UTC (Mon) by giraffedata (subscriber, #1954) [Link]
That would be the direct consequence of discarding the directory entry / inode distinction in Unix -
But what I described makes the distinction even larger. Today directory entries and inodes are tied together tightly by the kernel.
But I'm curious about how this affects having to reboot when you update system software.
The two sides of reflink()
Posted May 11, 2009 6:09 UTC (Mon) by nix (subscriber, #2304) [Link]
You could no longer rely on unlink() decrementing i_nlink, but I don't
know of *anything* that depends on this (some things doubtless do but it
can't be common).
It breaks updating running software because that involves unlinking files
that are in use, and because the update process generally consists of
creating a file with a temporary name, filling it out, and rename()ing it
over the original (that's an implicit unlink right there, and it does not
fail). If you break that you break every package manager on the face of
the earth.
The two sides of reflink()
Posted May 11, 2009 7:37 UTC (Mon) by giraffedata (subscriber, #1954) [Link]
What? Directory entries and inodes aren't tied together in the fs model at all, except that each directory entry increases i_nlink in the corresponding inode by one.
That's a pretty tight bond, especially since i_nlink controls when the inode/file gets deleted. Also, you can't make the kernel create an inode without also creating a directory entry, and except temporarily, an inode cannot exist without at least one directory entry associated with it. Those are the bonds that it would be nice to get away from, as pretty much every OS except Unix does.
Reflinks simply would ...
We must be talking about different things. I was just talking about what Unix should do instead of what it always has (as a fundamental design point) done. Nothing to do with reflinks. And I'm also not claiming it would be compatible with any existing Unix application, but I do believe every application could be done at least as well with a kernel without automatic file deletion and directories.
The two sides of reflink()
Posted May 11, 2009 18:32 UTC (Mon) by nix (subscriber, #2304) [Link]
The two sides of reflink()
Posted May 12, 2009 1:22 UTC (Tue) by giraffedata (subscriber, #1954) [Link]
If they weren't in the kernel, they'd need to be in some userspace library (ew).
They work better in user space -- there's more flexibility there and the basic concept of a directory has nothing to do with resource allocation between users, which is what the kernel is for. Many OSes do them outside the kernel. The only reason they have to be in the kernel in Unix is that the kernel deletes files implicitly based on directory references. And as I've been saying, we'd be better off without that.
The two sides of reflink()
Posted May 12, 2009 19:59 UTC (Tue) by nix (subscriber, #2304) [Link]
Also it's a grotesque security hole: now you can't keep stuff secret by
hiding it in unreadable directories anymore.
Periodically there are proposals to introduce an open()-by-inode-number
syscall. They are always shot down. I don't know what sort of system
you're thinking of, but it isn't Unix.
(And if you're going to go that route, make the inums 1024 bits long and
bingo, you've got a capability-based system.)
The two sides of reflink()
Posted May 11, 2009 15:38 UTC (Mon) by butlerm (guest, #13312) [Link]
The two sides of reflink()
Posted May 11, 2009 5:34 UTC (Mon) by butlerm (guest, #13312) [Link]
Otherwise you would get a highly restricted operation that would only be
useful to unprivileged users for making efficient copies of files they
already own.
The two sides of reflink()
Posted May 6, 2009 8:15 UTC (Wed) by amw (subscriber, #29081) [Link]
The two sides of reflink()
Posted May 6, 2009 8:52 UTC (Wed) by lmb (subscriber, #39048) [Link]
There is, of course, also a need for reflinkstat() or whatever it is going to be called - one needs to find out how many blocks are part of a specific COW, and also reflinkdiff(), which enumerates the (meta-)data blocks which differ from the link target (needed for efficient rsync/backup).
Then, it also becomes possible to use quota to account just the difference, i.e. the actual space used by the reflink.
A further complication arises when we look at using reflink() on directories, which would of course also be quite desirable (for snapshots, multi-use chroots etc). That will be an interesting direction to explore ;-)
The two sides of reflink()
Posted May 9, 2009 23:52 UTC (Sat) by butlerm (guest, #13312) [Link]
However both aforementioned filesystems store block checksums, so perhaps a
more practical means would be to add an interface that returns the block
checksums of a file range (if not the block offsets) to user space, and
generate a candidate duplication list from that.
The two sides of reflink()
Posted May 6, 2009 9:02 UTC (Wed) by epa (subscriber, #39769) [Link]
That said, it makes no sense to account disk quota conservatively while lying about the amount of free space really available. The two should be treated the same, so if reflinking a large file has no effect on the reported free space, it shouldn't cost quota either.
The two sides of reflink()
Posted May 6, 2009 9:22 UTC (Wed) by nix (subscriber, #2304) [Link]
The two sides of reflink()
Posted May 6, 2009 10:42 UTC (Wed) by epa (subscriber, #39769) [Link]
The two sides of reflink()
Posted May 6, 2009 12:33 UTC (Wed) by vonbrand (guest, #4458) [Link]
Even worse, if I reflink() a file of yours, and yout then change it (or delete it, whatever) suddenly my quota goes up without any action on my part.
quota behaviour of reflink()
Posted May 6, 2009 11:21 UTC (Wed) by pjm (subscriber, #2080) [Link]
The reasons for different quota behaviour would be if it results in different (and more desirable) user behaviour, or if it helps system administrators choose better quota limits, i.e. if it results in less frequent filesystem-full situations for a given amount of user productivity.
How much are quotas used these days, and for what uses ? Can people comment on the usefulness of different quota policies in the context of specific use cases?
As to whether different quota behaviour would result in different user behaviour (e.g. encouraging taking steps for files to be reflink'ed rather than copied), I wonder how many quota'd users would have the necessary knowledge for it to change their behaviour.
quota behaviour of reflink()
Posted May 6, 2009 12:47 UTC (Wed) by utoddl (subscriber, #1232) [Link]
Interesting points, but remember: quota can be impacted by unrelated actions by other processes owned by the same or other users at any time, so user space needs to respond to space-related issues regardless of what quotas existed "moments ago". In fact, quota can be affected when the user takes no actions at all; the admin can change quotas, file systems can be resized, etc., and a user who was not over quota may suddenly be so even with no changes to his files.Space-reporting tools can report variations of (1) actual space used by extant allocated blocks (modulo sparse files), (2) space free, (3) space that would be used in the case where a naive copy were made to another file system -- all of which are valid and different numbers. The "simple" question of how much storage is used is in fact complicated, our desire for simple answers notwithstanding.
The two sides of reflink()
Posted May 16, 2009 0:37 UTC (Sat) by efexis (guest, #26355) [Link]
One example - if you share somebody's file and it doesn't count as your own space, then the original owner deletes their copy, the space is now soley allocated to you, and so should count against your quota? This act could push you way over your quota; what if the file alone is bigger than your complete quota?
What if two people are sharing a file that you delete? Should what's on your quota be divided equally amongst the two of them? If you want to be fair, surely the thing to do when you share a file from someone else is add half of its size to your own quote, and remove half of it from theres. If you share the file, you should share the cost?
If it goes straight onto your quota when you share it, you simply don't have to worry about any of these - but you also do at the same time lose certain benefits of sharing the data. With VMs, over-committing can often be specified as a %, maybe a similar option for sharing files... sharing a file could save you a certain % of its size from your quota, which then means if you suddenly become the sole owner, you only have added the remaining % rather than the full 100%. The % chosen would be linked to the average reference count of all blocks that you own, as this shows the likelyhook of any block being or becoming soley yours.
COW links?
Posted May 6, 2009 12:08 UTC (Wed) by arnd (subscriber, #8866) [Link]
COW links were last covered five years ago, in
http://lwn.net/Articles/77972/.
COW links?
Posted May 6, 2009 12:25 UTC (Wed) by corbet (editor, #1) [Link]
I meant to work that in, but didn't... cowlinks were a similar idea, but a different implementation. They were really links, while "reflinks" create a new inode. So there's quite a bit of difference in how they work.
The two sides of reflink()
Posted May 6, 2009 16:03 UTC (Wed) by iabervon (subscriber, #722) [Link]
Of course, being about to copy directories with this sort of mechanism is probably trickier to implement, but any good snapshot mechanism would have to handle this sort of thing.
The two sides of reflink()
Posted May 6, 2009 17:44 UTC (Wed) by nybble41 (subscriber, #55106) [Link]
Given their small average size, it probably makes more sense to just copy the directories themselves outright rather than implementing them as COW. Recursive copies would be similar to "cp -dlR", just with reflink() in place of symlink().
There are better ways to snapshotting
Posted May 7, 2009 20:54 UTC (Thu) by anton (subscriber, #25547) [Link]
Many posters seem to think of snapshotting as a good use of reflinks, but snapshots should be created on the whole file system at once, atomically (so you get a snapshot that is consistent across files). File systems like btrfs have this functionality, and one can also use LVM for that (with a little fs support), so snapshots don't seem to be a good use and justification of reflink().
There are better ways to snapshotting
Posted May 10, 2009 0:05 UTC (Sun) by butlerm (guest, #13312) [Link]
Copy-on-write filesystems could use a good interface (like this one) to make
efficient *copies* of files in the same (extended) filesystem. That is
something that ZFS and BTRFS both appear to lack, and which for portability
reasons is easy to downgrade to an ordinary file copy.
There are better ways to snapshotting
Posted May 10, 2009 18:36 UTC (Sun) by anton (subscriber, #25547) [Link]
One of the advantages of ZFS style snapshotting is it is a constant time operation that is far more efficient than walking a directory tree and creating thousands of new inodes. It doesn't work for making a (writable) snapshot of a snapshot, howeverThat may be a limitation of ZFS, but it's not a necessary limitation. LLFS can do it, and nothing I have read about Btrfs indicates that it cannot do it.
Copy-on-write filesystems could use a good interface (like this one) to make efficient *copies* of files in the same (extended) filesystem.Yes, implementing cp more efficiently seems to be the main benefit I see from this system call.
There are better ways to snapshotting
Posted May 11, 2009 5:22 UTC (Mon) by butlerm (guest, #13312) [Link]
The ZFS scheme is based on logical sequence numbers, Netapp uses block maps.
I understand that BTRFS allows nested writable snapshots, but that comes at a
general performance cost that ZFS doesn't have.
different API
Posted May 14, 2009 21:12 UTC (Thu) by mslusarz (subscriber, #58587) [Link]
long splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags);
So don't add new syscall. Implement splice for two descriptors on one filesystem!
Copyright © 2009, Eklektix, Inc.
This article may be redistributed under the terms of the
Creative
Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds