The two sides of reflink()
The proposed system calls are pretty simple:
int reflink(const char *oldname, const char *newname); int reflinkat(int old_dir_fd, const char *oldname, int new_dir_fd, const char *newname, int flags);
These system calls function much like link() and linkat() but with an important exception: rather than create a new link pointing to an existing inode, they create a new inode which happens to share the same disk blocks as the existing file. So, at the conclusion of a reflink() call, newname looks very much like a copy of oldname, but the actual data blocks have not been duplicated. The files are copy-on-write, though, meaning that a write to either file will cause some or all of the blocks to be duplicated. A change to one of the files will thus not be visible in the other file. In a sense, a reflink() call behaves like a low-cost file copy operation, though how copy-like it will be remains to be seen.
The first question to arise was: does the kernel really need to provide both the reflink() and reflinkat() system calls? Most of the other *at() calls are paired with the non-at versions because the latter came first. Since Unix-like systems have had link() for a long time, it cannot be removed without breaking applications. So linkat() had to go in as a separate call. But reflink() is new, so it can just as easily be implemented in the C library as a wrapper around reflinkat(). That is how things will probably be done in the end.
The deeper discussion, though, reveals that there are two fundamentally different views of how this system call should work. Joel Becker, who posted the reflink() patch, sees it as a new variant of the link() system call. Others, though, would like it to behave more like a file copy operation. If you see reflink() as being a type of link(), then certain implications emerge:
- The reflink-as-link view requires that the new file have (to the
greatest extent possible) the same metadata as the old one; in
particular, it must have (at the end of the reflink() system
call) the same owner, just like hard links do.
- Creating low-level snapshots of filesystems (or portions thereof) is
straightforward and easy. Reflinked files look just like the
originals; in particular, they have (mostly) the same metadata and can
share the same security context.
- Disk quotas are a problem. Should a reflinked file count against the
owner's disk quota, even though little or no extra storage is actually
used? If so, the system must take extra steps to keep users from
creating a reflink to a file they do not own; otherwise one user could
exhaust another user's quota. If, instead, storage is charged against
the quota of the user who created the reflink, complicated structures
will be needed to track usage associated with files owned by others.
- What happens if the new file's metadata - permissions or owner - are changed? In some scenarios, depending on the underlying filesystem implementation, it seems that a metadata change could require a copy-on-write of the whole file. That would turn a command like chmod into a rather heavy-weight operation.
On the other hand, if a reflink is like making a copy, the situation changes somewhat:
- Security becomes a rather more complicated affair. Making a hard link
doesn't require messing with SELinux security contexts, but a
reflink-as-copy would require that. Permission checks (again,
including security module checks) would have to become more
elaborate; it would have to be clear that the user making the reflink
had read access to the file.
- The quota problem goes away. If a reflink is essentially a copy, then
the resulting link should be owned by the user who creates it, rather
than the owner of the original file. The only course which makes
sense is to charge both users for the full size of the file. There
are no concerns about one user exhausting another's disk quota, and
there are no real difficulties with keeping disk usage information
current.
- Metadata changes are handled naturally, since the files are completely
separate from each other.
- Reflinks are no longer true snapshots; they will not work to represent the state of the filesystem at a given time. For a user whose real interest is low-level snapshotting, reflink-as-copy will not work.
On the other hand, reflink-as-copy could be used in a lot of other interesting situations; the cp command could create reflinks by default when the destination file is on the same filesystem. That would turn "cp -r" into a fast and efficient operation. They could also be used to share files between virtualized guests.
What it comes down to is that there are real uses for both the
reflink-as-link and reflink-as-copy modes of operation. So the right
solution may well be to implement both modes. The flags parameter
to reflinkat() can be used to distinguish between the two.
Implementing both behaviors will complicate the implementation somewhat,
and it muddies up what is otherwise a conceptually clean system call. But
that's what happens, sometimes, when designs encounter the real world.
Index entries for this article | |
---|---|
Kernel | Filesystems |
Kernel | reflink() |
Kernel | System calls |
Posted May 5, 2009 21:12 UTC (Tue)
by flewellyn (subscriber, #5047)
[Link] (16 responses)
Posted May 5, 2009 21:21 UTC (Tue)
by martinfick (subscriber, #4455)
[Link] (8 responses)
Posted May 5, 2009 21:29 UTC (Tue)
by flewellyn (subscriber, #5047)
[Link]
Posted May 6, 2009 14:25 UTC (Wed)
by wilreichert (guest, #17680)
[Link] (2 responses)
Posted May 6, 2009 15:09 UTC (Wed)
by dlang (guest, #313)
[Link]
Posted May 6, 2009 17:03 UTC (Wed)
by elanthis (guest, #6227)
[Link]
If on the other hand cp says "this is a copy" to the kernel then the filesystem can just do the right thing. Of course, other applications will need to be modified to take advantage of the new feature, but such is the truth of most progress.
Posted May 7, 2009 21:03 UTC (Thu)
by anton (subscriber, #25547)
[Link] (3 responses)
Posted May 7, 2009 21:13 UTC (Thu)
by martinfick (subscriber, #4455)
[Link] (2 responses)
You don't, usually the host system mounts a portion of the filesystem into a separate chroot for each guest server. The guests typically then have a limited root capability that does not included making device nodes so they really do not have access to the device, only the filesystem.
Posted May 10, 2009 18:42 UTC (Sun)
by anton (subscriber, #25547)
[Link] (1 responses)
Posted May 10, 2009 19:09 UTC (Sun)
by martinfick (subscriber, #4455)
[Link]
Sure, but if you make the binaries read only you no longer have
independent guest systems that can be administered without knowledge of
the host or other guests. In other words, if I now want to upgrade the
apache server in one guest, I can't since the binary is read only to my
guest root user. With COW, no problem, as a guest admin I do not even
know that my apache binary is shared with others. It is only relevant to
the host (the host unifies the various guest binaries, not the guest).
Posted May 6, 2009 14:33 UTC (Wed)
by MarkWilliamson (guest, #30166)
[Link] (6 responses)
Those folks who are fortunate enough to have their home directory on a netapp filer
With reflinks this would be very fast and would not use up loads of disk space
Another use that I've seen mentioned is the ability to make checkouts / clones in
Finally, you could probably have a daemon that rummages around the system, finds
Loads of cool stuff :-)
Posted May 6, 2009 14:34 UTC (Wed)
by MarkWilliamson (guest, #30166)
[Link]
Posted May 6, 2009 16:38 UTC (Wed)
by cdarroch (subscriber, #26812)
[Link] (4 responses)
Posted May 6, 2009 16:59 UTC (Wed)
by MarkWilliamson (guest, #30166)
[Link] (1 responses)
rdiff-backup (http://www.gnu.org/savannah-checkouts/non-gnu/rdiff-backup/) gives
archfs (http://code.google.com/p/archfs/) provides a Fuse interface to browse rdiff-backup
.snapshot is a very nice user interface to have to old revisions.
Posted May 9, 2009 5:07 UTC (Sat)
by TRS-80 (guest, #1804)
[Link]
Line endings - make sure you select HTML not plain text, as the latter doesn't do wrapping for some reason.
Posted May 11, 2009 0:21 UTC (Mon)
by vonbrand (subscriber, #4458)
[Link] (1 responses)
Please don't.
I suffered through DOS's "you can undelete files whenever you fatfingered DEL". Most of the time it worked, but Murphy's Law ensured that when you really needed to get something back, it would usually be gone for good.
Unix' idea of "
Posted May 11, 2009 1:10 UTC (Mon)
by MarkWilliamson (guest, #30166)
[Link]
Although in practice it's going to get used to undo rm occasionally, it seems to me only sensible to have something like this available so I'm able to roll back important documents and settings to previous states if I make the wrong modification, or if some program barfs over everything and corrupts things.
Users will probably have to be repeatedly reminded that, yes, they do need an independent backup on another disk somewhere because reflinks won't save you if your computer explodes. But most folks don't do proper backups *anyhow*, so I doubt it'll make that aspect of user behaviour much worse!
Posted May 5, 2009 21:43 UTC (Tue)
by JoeBuck (subscriber, #2330)
[Link]
Posted May 5, 2009 21:47 UTC (Tue)
by martinfick (subscriber, #4455)
[Link] (18 responses)
Posted May 5, 2009 23:21 UTC (Tue)
by adj (subscriber, #7401)
[Link] (17 responses)
Posted May 5, 2009 23:31 UTC (Tue)
by martinfick (subscriber, #4455)
[Link] (15 responses)
Posted May 6, 2009 0:05 UTC (Wed)
by adj (subscriber, #7401)
[Link]
Posted May 6, 2009 0:09 UTC (Wed)
by clugstj (subscriber, #4020)
[Link] (4 responses)
Posted May 6, 2009 0:14 UTC (Wed)
by martinfick (subscriber, #4455)
[Link] (3 responses)
Posted May 6, 2009 0:49 UTC (Wed)
by JoeBuck (subscriber, #2330)
[Link] (1 responses)
Posted May 6, 2009 5:59 UTC (Wed)
by tialaramex (subscriber, #21167)
[Link]
It also contains per-inode flags, so that implementations can be warned that they're missing a feature needed to read or update a particular file, in this case the implementation should fail open() for that file.
Of course poor quality implementations from third parties may be missing some or all of these checks. Fortunately the worst implementation I'm aware of as of this moment is read-only, so any problems only occur when reading files with that implementation and if/when they reboot into Linux everything is fine again.
Posted May 6, 2009 19:31 UTC (Wed)
by clugstj (subscriber, #4020)
[Link]
Posted May 6, 2009 3:00 UTC (Wed)
by xoddam (guest, #2322)
[Link] (8 responses)
You could also specify flags in the superblock as others have suggested, so as to prevent a filesystem with reflinks from being mounted at all by a kernel which does not support them.
Posted May 6, 2009 8:55 UTC (Wed)
by epa (subscriber, #39769)
[Link] (7 responses)
Posted May 6, 2009 9:46 UTC (Wed)
by xoddam (guest, #2322)
[Link]
Posted May 6, 2009 12:24 UTC (Wed)
by corbet (editor, #1)
[Link] (5 responses)
Posted May 6, 2009 16:07 UTC (Wed)
by masoncl (subscriber, #47138)
[Link] (2 responses)
The two files can be changed independently without affecting each other. One could be deleted, truncated, expanded, chmoded, have new acls set, etc.
The actual block sharing is just an implementation detail...it could be implemented as a lazy copy for example.
Posted May 6, 2009 19:40 UTC (Wed)
by adj (subscriber, #7401)
[Link] (1 responses)
Surely there's a better name for a make-a-copy-of-this-inode-and-all-its-data-and-maybe-do-some-cool-COW-magic system call. It's too bad that the "dup" family of system call names is already used for something with a completely separate meaning.
Posted May 6, 2009 21:47 UTC (Wed)
by AnswerGuy (guest, #1256)
[Link]
magically-allocated-data-copy-on-write ... :)
Posted May 7, 2009 10:12 UTC (Thu)
by epa (subscriber, #39769)
[Link] (1 responses)
If one of the two reflink inodes is then removed, does the other one revert back to being a normal file? If not, is there any difference between a lone reflink inode and a normal one? Couldn't all files be reflinks?
Posted May 7, 2009 22:59 UTC (Thu)
by dlang (guest, #313)
[Link]
if this isn't the case, it should be, for just this reason.
as I understand things, if a file is changed it may break the linking entirely (copying the entire file), or it may break the link partially (still sharing the common parts of the file, but with the differences being separated) at the option of the filesystem
Posted May 9, 2009 14:25 UTC (Sat)
by sdbrady (guest, #56894)
[Link]
Posted May 5, 2009 22:34 UTC (Tue)
by ikm (guest, #493)
[Link] (2 responses)
Posted May 5, 2009 22:55 UTC (Tue)
by JoeBuck (subscriber, #2330)
[Link] (1 responses)
Posted May 5, 2009 23:02 UTC (Tue)
by ikm (guest, #493)
[Link]
Posted May 5, 2009 23:36 UTC (Tue)
by adj (subscriber, #7401)
[Link] (6 responses)
Posted May 6, 2009 0:00 UTC (Wed)
by daney (guest, #24551)
[Link] (4 responses)
On my filesystem with somewhat smaller blocks presumably the wait would be shorter.
Posted May 6, 2009 0:12 UTC (Wed)
by adj (subscriber, #7401)
[Link] (3 responses)
Posted May 6, 2009 0:17 UTC (Wed)
by adj (subscriber, #7401)
[Link] (1 responses)
Posted May 6, 2009 0:57 UTC (Wed)
by JoeBuck (subscriber, #2330)
[Link]
Posted May 6, 2009 1:01 UTC (Wed)
by corbet (editor, #1)
[Link]
Posted May 6, 2009 18:49 UTC (Wed)
by jzbiciak (guest, #5246)
[Link]
Slightly fancier answer: Perhaps this is another thing to "ulimit"? Return EPERM or EFBIG (file is too big) if the limit's exceeded.
Posted May 6, 2009 2:12 UTC (Wed)
by knobunc (guest, #4678)
[Link]
But can I tell if two files are reflinked? Can I tell if one has changed after the fact? If the smarter COW semantics that only copy necessary blocks in a changed file is used, is there any way to tell which portions of the file are common, or at least what percentage?
Can you reflink something more than once... I assume so. And if so, how does that impact the above questions.
Thanks, -ben
Posted May 6, 2009 7:34 UTC (Wed)
by butlerm (subscriber, #13312)
[Link] (18 responses)
The reasonable options are full copy semantics or (read-only) snapshot
The general idea here is excellent of course. However, I suggest this system
Posted May 6, 2009 13:22 UTC (Wed)
by Cato (guest, #7643)
[Link]
Posted May 9, 2009 21:34 UTC (Sat)
by giraffedata (guest, #1954)
[Link] (16 responses)
The caller should be able to change the owner of the file? Or change permissions on a file, whether he owns it or not? These are not normal Unix things, except for the superuser.
That's why there's a decision to be made (or not -- has anyone considered doing one of each?)
I would use a reflink that makes new metadata (and I can then copy whatever metadata I can, like cp --archive, if I like). I wouldn't use one that clones the metadata -- that function is better served by a snapshot function that does a whole directory tree at once.
Posted May 9, 2009 23:31 UTC (Sat)
by butlerm (subscriber, #13312)
[Link] (15 responses)
Posted May 10, 2009 1:45 UTC (Sun)
by giraffedata (guest, #1954)
[Link] (14 responses)
Well not obviously, since that assumption leaves the situation equally weird.
But now that you've said that's what you have in mind, maybe you can elaborate. Would an unprivileged person be able to use reflink? What would happen if he did it on a file he doesn't own? Would it be possible for someone to create a file he can't access? One whose space is charged to someone else?
Posted May 10, 2009 10:30 UTC (Sun)
by nix (subscriber, #2304)
[Link] (12 responses)
Posted May 10, 2009 18:03 UTC (Sun)
by giraffedata (guest, #1954)
[Link] (11 responses)
Posted May 10, 2009 18:36 UTC (Sun)
by nix (subscriber, #2304)
[Link] (2 responses)
(I'd be rather annoyed, too: I use hardlinks all the time.)
Posted May 11, 2009 5:50 UTC (Mon)
by giraffedata (guest, #1954)
[Link] (1 responses)
Well, I wouldn't scrap link() et al -- I'd just move them out of the kernel and add the ability to explicitly create and delete files independent of directory links.
Posted May 11, 2009 6:10 UTC (Mon)
by nix (subscriber, #2304)
[Link]
Posted May 11, 2009 5:42 UTC (Mon)
by butlerm (subscriber, #13312)
[Link] (7 responses)
Posted May 11, 2009 5:57 UTC (Mon)
by giraffedata (guest, #1954)
[Link] (6 responses)
But what I described makes the distinction even larger. Today directory entries and inodes are tied together tightly by the kernel.
But I'm curious about how this affects having to reboot when you update system software.
Posted May 11, 2009 6:09 UTC (Mon)
by nix (subscriber, #2304)
[Link] (5 responses)
You could no longer rely on unlink() decrementing i_nlink, but I don't
It breaks updating running software because that involves unlinking files
Posted May 11, 2009 7:37 UTC (Mon)
by giraffedata (guest, #1954)
[Link] (3 responses)
That's a pretty tight bond, especially since i_nlink controls when the inode/file gets deleted. Also, you can't make the kernel create an inode without also creating a directory entry, and except temporarily, an inode cannot exist without at least one directory entry associated with it. Those are the bonds that it would be nice to get away from, as pretty much every OS except Unix does.
We must be talking about different things. I was just talking about what Unix should do instead of what it always has (as a fundamental design point) done. Nothing to do with reflinks. And I'm also not claiming it would be compatible with any existing Unix application, but I do believe every application could be done at least as well with a kernel without automatic file deletion and directories.
Posted May 11, 2009 18:32 UTC (Mon)
by nix (subscriber, #2304)
[Link] (2 responses)
Posted May 12, 2009 1:22 UTC (Tue)
by giraffedata (guest, #1954)
[Link] (1 responses)
They work better in user space -- there's more flexibility there and the basic concept of a directory has nothing to do with resource allocation between users, which is what the kernel is for. Many OSes do them outside the kernel. The only reason they have to be in the kernel in Unix is that the kernel deletes files implicitly based on directory references. And as I've been saying, we'd be better off without that.
Posted May 12, 2009 19:59 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Also it's a grotesque security hole: now you can't keep stuff secret by
Periodically there are proposals to introduce an open()-by-inode-number
(And if you're going to go that route, make the inums 1024 bits long and
Posted May 11, 2009 15:38 UTC (Mon)
by butlerm (subscriber, #13312)
[Link]
Posted May 11, 2009 5:34 UTC (Mon)
by butlerm (subscriber, #13312)
[Link]
Otherwise you would get a highly restricted operation that would only be
Posted May 6, 2009 8:15 UTC (Wed)
by amw (subscriber, #29081)
[Link] (9 responses)
Posted May 6, 2009 8:52 UTC (Wed)
by lmb (subscriber, #39048)
[Link] (1 responses)
There is, of course, also a need for reflinkstat() or whatever it is going to be called - one needs to find out how many blocks are part of a specific COW, and also reflinkdiff(), which enumerates the (meta-)data blocks which differ from the link target (needed for efficient rsync/backup).
Then, it also becomes possible to use quota to account just the difference, i.e. the actual space used by the reflink.
A further complication arises when we look at using reflink() on directories, which would of course also be quite desirable (for snapshots, multi-use chroots etc). That will be an interesting direction to explore ;-)
Posted May 9, 2009 23:52 UTC (Sat)
by butlerm (subscriber, #13312)
[Link]
However both aforementioned filesystems store block checksums, so perhaps a
Posted May 6, 2009 9:02 UTC (Wed)
by epa (subscriber, #39769)
[Link] (3 responses)
That said, it makes no sense to account disk quota conservatively while lying about the amount of free space really available. The two should be treated the same, so if reflinking a large file has no effect on the reported free space, it shouldn't cost quota either.
Posted May 6, 2009 9:22 UTC (Wed)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted May 6, 2009 10:42 UTC (Wed)
by epa (subscriber, #39769)
[Link]
Posted May 6, 2009 12:33 UTC (Wed)
by vonbrand (subscriber, #4458)
[Link]
Even worse, if I reflink() a file of yours, and yout then change it (or delete it, whatever) suddenly my quota goes up without any action on my part.
Posted May 6, 2009 11:21 UTC (Wed)
by pjm (guest, #2080)
[Link] (1 responses)
The reasons for different quota behaviour would be if it results in different (and more desirable) user behaviour, or if it helps system administrators choose better quota limits, i.e. if it results in less frequent filesystem-full situations for a given amount of user productivity.
How much are quotas used these days, and for what uses ? Can people comment on the usefulness of different quota policies in the context of specific use cases?
As to whether different quota behaviour would result in different user behaviour (e.g. encouraging taking steps for files to be reflink'ed rather than copied), I wonder how many quota'd users would have the necessary knowledge for it to change their behaviour.
Posted May 6, 2009 12:47 UTC (Wed)
by utoddl (guest, #1232)
[Link]
Space-reporting tools can report variations of (1) actual space used by extant allocated blocks (modulo sparse files), (2) space free, (3) space that would be used in the case where a naive copy were made to another file system -- all of which are valid and different numbers. The "simple" question of how much storage is used is in fact complicated, our desire for simple answers notwithstanding.
Posted May 16, 2009 0:37 UTC (Sat)
by efexis (guest, #26355)
[Link]
One example - if you share somebody's file and it doesn't count as your own space, then the original owner deletes their copy, the space is now soley allocated to you, and so should count against your quota? This act could push you way over your quota; what if the file alone is bigger than your complete quota?
What if two people are sharing a file that you delete? Should what's on your quota be divided equally amongst the two of them? If you want to be fair, surely the thing to do when you share a file from someone else is add half of its size to your own quote, and remove half of it from theres. If you share the file, you should share the cost?
If it goes straight onto your quota when you share it, you simply don't have to worry about any of these - but you also do at the same time lose certain benefits of sharing the data. With VMs, over-committing can often be specified as a %, maybe a similar option for sharing files... sharing a file could save you a certain % of its size from your quota, which then means if you suddenly become the sole owner, you only have added the remaining % rather than the full 100%. The % chosen would be linked to the average reference count of all blocks that you own, as this shows the likelyhook of any block being or becoming soley yours.
Posted May 6, 2009 12:08 UTC (Wed)
by arnd (subscriber, #8866)
[Link] (1 responses)
COW links were last covered five years ago, in
Posted May 6, 2009 12:25 UTC (Wed)
by corbet (editor, #1)
[Link]
Posted May 6, 2009 16:03 UTC (Wed)
by iabervon (subscriber, #722)
[Link] (1 responses)
Of course, being about to copy directories with this sort of mechanism is probably trickier to implement, but any good snapshot mechanism would have to handle this sort of thing.
Posted May 6, 2009 17:44 UTC (Wed)
by nybble41 (subscriber, #55106)
[Link]
Given their small average size, it probably makes more sense to just copy the directories themselves outright rather than implementing them as COW. Recursive copies would be similar to "cp -dlR", just with reflink() in place of symlink().
Posted May 7, 2009 20:54 UTC (Thu)
by anton (subscriber, #25547)
[Link] (3 responses)
Posted May 10, 2009 0:05 UTC (Sun)
by butlerm (subscriber, #13312)
[Link] (2 responses)
Copy-on-write filesystems could use a good interface (like this one) to make
Posted May 10, 2009 18:36 UTC (Sun)
by anton (subscriber, #25547)
[Link] (1 responses)
Posted May 11, 2009 5:22 UTC (Mon)
by butlerm (subscriber, #13312)
[Link]
The ZFS scheme is based on logical sequence numbers, Netapp uses block maps.
Posted May 14, 2009 21:12 UTC (Thu)
by mslusarz (guest, #58587)
[Link]
long splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags);
So don't add new syscall. Implement splice for two descriptors on one filesystem!
The two sides of reflink()
The two sides of reflink()
The two sides of reflink()
The two sides of reflink()
The two sides of reflink()
The two sides of reflink()
For shared stuff like program text, all servers could use the same
binaries (through mount, mount -t bind, or hard links), so that's not
a good justification for reflinks, either (and if you don't trust the
other servers not to write to the file, why would you trust them with
access to the device at all?). Writable files that would mostly or
completely be the same on both VMs would be a better example, but no
concrete example comes to my mind.
The two sides of reflink()
The two sides of reflink()
The two sides of reflink()
The guests typically then have a limited root capability
that does not included making device nodes so they really do not have
access to the device, only the filesystem.
With the limits on the root capabilities, the binaries can surely be
made read-only even for the guest roots, so no reflinks are needed for
the binaries.
The two sides of reflink()
The guests typically then have a limited root capability that does not
included making device nodes so they really do not have access to the
device, only the filesystem.
With the limits on the root capabilities, the binaries can surely be made
read-only even for the guest roots, so no reflinks are needed for the
binaries.
The two sides of reflink()
have for years been able to "cd ~/.snapshot/" and find a special directory of
historical versions of their files. These are stored efficiently because of the nature of
the WAFL filesystem. With reflink, it would be possible to create a lightweight
version of historical snapshotting: you'd have a daemon run every night (for
instance) and recursively reflink the current state of all your files into a directory
tree at ~/.old-versions/<date>/ - then, if you ever needed to go back to an old
version of a file you could just look in there.
(though there would still be quota concerns). It would make time-machine or Netapp
.snapshot-like functionality easy to implement efficiently on single disk systems.
Probably the most quoted reference for stuff like this is the Elephant research
filesystem, about which there are a number of decent research papers.
moderen version control systems go faster and be more lightweight in terms of disk
storage - for instance, cloning a git repository could transparently share all the
underlying data (including the working directory!) using reflinks. Similar tricks being
possible for the other VCSes.
identical files and unifies them on disk using reflink in order to save space.
The two sides of reflink()
somehow.
The two sides of reflink()
The two sides of reflink()
somewhat similar snapshotting convenience but you have to interact with it through a
command line app. Also, it does use up extra space (although if you're backing up to
another machine / another drive for redundancy then that's just fine!).
repositories. Last time I tried it it wasn't really suitable for large repositories but this may
have been fixed since then. rdiff-backup's page on related info has some other solutions:
http://www.gnu.org/savannah-checkouts/non-gnu/rdiff-backu...
rdiffWeb is a nice web interface to rdiff-backup. At work we're using rdiff-backup for weekly snapshots to complement our nightly amanda tape, a 1TB drive lasted us a year.
The two sides of reflink()
The two sides of reflink()
rm
is final" is harsh, but you learn not to misplace stuff in the first place. Makes for a better experience in the long run.
The two sides of reflink()
Since the other "at" calls ("linkat", etc.) are shared with Solaris, I hope that there will be discussions with Solaris and BSD people about these calls, to help people write portable software going forward.
The two sides of reflink()
The two sides of reflink()
The two sides of reflink()
The two sides of reflink()
The two sides of reflink()
The two sides of reflink()
The two sides of reflink()
You'd need flags in the filesystem that would prevent the filesystem from being mounted by a kernel that lacks the needed feature.
The two sides of reflink()
The two sides of reflink()
The two sides of reflink()
a reflink would be a new type of inode
a reflink would be a new type of inode
a reflink would be a new type of inode
A reflink would be a new type of inode only in so far as the filesystem must track the fact that it has blocks shared with another inode. There is no difference, though, between an inode created by a reflink and the file's original inode; they both become reflink inodes. In Btrfs, I believe, things are even less different; the tracking of the shared blocks is done at the extent level.
a reflink would be a new type of inode
a reflink would be a new type of inode
a reflink would be a new type of inode
madcow()?
a reflink would be a new type of inode
A reflink would be a new type of inode only in so far as the filesystem must track the fact that it has blocks shared with another inode. There is no difference, though, between an inode created by a reflink and the file's original inode; they both become reflink inodes.
Ah, so it is symmetric, somewhat like making a hard link with 'ln'.
a reflink would be a new type of inode
The two sides of reflink()
The two sides of reflink()
Read the article again; the uses are described (snapshots, more efficient copying).
The two sides of reflink()
The two sides of reflink()
Pity userspace on where this is done on a simple minded filesystem
reflink("myold200Gbytefile", "myreflinked20Gbytefile);
myfd = open("myreflinked200Gbytefile", (O_WRONLY|O_APPEND|O_LARGEFILE));
res = write(myfd, "hereismyshortlittlemessage", 27);
/*
* wait, like _forever_ while i just added 200Gbytes to the amount
* of space used on this filesystem. Not to mention getting back
* ENOSPC, because, even though there were 100Gbytes available
* before this 27 byte long write, now there are none. reflinks
* sound neat, but sometimes have unexpected teeth.
*/
Re: Pity userspace on where this is done on a simple minded filesystem
Re: Pity userspace on where this is done on a simple minded filesystem
Re: Pity userspace on where this is done on a simple minded filesystem
But appending (to logs, for example) is such a common special case that it would be likely to be supported at an early stage, so that you have an efficient representation when you take a snapshot of a filesystem, and then you let it go on, appending lots of stuff to your log files, blocks belonging to a common prefix can be shared.
Re: Pity userspace on where this is done on a simple minded filesystem
btrfs, at least, has COW wired pretty deeply into it. So it will only copy blocks which have
actually been changed. Results on other filesystems may vary.
Re: Pity userspace on where this is done on a simple minded filesystem
Pity userspace on where this is done on a simple minded filesystem
Any introspection into reflinked files?
The two sides of reflink()
that from a user point of view does not appear to make any kind of link at
all.
semantics. A writable non-link "link" that you can't change the owner of
doesn't seem like a very useful construct to me.
call would make a lot more sense if it were named "fclone" (or something like
that) with full copy semantics. It should preserve the owner, permissions
and data to begin with, and then the caller should be able to change all
three after the fact. Sort of like "fork" or "clone".
The two sides of reflink()
The two sides of reflink()
It should preserve the owner, permissions
and data to begin with, and then the caller should be able to change all
three after the fact.
The two sides of reflink()
The two sides of reflink()
Assuming they have the necessary privileges to do so, obviously.
The two sides of reflink()
yourself; hardlink someone else's file into it; wait for that other person
to delete it. Now you've stolen that person's quota.
Yes, and you misspoke. The other person didn't delete the file because no one can delete a file. The system deletes one automatically when it's no longer accessible. The space charging problem is one of the many reasons this innovative Unix concept should actually be scrapped. Along with the related concepts that directories are kernel level things, and you can't give a file a name.
The two sides of reflink()
The two sides of reflink()
authors would try to kill you.
The two sides of reflink()
I suspect that if you actually tried to scrap link() et al, a million MTA
authors would try to kill you.
The two sides of reflink()
directory links: mkstemp(). What you can't do is easily create them
outside of /tmp, or link them to names at a later date.
The two sides of reflink()
updated virtually any piece of system software. That would be the direct
consequence of discarding the directory entry / inode distinction in Unix -
to regress to the reboot happy world of Win32.
The two sides of reflink()
That would be the direct
consequence of discarding the directory entry / inode distinction in Unix -
The two sides of reflink()
all, except that each directory entry increases i_nlink in the
corresponding inode by one. Reflinks simply would ensure that i_nlink was
*at least* one but would not increment it (probably by maintaining a
separate i_reflink count), and the semantics of unlink() would change to
ensure that a reflink()/unlink() sequence had the same (no) effect on link
count as link()/unlink().
know of *anything* that depends on this (some things doubtless do but it
can't be common).
that are in use, and because the update process generally consists of
creating a file with a temporary name, filling it out, and rename()ing it
over the original (that's an implicit unlink right there, and it does not
fail). If you break that you break every package manager on the face of
the earth.
The two sides of reflink()
What? Directory entries and inodes aren't tied together in the fs model at
all, except that each directory entry increases i_nlink in the
corresponding inode by one.
Reflinks simply would ...
The two sides of reflink()
for scalability. If they weren't in the kernel, they'd need to be in some
userspace library (ew).
The two sides of reflink()
If they weren't in the kernel, they'd need to be in some
userspace library (ew).
The two sides of reflink()
things POSIX guarantees become, as near I can tell, impossible to provide.
I can't see any way to keep cross-directory rename() atomic, for instance.
hiding it in unreadable directories anymore.
syscall. They are always shot down. I don't know what sort of system
you're thinking of, but it isn't Unix.
bingo, you've got a capability-based system.)
The two sides of reflink()
the reflink shares the same inode. The ownership, permissions, and file data
of both the original file and the new file all have to be modifiable
independently. To make any sense, they would also need separate inode
numbers.
The two sides of reflink()
have semantics as a file copy. Since an unprivileged user cannot change the
ownership of an existing file, a general purpose implementation *must* be
able to change the ownership of the new file to that of the current user in
the process.
useful to unprivileged users for making efficient copies of files they
already own.
The two sides of reflink()
The two sides of reflink()
The two sides of reflink()
figure out which files / snapshots share which blocks in a reasonable amount
of time. I imagine BTRFS is similar.
more practical means would be to add an interface that returns the block
checksums of a file range (if not the block offsets) to user space, and
generate a candidate duplication list from that.
The two sides of reflink()
The two sides of reflink()
potentially fail with -ENOSPC is broken. Such write()s can fail even now
thanks to sparse files. It is true that currently userspace can rely on
the second write() in a ftell()/write()/fseek()/write() sequence not
failing, but this seems a rather thin thing to rely on, to me.
The two sides of reflink()
The two sides of reflink()
quota behaviour of reflink()
Interesting points, but remember: quota can be impacted by unrelated actions by other processes owned by the same or other users at any time, so user space needs to respond to space-related issues regardless of what quotas existed "moments ago". In fact, quota can be affected when the user takes no actions at all; the admin can change quotas, file systems can be resized, etc., and a user who was not over quota may suddenly be so even with no changes to his files.
quota behaviour of reflink()
The two sides of reflink()
COW links?
links was shot down. The interface is rather different, but the effective
use cases are practically the same.
http://lwn.net/Articles/77972/.
I meant to work that in, but didn't... cowlinks were a similar idea, but a different implementation. They were really links, while "reflinks" create a new inode. So there's quite a bit of difference in how they work.
COW links?
The two sides of reflink()
The two sides of reflink()
Many posters seem to think of snapshotting as a good use of reflinks,
but snapshots should be created on the whole file system at once,
atomically (so you get a snapshot that is consistent across files).
File systems like btrfs have this functionality, and one can also use
LVM for that (with a little fs support), so snapshots don't seem to be
a good use and justification of reflink().
There are better ways to snapshotting
There are better ways to snapshotting
operation that is far more efficient than walking a directory tree and
creating thousands of new inodes. It doesn't work for making a (writable)
snapshot of a snapshot, however, and that is a very important use case for
virtualization, for example.
efficient *copies* of files in the same (extended) filesystem. That is
something that ZFS and BTRFS both appear to lack, and which for portability
reasons is easy to downgrade to an ordinary file copy.
There are better ways to snapshotting
One of the advantages of ZFS style snapshotting is it is a constant time
operation that is far more efficient than walking a directory tree and
creating thousands of new inodes. It doesn't work for making a (writable)
snapshot of a snapshot, however
That may be a limitation of ZFS, but it's not a necessary limitation.
LLFS can do
it, and nothing I have read about Btrfs indicates that it cannot do
it.
Copy-on-write filesystems could use a good interface (like this one) to make
efficient *copies* of files in the same (extended) filesystem.
Yes, implementing cp more efficiently seems to be the main benefit I
see from this system call.
There are better ways to snapshotting
ZFS scheme is you can have an unlimited number of snapshots, where with
Netapp these days I believe you get 256.
I understand that BTRFS allows nested writable snapshots, but that comes at a
general performance cost that ZFS doesn't have.
different API