User: Password:
|
|
Subscribe / Log in / New account

XFS: the filesystem of the future?

XFS: the filesystem of the future?

Posted Mar 5, 2012 14:34 UTC (Mon) by XTF (guest, #83255)
In reply to: XFS: the filesystem of the future? by dlang
Parent article: XFS: the filesystem of the future?

> depending on how much you want to write, and what your definition of atomic is, what you are asking for may not be possible.

Why not?
At the hardware level it's quite simple: write the new data to new blocks, then write the meta-data.

> it also depends on what you mean about 'tons of regressions'

Assuming the target is not a symlink to a different volume
Assuming you are allowed to create the tmp file
Assuming you are allowed to overwrite an existing file having the same name as your tmp file
Assuming it's ok to reset meta-data, like file owner, permissions, acls, creation timestamp, etc.
Assuming the performance regression due to fsync is ok (request was for atomic, not durable)

> fsync temp file (may require fsync of directory)

How does one check that requirement in a portable way?

> if you want to be sure the change has taken place, fsync directory.

> no, this isn't "real code", but translating it into your language of choice is not that hard.

But it's far from trivial either.

> There are a lot of programs out there that do this right today.

But there are even more ones that don't. There's also no tool to detect these ones (AFAIK).

> lookup the lwn.net article on safely saving data from a few months back

It also failed to address any of the assumptions / regressions.


(Log in to post comments)

XFS: the filesystem of the future?

Posted Mar 5, 2012 20:34 UTC (Mon) by dlang (subscriber, #313) [Link]

> At the hardware level it's quite simple: write the new data to new blocks, then write the meta-data.

writing the data to the new blocks may not be atomic.

writing the metadata to disk may not be atomic.

If your writes to disk can't be atomic, how can the entire transaction?

some of the data you write to a file may be visible before the metadata gets changed, which would make the change overall not be atomic.

XFS: the filesystem of the future?

Posted Mar 5, 2012 21:04 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

The only part that really needs to be atomic is the metadata update. That's not usually a problem so long as your on-disk inodes, or at least those fields relating to top-level data block allocation, fit within one physical sector:

- Write new data to disk blocks which are unallocated on disk, but allocated in memory.
- Allocate new blocks on disk.
- Force in-order I/O (e.g. flush the disk cache).
- Atomically update inode to point to new data blocks, in memory and on disk.
- Force in-order I/O (e.g. flush the disk cache).
- De-allocate old data blocks on disk.

Between the "Allocated on disk" and "De-allocate on disk" steps there are potentially two version of the file data, only one of which is connected to a real file. A basic fsck tool or journal feature can clean up the disconnected version in the event that the process is interrupted.

This does assume _complete_ replacement, i.e. O_ATOMIC implies O_TRUNC. A copy-on-write filesystem could implement O_ATOMIC efficiently without truncation, but not all filesystems have the necessary flexibility in the on-disk format for copy-on-write.

XFS: the filesystem of the future?

Posted Mar 5, 2012 21:36 UTC (Mon) by dlang (subscriber, #313) [Link]

The process you list will badly fragment the file on disk, destroying performance (unless you re re-writing the entire file, in which case just writing a temporary file and renaming it will work)

It's also not clear that this is what the original poster was looking for when he said "Linux devs should really provide a proper solution (like O_ATOMIC) instead of blaming app devs for not doing the impossible."

I've been trying to make two points in this thread

1. it's not obvious what the "proper solution" that the kernel should provide looks like

2. it's not impossible to do this today (since there are many classes of programs that do this)

XFS: the filesystem of the future?

Posted Mar 5, 2012 22:45 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

>> This does assume _complete_ replacement, i.e. O_ATOMIC implies O_TRUNC.
> The process you list will badly fragment the file on disk, destroying performance (***unless you re re-writing the entire file***, in which case just writing a temporary file and renaming it will work)

Emphasis added. Yes, I assumed the entire file was being rewritten. If not, one could add an online defragmentation step after the metadata update, though online defragmentation introduces atomicity issues of its own.

Creating and renaming a temporary file has its own issues, which have already been mentioned, particularly relating to symlinks and hard links. Even for ordinary files, it's not guaranteed that there exists a directory on the same filesystem where you have permission to create a temporary file. Bind mounts (which can apply to individual files) are another potential sore spot. How much work should applications be expected to do just to find a place to put their temporary file such that rename() can be guaranteed atomic?

An O_ATOMIC option to open() would ensure that you are really replacing the original file, and that the temporary space comes from the same filesystem.

1) No, there are no obvious solutions yet, but there have been several reasonable proposals.

2) Current applications can do atomic replacement in common but limited circumstances using rename(). They all depend on creating a temporary file on the same filesystem, which assumes both that you can locate that filesystem (see: bind mounts) and that you can create new files there. They also tend to break hard links, which may or may not be a desired behavior. There is no general, straightforward solution to the problem of atomically replacing the data associated with a specific inode.

XFS: the filesystem of the future?

Posted Mar 5, 2012 23:37 UTC (Mon) by dlang (subscriber, #313) [Link]

> it's not guaranteed that there exists a directory on the same filesystem where you have permission to create a temporary file.

this sounds like a red hearing to me.

If you aren't allowed to create a new file in the directory of the file, are you sure you have permission to overwrite the file you are trying to modify?

as for the symlink/hardlink 'issue', in my sysadmin experience, more problems are caused by editors that modify files in place (not using temp files and renaming them) than by breaking links. Editors that break links when modifying a file are referred to as 'well behaved' in this area.

XFS: the filesystem of the future?

Posted Mar 6, 2012 0:28 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

> If you aren't allowed to create a new file in the directory of the file, are you sure you have permission to overwrite the file you are trying to modify?

These are orthogonal permissions. To overwrite the file you need write permission on the file. To create a new file in the same directory you need write permission on the directory. It's easily possible to have one without the other:

root# mkdir a
root# touch a/b
root# chown user a/b

user$ echo test > a/b # no error
user$ touch a/c
touch: cannot touch `a/c': Permission denied

> as for the symlink/hardlink 'issue', in my sysadmin experience, more problems are caused by editors that modify files in place (not using temp files and renaming them) than by breaking links.

Obviously that would remain an option. The point is that it should be an *option*, alongside the ability to atomically update a file in place. Symlinks are often used to add version control over configuration files (without putting the entire home / etc directory in the repository), while bind mounts are more often used with namespaces and chroot environments. Usually you want the latter to be read-only, but if the file is read/write it makes sense to allow it to be updated without breaking the link. (You can't just delete or rename over a bind target, either; it has to be unmounted.)

XFS: the filesystem of the future?

Posted Mar 5, 2012 22:32 UTC (Mon) by XTF (guest, #83255) [Link]

> The only part that really needs to be atomic is the metadata update. That's not usually a problem so long as your on-disk inodes, or at least those fields relating to top-level data block allocation, fit within one physical sector:

They don't and that's not how atomicity is guaranteed. Atomicity is guaranteed via the journal.

> This does assume _complete_ replacement, i.e. O_ATOMIC implies O_TRUNC.

Not really, actually. You merely have to ensure that the old state / blocks remain valid, so you have to do all writes to new blocks.

XFS: the filesystem of the future?

Posted Mar 5, 2012 22:55 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

>> The only part that really needs to be atomic is the metadata update. That's not usually a problem so long as your on-disk inodes, or at least those fields relating to top-level data block allocation, fit within one physical sector:

> They don't and that's not how atomicity is guaranteed. Atomicity is guaranteed via the journal.

I can only assume you have a particular filesystem in mind. It is possible to arrange for inodes (or at least the data block portions) to fit within one sector, and to have atomic metadata updates without a journal. If you have a journal, great; atomic updates shouldn't be a problem. However, this system can also be retrofitted onto filesystem which do not support journals.

>> This does assume _complete_ replacement, i.e. O_ATOMIC implies O_TRUNC.

> Not really, actually. You merely have to ensure that the old state / blocks remain valid, so you have to do all writes to new blocks.

True, if you want to implement full copy-on-write semantics. I was going for a simpler approach which can be implemented by almost any filesystem with no on-disk structure changes.

XFS: the filesystem of the future?

Posted Mar 6, 2012 0:10 UTC (Tue) by XTF (guest, #83255) [Link]

> I can only assume you have a particular filesystem in mind

Not really

> True, if you want to implement full copy-on-write semantics. I was going for a simpler approach which can be implemented by almost any filesystem with no on-disk structure changes.

What do you mean by full copy-on-write semantics?
What on-disk structure changes would be required to do this in ext4 for example?

XFS: the filesystem of the future?

Posted Mar 6, 2012 0:49 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

>> I can only assume you have a particular filesystem in mind
> Not really

In that case, why did you say that on-disk inodes do not fit within one physical sector? That is filesystem-specific, and I certainly know of some where the full inode size is less than or equal to 512 bytes; ext2 is at least capable of being configured that way.

>> True, if you want to implement full copy-on-write semantics. I was going for a simpler approach which can be implemented by almost any filesystem with no on-disk structure changes.
> What do you mean by full copy-on-write semantics?
> What on-disk structure changes would be required to do this in ext4 for example?

Perhaps none. I didn't mean to imply that it was impossible to implement atomic replacement of partial files without changing the on-disk structure; I simply hadn't proved to myself that it could be done easily. In retrospect it probably could be done, though you would run into the aforementioned fragmentation issues common to most C.O.W. filesystems.

The biggest complication is not on disk but in memory; you would need to modify the filesystem code to account for the shared data blocks, of which there may be as many alternate versions as there are O_ATOMIC file descriptors--unless O_ATOMIC is exclusive, of course.

XFS: the filesystem of the future?

Posted Mar 6, 2012 12:55 UTC (Tue) by XTF (guest, #83255) [Link]

> In that case, why did you say that on-disk inodes do not fit within one physical sector?

You're right, inodes typically do fit in a sector.
What I wanted to say is that a meta-data transaction usually involves multiple parts / sectors. Providing consistency guarantees after a crash is hard without a journal.

> unless O_ATOMIC is exclusive, of course.

Having a reader and a writer or multiple writers at the same time is always problematic.

XFS: the filesystem of the future?

Posted Mar 6, 2012 21:36 UTC (Tue) by dgc (subscriber, #6611) [Link]

> - Write new data to disk blocks which are unallocated on disk, but
> allocated in memory.

open(tmpfile)
write(tmpfile)

> - Allocate new blocks on disk.
> - Force in-order I/O (e.g. flush the disk cache).

fsync(tmpfile)

> - Atomically update inode to point to new data blocks, in memory and
> on disk.
> - Force in-order I/O (e.g. flush the disk cache).
> - De-allocate old data blocks on disk.

rename(tmpfile, destination)

It seems that there are lots of people with ideas of how to "improve" overwrites but few of those people really understand the mechanisms that filesystems already provide via the POSIX interface. If you ever wonder why filesystem developers are a little bit sick of this topic, your post is a perfect example.

Dave.

XFS: the filesystem of the future?

Posted Mar 6, 2012 21:43 UTC (Tue) by dlang (subscriber, #313) [Link]

per the other messages in this thread, the OP is concerned about rather esoteric situations where

the file is a bind mount

the program doesn't have permission to make a new file in the directory (only to modify an existing one)

where the file is a special link (symlink, multiple hardlinks, bind mounted, etc) and the 'right thing' is to maintain those links instead of breaking them to have the modified version be a local copy

these situations do exist, but they are rather rare and the 'right thing' to do is not always and obvious and consistent thing.

XFS: the filesystem of the future?

Posted Mar 6, 2012 22:09 UTC (Tue) by XTF (guest, #83255) [Link]

> about rather esoteric situations where

Since when is not reseting meta-data (like file owner, permissions, acls, creation timestamp, etc) an esoteric situation

Each case by itself might be insignificant, but all cases together are not, IMO:
Assuming the target is not a symlink to a different volume
Assuming you are allowed to create the tmp file
Assuming you are allowed to overwrite an existing file having the same name as your tmp file
Assuming it's ok to reset meta-data, like file owner, permissions, acls, creation timestamp, etc.
Assuming the performance regression due to fsync is ok (request was for atomic, not durable)

XFS: the filesystem of the future?

Posted Mar 7, 2012 11:19 UTC (Wed) by dgc (subscriber, #6611) [Link]

> the OP is concerned about rather esoteric situations

I know. But what was described was a specific set of block and disk cache manipulations. It wasn't a set of requirements; all I was doing is pointing out if I treated it as a set of requirements then it can be implemented with the existing POSIX API.

> These situations do exist, but they are rather rare and the 'right
> thing' to do is not always and obvious and consistent thing.

And that is precisely why the problem can't be solved by some new magic filesystem operation until the desired behaviour is defined and specified. The weird corner cases that make it hard for userspace also make it just as hard to implement it in the kernel. I see lots of people pointing out why using rename is hard, but their underlying assumption is that moving all this into the kernel will solve all these problems. Well, it doesn't. For example: how does moving the overwrite into the kernel make it any easier to decide whether we should break hard links or not?

IOWs, the issue here is that nobody has defined what the operations being asked for are supposed to do, the use cases, the expected behaviour, constraints, etc.

Before asking kernel developers to do something the problem space needs to be scoped and specified. For the people that want this functionality in the kernel: write a man page for the syscall. Refine it. Implement it as a library function and see if you can use effectively for your use cases. Work out all the kinks in the syscall API. Ask linux-fsdevel for comments and if it can be implemented. Go back to step 1 with all the review comments you receive, then continue looping until there's consensus and somebody implements the syscall with support from a filesystem or two. Once the syscall is done, over time more filesystems will then implement support and then maybe 5 years down the track we can assume this functionality is always available.

But what really needs to happen first is that someone asking for this new kernel functionality steps up and takes responsibility for driving this process.....

Gesta non verba.

XFS: the filesystem of the future?

Posted Mar 7, 2012 16:54 UTC (Wed) by XTF (guest, #83255) [Link]

> but their underlying assumption is that moving all this into the kernel will solve all these problems. Well, it doesn't.

Why not? Some of the issues should be easy to avoid in kernel space.

> For example: how does moving the overwrite into the kernel make it any easier to decide whether we should break hard links or not?

What does the non-atomic case do? I'd do the same in the atomic case.

> IOWs, the issue here is that nobody has defined what the operations being asked for are supposed to do, the use cases, the expected behaviour, constraints, etc.

My request was quite clear: Could any of the fsync advocates post real code that does the atomic variant of open, write, close?
Isn't that quite well-defined?

> Before asking kernel developers to do something the problem space needs to be scoped and specified

Before that, we should agree that these are valid issues / assumptions / regressions.

> But what really needs to happen first is that someone asking for this new kernel functionality steps up and takes responsibility for driving this process.....

That'd be great, but at the moment most (FS) devs (and others) just declare these non-issues and refuse to discuss.

XFS: the filesystem of the future?

Posted Mar 6, 2012 22:00 UTC (Tue) by XTF (guest, #83255) [Link]

> It seems that there are lots of people with ideas of how to "improve" overwrites but few of those people really understand the mechanisms that filesystems already provide via the POSIX interface. If you ever wonder why filesystem developers are a little bit sick of this topic, your post is a perfect example.

Other people appear to have trouble reading. ;)
Your code was posted already and didn't suffice.

XFS: the filesystem of the future?

Posted Mar 6, 2012 22:05 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

> open(tmpfile)
> write(tmpfile)
> fsync(tmpfile)
> rename(tmpfile, destination)

Now find a way to do that *without* the need for a temporary file, and you might have something relevant to contribute to the thread.

A temporary file is not always an acceptable option; it presumes that you know of a directory in which you can create a temporary file guaranteed to be on the same filesystem as the file that it's replacing, so that rename() can be implemented atomically, and that either there are no hard links to the file or that those links should be broken by the rename(). Moreover, the rename() method resets portions of the security context of the original file, including ownership and security labels, which you can't restore without superuser capabilities.

This thread exists because people who do know plenty about the POSIX interfaces also know that they don't provide a general solution.

XFS: the filesystem of the future?

Posted Mar 5, 2012 22:18 UTC (Mon) by XTF (guest, #83255) [Link]

> writing the data to the new blocks may not be atomic.

It doesn't need to be

> writing the metadata to disk may not be atomic.

That's why you use a journal

> If your writes to disk can't be atomic, how can the entire transaction?

Heard of TCP? It creates a reliable connection over an unreliable network.
Or databases? Atomic transactions on unatomic disks are very possible.

> some of the data you write to a file may be visible before the metadata gets changed, which would make the change overall not be atomic.

No, because you write the new data to *new* blocks. Blocks not references by any file yet.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds