|
|
Subscribe / Log in / New account

The 3.11 merge window closes

The 3.11 merge window closes

Posted Jul 17, 2013 9:27 UTC (Wed) by Karellen (subscriber, #67644)
In reply to: The 3.11 merge window closes by epa
Parent article: The 3.11 merge window closes

"that sounds like a better way to write files in general. No more pesky partially-written files if something goes wrong;"

You can do that already by creating a file with the "wrong" name (e.g. "config.cfg.new") and calling rename(2) when you're finished. e.g.

rename("config.cfg.new", "config.cfg");

"instead open with O_TMPFILE and atomically link it with a certain name once you're finished."

Unfortunately, there currently is no way to create a new link to a file for which you only have a file handle.

There are some cases in which it wouldn't make sense, such as trying to link a socket fd into a filesystem, or creating a link in a mountpoint where the file does not reside, but a hypothetical fdlink() could return EXDEV as per rename() in that case. In any case, no-one's implemented it yet.

"The piece that's missing is an atomic link-and-unlink operation where you link a file into a directory with a given name, at the same time unlinking any file that was previously there with that name"

rename(2) already does this.

"(or even renaming the existing file atomically)."

You can do this with link(2), by e.g.

link("config.cfg", "config.cfg.old");
rename("config.cfg.new", "config.cfg");


to post comments

The 3.11 merge window closes

Posted Jul 17, 2013 10:12 UTC (Wed) by epa (subscriber, #39769) [Link] (2 responses)

Unfortunately, there currently is no way to create a new link to a file for which you only have a file handle.
I thought the linkat trick described by Al Viro in the grandparent comment would achieve that.

I believe that rename is not fully atomic. As the manual page says, "However, when overwriting there will probably be a window in which both oldpath and newpath refer to the file being renamed.". It's also not atomic over NFS (though enhancements to the NFS protocol may be out of scope for Linux kernel discussions).

However, it is atomic in the looser sense that the filename at any moment links to either the old file or the new one. That may be good enough for many applications.

The 3.11 merge window closes

Posted Jul 17, 2013 10:40 UTC (Wed) by Karellen (subscriber, #67644) [Link]

"I thought the linkat trick described by Al Viro in the grandparent comment would achieve that."

Doh! That'll teach me to skim some messages.

Wow. That's incredibly neat. Hadn't thought of/seen that before. Thanks.

The 3.11 merge window closes

Posted Jul 17, 2013 17:22 UTC (Wed) by dlang (guest, #313) [Link]

If the file doesn't currently have a filename, then there is no window where there are multiple paths to the file (not that having multiple paths to a file should be any problem, that's a 'normal' case on a Unix filesystem)

as long as the target path always exists, and always points at either the old or the new, you should be in good shape.

now, to be crash safe, you need to fsync the file before doing the rename, and you need to not be using ext3 which has such horrid fsync behavior.

The 3.11 merge window closes

Posted Jul 17, 2013 10:43 UTC (Wed) by mjg59 (subscriber, #23239) [Link] (8 responses)

You've missed the bit where you have to fsync() the file because POSIX says the metadata can be updated before the data hits the disk and you can end up with an empty file, except you shouldn't do that because ext3 will then write out the entire journal and you'll block for several seconds.

The 3.11 merge window closes

Posted Jul 17, 2013 23:35 UTC (Wed) by Karellen (subscriber, #67644) [Link] (2 responses)

"You've missed the bit where you have to fsync()..."

I've seen that argument before, but it's always confused me. Surely that's only wanted as protection against an unexpected system crash/failure? Except - I didn't think that POSIX made any guarantees at all in that event. I thought your OS was "allowed" to overwrite your partition tables and FS journals completely in the event of a crash and still be POSIX-compliant.

(If not, how does POSIX expect to guarantee otherwise, unless POSIX compliance requires the absence of certain classes of bugs?)

Looking at the rationale section of POSIX fsync[0] documentation, fsync() is allowed to be the null operation, or to not cause data to actually be written, and that fsync() correctness could be considered a QoI issue.

However, the Open Group website documentation the closest thing I have to the actual POSIX spec. If there is another section somewhere dealing with the general problem of compliance in the face of bugs/power outages which is more enlightening, I would welcome a link to it, or a quote from it.

(FWIW, I think that Linux writing metadata before data is a poor QoI decision, and that the filesystem devs should strive to do otherwise, no matter what POSIX allows. However, IANA Kernel/FS developer, and am not properly informed on how hard, impractical or pessimal that might be.)

[0] http://pubs.opengroup.org/onlinepubs/009695399/functions/...

The 3.11 merge window closes

Posted Jul 18, 2013 0:25 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

If we weren't concerned about such things, ext4 wouldn't default to committing every 5 seconds. We can't guarantee that systems will remain powered or that software is perfect, and so if your expectation is that your code will always either read the new file or the old file and never see an intermediate representation then you probably want to handle the not-uncommon case of the device ceasing to function at around the same time.

The meaning of fsync

Posted Jul 20, 2013 17:56 UTC (Sat) by giraffedata (guest, #1954) [Link]

I thought your OS was "allowed" to overwrite your partition tables and FS journals completely in the event of a crash and still be POSIX-compliant.

The thing about adherence to any standard is that one specifies the very adherence with myriad conditions, most of them implied. So POSIX doesn't say, "if the system crashes, a read doesn't have to get back the same data that was written." Rather, the system designer says, "the system is POSIX-compliant as long as the system never crashes." And as I said, that condition is usually not actually spoken. There are tons of similar conditions: the superuser does not write directly to the disk; the disk drive never makes a mistake; cosmic rays don't change magnetic state; etc.

Of course, designers do whatever they can to reduce the conditions; few systems today are offered on a "if the power ever goes out, nothing in the POSIX standard applies" basis.

Fsync drives us into the awkward territory of robustness. Robustness is a system's ability to work when it is broken. That contradiction in terms is why any specification of fsync is bound to be fuzzy. It's like saying, "I will pay you back by Tuesday. If I don't, ..."

The 3.11 merge window closes

Posted Jul 18, 2013 14:06 UTC (Thu) by Tobu (subscriber, #24111) [Link] (4 responses)

The unavailability of good (O_PONIES) semantics continues to amaze me.

The only option right now seems to be a combination of f(data)sync and deferred threads; but introducing threads has a nasty engineering cost.

The last I've seen of these issues (on the XFS list), maintainers were willing to take a new flag (don't know if that's possible; the VFS seems misdesigned to ignore new flags, see O_TMPFILES above) or a new VFS syscall that might be progressively implemented.

The 3.11 merge window closes

Posted Jul 18, 2013 14:23 UTC (Thu) by viro (subscriber, #7872) [Link] (3 responses)

FWIW, it's much worse than any VFS design flaws; those we could deal with, but there's fuck-all we can do about existing userland ABI. Which, for open(2), had been "ignore any unknown bits", all the way since the very beginning. And yes, it *is* a sucky ABI. Unknown values in flags, etc., should be rejected with "I don't know what are you talking about", rather than silently ignored. Sadly, open(2) is one of the places where that principle had been violated. What's really sad is that openat(2) did not add such validation, even though it had happened much more recently.

See the talk by Michael Kerrisk re ABI suckitude a while ago - this is a prime example of such ;-/

The 3.11 merge window closes

Posted Jul 18, 2013 15:18 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

I really wish that Linux would stop piling flags upon flags to existing syscalls. The man page for open() already looks like one of those 'interactive novells'.

Just create a new syscall, say open2(), with a better-designed ABI. Old programs can still use open() and new ones can use the new syscall to get new features.

The 3.11 merge window closes

Posted Jul 18, 2013 19:58 UTC (Thu) by paulj (subscriber, #341) [Link]

And if a new syscall is added, at least divide the flag space up into "Don't care if not implemented" and "Mandatory, error if not implemented".

The 3.11 merge window closes

Posted Jul 21, 2013 22:05 UTC (Sun) by nix (subscriber, #2304) [Link]

Oh, great. Because one of the things that's really *nice* about the Windows API is its profusion of Foo and FooEx and FooExEx calls.

No, what you do if you really think programs will care is introduce a new open2(), wire it to a new version of open() in glibc, change the values of all the O_* constants in glibc (but *not* the kernel) to some new value range that doesn't intersect the old, and have glibc redirect all calls using any old flag values to the old open() and all new ones to open(), mapping the 'new' flag values in the userspace API to the kernel values (probably by subtracting a constant). You can also expose the old flags under new names, OBS_EXCL and the like,. That way, old apps get the old syscall, new ones get the new syscall, and new apps that really, really want the old semantics can get them.

If you thought it mattered that much, and really needed to do it, that's how you'd do it. No uglifying programs with horrible open2() nonsense. (Yes, you need a new glibc version to use this, but you need a new glibc to use any new syscall *anyway*.)


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds