LWN.net Logo

Quotes of the fortnight

Basically, file systems are not databases, and databases are not file systems. There seems to be an unfortunate tendency for application programmers to want to use file systems as databases, and they suffer as a result.
-- Ted Ts'o

We simply do not add kernel facilities that can produce a stack overflow, memory corruption and triple fault if a rare debug statement triggers in an IRQ context by accident.
-- Ingo Molnar

There is absolutely _no_ excuse for an endless loop in kernel mode. Certainly not "the other end is incompetent".

EVERYBODY is incompetent sometimes. That just means that you must never trust the other end too much. You can't say "we require the server to be sane in order not to lock up".

-- Linus Torvalds
(Log in to post comments)

Quotes of the fortnight

Posted Jan 6, 2011 7:22 UTC (Thu) by butlerm (subscriber, #13312) [Link]

You may *say* that you don't care which version of the file you get after a rename, but only that one or the other is valid, but what if some other program reads from the file, gets the second file, and sends out a network message saying the rename was successful, but then a crash happens and the rename is undone?

This is a red herring. The system crashed. fsync was not called. Now the only question is how best to pick up the pieces. Arguing that it would be better to have a corrupted file than the previous version of the file because some connected system somewhere has been sent a unconfirmed report of an uncommitted operation is silly.

Quotes of the fortnight

Posted Jan 6, 2011 9:34 UTC (Thu) by Nick (guest, #15060) [Link]

That's just trying to fit a square peg into a round hole. POSIX does not have atomic-but-not-durable file overwrite API. The filesystem does not try to be a database, but a much simpler building block (that you can build your own transactional layer upon if you want).

If you want a new version of a file's data without durability, create it under a new name. If you crash before you make it durable, you can delete it. If you subsequently require durability then you can fsync and then make a note of it (eg. with a rename or lockfile) so after a crash you know it is valid data. That's simple, works on all posix systems, doesn't require the use of any database library, and doesn't require other code, doesn't require a complex special case new API to be supported by the kernel, and best of all it can work with multiple files and directory hierarchies unlike atomic-overwrite-one-file.

Quotes of the fortnight

Posted Jan 6, 2011 11:06 UTC (Thu) by epa (subscriber, #39769) [Link]

POSIX does not have atomic-but-not-durable file overwrite API.
Does POSIX have atomic-but-not-durable anything in the file system?

Quotes of the fortnight

Posted Jan 6, 2011 13:55 UTC (Thu) by Nick (guest, #15060) [Link]

Namespace operations. creat/unlink/rename.

Quotes of the fortnight

Posted Jan 6, 2011 12:42 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

We're not obliged to limit ourselves to what POSIX guarantees. Write-then-rename can be atomic but not durable without violating POSIX. What good reason is there for a filesystem not to behave this way?

(Any argument that you should have to fsync() to get this behaviour should also include a full description of how to handle the same situation on ext3)

Quotes of the fortnight

Posted Jan 6, 2011 14:17 UTC (Thu) by Nick (guest, #15060) [Link]

The good reason is high complexity and low benefit, as explained in the thread.

But that's just my (and some others') opinions. If a developer wants to change something, it is up to them to justify it. If you can show a significant improvement for real apps that can not be achieved with alternatives that outweighs the complexity of the patch, you have a very good chance to get people interested and get it merged.

Quotes of the fortnight

Posted Jan 6, 2011 14:59 UTC (Thu) by mjg59 (subscriber, #23239) [Link]

The benefit is that you get the behaviour that filesystem users actually want, without making ext3-based systems unusable. That doesn't sound like low benefit.

Quotes of the fortnight

Posted Jan 7, 2011 3:49 UTC (Fri) by Nick (guest, #15060) [Link]

Please read the thread. "Users" don't want any particular API, they want apps to work properly. Developers want atomic write of new file data without slowdown of fsync, then they can achieve it in many ways, traditional one is by writing to a new name in the namespace.

Quotes of the fortnight

Posted Jan 7, 2011 15:44 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

Since POSIX provides no guarantee of whether a file's ever hit disk (except fsync(), which we've already established isn't usable for this kind of situation), using new names doesn't work terribly well. I start instance 1 of the app and close it - in the process, config.new gets created. I restart it. It finds config.new and copies it onto config and at that point the system crashes. Oops! Now I potentially have neither copy of the file again. So I can't just retain one additional copy of the file. The alternative is to create config.datestamp, but there's no way to know when I can garbage collect old versions (even checking against uptime doesn't work, because the filesystem may be on a remote server).

The users of a filesystem are the people who have to write working applications that use it. The only end-users who use the filesystem directly are sysadmins using things like btrfs's snapshot support. I don't understand the intense reluctance on the part of certain parts of the filesystem community to accept that a filesystem that doesn't give application developers what they want isn't a useful general purpose filesystem.

Quotes of the fortnight

Posted Jan 7, 2011 16:57 UTC (Fri) by Nick (guest, #15060) [Link]

No, we have not established that fsync is not usable. It is exactly what to use if you want a guarantee of data integrity. So the rest of your example is nonsense.

Quotes of the fortnight

Posted Jan 7, 2011 17:17 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

If fsync() were usable on ext3 we wouldn't be having this conversation. In several common cases (like, applications actually responding to the user), it's not.

Quotes of the fortnight

Posted Jan 7, 2011 17:32 UTC (Fri) by Nick (guest, #15060) [Link]

You know what I'll say the correct solution to that is, don't you? Improve the filesystem so that everybody benefits with the existing APIs (as opposed to creating a more complex and non standard API because your simple one is too slow).

Quotes of the fortnight

Posted Jan 7, 2011 21:29 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

Which would be a fine solution, if there were any way to update the huge number of machines currently running ext3. But there isn't, so application developers who want to have their code be usable on them can't rely on fsync() (and nor can they make this behaviour conditional on ext3, because statfs returns the same magic for ext2, 3 and 4). So, yeah, compatibility with what is perhaps the most widespread POSIX filesystem out there (with the possible exception of HFS+) means not being able to use fsync() unless it's really vital that you do so.

Quotes of the fortnight

Posted Jan 7, 2011 21:32 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

if those systems aren't going to be updated so that they could run a new version of ext3, what makes you think that they will be updated to new versions of the software?

besides, aren't people claiming that the filesystems should all be changed to provide new semantics, these old systems aren't going to be updated to the new version of ext3 that would have these new semantecs either.

Quotes of the fortnight

Posted Jan 7, 2011 21:37 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

ext3 in data=ordered has these semantics, and people tend to install new versions of userspace applications fairly regularly.

Quotes of the fortnight

Posted Jan 7, 2011 22:12 UTC (Fri) by neilbrown (subscriber, #359) [Link]

> because statfs returns the same magic for ext2, 3 and 4

oohhhh.. you're right. That's fairly horrible.

One could still parse /proc/self/mountinfo - key off st_dev and extract the fs type as a string. And we could wrap that up and put it in a library call called "safe_fsync".

Horrible, but possible.

Quotes of the fortnight

Posted Jan 7, 2011 22:22 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

And also parse the value of the data= line, because the Right Thing varies based on that...

Quotes of the fortnight

Posted Jan 8, 2011 3:07 UTC (Sat) by Nick (guest, #15060) [Link]

fsync isn't broken (ok it is but due to code bugs which are being fixed, not by design), it's just slow on some filesystems. This is a pity but there really isn't an alternative. You can't rely on some atomic semantics that will be filesystem and OS specific, and you can't reduce fsync below integrity requirements.

You always should only use fsync when it is needed, no more and no less. If that's still too costly then you're stuck. You could provide an option for an unsafe mode for users, which is a pretty common option and helpful for any filesystem where the user doesn't need the integrity.

Quotes of the fortnight

Posted Jan 7, 2011 18:55 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

so isn't the answer to fix the bug with fsync in ext3 rather than create new commands and semantics that you then expect every other filesystem on every unix varient to implement?

Quotes of the fortnight

Posted Jan 6, 2011 17:37 UTC (Thu) by butlerm (subscriber, #13312) [Link]

POSIX does not have atomic-but-not-durable file overwrite API

Setting aside the point that POSIX doesn't have anything to say about whether your filesystem even exists after an unclean shutdown, that is not what anyone has been asking for. No new APIs and no performance killing inefficiencies are required for a filesystem to exercise a modicum of intelligence about how it recovers from a system crash.

It is hardly unreasonable to expect a filesystem to keep a few extra bits of information so that after a system crash data and the metadata come back in much more consistent state than would otherwise be the case. That means if a rename has changed the association of a name from an inode with committed data to an inode with unwritten data, preserving the association of the name with the good inode is a much more rational recovery scenario than preserving the association of the name with the bad one.

Quotes of the fortnight

Posted Jan 6, 2011 17:50 UTC (Thu) by butlerm (subscriber, #13312) [Link]

that is not what anyone has been asking for

Sorry, I spoke too hastily on that point. Those asking for a new API should ask for more intelligent crash recovery instead, not some sort of API or system level all filesystems guarantee.

Quotes of the fortnight

Posted Jan 7, 2011 3:52 UTC (Fri) by Nick (guest, #15060) [Link]

Why would you put some kind of ill defined, complex, and special case "intelligence" in the filesystem, when the app can do it just fine? This isn't necessarily a rhetorical question, but one that nobody has been able to answer. If you can up with good use cases and numbers, etc. then by all means post it to the mailing list.

Quotes of the fortnight

Posted Jan 7, 2011 4:30 UTC (Fri) by foom (subscriber, #14868) [Link]

Because I (a user of the system) NEVER want the filesystem to give me a truncated file after a crash when the file would never have appeared truncated if the system kept running. Given the choice, not losing the file is of course preferable to losing the file. Even if I'm just using "mv" in a shell, I *still* don't want the filesystem to eat my data. Or should "mv" do an fsync too?

You can argue about how much /relative/ worth that has compared to the additional complexity in the filesystem journal recovery code, but the use case ("don't eat my data if you don't have to") is certainly plain.

Quotes of the fortnight

Posted Jan 7, 2011 6:36 UTC (Fri) by Nick (guest, #15060) [Link]

Lots of other users of the POSIX file API do not want the complexity that more complex semantics like atomicity would bring. You can build more complex semantics on top of the simpler ones, but you can't take complexity out of the stack, so the right way is to have a simpler API at the bottom.

I don't see what good your example is. mv is atomic regarding the namespace operation, and it will give you the same durability semantics of the data that it had before the mv.

So if you have a stable-on-disk file, and you mv it, then if the system crashes you get the same data on one of the disk and either the new or the old name. If you had an unstable file, then you would get possibly some random data guarantees at either the new or the old name. Why would mv need an fsync?

I could see an fsync command being useful for shell or makefile code, that currently has to rely on sync, but that's not a problem with the syscall API.

And it's not about trying to eat your data, it is about providing a reasonable API that caters to performance but can be built on top of. The kernel still tries pretty hard to get data stable in a timely manner, so even if you're editing your critical data with echo and cat, you should be fine after a crash in most cases.

Quotes of the fortnight

Posted Jan 7, 2011 7:22 UTC (Fri) by foom (subscriber, #14868) [Link]

The example is the traditional one: you mv a file "Y" on top of another file "X".

When you crash and recover, it might turn out the namespace operations (unlink/rename) had been written to perm storage, but the new file "Y" never was. So you end up with an empty file "X", even though at no time, if the system hadn't crashed, would the file named "X" have been empty.

As I understand things, ext3/4 already special cases this, but I think does so by auto-fsync'ing the new file, rather than making the recovery process smarter. (right? I've kinda lost track).

Anyways, the point here was that you (the application or a filesystem special-case) usually don't *need* to force fsync the new file and take the speed hit. Rather, the filesystem "simply" needs to realize, during recovery at mount-time, that this situation occurred, and rewind the rename so that "X" points to the old inode again. Thus, making available a version of the file that actually has the user's data (as of some point in time) vs the one that doesn't.

I have no opinion on whether such a thing would be too hard to fully-specify and code to be worth it or not. But it's certainly a sane idea in the abstract.

Quotes of the fortnight

Posted Jan 7, 2011 7:53 UTC (Fri) by Nick (guest, #15060) [Link]

Why on earth would you do that without making Y durable with fsync and/or without backing up X elsewhere? Really, "don't do that then".

Quotes of the fortnight

Posted Jan 7, 2011 10:01 UTC (Fri) by epa (subscriber, #39769) [Link]

Why on earth would you do that without making Y durable with fsync and/or without backing up X elsewhere?
Because you have the idea (perhaps an illusion, but one that the system has traditionally tried to foster) that operations happen sequentially.

The first operation is writing the new file Y. The second operation is moving Y to X. You might reasonably expect a system crash during the first operation: in that case file Y will be corrupt, but X will be untouched. What you don't expect is for the file system to somehow reorder the operations so that the later command (the rename) is complete even though the earlier one is not. If the system crashes at this point then you end up with a filesystem state that does not correspond to any of the states the filesystem should pass through when doing these two operations in sequence.

File systems should of course be able to reorder operations to get best performance out of the disk. But (rightly or wrongly) users expect them to preserve the same semantics as if the commands had been executed serially.

Quotes of the fortnight

Posted Jan 7, 2011 11:58 UTC (Fri) by Nick (guest, #15060) [Link]

Well I'm not sure that the posix API has ever tried to give that impression of crash safety. Properly written and portable applications of course would not make such assumptions, and so those would not need to switch over to assuming any different semantics.

Yes it would be a nice behaviour and probably implemented if it were free. It is not, and solutions exist. So that leaves us with somebody investing enough in the idea to come up with a patch and demonstrating an improvement.

Quotes of the fortnight

Posted Jan 7, 2011 14:48 UTC (Fri) by foom (subscriber, #14868) [Link]

> Well I'm not sure that the posix API has ever tried to give that impression of crash safety.

Of course it hasn't ever guaranteed it. Doesn't mean users don't expect it (they do). As has been repeatedly pointed out, POSIX doesn't even prohibit you from just destroying the entire filesystem if the OS crashes... Luckily mobody seems to think that's an acceptable thing for the implementation to do very often, though. ;)

Quotes of the fortnight

Posted Jan 7, 2011 16:37 UTC (Fri) by Nick (guest, #15060) [Link]

It means that apps do not expect some kind of atoimc sequential consistency like was claimed.

Quotes of the fortnight

Posted Jan 7, 2011 16:00 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

As far as I can tell, POSIX allows filesystem recovery after a crash to replace every one of my files with a giant ASCII middle finger, even if they've been fsync()ed. Whether anyone would choose to use that filesystem is left as an excercise for the reader.

Talking about whether POSIX guarantees behaviour or not is unhelpful. It doesn't. Everyone knows that. It's impossible to write an application which makes no assumptions above POSIX while also supporting any kind of atomicity or durability (fsync() isn't guaranteed to push stuff out of drive cache, for instance). The only reasonable approach is to accept that we need to provide functional guarantees above POSIX, and if that means that applications aren't portable then so be it.

Quotes of the fortnight

Posted Jan 7, 2011 16:52 UTC (Fri) by Nick (guest, #15060) [Link]

I don't think that's really a correct reading of POSIX at all, actually, and certainly implementors don't agree with you. Also, POSIX talks at length about I/O file integrity completion, which is clearly intended for durability and you have to weasel the words to come up with that interpretation. It would also be redundant if it then said that anything is allowed after a crash.

So you're just derailing the discussion with pointless nitpicking.

Quotes of the fortnight

Posted Jan 7, 2011 17:21 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

Implementors don't agree because it would be a ridiculous way to implement the specification. It's an absurd example, but my point is simple - people absolutely rely upon implementation details that aren't guaranteed by the specification, and saying that we shouldn't extend those guarantees further because people might rely on them on other operating systems and then get bitten is absurd. If I put a drive with 32MB of cache on Linux 1.2 I suspect that fsync() isn't going to give me what I want either, so there's clearly limits to how concerned we are with portability.

Quotes of the fortnight

Posted Jan 8, 2011 3:16 UTC (Sat) by Nick (guest, #15060) [Link]

And it would be a ridiculous way to weasel word the specification. The specification calls for data transfer to be completed to the filesystem from cache. If you weasel it to say that the "transfer" refers to transfer from OS filesystem cache until it hits the other end of the PCI bus, then you'd really get what you deserve. The specification fairly obviously intends for it to reach powerfail safe storage.

Write caches certainly do need to be dealt with, and if they haven't in the past it has been due to bugs in the code, but apps can't code to all versions of all bugs on all OSes they run on.

If people say that posix doesn't guarantee fsync to do anything, they'd probably be talking about systems which are defined not to have synchronized io. These are not general purpose systems like we are talking about.

Quotes of the fortnight

Posted Jan 8, 2011 4:05 UTC (Sat) by mjg59 (subscriber, #23239) [Link]

Your definition of "General purpose system" appears to exclude MacOS and FreeBSD (both #define _POSIX_SYNCHRONIZED_IO as -1), so I don't think it's an especially useful one. If I'm writing for portability, I can't depend upon fsync() doing anything. So I have to assume behaviour above and beyond what the spec defines, otherwise we're back to the ASCII middle finger.

Quotes of the fortnight

Posted Jan 8, 2011 4:22 UTC (Sat) by Nick (guest, #15060) [Link]

On operating systems that don't implement data integrity guarantees with the POSIX data integrity APIs, yes you can't use the posix APIs to implement data integrity. On the ones that do, you can, without assuming anything beyond what posix gives you.

A new, non standard behavior wouldn't seem to help anything in this regard.

Quotes of the fortnight

Posted Jan 8, 2011 4:28 UTC (Sat) by mjg59 (subscriber, #23239) [Link]

...which means that, from a desktop perspective, the only way I can guarantee data integrity is by targeting Linux, which means that portability is entirely irrelevant.

Quotes of the fortnight

Posted Jan 8, 2011 4:58 UTC (Sat) by Nick (guest, #15060) [Link]

Um, but then you *already have* the solution for Linux, so why does LInux desperately need something more? old installations is not an argument because they won't benefit from new code.

For OSX you need a wc flush fcntl, not sure about Windows. But none of that's is an argument for adding more API requirement for Linux to have to support.

Quotes of the fortnight

Posted Jan 8, 2011 14:45 UTC (Sat) by mjg59 (subscriber, #23239) [Link]

I'm saying that the entire "We can't extend Linux because then apps won't be portable" argument is irrelevant. Developers already have to depend on behaviour outside base POSIX, so portability is not an issue. Which means that Linux can offer behaviour not guaranteed by POSIX without it resulting in applications that are any less portable than they currently are. As an application developer, there's currently no way for me to write a usable application that has atomic-but-not-durable characteristics that will run on both existing Linux deployments and future Linux deployments. That's a problem, and pointing at POSIX doesn't provide a solution.

Quotes of the fortnight

Posted Jan 9, 2011 3:57 UTC (Sun) by Nick (guest, #15060) [Link]

Of course, but there has to be a good use case and cost/benefit tradeoff.

People who say "filesystem apis 'appear' to offer sequential integrity or atomicity, so that's what a good implementation should do" etc obviously don't know the filesystem apis because that is not what they offer, they never made an appearance to, and in fact they explicitly don't offer that, which is considered a feature in many cases.

Quotes of the fortnight

Posted Jan 11, 2011 21:08 UTC (Tue) by roelofs (guest, #2599) [Link]

As far as I can tell, POSIX allows filesystem recovery after a crash to replace every one of my files with a giant ASCII middle finger, even if they've been fsync()ed.

Ah, of course: "See Figure 1." One of my favorites. :-)

Greg

Quotes of the fortnight

Posted Jan 17, 2011 8:21 UTC (Mon) by dpotapov (guest, #46495) [Link]

> during recovery at mount-time, that this situation occurred
> and rewind the rename so that "X" points to the old inode again.

and you can get file system corrupted, because now the old node may contain information of another file or directory... So, just rewinding references will not work.

The fact is if your old data are overwritten and the new have not been flushed to the disk, there is nothing any "intelligent" recovery can do other than to truncate the file.

An alternative of having all data being written to the journal has significant performance penalties...

Quotes of the fortnight

Posted Jan 7, 2011 7:46 UTC (Fri) by butlerm (subscriber, #13312) [Link]

Lots of other users of the POSIX file API do not want the complexity that more complex semantics like atomicity would bring.

There is no need to change the POSIX semantics, nor any need to change the guarantees the POSIX API provides under Linux. There is no need for atomic file writes in particular. What would be helpful, however, is for filesystems to do a better job of restoring to a consistent state after a crash. The system crashed, you can't get perfection. However, you can easily undo renames where the name now points to corrupted data.

The idea is simple. Log all renames in the journal. When doing a rename, if the replacement inode data has not been committed yet, hold on to the old inode until it has. If the system crashes, replay the journal, then for each pending rename that points to a new inode with data that did not get written, undo the rename such that the name points to the old inode, then deref and discard the inode with the unwritten data as appropriate.

Very low overhead, in many cases considerably less overhead than having the filesystem start write out of the replacement inode data immediately, in a half baked attempt at accomplishing the same thing.

Quotes of the fortnight

Posted Jan 7, 2011 7:55 UTC (Fri) by Nick (guest, #15060) [Link]

Write a patch that takes care of all that, how it interacts with subsequent (infinitely many) renames, tracks unwritten data, new writes, fsyncs, etc. And then we can start talking about portability and actual usefulness and performance data on real apps.

Quotes of the fortnight

Posted Jan 8, 2011 22:51 UTC (Sat) by butlerm (subscriber, #13312) [Link]

Certainly there are implementation issues here, although not obviously any significant performance ones.

I am not familiar with how current implementations determine that the data associated with a file wasn't written out (so that it knows to truncate the file instead of making random data on the disk visible), but I imagine one would use journal entries indicating that the region implied by the inode data size at first or last close has been written to disk at least once.

In the general case of a series of rename replacements, it would be ideal to back track to the last good (i.e. written) inode. In practice there is no point in temporarily holding on to more than two inodes in a replacement chain at a time. To be any use you have temporarily to hold on to at least one - preferably one where writeout has completed.

Suppose someone does rename replacements of the same name in rapid fire. The first inode you want to temporarily hold onto is one that has already made it completely to disk (but has pending replacements). The second inode you ideally want to hold onto is one that has been replaced by a third, but has not finished writeout yet. The normative, third inode is the one that has not yet been replaced, and has also not finished writeout yet. A fourth inode may be in the process of preparation, but the fs doesn't know about the relationship, because a rename has not yet occurred.

So in an ideal implementation, logically speaking inode 1 is discarded whenever the candidate inode (inode 2) completes writeout. Whenever there is a rename replacement with an inode that has not completed writeout, the replacement becomes the new inode 3. But as the new "current" inode (inode 3) may never be written out (it might not even start), if there isn't an inode 2 the previous inode 3 becomes the new candidate and continues writeout to completion, and is promoted to inode 1.

So with a typical 30 second delay with typical file sizes, in a rapid series of replacements at most one inode in the series will be under writeout at any given time. When that inode finishes writeout, it becomes the new committed inode, and the current inode becomes the new candidate. Other possible candidates are discarded, because there is no point in having more than one candidate (known to the fs) under writeout at once.

When the series of replacements stops, the current inode eventually becomes the candidate inode, and then the committed one. If the system suffers an unclean shutdown in the middle of this process the latest committed inode in the series is selected, the name is changed to point back to it, and the later inodes in the series that have no other references are discarded.

That is how I would do it anyway. A simpler implementation would only temporarily hold onto one inode at a time, with the disadvantage that that the committed inode might become rather old if there is an ongoing series of rename replacements that never complete writeout before being replaced by another one. To keep the committed inode up to date in that case, a candidate replacement inode should be selected if necessary and allowed to complete writeout before being discarded.

Quotes of the fortnight

Posted Jan 7, 2011 6:37 UTC (Fri) by dlang (✭ supporter ✭, #313) [Link]

you can get this today, but the problem is that people are not willing to sacrifice the performance that this entails.

systems used to be as safe as you want, but in the interests of speed under the normal (non-failure) condition people started trading off the reliability under failure conditions. the operating systems that didn't make these trade-offs were abandoned by users, who care FAR more about the things that the do hundreds of times a day than they do about the reduced liklyhood of corruption when they do have a failure (which can be months to years apart)

Quotes of the fortnight

Posted Jan 8, 2011 17:20 UTC (Sat) by oak (subscriber, #2786) [Link]

> were abandoned by users, who care FAR more about the things that the do hundreds of times a day

But when those "faster operating systems" do trash their important data, I'm pretty sure such users will noisily flame those operating systems to crispy crap, think of their designers as idiots and avoid both quite a while in the future.

It depends how likely all this is, compared to the speed advantages.

Quotes of the fortnight

Posted Jan 8, 2011 19:49 UTC (Sat) by dlang (✭ supporter ✭, #313) [Link]

they may flame, but they continue to use them.

_EVERY_ current desktop/server OS has made this decision.

Even on Windows where crashes are far more common, users accept this because even there crashes are rare enough compared to the times that the performance benifits are seen that they accept it.

In fact, this has extended beyond the OS programmers and is in the firmware of the hard drives that you buy. Every hard drive you buy will (by default) report that the data has been saved when it's only been written to the cache on the drive, if you then loose power that data is not on disk yet. The disk also re-orders the writes, so you can have the exact same problems (write a new file, rename it, and you end up with neither old or new file) simply due to the drive caching and re-ordering things. Some drives have been found to even lie to you when you tell them to turn off this mode (cached writes), or so the scuttlebut goes in the enterprise database circles.

This is even the case with the new (very expensive) SSD drives. There are a handful of manufacturers who are targeting enterprise systems where they are willing to sacrifice performance for reliability and pay a lot more in the process where they are starting to implement power storage on the drives themselves so that you can get the data written to disk even in a crash/powerfail situation.

File system as database

Posted Jan 11, 2011 14:24 UTC (Tue) by job (guest, #670) [Link]

Almost every mail server uses the file system as a database. Many of the popular version control systems, such as git, do as well. File system developers may not like it but it's a good idea and would be even better if they optimized for it (many small files).

It's works well and has quite well known semantics when you need a simple key/value blob store (that is, until recently). The repair tools is mature and covers most corner cases as well as one can hope. Backup is trivial. Partitioning and clustering is straightforward. Performance is excellent in trivial but common use cases.

I wouldn't want git to require a "proper" database just because the file system devleopers thought it was theoretically the way to go. No matter what they say, fsck can do a much better job than fsck and a database repair tool unaware of each other. If I absolutely have to use a proper database I'd like it to offer simple open/read/write semantics and work on a raw disk device (and I'd like to call it a "file system").

File system as database

Posted Jan 17, 2011 8:25 UTC (Mon) by dpotapov (guest, #46495) [Link]

> I wouldn't want git to require a "proper" database just because the
> file system devleopers thought it was theoretically the way to go.

git uses fsync(), and it works fine on ext3 and other filesystems, and git it the fastest VCS out there. So, fsync() interface _does_ work if it is used properly, and the only people who have a problem with it is those who do not really know what they want.

If you know what you want, there is no problem to organize an efficient and reliable data storage for your particular use case. But if you want a reliable storage that will work in _general_ (not being limited to a few use cases) then it is called a database. And databases are inherently slower than filesystem just because they have to deal with a general case.

Quotes of the fortnight

Posted Jan 17, 2011 10:44 UTC (Mon) by andrejp (guest, #47396) [Link]

Wherever the EXT developers got the crazy idea that the filesystem is a) not a database (which is to say organized storage of data) , and b) that performance in a filesystem is more important than data is beyond me. It's just ludicrous.

I won't even comment on the first one beyond the observation that if a filesystem developer is claiming that the filesystem is not a database, the said developer should probably be doing something else than code filesystems. Just get a dictionary and get the meaning of the word 'database' straightened up.

The second one is a matter of priorities and importances in the design. Metadata (that is data about data) has *no* relevance in a filesystem unless the *data* that it's about is actually there. If the data is not there, the only 'data' that metadata can provide about data is the one that says "data is not there", in which case said metadata might as well not exist and thus provide the exact same information. The idea that in a filesystem metadata is more important than data or that the filesystem should actually do *anything at all* about metadata if data hasn't been written yet is just crazy. *All* performance considerations in a filesystem design are very much secondary to the fact that the purpose of a filesystem is to store *data*.

Quotes of the fortnight

Posted Jan 17, 2011 20:24 UTC (Mon) by jthill (guest, #56558) [Link]

Where manual recovery after system failures takes less time than I'd have lost waiting for any approximation of database-quality ACID, I'd like the faster option available please. Particularly if it's lots faster.

Often enough the right way to recover is to just redo whatever was inflight. I don't want to pay for database overhead on a build. If somebody trips over a power cord the right thing to do is plug it back in and type 'git clean -dfx; make'.

The filesystem has no way of knowing a priori which is which. A filesystem that can't make the go-fast situations go fast is broken.

Which I see is pretty much what Ted Ts'o said in the first place.

What the poster he was responding to wants is ... I dunno what to call it. The specific sequence is write, close, rename. He wants to be sure an fsync is effectively done before the link is rewritten, but he doesn't want to wait for any of it to complete or even to put any hurry on getting the data written.

rename-with-ext3-ordered-data-semantics-for-a-file-that-isn't-important-enough-to-fsync.

According to Documentation/filesystems/ext4.txt, ext4 spells that 'auto_da_alloc' and git blame says that's been on by default since 2.6.30:

auto_da_alloc(*)	Many broken applications don't use fsync() when 
noauto_da_alloc		replacing existing files via patterns such as
			fd = open("foo.new")/write(fd,..)/close(fd)/
			rename("foo.new", "foo"), or worse yet,
			fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).
			If auto_da_alloc is enabled, ext4 will detect
			the replace-via-rename and replace-via-truncate
			patterns and force that any delayed allocation
			blocks are allocated such that at the next
			journal commit, in the default data=ordered
			mode, the data blocks of the new file are forced
			to disk before the rename() operation is
			committed.  This provides roughly the same level
			of guarantees as ext3, and avoids the
			"zero-length" problem that can happen when a
			system crashes before the delayed allocation
			blocks are forced to disk.

Quotes of the fortnight

Posted Jan 18, 2011 0:04 UTC (Tue) by jrn (subscriber, #64214) [Link]

> rename-with-ext3-ordered-data-semantics-for-a-file-that-isn't-important-enough-to-fsync.
>
> According to Documentation/filesystems/ext4.txt, ext4 spells that 'auto_da_alloc' and git blame says that's been on by default since 2.6.30:

That is not enough to avoid 0-length files on power failure (it makes the race shorter but does not eliminate it). See https://bugzilla.kernel.org/show_bug.cgi?id=15910

-- a happy sync_file_range user

Copyright © 2011, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds