Better than POSIX?

By Jonathan Corbet
March 17, 2009

One might well think that, at this point, there has been sufficient discussion of the design decisions built into the ext4 "delayed allocation" feature and the user-space implications of those decisions. And perhaps that is true, but there should be room for a summary of the relevant issues. The key question has little to do with the details of filesystem design, and a lot to do with the kind of API that the Linux kernel should present to its user-space processes.

As has been well covered (and discussed) elsewhere, the delayed allocation feature found in the ext4 filesystem - and most other contemporary filesystems as well - has some real benefits for system performance. In many cases, delayed allocation can avoid the allocation of space on the physical medium (along with the associated I/O) entirely. For longer-lived data, delayed allocation allows the filesystem to optimize the placement of the data blocks, making subsequent accesses faster. But delayed allocation can, should the system crash, lead to the loss of the data for which space has not yet been allocated. Any filesystem may lose data if the system is unexpectedly yanked out from underneath it, but the changes in ext4 can lead to data loss in situations that, with ext3, appeared to be robust. This change looks much like a regression to many users.

Many electrons have been expended to justify the new, more uncertain ext4 situation. The POSIX specification says that no persistence is guaranteed for data which has not been explicitly sent to the media with the fsync() call. Applications which lose data on ext4 are not using the filesystem correctly and need to be fixed. The real problem is users running proprietary kernel modules which cause their systems to crash in the first place. And so on. All of these statements are true, at least to an extent.

But one might also argue that they are irrelevant.

Your editor recently became aware that Simson Garfinkel's Unix Hater's Handbook [PDF] is available online. To say that this book is an aggravating read is an understatement; much of it seems like childish poking at Unix by somebody who wishes that VMS (or some other proprietary system) had taken over the world. It's full of text like:

The traditional Unix file system is a grotesque hack that, over the years, has been enshrined as a "standard" by virtue of its widespread use. Indeed, after years of indoctrination and brainwashing, people now accept Unix's flaws as desired features. It's like a cancer victim's immune system enshrining the carcinoma cell as ideal because the body is so good at making them.

But behind the silly rhetoric are some real points that anybody concerned with the value of Unix-like systems should hear. Among them are the "worse is better" notion expressed by Richard Gabriel in 1991 - the year the Linux kernel was born. This charge states that Unix developers will choose implementation simplicity over correctness at the lower levels, even if it leads to application complexity (and lack of robustness) at the higher levels. The ability of a write() system call to succeed partially is given as an example; it forces every write() call to be coded within a loop which retries the operation until the kernel gets around to finishing the job. Developers who cut corners like that are left with an application which works most of the time, but which can fail silently in unexpected circumstances. It is far better, these people say, to solve the problem once at the kernel level so that applications can be simpler and more robust.

The ext4 situation can be seen as similar: any application developer who wants to be sure that data has made it to persistent storage must take extra care to inform the kernel that, yes, that data really does matter. Developers who skip that step will have applications which work - almost all the time. One could well argue that, again, the kernel should take the responsibility of ensuring correctness, freeing application developers from the need to worry about it.

The ext3 filesystem made no such guarantees, but, due to the way its features interact, ext3 provides something close to a persistence guarantee in most situations. An ext3 filesystem running under a default configuration will normally lose no more than five seconds worth of work in a crash, and, importantly, it is not prone to the creation of zero-length files in common scenarios. The ext4 filesystem withdrew that implicit guarantee; unpleasant surprises for users followed.

Now the ext4 developers are faced with a choice. They could stand by their changes, claiming that the loss of robustness is justified by increased performance and POSIX compliance. They could say that buggy applications need to be fixed, even if it turns out that very large numbers of applications need fixing. Or, instead, they could conclude that Linux should provide a higher level of reliability, regardless of how diligent any specific application developers might have been and regardless of what the standards say.

It should be said that the first choice is not entirely unreasonable. POSIX forms a sort of contract between user space and the kernel. When the kernel fails to provide POSIX-specified behavior, application developers are the first to complain. So perhaps they should not object when the kernel insists that they, too, live up to their end of the bargain. One could argue that applications which have been written according to the rules should not take a performance hit to make life easier for the rest. Besides, this is free software; it would not take that long to fix up the worst offenders.

[PULL QUOTE: There is a case to be made that this is a situation where the Linux kernel, in the interest of greater robustness throughout the system, should go beyond POSIX. END QUOTE] But fixing this kind of problem is a classic case of whack-a-mole: application developers will continually reintroduce similar bugs. The kernel developers have been very clear that they do not feel bound by POSIX when the standard is seen to make no sense. So POSIX certainly does not compel them to provide a lower level of filesystem data robustness than application developers would like to have. There is a case to be made that this is a situation where the Linux kernel, in the interest of greater robustness throughout the system, should go beyond POSIX.

The good news, of course, is that this has already happened. There is a set of patches queued for 2.6.30 which will provide ext3-like behavior in many of the situations that have created trouble for early ext4 users. Beyond that, the ext4 developers are considering an "allocate-on-commit" mount option which would force the completion of delayed allocations when the associated metadata is committed to disk, thus restoring ext3 semantics almost completely. Chances are good that distributors would enable such an option by default. There would be a performance penalty, but ext4 should still perform better than ext3, and one should not underestimate the performance costs associated with lost data.

In summary: the ext4 developers - like Linux developers in general - do care about their users. They may complain a bit about sloppy application developers, standards compliance, and proprietary kernel modules, but they'll do the right thing in the end.

One should also remember that ext4 is still a very young filesystem; it's not surprising that a few rough edges remain in places. It is unlikely that we have seen the last of them.

As a related issue, it has been suggested that the real problem is with the POSIX API, which does not make the expression of atomicity and durability requirements easy or natural. It is time, some say, to create an extended (or totally new) API which handles these issues better. That may well be true, but this is easier said than done. There are, of course, the difficulties in designing a new API to last for the next few decades; one assumes that we are up to that challenge. But will anybody use it? Consider Linus Torvalds's response to another suggestion for an API extension:

Over the years, we've done lots of nice "extended functionality" stuff. Nobody ever uses them. The only thing that gets used is the standard stuff that everybody else does too.

Application developers will be naturally apprehensive about using Linux-only interfaces. It is not clear that designing a new API which will gain acceptance beyond Linux is feasible at this time.

Your editor also points out, hesitantly, that Hans Reiser had designed - and implemented - all kinds of features designed to allow applications to use small files in a robust manner for the reiser4 filesystem. Interest in accepting those features was quite low even before Hans left the scene. There were a lot of reasons for this, including nervousness about a single-filesystem implementation and nervousness about dealing with Hans, but the addition of non-POSIX extensions was problematic in its own right (see this article for coverage of this discussion in 2004).

The real answer is probably not new APIs. It is probably a matter of building our filesystems to provide "good enough" robustness as a default, with much stronger guarantees available to developers who are willing to do the extra coding work. Such changes may come hard to filesystem hackers who have worked to create the fastest filesystem possible. But they will happen anyway; Linux is, in the end, written by and for its users.

Better than POSIX?

Posted Mar 17, 2009 15:42 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (5 responses)

Thank you for a balanced discussion of one of the most inflammatory technical (as opposed to legal or social) issues in recent memory. The "worse is better" approach is not unique to unixland, and in the days of small computers with limited resources, counting on applications to not do silly things made sense.

But I firmly believe that today the kernel can provide a great deal of additional robustness at practically no performance cost. An "ordered rename" is a no-brainer. Not only do existing applications suddenly do the right thing, but also an "ordered rename" allows application developers to inform the kernel of constraints that are simply impossible to express when applications are required to fsync before every useful rename.

Some people say that a non-ordered rename gives the kernel more freedom to optimize. That view is a red herring: with a non-ordered rename, applications must fsync before the rename to have sensible semantics. So really, the choice isn't between an ordered and a non-ordered rename, but between fsync-unordered_rename and ordered_rename; the latter actually gives the kernel greater latitude in optimizing IO.

An ordered rename be either neutral or beneficial in every real-world situation. Here's my challenge to anyone reading this: come up with a non-contrived scenario in which an ordered rename (i.e., an implicit write barrier) would be harmful.

Better than POSIX?

Posted Mar 17, 2009 17:28 UTC (Tue) by ms (subscriber, #41272) [Link] (4 responses)

Is no one doing transactional filesystems? Surely that's ultimately where we're going to end up. Behavior is well defined and understood for databases, so why don't we just adopt them? Apologies if I'm just repeating other people's thoughts from elsewhere - I'm not trying to extend the flames to this very nice and balanced piece.

Transactional filesystems

Posted Mar 17, 2009 17:56 UTC (Tue) by butlerm (subscriber, #13312) [Link] (3 responses)

Doing this efficiently requires the adoption of more filesystem internal
transaction processing techniques - meta data undo in particular.

However, "transactional filesystem" usually refers to a much more
complicated setup that allows an arbitrary number of data and meta data
operations to commit or rollback as a group. That is an order of magnitude
more complicated than what would be required to efficiently preserve the
atomicity of file rename operations, for example.

That might well be standard a couple of decades down the road - Google
Transactional NTFS for an example - but for now the ability to provide
atomicity without durability would be a major improvement that has a
fraction of the complexity.

Transactional filesystems

Posted Mar 18, 2009 15:19 UTC (Wed) by alvherre (subscriber, #18730) [Link] (2 responses)

> Doing this efficiently requires the adoption of more filesystem internal
> transaction processing techniques - meta data undo in particular.

Not so. PostgreSQL, for example, implements transactional semantics without needing metadata undo.

Transactional filesystems

Posted Mar 20, 2009 20:54 UTC (Fri) by butlerm (subscriber, #13312) [Link] (1 responses)

PostgreSQL is not a filesystem. If it was pretending to be one, it would
accomplish what it does by "meta-data" undo, where the meta data of the
higher level filesystem was native to PostgreSQL (i.e. stored in table
rows), as opposed to the completely separate and irrelevant meta data of
the lower level filesystem PostgreSQL was running on top of.

So of course you can implement meta data undo on any filesystem you please,
just as long as the meta data you are referring to is not the meta data of
the filesystem itself.

Transactional filesystems

Posted Mar 21, 2009 18:00 UTC (Sat) by nix (subscriber, #2304) [Link]

I don't see why you can't use MVCC for filesystems, since MVCC could be
regarded as a means of mixing the undo log into the data store itself,
eliminating it as a separate entity.

(As for vacuuming, do it incrementally and the data volume pretty much
doesn't matter; you just work through it bit by bit. PostgreSQL scales to
silly data volumes already...)

Better than POSIX?

Posted Mar 17, 2009 15:56 UTC (Tue) by ssam (guest, #46587) [Link] (9 responses)

what seems unanswered is if you can overwrite an existing file, in such a way that you either have the old or the new version, without forcing the physical write to disk.

i think a lot of people are willing to risk recent changes to a file to get performance gains and and power saving. but no one wants to risk completely loosing a file just because you wrote to it recently.

in the sequence
1) open foo.new
2) write to foo.new
3) close foo.new
4) move foo.new to foo
all that needs to happen is that the 4 does not hit the disk before 2 has.

it seems that 2 gets delayed, because that gives performance/powerusage gains, but 4 happens quickly.

Better than POSIX?

Posted Mar 17, 2009 19:51 UTC (Tue) by dlang (guest, #313) [Link] (7 responses)

no, currently there is no API in existance for any filesystem (including ext3) that will _guarantee_ that you have either the old or the new filesystem on disk after a crash with no possibility of garbage (even excluding things that damage the drives themselves)

the probability of having problems varies from filesystem to filesystem, from mount option to mount option within a single filesystem, and is even dependant on kernel tuning parameters that can be set in /etc/sysctl.conf

if you want that guarantee you need to force a write to disk (and make sure that your disk drive doesn't cache or reorder the writes)

it would be nice if there was a write barrier API that let you insert a ordering requirement from userspace without forcing a write to disk, but at this point in time I don't believe that such a API exists.

people were relatively happy with the ~5 second vulnerability window in ext3, but are getting bit by the ~60 second vulnerability window that ext4 defaulted to.

Ok, that's a reasonable problem, and adjusting ext4 to use a smaller window (possibly even by default) is a reasonable thing. the idea that Ted is working on to have ext4 honor the 'how much time am I willing to loose' parameter that was introduced for laptop mode is a good idea in that it's clear what you are adjusting, and it lets you tie all similar types of thing to the one parameter (as opposed to configuring different aspects in different places)

but shortening this time will cost performance. I doubt that the ext4 developers selected the perfect time (I doubt that there is a single perfect time to select), and having it as an easy tunable will let people plot out what the performance/reliability curve looks like. it may be that going from 5 seconds to 10 seconds gains 80% of the performance that going from 5 seconds to 60 seconds gives you (for common desktop workloads ;-) and distros may choose to move the default there.

the problem in this situation is that people are mistakenly believing that this vunerability didn't exist before, it did, it just wasn't as large a window so fewer people were hitting it.
Also they didn't have any other experiances to compare it to. If ext3 had also lost data in similar situations, people would be used to 'if the system crashes and your app didn't do fsync you will loose data modified in the last couple of minutes' and consider that reasonable.

Better than POSIX?

Posted Mar 17, 2009 21:16 UTC (Tue) by foom (subscriber, #14868) [Link] (1 responses)

The problem isn't that the time-window is bigger, it's that even if you crashed within the 5sec
window in ext3, you are almost guaranteed to end up with the old version of the file.

If you crashed within a 60sec window in (pre-fix) ext4, you were almost guaranteed to end up with
a 0-byte file. That's a rather stark difference.

Of course there's no absolute guarantees...if your kernel is crashing particularly hard, the filesystem
driver could go berserk and write 99 red balloons all over the disk. Nothing in the design of the FS
could prevent that...but that's *extraordinarily* rare.

Better than POSIX?

Posted Mar 19, 2009 10:44 UTC (Thu) by Los__D (guest, #15263) [Link]

berserk and write 99 red balloons all over the disk

What a fantastic idea for an easter egg... :)

"What, bug? No no no, it was just a joke. Your data is... Well, buried in baloons."

Better than POSIX?

Posted Mar 17, 2009 21:44 UTC (Tue) by man_ls (guest, #15091) [Link] (3 responses)

people were relatively happy with the ~5 second vulnerability window in ext3, but are getting bit by the ~60 second vulnerability window that ext4 defaulted to.

Well, if the 5-second window had left my systems vulnerable to data corruption (in the form of zero-length files) I know I would not have been happy. After all that 5-second window happens every 5 seconds, so you are virtually guaranteed to run into the problem if it exists, whatever the window size.

I remember my horror after finding out that XFS had lost my data, and I read about XFS devs hiding behind the standard "journalling filesystems only make guarantees about the metadata". "The metadata? fsck the metadata! What I want is my wretched data back!" Now it is all coming back in waves.

Better than POSIX?

Posted Mar 18, 2009 2:40 UTC (Wed) by dlang (guest, #313) [Link] (2 responses)

that 5 second window did leave you vulnerable to data corruption (in many ways). you just got lucky

Better than POSIX?

Posted Mar 18, 2009 4:12 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

It wasn't really as bad as it sounds. ext3 also doesn't have delayed allocation. Someone on the Ubuntu bug list posted some testcases and really couldn't make ext3 fail at all. (ext4 fell flat.)

Better than POSIX?

Posted Mar 18, 2009 7:28 UTC (Wed) by man_ls (guest, #15091) [Link]

I know my data is vulnerable in many ways. I would just like the most common cases to be addressed. It is not luck. According to the discussions, everyone who has had a crash with (first) XFS or (now) ext4 is getting inconsistent states, while no heavy-duty ext3 users have reportedly seen this kind of corruption. Maybe others, yes, but no zero-length files -- which is the issue under discussion.

Better than POSIX?

Posted Mar 19, 2009 18:33 UTC (Thu) by anton (subscriber, #25547) [Link]

no, currently there is no API in existance for any filesystem (including ext3) that will _guarantee_ that you have either the old or the new filesystem on disk after a crash with no possibility of garbage (even excluding things that damage the drives themselves)

I am pretty sure that there are file systems around that give such guarantees. This was certainly one of the goals of LinLogFS and LLFS; ok, these filesystems have not come out of the development stage (yet), but I would be surprised if there were no others (ZFS?).

Concerning an API, I don't see a need for an additional API, it's just a matter of whether the file system gives consistency guarantees in case of a crash when using the existing file API, and if so, which ones.

Do you really NEED such an API?

Posted Mar 17, 2009 21:30 UTC (Tue) by khim (subscriber, #9252) [Link]

Ted gave interesting answer to this request which I really like: why will your program need such an API? Your system needs such a API - and it's not that hard to implement...

Yes, it'll probably violate POSIX's letter but it'll give much better results in practice.

Better than POSIX?

Posted Mar 17, 2009 16:04 UTC (Tue) by xav (guest, #18536) [Link] (15 responses)

Maybe the answer is a new set of guarantees for Linux's POSIX API, e.g. an overwriting rename() will either leave the old or new version to disk, atomically.

Better than POSIX?

Posted Mar 17, 2009 17:01 UTC (Tue) by drag (guest, #31333) [Link] (4 responses)

A simple way to say it would be:

In the actual physical file system image committed to disk; Don't name partially written files.

------

That would pretty much get what everybody wants. I suppose it's much much complicated then that, of course. I'll take "good enough" any day.

Better than POSIX?

Posted Mar 17, 2009 17:52 UTC (Tue) by vonbrand (subscriber, #4458) [Link] (3 responses)

"Don't name partially written files" would mean that nothing has a name until the file is closed, and the file has to disappear whenever it is opened for writing... I'd take the current ~~mess~~ situation any time in the face of that.

Better than POSIX?

Posted Mar 17, 2009 23:50 UTC (Tue) by drag (guest, #31333) [Link] (2 responses)

Who said it has to dissappeer when being re-opened? It got finished writing once and thus had a name. The fact that it was edited again doesn't change that. :)

It certainly will solve the write() then rename() issue. :)

And as I recalled I remember hearing about file system designers deliberately zero-ing out files for various reasons.

Postponing flush until close()?

Posted Mar 18, 2009 23:17 UTC (Wed) by xoddam (subscriber, #2322) [Link] (1 responses)

Am I to understand that you're proposing that any changes to a file (metadata and data blocks alike) not be flushed to disk until close()?

That doesn't really sound like a good way to enhance recoverability. For applications that keep large files (eg. caches) open for a long time and update them piecemeal, it sounds like sheer madness.

Applications that truncate existing data before rewriting it are asking for trouble, though I appreciate a filesystem that doesn't exacerbate the race condition by promptly truncating the inode but delaying the flush of the new data blocks for several seconds. Ted has already fixed that particular issue heuristically by delaying truncation until it is time to flush the data. Flushing *early* on close() couldn't hurt integrity but could hurt performance quite a bit.

Postponing flush until close()?

Posted Mar 19, 2009 23:34 UTC (Thu) by xoddam (subscriber, #2322) [Link]

Sorry, I just read the patch. If a file has been opened with O_TRUNC then it will indeed be flushed (early) when closed. The race condition still exists of course, but flushing on close will keep the risky interval relatively short in the vast majority of cases.

Maybe not

Posted Mar 17, 2009 21:40 UTC (Tue) by man_ls (guest, #15091) [Link] (2 responses)

Maybe the answer is a new set of guarantees for Linux's POSIX API, e.g. an overwriting rename() will either leave the old or new version to disk, atomically.

Why? As has been pointed out, ext2 is perfectly fine for many applications, and it would never be Linux-POSIX-compliant this way. For example in data centers with 3-way redundant power supplies and redundant storage, or temporary filesystems.

Do you really think it is better to force everyone to comply with a new standard than trying to convince ext4 developers to do the (obvious) right thing?

ext2, Bruté?

Posted Mar 18, 2009 22:57 UTC (Wed) by xoddam (subscriber, #2322) [Link] (1 responses)

I can think of no reason at all why a guarantee of this sort should not be considered desirable for any filesystems that try to ease crash recovery. It may be out of ext2's reach (because its code does not impose a strict partial ordering on disk writes), but it should be achievable as an enhancement to any journaling, log-structured or soft-update filesystem.

ext2, Bruté?

Posted Mar 18, 2009 23:25 UTC (Wed) by man_ls (guest, #15091) [Link]

It is by all means desirable. The proper place for such a standard might be debated though. I have always understood that POSIX is a standard for compatibility, e.g. Wikipedia says:

POSIX or "Portable Operating System Interface for Unix"[1] is the collective name of a family of related standards specified by the IEEE to define the application programming interface (API), along with shell and utilities interfaces for software compatible with variants of the Unix operating system, although the standard can apply to any operating system.

So I don't know if a standard for reliable file systems would fit in.

Better than POSIX?

Posted Mar 17, 2009 22:30 UTC (Tue) by dhess (guest, #7827) [Link] (6 responses)

Maybe the answer is a new set of guarantees for Linux's POSIX API, e.g. an overwriting rename() will either leave the old or new version to disk, atomically.

Yeah, I've come to a similar conclusion. Perhaps the rename() semantics alone is sufficient. It's simple enough conceptually that it might be relatively easy to get other operating systems to adopt the new semantics, too, at least for the filesystems that can support it. And it sounds like there's already a quite common belief amongst application developers that all filesystems behave this way, anyway.

In a previous life, I worked on memory ordering models in CPUs and chipsets. During this recent ext4 hubbub, it dawned on me that the issues with ordering and atomicity in high-performance filesystem design may be isomorphic to memory ordering. Even if that's not strictly true, there's probably a lot to be learned by filesystem designers and API writers from modern CPU memory ordering models, in any case, because memory ordering is a well-explored space by this point in the history of computer engineering; and I don't just mean the technical semantics, either, but the whole social aspect, too, i.e., how to balance good performance with software complexity, how much of that complexity to expose to application programmers, who often have neither the time nor the background to understand all of the tradeoffs, let alone dot all the "i"s and cross all the "t"s, etc. Anyway, changing rename's semantics as you suggest would be the equivalent of a "release store" in memory ordering terms, and seems to be exactly the right kind of tradeoff in this situation.

Better than POSIX?

Posted Mar 17, 2009 23:04 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (5 responses)

Thanks for that comment --- it's amazing how much knowledge we're rediscovering in computing. It's almost as if we're coming out of some kind of dark age.

One thing that struck me was a comment on a Slashdot story about a "breakthrough" in data center energy optimization. The comment showed that the problem of deciding when to boot up additional servers to meet demand was isomorphic to the problem of steam boiler management --- right down to the start-up and constant energy costs --- and that the problem had already been thoroughly addressed in literature from the turn of the last century.

Better than POSIX?

Posted Mar 17, 2009 23:52 UTC (Tue) by dhess (guest, #7827) [Link]

Hmm, that is interesting! I'll file the steam boiler analogy away for later use.

Alan Kay never misses an opportunity to point out that our field has a terrible track record when it comes to learning from, or even being aware of, our history, let alone that of other related fields. There's a lot of unfortunate "rediscovering" of knowledge in computer science and engineering. (I'm as guilty of it as anyone.) I think it's a good habit to consider Alan's admonishment when we're faced with challenges or seeking solutions to problems, so I guess I'll follow his lead by mentioning it here :)

Better than POSIX?

Posted Mar 17, 2009 23:57 UTC (Tue) by rahvin (guest, #16953) [Link] (3 responses)

That's cause FOSS has opened up the gates to allow technology experience and knowledge to flow around society instead of being trapped behind the corporate copyright, trade secret and patent. What was once a segregated system where you could only learn from those who worked with you directly has turned into a system where the experts from every company provide wisdom and training to the new kids at every company. The corporate policies that once locked behind the corporate veil the experts who built the foundation we are trying to build upon has been blasted apart and is now being shared to everyone's benefit.

Imagine for how many years the wheel was reinvented over and over again at hundreds of companies as people relearned how to code something the proper way for a certain scenario. It's scary to think how fast we could have developed software if it had been FOSS all along instead of corporations each trying to slit each others throats. In the case of computer information systems the sharing of code accelerates the total technology much faster than the private corporate system of the past ever has. Of course this isn't always true. Niche software will probably always need the economic support closed systems provide even if it divides the efforts among a few companies who reinvent each other's innovations.

Better than POSIX?

Posted Mar 26, 2009 10:56 UTC (Thu) by massimiliano (subscriber, #3048) [Link] (2 responses)

That is one explanation.

Another one is that we don't have an "engineering culture" in software development.

I mean, software developers are not necessarily engineers, so they rarely know about issues like steam boiler management.
But most importantly, a software developer is seldom trained to think at an engineering level. I remember when I studied for my degree, I have been taught about power plants, engines, turbines, cooling plants, pipes, dissipators... none of that has anything to do directly with software development. But after a few years of studying those systems it becomes obvious that there are lots of analogies between them, and very often the mathematical models that describe them are the same.
The teachers themselves pointed this out every time they could, and they did it on purpose, to teach us to recognize the patterns.

Now, I'm not claiming engineers are necessarily better than others in this sense. I know many guys who quit college and they are better than me in understanding aspects of different technologies.

What I'm claiming is that very often people reinvent the wheel not because the previous wheel was a secret, but because they do not have this "engineering culture" of knowing different kinds of wheels in advance, and being able to understand correctly in which ways they are similar and when they are relevant.

And without going to different disciplines, how many software developers have a good "culture" about the basic concepts needed in their job, like recurrent algorithms and patterns?
I mean, how many actually tried to read Donald Knuth's books (or similar ones), or at least consult them when appropriate? There are lots of answers already published, but we continue reinventing them anyway...

My 2c,
Massimiliano

Better than POSIX?

Posted Mar 27, 2009 1:01 UTC (Fri) by nix (subscriber, #2304) [Link] (1 responses)

HEAR HEAR.

I'd estimate that I spend 20% of my time at work ripping out people's
buggy broken slow reimplementations of wheels and replacing them with a
wheel that uses twenty-to-forty-year-old techniques to do the same thing
faster and more reliably.

(And do the reimplementations stop? No! I ditched a chained hash table
implementation today which had a stupid bug which led to every element
landing in the same bucket. Obviously it was too hard to look
in "include/hash.h" to find that there was already a hash table in the
system with a better API...)

I mean it's not as if computers are bad at searching for things, but half
the people I work with are tentative and reluctant to just grep for a few
plausible terms to see if they can avoid reinventing the damn wheel yet
again.

Pride

Posted Mar 27, 2009 5:49 UTC (Fri) by quotemstr (subscriber, #45331) [Link]

One explanation behind all these square wheels is the phase every programmer goes through during which he overestimates his abilities, has no sense of scale, and lacks sense for robustness. In short, he's proud, ignorant, and dangerous. He believes that libraries are bloated and slow, that he can out-perform standard implementations. He optimized prematurely, avoids function calls, abuses the ternary operator, and doesn't use a profiler.

Eventually, these programmers grow up, but in the meantime, they've written a significant amount of horrible code. I've seen this pattern again and again. As the parent mentioned, software developers have no "engineering culture." I imagine that in more established engineering disciplines, students have the above attitude beaten out of them before they graduate.

Better than POSIX?

Posted Mar 17, 2009 16:25 UTC (Tue) by jeleinweber (subscriber, #8326) [Link] (13 responses)

Meanwhile the *BSD crew went to soft updates (non-barrier asynchronous write order dependencies) 10 years ago, and think the Linux filesystem people are crazy for not doing it too. Ted's patches are maybe 25% of the way to what Dr. McKusick described back in 1999. Maybe we should do the other 75% ...

Better than POSIX?

Posted Mar 17, 2009 17:03 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (7 responses)

According to http://www.scs.cs.nyu.edu/V22.0480-003/notes/l10d.txt, soft updates doesn't provide atomic rename. (See "limitations".) Can someone who knows more about soft updates elaborate on this point?

Better than POSIX?

Posted Mar 18, 2009 0:22 UTC (Wed) by bojan (subscriber, #14302) [Link] (6 responses)

Gosh! A possible file system in another Unix OS that doesn't do the magic? Run for the hills folks! ;-)

Better than POSIX?

Posted Mar 18, 2009 0:36 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (5 responses)

Well, what's confusing is that it should, as I understand soft-updates --- it maintains a dependency graph of required metadata updates, and writes to disk in such a way that at each step, the system is consistent. I don't see why such a scheme wouldn't automatically give you at least a metadata-level atomic rename, so I was asking for clarification on what those notes meant.

Besides --- UFS users don't complain about zero-length orphans, so the filesystem must be doing something right. :-)

Better than POSIX?

Posted Mar 18, 2009 0:49 UTC (Wed) by bojan (subscriber, #14302) [Link]

Key word: possible.

I'm just poking fun, sorry ;-)

Better than POSIX?

Posted Mar 19, 2009 17:19 UTC (Thu) by anton (subscriber, #25547) [Link] (3 responses)

Well, what's confusing is that it should, as I understand soft-updates --- it maintains a dependency graph of required metadata updates, and writes to disk in such a way that at each step, the system is consistent. I don't see why such a scheme wouldn't automatically give you at least a metadata-level atomic rename, so I was asking for clarification on what those notes meant.

Soft updates is an enhancement of a file system that does update-in-place. Certain things are hard or impossible to do in that setting. In particular, rename requires updating to different places, so if you update one place after the other (as you have to do with in-place updates), there is a time span when one update has happened, but not the other.

You can use a journal for an in-place update system to work around this: write the rename operation atomically into the journal, then perform the individual updates, then clear that journal entry. In crash recovery replay the outstanding journal entries.

Or you can use a copy-on-write file system: there you write the changes to some free space, and when you have reached a consistent state, commit the changes by rewriting the root of the file system.

Besides --- UFS users don't complain about zero-length orphans

Wrong!

I don't think OSF/1 aka Digital Unix aka Tru64 Unix ever had soft updates.

Posted Mar 23, 2009 6:01 UTC (Mon) by xoddam (subscriber, #2322) [Link] (2 responses)

I think the OP was speculating about the utility of UFS as currently supplied with most free BSDs and therefore making up the vast majority of contemporary UFS users.

Soft updates were first implemented on top of 4.4BSD (but weren't part of the distribution), first described in a paper in 1994 and first widely distributed as part of FreeBSD 4.0 in 2000.

Your complaint (linked above, not dated) seems to be about UFS on the DEC distribution of OSF/1 for its Alpha machines. OSF/1 had the option of traditional BSD-style UFS (synchronous metadata with async data) or the journaling filesystem AdvFS. (You initially had to pay extra for AdvFS. It was GPLd last year and is on SourceForge).

So if you used UFS on Tru64 or whatever it was called at the time, you didn't have soft updates like you do on today's *BSD. You had synchronous metadata updates with asynchronous data writes. OSF/1 never adopted soft updates because it already had a good journaling solution years beforehand.

I don't think OSF/1 aka Digital Unix aka Tru64 Unix ever had soft updates.

Posted Mar 23, 2009 9:45 UTC (Mon) by anton (subscriber, #25547) [Link] (1 responses)

Yes, my complaint was about UFS with synchronous metadata updates, not with soft updates (indeed I mention soft updates as a correct solution). My complaint just refutes the no-complaints claim of the OP, and weakens his insinuation that UFS with soft updates still zeroes files and we just don't hear about it.

soft update not providing true atomic rename?

Posted Mar 23, 2009 11:31 UTC (Mon) by xoddam (subscriber, #2322) [Link]

Methinks "insinuation" is a bit strong. If anything, Daniel was suggesting that UFS-with-soft-updates gets the balance right with respect to users' data not going missing.

The (sketchy) lecture notes linked in quotemstr's first post end with two points of limitations of soft updates:

>Limitations of soft updates:
> * Not as general as logging, e.g., no atomic rename
> * Metadata updates might proceed out of order
> - create /dir1/a than /dir2/b
> - after crash /dir2/b might exist but not /dir1/a

As has been explained elsewhere, rename on early Unix implementations was a two- or three-step process: unlink any existing file at the target path, hard-link the file at the source path to the target path, and finally unlink the source path. The target path would briefly be absent (as observeable by concurrent processes, nothing to do with crash recovery) and the source file was briefly visible at two locations. Atomic replace was introduced to avoid the two obvious race conditions: a reader might find nothing at the target location, or another updater might open the source path with O_TRUNC and thereby inadvertently truncate the target file.

It's possible (pure speculation on my part here, I haven't actually used soft updates or read the code) that the atomicity not provided by UFS-with-soft-updates amounts to a chance that the source file may end up linked at both the source and target paths. This is likely if the rename is moving the file from one directory to another, as the two directories are likely to be stored at different locations on disk.

Updating the target directory first, with atomic replacement of an existing file at that path but delayed deletion of the source path, satisfies the important condition that the target file never disappear, while violating the letter of the atomic rename contract.

The race condition risking truncation of target file when the source path is opened O_TRUNC may still exist, but only after a crash as the atomic semantics pertain on a running system.

Maybe there are no complaints from BSD users because they're so l33t they never crash: they don't run out of battery or kick out power cords or use unstable video and wireless drivers. Or maybe, just maybe, soft updates are as cool as they say ;-)

Better than POSIX?

Posted Mar 18, 2009 2:28 UTC (Wed) by vaurora (guest, #38407) [Link] (3 responses)

Soft updates are brilliant - so brilliant that I'd only want to support a file system using them if I could first clone a bunch of Kirk McKusicks to maintain it.

Better than POSIX?

Posted Mar 18, 2009 2:39 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (1 responses)

Why? The concept isn't that complicated.

soft updates

Posted Mar 18, 2009 5:15 UTC (Wed) by xoddam (subscriber, #2322) [Link]

The concept isn't complicated, but the implementation is fragile.

Better than POSIX?

Posted Mar 19, 2009 9:58 UTC (Thu) by joern (guest, #22392) [Link]

Assuming an FFS-style filesystem I agree with you. To my knowledge I have done the only other implementation of soft updates. And it hardly resembles the original because 95% or more of the complexity doesn't exist. Put another way, I could only pull it off by first changing the rules (aka cheating).

soft updates [was: Better than POSIX?]

Posted Mar 26, 2009 13:11 UTC (Thu) by doran (guest, #57557) [Link]

For your information, NetBSD just replaced soft updates with journaling.

Soft updates does not understand hardware write-back caches, NetBSD's journaling does. Given than nearly all commodity systems have write back caches at some level in the I/O path nowadays, and need them for good performance, this is important.

Soft updates was quite complicated code with integration issues, making it very difficult to fix some of the bugs with it. This was mainly down to its structure, which was suited to commercialization (soft updates was initially a commercial offering). It was not necessarily tailored to the BSD systems.

Additionally, journaling gives some nice benefits like no fsck, and fully atomic rename. No fsck is important to NetBSD for the embedded role, and elsewhere is a much bigger story than it was when soft updates came about due to the explosion in storage volume size.

Better than POSIX?

Posted Mar 17, 2009 17:22 UTC (Tue) by SteveKnodle (guest, #7853) [Link] (6 responses)

This discussion completely misrepresents what POSIX conformance means.
The POSIX standards represent an essentially "minimal" set of requirements,
so that any application written to to POSIX standard will run on a wide
variety of computers. VMS has a "posix" API. and I believe VM370 also.
Each OS adds additional properties that developers rely on when writing
applications that require specific behavior. The SysVR4 interface standards
are one example. Ext4 needs to obey POSIX requirements, and provide a
predictable, dependable behavior (beyond POSIX) that developers can use
to write useful programs.

Better than POSIX?

Posted Mar 17, 2009 19:01 UTC (Tue) by christian.convey (guest, #39159) [Link] (5 responses)

It seems to me that a POSIX-compliant operating system must provide at least the POSIX API.

But a POSIX-compliant program should require no more than the POSIX API.

Better than POSIX?

Posted Mar 18, 2009 2:48 UTC (Wed) by zlynx (guest, #2285) [Link] (4 responses)

That's a very good point.

So no matter what ext4 does in this case, the programs with the problems are still not POSIX-
compliant.

They should either fsync, or adopt the other common pattern of renaming the target file to a
backup with the tilde before renaming the new file into place. Then have recovery code to pick up
the tilde file if there's a problem with the regular one.

Better than POSIX?

Posted Mar 18, 2009 4:09 UTC (Wed) by bojan (subscriber, #14302) [Link]

Precisely.

Better than POSIX?

Posted Mar 18, 2009 4:11 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (2 responses)

Huh? If a program is truly POSIX-compliant, it can't make any assumptions about what happens after a crash. That's undefined by POSIX. There's no sense in playing games with backup files because the backups are undefined by POSIX too!

Really, all a program that depends on nothing more than POSIX can do it take advantage of the atomic rename support on a running system and hope for the best on a crash. POSIX guarantees nothing else.

Clearly, this is not an acceptable state of affairs, so we make further assumptions about the exact behavior fsync and friends. But these assumptions go beyond POSIX.

Now, an ordered rename makes a whole lot more sense than a rename that only preserves contents after an fsync. But don't pretend that either alternative is mandated by POSIX. This whole damn problem has nothing to do with POSIX, so stop bringing it up. (And that means you too, bojan.)

Better than POSIX?

Posted Mar 18, 2009 5:26 UTC (Wed) by bojan (subscriber, #14302) [Link] (1 responses)

> And that means you too, bojan.

Sorry. I'll bring up whatever I see fit, whether you like that or not. But, by all means, don't listen to me and don't reply to my comments. Ignore me - that's OK. But at least do try to understand what Ted's saying.

> If a program is truly POSIX-compliant, it can't make any assumptions about what happens after a crash.

Not, it cannot make assumptions. It can make preparations as best it can (which are defined in the standard) to have data on disk. These preparations are called fsync(). Or, it can be smart and create little tiny backup files with fsync() beforehand and then be fast and keep renaming in the hope that the system doesn't crash all the time.

Either intentionally or otherwise, you keep misinterpreting what POSIX does or does not define. POSIX defines that fsync() is your best bet on having the data on disk. It doesn't define anything about rename() having ordered characteristics. It also doesn't define anything about the situation after the crash.

Out of this, you are giving people advice that the best thing to do is to go with undefined behaviour of rename() if you want your data on disk after the crash. The mind boggles...

Better than POSIX?

Posted Mar 18, 2009 14:22 UTC (Wed) by nlucas (guest, #33793) [Link]

Even if you go the POSIX way, many cheap disk drives lie to the system saying the data is on the disk plater when it's still on it's internal buffers. So even that can not guarantee you the operation is atomic.

An anecdotal case some years ago was Windows 98 corrupting the disk on "modern" PCs, because the hardware was so much faster than the disk flushing. When windows was shutdown it would kill the power before the disk finished buffering it's writes, corrupting the file system.

The only solution was to add some wait time before killing the power.

Better than POSIX?

Posted Mar 17, 2009 17:33 UTC (Tue) by forbdonut (subscriber, #21577) [Link] (3 responses)

I don't understand one point. We keep saying that ext3 had an "implicit
guarantee" that data blocks will hit the disk before meta-data. I don't
understand why that's "implicit." It seems like the definition of data=ordered
mode says exactly that?

In particular, what does data=ordered mode actually mean in ext4 with delayed
allocation?

This mode is more of data=pseudo-writeback i.e. it's some new writeback mode
that ext3 didn't have.

It's feels like the suggested alloc-on-commit mode should be called
data=ordered.

Am I missing something?? Thanks!

Better than POSIX?

Posted Mar 18, 2009 0:26 UTC (Wed) by brugolsky (guest, #28) [Link] (2 responses)

Since Ted has emphasized that data=ordered is about security, not integrity, I think the point is that as long as blocks have not been allocated, there is no risk of exposure of stale data. Hence the security guarantee of ext4 data=ordered is equivalent to the security guarantee in ext3.

But frankly, since the dawn of ext3 in the 2.2.x kernel series, I've always considered the fact that it didn't leave garbage files around on my laptop (with its dodgy IDE chipset) to be its major benefit. And I never really seriously considered using the other journalling filesystems that only preserve meta-data integrity. So I am unhappy with the choice of names for the options. I'd rather that the default be "data=delayed", and "data=ordered" refer to the allocate-on-commit behavior.

Better than POSIX?

Posted Mar 18, 2009 22:18 UTC (Wed) by forbdonut (subscriber, #21577) [Link] (1 responses)

I don't completely buy Ted's "data=ordered is only for security argument". In particular the documentation (in man pages / kernel docs / even the **ext4 docs** themselves) make no mention of security.

data=ordered isn't advertised as a security feature. It's claims to make data ordered with associated meta-data.

Better than POSIX?

Posted Mar 27, 2009 6:00 UTC (Fri) by Duncan (guest, #6647) [Link]

FWIW, you (and I) and Linus agree. This whole thing has come up yet again
in in one of the 2.6.29 announcement reply subthreads, and Linus calls the
failure to honor data=ordered (thus implying that in Linus opinion, it WAS
a failure to honor it, no matter what various others say about it being
about security only and that it was thus honored) "idiotic" in one reply,
and in another reply says the essentially the same thing using a different
choice description.

BTW, it's worth noting that to long time observers, Linus and thus the
LKML in general has at least three definite levels of "idiotic". Yes,
this is "idiotic", but Linus hasn't yet advanced to calling it
the "smoking crack" level of "idiotic" that he has been known to resort to
in other instances. OTOH, this would seem to be beyond the "brown paper
bag" level of "idiotic", so called because that's what the person making
the mistake wants to wear since he's now embarrassed to be seen in public.
The "brown paper bag" level of "idiotic" is the level that once aware, the
person who made the mistake owns up to it and does NOT defend, but
rather "resorts to the brown paper bag", and in fact, many such "brown
bag" level of mistakes are discovered and fixed by the person that made
them in the first place. This is beyond that since the person making
the "mistake" has been and continues to defend it as "correct", thus
reaching at minimum the "idiotic" level.

Anyway, based on Linus own posts to the post-2.6.29-announcement thread,
one gets the strong impression that somewhere along the line, Linus would
love to get a patch that makes data=ordered mean just that once again,
that delayed allocation or no delayed allocation, if data=ordered, the
data will be written before the metadata that covers it. Personally, I'd
suggest the current default then be christened "data=screwed", altho
data=delayed or some such is the more likely "acceptable" alternative.

Duncan

How we learn APIs

Posted Mar 17, 2009 17:34 UTC (Tue) by christian.convey (guest, #39159) [Link] (7 responses)

I think this goes to one weakness we have regarding writing robust code: most of us learn APIs by experimentation and seeing how a program performs when we've written it in a certain way. This issue with the POSIX file APIs plays right into our weaknesses as developers, because we easily draw the wrong conclusions about our correct use of the API when our programs appear to work.

Unless we can get programmers to learn APIs differently, I don't see how we can avoid providing them with a more helpful API if we want them to write robust programs.

Maybe what we really need is some user-space library that provides the more robust guarantees, but that works across many different file systems and operating systems. That could get us past the concerns about writing ReiserFS-only code or Linux-only code.

How we learn APIs

Posted Mar 17, 2009 18:19 UTC (Tue) by droundy (subscriber, #4559) [Link] (3 responses)

Unless we can get programmers to learn APIs differently, I don't see how we can avoid providing them with a more helpful API if we want them to write robust programs.

I've been wondering whether we might be able to quantitatively describe a "robust" implementation of the POSIX API in terms of the published API in a way that is relatively easy to convey to its users. For instance, I'd like to know that if my application is written so as to behave correctly (as understood by myself and my users) in any scenario in which any subset of processes are abruptly killed, and the connection to the disk is severed at any time--assuming that the disk is always in a "correct" state at the time it is severed, in the sense of reflecting all IO preceding that point, and that all IO after that point has no effect on the disk, and fsync fails after that point--then my application is also robust in the event of a system crash. This is a relatively simple criterion of *application* robustness, which then could define *file system* robustness.

As an application developer, that's what I'd like to have assurance of. In the extremely common case that we don't care if data has actually hit disk (i.e. our application will work fine on a RAM disk), the latter scenario of severed disks is irrelevant, and we just want to know that the guarantees that POSIX provides with regard to the running system are preserved in case of crash. And that's something I can reason about. And it's even something I can test my application for without installing special non-robust file systems and pulling the plug on the computer.

How we learn APIs

Posted Mar 19, 2009 17:35 UTC (Thu) by anton (subscriber, #25547) [Link] (2 responses)

What you describe sounds very simular to what in-order semantics provides.

How we learn APIs

Posted Mar 20, 2009 13:53 UTC (Fri) by droundy (subscriber, #4559) [Link] (1 responses)

Yes, that exactly describes the semantics I'd like--and the page even gives the same reasoning for it. It seems so obvious that this should be the goal of a file system!

(Although I can see an appeal to relaxing the in-order constraint for IO from different processes... one ought to be able in principle to do that while maintaining in-order semantics if one were to examine locks---and information flow in general---to ensure that you didn't reorder IO that could be causally related.)

How we learn APIs

Posted Mar 20, 2009 18:36 UTC (Fri) by anton (subscriber, #25547) [Link]

Although I can see an appeal to relaxing the in-order constraint for IO from different processes

Well, given that Unix applications often create lots of processes that interact in lots of interesting ways (think shell scripts), I think it's just too hard to find out when two operations are independent. I also don't think that this relaxation buys much if the in-order constraint is implemented efficiently (by combining a large batch of operations and commiting them together).

How we learn APIs

Posted Mar 17, 2009 20:04 UTC (Tue) by nix (subscriber, #2304) [Link] (1 responses)

The only programming community I know that actually works as you suggest
is the Ada community, and that's because their systems tend to be 'system
fails, people die' sorts of embedded systems (or 'system fails, N-million-
dollar missile falls out of the sky'). That tends to breed paranoia and a
desire to actually know the damn standard before you write code.

Everywhere else, it's a minority who even know the standard *exists*, let
alone reference it regularly, and it's a very small minority who've
actually read the whole thing.

I, too, wish this was not true (I spend much too much of my time at work
cleaning up after these bozos), but I can see no useful way to fix it. In
a corporate setting mandatory code reviews with a flunk-too-often-and-
you're-fired rule might work, but I wouldn't want to work there: morale
would be appalling.

How we learn APIs

Posted Mar 17, 2009 21:44 UTC (Tue) by man_ls (guest, #15091) [Link]

And there is a reason that critical code can be as much as 10 times as expensive as regular business code. Working that way (learning your standard by heart before starting to work) can be much more expensive than the usual corporate style of "write, then fix". On the other hand it is better to write correct code than to "write, then fix". There is a sweet spot somewhere in between.

Learning APIs by writing unit tests?

Posted Mar 18, 2009 11:25 UTC (Wed) by emk (subscriber, #1128) [Link]

Over the years, I've become a bit cynical about standards and documentation. When I rely on a standard, I find that nobody implements it correctly. (Witness, oh, every C++ compiler from 1992 to 2002.) When I rely on documentation, I find that it's a pack of useless lies. (The worst offenders: Hardware databooks, which often have no bearing on reality whatsoever.)

So now I take a different approach: I write lots of unit tests, and I set up buildbots for any platform I need to support.

This leads to lots of interesting discoveries. For example, under Windows Vista, pretty much any file system operation can fail for no apparent reason, and it may need to be retried until it works.

There are two exceptions to this rule: Security, and data integrity. You can't ensure either with unit tests. You must also understand both the official documentation and the reality of common implementations. (My nastiest surprise so far: There are snprintf implementations on some legacy Unix platforms that ignore the buffer size parameter, exposing every caller of snprintf to an overflow attack. If you rely on snprintf on old Unix platforms, it might be worth writing a unit test that checks to see if it works.)

Please make us proud of ext4

Posted Mar 17, 2009 17:49 UTC (Tue) by sbergman27 (guest, #10767) [Link] (1 responses)

"allocate-on-commit" as the default, with a separate option to allow less reliable, but more performant behavior (perhaps alloc=benchmarks) would exhibit the traditional, and time-tested, extX philosophy that has worked so well. The ext3 devs were steadfast in insisting that data=ordered be the default for ext3, even though data=writeback is generally a performance win.

Please make us proud of ext4

Posted Mar 19, 2009 2:39 UTC (Thu) by mjr (guest, #6979) [Link]

Yes, make us proud, by actually using the filesystem's capabilities and not further discouraging the proper use of fsync (not overuse, such as firefox is/has been prone to...) by keeping it unnecessarily slow (which alloc-on-commit will do).

The kludges for alloc-on-replace that will apparently be enabled by default should deal with the real, problematic cases of loss of pre-existing data quite nicely, and I do think having those as defaults is a good, reasonable compromise. Those who want to be extra-paranoid with either nodelalloc or alloc-on-commit, well, sure, the option is/will be there, but I don't think these might I say somewhat obsolescent modes should be the default.

'course, distros shouldn't use ext4 per default yet anyway for conservativeness sake. Maybe starting next year ;]

Better than POSIX?

Posted Mar 17, 2009 17:50 UTC (Tue) by iabervon (subscriber, #722) [Link] (1 responses)

POSIX makes no claims that, in the event of a system crash, your filesystem will be recoverable at all, or continue to contain anything in particular. If you fsync() before rename(), that will evidently satisfy ext4, but it doesn't particularly matter to POSIX; there's no reason to think that the filesystem doesn't handle rename() by atomically changing how it responds to processes, but replaces the on-disk directory entry by a 0-length inode immediately and only writes the replacement one later (limited by when you call fsync() on the directory). All that fsync() will ensure is that your new data is on your disk somewhere. This means that you would then be able to slog through /dev/sda1 and find your file contents somewhere, not that any particular filename lookup will find it for you after a system crash.

I don't see anything in POSIX to suggest that there's anything you can do in general to avoid having a window in which the on-disk mapping of names to contents is undesirable. Using fsync() is a implementation-specific hack to do something that POSIX defines to update the data that ext4 happens to care about.

Better than POSIX?

Posted Mar 17, 2009 18:33 UTC (Tue) by ssam (guest, #46587) [Link]

so a perfectly good POSIX filesystem could zero the whole disk after a system crash.

but that would be bad. hence there is a journal, and tools like fsck to make sure that a system crash usually does not harm most of your data.

the journal is meant to mean that after a crash the filesystem can be recovered to a valid state, without having to sync after each write.

so ext3 is POSIX plus stuff to make a filesystem safe and useful.

Better than POSIX?

Posted Mar 17, 2009 17:53 UTC (Tue) by walters (subscriber, #7396) [Link] (3 responses)

The problem with the POSIX interface (and raw filesystem APIs in general) is that they're too general. Is it designed for Postgres? Is it designed for preference store backends? Is it designed for my IRC client to store conversation logs (with search)? You can't do all of these simultaneously and well.

Basically I think the filesystem should be designed for #1 and #2, and what we really need is better userspace libraries; particularly for desktop applications. If those *libraries* happen to use kernel-specific interfaces for optimization, that makes sense.

For example, I'm pretty sure one could put part of sqlite's atomic commit in the kernel.

The other argument against trying to solve this in the kernel is that very few programs use the raw libc/POSIX interface; they're using the standard C++ library, Java, Python, Qt, GLib/Gio etc. So any changes have to be made at those levels.

Better than POSIX?

Posted Mar 17, 2009 18:03 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

For example, I'm pretty sure one could put part of sqlite's atomic commit in the kernel.

Your ideas (like most ideas) have been brought up in the past. Down that road you propose lay record-oriented filesystems. There's a reason every major filesystem today works with generic files: the alternative is absolutely horrible.

The other argument against trying to solve this in the kernel is that very few programs use the raw libc/POSIX interface; they're using the standard C++ library, Java, Python, Qt, GLib/Gio etc. So any changes have to be made at those levels.

There interfaces are as generic as the POSIX ones, and putting data integrity logic there will not alleviate any of the concerns that have already been raised. If you want to make file-format-specific optimizations, you have to modify file-format-specific libraries --- and most applications just roll their own. Fixing the problem in the kernel solves all these issues elegantly, and at once.

Better than POSIX?

Posted Mar 17, 2009 18:31 UTC (Tue) by butlerm (subscriber, #13312) [Link]

That SQLite explanation you quoted is an excellent example. It is worth
noting however that high performance databases can commit a durable
transaction as soon as the necessary journal update has made it to
persistent storage. Nothing else needs to be touched.

SQLite on the other hand, appears to do a checkpoint (i.e. finish the
updates to all the actual data files) as part of every commit. That is not
necessary in a good design. There is nothing fundamental about filesystem
or database design that requires anything except the journal to be
synchronously updated on transaction commit. If durability is not
required, not even that.

Better than POSIX?

Posted Mar 26, 2009 16:42 UTC (Thu) by muwlgr (guest, #35359) [Link]

There was nice DBMS, Btrieve from Novell, with similar rollback facilities. Used it under DOC and liked it in the past.

Better than POSIX?

Posted Mar 17, 2009 18:19 UTC (Tue) by rsidd (subscriber, #2582) [Link] (6 responses)

Your characterisation of the Unix Haters Handbook as "childish" itself seems childish. First, it was written in the days when Unix itself was proprietary (Linux and BSD were around, but barely), so there wouldn't have been much harm in another proprietary system winning (and VMS is only one of many systems the authors seem to prefer; ITS and Lisp machines were others). Second, it is alarming how much of the book rings true even today. This quote from page 19 resonates in the present context: "Unix programmers have a criminally lax attitude toward error reporting and checking. Many programs dont bother to see if all of the bytes in their output file can be written to disk. Some dont even bother to see if their output file has been created. Nevertheless, these programs are sure to delete their input files when they are finished." (This is a problem because of the "worse is better" philosophy of Unix. By the way, Gabriel's article is a later appendix here, not part of the original handbook.) And most of the chapter on filesystems is very true even today; practically all of it is true on ext2, which was the de facto Linux filesystem until recently.

Better than POSIX?

Posted Mar 17, 2009 19:33 UTC (Tue) by ncm (guest, #165) [Link] (5 responses)

At the risk of being accused myself, I will point out that the accusation above is at least equally childish. The Unix Haters Handbook made no bones about its shallowness or its sour-grapes attitude, and neither should we. Implicit in any criticism of an established target is the awareness that, were the tables turned, one's favored alternative has (or would have, had it survived and evolved) plenty to complain about. That said, a Windows Haters Handbook at a similar level of detail would be much, much longer and could be decidedly less subjective about the flaws, and therefore less fun. The difficulty of maintaining the right balance is demonstrated by how badly all UHH's imitators have done.

It's amusing to me that Gabriel's example of interruptible read() and write() system calls utterly fails to illustrate his point. The interface complexity of these system calls can be completely hidden by simple C library functions, and the code to implement them is much, much shorter than would be needed in the kernel to re-run the interrupted calls. Furthermore, a user program might well prefer to bypass such help, e.g. in case it could do something else useful before performing more I/O.

That said, Apollo's Aegis, evolved from Unix's predecessor Multics, had a much better interface to read(): it took a pointer to a buffer, but normally would return a pointer into the OS's own file buffer, usually avoiding a copy operation. Unixen didn't typically have a nonblocking read until remarkably recently, and it still doesn't tend to work well with disk files.

Better than POSIX?

Posted Mar 17, 2009 21:38 UTC (Tue) by bcopeland (subscriber, #51750) [Link] (3 responses)

> It's amusing to me that Gabriel's example of interruptible read() and
> write() system calls utterly fails to illustrate his point.

I don't see how the example undermines Gabriel's point. As far as I can tell,
he was very much arguing in favor of worse-is-better, that having the best
interface for the sake of it is brain-damaged, and so the Unix solution of
EAGAIN was superior to the MIT way, even though the interface was arguably
"worse."

Better than POSIX?

Posted Mar 18, 2009 6:23 UTC (Wed) by rsidd (subscriber, #2582) [Link] (1 responses)

He was not arguing that this is intrinsically better. He was arguing that worse-is-better spreads, the way viruses spread.

Someone else above made the point that worse-is-better was quite reasonable in the 1970s and 1980s when computers were extremely slow and people were willing to sacrifice stability for speed. (Lisp machines, I believe, literally took all morning to boot; and garbage collection was time for a coffee break. At the other extreme, completely unprotected operating systems like CP/M and MS-DOS, that let programmers do pretty much anything they liked, managed to have useful applications like WordStar that were as smooth and interactive as today's word processors. Unix machines lay somewhere in the middle.)

Also, programming was an arcane art and OS designers were willing to trust application designers to "do the right thing" (and if they didn't, the consequences were immediately noticeable).

Today's computers are a few orders of magnitude faster, but are still running operating systems built on assumptions that ceased to be valid nearly two decades ago.

Better than POSIX?

Posted Mar 18, 2009 6:58 UTC (Wed) by mgb (guest, #3226) [Link]

In the 80's a Lisp machine would start in the few seconds it took the monitor to warm up because it was almost always started from what we would now call suspend-to-disk. Loading a new image could take half an hour or more but you only did this a few times per year.

Garbage collections caused zero delay, as incremental garbage collection was supported in microcode.

Just like today, if you needed a coffee break in the 80's you had to find something huge to compile.

Better than POSIX?

Posted Mar 19, 2009 23:11 UTC (Thu) by jschrod (subscriber, #1646) [Link]

No, Dick doesn't argue in favor of worse-is-better.

He states that it is an observable fact that the worse-is-better school has a better adoption rate than the do-it-right-the-first-time school; for several reasons, only some of them technical. And that fact makes him sad, because he is very much a member of the do-it-right-the-first-time school -- you just have to read his report about CLOS to recognize that, even if you have never spoken to him personally. (I have, so this is not hearsay.)

Prerequsite for UHH

Posted Mar 19, 2009 18:15 UTC (Thu) by dmarti (subscriber, #11625) [Link]

You couldn't have a UHH without all the boasts about the elegance of Unix from otherwise mostly sensible people. Microsoft's OS products have been driven by backwards compatibility, business side demands for lock-in, and stripping out features to please antitrust regulators -- nobody is saying they're elegant, so no point writing a "hater's" handbook to rebut the claim.

Better than POSIX?

Posted Mar 17, 2009 18:20 UTC (Tue) by sf_alpha (guest, #40328) [Link] (17 responses)

Good and balanced view.

I think it also duty of application developers to ensure their applications work on POSIX if they are writing UNIX applications.

Ther are many filesystems out there that implement delayed allocation. Although those filesystems are not default filesystem for Linux, we expected applications to work regardless which STABLE filesystem and Operating System used.

Use of allocate-on-commit mount option for provide ext3 like behavior is the workaround that give time for applications to migrate, also the patch to the commit on rename. But again, application developers get ignore POSIX specfication and compliance. If they want to write Linux-only applcation, that is no problems to rely on Linux or ext3 functionality. But I am sure that most applications not intended to use only in Linux and a few filesystems.

So if allocate-on-commit is default behavior, we get non-portable (and bugged) application as an exchange.

Better than POSIX?

Posted Mar 17, 2009 19:52 UTC (Tue) by aleXXX (subscriber, #2742) [Link] (1 responses)

> I think it also duty of application developers to ensure their
> applications work on POSIX if they are writing UNIX applications.

Sure, sounds reasonable. But, how to do that ? Most people have one box around, usually running Linux. How should they/we test that our software works fine everywhere ? Ok, one can install a virtual machine and run e.g. FreeBSD or Solaris on it. Not sure how many people do this in their spare time. At least for me it's quite at the top of my TODO.

E.g. Kitware has nightly builds and testing for basically all their software on basically all operating systems, i.e. Linux, Windows, OSX, AIX, HP-UX, Solaris, FreeBSD, QNX and others (here's the current dashboard for CMake: http://www.cdash.org/CDash/index.php?project=CMake) But setting this up is quite some effort, you need to find people to host these builds for you. Not every small project can afford this.

Alex

Better than POSIX?

Posted Mar 18, 2009 5:45 UTC (Wed) by sf_alpha (guest, #40328) [Link]

Ok. I did not mean to test on every system, Just pay attention to POSIX for ensure data integrity and application work as expected on most of the systems.

If applications is designed to use only Linux with ext3, that is OK, just ignore this problems and rely on Ext3 robustness. Only drawbacks is application is not portable and still MAY LOST DATA WHEN CRASH.

Better than POSIX?

Posted Mar 17, 2009 20:17 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (1 responses)

I think it also duty of application developers to ensure their applications work on POSIX if they are writing UNIX applications.

No. I don't support every POSIX system, and it's not my responsibility to do that. I'll decide where I want my application to fall on the spectrum between simplicity and portability, not you.

Better than POSIX?

Posted Mar 17, 2009 22:29 UTC (Tue) by nix (subscriber, #2304) [Link]

Ah, but just use gnulib and your program will work everywhere, including
POSIX environments like mingw. ;}}}}

Better than POSIX?

Posted Mar 17, 2009 22:21 UTC (Tue) by man_ls (guest, #15091) [Link] (11 responses)

I think it also duty of application developers to ensure their applications work on POSIX if they are writing UNIX applications.

Those applications do work on POSIX systems. They also happen to leave hordes of little empty files after a crash, something which, as has been argued ad nauseam, is a POSIX-compliant way of dealing with a crash. It is also POSIX-compliant to make the user hunt these little buggers and remove them, or to provide valid contents. It seems that it is even POSIX-compliant to zero the whole disk on crash, something which these applications kindly refrain from doing. See? nobody is ignoring POSIX specification and compliance.

Just joking. You are right that people should respect the spec, but I think that POSIX compliance is not the problem here. In fact POSIX is just a red herring that Ted Ts'o threw in the way to make the hounds lose the scent. Apparently he failed, but he left the hounds half-crazed and biting each other for a long time.

Better than POSIX?

Posted Mar 18, 2009 0:45 UTC (Wed) by bojan (subscriber, #14302) [Link] (4 responses)

> They also happen to leave hordes of little empty files after a crash, something which, as has been argued ad nauseam, is a POSIX-compliant way of dealing with a crash.

I know you are joking here, but these files are not empty because of some evil "POSIX compliant" way of dealing with a crash by the FS or kernel. They are empty because they were never committed to the disk by running fsync().

So, it is not that POSIX compliant FS decided to _remove_ that data upon crash. It was never _explicitly_ placed there in the first place, so it cannot possibly be there after the crash.

Sure, you could have a very rare situation where you do run fsync() and you get a zero length file, for all sorts of reasons (usually hardware and kernel driver issues). This is not the case here, however.

> POSIX is just a red herring

I have to disagree here and agree with Ted. The manual page of close is very specific and says:

> A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2). (It will depend on the disk hardware at this point.)

So, if you want have _any_ guarantee that you will see you data on disk _now_, you better fsync().

The manual page of rename(2) is similarly clear on what is being atomic - just the directory entries. Sure, ignore at your peril.

And, finally, the documentation of fsync() is also crystal clear that one is allowed to run it independently on a directory and on a file. Which means that users themselves are allowed to do this separately (and they do). So does the kernel.

Sure, Ted is being gentle to everyone with bugs in their apps, which is fair enough. But, I'll bet $5 they'll hit the same thing on another Unix-like OS in the future, at which point all this screaming will happen again and people writing buggy code will accuse FS writers that it's their fault.

In the meantime, it is easy to take backups of configuration files rarely and restore them if real config files are broken. And it doesn't require running fsync() all the time.

3 out of 4?

Posted Mar 18, 2009 7:22 UTC (Wed) by man_ls (guest, #15091) [Link] (3 responses)

You seem to think that:

if you repeat the same red herring long enough then it becomes the truth,
you can ignore the distinctions thrown in your face once and again,
and that the last person to reply wins the discussion,

Just curious. Do lurkers support you in email?

3 out of 4?

Posted Mar 18, 2009 8:46 UTC (Wed) by bojan (subscriber, #14302) [Link]

Very funny :-)

Look, it is completely up to you to believe what you want of course. I will do the same. OK?

3 out of 4?

Posted Mar 20, 2009 5:02 UTC (Fri) by k8to (guest, #15413) [Link] (1 responses)

I'm personally in agreement.

The applications are expecting behavior POSIX does not provide.
The applications should stop expecting this.

It's fine to use a pattern that doesn't request the data be on disk, but you should write the app to deal with the lack of the data being on disk.

This is what I've done many times, in my own software authoring.

Congratulations

Posted Mar 20, 2009 11:52 UTC (Fri) by man_ls (guest, #15091) [Link]

I am sure you also run your Bash console in POSIX mode, never use ls with long options, and only use cp with the four-and-a-half POSIX options. Congratulations. The rest of the world is not so spartan.

I hope I don't have to use your software that fsync()s after every file operation. Mistaking durability with atomicity can have dire consequences both for durability and for atomicity.

Besides, you are not a lurker and this is not email.

Better than POSIX?

Posted Mar 18, 2009 2:03 UTC (Wed) by ras (subscriber, #33059) [Link] (5 responses)

> You are right that people should respect the spec, but I think that POSIX compliance is not the problem here.

It strikes me as odd that an open source OS uses a non-free spec to define its operations. Doesn't it strike anybody else as odd that we have a whole pile of people here arguing about compliance to a spec they most likely haven't seen? I see statements like "ensure your app only relies on stuff in POSIX". Perfectly good advice, except how is your typical open source developer meant to do that when he can't get access to the bloody thing?

That aside, I gather (since I have not been able to get a copy of POSIX myself), POSIX's doesn't offer much to programmers who want to ensure some combination of consistency consistency and durability. This sort of stuff is a basic requirement if your want to produce a reliable application. The furor here is an indication of just how basic it is. Yet even if you did have access to the spec, I gather it doesn't spell out how to do this. So programmers have learnt a bunch of ad hoc heuristics, like "to get consistency without the slowdowns caused by durability, use open();write();close();rename()". Then we get accused of "not adhering to the spec" when the next version of the FS doesn't implement the heuristic. Give us a break!

Ted's suggestion that you should be using sqlite if you want to write out a few hundred bytes of text reliably is on one level almost a joke. I presume he suggested it because the sqlite authors have taken the time to learn all the heuristics to get data on the disc reliably. Given it _is_ so hard figure all those heuristics for the various file systems your application could find itself running on I guess it is a reasonable suggestion. Unfortunately, as the firefox programmers found out, it doesn't always work. Yeah, sqlite got the data onto the disc reliably, but only by using fsync() which killed performance on some platforms. Given you probably don't care if your latest browsing history hit the disc in 5 minutes time, it is a great illustration of why programmers are so fond of "open();write();close();rename()".

From talking to a MySql developer, I gather the situation is even worse than most posting here realise. Not only does the rename() trick not work, it turns out just about anything beyond fdatasync() doesn't work. For example, you might expect that appending to a file would be fairly safe. Well, not so apparently according to POSIX. He said that if you append to a file, there is a chance on POSIX system the entire file could be truncated if you crash at the wrong moment. The only way to guarantee a file can't be corrupted by a write is to ensure you don't effect the metadata (think block allocations) - ie always write to pre-allocated blocks. Need to extend your 100Gb database? Well then you have to copy it, write zero's to the extra space at the end to ensure it isn't sparsely allocated, then use the fsync(); rename() trick.

And that should be a joke. Pity it isn't. Given that filesystems aren't going to implement ACID, we need a set of primitives we can use build up our own implementations ACID. Fast, simple things, along the lines of the lines of the CPU instruction "Test Bit and Set" which is there so assembly programmers to implement all sorts of complex locking schemes on top of it. And we need them defined in a spec that we can actually access - unlike POSIX.

Given that ain't going to happen, Ted's only way of of this is to publish such a document for his filesystems - the ext* series. Just a series of HOWTO's would be a good start - HOWTO extend a large file reliably, HOWTO get consistent data written to disc (ie impose ordering on writes) without the slowdown's of unwanted sync()'s, HOWTO ensure a rename() for a file you don't have open has hit the disc. Nothing fancy. Just the basic operations we applications are expected to implement reliably every day on his file systems.

Better than POSIX?

Posted Mar 18, 2009 4:35 UTC (Wed) by butlerm (subscriber, #13312) [Link]

Most of the POSIX specs are online these days. Google "POSIX IEEE".

Better than POSIX?

Posted Mar 18, 2009 13:17 UTC (Wed) by RobSeace (subscriber, #4435) [Link]

The actual POSIX specs may not be available anywhere for free, but the Single Unix Specs are, and they are essentially a superset of POSIX, and probably what most people really mean when they say "POSIX" these days...

Better than POSIX?

Posted Mar 18, 2009 15:41 UTC (Wed) by markh (subscriber, #33984) [Link]

POSIX.1-2008 is available here.
POSIX and SUS have merged, and are now the same thing. The link in the last comment, and the first google link, point to the older 2004 edition.

Better than POSIX?

Posted Mar 18, 2009 20:27 UTC (Wed) by sb (subscriber, #191) [Link]

> It strikes me as odd that an open source OS uses a non-free spec to define its operations. Doesn't it strike anybody else as odd that we have a whole pile of people here arguing about compliance to a spec they most likely haven't seen? I see statements like "ensure your app only relies on stuff in POSIX". Perfectly good advice, except how is your typical open source developer meant to do that when he can't get access to the bloody thing?

Read the online specifications, particularly the System Interfaces volume.

Read also the Linux manpage for the system call you are using. It will say which standards the implementation adheres to, and how it departs from those standards.

On a Debian system, install "mapages-dev" and "manpages-posix-dev". For most system interfaces, you will then have the Linux implementation in section 2 and the POSIX spec in section "3posix".

Better than POSIX?

Posted Mar 19, 2009 0:31 UTC (Thu) by ras (subscriber, #33059) [Link]

Thanks to everybody pointing out POSIX is actually available nowadays - with links even.

Times have apparently changed, and it is a big improvement. The last time I went search for POSIX was when I was referred to it by some man page which just said it implemented "POSIX regex's", and got completely pissed off when I discovered the manual entry for a supposedly free library referred me to a non-free spec.

Better than POSIX?

Posted Mar 19, 2009 18:43 UTC (Thu) by anton (subscriber, #25547) [Link]

So if allocate-on-commit is default behavior, we get non-portable (and bugged) application as an exchange.

You might get a few more applications that sync before renaming, but that does not make them any more portable or bug-free. If the OS crashes, that's not an application bug nor a portability problem. If the user uses a file system that gives no crash consistency guarantees (e.g., ext4), that's not an application bug or portability problem. A user using such a file system should just back up frequently and be prepared to restore from backup in case of a crash. Application programming doesn't have anything to do with it.

Why EXT4?

Posted Mar 17, 2009 19:37 UTC (Tue) by gmaxwell (guest, #30048) [Link] (8 responses)

Why is this being cast as an EXT4 issue? XFS has the same behaviour, as does ReiserFS.

(and yes, I've seen KDE broken after an unplanned shutdown on XFS but there was never any confusion or doubt on my mind that it was a KDE problem)

Why EXT4?

Posted Mar 17, 2009 20:15 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (2 responses)

XFS is a fast, well-written, feature-packed, mature filesystem. Ever wonder why it isn't used more?

Why EXT4?

Posted Mar 18, 2009 1:08 UTC (Wed) by bojan (subscriber, #14302) [Link]

Sure. Here are some:

1. It was written by SGI, so most distros don't have internal resources to support it. So, they ship other FSes as default.

2. Other FSes were historically the default, so upgrading to XFS is not that straightforward.

3. Some people unjustly accused this FS of loosing data, so it is taking a long time for the FS to get its reputation back. Even if all those that did the accusing published public retractions of their unfounded accusations, it would take a long time.

Why EXT4?

Posted Mar 18, 2009 11:15 UTC (Wed) by arekm (guest, #4846) [Link]

I use XFS everywhere (laptop, busy web/database servers, desktops) for years and the only major problem I had was NULLing files on crashes (fortunately this problem is fixed in 2.6.22+).

Otherwise I'm very happy with xfs and I'll continue to use it.

Why EXT4?

Posted Mar 17, 2009 20:26 UTC (Tue) by sbergman27 (guest, #10767) [Link] (3 responses)

To the average user, it's a "Linux Problem". As in "Linux is obviously unstable crap". I know of no distros which ever planned to use XFS or Reiser4 as the default filesystem. Ext4 is presumably destined to become the default on most distros.

Users who get hoist with that particular petard have clearly done it to themselves. "XFS does it too" is really academic.

Why EXT4?

Posted Mar 17, 2009 20:47 UTC (Tue) by gmaxwell (guest, #30048) [Link] (2 responses)

I'm not aware of any major distro defaulting to XFS there are a great many people using it (it has been the most reasonable choice for production filesystems >2TB as ext3 can't grow that big) and the major distros all offer XFS as install option today.

I don't disagree with the position that the end result is a "linux (distribution) problem" to the end user. But the problem is not the filesystem or the kernel. The problem is the dependence on zillions of tiny dot files (or a registry, *cough*, gconf) which absolutely can't be corrupt. Even if EXT4 provides the EXT3 behaviour there will always be opportunities for these files to become corrupted (software/hardware failure, cosmic ray, etc) and that the failure frequently results in an inability to even login is simply unacceptable. Quite arguably the file system which demonstrates its corner case behaviour more frequently is preferable since it means that developers will be more aware and more likely to address these situations.

Why EXT4?

Posted Mar 17, 2009 22:27 UTC (Tue) by sbergman27 (guest, #10767) [Link]

Well, eventually Linus is going to speak up on this topic. So I think I'll just wait for him to say what he thinks about reliability regressions, made in the name of performance optimizations, in what is supposed to be Linux's next standard filesystem. I predict that it won't be pretty.

Why EXT4?

Posted Mar 18, 2009 19:58 UTC (Wed) by chad.netzer (subscriber, #4257) [Link]

ext3 filesystem size limit is 8TB (I just built one this week, in fact), derived from 2**31 4K blocks. The 2TB limit is the old style DOS partition table limit, derived from 2**32 512 byte sectors. Ext4 has an exabyte filesystem size limit, and GPT partitioning (or LVM/RAID with multiple DOS partitions) gets around 2TB limit. Just FYI.

Why EXT4?

Posted Mar 27, 2009 7:20 UTC (Fri) by Duncan (guest, #6647) [Link]

Actually, I think (IOW, I'll take a decent reference saying otherwise and
learn, but based on what I've read and experience) you're wrong about
current reiserfs, altho it /used/ to have the problems you mention. AFAIK
reiserfs' default behavior is very much like ext3's default behavior in
this regard. Both of them use data=ordered by default now, and have for
quite some time. (Reiserfs data=ordered was added back at 2.6.6, according
to the best I can google, and ordered became the default either then or
shortly thereafter.) Just as ext3, reiserfs apparently doesn't have
delayed allocation, and the default 5-second-metadata-flush (which with
ordered and due to the security implications Ted Tso mentions, means data
gets flushed every five seconds too, before the metadata) applies to both,
too. Thus ext3 and reiserfs should have the same general level of
stability now, and post-data=ordered, that has certainly been my
observation -- reiserfs has been incredibly stable for me.

IOW, AFAICT, the bad reiserfs rep originates in the pre-2.6.6 era, quite
some time ago now, and it's actually a quite stable and mature fs now.
That has certainly been my experience, both bad back then, with corrupt or
zeroed files at boot after a crash pre-data=ordered, and impressively
stable, now and for several years, post-data=ordered.

Again, if you have references otherwise, I'm willing to be corrected, but
I believe reiserfs is actually reasonably similar to ext3 in this regard,
and has been for some years, now.

Duncan

Better get the basics straight

Posted Mar 17, 2009 21:12 UTC (Tue) by job (guest, #670) [Link] (8 responses)

This discussion is made harder by statements like: "any application developer who wants to be sure that data has made it to persistent storage". That is far from the question here. Delayed allocation is good, for performance reasons. But delaying data more than the corresponding meta data creates problems with atomicity.

I would have appreciated if the article mentioned sed in-line editing or shell scripts using mktemp && mv as examples of potential problems. After all, no one expects to be forced to call sync after that, do they? Calling most shells examples of broken user space applications is a bit of a stretch.

Better get the basics straight

Posted Mar 17, 2009 22:42 UTC (Tue) by butlerm (subscriber, #13312) [Link] (6 responses)

I agree. I am coming around to the position that all self respecting
filesystems should provide ordered (but not necessarily durable) renames by
default.

Then, as suggested by a commenter elsewhere, we could add something like a
"rename_unordered" function for those relatively unusual cases where a user
is willing to risk severe data loss to get better performance. In addition,
for portability reasons, we should have an option so that an application
can discover whether or not an fsync is required to get ordered renames.
fsync is very expensive when you don't need (synchronous) durability
semantics.

Better get the basics straight

Posted Mar 17, 2009 22:53 UTC (Tue) by quotemstr (subscriber, #45331) [Link] (5 responses)

POSIX has a mechanism to do precisely that:

SYNOPSIS long fpathconf(int fildes, int name); long pathconf(const char *path, int name); DESCRIPTION The fpathconf() and pathconf() functions shall determine the current value of a configurable limit or option (variable) that is associated with a file or directory.

I don't see a problem with adding platform-specific values for name: if an conscientious application asks for _LINUX_SAFE_RENAME on a system that doesn't even know it exists, pathconf will just return -1 and the application will say, "oh, okay. I need to use fsync."

If the POSIX people ever get around to standardizing safe semantics for rename, then Linux's pathconf can just support both the original and the standard name.

Better get the basics straight

Posted Mar 18, 2009 1:00 UTC (Wed) by bojan (subscriber, #14302) [Link] (2 responses)

You'll probably need this on a per-FS basis, because you could mount an FS on Linux that doesn't support this, although the system may generally support it.

Better get the basics straight

Posted Mar 18, 2009 1:04 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (1 responses)

You'll probably need this on a per-FS basis

Of course it's per-filesystem. That's the whole point of using pathconf instead of sysconf.

Better get the basics straight

Posted Mar 18, 2009 1:11 UTC (Wed) by bojan (subscriber, #14302) [Link]

OOPS! My bad, sorry :-(

fpathconf

Posted Mar 18, 2009 4:41 UTC (Wed) by butlerm (subscriber, #13312) [Link]

Thanks for posting that, that is very helpful.

Better get the basics straight

Posted Mar 19, 2009 7:27 UTC (Thu) by job (guest, #670) [Link]

That would be one way of dealing with different filesystem semantics.

I would hope that the operation "write data to file" gets less complex, not more. There is already a little dance of calls to be made (Ted writes about it). If we add logic on the application level to handle that some filesystems expect fsync on the directory, some on only the file and some manage without, it becomes even more so. In tens of thousands of applications.

But this is only vaguely related to the data ordering issue. In an interactive program or where performance is critical you may not want to wait until data is commited to disk. Latency kills.

sed and shell

Posted Mar 18, 2009 5:13 UTC (Wed) by xoddam (subscriber, #2322) [Link]

Indeed. I use and maintain scripts using mktemp and mv on a regular basis. Some of them *also* use sed -i (shudder).

This is *the* way to achieve atomicity on Unix. It always was.

We didn't use to have journaling filesystems and we never used to expect anything at all to work after a crash. Crashes happened and they often meant hours of work, possibly reinstalling everything.

To discover that ext3 data=ordered just kept on working after a crash was a real eye-opener for me. I never realised before that such robustness was possible (just like I never realised prior to Linux that I could afford my own Unix box) and I am not about to relinquish it! I'm sure I speak for many users here. It *was* an unexpected nice-to-have when we first got it; but it has become a solid requirement.

Delayed allocation without an implicit write barrier before renaming a newly-written file virtually guarantees data loss after a crash with existing applications. It is therefore a regression from the status quo, albeit to something somewhat better than the status of a few years back.

Kudos and thanks to Ted for implementing this must-have write barrier (and also for improving the chances that unsafe truncate-and-write is less likely to hit the inevitable race condition, though IMO he's quite right that applications doing that are broken).

I just wish he wouldn't keep insisting that fsync at application level is the right way to achieve what we want.

POSIX and fsync have nothing to do with it (any journaling filesystem provides much more than POSIX), nor do application authors who forgot to think about recovery after an OS crash. A journaling filesystem *can* guarantee atomic renames, so it *should*, for the sake of users' sanity, not for the sake of a standards document.

Standards can be updated to follow best practice. They often do.

Library function please?

Posted Mar 17, 2009 21:35 UTC (Tue) by dmarti (subscriber, #11625) [Link]

Can we have a "write-data-to-file" library function to do this stuff?

Pass it a pathname, a pointer to a block of data, a length, and some flags, and it comes back with success if it opened, wrote, did whatever black magic you have to do, and closed the file.

The library could also include the corresponding read function, so you could have a flag to tell the writer, "don't bother rewriting with identical contents."

Add new mode letter to fopen()

Posted Mar 17, 2009 23:42 UTC (Tue) by ikm (guest, #493) [Link] (1 responses)

Personally, I'd propose an additional mode letter to the standard fopen() libc function. Surely most people here know the "b" flag which is ignored everywhere except in Windows? Why not do the same thing for atomic updates?

E.g. suppose we call it "f". Then

FILE * f = fopen( "precious_config", "wf" );
...
fwrite(...);
...
fclose( f );

That flag would mean that the original file is to be kept untouched until the stream is closed. Instead, a replacement inode is to be allocated on the same media, getting all new changes. At the moment stream is closed, the new inode is to atomically replace the original one. So you'd either get the old version, or the new version.

Programs that would use the new flag would still be portable, since that flag would just be ignored if not known. Yes, they won't be using rename tricks in that case, but I doubt most people care that much anyway. At least, I don't do rename tricks, but I'd happily use new flag is there was one.

In glibc, the whole thing could be implemented either by using a special flag to the underlying open syscall if one exists, or by just utilizing the rename trick internally otherwise.

Add new mode letter to fopen()

Posted Mar 18, 2009 0:17 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

FILE * f = fopen( "precious_config", "wf" );

All we need is a "t" flag. :-)

In all seriousness, I think that's the wrong level of abstraction for this functionality. A C FILE* returned by fopen should correspond to only one underlying file, and that file shouldn't magically change its name.

Besides, a flag to open won't work because the kernel would have to spool modifications to that file until commit, and that would not only be very complex, but could open up all kind of denial-of-service attacks.

I wouldn't mind, however, a libc function that encapsulated the very common open-write-close-rename sequence. I wouldn't even mind some function that returned a FILE*. I just don't think that function should be called fopen, and don't think that FILE* shouldn't be closed with fclose -- think of something more like popen.

Better than POSIX?

Posted Mar 17, 2009 23:46 UTC (Tue) by clugstj (subscriber, #4020) [Link]

All that is actually needed is for the metadata updates to be kept in sync with the data updates. Then the write-new-file-rename-over-old-file trick will reliably leave you with the new or the old file again. Getting data/metadata out of sync seems to me to be a bad idea even without the possibility of crashing.

It may not be required by POSIX, but it is a reasonable way for a filesystem to behave.

Better than POSIX?

Posted Mar 18, 2009 0:40 UTC (Wed) by ikm (guest, #493) [Link]

Jon, I'm loving the book. Feels like a fantasy setting of some sort, with daemons, wizards, sttys and all the other jrrt. Makes a nice toilet reading.

Nice summary

Posted Mar 18, 2009 3:53 UTC (Wed) by bojan (subscriber, #14302) [Link] (12 responses)

> It is probably a matter of building our filesystems to provide "good enough" robustness as a default, with much stronger guarantees available to developers who are willing to do the extra coding work.

Which is the euphemism for "we'll have workarounds for your bugs" and "people that know will fix their apps".

Nice summary

Posted Mar 18, 2009 5:36 UTC (Wed) by xoddam (subscriber, #2322) [Link] (11 responses)

Please stop insisting that applications are buggy or broken that haven't considered recovery from kernel or hardware failure. That didn't use to be possible at all and POSIX certainly never guaranteed it.

Application writers have long used rename precisely and only to achieve atomicity; it's the only atomic operation the API provides (simulating atomicity with locks and a series of synchronous flushes is a very different matter). As long as the kernel and fs are up, POSIX guarantees that data writes which precede the rename will be visible to all readers after the rename. Post-crash, nothing used to be guaranteed.

This isn't about application developers coding to an API, it's about users wanting reasonable behaviour from their computers even when they abuse them by kicking out power cords or installing proprietary device drivers.

Ext3 has given us a new, much-better-than-POSIX standard of data recoverability. It's a mere implementation detail that it does this in part by effectively preserving the order of operations that POSIX mandates down to the disk level.

Delayed allocation without a write barrier before renames of newly-written files practically guarantees data loss in this extremely common use-case, so it's a regression.

The regression now has been fixed (thanks Ted). No hacking of applications required.

Nice summary

Posted Mar 18, 2009 6:05 UTC (Wed) by bojan (subscriber, #14302) [Link] (10 responses)

> Please stop insisting that applications are buggy or broken that haven't considered recovery from kernel or hardware failure.

Sorry, I didn't come up with that. I think that would be... Ted ;-)

Know your place

Posted Mar 18, 2009 8:21 UTC (Wed) by xoddam (subscriber, #2322) [Link] (9 responses)

Appeal to authority, eh?

Please, bojan and tytso alike, cease and desist from saying applications are broken, when users have given a clear requirement for a new filesystem: that it not lose data as a matter of course, when the status quo would preserve it.

Know your place

Posted Mar 18, 2009 8:55 UTC (Wed) by bojan (subscriber, #14302) [Link] (1 responses)

> Appeal to authority, eh?

I don't know. I think a person that created the file system may know a thing a two about POSIX. I did actually go and check and he did appear to be right. But, that's obviously not good enough for you (or you may know of some interpretation we cannot grasp - it is possible). I'm OK with that.

> Please, bojan and tytso alike, cease and desist from saying applications are broken, when users have given a clear requirement for a new filesystem: that it not lose data as a matter of course, when the status quo would preserve it.

I have no intention of doing that (unless LWN editors throw me out). Likewise, you can say what you please.

Ted, being a pragmatic perseon, already did put workarounds in place, so users will be happy.

Know your place

Posted Mar 30, 2009 12:37 UTC (Mon) by forthy (guest, #1525) [Link]

> I think a person that created the file system may know a thing a two about POSIX.

It's not, and I repeat in bold: NOT about POSIX. It is about reasonable behavior. Ordered data has been implemented in ReiserFS and XFS, which both had the reputation of being unstable and prone to eat files before. This is a quality of implementation issue, not a standard issue. Maybe we would need a better standard for file systems, so that quality of implementation is reasonable by default, but that's a different topic. If you insist that your way-below-average quality of implementation is "perfectly valid", you are anal-retentive.

I think Ted T'so should read the GNU Coding Standards. What is written there is mandatory for a core component of the GNU project (which the Linux kernel is, regardless if it's officially part of the GNU project). The point in question here is section 4.1:

The GNU Project regards standards published by other organizations as suggestions, not orders. We consider those standards, but we do not “obey” them. In developing a GNU program, you should implement an outside standard's specifications when that makes the GNU system better overall in an objective sense. When it doesn't, you shouldn't.

What Ted has implemented was a behavior which is standard, but makes his file system worse, because it has inconvenient side-effects on robustness in case of a crash. In shorter words: It sucks. And the GNU Coding Standards clearly say: If the standard sucks, don't follow it.

it's about the crashes!!!

Posted Mar 18, 2009 17:24 UTC (Wed) by pflugstad (subscriber, #224) [Link] (6 responses)

that it not lose data as a matter of course

Okay, I had to post on this one. The thing that EVERYONE seems to be forgetting is that these problems only occur when you have crashes - I.e. bad hardware or buggy drivers. This is not a case of lose data as a matter of course, it's a case of the whole freakin system crashing badly. This is a situation which happens VERY rarely.

Honestly, has anyone here, NOT running binary closed source drivers, experienced a crash in a distro provided kernel in what, the last 12 months or longer? Heck, even a bleeding edge (but not -RC) kernel.

Didn't think so. Now, please refrain from hyperbolic statements like that.

I realize the Ted pointed this out in his initial emails and while it's still not good for the system level behavior to change like this, this is a case of ultra bleeding edge kernel, ALPHA distro release, etc. These are not common users in any sense of the word "common".

it's about the crashes!!!

Posted Mar 18, 2009 20:02 UTC (Wed) by zeekec (subscriber, #2414) [Link]

Just let me say that I agree with Teds opinion that the applications are in error, not ext4

> Honestly, has anyone here, NOT running binary closed source drivers, experienced a crash in a distro provided kernel in what, the last 12 months or longer? Heck, even a bleeding edge (but not -RC) kernel.

Actually, yes I have. I run Gentoo unstable at home, and I am currently having issues with the 2.6.28 kernel and Xorg's intel drivers. All open source. So it does happen. (But I'm running Gentoo unstable and expect it!)

it's about the crashes!!!

Posted Mar 18, 2009 22:45 UTC (Wed) by xoddam (subscriber, #2322) [Link] (3 responses)

Your point is entirely correct, as far as it goes, and it validates my position.

The purpose of a journaling filesystem is *only* to ease and speed the task of recovery after an unclean shutdown. I can't emphasise this point strongly enough.

it's about the crashes!!!

Posted Mar 19, 2009 0:35 UTC (Thu) by butlerm (subscriber, #13312) [Link] (2 responses)

<em>The purpose of a journaling filesystem is *only* to ease and speed the
task of recovery after an unclean shutdown.</em>

That is not quite correct. The primary purpose of journaling in typical
journaling filesystems is to preserve metadata integrity. Filesystem
repair tools cannot repair metadata that has never been written.

The secondary purpose of journaling is to loosen ordering restrictions on
meta data updates. Assuming you want your filesystem to be there after an
unclean shutdown, that is a major advantage.

Finally, journaling filesystems are not metaphysically prohibited from
using their journals to do other useful things, such as store meta-data
undo information, for example.

it's about the crashes!!!

Posted Mar 19, 2009 5:56 UTC (Thu) by xoddam (subscriber, #2322) [Link] (1 responses)

Metaphysics aside, surely these primary and secondary purposes you describe themselves have the ultimate goal of saving end users the trouble of cleaning up a mess after an unclean shutdown?

it's about the crashes!!!

Posted Mar 20, 2009 21:17 UTC (Fri) by butlerm (subscriber, #13312) [Link]

Yes. The primary goal of journaling is to make the filesystem more robust
so that manual intervention after a system crash is minimized.

it's about the crashes!!!

Posted Mar 19, 2009 23:25 UTC (Thu) by jschrod (subscriber, #1646) [Link]

If you take your own argument seriously, you don't need any journaled file system -- after all, the only reason to use journaling is to get better behaviour after a crash.

That said, yes, I had many kernel crashes at the start of this year, using SUSE and no proprietary modules. It took a long time to identify the piece of hardware that caused it. (It was the video card.) I have another system where usage of ionice causes hard lockups of the whole system, reproducable. E.g., running updatedb with ionice. I have never identified the culprit here and finally put it in the closet; my time was worth more than the price of a new system.

Benchmarks?

Posted Mar 18, 2009 15:03 UTC (Wed) by sbergman27 (guest, #10767) [Link] (1 responses)

So, does anyone have links to some benchmarks that show the performance benefits that this whole delayed allocation issue has been about? Something that shows the difference between the default and nodelalloc?

Benchmarks?

Posted Mar 19, 2009 13:32 UTC (Thu) by sbergman27 (guest, #10767) [Link]

All this discussion about performance/data integrity tradeoffs and no one even has a benchmark to point to? How very willing we are to give up our data integrity these days!

Fsck to the rescue?

Posted Mar 19, 2009 12:50 UTC (Thu) by ssam (guest, #46587) [Link]

could fsck be the solution.

if system crashs/looses power between the rename hitting the disk, and the new data actually being writen, then there is a very good chance that the old data is still on the disk.

there are programs what can find deleted files, and undelete them. if ext4 left some clue in the journal as to which blocks were freed during a rename, then maybe fsck or the journal replaying code, could go and find those block containing the old file, and relink them.

ext4 like ext3 - the wrong way

Posted Mar 19, 2009 18:06 UTC (Thu) by zmi (guest, #4829) [Link] (2 responses)

Wonderful. There is a new filesystem (ext4) which behaves like most other
modern filesystems like XFS, reiserfs, reiser4, in that on a crash it
thrashes data. But just because this ones name is similar to ext3, people
want it to behave the same.

The problem is that if it continues like ext3, applications will never get
fixed. But harddisks already use 32MB cache, people use on-board RAID
controllers. Imagine you have a RAID over 4 hard disks with 32MB cache each
- on a power outage, you loose up to 128MB data just from the disks,
there's nothing any filesystem can do about it (despite turning off disk
write caches). There's the *absolutely false* assumption that your data is
safe once you fclose(). But without a fsync, it's not.

In the end it would have been better to fix applications. Otherwise other
filesystems, and newer ones with even more advanced features, will always
be told to "eat your data", while really it's not their fault.

What you currently have is this: use ext3/ext4, or otherwise you can loose
data. Not because the other filesystems are crap, but because application
developers don't (need to) care. They use ext3/4 and people will continue
to say that it's much better. The typical half-truth we often see in IT,
and that often leads to bad systems. Think of computers controlling
equipment like trains, cars, or atomic reactors. "Oh, that one melted
because the filesystem ate it's data" is not a good answer after all. A
good application would have checked it. Seems like more and more people
believe if it works 95% of the time that's enough. Until that computer-
of a computer problem. Then it should have been working 100% of the time,
right?

(Yes, I really feel nervous that we start to trust computers where you
really can't)

ext4 like ext3 - the wrong way

Posted Mar 19, 2009 23:54 UTC (Thu) by xoddam (subscriber, #2322) [Link]

Duh. We know that data in volatile memory disappears when hardware fails or operating systems crash. That's fine, we accept the risk of losing a few minutes' browsing history when we abuse our computers by daisy-chaining power strips or letting batteries run low on a long commute or running experimental wifi drivers.

We use a journaling filesystem so that *after* our hardware or our kernel fails -- which is more or less likely depending on the way we use them, but everyone is at risk -- whatever made it to nonvolatile storage is in a sane, recoverable state. NOT MISSING. That way we *only* lose a few seconds' or minutes' work.

[ Or browsing history, some people call that work :-) ]

ext4 like ext3 - the wrong way

Posted Mar 21, 2009 17:03 UTC (Sat) by nix (subscriber, #2304) [Link]

I just bought a machine with an onboard hardware RAID controller. The
additional price for the battery backup was less than 2% of the machine's
cost. If you can't afford that, you probably shouldn't be using RAID at
all (certainly not hardware RAID: md will work anywhere but makes no bones
about possible data loss if a crash happens during writing).

Better than POSIX?

Posted Mar 20, 2009 22:53 UTC (Fri) by spitzak (guest, #4593) [Link]

I absolutely agree that EXT4 should make atomic rename() work, despite the fact that POSIX does not require it.

I would also like to see "atomic create" with a new flag to open(). This would make a hidden (ie the inode is not in any directory) file that you can write to. When you close() it it does the equivalent of the atomic rename() to the filename given to open(). If the program crashes then the file disappears as it is not linked to by any directory. Calling fsync() on the file would do the rename that close does at that point, and from then on it would act like a normal file. There may also be a way to abort the file, closing it such that it disappears with no effect.

This will give the atomic operation that everybody is trying to get, without the need to generate a temp filename, without leaving temp files on the disk in case of a crash, and possibly with far better performance as the filesystem could defer the effects of close.

Furthermore I think there would not even need to be a new flag. Instead the flags O_CREAT|O_WRONLY|O_TRUNC (the ones creat() use) would trigger this behavior. This is almost certainly the behavior anybody calling creat() is expecting anyway.