Better than POSIX?
As has been well covered (and discussed) elsewhere, the delayed allocation feature found in the ext4 filesystem - and most other contemporary filesystems as well - has some real benefits for system performance. In many cases, delayed allocation can avoid the allocation of space on the physical medium (along with the associated I/O) entirely. For longer-lived data, delayed allocation allows the filesystem to optimize the placement of the data blocks, making subsequent accesses faster. But delayed allocation can, should the system crash, lead to the loss of the data for which space has not yet been allocated. Any filesystem may lose data if the system is unexpectedly yanked out from underneath it, but the changes in ext4 can lead to data loss in situations that, with ext3, appeared to be robust. This change looks much like a regression to many users.
Many electrons have been expended to justify the new, more uncertain ext4 situation. The POSIX specification says that no persistence is guaranteed for data which has not been explicitly sent to the media with the fsync() call. Applications which lose data on ext4 are not using the filesystem correctly and need to be fixed. The real problem is users running proprietary kernel modules which cause their systems to crash in the first place. And so on. All of these statements are true, at least to an extent.
But one might also argue that they are irrelevant.
Your editor recently became aware that Simson Garfinkel's Unix Hater's Handbook [PDF] is available online. To say that this book is an aggravating read is an understatement; much of it seems like childish poking at Unix by somebody who wishes that VMS (or some other proprietary system) had taken over the world. It's full of text like:
But behind the silly rhetoric are some real points that anybody concerned with the value of Unix-like systems should hear. Among them are the "worse is better" notion expressed by Richard Gabriel in 1991 - the year the Linux kernel was born. This charge states that Unix developers will choose implementation simplicity over correctness at the lower levels, even if it leads to application complexity (and lack of robustness) at the higher levels. The ability of a write() system call to succeed partially is given as an example; it forces every write() call to be coded within a loop which retries the operation until the kernel gets around to finishing the job. Developers who cut corners like that are left with an application which works most of the time, but which can fail silently in unexpected circumstances. It is far better, these people say, to solve the problem once at the kernel level so that applications can be simpler and more robust.
The ext4 situation can be seen as similar: any application developer who wants to be sure that data has made it to persistent storage must take extra care to inform the kernel that, yes, that data really does matter. Developers who skip that step will have applications which work - almost all the time. One could well argue that, again, the kernel should take the responsibility of ensuring correctness, freeing application developers from the need to worry about it.
The ext3 filesystem made no such guarantees, but, due to the way its features interact, ext3 provides something close to a persistence guarantee in most situations. An ext3 filesystem running under a default configuration will normally lose no more than five seconds worth of work in a crash, and, importantly, it is not prone to the creation of zero-length files in common scenarios. The ext4 filesystem withdrew that implicit guarantee; unpleasant surprises for users followed.
Now the ext4 developers are faced with a choice. They could stand by their changes, claiming that the loss of robustness is justified by increased performance and POSIX compliance. They could say that buggy applications need to be fixed, even if it turns out that very large numbers of applications need fixing. Or, instead, they could conclude that Linux should provide a higher level of reliability, regardless of how diligent any specific application developers might have been and regardless of what the standards say.
It should be said that the first choice is not entirely unreasonable. POSIX forms a sort of contract between user space and the kernel. When the kernel fails to provide POSIX-specified behavior, application developers are the first to complain. So perhaps they should not object when the kernel insists that they, too, live up to their end of the bargain. One could argue that applications which have been written according to the rules should not take a performance hit to make life easier for the rest. Besides, this is free software; it would not take that long to fix up the worst offenders.
[PULL QUOTE: There is a case to be made that this is a situation where the Linux kernel, in the interest of greater robustness throughout the system, should go beyond POSIX. END QUOTE] But fixing this kind of problem is a classic case of whack-a-mole: application developers will continually reintroduce similar bugs. The kernel developers have been very clear that they do not feel bound by POSIX when the standard is seen to make no sense. So POSIX certainly does not compel them to provide a lower level of filesystem data robustness than application developers would like to have. There is a case to be made that this is a situation where the Linux kernel, in the interest of greater robustness throughout the system, should go beyond POSIX.
The good news, of course, is that this has already happened. There is a set of patches queued for 2.6.30 which will provide ext3-like behavior in many of the situations that have created trouble for early ext4 users. Beyond that, the ext4 developers are considering an "allocate-on-commit" mount option which would force the completion of delayed allocations when the associated metadata is committed to disk, thus restoring ext3 semantics almost completely. Chances are good that distributors would enable such an option by default. There would be a performance penalty, but ext4 should still perform better than ext3, and one should not underestimate the performance costs associated with lost data.
In summary: the ext4 developers - like Linux developers in general - do care about their users. They may complain a bit about sloppy application developers, standards compliance, and proprietary kernel modules, but they'll do the right thing in the end.
One should also remember that ext4 is still a very young filesystem; it's not surprising that a few rough edges remain in places. It is unlikely that we have seen the last of them.
As a related issue, it has been suggested that the real problem is with the POSIX API, which does not make the expression of atomicity and durability requirements easy or natural. It is time, some say, to create an extended (or totally new) API which handles these issues better. That may well be true, but this is easier said than done. There are, of course, the difficulties in designing a new API to last for the next few decades; one assumes that we are up to that challenge. But will anybody use it? Consider Linus Torvalds's response to another suggestion for an API extension:
Application developers will be naturally apprehensive about using Linux-only interfaces. It is not clear that designing a new API which will gain acceptance beyond Linux is feasible at this time.
Your editor also points out, hesitantly, that Hans Reiser had designed - and implemented - all kinds of features designed to allow applications to use small files in a robust manner for the reiser4 filesystem. Interest in accepting those features was quite low even before Hans left the scene. There were a lot of reasons for this, including nervousness about a single-filesystem implementation and nervousness about dealing with Hans, but the addition of non-POSIX extensions was problematic in its own right (see this article for coverage of this discussion in 2004).
The real answer is probably not new APIs. It is probably a matter of
building our filesystems to provide "good enough" robustness as a default,
with much stronger guarantees available to developers who are willing to do
the extra coding work. Such changes may come hard to filesystem hackers
who have worked to create the fastest filesystem possible. But they will
happen anyway; Linux is, in the end, written by and for its users.
Posted Mar 17, 2009 15:42 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link] (5 responses)
But I firmly believe that today the kernel can provide a great deal of additional robustness at practically no performance cost. An "ordered rename" is a no-brainer. Not only do existing applications suddenly do the right thing, but also an "ordered rename" allows application developers to inform the kernel of constraints that are simply impossible to express when applications are required to Some people say that a non-ordered
An ordered
Posted Mar 17, 2009 17:28 UTC (Tue)
by ms (subscriber, #41272)
[Link] (4 responses)
Posted Mar 17, 2009 17:56 UTC (Tue)
by butlerm (subscriber, #13312)
[Link] (3 responses)
However, "transactional filesystem" usually refers to a much more
That might well be standard a couple of decades down the road - Google
Posted Mar 18, 2009 15:19 UTC (Wed)
by alvherre (subscriber, #18730)
[Link] (2 responses)
Not so. PostgreSQL, for example, implements transactional semantics without needing metadata undo.
Posted Mar 20, 2009 20:54 UTC (Fri)
by butlerm (subscriber, #13312)
[Link] (1 responses)
So of course you can implement meta data undo on any filesystem you please,
Posted Mar 21, 2009 18:00 UTC (Sat)
by nix (subscriber, #2304)
[Link]
(As for vacuuming, do it incrementally and the data volume pretty much
Posted Mar 17, 2009 15:56 UTC (Tue)
by ssam (guest, #46587)
[Link] (9 responses)
i think a lot of people are willing to risk recent changes to a file to get performance gains and and power saving. but no one wants to risk completely loosing a file just because you wrote to it recently.
in the sequence
it seems that 2 gets delayed, because that gives performance/powerusage gains, but 4 happens quickly.
Posted Mar 17, 2009 19:51 UTC (Tue)
by dlang (guest, #313)
[Link] (7 responses)
the probability of having problems varies from filesystem to filesystem, from mount option to mount option within a single filesystem, and is even dependant on kernel tuning parameters that can be set in /etc/sysctl.conf
if you want that guarantee you need to force a write to disk (and make sure that your disk drive doesn't cache or reorder the writes)
it would be nice if there was a write barrier API that let you insert a ordering requirement from userspace without forcing a write to disk, but at this point in time I don't believe that such a API exists.
people were relatively happy with the ~5 second vulnerability window in ext3, but are getting bit by the ~60 second vulnerability window that ext4 defaulted to.
Ok, that's a reasonable problem, and adjusting ext4 to use a smaller window (possibly even by default) is a reasonable thing. the idea that Ted is working on to have ext4 honor the 'how much time am I willing to loose' parameter that was introduced for laptop mode is a good idea in that it's clear what you are adjusting, and it lets you tie all similar types of thing to the one parameter (as opposed to configuring different aspects in different places)
but shortening this time will cost performance. I doubt that the ext4 developers selected the perfect time (I doubt that there is a single perfect time to select), and having it as an easy tunable will let people plot out what the performance/reliability curve looks like. it may be that going from 5 seconds to 10 seconds gains 80% of the performance that going from 5 seconds to 60 seconds gives you (for common desktop workloads ;-) and distros may choose to move the default there.
the problem in this situation is that people are mistakenly believing that this vunerability didn't exist before, it did, it just wasn't as large a window so fewer people were hitting it.
Posted Mar 17, 2009 21:16 UTC (Tue)
by foom (subscriber, #14868)
[Link] (1 responses)
If you crashed within a 60sec window in (pre-fix) ext4, you were almost guaranteed to end up with
Of course there's no absolute guarantees...if your kernel is crashing particularly hard, the filesystem
Posted Mar 19, 2009 10:44 UTC (Thu)
by Los__D (guest, #15263)
[Link]
berserk and write 99 red balloons all over the disk What a fantastic idea for an easter egg... :)
"What, bug? No no no, it was just a joke. Your data is... Well, buried in baloons."
Posted Mar 17, 2009 21:44 UTC (Tue)
by man_ls (guest, #15091)
[Link] (3 responses)
I remember my horror after finding out that XFS had lost my data, and I read about XFS devs hiding behind the standard "journalling filesystems only make guarantees about the metadata". "The metadata? fsck the metadata! What I want is my wretched data back!" Now it is all coming back in waves.
Posted Mar 18, 2009 2:40 UTC (Wed)
by dlang (guest, #313)
[Link] (2 responses)
Posted Mar 18, 2009 4:12 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link]
Posted Mar 18, 2009 7:28 UTC (Wed)
by man_ls (guest, #15091)
[Link]
Posted Mar 19, 2009 18:33 UTC (Thu)
by anton (subscriber, #25547)
[Link]
Concerning an API, I don't see a need for an additional API, it's just a
matter of whether the file system gives consistency guarantees in case
of a crash when using the existing file API, and if so, which ones.
Posted Mar 17, 2009 21:30 UTC (Tue)
by khim (subscriber, #9252)
[Link]
Ted gave interesting answer
to this request which I really like: why will your
program need such an API? Your system needs such a API - and it's
not that hard to implement... Yes, it'll probably violate POSIX's letter but it'll give much better
results in practice.
Posted Mar 17, 2009 16:04 UTC (Tue)
by xav (guest, #18536)
[Link] (15 responses)
Posted Mar 17, 2009 17:01 UTC (Tue)
by drag (guest, #31333)
[Link] (4 responses)
In the actual physical file system image committed to disk; Don't name partially written files.
------
That would pretty much get what everybody wants. I suppose it's much much complicated then that, of course. I'll take "good enough" any day.
Posted Mar 17, 2009 17:52 UTC (Tue)
by vonbrand (subscriber, #4458)
[Link] (3 responses)
"Don't name partially written files" would mean that nothing has a name until the file is closed, and the file has to disappear whenever it is opened for writing... I'd take the current
Posted Mar 17, 2009 23:50 UTC (Tue)
by drag (guest, #31333)
[Link] (2 responses)
It certainly will solve the write() then rename() issue. :)
And as I recalled I remember hearing about file system designers deliberately zero-ing out files for various reasons.
Posted Mar 18, 2009 23:17 UTC (Wed)
by xoddam (subscriber, #2322)
[Link] (1 responses)
That doesn't really sound like a good way to enhance recoverability. For applications that keep large files (eg. caches) open for a long time and update them piecemeal, it sounds like sheer madness.
Applications that truncate existing data before rewriting it are asking for trouble, though I appreciate a filesystem that doesn't exacerbate the race condition by promptly truncating the inode but delaying the flush of the new data blocks for several seconds. Ted has already fixed that particular issue heuristically by delaying truncation until it is time to flush the data. Flushing *early* on close() couldn't hurt integrity but could hurt performance quite a bit.
Posted Mar 19, 2009 23:34 UTC (Thu)
by xoddam (subscriber, #2322)
[Link]
Posted Mar 17, 2009 21:40 UTC (Tue)
by man_ls (guest, #15091)
[Link] (2 responses)
Do you really think it is better to force everyone to comply with a new standard than trying to convince ext4 developers to do the (obvious) right thing?
Posted Mar 18, 2009 22:57 UTC (Wed)
by xoddam (subscriber, #2322)
[Link] (1 responses)
Posted Mar 18, 2009 23:25 UTC (Wed)
by man_ls (guest, #15091)
[Link]
Posted Mar 17, 2009 22:30 UTC (Tue)
by dhess (guest, #7827)
[Link] (6 responses)
Yeah, I've come to a similar conclusion. Perhaps the rename() semantics alone is sufficient. It's simple enough conceptually that it might be relatively easy to get other operating systems to adopt the new semantics, too, at least for the filesystems that can support it. And it sounds like there's already a quite common belief amongst application developers that all filesystems behave this way, anyway.
In a previous life, I worked on memory ordering models in CPUs and chipsets. During this recent ext4 hubbub, it dawned on me that the issues with ordering and atomicity in high-performance filesystem design may be isomorphic to memory ordering. Even if that's not strictly true, there's probably a lot to be learned by filesystem designers and API writers from modern CPU memory ordering models, in any case, because memory ordering is a well-explored space by this point in the history of computer engineering; and I don't just mean the technical semantics, either, but the whole social aspect, too, i.e., how to balance good performance with software complexity, how much of that complexity to expose to application programmers, who often have neither the time nor the background to understand all of the tradeoffs, let alone dot all the "i"s and cross all the "t"s, etc. Anyway, changing rename's semantics as you suggest would be the equivalent of a "release store" in memory ordering terms, and seems to be exactly the right kind of tradeoff in this situation.
Posted Mar 17, 2009 23:04 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link] (5 responses)
One thing that struck me was a comment on a Slashdot story about a "breakthrough" in data center energy optimization. The comment showed that the problem of deciding when to boot up additional servers to meet demand was isomorphic to the problem of steam boiler management --- right down to the start-up and constant energy costs --- and that the problem had already been thoroughly addressed in literature from the turn of the last century.
Posted Mar 17, 2009 23:52 UTC (Tue)
by dhess (guest, #7827)
[Link]
Alan Kay never misses an opportunity to point out that our field has a terrible track record when it comes to learning from, or even being aware of, our history, let alone that of other related fields. There's a lot of unfortunate "rediscovering" of knowledge in computer science and engineering. (I'm as guilty of it as anyone.) I think it's a good habit to consider Alan's admonishment when we're faced with challenges or seeking solutions to problems, so I guess I'll follow his lead by mentioning it here :)
Posted Mar 17, 2009 23:57 UTC (Tue)
by rahvin (guest, #16953)
[Link] (3 responses)
Imagine for how many years the wheel was reinvented over and over again at hundreds of companies as people relearned how to code something the proper way for a certain scenario. It's scary to think how fast we could have developed software if it had been FOSS all along instead of corporations each trying to slit each others throats. In the case of computer information systems the sharing of code accelerates the total technology much faster than the private corporate system of the past ever has. Of course this isn't always true. Niche software will probably always need the economic support closed systems provide even if it divides the efforts among a few companies who reinvent each other's innovations.
Posted Mar 26, 2009 10:56 UTC (Thu)
by massimiliano (subscriber, #3048)
[Link] (2 responses)
That is one explanation.
Another one is that we don't have an "engineering culture" in software development.
I mean, software developers are not necessarily engineers, so they rarely know about issues like steam boiler management.
Now, I'm not claiming engineers are necessarily better than others in this sense. I know many guys who quit college and they are better than me in understanding aspects of different technologies.
What I'm claiming is that very often people reinvent the wheel not because the previous wheel was a secret, but because they do not have this "engineering culture" of knowing different kinds of wheels in advance, and being able to understand correctly in which ways they are similar and when they are relevant.
And without going to different disciplines, how many software developers have a good "culture" about the basic concepts needed in their job, like recurrent algorithms and patterns?
My 2c,
Posted Mar 27, 2009 1:01 UTC (Fri)
by nix (subscriber, #2304)
[Link] (1 responses)
I'd estimate that I spend 20% of my time at work ripping out people's
(And do the reimplementations stop? No! I ditched a chained hash table
I mean it's not as if computers are bad at searching for things, but half
Posted Mar 27, 2009 5:49 UTC (Fri)
by quotemstr (subscriber, #45331)
[Link]
Eventually, these programmers grow up, but in the meantime, they've written a significant amount of horrible code. I've seen this pattern again and again. As the parent mentioned, software developers have no "engineering culture." I imagine that in more established engineering disciplines, students have the above attitude beaten out of them before they graduate.
Posted Mar 17, 2009 16:25 UTC (Tue)
by jeleinweber (subscriber, #8326)
[Link] (13 responses)
Posted Mar 17, 2009 17:03 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link] (7 responses)
Posted Mar 18, 2009 0:22 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (6 responses)
Posted Mar 18, 2009 0:36 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (5 responses)
Besides --- UFS users don't complain about zero-length orphans, so the filesystem must be doing something right. :-)
Posted Mar 18, 2009 0:49 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
I'm just poking fun, sorry ;-)
Posted Mar 19, 2009 17:19 UTC (Thu)
by anton (subscriber, #25547)
[Link] (3 responses)
You can use a journal for an in-place update system to work around
this: write the rename operation atomically into the journal, then
perform the individual updates, then clear that journal entry. In
crash recovery replay the outstanding journal entries.
Or you can use a copy-on-write file system: there you write the
changes to some free space, and when you have reached a consistent
state, commit the changes by rewriting the root of the file system.
Posted Mar 23, 2009 6:01 UTC (Mon)
by xoddam (subscriber, #2322)
[Link] (2 responses)
Soft updates were first implemented on top of 4.4BSD (but weren't part of the distribution), first described in a paper in 1994 and first widely distributed as part of FreeBSD 4.0 in 2000.
Your complaint (linked above, not dated) seems to be about UFS on the DEC distribution of OSF/1 for its Alpha machines. OSF/1 had the option of traditional BSD-style UFS (synchronous metadata with async data) or the journaling filesystem AdvFS. (You initially had to pay extra for AdvFS. It was GPLd last year and is on SourceForge).
So if you used UFS on Tru64 or whatever it was called at the time, you didn't have soft updates like you do on today's *BSD. You had synchronous metadata updates with asynchronous data writes. OSF/1 never adopted soft updates because it already had a good journaling solution years beforehand.
Posted Mar 23, 2009 9:45 UTC (Mon)
by anton (subscriber, #25547)
[Link] (1 responses)
Posted Mar 23, 2009 11:31 UTC (Mon)
by xoddam (subscriber, #2322)
[Link]
The (sketchy) lecture notes linked in quotemstr's first post end with two points of limitations of soft updates:
>Limitations of soft updates:
As has been explained elsewhere, rename on early Unix implementations was a two- or three-step process: unlink any existing file at the target path, hard-link the file at the source path to the target path, and finally unlink the source path. The target path would briefly be absent (as observeable by concurrent processes, nothing to do with crash recovery) and the source file was briefly visible at two locations. Atomic replace was introduced to avoid the two obvious race conditions: a reader might find nothing at the target location, or another updater might open the source path with O_TRUNC and thereby inadvertently truncate the target file.
It's possible (pure speculation on my part here, I haven't actually used soft updates or read the code) that the atomicity not provided by UFS-with-soft-updates amounts to a chance that the source file may end up linked at both the source and target paths. This is likely if the rename is moving the file from one directory to another, as the two directories are likely to be stored at different locations on disk.
Updating the target directory first, with atomic replacement of an existing file at that path but delayed deletion of the source path, satisfies the important condition that the target file never disappear, while violating the letter of the atomic rename contract.
The race condition risking truncation of target file when the source path is opened O_TRUNC may still exist, but only after a crash as the atomic semantics pertain on a running system.
Maybe there are no complaints from BSD users because they're so l33t they never crash: they don't run out of battery or kick out power cords or use unstable video and wireless drivers. Or maybe, just maybe, soft updates are as cool as they say ;-)
Posted Mar 18, 2009 2:28 UTC (Wed)
by vaurora (guest, #38407)
[Link] (3 responses)
Posted Mar 18, 2009 2:39 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Posted Mar 18, 2009 5:15 UTC (Wed)
by xoddam (subscriber, #2322)
[Link]
Posted Mar 19, 2009 9:58 UTC (Thu)
by joern (guest, #22392)
[Link]
Posted Mar 26, 2009 13:11 UTC (Thu)
by doran (guest, #57557)
[Link]
Soft updates does not understand hardware write-back caches, NetBSD's journaling does. Given than nearly all commodity systems have write back caches at some level in the I/O path nowadays, and need them for good performance, this is important.
Soft updates was quite complicated code with integration issues, making it very difficult to fix some of the bugs with it. This was mainly down to its structure, which was suited to commercialization (soft updates was initially a commercial offering). It was not necessarily tailored to the BSD systems.
Additionally, journaling gives some nice benefits like no fsck, and fully atomic rename. No fsck is important to NetBSD for the embedded role, and elsewhere is a much bigger story than it was when soft updates came about due to the explosion in storage volume size.
Posted Mar 17, 2009 17:22 UTC (Tue)
by SteveKnodle (guest, #7853)
[Link] (6 responses)
Posted Mar 17, 2009 19:01 UTC (Tue)
by christian.convey (guest, #39159)
[Link] (5 responses)
It seems to me that a POSIX-compliant operating system must provide at least the POSIX API.
But a POSIX-compliant program should require no more than the POSIX API.
Posted Mar 18, 2009 2:48 UTC (Wed)
by zlynx (guest, #2285)
[Link] (4 responses)
So no matter what ext4 does in this case, the programs with the problems are still not POSIX-
They should either fsync, or adopt the other common pattern of renaming the target file to a
Posted Mar 18, 2009 4:09 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
Posted Mar 18, 2009 4:11 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
Really, all a program that depends on nothing more than POSIX can do it take advantage of the atomic rename support on a running system and hope for the best on a crash. POSIX guarantees nothing else.
Clearly, this is not an acceptable state of affairs, so we make further assumptions about the exact behavior fsync and friends. But these assumptions go beyond POSIX.
Now, an ordered rename makes a whole lot more sense than a rename that only preserves contents after an fsync. But don't pretend that either alternative is mandated by POSIX. This whole damn problem has nothing to do with POSIX, so stop bringing it up. (And that means you too, bojan.)
Posted Mar 18, 2009 5:26 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (1 responses)
Sorry. I'll bring up whatever I see fit, whether you like that or not. But, by all means, don't listen to me and don't reply to my comments. Ignore me - that's OK. But at least do try to understand what Ted's saying.
> If a program is truly POSIX-compliant, it can't make any assumptions about what happens after a crash.
Not, it cannot make assumptions. It can make preparations as best it can (which are defined in the standard) to have data on disk. These preparations are called fsync(). Or, it can be smart and create little tiny backup files with fsync() beforehand and then be fast and keep renaming in the hope that the system doesn't crash all the time.
Either intentionally or otherwise, you keep misinterpreting what POSIX does or does not define. POSIX defines that fsync() is your best bet on having the data on disk. It doesn't define anything about rename() having ordered characteristics. It also doesn't define anything about the situation after the crash.
Out of this, you are giving people advice that the best thing to do is to go with undefined behaviour of rename() if you want your data on disk after the crash. The mind boggles...
Posted Mar 18, 2009 14:22 UTC (Wed)
by nlucas (guest, #33793)
[Link]
An anecdotal case some years ago was Windows 98 corrupting the disk on "modern" PCs, because the hardware was so much faster than the disk flushing. When windows was shutdown it would kill the power before the disk finished buffering it's writes, corrupting the file system.
The only solution was to add some wait time before killing the power.
Posted Mar 17, 2009 17:33 UTC (Tue)
by forbdonut (subscriber, #21577)
[Link] (3 responses)
In particular, what does data=ordered mode actually mean in ext4 with delayed
This mode is more of data=pseudo-writeback i.e. it's some new writeback mode
It's feels like the suggested alloc-on-commit mode should be called
Am I missing something?? Thanks!
Posted Mar 18, 2009 0:26 UTC (Wed)
by brugolsky (guest, #28)
[Link] (2 responses)
But frankly, since the dawn of ext3 in the 2.2.x kernel series, I've always considered the fact that it didn't leave garbage files around on my laptop (with its dodgy IDE chipset) to be its major benefit. And I never really seriously considered using the other journalling filesystems that only preserve meta-data integrity. So I am unhappy with the choice of names for the options. I'd rather that the default be "data=delayed", and "data=ordered" refer to the allocate-on-commit behavior.
Posted Mar 18, 2009 22:18 UTC (Wed)
by forbdonut (subscriber, #21577)
[Link] (1 responses)
data=ordered isn't advertised as a security feature. It's claims to make data ordered with associated meta-data.
Posted Mar 27, 2009 6:00 UTC (Fri)
by Duncan (guest, #6647)
[Link]
BTW, it's worth noting that to long time observers, Linus and thus the
Anyway, based on Linus own posts to the post-2.6.29-announcement thread,
Duncan
Posted Mar 17, 2009 17:34 UTC (Tue)
by christian.convey (guest, #39159)
[Link] (7 responses)
Unless we can get programmers to learn APIs differently, I don't see how we can avoid providing them with a more helpful API if we want them to write robust programs.
Maybe what we really need is some user-space library that provides the more robust guarantees, but that works across many different file systems and operating systems. That could get us past the concerns about writing ReiserFS-only code or Linux-only code.
Posted Mar 17, 2009 18:19 UTC (Tue)
by droundy (subscriber, #4559)
[Link] (3 responses)
I've been wondering whether we might be able to quantitatively describe a "robust" implementation of the POSIX API in terms of the published API in a way that is relatively easy to convey to its users. For instance, I'd like to know that if my application is written so as to behave correctly (as understood by myself and my users) in any scenario in which any subset of processes are abruptly killed, and the connection to the disk is severed at any time--assuming that the disk is always in a "correct" state at the time it is severed, in the sense of reflecting all IO preceding that point, and that all IO after that point has no effect on the disk, and fsync fails after that point--then my application is also robust in the event of a system crash. This is a relatively simple criterion of *application* robustness, which then could define *file system* robustness.
As an application developer, that's what I'd like to have assurance of. In the extremely common case that we don't care if data has actually hit disk (i.e. our application will work fine on a RAM disk), the latter scenario of severed disks is irrelevant, and we just want to know that the guarantees that POSIX provides with regard to the running system are preserved in case of crash. And that's something I can reason about. And it's even something I can test my application for without installing special non-robust file systems and pulling the plug on the computer.
Posted Mar 19, 2009 17:35 UTC (Thu)
by anton (subscriber, #25547)
[Link] (2 responses)
Posted Mar 20, 2009 13:53 UTC (Fri)
by droundy (subscriber, #4559)
[Link] (1 responses)
(Although I can see an appeal to relaxing the in-order constraint for IO from different processes... one ought to be able in principle to do that while maintaining in-order semantics if one were to examine locks---and information flow in general---to ensure that you didn't reorder IO that could be causally related.)
Posted Mar 20, 2009 18:36 UTC (Fri)
by anton (subscriber, #25547)
[Link]
Posted Mar 17, 2009 20:04 UTC (Tue)
by nix (subscriber, #2304)
[Link] (1 responses)
Everywhere else, it's a minority who even know the standard *exists*, let
I, too, wish this was not true (I spend much too much of my time at work
Posted Mar 17, 2009 21:44 UTC (Tue)
by man_ls (guest, #15091)
[Link]
Posted Mar 18, 2009 11:25 UTC (Wed)
by emk (subscriber, #1128)
[Link]
So now I take a different approach: I write lots of unit tests, and I set up buildbots for any platform I need to support.
This leads to lots of interesting discoveries. For example, under Windows Vista, pretty much any file system operation can fail for no apparent reason, and it may need to be retried until it works.
There are two exceptions to this rule: Security, and data integrity. You can't ensure either with unit tests. You must also understand both the official documentation and the reality of common implementations. (My nastiest surprise so far: There are snprintf implementations on some legacy Unix platforms that ignore the buffer size parameter, exposing every caller of snprintf to an overflow attack. If you rely on snprintf on old Unix platforms, it might be worth writing a unit test that checks to see if it works.)
Posted Mar 17, 2009 17:49 UTC (Tue)
by sbergman27 (guest, #10767)
[Link] (1 responses)
Posted Mar 19, 2009 2:39 UTC (Thu)
by mjr (guest, #6979)
[Link]
The kludges for alloc-on-replace that will apparently be enabled by default should deal with the real, problematic cases of loss of pre-existing data quite nicely, and I do think having those as defaults is a good, reasonable compromise. Those who want to be extra-paranoid with either nodelalloc or alloc-on-commit, well, sure, the option is/will be there, but I don't think these might I say somewhat obsolescent modes should be the default.
'course, distros shouldn't use ext4 per default yet anyway for conservativeness sake. Maybe starting next year ;]
Posted Mar 17, 2009 17:50 UTC (Tue)
by iabervon (subscriber, #722)
[Link] (1 responses)
I don't see anything in POSIX to suggest that there's anything you can do in general to avoid having a window in which the on-disk mapping of names to contents is undesirable. Using fsync() is a implementation-specific hack to do something that POSIX defines to update the data that ext4 happens to care about.
Posted Mar 17, 2009 18:33 UTC (Tue)
by ssam (guest, #46587)
[Link]
but that would be bad. hence there is a journal, and tools like fsck to make sure that a system crash usually does not harm most of your data.
the journal is meant to mean that after a crash the filesystem can be recovered to a valid state, without having to sync after each write.
so ext3 is POSIX plus stuff to make a filesystem safe and useful.
Posted Mar 17, 2009 17:53 UTC (Tue)
by walters (subscriber, #7396)
[Link] (3 responses)
The problem with the POSIX interface (and raw filesystem APIs in general) is that they're too general. Is it designed for Postgres? Is it designed for preference store backends? Is it designed for my IRC client to store conversation logs (with search)? You can't do all of these simultaneously and well.
Basically I think the filesystem should be designed for #1 and #2, and what we really need is better userspace libraries; particularly for desktop applications. If those *libraries* happen to use kernel-specific interfaces for optimization, that makes sense.
For example, I'm pretty sure one could put part of sqlite's atomic commit in the kernel.
The other argument against trying to solve this in the kernel is that very few programs use the raw libc/POSIX interface; they're using the standard C++ library, Java, Python, Qt, GLib/Gio etc. So any changes have to be made at those levels.
Posted Mar 17, 2009 18:03 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link]
Posted Mar 17, 2009 18:31 UTC (Tue)
by butlerm (subscriber, #13312)
[Link]
SQLite on the other hand, appears to do a checkpoint (i.e. finish the
Posted Mar 26, 2009 16:42 UTC (Thu)
by muwlgr (guest, #35359)
[Link]
Posted Mar 17, 2009 18:19 UTC (Tue)
by rsidd (subscriber, #2582)
[Link] (6 responses)
Posted Mar 17, 2009 19:33 UTC (Tue)
by ncm (guest, #165)
[Link] (5 responses)
It's amusing to me that Gabriel's example of interruptible read() and write() system calls utterly fails to illustrate his point. The interface complexity of these system calls can be completely hidden by simple C library functions, and the code to implement them is much, much shorter than would be needed in the kernel to re-run the interrupted calls. Furthermore, a user program might well prefer to bypass such help, e.g. in case it could do something else useful before performing more I/O.
That said, Apollo's Aegis, evolved from Unix's predecessor Multics, had a much better interface to read(): it took a pointer to a buffer, but normally would return a pointer into the OS's own file buffer, usually avoiding a copy operation. Unixen didn't typically have a nonblocking read until remarkably recently, and it still doesn't tend to work well with disk files.
Posted Mar 17, 2009 21:38 UTC (Tue)
by bcopeland (subscriber, #51750)
[Link] (3 responses)
I don't see how the example undermines Gabriel's point. As far as I can tell,
Posted Mar 18, 2009 6:23 UTC (Wed)
by rsidd (subscriber, #2582)
[Link] (1 responses)
Someone else above made the point that worse-is-better was quite reasonable in the 1970s and 1980s when computers were extremely slow and people were willing to sacrifice stability for speed. (Lisp machines, I believe, literally took all morning to boot; and garbage collection was time for a coffee break. At the other extreme, completely unprotected operating systems like CP/M and MS-DOS, that let programmers do pretty much anything they liked, managed to have useful applications like WordStar that were as smooth and interactive as today's word processors. Unix machines lay somewhere in the middle.)
Also, programming was an arcane art and OS designers were willing to trust application designers to "do the right thing" (and if they didn't, the consequences were immediately noticeable).
Today's computers are a few orders of magnitude faster, but are still running operating systems built on assumptions that ceased to be valid nearly two decades ago.
Posted Mar 18, 2009 6:58 UTC (Wed)
by mgb (guest, #3226)
[Link]
Garbage collections caused zero delay, as incremental garbage collection was supported in microcode.
Just like today, if you needed a coffee break in the 80's you had to find something huge to compile.
Posted Mar 19, 2009 23:11 UTC (Thu)
by jschrod (subscriber, #1646)
[Link]
He states that it is an observable fact that the worse-is-better school has a better adoption rate than the do-it-right-the-first-time school; for several reasons, only some of them technical. And that fact makes him sad, because he is very much a member of the do-it-right-the-first-time school -- you just have to read his report about CLOS to recognize that, even if you have never spoken to him personally. (I have, so this is not hearsay.)
Posted Mar 19, 2009 18:15 UTC (Thu)
by dmarti (subscriber, #11625)
[Link]
Posted Mar 17, 2009 18:20 UTC (Tue)
by sf_alpha (guest, #40328)
[Link] (17 responses)
I think it also duty of application developers to ensure their applications work on POSIX if they are writing UNIX applications.
Ther are many filesystems out there that implement delayed allocation. Although those filesystems are not default filesystem for Linux, we expected applications to work regardless which STABLE filesystem and Operating System used.
Use of allocate-on-commit mount option for provide ext3 like behavior is the workaround that give time for applications to migrate, also the patch to the commit on rename. But again, application developers get ignore POSIX specfication and compliance. If they want to write Linux-only applcation, that is no problems to rely on Linux or ext3 functionality. But I am sure that most applications not intended to use only in Linux and a few filesystems.
So if allocate-on-commit is default behavior, we get non-portable (and bugged) application as an exchange.
Posted Mar 17, 2009 19:52 UTC (Tue)
by aleXXX (subscriber, #2742)
[Link] (1 responses)
Sure, sounds reasonable.
But, how to do that ?
Most people have one box around, usually running Linux.
How should they/we test that our software works fine everywhere ?
Ok, one can install a virtual machine and run e.g. FreeBSD or Solaris on
it. Not sure how many people do this in their spare time. At least for me
it's quite at the top of my TODO.
E.g. Kitware has nightly builds and
testing for
basically all their software on basically all operating systems, i.e.
Linux, Windows, OSX, AIX, HP-UX, Solaris, FreeBSD, QNX and others (here's
the current dashboard for CMake: http://www.cdash.org/CDash/index.php?project=CMake)
But setting this up is quite some effort, you need to find people to host
these builds for you. Not every small project can afford this.
Alex
Posted Mar 18, 2009 5:45 UTC (Wed)
by sf_alpha (guest, #40328)
[Link]
If applications is designed to use only Linux with ext3, that is OK, just ignore this problems and rely on Ext3 robustness. Only drawbacks is application is not portable and still MAY LOST DATA WHEN CRASH.
Posted Mar 17, 2009 20:17 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Posted Mar 17, 2009 22:29 UTC (Tue)
by nix (subscriber, #2304)
[Link]
Posted Mar 17, 2009 22:21 UTC (Tue)
by man_ls (guest, #15091)
[Link] (11 responses)
Just joking. You are right that people should respect the spec, but I think that POSIX compliance is not the problem here. In fact POSIX is just a red herring that Ted Ts'o threw in the way to make the hounds lose the scent. Apparently he failed, but he left the hounds half-crazed and biting each other for a long time.
Posted Mar 18, 2009 0:45 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (4 responses)
I know you are joking here, but these files are not empty because of some evil "POSIX compliant" way of dealing with a crash by the FS or kernel. They are empty because they were never committed to the disk by running fsync().
So, it is not that POSIX compliant FS decided to _remove_ that data upon crash. It was never _explicitly_ placed there in the first place, so it cannot possibly be there after the crash.
Sure, you could have a very rare situation where you do run fsync() and you get a zero length file, for all sorts of reasons (usually hardware and kernel driver issues). This is not the case here, however.
> POSIX is just a red herring
I have to disagree here and agree with Ted. The manual page of close is very specific and says:
> A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a file system to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2). (It will depend on the disk hardware at this point.)
So, if you want have _any_ guarantee that you will see you data on disk _now_, you better fsync().
The manual page of rename(2) is similarly clear on what is being atomic - just the directory entries. Sure, ignore at your peril.
And, finally, the documentation of fsync() is also crystal clear that one is allowed to run it independently on a directory and on a file. Which means that users themselves are allowed to do this separately (and they do). So does the kernel.
Sure, Ted is being gentle to everyone with bugs in their apps, which is fair enough. But, I'll bet $5 they'll hit the same thing on another Unix-like OS in the future, at which point all this screaming will happen again and people writing buggy code will accuse FS writers that it's their fault.
In the meantime, it is easy to take backups of configuration files rarely and restore them if real config files are broken. And it doesn't require running fsync() all the time.
Posted Mar 18, 2009 7:22 UTC (Wed)
by man_ls (guest, #15091)
[Link] (3 responses)
Posted Mar 18, 2009 8:46 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
Look, it is completely up to you to believe what you want of course. I will do the same. OK?
Posted Mar 20, 2009 5:02 UTC (Fri)
by k8to (guest, #15413)
[Link] (1 responses)
The applications are expecting behavior POSIX does not provide.
It's fine to use a pattern that doesn't request the data be on disk, but you should write the app to deal with the lack of the data being on disk.
This is what I've done many times, in my own software authoring.
Posted Mar 20, 2009 11:52 UTC (Fri)
by man_ls (guest, #15091)
[Link]
I hope I don't have to use your software that fsync()s after every file operation. Mistaking durability with atomicity can have dire consequences both for durability and for atomicity.
Besides, you are not a lurker and this is not email.
Posted Mar 18, 2009 2:03 UTC (Wed)
by ras (subscriber, #33059)
[Link] (5 responses)
It strikes me as odd that an open source OS uses a non-free spec to define its operations. Doesn't it strike anybody else as odd that we have a whole pile of people here arguing about compliance to a spec they most likely haven't seen? I see statements like "ensure your app only relies on stuff in POSIX". Perfectly good advice, except how is your typical open source developer meant to do that when he can't get access to the bloody thing?
That aside, I gather (since I have not been able to get a copy of POSIX myself), POSIX's doesn't offer much to programmers who want to ensure some combination of consistency consistency and durability. This sort of stuff is a basic requirement if your want to produce a reliable application. The furor here is an indication of just how basic it is. Yet even if you did have access to the spec, I gather it doesn't spell out how to do this. So programmers have learnt a bunch of ad hoc heuristics, like "to get consistency without the slowdowns caused by durability, use open();write();close();rename()". Then we get accused of "not adhering to the spec" when the next version of the FS doesn't implement the heuristic. Give us a break!
Ted's suggestion that you should be using sqlite if you want to write out a few hundred bytes of text reliably is on one level almost a joke. I presume he suggested it because the sqlite authors have taken the time to learn all the heuristics to get data on the disc reliably. Given it _is_ so hard figure all those heuristics for the various file systems your application could find itself running on I guess it is a reasonable suggestion. Unfortunately, as the firefox programmers found out, it doesn't always work. Yeah, sqlite got the data onto the disc reliably, but only by using fsync() which killed performance on some platforms. Given you probably don't care if your latest browsing history hit the disc in 5 minutes time, it is a great illustration of why programmers are so fond of "open();write();close();rename()".
From talking to a MySql developer, I gather the situation is even worse than most posting here realise. Not only does the rename() trick not work, it turns out just about anything beyond fdatasync() doesn't work. For example, you might expect that appending to a file would be fairly safe. Well, not so apparently according to POSIX. He said that if you append to a file, there is a chance on POSIX system the entire file could be truncated if you crash at the wrong moment. The only way to guarantee a file can't be corrupted by a write is to ensure you don't effect the metadata (think block allocations) - ie always write to pre-allocated blocks. Need to extend your 100Gb database? Well then you have to copy it, write zero's to the extra space at the end to ensure it isn't sparsely allocated, then use the fsync(); rename() trick.
And that should be a joke. Pity it isn't. Given that filesystems aren't going to implement ACID, we need a set of primitives we can use build up our own implementations ACID. Fast, simple things, along the lines of the lines of the CPU instruction "Test Bit and Set" which is there so assembly programmers to implement all sorts of complex locking schemes on top of it. And we need them defined in a spec that we can actually access - unlike POSIX.
Given that ain't going to happen, Ted's only way of of this is to publish such a document for his filesystems - the ext* series. Just a series of HOWTO's would be a good start - HOWTO extend a large file reliably, HOWTO get consistent data written to disc (ie impose ordering on writes) without the slowdown's of unwanted sync()'s, HOWTO ensure a rename() for a file you don't have open has hit the disc. Nothing fancy. Just the basic operations we applications are expected to implement reliably every day on his file systems.
Posted Mar 18, 2009 4:35 UTC (Wed)
by butlerm (subscriber, #13312)
[Link]
Posted Mar 18, 2009 13:17 UTC (Wed)
by RobSeace (subscriber, #4435)
[Link]
Posted Mar 18, 2009 15:41 UTC (Wed)
by markh (subscriber, #33984)
[Link]
Posted Mar 18, 2009 20:27 UTC (Wed)
by sb (subscriber, #191)
[Link]
Read the online specifications, particularly the
System Interfaces volume.
Read also the Linux manpage for the system call you are using. It will say
which standards the implementation adheres to, and how it departs from those
standards.
On a Debian system, install "mapages-dev" and "manpages-posix-dev". For most
system interfaces, you will then have the Linux implementation in section 2 and
the POSIX spec in section "3posix".
Posted Mar 19, 2009 0:31 UTC (Thu)
by ras (subscriber, #33059)
[Link]
Times have apparently changed, and it is a big improvement. The last time I went search for POSIX was when I was referred to it by some man page which just said it implemented "POSIX regex's", and got completely pissed off when I discovered the manual entry for a supposedly free library referred me to a non-free spec.
Posted Mar 19, 2009 18:43 UTC (Thu)
by anton (subscriber, #25547)
[Link]
Posted Mar 17, 2009 19:37 UTC (Tue)
by gmaxwell (guest, #30048)
[Link] (8 responses)
(
and yes, I've seen KDE broken after an unplanned shutdown on XFS but there was never any confusion or doubt on my mind that it was a KDE problem)
Posted Mar 17, 2009 20:15 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link] (2 responses)
Posted Mar 18, 2009 1:08 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
1. It was written by SGI, so most distros don't have internal resources to support it. So, they ship other FSes as default.
2. Other FSes were historically the default, so upgrading to XFS is not that straightforward.
3. Some people unjustly accused this FS of loosing data, so it is taking a long time for the FS to get its reputation back. Even if all those that did the accusing published public retractions of their unfounded accusations, it would take a long time.
Posted Mar 18, 2009 11:15 UTC (Wed)
by arekm (guest, #4846)
[Link]
Otherwise I'm very happy with xfs and I'll continue to use it.
Posted Mar 17, 2009 20:26 UTC (Tue)
by sbergman27 (guest, #10767)
[Link] (3 responses)
Users who get hoist with that particular petard have clearly done it to themselves. "XFS does it too" is really academic.
Posted Mar 17, 2009 20:47 UTC (Tue)
by gmaxwell (guest, #30048)
[Link] (2 responses)
I don't disagree with the position that the end result is a "linux (distribution) problem" to the end user. But the problem is not the filesystem or the kernel. The problem is the dependence on zillions of tiny dot files (or a registry, *cough*, gconf) which absolutely can't be corrupt. Even if EXT4 provides the EXT3 behaviour there will always be opportunities for these files to become corrupted (software/hardware failure, cosmic ray, etc) and that the failure frequently results in an inability to even login is simply unacceptable. Quite arguably the file system which demonstrates its corner case behaviour more frequently is preferable since it means that developers will be more aware and more likely to address these situations.
Posted Mar 17, 2009 22:27 UTC (Tue)
by sbergman27 (guest, #10767)
[Link]
Posted Mar 18, 2009 19:58 UTC (Wed)
by chad.netzer (subscriber, #4257)
[Link]
Posted Mar 27, 2009 7:20 UTC (Fri)
by Duncan (guest, #6647)
[Link]
IOW, AFAICT, the bad reiserfs rep originates in the pre-2.6.6 era, quite
Again, if you have references otherwise, I'm willing to be corrected, but
Duncan
Posted Mar 17, 2009 21:12 UTC (Tue)
by job (guest, #670)
[Link] (8 responses)
I would have appreciated if the article mentioned sed in-line editing or shell scripts using mktemp && mv as examples of potential problems. After all, no one expects to be forced to call sync after that, do they? Calling most shells examples of broken user space applications is a bit of a stretch.
Posted Mar 17, 2009 22:42 UTC (Tue)
by butlerm (subscriber, #13312)
[Link] (6 responses)
Then, as suggested by a commenter elsewhere, we could add something like a
Posted Mar 17, 2009 22:53 UTC (Tue)
by quotemstr (subscriber, #45331)
[Link] (5 responses)
I don't see a problem with adding platform-specific values for
If the POSIX people ever get around to standardizing safe semantics for
Posted Mar 18, 2009 1:00 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (2 responses)
Posted Mar 18, 2009 1:04 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (1 responses)
Of course it's per-filesystem. That's the whole point of using
Posted Mar 18, 2009 1:11 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
Posted Mar 18, 2009 4:41 UTC (Wed)
by butlerm (subscriber, #13312)
[Link]
Posted Mar 19, 2009 7:27 UTC (Thu)
by job (guest, #670)
[Link]
I would hope that the operation "write data to file" gets less complex, not more. There is already a little dance of calls to be made (Ted writes about it). If we add logic on the application level to handle that some filesystems expect fsync on the directory, some on only the file and some manage without, it becomes even more so. In tens of thousands of applications.
But this is only vaguely related to the data ordering issue. In an interactive program or where performance is critical you may not want to wait until data is commited to disk. Latency kills.
Posted Mar 18, 2009 5:13 UTC (Wed)
by xoddam (subscriber, #2322)
[Link]
This is *the* way to achieve atomicity on Unix. It always was.
We didn't use to have journaling filesystems and we never used to expect anything at all to work after a crash. Crashes happened and they often meant hours of work, possibly reinstalling everything.
To discover that ext3 data=ordered just kept on working after a crash was a real eye-opener for me. I never realised before that such robustness was possible (just like I never realised prior to Linux that I could afford my own Unix box) and I am not about to relinquish it! I'm sure I speak for many users here. It *was* an unexpected nice-to-have when we first got it; but it has become a solid requirement.
Delayed allocation without an implicit write barrier before renaming a newly-written file virtually guarantees data loss after a crash with existing applications. It is therefore a regression from the status quo, albeit to something somewhat better than the status of a few years back.
Kudos and thanks to Ted for implementing this must-have write barrier (and also for improving the chances that unsafe truncate-and-write is less likely to hit the inevitable race condition, though IMO he's quite right that applications doing that are broken).
I just wish he wouldn't keep insisting that fsync at application level is the right way to achieve what we want.
POSIX and fsync have nothing to do with it (any journaling filesystem provides much more than POSIX), nor do application authors who forgot to think about recovery after an OS crash. A journaling filesystem *can* guarantee atomic renames, so it *should*, for the sake of users' sanity, not for the sake of a standards document.
Standards can be updated to follow best practice. They often do.
Posted Mar 17, 2009 21:35 UTC (Tue)
by dmarti (subscriber, #11625)
[Link]
Pass it a pathname, a pointer to a block of data, a length, and some flags, and it comes back with success if it opened, wrote, did whatever black magic you have to do, and closed the file.
The library could also include the corresponding read function, so you could have a flag to tell the writer, "don't bother rewriting with identical contents."
Posted Mar 17, 2009 23:42 UTC (Tue)
by ikm (guest, #493)
[Link] (1 responses)
E.g. suppose we call it "f". Then
FILE * f = fopen( "precious_config", "wf" );
That flag would mean that the original file is to be kept untouched until the stream is closed. Instead, a replacement inode is to be allocated on the same media, getting all new changes. At the moment stream is closed, the new inode is to atomically replace the original one. So you'd either get the old version, or the new version.
Programs that would use the new flag would still be portable, since that flag would just be ignored if not known. Yes, they won't be using rename tricks in that case, but I doubt most people care that much anyway. At least, I don't do rename tricks, but I'd happily use new flag is there was one.
In glibc, the whole thing could be implemented either by using a special flag to the underlying open syscall if one exists, or by just utilizing the rename trick internally otherwise.
Posted Mar 18, 2009 0:17 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link]
In all seriousness, I think that's the wrong level of abstraction for this functionality. A C
Besides, a flag to
I wouldn't mind, however, a libc function that encapsulated the very common open-write-close-rename sequence. I wouldn't even mind some function that returned a
Posted Mar 17, 2009 23:46 UTC (Tue)
by clugstj (subscriber, #4020)
[Link]
It may not be required by POSIX, but it is a reasonable way for a filesystem to behave.
Posted Mar 18, 2009 0:40 UTC (Wed)
by ikm (guest, #493)
[Link]
Posted Mar 18, 2009 3:53 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (12 responses)
Which is the euphemism for "we'll have workarounds for your bugs" and "people that know will fix their apps".
Posted Mar 18, 2009 5:36 UTC (Wed)
by xoddam (subscriber, #2322)
[Link] (11 responses)
Application writers have long used rename precisely and only to achieve atomicity; it's the only atomic operation the API provides (simulating atomicity with locks and a series of synchronous flushes is a very different matter). As long as the kernel and fs are up, POSIX guarantees that data writes which precede the rename will be visible to all readers after the rename. Post-crash, nothing used to be guaranteed.
This isn't about application developers coding to an API, it's about users wanting reasonable behaviour from their computers even when they abuse them by kicking out power cords or installing proprietary device drivers.
Ext3 has given us a new, much-better-than-POSIX standard of data recoverability. It's a mere implementation detail that it does this in part by effectively preserving the order of operations that POSIX mandates down to the disk level.
Delayed allocation without a write barrier before renames of newly-written files practically guarantees data loss in this extremely common use-case, so it's a regression.
The regression now has been fixed (thanks Ted). No hacking of applications required.
Posted Mar 18, 2009 6:05 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (10 responses)
Sorry, I didn't come up with that. I think that would be... Ted ;-)
Posted Mar 18, 2009 8:21 UTC (Wed)
by xoddam (subscriber, #2322)
[Link] (9 responses)
Please, bojan and tytso alike, cease and desist from saying applications are broken, when users have given a clear requirement for a new filesystem: that it not lose data as a matter of course, when the status quo would preserve it.
Posted Mar 18, 2009 8:55 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (1 responses)
I don't know. I think a person that created the file system may know a thing a two about POSIX. I did actually go and check and he did appear to be right. But, that's obviously not good enough for you (or you may know of some interpretation we cannot grasp - it is possible). I'm OK with that.
> Please, bojan and tytso alike, cease and desist from saying applications are broken, when users have given a clear requirement for a new filesystem: that it not lose data as a matter of course, when the status quo would preserve it.
I have no intention of doing that (unless LWN editors throw me out). Likewise, you can say what you please.
Ted, being a pragmatic perseon, already did put workarounds in place, so users will be happy.
Posted Mar 30, 2009 12:37 UTC (Mon)
by forthy (guest, #1525)
[Link]
> I think a person that created the file
system may know a thing a two about POSIX. It's not, and I repeat in bold: NOT about POSIX. It is about
reasonable behavior. Ordered data has been implemented in ReiserFS and
XFS, which both had the reputation of being unstable and prone to eat
files before. This is a quality of implementation issue, not a standard
issue. Maybe we would need a better standard for file systems, so that
quality of implementation is reasonable by default, but that's a
different topic. If you insist that your way-below-average quality of
implementation is "perfectly valid", you are anal-retentive. I think Ted T'so should read the GNU Coding
Standards. What is written there is mandatory for a core component
of the GNU project (which the Linux kernel is, regardless if it's
officially part of the GNU project). The point in question here is
section 4.1: The GNU Project regards standards published
by other organizations as suggestions, not orders. We consider those
standards, but we do not “obey” them. In developing a GNU
program, you
should implement an outside standard's specifications when that makes the
GNU system better overall in an objective sense. When it doesn't, you
shouldn't. What Ted has implemented was a behavior which is standard, but makes
his file system worse, because it has inconvenient side-effects on
robustness in case of a crash. In shorter words: It sucks. And the
GNU Coding Standards clearly say: If the standard sucks, don't follow
it.
Posted Mar 18, 2009 17:24 UTC (Wed)
by pflugstad (subscriber, #224)
[Link] (6 responses)
Honestly, has anyone here, NOT running binary closed source drivers, experienced a crash in a distro provided kernel in what, the last 12 months or longer? Heck, even a bleeding edge (but not -RC) kernel.
Didn't think so. Now, please refrain from hyperbolic statements like that.
I realize the Ted pointed this out in his initial emails and while it's still not good for the system level behavior to change like this, this is a case of ultra bleeding edge kernel, ALPHA distro release, etc. These are not common users in any sense of the word "common".
Posted Mar 18, 2009 20:02 UTC (Wed)
by zeekec (subscriber, #2414)
[Link]
> Honestly, has anyone here, NOT running binary closed source drivers, experienced a crash in a distro provided kernel in what, the last 12 months or longer? Heck, even a bleeding edge (but not -RC) kernel.
Actually, yes I have. I run Gentoo unstable at home, and I am currently having issues with the 2.6.28 kernel and Xorg's intel drivers. All open source. So it does happen. (But I'm running Gentoo unstable and expect it!)
Posted Mar 18, 2009 22:45 UTC (Wed)
by xoddam (subscriber, #2322)
[Link] (3 responses)
The purpose of a journaling filesystem is *only* to ease and speed the task of recovery after an unclean shutdown. I can't emphasise this point strongly enough.
Posted Mar 19, 2009 0:35 UTC (Thu)
by butlerm (subscriber, #13312)
[Link] (2 responses)
That is not quite correct. The primary purpose of journaling in typical
The secondary purpose of journaling is to loosen ordering restrictions on
Finally, journaling filesystems are not metaphysically prohibited from
Posted Mar 19, 2009 5:56 UTC (Thu)
by xoddam (subscriber, #2322)
[Link] (1 responses)
Posted Mar 20, 2009 21:17 UTC (Fri)
by butlerm (subscriber, #13312)
[Link]
Posted Mar 19, 2009 23:25 UTC (Thu)
by jschrod (subscriber, #1646)
[Link]
That said, yes, I had many kernel crashes at the start of this year, using SUSE and no proprietary modules. It took a long time to identify the piece of hardware that caused it. (It was the video card.) I have another system where usage of ionice causes hard lockups of the whole system, reproducable. E.g., running updatedb with ionice. I have never identified the culprit here and finally put it in the closet; my time was worth more than the price of a new system.
Posted Mar 18, 2009 15:03 UTC (Wed)
by sbergman27 (guest, #10767)
[Link] (1 responses)
Posted Mar 19, 2009 13:32 UTC (Thu)
by sbergman27 (guest, #10767)
[Link]
Posted Mar 19, 2009 12:50 UTC (Thu)
by ssam (guest, #46587)
[Link]
if system crashs/looses power between the rename hitting the disk, and the new data actually being writen, then there is a very good chance that the old data is still on the disk.
there are programs what can find deleted files, and undelete them. if ext4 left some clue in the journal as to which blocks were freed during a rename, then maybe fsck or the journal replaying code, could go and find those block containing the old file, and relink them.
Posted Mar 19, 2009 18:06 UTC (Thu)
by zmi (guest, #4829)
[Link] (2 responses)
The problem is that if it continues like ext3, applications will never get
In the end it would have been better to fix applications. Otherwise other
What you currently have is this: use ext3/ext4, or otherwise you can loose
(Yes, I really feel nervous that we start to trust computers where you
Posted Mar 19, 2009 23:54 UTC (Thu)
by xoddam (subscriber, #2322)
[Link]
We use a journaling filesystem so that *after* our hardware or our kernel fails -- which is more or less likely depending on the way we use them, but everyone is at risk -- whatever made it to nonvolatile storage is in a sane, recoverable state. NOT MISSING. That way we *only* lose a few seconds' or minutes' work.
[ Or browsing history, some people call that work :-) ]
Posted Mar 21, 2009 17:03 UTC (Sat)
by nix (subscriber, #2304)
[Link]
Posted Mar 20, 2009 22:53 UTC (Fri)
by spitzak (guest, #4593)
[Link]
I would also like to see "atomic create" with a new flag to open(). This would make a hidden (ie the inode is not in any directory) file that you can write to. When you close() it it does the equivalent of the atomic rename() to the filename given to open(). If the program crashes then the file disappears as it is not linked to by any directory. Calling fsync() on the file would do the rename that close does at that point, and from then on it would act like a normal file. There may also be a way to abort the file, closing it such that it disappears with no effect.
This will give the atomic operation that everybody is trying to get, without the need to generate a temp filename, without leaving temp files on the disk in case of a crash, and possibly with far better performance as the filesystem could defer the effects of close.
Furthermore I think there would not even need to be a new flag. Instead the flags O_CREAT|O_WRONLY|O_TRUNC (the ones creat() use) would trigger this behavior. This is almost certainly the behavior anybody calling creat() is expecting anyway.
Thank you for a balanced discussion of one of the most inflammatory technical (as opposed to legal or social) issues in recent memory. The "worse is better" approach is not unique to unixland, and in the days of small computers with limited resources, counting on applications to not do silly things made sense.
Better than POSIX?
fsync
before every useful rename
.
rename
gives the kernel more freedom to optimize. That view is a red herring: with a non-ordered rename
, applications must fsync
before the rename to have sensible semantics. So really, the choice isn't between an ordered and a non-ordered rename
, but between fsync
-unordered_rename
and ordered_rename
; the latter actually gives the kernel greater latitude in optimizing IO.
rename
be either neutral or beneficial in every real-world situation. Here's my challenge to anyone reading this: come up with a non-contrived scenario in which an ordered rename (i.e., an implicit write barrier) would be harmful.
Better than POSIX?
Transactional filesystems
transaction processing techniques - meta data undo in particular.
complicated setup that allows an arbitrary number of data and meta data
operations to commit or rollback as a group. That is an order of magnitude
more complicated than what would be required to efficiently preserve the
atomicity of file rename operations, for example.
Transactional NTFS for an example - but for now the ability to provide
atomicity without durability would be a major improvement that has a
fraction of the complexity.
Transactional filesystems
> transaction processing techniques - meta data undo in particular.
Transactional filesystems
accomplish what it does by "meta-data" undo, where the meta data of the
higher level filesystem was native to PostgreSQL (i.e. stored in table
rows), as opposed to the completely separate and irrelevant meta data of
the lower level filesystem PostgreSQL was running on top of.
just as long as the meta data you are referring to is not the meta data of
the filesystem itself.
Transactional filesystems
regarded as a means of mixing the undo log into the data store itself,
eliminating it as a separate entity.
doesn't matter; you just work through it bit by bit. PostgreSQL scales to
silly data volumes already...)
Better than POSIX?
1) open foo.new
2) write to foo.new
3) close foo.new
4) move foo.new to foo
all that needs to happen is that the 4 does not hit the disk before 2 has.
Better than POSIX?
Also they didn't have any other experiances to compare it to. If ext3 had also lost data in similar situations, people would be used to 'if the system crashes and your app didn't do fsync you will loose data modified in the last couple of minutes' and consider that reasonable.
Better than POSIX?
window in ext3, you are almost guaranteed to end up with the old version of the file.
a 0-byte file. That's a rather stark difference.
driver could go berserk and write 99 red balloons all over the disk. Nothing in the design of the FS
could prevent that...but that's *extraordinarily* rare.
Better than POSIX?
Better than POSIX?
people were relatively happy with the ~5 second vulnerability window in ext3, but are getting bit by the ~60 second vulnerability window that ext4 defaulted to.
Well, if the 5-second window had left my systems vulnerable to data corruption (in the form of zero-length files) I know I would not have been happy. After all that 5-second window happens every 5 seconds, so you are virtually guaranteed to run into the problem if it exists, whatever the window size.
Better than POSIX?
Better than POSIX?
I know my data is vulnerable in many ways. I would just like the most common cases to be addressed. It is not luck. According to the discussions, everyone who has had a crash with (first) XFS or (now) ext4 is getting inconsistent states, while no heavy-duty ext3 users have reportedly seen this kind of corruption. Maybe others, yes, but no zero-length files -- which is the issue under discussion.
Better than POSIX?
Better than POSIX?
no, currently there is no API in existance for any
filesystem (including ext3) that will _guarantee_ that you have either
the old or the new filesystem on disk after a crash with no
possibility of garbage (even excluding things that damage the drives
themselves)
I am pretty sure that there are file systems around that give such
guarantees. This was certainly one of the goals of LinLogFS and LLFS; ok,
these filesystems have not come out of the development stage (yet),
but I would be surprised if there were no others (ZFS?).
Do you really NEED such an API?
Better than POSIX?
Better than POSIX?
Better than POSIX?
mess situation any time in the face of that.
Better than POSIX?
Postponing flush until close()?
Postponing flush until close()?
Maybe not
Maybe the answer is a new set of guarantees for Linux's POSIX API, e.g. an overwriting rename() will either leave the old or new version to disk, atomically.
Why? As has been pointed out, ext2 is perfectly fine for many applications, and it would never be Linux-POSIX-compliant this way. For example in data centers with 3-way redundant power supplies and redundant storage, or temporary filesystems.
ext2, Bruté?
It is by all means desirable. The proper place for such a standard might be debated though. I have always understood that POSIX is a standard for compatibility, e.g. Wikipedia says:
ext2, Bruté?
POSIX or "Portable Operating System Interface for Unix"[1] is the collective name of a family of related standards specified by the IEEE to define the application programming interface (API), along with shell and utilities interfaces for software compatible with variants of the Unix operating system, although the standard can apply to any operating system.
So I don't know if a standard for reliable file systems would fit in.
Better than POSIX?
Maybe the answer is a new set of guarantees for Linux's POSIX API, e.g. an overwriting rename() will either leave the old or new version to disk, atomically.
Thanks for that comment --- it's amazing how much knowledge we're rediscovering in computing. It's almost as if we're coming out of some kind of dark age.
Better than POSIX?
Better than POSIX?
Better than POSIX?
Better than POSIX?
But most importantly, a software developer is seldom trained to think at an engineering level. I remember when I studied for my degree, I have been taught about power plants, engines, turbines, cooling plants, pipes, dissipators... none of that has anything to do directly with software development. But after a few years of studying those systems it becomes obvious that there are lots of analogies between them, and very often the mathematical models that describe them are the same.
The teachers themselves pointed this out every time they could, and they did it on purpose, to teach us to recognize the patterns.
I mean, how many actually tried to read Donald Knuth's books (or similar ones), or at least consult them when appropriate?
There are lots of answers already published, but we continue reinventing them anyway...
Massimiliano
Better than POSIX?
buggy broken slow reimplementations of wheels and replacing them with a
wheel that uses twenty-to-forty-year-old techniques to do the same thing
faster and more reliably.
implementation today which had a stupid bug which led to every element
landing in the same bucket. Obviously it was too hard to look
in "include/hash.h" to find that there was already a hash table in the
system with a better API...)
the people I work with are tentative and reluctant to just grep for a few
plausible terms to see if they can avoid reinventing the damn wheel yet
again.
Pride
Meanwhile the *BSD crew went to soft updates (non-barrier asynchronous write order
dependencies) 10 years ago, and think the Linux filesystem people are crazy for not doing it too. Ted's patches are maybe 25% of the way to what Dr. McKusick described back in 1999. Maybe we should do the other 75% ...
Better than POSIX?
According to http://www.scs.cs.nyu.edu/V22.0480-003/notes/l10d.txt, soft updates doesn't provide atomic rename. (See "limitations".) Can someone who knows more about soft updates elaborate on this point?
Better than POSIX?
Better than POSIX?
Well, what's confusing is that it should, as I understand soft-updates --- it maintains a dependency graph of required metadata updates, and writes to disk in such a way that at each step, the system is consistent. I don't see why such a scheme wouldn't automatically give you at least a metadata-level atomic rename, so I was asking for clarification on what those notes meant.
Better than POSIX?
Better than POSIX?
Better than POSIX?
Well, what's confusing is that it should, as I understand soft-updates
--- it maintains a dependency graph of required metadata updates, and
writes to disk in such a way that at each step, the system is
consistent. I don't see why such a scheme wouldn't automatically give
you at least a metadata-level atomic rename, so I was asking for
clarification on what those notes meant.
Soft updates is an enhancement of a file system that does
update-in-place. Certain things are hard or impossible to do in that
setting. In particular, rename requires updating to different places,
so if you update one place after the other (as you have to do with
in-place updates), there is a time span when one update has happened,
but not the other.
Besides --- UFS users don't complain about zero-length orphans
Wrong!
I don't think OSF/1 aka Digital Unix aka Tru64 Unix ever had soft updates.
Yes, my complaint was about UFS with synchronous metadata updates, not
with soft updates (indeed I mention soft updates as a correct
solution). My complaint just refutes the no-complaints claim of the
OP, and weakens his insinuation that UFS with soft updates still
zeroes files and we just don't hear about it.
I don't think OSF/1 aka Digital Unix aka Tru64 Unix ever had soft updates.
soft update not providing true atomic rename?
> * Not as general as logging, e.g., no atomic rename
> * Metadata updates might proceed out of order
> - create /dir1/a than /dir2/b
> - after crash /dir2/b might exist but not /dir1/a
Better than POSIX?
Better than POSIX?
soft updates
Better than POSIX?
soft updates [was: Better than POSIX?]
Better than POSIX?
The POSIX standards represent an essentially "minimal" set of requirements,
so that any application written to to POSIX standard will run on a wide
variety of computers. VMS has a "posix" API. and I believe VM370 also.
Each OS adds additional properties that developers rely on when writing
applications that require specific behavior. The SysVR4 interface standards
are one example. Ext4 needs to obey POSIX requirements, and provide a
predictable, dependable behavior (beyond POSIX) that developers can use
to write useful programs.
Better than POSIX?
Better than POSIX?
compliant.
backup with the tilde before renaming the new file into place. Then have recovery code to pick up
the tilde file if there's a problem with the regular one.
Better than POSIX?
Huh? If a program is truly POSIX-compliant, it can't make any assumptions about what happens after a crash. That's undefined by POSIX. There's no sense in playing games with backup files because the backups are undefined by POSIX too!
Better than POSIX?
Better than POSIX?
Better than POSIX?
Better than POSIX?
guarantee" that data blocks will hit the disk before meta-data. I don't
understand why that's "implicit." It seems like the definition of data=ordered
mode says exactly that?
allocation?
that ext3 didn't have.
data=ordered.
Better than POSIX?
Better than POSIX?
Better than POSIX?
in in one of the 2.6.29 announcement reply subthreads, and Linus calls the
failure to honor data=ordered (thus implying that in Linus opinion, it WAS
a failure to honor it, no matter what various others say about it being
about security only and that it was thus honored) "idiotic" in one reply,
and in another reply says the essentially the same thing using a different
choice description.
LKML in general has at least three definite levels of "idiotic". Yes,
this is "idiotic", but Linus hasn't yet advanced to calling it
the "smoking crack" level of "idiotic" that he has been known to resort to
in other instances. OTOH, this would seem to be beyond the "brown paper
bag" level of "idiotic", so called because that's what the person making
the mistake wants to wear since he's now embarrassed to be seen in public.
The "brown paper bag" level of "idiotic" is the level that once aware, the
person who made the mistake owns up to it and does NOT defend, but
rather "resorts to the brown paper bag", and in fact, many such "brown
bag" level of mistakes are discovered and fixed by the person that made
them in the first place. This is beyond that since the person making
the "mistake" has been and continues to defend it as "correct", thus
reaching at minimum the "idiotic" level.
one gets the strong impression that somewhere along the line, Linus would
love to get a patch that makes data=ordered mean just that once again,
that delayed allocation or no delayed allocation, if data=ordered, the
data will be written before the metadata that covers it. Personally, I'd
suggest the current default then be christened "data=screwed", altho
data=delayed or some such is the more likely "acceptable" alternative.
How we learn APIs
Unless we can get programmers to learn APIs differently, I don't see how we can avoid providing them with a more helpful API if we want them to write robust programs.How we learn APIs
What you describe sounds very simular to what in-order
semantics provides.
How we learn APIs
How we learn APIs
How we learn APIs
Although I can see an appeal to relaxing the in-order
constraint for IO from different processes
Well, given that Unix applications often create lots of processes that
interact in lots of interesting ways (think shell scripts), I think
it's just too hard to find out when two operations are independent. I
also don't think that this relaxation buys much if the in-order
constraint is implemented efficiently (by combining a large batch of
operations and commiting them together).
How we learn APIs
is the Ada community, and that's because their systems tend to be 'system
fails, people die' sorts of embedded systems (or 'system fails, N-million-
dollar missile falls out of the sky'). That tends to breed paranoia and a
desire to actually know the damn standard before you write code.
alone reference it regularly, and it's a very small minority who've
actually read the whole thing.
cleaning up after these bozos), but I can see no useful way to fix it. In
a corporate setting mandatory code reviews with a flunk-too-often-and-
you're-fired rule might work, but I wouldn't want to work there: morale
would be appalling.
And there is a reason that critical code can be as much as 10 times as expensive as regular business code. Working that way (learning your standard by heart before starting to work) can be much more expensive than the usual corporate style of "write, then fix". On the other hand it is better to write correct code than to "write, then fix". There is a sweet spot somewhere in between.
How we learn APIs
Learning APIs by writing unit tests?
Please make us proud of ext4
Please make us proud of ext4
Better than POSIX?
Better than POSIX?
Better than POSIX?
Better than POSIX?
For example, I'm pretty sure one could put part of sqlite's atomic commit in the kernel.
Your ideas (like most ideas) have been brought up in the past. Down that road you propose lay record-oriented filesystems. There's a reason every major filesystem today works with generic files: the alternative is absolutely horrible.
The other argument against trying to solve this in the kernel is that very few programs use the raw libc/POSIX interface; they're using the standard C++ library, Java, Python, Qt, GLib/Gio etc. So any changes have to be made at those levels.
There interfaces are as generic as the POSIX ones, and putting data integrity logic there will not alleviate any of the concerns that have already been raised. If you want to make file-format-specific optimizations, you have to modify file-format-specific libraries --- and most applications just roll their own. Fixing the problem in the kernel solves all these issues elegantly, and at once.
Better than POSIX?
noting however that high performance databases can commit a durable
transaction as soon as the necessary journal update has made it to
persistent storage. Nothing else needs to be touched.
updates to all the actual data files) as part of every commit. That is not
necessary in a good design. There is nothing fundamental about filesystem
or database design that requires anything except the journal to be
synchronously updated on transaction commit. If durability is not
required, not even that.
Better than POSIX?
Better than POSIX?
Better than POSIX?
Better than POSIX?
> write() system calls utterly fails to illustrate his point.
he was very much arguing in favor of worse-is-better, that having the best
interface for the sake of it is brain-damaged, and so the Unix solution of
EAGAIN was superior to the MIT way, even though the interface was arguably
"worse."
Better than POSIX?
Better than POSIX?
Better than POSIX?
You couldn't have a UHH without all the boasts about the elegance of Unix from otherwise mostly sensible people. Microsoft's OS products have been driven by backwards compatibility, business side demands for lock-in, and stripping out features to please antitrust regulators -- nobody is saying they're elegant, so no point writing a "hater's" handbook to rebut the claim.
Prerequsite for UHH
Better than POSIX?
> I think it also duty of application developers to ensure theirBetter than POSIX?
> applications work on POSIX if they are writing UNIX applications.
Better than POSIX?
Better than POSIX?
I think it also duty of application developers to ensure their applications work on POSIX if they are writing UNIX applications.
No. I don't support every POSIX system, and it's not my responsibility to do that. I'll decide where I want my application to fall on the spectrum between simplicity and portability, not you.
Better than POSIX?
POSIX environments like mingw. ;}}}}
Better than POSIX?
I think it also duty of application developers to ensure their applications work on POSIX if they are writing UNIX applications.
Those applications do work on POSIX systems. They also happen to leave hordes of little empty files after a crash, something which, as has been argued ad nauseam, is a POSIX-compliant way of dealing with a crash. It is also POSIX-compliant to make the user hunt these little buggers and remove them, or to provide valid contents. It seems that it is even POSIX-compliant to zero the whole disk on crash, something which these applications kindly refrain from doing. See? nobody is ignoring POSIX specification and compliance.
Better than POSIX?
You seem to think that:
3 out of 4?
Just curious. Do lurkers support you in email?
3 out of 4?
3 out of 4?
The applications should stop expecting this.
I am sure you also run your Bash console in POSIX mode, never use ls with long options, and only use cp with the four-and-a-half POSIX options. Congratulations. The rest of the world is not so spartan.
Congratulations
Better than POSIX?
Better than POSIX?
The actual POSIX specs may not be available anywhere for free, but the Single Unix Specs are, and they are essentially a superset of POSIX, and probably what most people really mean when they say "POSIX" these days...
Better than POSIX?
POSIX.1-2008 is available here.Better than POSIX?
POSIX and SUS have merged, and are now the same thing. The link in the last comment, and the first google link, point to the older 2004 edition.
> It strikes me as odd that an open source OS uses a non-free spec to define its operations. Doesn't it strike anybody else as odd that we have a whole pile of people here arguing about compliance to a spec they most likely haven't seen? I see statements like "ensure your app only relies on stuff in POSIX". Perfectly good advice, except how is your typical open source developer meant to do that when he can't get access to the bloody thing?
Better than POSIX?
Better than POSIX?
Better than POSIX?
So if allocate-on-commit is default behavior, we get
non-portable (and bugged) application as an exchange.
You might get a few more applications that sync before renaming, but
that does not make them any more portable or bug-free. If the OS
crashes, that's not an application bug nor a portability problem. If
the user uses a file system that gives no crash consistency guarantees
(e.g., ext4), that's not an application bug or portability problem. A
user using such a file system should just back up frequently
and be prepared to restore from backup in case of a crash.
Application programming doesn't have anything to do with it.
Why EXT4?
Why EXT4?
Why EXT4?
Why EXT4?
Why EXT4?
Why EXT4?
Why EXT4?
Why EXT4?
Why EXT4?
learn, but based on what I've read and experience) you're wrong about
current reiserfs, altho it /used/ to have the problems you mention. AFAIK
reiserfs' default behavior is very much like ext3's default behavior in
this regard. Both of them use data=ordered by default now, and have for
quite some time. (Reiserfs data=ordered was added back at 2.6.6, according
to the best I can google, and ordered became the default either then or
shortly thereafter.) Just as ext3, reiserfs apparently doesn't have
delayed allocation, and the default 5-second-metadata-flush (which with
ordered and due to the security implications Ted Tso mentions, means data
gets flushed every five seconds too, before the metadata) applies to both,
too. Thus ext3 and reiserfs should have the same general level of
stability now, and post-data=ordered, that has certainly been my
observation -- reiserfs has been incredibly stable for me.
some time ago now, and it's actually a quite stable and mature fs now.
That has certainly been my experience, both bad back then, with corrupt or
zeroed files at boot after a crash pre-data=ordered, and impressively
stable, now and for several years, post-data=ordered.
I believe reiserfs is actually reasonably similar to ext3 in this regard,
and has been for some years, now.
Better get the basics straight
Better get the basics straight
filesystems should provide ordered (but not necessarily durable) renames by
default.
"rename_unordered" function for those relatively unusual cases where a user
is willing to risk severe data loss to get better performance. In addition,
for portability reasons, we should have an option so that an application
can discover whether or not an fsync is required to get ordered renames.
fsync is very expensive when you don't need (synchronous) durability
semantics.
POSIX has a mechanism to do precisely that:
Better get the basics straight
SYNOPSIS
long fpathconf(int fildes, int name);
long pathconf(const char *path, int name);
DESCRIPTION
The fpathconf() and pathconf() functions shall determine the current value of a configurable limit or option (variable) that is associated with a file or directory.
name
: if an conscientious application asks for _LINUX_SAFE_RENAME on a system that doesn't even know it exists, pathconf
will just return -1 and the application will say, "oh, okay. I need to use fsync."
rename
, then Linux's pathconf
can just support both the original and the standard name
.
Better get the basics straight
Better get the basics straight
You'll probably need this on a per-FS basis
pathconf
instead of sysconf
.
Better get the basics straight
fpathconf
Better get the basics straight
sed and shell
Library function please?
Add new mode letter to fopen()
...
fwrite(...);
...
fclose( f );
Add new mode letter to fopen()
FILE * f = fopen( "precious_config", "wf" );
All we need is a "t" flag. :-)
FILE*
returned by fopen
should correspond to only one underlying file, and that file shouldn't magically change its name.
open
won't work because the kernel would have to spool modifications to that file until commit, and that would not only be very complex, but could open up all kind of denial-of-service attacks.
FILE*
. I just don't think that function should be called fopen
, and don't think that FILE* shouldn't be closed with fclose
-- think of something more like popen
.
Better than POSIX?
Better than POSIX?
Nice summary
Nice summary
Nice summary
Know your place
Know your place
Know your place
it's about the crashes!!!
that it not lose data as a matter of course
Okay, I had to post on this one. The thing that EVERYONE seems to be forgetting is that these problems only occur when you have crashes - I.e. bad hardware or buggy drivers. This is not a case of lose data as a matter of course, it's a case of the whole freakin system crashing badly. This is a situation which happens VERY rarely.
it's about the crashes!!!
it's about the crashes!!!
it's about the crashes!!!
task of recovery after an unclean shutdown.</em>
journaling filesystems is to preserve metadata integrity. Filesystem
repair tools cannot repair metadata that has never been written.
meta data updates. Assuming you want your filesystem to be there after an
unclean shutdown, that is a major advantage.
using their journals to do other useful things, such as store meta-data
undo information, for example.
it's about the crashes!!!
it's about the crashes!!!
so that manual intervention after a system crash is minimized.
it's about the crashes!!!
Benchmarks?
Benchmarks?
Fsck to the rescue?
ext4 like ext3 - the wrong way
modern filesystems like XFS, reiserfs, reiser4, in that on a crash it
thrashes data. But just because this ones name is similar to ext3, people
want it to behave the same.
fixed. But harddisks already use 32MB cache, people use on-board RAID
controllers. Imagine you have a RAID over 4 hard disks with 32MB cache each
- on a power outage, you loose up to 128MB data just from the disks,
there's nothing any filesystem can do about it (despite turning off disk
write caches). There's the *absolutely false* assumption that your data is
safe once you fclose(). But without a fsync, it's not.
filesystems, and newer ones with even more advanced features, will always
be told to "eat your data", while really it's not their fault.
data. Not because the other filesystems are crap, but because application
developers don't (need to) care. They use ext3/4 and people will continue
to say that it's much better. The typical half-truth we often see in IT,
and that often leads to bad systems. Think of computers controlling
equipment like trains, cars, or atomic reactors. "Oh, that one melted
because the filesystem ate it's data" is not a good answer after all. A
good application would have checked it. Seems like more and more people
believe if it works 95% of the time that's enough. Until that computer-
of a computer problem. Then it should have been working 100% of the time,
right?
really can't)
ext4 like ext3 - the wrong way
ext4 like ext3 - the wrong way
additional price for the battery backup was less than 2% of the machine's
cost. If you can't afford that, you probably shouldn't be using RAID at
all (certainly not hardware RAID: md will work anywhere but makes no bones
about possible data loss if a crash happens during writing).
Better than POSIX?