User: Password:
|
|
Subscribe / Log in / New account

Shared pain

Shared pain

Posted Jan 20, 2012 22:59 UTC (Fri) by zlynx (subscriber, #2285)
In reply to: Shared pain by gwolf
Parent article: XFS: the filesystem of the future?

EXT4 acts the same way, creating files with zeros when run in writeback journaling mode. I run mine this way, although I do make sure that auto_da_alloc is turned on so that data is flushed when doing file replacement via rename.

I'd much rather have the performance.


(Log in to post comments)

Shared pain

Posted Jan 21, 2012 12:45 UTC (Sat) by dany (subscriber, #18902) [Link]

> I'd much rather have the performance.

Its ok that you would, but would also your employer/customers like better performance over reliability? There is reason, that default ext4 mount mode in RHEL is ordered.

Shared pain

Posted Jan 21, 2012 17:15 UTC (Sat) by ricwheeler (subscriber, #4980) [Link]

You don't need to give up reliability for performance in either ext4 or xfs.

Eric and I are both with the Red Hat file system team (as is Dave Chinner) and we would not be supporting XFS if it was not solid and reliable as well as high performance.

What you do need, as Eric mentioned, is to keep your box properly configured and to have applications that do the right things to persist data.

Jeff Moyer (also a Red Hat file system person) wrote up a nice article for LWN a few months back on best practices for data integrity.

Shared pain - article link

Posted Jan 22, 2012 21:01 UTC (Sun) by ndye (guest, #9947) [Link]

Jeff Moyer (also a Red Hat file system person) wrote up a nice article for LWN a few months back on best practices for data integrity.

This article, I presume?

Shared pain

Posted Jan 28, 2012 23:16 UTC (Sat) by sbergman27 (guest, #10767) [Link]

So you can either use Ext4 mounted with the nodellalloc option, or Ext3 mounted data=ordered, and sleep well at night. Or... you can use XFS and commission a code audit for every piece of important software you run, specifically checking for proper fsync usage. Cross your fingers, hoping the auditors didn't miss anything, and try to sleep well at night.

Shared pain

Posted Jan 29, 2012 0:44 UTC (Sun) by dlang (subscriber, #313) [Link]

you may sleep well at night, but it will be the sleep of someone who has been fooled about the reliability of their data.

skipping fsync is not safe on any filesystem that's not mounted -sync

this is true for every OS

Shared pain

Posted Jan 29, 2012 1:53 UTC (Sun) by sbergman27 (guest, #10767) [Link]

And again, you are glossing over the matter of the relative likelihood of data loss with various filesystems. It suits your purposes to turn it into a black and white issue. "Ext3 can lose your data too!", you cry.

Well, Both my mattress and J.P. Morgan *could* lose my money. So putting my money either place represents equal risk. If I understand you correctly, you are saying that Ext3 mounted data=ordered, Ext4 mounted with the defaults, and XFS mounted with the defaults, all represent equal risk to our data because any one of them *could* conceivably lose our data.

Again, I'm not buying it.

Shared pain

Posted Jan 30, 2012 21:49 UTC (Mon) by dlang (subscriber, #313) [Link]

replying to multiple comments in one reply

I am not saying that the risk is equal, I am disputing the statement that ext3 is rock solid and won't loose your data without you needing to do anything.

Ext3 is one of the worst possible filesystems to use if you really care about your data not getting lost (and therefor implement the fsync dance to make sure you don't loose data), because it's fsync performance is so horrid.

The applications are not keeping the data in buffers 'as long as they can', they are keeping the data in buffers for long enough to be able to optimize disk activity.

The first is foolishly risking data for no benefit, the second is taking a risk for a direct benefit. These are very different things. Ext3 also keeps data in buffers and delays writing it out in the name of performance. Every filesystem available on every OS does so by default, the may differ in how long they will buffer the data, and what order they write things out, but they all buffer the data unless you explicitly mount the filesystem with the sync option to tell it not to.

You say that people will always pick reliability over performance, but time and time again this is shown to not be the case. As pointed out by another poster, MySQL grew almost entirely on it's "performance over reliability" approach, only after it was huge did they start pushing reliability. The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

This extends beyond the computing field. There are no fields where the objective is to eliminate every risk, no matter what the cost. It is always a matter of balancing the risk with the costs of eliminating them, or with the benefits of accepting the risk.

That's the problem...

Posted Jan 30, 2012 22:54 UTC (Mon) by khim (subscriber, #9252) [Link]

In theory, theory and practice are the same. In practice, they are not.

Ext3 is one of the worst possible filesystems to use if you really care about your data not getting lost (and therefor implement the fsync dance to make sure you don't loose data), because it's fsync performance is so horrid.

Right, but that's the problem: most developers don't care about these things (most just don't think about the problem at all, others just hope it all will work... somehow). Most users do. Thus we have a strange fact: in theory ext3 is the worst possible FS from "lost data" POV, in practice it's one of the best.

That's the problem...

Posted Jan 30, 2012 23:16 UTC (Mon) by dlang (subscriber, #313) [Link]

trust me, users trying to run high performance software that implements data safety (databases, mail servers, etc) care about this problem as well.

For other developers, the fact that fsync performance is so horrible on the default filesystem for many distros has trained a generation of programmers to NOT use fsync (because it kills performance in ways that users complain about)

That's the problem...

Posted Feb 2, 2012 3:45 UTC (Thu) by tconnors (guest, #60528) [Link]

> For other developers, the fact that fsync performance is so horrible on the default filesystem for many distros has trained a generation of programmers to NOT use fsync (because it kills performance in ways that users complain about)

Then there's the fact that fsync will spin up your disks if you were trying to keep them spun down (to the point where on laptops, I try to use 30 minute journal commit times, and manually invoke sync when I absolutely want something committed). I don't want or need an absolute gaurantee that the new file has hit the disk consistent with metadata. I want an absolute guarantee that /either/ the new file or the old file is there consistent with the relevant metadata. ext3 did this. It's damn obvious what rename() means - there should be no need for every developer to go through all code in existance and change semantics of code that used to work well *in practice*. XFS loses files everytime power fails *in practice*. If I need to compare to backup *everytime* power fails, then I might as well be writing all my data to volatile RAM and do away with spinning rust all together, because that's all that XFS is good for.

Another pathological (but instructive) case...

Posted Jan 30, 2012 23:03 UTC (Mon) by khim (subscriber, #9252) [Link]

Similar story happens with USB sticks: most users believe FAT (on Windows) is super-safe, NTFS (on Windows) is horrible - and Linux is awful no matter what. Why? Delayed write defaults. FAT on Windows is tuned to flush everything on ANY close(2) call. NTFS works awfully slow in this mode thus it uses more aggressive caching. And on Linux caching is always on.

And users just snatch USB stick the very millisecond program window is closed (well... most do... the cautious ones wait one or two seconds). They feel it's their unalienable right. In these circumstances suddenly the oldest and the most awful filesystem of them all becomes the clear winner!

Exactly because "I care about my data" does not automatically imply "I'll do what I'm told to do to keep it".

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 4, 2012 13:08 UTC (Sat) by Wol (guest, #4433) [Link]

Don't get me started ... :-)

I work with Pick (a noSQL db), and a LOT of the reliability problems that ACID is meant to fix, just *can't* *happen* in Pick.

Well, they can if the database was badly designed, but you can get similar pain in relational databases too...

Relational is a lovely mathematical design - I would use it to design a database without a second thought - but I would then convert that design to NF2 (non-first-normal-form) and implement it in Pick. Because 90% of ACID's benefits would then be redundant, and the database would be so much faster too.

You've heard my war-story of a Pentium90/Pick combo outperforming an Oracle/twinXeon800, I'm sure ...

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 4, 2012 13:27 UTC (Sat) by gioele (subscriber, #61675) [Link]

> I work with Pick (a noSQL db), and a LOT of the reliability problems that ACID is meant to fix, just *can't* *happen* in Pick.

This is getting off-topic, but could you explain which kind of reliability problems meant to be fixed by ACID DBs are cannot happen in Pick and why?

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 7, 2012 23:20 UTC (Tue) by Wol (guest, #4433) [Link]

Okay, let's do a data analysis. In Pick, you would do an EAR.

Look for what I call "real world primary keys" - an invoice has a number, a person has a name. Now we've got a primary key, we work out all the attributes that belong to that key. We can now do a relational analysis on those attributes. (forget that real-world primary keys aren't always unique and you might have to create a GUID etc.)

With an invoice, in relational you'll end up with a bunch of rows spread across several tables for each invoice. IN PRACTICE, ACID is mostly used to make sure all those rows pass successfully through the database from application to disk and back again.

In Pick, however, you then coalesce all those tables (2-dimensional arrays) together into one n-dimensional Pick FILE. And you coalesce all those rows together into one Pick RECORD. With the result that there is no need for the database to make sure the transaction is atomic. All the data is held as a single atom in the application, and is passed through the database to and from disk as a single atom.

That's also why Pick smokes relational for speed - access any attribute of an object, and all attributes get pulled into cache. Try doing that for a complex object with relational !!! :-) (Plus Pick is self-optimising, and when optimised it takes, on average, just over one disk seek per primary key to find the data it's looking for on disk!)

The problem I see with relational is it is MATHS and, to reference Einstein, therefore has no basis in reality. Pick is based on solid engineering, and when "helped" with relational theory really flies. Relational practice actually FORBIDS a lot of powerful optimising techniques.

And if designed properly, a Pick database is normalised therefore it can look like a relational database, only superfast. I always compare Pick and relational to C and Pascal. Pick and C give you all the rope you need to seriously shoot yourself in the foot. Relational and Pascal have so many safety catches, it's damn hard to actually do any real work.

(And because foreign keys are attributes of a primary key, you can also trap errors in the application. For example, the client's key is a mandatory element of an invoice so it belongs in the invoice. Okay, it's the app's job to make sure the record isn't filed without a client, whereas in relational you can leave it to the DB, but in Pick it's easy enough to add a business layer between the app and the DB that does this sort of thing.)

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 7, 2012 23:26 UTC (Tue) by Wol (guest, #4433) [Link]

Just to add, look at it this way ...

In relational, attributes that are tied together can be spread (indeed, for a complex object MUST be spread) across multiple tables and rows. Column X in table Y is meaningless without column A in table B.

In Pick, that data would share the same primary key, and would be stored in one FILE, in one RECORD. Delete the primary key and both cells vanish together. Create the primary key, and both cells appear waiting to be filled.

As far as Pick is concerned, a RECORD is a single atom to be passed through from disk to app and vice versa. From what I can make out, in relational you can't even guarantee a row is a single atom!

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 8, 2012 0:45 UTC (Wed) by dlang (subscriber, #313) [Link]

this has nothing to do with the ACID guarantees. the ACID guarantees have to do when you start modifying the datastore, specifically what happens if the modification doesn't complete (including that the system crashes in the middle of an update)

ACID is the sequence of modifying the file on disk so that you have either the new data or the old data at all times, and if the application says that the transaction is done, there's no way for it to disappear.

what you are talking about with Pick is a way to bundle related things together so that it's easier to be consistent. That doesn't mean that writing your record will happen in an atomic manner on the filesystem (if the record is large enough, this won't be an atomic action)

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 8, 2012 14:52 UTC (Wed) by Wol (guest, #4433) [Link]

So ACID is there to cope with the fact that, what the DB sees as a single transaction, the operating system and filestore doesn't, so it sits between the database and OS, and makes sure that the multiple success/fails returned by the OS are returned to the database as a single success or fail.

AND THAT IS MY POINT. In Pick, 90% of the time, this is unnecessary and wasteful, because what is a single transaction as seen by the DB is also a single transaction as seen by the OS and disk subsystem!

I agree with you, if the record is too big, it won't go onto the file store in one piece, but the point is that with a relational DB you can pretty much guarantee that, in practice, a transaction will never go onto the filestore in one piece. So ACID is needed. But in Pick it's unusual for it NOT to go on the filesystem in one piece. So most of the time ACID is an unnecessary complexity.

I'm ignoring file system failures like XFS/ext4 zeroing out your table in a crash :-) because I don't see how ACID can protect against the OS trashing your data :-)

What you need to do is realise that ACID sits between the database and disk. As you say, it guarantees that the database in memory is (a) consistent, and (b) accurately represented on disk. And because, *in* *the* *real* *world*, pretty much any change in a relational database requires multiple changes in multiple places on disk, ACID is a necessity.

But in the real world, most changes in a Pick database only involve a *single* change in a *single* place on disk to ensure consistency. So a "write successful" from the OS is all that's needed to provide a "good enough" implementation of ACID. (And if the OS lies to your ACID layer, you're sunk even if you've got ACID. See all the other posts in this thread about disks lying to the OS!)

(This has other side effects. Yes, a Pick database can get into an inconsistent state. But that inconsistent state MIRRORS REALITY. A Pick database can lose the connection between a person and his house. Or a car and its owner. But in reality a person can lose their house. A car can lose its owner. It's far too easy to assume in a relational database that everyone has a home, and next thing you can't put some poor vagrant in your database when you need to ...)

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 8, 2012 22:17 UTC (Wed) by dlang (subscriber, #313) [Link]

you don't seem to understand that writes to the filesystem are not atomic in just about every case, let alone dealing with the rest of ACID

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 9, 2012 0:17 UTC (Thu) by Wol (guest, #4433) [Link]

Writes to the file system where? At the db/OS interface? At the OS/disk interface?

Because if it's at the OS/disk interface, what the heck is ACID doing in the database? It can't provide ANY guarantees, because it's too remote from the action.

And if it's at the db/OS interface, well as far as Pick is concerned, most transactions are near-enough atomic that the overhead isn't worth the cost (that was my comment about "90% of the time").

Your relational bias is clouding your thinking (although Pick might be clouding mine :-) But just because relational cannot do atomic transactions to disk doesn't mean Pick can't. As far as Pick is concerned, that transaction is atomic right up to the point that the OS code actually puts the data onto the disk. And if the OS screws that up, ACID isn't going to save you ...

Think of a "begin transaction" / "end transaction" pair. It's almost impossible for that transaction to truly be atomic in a relational database - you will invariably need to update multiple rows. In Pick, it's more than possible for that transaction to be truly atomic at the point where the db hands it over to the OS. ACID enforces atomicity between the OS and the db. Pick doesn't need it.

What guarantees does ACID provide over and above data consistency? Because a well-designed Pick app guarantees "if it's there it's consistent". And if the OS screws up and corrupts it, neither Pick nor ACID will save you.

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 9, 2012 0:54 UTC (Thu) by dlang (subscriber, #313) [Link]

ACID has nothing to do with relational algebra

ACID is a feature that SQL databases have had, but you don't need to abandon SQL to abandon ACID and you don't need to have SQL to have ACID

Berkeley DB is ACID, but not SQL, MySQL was SQL but not ACID with the default table types for many years.

ACID involves the database application doing a lot of stuff to provide the ACID guarantees to users by using the features of the OS and hardware. If the OS/hardware lies to the database application about when something is actually completed then the database cannot provide ACID guarantees.

It appears that you have an odd interpretation about what ACID means, so reviewing

Atomicity

A transaction is either completely implemented or not implemented at all. For changes to a single record this is relatively easy to do, but if a transaction involves changing multiple records (subtract $10 from account A and add $10 to account B) it's not as simple as atomically writing one record. Remember that even a single write() call in C is not guaranteed to be atomic (it's not even guaranteed to succeed fully, you may be able to write part of it and not other parts)

Consistency

this says that at any point in time the database will be consistent, by whatever rules the database chooses to enforce. Berkeley DB has very trivial consistency checks, the records must all be complete. Many SQL databases have far more complex consistency requirements (foreign keys, triggers, etc)

Isolation

This says that one transaction can affect another transaction happening at the same time

Durability

This says that once a transaction is reported to succeed then nothing, including a system crash at that instant (but excluding something writing over the file on disk) will cause the transaction to be lost

What you are describing about Pick makes me thing that it has very loose consistency and isolation requirements, but to get Atomicity and Durability the database needs to be very careful about how it writes changes.

It cannot overwrite an existing record (because the write may not complete), and it must issue appropriate system calls (fsync and similar) to the OS, and watch for the appropriate results, to know when the data has actually been written to disk and will not change.

It's getting this last part done that really differentiates similar database engines from each other. There are many approaches to doing this and they all have their performance trade-offs. If you are willing to risk your data by relaxing these requirements a database becomes trivial to implement and is faster by several orders of magnitude.

note how the only SQL concept that is involved here is the concept of a transaction in changing the data.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 9, 2012 20:22 UTC (Thu) by Wol (guest, #4433) [Link]

Yup. I am being far looser in my requirements for ACID for Pick, but the reason is that Pick is far more ACID by accident than relational.

Atomic: as I said, a relational transaction in relational will pretty much inevitably be split across multiple, often many, tables. In Pick, all dependant attributes (excluding foreign-key links) will be updated as a single transaction right down to the file-system layer. So, as an example, if I have separate FILEs for people and buildings, it's possible I'll corrupt "where someone lives" if I update the person and fail to create the building, but I won't have inconsistent person or building data.

Consistency: IF designed properly, a Pick database should be consistent within entities. All data associated with an individual "real world primary key". Relations between entities could get corrupted, but that *should* be solved with good programming practice - in my example above, "lives at" is an attribute of person, so you update building then person.

Isolation: I don't quite understand that, so I won't comment.

Durability: Well, when I tried to write a Pick engine, my first reaction to actually writing FILEs to disk was "copy on write seems pretty easy...". And there comes a point where you have to take the OS on trust.

So I think my premise still stands - a LOT of the requirement for ACID is actually *caused* by the rigorous separation demanded by relational between the application and the database. By allowing the application to know about (and work with) the underlying database structure you can get all the advantages of relational's rigorous analysis, all the advantages of a strong ACID setup, and all the advantages of noSQL's speed. But it depends on having decent programmers (cue my previous comment about Pick and C giving you all the rope you need ...)

And one of the reasons I wanted to write that Pick db engine was so I could put in - as *optional* components - loads of stuff that enforced relational constraints to try and reign in the less-competent programmers! I want a Modula-2 sort of Pick, that by default protects you from yourself, but where the protections can be turned off.

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 9, 2012 20:36 UTC (Thu) by dlang (subscriber, #313) [Link]

atomic, your scheme won't work if you need to make changes to two records (the ever popular "subtract $10 from account A, add $10 to account B" example)

consistency, what if part of your updates get to disk and other parts don't? what if the OS (or drive) re-orders your updates so that the write to the record for person happens before the write to building?

As far as durability goes, if you don't tell the OS to flush it's buffers (which is what fsync does), then in a crash you have no idea what may have made it to disk and what didn't.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 10, 2012 16:17 UTC (Fri) by Wol (guest, #4433) [Link]

The ever popular "subtract $10, add $10" ...

Well, if you define the transaction as an entity, then it gets written to its own FILE. If the system crashes then you get a discrepancy that will show up in an audit. It makes sense to define it as an entity - it has its own "primary key" ie "time X at teller Y". Okay, you'll argue that I have to run an integrity check after a crash (true) while you don't, but I can probably integrity-check the entire database in the time it takes you to scan one big table :-)

Consistency? Journalling a transaction? Easily done.

And yes, your point about flushing buffers is good, but that really should be the OS's problem, not the app (database) sitting on top. Yes I know, I used the word *should* ...

Look at it from an economic standpoint :-) If my database (on equivalent hardware) is ten times faster than yours, and I can run an integrity check after a crash without impinging on my users, and I can guarantee to repair my database in hours, which is the economic choice?

Marketing 101 - proudly announce your weaknesses as a strength. The chances of a crash occuring at the "wrong moment" and corrupting your database are much higher with SQL, because any given task will typically require between 10s and 100s more transactions between the db and OS than Pick. So SQL needs ACID. With Pick, the chances of a crash happening at the wrong moment and corrupting data are much, much lower. So expensive strong ACID actually has a prohibitive cost. Especially if you can get 90% of the benefits for 10% of the effort.

I'm not saying ACID isn't a good thing. It's just that the cost/benefit equation for Pick says strong ACID isn't worth it - because the benefits are just SO much less. (Like query optimisers. Pick doesn't have an optimiser because it's pretty much a dead cert the optimser will save less than it costs!)

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 10, 2012 18:43 UTC (Fri) by dlang (subscriber, #313) [Link]

so that means that you don't have any value anywhere in your database that says "this is the amount of money in account A", instead you have to search all transactions by all tellers to find out how much money is in account A

that doesn't sound like a performance win to me.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 11, 2012 2:30 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

Well, git works exactly the same way. Is it fast enough for you?

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 11, 2012 5:48 UTC (Sat) by dlang (subscriber, #313) [Link]

what gives you reasonable performance for a version control system with a few updates per minute is nowhere close to being reasonable for something that measures it's transaction rate in thousands per second.

besides, git tends to keep the most recent version of a file uncompressed, it's only when the files are combined into packs that things need to be reconstructed, and even there git only lets the chains get so long.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 11, 2012 13:44 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

git/svn/... use store intermediate versions of the source code, so that applying all patches becomes O(log N) instead of O(N). But that's just an optimization.

NoSQL systems work in a similar way - they can store the 'tip' of the data, so that they don't have to reapply all the patches all the time. However, the latest data view can be rebuilt if required.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 12, 2012 15:57 UTC (Sun) by nix (subscriber, #2304) [Link]

Actually, even the most recent stuff is compressed. It just might not be deltified in terms of other blobs (which is what you meant, I know).

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 12, 2012 18:29 UTC (Sun) by dlang (subscriber, #313) [Link]

yes, everything stored in git is compressed, but it only gets deltafied when it gets packed.

and it's frequently faster to read a compressed file and uncompress it than it is to read the uncompressed equivalent (especially for highly compressible text like code or logs), I've done benchmarks on this within the last year or so

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 12, 2012 13:38 UTC (Sun) by Wol (guest, #4433) [Link]

Okay, it would need a little bit of coding, but I'd do the following ...

Each month, when you run end-of-month statements, you save that info. When you upate an account you keep a running total.

If the system crashes you then do "set corruptaccout = true where last-month plus transactions-this-month does not equal running balance". At which point you can do a brute force integrity check on those accunts.

(If I've got a 3rd state of that flag, undefined, I can even bring my database back on line immediately I've run a "set corruptaccount to undefined" command!)

And in Pick, that query will FLY! If I've got a massive terabyte database that's crashed, it's quite likely going to take a couple of hours to reboot the OS (I just rebooted our server at work - 15-20 mins to come up including disk checks etc). What's another hour running an integrity check on the data? And I can bring my database back on line immediately that query (and others like it) have completed. Tough luck on the customer who's account has been locked ... but 99% of my customers can have normal service resume quickly.

Thing is, I now *know* after a crash that my data is safe, I'm not trusting the database company and the hardware. And if my system is so much faster than yours, once the system is back I can clear the backlog faster than you can. Plus, even if ACID saves your data, I've got so much less data in flight and at risk.

But this seems to be mirroring the other debate :-) the moan about "fsync and rename" was that fsync was guaranteeing (at major cost) far more than necessary. The programmer wanted consistency, but the only way he could get it was to use fsync, which charged a high price for durability. If I really need ACID I can use BEGIN/END TRANSACTION in Pick. But 99% of the time I don't need it, and can get 90% of its benefits with 10% of its cost, just by being careful about how I program. At the end of the day, Pick gives me moderate ACID pretty much by default. Why should I have to pay the (high) price for strong ACID when 90% of the time, it is of no benefit whatsoever? (And how many SQL programmers actually use BEGIN/END TRANSACTION, even when they should?)

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 8, 2012 14:08 UTC (Wed) by nix (subscriber, #2304) [Link]

From what I can make out, in relational you can't even guarantee a row is a single atom!
Well, the relational algebra does not discuss storage at all, and does not stipulate where relations might reside on permanent storage (nor *which* might: you could perfectly well store join results permanently for all it cares).

But in practice, in SQL... just try INSERTing half a row. You can't. Atomicity at the row level is guaranteed. I hate SQL, but at least it does this right.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 8, 2012 15:00 UTC (Wed) by Wol (guest, #4433) [Link]

Well, the relational algebra does not discuss storage at all, and does not stipulate where relations might reside on permanent storage

Which is exactly my beef with relational databases. C&D FORBID you from telling the database where relations should be stored for efficiency. But in REALITY it is highly probable that, if you access one attributed associated with my primary key, that you will want to access others. But it's a complete gamble retrieving the same attribute associated with other primary keys. Because Pick guarantees (by accident, admittedly) that all attributes are stored in the same atom as the primary key they describe, all those attributes you are statistically most likely to want are exactly the attributes that coincidentally get retrieved together.

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 8, 2012 15:05 UTC (Wed) by Wol (guest, #4433) [Link]

Atomicity at the row level IN THE DATABASE is guaranteed, yes.

What I meant was it's not guaranteed at the physical level in the datastore. Two cells in the same row could be stored in completely different "buckets" in the database, for example the data is stored in an index with a pointer from the row. I know that probably doesn't happen but if the guy who designed the database engine thinks it's more efficient there's nothing stopping him.

So even if you the database programmer *think* an operation should be atomic right down to the disk, there's no guarantee.

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 8, 2012 21:53 UTC (Wed) by nix (subscriber, #2304) [Link]

It happens quite a lot, increasingly often now that databases are lifting the horrible restrictions many of them had on the total amount of data stored per row (Oracle and MySQL had limits low enough that you could hit them in real systems quite easily).

If it matters that data is written to the disk atomically, you have already lost, because *nothing* is written to the disk atomically, not least because you invariably have to update metadata, and secondly because no disk will guarantee what happens to partial writes in the case of power failure. So, as long as you have to keep a journal or a writeahead log to deal with that, why not allow arbitrarily large amounts of data to appear to be written atomically? Hence, transactions.

It is true that programs that truly use transactions are relatively rare: in one of my least proud moments I accidentally changed the rollback operation in one fairly major financial system to do a commit and it was a year before anyone noticed. However, when you *do* have code that uses transactions, the effect on the code complexity and volume can be dramatic. As a completely random example, I've written backtracking searchers that relied on rollback in about 200 lines before, because I could rely on the database's transaction system to do nearly all the work for me.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 9, 2012 20:30 UTC (Thu) by Wol (guest, #4433) [Link]

It happens quite a lot, increasingly often now that databases are lifting the horrible restrictions many of them had on the total amount of data stored per row (Oracle and MySQL had limits low enough that you could hit them in real systems quite easily).

Sorry, I have to laugh here. It's taken Pick quite a while to get rid of the 32K limit, but that limit does date from the age of the dinosaur when computers typically came with 4K of core ...

And no limit on the size of individual items, or the number of items in a FILE.

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 9, 2012 20:38 UTC (Thu) by dlang (subscriber, #313) [Link]

if a single item is larger than the track size of a drive, it is physically impossible for the write to be atomic. You don't need to get this large to run in to problems though, any write larger than a block runs the possibility of being split across different tracks (or in a RAID setup, across different drives). If you don't tell the filesystem that you care about this, the filesystem will write these blocks in whatever order is most efficient for it.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Posted Feb 10, 2012 16:25 UTC (Fri) by Wol (guest, #4433) [Link]

:-)

Look at the comment you're replying to :-) In early Pick systems I believe it was possible for a single item to be larger than available memory ...

Okay, it laid the original systems wide open to serious problems if something went wrong, but as far as users were concerned Pick systems didn't have disk. It was just "permanent memory". And Pick was designed to "store all its data in ram and treat the disk as a huge virtual memory". I believe they usually got round any problem by flushing changes from core to disk as fast as possible, so in a crash they could just restore state from disk.

Cheers,
Wol

Shared pain

Posted Feb 2, 2012 1:06 UTC (Thu) by darrint (guest, #673) [Link]

"Well, Both my mattress and J.P. Morgan *could* lose my money. So putting my money either place represents equal risk."

I'm not sure how that metaphor is supposed to work. A few years ago when the U.S. government passed new credit reforms they were time delayed by several months at the request of the targeted banks, supposedly to give them time to carefully update their big and very important computer systems. The reality was the banks used those several months to burn and pillage the assets of people like me. I was in debt, my own stupidty of course, and I probably lost over a thousand dollars in fees alone due to a few financial institutions playing shenanigans with the ragged edges of my account terms.

At the time I thought wistfully of how much more secure my money would be if I could just collect my pay in cash and drive it to my house.

Shared pain

Posted Feb 2, 2012 14:08 UTC (Thu) by rwmj (subscriber, #5474) [Link]

You were in debt ...

Does your house loan you money? What does this negative money look like that you keep under your mattress?

Shared pain

Posted Jan 29, 2012 2:02 UTC (Sun) by sbergman27 (guest, #10767) [Link]

"I'd much rather have the performance."

Why? And have you actually measured the "performance" that you are sacrificing reliability for? I'd certainly want to quantify what I was getting in return for the reduced reliability. In testing I've done, I couldn't tell the difference between data=ordered and data=writeback with ext4. Or for that matter, between default and nodelalloc.

Even for a single user desktop, I just don't see that the trade-off is a win. YMMV, I suppose. But I would encourage you to run some objective tests.

Shared pain

Posted Feb 2, 2012 2:44 UTC (Thu) by tconnors (guest, #60528) [Link]

> I'd much rather have the performance.

Really? Why bother writing your data to disk at all then? RAM is *super* fast! Me, when I call rename(), I damn well expect either the previous file or the current version has hit the platter and is consistent with metadata. ext3 does this, and ext4 does this with some tweaks. rename() has been implied by programmers for generations to mean an atomic barrier.

XFS has never done this. That is why I don't use XFS. Because they have a damn stubborn following that insists that the perfectly reasonable semantic of close();rename(); is "wrong wrong wrongity wrong, burn you evil data hater!"

Shared pain

Posted Feb 2, 2012 22:13 UTC (Thu) by dlang (subscriber, #313) [Link]

no, rename is an atomic barrier from the point of view of your software on the running machine.

however if the filesystem does not get unmounted cleanly, all guarantees are off. This has always been the case in Unix.

Shared pain

Posted Feb 2, 2012 23:27 UTC (Thu) by khim (subscriber, #9252) [Link]

Unix is dead, sorry. On Linux you have filesystems which was was unreliable against crashes at all (ext, ext2, etc) or which guarantee atomicity across reboots (ext3, ext4, btrfs). Oh, and there are XFS, too - looks like it's developers finally understood that filesystems exist to support applications, not the other way around (even if XFS fanbois didn't)

P.S. Yes, ext4 and btrfs also had the problem under discussion. But they were quickly fixed.

Shared pain

Posted Feb 2, 2012 22:50 UTC (Thu) by dgc (subscriber, #6611) [Link]

> rename() has been implied by programmers for generations to mean an
> atomic barrier

Not true. rename is atomic, but it is not a barrier and never has implied that one exists. rename() has been around for 3 times longer than ext3, so I don't really see how ext3 behaviour can possibly be what generations of programmers expect to see....

Indeed, ext3 has unique rename behaviour as a side effect of data=ordered mode - it flushes the data before flushing the metadata, and so appears to give rename "barrier" semantics. It's the exception, not the rule.

> XFS has never done this. That is why I don't use XFS.

Using data=writeback mode on ext3 makes it behave just like XFS. So ext3 is just as bad as XFS - you shouldn't use ext3 either! :P

> they have a damn stubborn following that insists that the perfectly
> reasonable semantic of close();rename(); is "wrong wrong wrongity wrong,
> burn you evil data hater!"

That's a bit harsh.

There's many good reasons for not doing this - lots of applications don't need or want barrier semantics to rename, or are cross platform and can't rely on implementation specific behaviours for data safety. e.g. rsync() is a heavy user of rename, but adding barrier semantics to the way it uses rename would slow it down substantially. Further, rsync doesn't need barrier semantics to guarantee that data has been copied and safely overwritten - it's written to be safe with current rename behaviour because it is both operating system and filesystem independent.

There have also been good arguments put forward for making this change, such as from Val Aurora (who I also quoted in my talk):

http://lwn.net/Articles/351422/

However, no-one has ever followed up on such discussions with patches to the VFS to make this a standard behaviour that you can rely on all linux filesystems to support. I'm certainly not opposed to such changes if the consensus is that this is what we should be doing - I might argue to maintain the status quo (e.g. because rsync performance is extremely important for backups on large filesystems) but that doesn't mean I don't see or understand the benefits of such a change.

Indeed, adding a new rename syscall with the desired semantics rather than changing the existing one is a compromise everyone would agree with. Perhaps you could do write patches to propose this seeing as you seem to care about such things?

Dave.

Shared pain

Posted Feb 2, 2012 23:41 UTC (Thu) by khim (subscriber, #9252) [Link]

rename is atomic, but it is not a barrier and never has implied that one exists. rename() has been around for 3 times longer than ext3, so I don't really see how ext3 behaviour can possibly be what generations of programmers expect to see....

Easy: most currently active programmers never seen a Unix with journalling FS and without the ability to safely use rename across reboots. Actually they very much insist on such ability - and looks like XFS developers try to provide the capability. But it's not clear if you can trust them: clearly they value POSIX compatibility and benchmarks more then needs of real users (who need working applications, after all, filesystem needs are just minor implementation detail for them).

Indeed, ext3 has unique rename behaviour as a side effect of data=ordered mode - it flushes the data before flushing the metadata, and so appears to give rename "barrier" semantics. It's the exception, not the rule.

When "exception" happens in 90% cases it becomes a new rule - it's as simple as that.

However, no-one has ever followed up on such discussions with patches to the VFS to make this a standard behaviour that you can rely on all linux filesystems to support.

That's because we already have a solution: don't use XFS and you are golden. OSes exist to support application - as you've succinctly shown above with rsync example. The only problem: I'm not all that concerned with rsync speed. I need mundane things: fast compilation (solved with gobs of RAM and SSD, filesystem is minor issue after that), reliable work with bunch of desktop applications (which don't issue fsync(2) before rename(2), obviously). Since I already have a solution I don't see why I should push the patches. If you want to advocate xfs - then you must fix it's problems, I'm happy with ext3/ext4 (which may contain bugs but which at least don't try to play "all your apps are broken, you should just fix them" card).

Shared pain

Posted Feb 3, 2012 4:42 UTC (Fri) by raven667 (subscriber, #5198) [Link]

> rename() has been around for 3 times longer than ext3, so I don't really see how ext3 behaviour can possibly be what generations of programmers expect to see

I'm going to go out on a limb and say that there are more people who are familiar with expected ext3 behavior than the entire number of people who have run UNIX so I do think that ext3-like behavior is what programmers in general expect these days.

Shared pain

Posted Feb 3, 2012 5:01 UTC (Fri) by neilbrown (subscriber, #359) [Link]

This doesn't change the fact that the ext3 behaviour is a mistake, was not designed, was never universal and so should not be seen as a desirable standard.

Yes, there is room for improvement - there always is. Copying a mistake because it has some good features is not a wise move.

As Dave said - if there is a problem, let's fix it properly.

(and yes, my beard is gray (or heading that way)).

Shared pain

Posted Feb 3, 2012 5:16 UTC (Fri) by raven667 (subscriber, #5198) [Link]

Even though the behavior was created by accident doesn't mean that it's not a good idea and that it's become a de-facto standard expectation. Why else would there have been so much noise with ext4?

Shared pain

Posted Feb 3, 2012 5:25 UTC (Fri) by dlang (subscriber, #313) [Link]

so you are saying that because ext3 gives better behavior for people who don't code carefully, it's behavior is the gold standard, even though there is still room for data loss and the same ext3 mistake that gave you the better reliability if you are careless also gives you horrid performance if you try to be careful and make sure your data is really safe.

if you could get the advantages without the drawbacks, of course it would be nice, but the same flaw in the ext3 logic that gives you one also gives you the other.

Shared pain

Posted Feb 3, 2012 5:49 UTC (Fri) by raven667 (subscriber, #5198) [Link]

> ext3 gives better behavior for people who don't code carefully, it's behavior is the gold standard

Its not even about coding carefully, doing the "correct" thing is not even possible in many of the use cases which are protected by the default ext3 behavior such as atomically updating a file from a program which is not in C such as a shell script. I learned, along with many admins, to use the atomic rename behavior to implement "safe" updates which may have been a misunderstanding at the time but can now be considered the new requirement.

At the time this issue was discovered with ext4 there was a frank exchange of ideas and the realization that the expected rename behavior is beneficial to overall reliability and we should make it work properly. I'd be interested in seeing this kind of thing handled at the VFS layer so that the behavior is consistant across all filesystems, that sounds like a great idea.

Shared pain

Posted Feb 6, 2012 23:33 UTC (Mon) by dlang (subscriber, #313) [Link]

the rename behavior was only 'usually safe' without fsyncs (like from scripts), and you could always have a script call 'sync' (a sledgehammer to swat a fly yes, but in the case of ext3, it generated the same disk I/O that a fsync on an individual file would)

yes the we can look at changing the standard, but the way to do that is to talk about changing the standard, not insist that the behavior of one filesystem is the only 'correct' way to do things and that all filesystem developers don't care about your data.

Shared pain

Posted Feb 7, 2012 23:47 UTC (Tue) by Wol (guest, #4433) [Link]

I think this argument has long been hashed out, but the point is the unwanted behaviour is pathological.

And IT IS LOGICALLY IMPOSSIBLE if the computer actually does what the programmer asked it to. THAT is the problem - the computer ends up in an "impossible" state.

And if it is logically impossible to end up there, at least in the programmer's mind, it is also logically impossible to make allowances for it and fix the system!

The state, as per the program's world view, is
(a) old file exists
(b) new file is written
(c) new file replaces old file

If the computer crashes in the middle of this we "magically" end up in state (d) old file is full of zeroes.

How do you program to fix a state that it is not logically possible to get to? In such a way as the program is actually guaranteed to work properly and portably?

Cheers,
Wol

Shared pain

Posted Feb 7, 2012 23:53 UTC (Tue) by neilbrown (subscriber, #359) [Link]

Seems like the programmer has an incorrect model of the world.

Writing to a file has never made the data safe in the event of a crash. fsync is needed for that.

If the programmer did not issue 'fsync' but still expected the data to be safe after a crash, then the programmer made a programming error. It really is that simple.

Incorrectly written programs often produce pathological behaviour - it shouldn't surprise you.

Shared pain

Posted Feb 8, 2012 3:41 UTC (Wed) by mjg59 (subscriber, #23239) [Link]

I appreciate that ordering has never been guaranteed by POSIX, but let's limit it to the actual argument rather than an obvious straw man. The desired behaviour was never for a rename to guarantee that the new data had hit disk. The desired behaviour was for it to be guaranteed that *either* the old data or the new data be present. fsync provides guarantees above and beyond that which weren't required in this particular use case. It's unhelpful to simply tell application developers that they should always fsync when we've just spent the best part of a decade using a filesystem that crippled any application that did.

Shared pain

Posted Feb 8, 2012 4:08 UTC (Wed) by neilbrown (subscriber, #359) [Link]

> we've just spent the best part of a decade using a filesystem that crippled any application that did.

That's the heart of the matter to me.... but now XFS - a filesystem that didn't cripple correct applicatons - is getting a hard time because it doesn't follow the lead of a filesystem that did.

And yes, I know, technical excellence doesn't determine market success, and even the best contender must adapt or die when facing with an ill-informed market. So maybe XFS should adopt the extX model for rename even though it hurts performance in some cases - because if it doesn't people might choose not to use it - and who wants to be the best filesystem that nobody uses (though XFS is a long way from that fate).

So I'm just being a lone voice trying to teach the history and show people that the feature they like so much was originally a mistake and the programs that use it are actually incorrect (or at least not-portable) and maybe there are hidden costs in the thing they keep asking for..

I don't expect to be particularly successful, but that is no justification for being silent.

Shared pain

Posted Feb 8, 2012 12:38 UTC (Wed) by mjg59 (subscriber, #23239) [Link]

Arguing "The specification allows us to do this" isn't something that convinces the people who consume your code. Arguing "Our design makes it difficult" is more convincing, but implies that your design stage ignored your users. "We made this tradeoff for these reasons" is something that people understand, but isn't something I've seen clearly articulated in most of these discussions. It just usually ends up with strawman arguments about well how did you expect this stuff to end up on disk when you didn't fsync, which just makes people feel like you don't even care about pretending to understand what they're actually asking for.

(Abstract you throughout)

Shared pain

Posted Feb 8, 2012 13:24 UTC (Wed) by nye (guest, #51576) [Link]

>That's the heart of the matter to me.

Then you have misunderstood the nature of the problem.

The problem is that there are cases when atomicity is required but durability is not so important. With ext3 (et al.) it is possible to get one without the other, but with XFS (et al.) atomicity can only be gained as a side-effect of durability, which is more expensive.

Thus, ext3 provides a feature which XFS does not - one which filesystem developers, as a rule, don't seem to care about, but application developers, as a rule, do. The characterisation of anyone who actually cares for that feature as 'ill-informed' is grating, even offensive to many.

General addendum, not targeted at you specifically: falling back to the observation that XFS's behaviour is POSIX-compliant is pointless because - though true - it is vacuous. In fact POSIX doesn't specify anything in the case of power loss or system crashes, hence it would be perfectly legal for a POSIX-compliant filesystem to fill your hard drive with pictures of LOLcats.

Shared pain

Posted Feb 8, 2012 22:29 UTC (Wed) by dlang (subscriber, #313) [Link]

and with ext3 it's not possible to get durability without a huge performance impact

with any filesystem you have atomic renames IF THE SYSTEM DOESN'T CRASH before the data is written out, that's what the POSIX standard provides.

ext3 gains it's 'atomic renames' as a side effect of a bug, it can't figure out what data belongs to what, so if it's trying to make sure something gets written out it must write out ALL pending data, no matter what the data is part of. That made it so that if you are journaling the rename, all the writes prior to that had to get written out first (making the rename 'safe'), but the side effect is that all other pending writes, anywhere in the filesystem also had to be written out, and that could cause 10's of seconds of delay.

for the casual user, you argue that this is "good enough", but for anyone who actually wants durability, not merely atomicity in the face of a crash has serious problems.

ext4 has a different enough design that they can order the rename after the write of the contents of THAT ONE file, so they can provide some added safety at relatively little cost

you also need to be aware that without the durability, you can still have corrupted files in ext3 after a crash, all it takes is any application that modifies a file in place, including just appending to the end of the file

Shared pain

Posted Feb 8, 2012 19:48 UTC (Wed) by Wol (guest, #4433) [Link]

Lets just say that governments (and businesses) have wasted billions throwing away applications where the application met the spec but in practice was unfit for purpose.

And a filesystem that throws away user data IS unfit for purpose. After all, what was the point of journalling? To improve boot times after a crash and get the system back into production quicker. If you need to do data integrity check on top of your filesystem check you've just made your reboot times far WORSE - a day or two would not be atypical after a crash!

Cheers,
Wol

Shared pain

Posted Feb 8, 2012 20:51 UTC (Wed) by raven667 (subscriber, #5198) [Link]

The hyperbole is getting a little out of control. Journaled filesystems have traditionally only journaled the metadata so any file data in-flight at the time of a crash would be lost and corruption would be the result. Pre-journaling any filesystem with a write cache would be susceptible to losing in-flight data and corrupting metadata leading to long fsck times after crash to repair the damage. All filesystems lose data in those circumstances, that doesn't mean that all filesystems are unfit for any purpose or that computers are fundamentally unfit for any purpose. The current state of the art is to be safer with regular data writes, even to the point of checksumming everything, that's nice but the world didn't end when this wasn't the case.

Shared pain

Posted Feb 8, 2012 15:13 UTC (Wed) by Wol (guest, #4433) [Link]

"fsync is needed for that"

And what is the poor programmer to do if he doesn't have access to fsync?

Or what are the poor lusers supposed to do as their system grinds to a halt with all the disk io as programs hang waiting for the disk?

Following the spec is not an end in itself. Getting the work done is the end. And if the spec HINDERS people getting the work done, then it's the spec that needs to change, not the people.

THAT is why Linux is so successful. Linus is an engineer. He understands that. "DO NOT UNDER ANY CIRCUMSTANCES WHATSOEVER break userspace" is the mantra he lives by. And filesystems eating your data while making everything *appear* okay is one of the most appalling breaches of faith by the computer that it could commit!

Cheers,
Wol

Shared pain

Posted Feb 9, 2012 1:26 UTC (Thu) by dlang (subscriber, #313) [Link]

> And what is the poor programmer to do if he doesn't have access to fsync?

use a language that gives them access to data integrity tools like fsync.

for shell scripts, either write a fsync wrapper, or use the sync command (which does exactly the same as fsync on ext3)

> Or what are the poor lusers supposed to do as their system grinds to a halt with all the disk io as programs hang waiting for the disk?

use a better filesystem that doesn't have such horrible performance problems with applications that try and be careful about their data.

> Following the spec is not an end in itself.

True, but what you are asking for is for the spec to be changed, no matter how much it harms people who do follow the spec (application programmers and users who care about durability)

There is no filesystem that you can choose to use that will not loose data if the system crashes. If you are expecting something different, you need to change your expectation.

Shared pain

Posted Feb 9, 2012 7:39 UTC (Thu) by khim (subscriber, #9252) [Link]

Somehow you've forgotten about the most sane alternative:
Remove XFS from all the computers and use sane filesystems (extX, btrfs when it'll be more stable) exclusively.

In a battle between applications and filesystems applications win 10 times out of 10 because without applications filesystems are pointless (and applications are pointless without the user's data).

The whole discussion just highlights that XFS is categorically, absolutely, totally unsuitable for the use as general-purpose FS. And when you don't care about data integrity then ext4 without journalling is actually faster (see Google datacenters, for example).

True, but what you are asking for is for the spec to be changed, no matter how much it harms people who do follow the spec

Yes.

application programmers and users who care about durability

Applications don't follow the spec. When they do they are punished and fixed. Thus users who care about durability need to use filesystems which work correctly given the existing applications.

Is it fair? No. It's classic vicious cycle. But said sycle is fact of the life. Ignore it at your peril.

I, for one, have a strict policy to never use XFS and to don't even consider bugs which can not be reproduced with other filesystems. Exactly because XFS developers think specs trump reality for some reason.

There is no filesystem that you can choose to use that will not loose data if the system crashes. If you are expecting something different, you need to change your expectation.

That's irrelevant. True, the loss of data in the case of system crash is unavoidable. I don't care if the window I've opened right before crash in Firefox is reopened or not. I understand that spinning rust is slow and can lose such info. But if the windows which were opened hour before that are lost because XFS replaced save state file with zeros then such filesystem is useless in practice. Long time ago XFS was prone to such data loss even if fsync was used and data was "saved" to disk days before crash. After a lot of work looks like XFS developers fixed this issue, but now they are stuck with the next step: atomic rename. It should be implemented for the FS to be suitable for real-world applications. There are even some hints that XFS have implemented it, but as long as XFS developer will exhibit this "specs are important, real applications don't" pathological thinking it's way too dangerous to even try to use XFS.

Shared pain

Posted Feb 9, 2012 9:12 UTC (Thu) by dlang (subscriber, #313) [Link]

if you use applications that follow the specs (for example, just about every database, or mailserver), then XFS/ext4/btrfs/etc are very reliable.

what you seem to be saying is that these classes of programs should be forced to use filesystems that give them huge performance penalties to accommodate other programs that are more careless, so that those careless programs loose less data (not no data loss, just less)

Shared pain

Posted Feb 9, 2012 9:19 UTC (Thu) by dlang (subscriber, #313) [Link]

by the way, I've done benchmarks on applications that do the proper fsync dance needed for the data to actually be safe (durable, not just atomic filesystem renames that may or may not get written to disk), and even on an otherwise idle system ext3 was at least 2x slower, and if you have other disk activity going on at the same time, the problem only goes up (if you hae another process writing large amounts of data, the performance difference for your critical app can easily be 40x slower on ext3)

Shared pain

Posted Feb 9, 2012 17:37 UTC (Thu) by khim (subscriber, #9252) [Link]

Exactly. This is part of the very simple proof sequence.

Fact 1: any application which calls fsync is very slow in ext3. You've just observed it.
Conclusion: most applications don't call fsync.
Fact 2: most systems out there are either "small" (where a lot of applications share one partition) or huge (where reliability of filesystem does not matter because there are other ways to keep data around like GFS).
Conclusion: any real-world filesystem needs to support all the application which are "wrong" and don't call fsync, too.
Fact 3: XFS does not provide these guarantees (and tries to cover it with POSIX, etc).
Conclusion: XFS? Fuhgeddaboudit.

Yes, it's not fair to XFS. No, I don't think being fair is guaranteed in real world.

Shared pain

Posted Feb 9, 2012 19:26 UTC (Thu) by dlang (subscriber, #313) [Link]

sorry, on my systems I'm not willing to tolerate a 50x slowdown just to make badly written apps be a little less likely to be confused after a power outage.

and I think that advocating that you have the right to make this choice for everyone else is going _way_ too far.

when I have applications that loose config data after a problem happens (which isn't always a system crash, apps that have this sort of problem usually have it after the application crashes as well), my solution is backups of the config (idealy into something efficient like git), not crippling the rest of the system to band-aid the bad app.

Shared pain

Posted Feb 9, 2012 20:44 UTC (Thu) by Wol (guest, #4433) [Link]

And what is "badly written" about an app that expects the computer to do what was asked of it?

I know changing things around for the sake of it doesn't matter when everything goes right, but if I tell the computer "do this, *then* that, *followed* by the other", well, if I told an employee to do it and they did things in the wrong order and screwed things up as a *direct* *result* of messing with the order, they'd get the sack.

The only reason we're in this mess, is because the computer is NOT doing what the programmer asked. It thinks it knows better. And it screws up as a result.

And the fix isn't that hard - just make sure you flush the data before the metadata (or journal the data too), which is pretty much (a) sensible, and (b) what every user would want if they knew enough to care.

Cheers,
Wol

Shared pain

Posted Feb 9, 2012 20:52 UTC (Thu) by dlang (subscriber, #313) [Link]

it is badly written because you did not tell the computer that you wanted to make sure that the data was written to the drive in a particular order.

If the system does not crash, the view of the filesystem presented to the user is absolutely consistent, and the rename is atomic.

The problem is that there are a lot of 'odd' situations that you can have where data is written to a file while it is being renamed that make it non-trivial to "do the right thing" because the system is having to guess at what the "right thing" is for this situation.

try running a system with every filesystem mounted with the sync option, that will force the computer to do exactly what the application programmers told it to do, writing all data exactly when they tell it to, even if this means writing the same disk sector hundreds of times as small writes happen. The result will be un-usable.

so you don't _really_ want the computer doing exactly what the programmer tells it to, you only want it to do so some of the time, not the rest of the time.

Shared pain

Posted Feb 9, 2012 21:13 UTC (Thu) by khim (subscriber, #9252) [Link]

so you don't _really_ want the computer doing exactly what the programmer tells it to, you only want it to do so some of the time, not the rest of the time.

Sure. YMMV as I've already noted. Good filesystem for USB sticks must flush on close(2) call. Good general purpose filesystem must guarantee rename(2) atomicity in the face of system crash.

You can use whatever you want for your own system - it's you choice. But when question is about replacement of extX… it's other thing entirely. To recommend filesystem which likes to eat user's data is simply irresponsible.

Shared pain

Posted Feb 14, 2012 16:16 UTC (Tue) by nye (guest, #51576) [Link]

>when I have applications that loose config data after a problem happens (which isn't always a system crash, apps that have this sort of problem usually have it after the application crashes as well)

That can't possibly be the case. You must be talking about applications which do something like truncate+rewrite, which is entirely orthogonal to the discussion (and is pretty clearly a bug).

I suspect you haven't understood the issue at hand.

Shared pain

Posted Feb 9, 2012 17:25 UTC (Thu) by khim (subscriber, #9252) [Link]

What you seem to be saying is that these classes of programs should be forced to use filesystems that give them huge performance penalties to accommodate other programs that are more careless, so that those careless programs loose less data

In a word: yes.

not no data loss, just less

Always and forever. No matter what filesystem you are using you data is toast in a case of RAID failure or lightning strike. This means that we always talk about probabilities.

This leads us to detailed explanation of the aforementioned phenomenon: in most cases you can not afford dedicated partitions for your database or mailserver and is this world filesystem without suitable reliability guarantees (like atomic rename in a crash case without fsync) is pointless. When your system grows it becomes good idea to dedicate server just to be a mailserver or just to be a database server. But the window of opportunity is quite small because when you go beyond handful of servers you need to develop plans which will keep your business alive in a face of hard crash (HDD failure, etc). And if you've designed your system for such a case then all these journalling efforts in a filesystem are just a useless overhead (see Google which switched from ext2 to ext4 without journal).

I'm not saying XFS is always useles. No, there are exist cases where you can use it effectively. But these are rare cases thus XFS will always be undertested. And this, in turn, usually means you should stick with extX/btrfs.

Shared pain

Posted Feb 3, 2012 10:19 UTC (Fri) by khim (subscriber, #9252) [Link]

Yes, there is room for improvement - there always is. Copying a mistake because it has some good features is not a wise move.

This depends on your goal, actually. If your goal is something theoretically sound, then no, it's not a wise move. If your goal is creation of something which will actually be used by real users then it's the only possible move.

(and yes, my beard is gray (or heading that way)).

My beard is not yet gray, but I was around long enough to see where the guys who did "wise move" ended. I must admit that they create really TRULY nice exhibits in Computer History Museum. Meanwhile creations of "unwise" guys are used for real work.

If your implementation unintentionally introduced some property and people started depending on it - it's the end of story: you are doomed to support said property forever. If you want to keep these people, obviously. If your goal is just to create something nice for the sake of art or science, then situation is different, of course.

This is basic fact of life and it's truly sad to see that so many Linux developers (especially the desktop guys) don't understand that.


Copyright © 2017, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds