Not logged in
Log in now
Create an account
Subscribe to LWN
Deadline scheduling: coming soon?
LWN.net Weekly Edition for November 27, 2013
ACPI for ARM?
LWN.net Weekly Edition for November 21, 2013
GNU virtual private Ethernet
skipping fsync is not safe on any filesystem that's not mounted -sync
this is true for every OS
Posted Jan 29, 2012 1:53 UTC (Sun) by sbergman27 (guest, #10767)
Well, Both my mattress and J.P. Morgan *could* lose my money. So putting my money either place represents equal risk. If I understand you correctly, you are saying that Ext3 mounted data=ordered, Ext4 mounted with the defaults, and XFS mounted with the defaults, all represent equal risk to our data because any one of them *could* conceivably lose our data.
Again, I'm not buying it.
Posted Jan 30, 2012 21:49 UTC (Mon) by dlang (✭ supporter ✭, #313)
I am not saying that the risk is equal, I am disputing the statement that ext3 is rock solid and won't loose your data without you needing to do anything.
Ext3 is one of the worst possible filesystems to use if you really care about your data not getting lost (and therefor implement the fsync dance to make sure you don't loose data), because it's fsync performance is so horrid.
The applications are not keeping the data in buffers 'as long as they can', they are keeping the data in buffers for long enough to be able to optimize disk activity.
The first is foolishly risking data for no benefit, the second is taking a risk for a direct benefit. These are very different things. Ext3 also keeps data in buffers and delays writing it out in the name of performance. Every filesystem available on every OS does so by default, the may differ in how long they will buffer the data, and what order they write things out, but they all buffer the data unless you explicitly mount the filesystem with the sync option to tell it not to.
You say that people will always pick reliability over performance, but time and time again this is shown to not be the case. As pointed out by another poster, MySQL grew almost entirely on it's "performance over reliability" approach, only after it was huge did they start pushing reliability. The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.
This extends beyond the computing field. There are no fields where the objective is to eliminate every risk, no matter what the cost. It is always a matter of balancing the risk with the costs of eliminating them, or with the benefits of accepting the risk.
That's the problem...
Posted Jan 30, 2012 22:54 UTC (Mon) by khim (subscriber, #9252)
In theory, theory and practice are the same. In practice, they are not.
Ext3 is one of the worst possible filesystems to use if you really care about your data not getting lost (and therefor implement the fsync dance to make sure you don't loose data), because it's fsync performance is so horrid.
Right, but that's the problem: most developers don't care about these things (most just don't think about the problem at all, others just hope it all will work... somehow). Most users do. Thus we have a strange fact: in theory ext3 is the worst possible FS from "lost data" POV, in practice it's one of the best.
Posted Jan 30, 2012 23:16 UTC (Mon) by dlang (✭ supporter ✭, #313)
For other developers, the fact that fsync performance is so horrible on the default filesystem for many distros has trained a generation of programmers to NOT use fsync (because it kills performance in ways that users complain about)
Posted Feb 2, 2012 3:45 UTC (Thu) by tconnors (guest, #60528)
Then there's the fact that fsync will spin up your disks if you were trying to keep them spun down (to the point where on laptops, I try to use 30 minute journal commit times, and manually invoke sync when I absolutely want something committed). I don't want or need an absolute gaurantee that the new file has hit the disk consistent with metadata. I want an absolute guarantee that /either/ the new file or the old file is there consistent with the relevant metadata. ext3 did this. It's damn obvious what rename() means - there should be no need for every developer to go through all code in existance and change semantics of code that used to work well *in practice*. XFS loses files everytime power fails *in practice*. If I need to compare to backup *everytime* power fails, then I might as well be writing all my data to volatile RAM and do away with spinning rust all together, because that's all that XFS is good for.
Another pathological (but instructive) case...
Posted Jan 30, 2012 23:03 UTC (Mon) by khim (subscriber, #9252)
Similar story happens with USB sticks: most users believe FAT (on Windows) is super-safe, NTFS (on Windows) is horrible - and Linux is awful no matter what. Why? Delayed write defaults. FAT on Windows is tuned to flush everything on ANY close(2) call. NTFS works awfully slow in this mode thus it uses more aggressive caching. And on Linux caching is always on.
And users just snatch USB stick the very millisecond program window is closed (well... most do... the cautious ones wait one or two seconds). They feel it's their unalienable right. In these circumstances suddenly the oldest and the most awful filesystem of them all becomes the clear winner!
Exactly because "I care about my data" does not automatically imply "I'll do what I'm told to do to keep it".
The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.
Posted Feb 4, 2012 13:08 UTC (Sat) by Wol (guest, #4433)
I work with Pick (a noSQL db), and a LOT of the reliability problems that ACID is meant to fix, just *can't* *happen* in Pick.
Well, they can if the database was badly designed, but you can get similar pain in relational databases too...
Relational is a lovely mathematical design - I would use it to design a database without a second thought - but I would then convert that design to NF2 (non-first-normal-form) and implement it in Pick. Because 90% of ACID's benefits would then be redundant, and the database would be so much faster too.
You've heard my war-story of a Pentium90/Pick combo outperforming an Oracle/twinXeon800, I'm sure ...
Posted Feb 4, 2012 13:27 UTC (Sat) by gioele (subscriber, #61675)
This is getting off-topic, but could you explain which kind of reliability problems meant to be fixed by ACID DBs are cannot happen in Pick and why?
Posted Feb 7, 2012 23:20 UTC (Tue) by Wol (guest, #4433)
Look for what I call "real world primary keys" - an invoice has a number, a person has a name. Now we've got a primary key, we work out all the attributes that belong to that key. We can now do a relational analysis on those attributes. (forget that real-world primary keys aren't always unique and you might have to create a GUID etc.)
With an invoice, in relational you'll end up with a bunch of rows spread across several tables for each invoice. IN PRACTICE, ACID is mostly used to make sure all those rows pass successfully through the database from application to disk and back again.
In Pick, however, you then coalesce all those tables (2-dimensional arrays) together into one n-dimensional Pick FILE. And you coalesce all those rows together into one Pick RECORD. With the result that there is no need for the database to make sure the transaction is atomic. All the data is held as a single atom in the application, and is passed through the database to and from disk as a single atom.
That's also why Pick smokes relational for speed - access any attribute of an object, and all attributes get pulled into cache. Try doing that for a complex object with relational !!! :-) (Plus Pick is self-optimising, and when optimised it takes, on average, just over one disk seek per primary key to find the data it's looking for on disk!)
The problem I see with relational is it is MATHS and, to reference Einstein, therefore has no basis in reality. Pick is based on solid engineering, and when "helped" with relational theory really flies. Relational practice actually FORBIDS a lot of powerful optimising techniques.
And if designed properly, a Pick database is normalised therefore it can look like a relational database, only superfast. I always compare Pick and relational to C and Pascal. Pick and C give you all the rope you need to seriously shoot yourself in the foot. Relational and Pascal have so many safety catches, it's damn hard to actually do any real work.
(And because foreign keys are attributes of a primary key, you can also trap errors in the application. For example, the client's key is a mandatory element of an invoice so it belongs in the invoice. Okay, it's the app's job to make sure the record isn't filed without a client, whereas in relational you can leave it to the DB, but in Pick it's easy enough to add a business layer between the app and the DB that does this sort of thing.)
Posted Feb 7, 2012 23:26 UTC (Tue) by Wol (guest, #4433)
In relational, attributes that are tied together can be spread (indeed, for a complex object MUST be spread) across multiple tables and rows. Column X in table Y is meaningless without column A in table B.
In Pick, that data would share the same primary key, and would be stored in one FILE, in one RECORD. Delete the primary key and both cells vanish together. Create the primary key, and both cells appear waiting to be filled.
As far as Pick is concerned, a RECORD is a single atom to be passed through from disk to app and vice versa. From what I can make out, in relational you can't even guarantee a row is a single atom!
Posted Feb 8, 2012 0:45 UTC (Wed) by dlang (✭ supporter ✭, #313)
ACID is the sequence of modifying the file on disk so that you have either the new data or the old data at all times, and if the application says that the transaction is done, there's no way for it to disappear.
what you are talking about with Pick is a way to bundle related things together so that it's easier to be consistent. That doesn't mean that writing your record will happen in an atomic manner on the filesystem (if the record is large enough, this won't be an atomic action)
Posted Feb 8, 2012 14:52 UTC (Wed) by Wol (guest, #4433)
AND THAT IS MY POINT. In Pick, 90% of the time, this is unnecessary and wasteful, because what is a single transaction as seen by the DB is also a single transaction as seen by the OS and disk subsystem!
I agree with you, if the record is too big, it won't go onto the file store in one piece, but the point is that with a relational DB you can pretty much guarantee that, in practice, a transaction will never go onto the filestore in one piece. So ACID is needed. But in Pick it's unusual for it NOT to go on the filesystem in one piece. So most of the time ACID is an unnecessary complexity.
I'm ignoring file system failures like XFS/ext4 zeroing out your table in a crash :-) because I don't see how ACID can protect against the OS trashing your data :-)
What you need to do is realise that ACID sits between the database and disk. As you say, it guarantees that the database in memory is (a) consistent, and (b) accurately represented on disk. And because, *in* *the* *real* *world*, pretty much any change in a relational database requires multiple changes in multiple places on disk, ACID is a necessity.
But in the real world, most changes in a Pick database only involve a *single* change in a *single* place on disk to ensure consistency. So a "write successful" from the OS is all that's needed to provide a "good enough" implementation of ACID. (And if the OS lies to your ACID layer, you're sunk even if you've got ACID. See all the other posts in this thread about disks lying to the OS!)
(This has other side effects. Yes, a Pick database can get into an inconsistent state. But that inconsistent state MIRRORS REALITY. A Pick database can lose the connection between a person and his house. Or a car and its owner. But in reality a person can lose their house. A car can lose its owner. It's far too easy to assume in a relational database that everyone has a home, and next thing you can't put some poor vagrant in your database when you need to ...)
Posted Feb 8, 2012 22:17 UTC (Wed) by dlang (✭ supporter ✭, #313)
Posted Feb 9, 2012 0:17 UTC (Thu) by Wol (guest, #4433)
Because if it's at the OS/disk interface, what the heck is ACID doing in the database? It can't provide ANY guarantees, because it's too remote from the action.
And if it's at the db/OS interface, well as far as Pick is concerned, most transactions are near-enough atomic that the overhead isn't worth the cost (that was my comment about "90% of the time").
Your relational bias is clouding your thinking (although Pick might be clouding mine :-) But just because relational cannot do atomic transactions to disk doesn't mean Pick can't. As far as Pick is concerned, that transaction is atomic right up to the point that the OS code actually puts the data onto the disk. And if the OS screws that up, ACID isn't going to save you ...
Think of a "begin transaction" / "end transaction" pair. It's almost impossible for that transaction to truly be atomic in a relational database - you will invariably need to update multiple rows. In Pick, it's more than possible for that transaction to be truly atomic at the point where the db hands it over to the OS. ACID enforces atomicity between the OS and the db. Pick doesn't need it.
What guarantees does ACID provide over and above data consistency? Because a well-designed Pick app guarantees "if it's there it's consistent". And if the OS screws up and corrupts it, neither Pick nor ACID will save you.
Posted Feb 9, 2012 0:54 UTC (Thu) by dlang (✭ supporter ✭, #313)
ACID is a feature that SQL databases have had, but you don't need to abandon SQL to abandon ACID and you don't need to have SQL to have ACID
Berkeley DB is ACID, but not SQL, MySQL was SQL but not ACID with the default table types for many years.
ACID involves the database application doing a lot of stuff to provide the ACID guarantees to users by using the features of the OS and hardware. If the OS/hardware lies to the database application about when something is actually completed then the database cannot provide ACID guarantees.
It appears that you have an odd interpretation about what ACID means, so reviewing
A transaction is either completely implemented or not implemented at all. For changes to a single record this is relatively easy to do, but if a transaction involves changing multiple records (subtract $10 from account A and add $10 to account B) it's not as simple as atomically writing one record. Remember that even a single write() call in C is not guaranteed to be atomic (it's not even guaranteed to succeed fully, you may be able to write part of it and not other parts)
this says that at any point in time the database will be consistent, by whatever rules the database chooses to enforce. Berkeley DB has very trivial consistency checks, the records must all be complete. Many SQL databases have far more complex consistency requirements (foreign keys, triggers, etc)
This says that one transaction can affect another transaction happening at the same time
This says that once a transaction is reported to succeed then nothing, including a system crash at that instant (but excluding something writing over the file on disk) will cause the transaction to be lost
What you are describing about Pick makes me thing that it has very loose consistency and isolation requirements, but to get Atomicity and Durability the database needs to be very careful about how it writes changes.
It cannot overwrite an existing record (because the write may not complete), and it must issue appropriate system calls (fsync and similar) to the OS, and watch for the appropriate results, to know when the data has actually been written to disk and will not change.
It's getting this last part done that really differentiates similar database engines from each other. There are many approaches to doing this and they all have their performance trade-offs. If you are willing to risk your data by relaxing these requirements a database becomes trivial to implement and is faster by several orders of magnitude.
note how the only SQL concept that is involved here is the concept of a transaction in changing the data.
Posted Feb 9, 2012 20:22 UTC (Thu) by Wol (guest, #4433)
Atomic: as I said, a relational transaction in relational will pretty much inevitably be split across multiple, often many, tables. In Pick, all dependant attributes (excluding foreign-key links) will be updated as a single transaction right down to the file-system layer. So, as an example, if I have separate FILEs for people and buildings, it's possible I'll corrupt "where someone lives" if I update the person and fail to create the building, but I won't have inconsistent person or building data.
Consistency: IF designed properly, a Pick database should be consistent within entities. All data associated with an individual "real world primary key". Relations between entities could get corrupted, but that *should* be solved with good programming practice - in my example above, "lives at" is an attribute of person, so you update building then person.
Isolation: I don't quite understand that, so I won't comment.
Durability: Well, when I tried to write a Pick engine, my first reaction to actually writing FILEs to disk was "copy on write seems pretty easy...". And there comes a point where you have to take the OS on trust.
So I think my premise still stands - a LOT of the requirement for ACID is actually *caused* by the rigorous separation demanded by relational between the application and the database. By allowing the application to know about (and work with) the underlying database structure you can get all the advantages of relational's rigorous analysis, all the advantages of a strong ACID setup, and all the advantages of noSQL's speed. But it depends on having decent programmers (cue my previous comment about Pick and C giving you all the rope you need ...)
And one of the reasons I wanted to write that Pick db engine was so I could put in - as *optional* components - loads of stuff that enforced relational constraints to try and reign in the less-competent programmers! I want a Modula-2 sort of Pick, that by default protects you from yourself, but where the protections can be turned off.
Posted Feb 9, 2012 20:36 UTC (Thu) by dlang (✭ supporter ✭, #313)
consistency, what if part of your updates get to disk and other parts don't? what if the OS (or drive) re-orders your updates so that the write to the record for person happens before the write to building?
As far as durability goes, if you don't tell the OS to flush it's buffers (which is what fsync does), then in a crash you have no idea what may have made it to disk and what didn't.
Posted Feb 10, 2012 16:17 UTC (Fri) by Wol (guest, #4433)
Well, if you define the transaction as an entity, then it gets written to its own FILE. If the system crashes then you get a discrepancy that will show up in an audit. It makes sense to define it as an entity - it has its own "primary key" ie "time X at teller Y". Okay, you'll argue that I have to run an integrity check after a crash (true) while you don't, but I can probably integrity-check the entire database in the time it takes you to scan one big table :-)
Consistency? Journalling a transaction? Easily done.
And yes, your point about flushing buffers is good, but that really should be the OS's problem, not the app (database) sitting on top. Yes I know, I used the word *should* ...
Look at it from an economic standpoint :-) If my database (on equivalent hardware) is ten times faster than yours, and I can run an integrity check after a crash without impinging on my users, and I can guarantee to repair my database in hours, which is the economic choice?
Marketing 101 - proudly announce your weaknesses as a strength. The chances of a crash occuring at the "wrong moment" and corrupting your database are much higher with SQL, because any given task will typically require between 10s and 100s more transactions between the db and OS than Pick. So SQL needs ACID. With Pick, the chances of a crash happening at the wrong moment and corrupting data are much, much lower. So expensive strong ACID actually has a prohibitive cost. Especially if you can get 90% of the benefits for 10% of the effort.
I'm not saying ACID isn't a good thing. It's just that the cost/benefit equation for Pick says strong ACID isn't worth it - because the benefits are just SO much less. (Like query optimisers. Pick doesn't have an optimiser because it's pretty much a dead cert the optimser will save less than it costs!)
Posted Feb 10, 2012 18:43 UTC (Fri) by dlang (✭ supporter ✭, #313)
that doesn't sound like a performance win to me.
Posted Feb 11, 2012 2:30 UTC (Sat) by Cyberax (✭ supporter ✭, #52523)
Posted Feb 11, 2012 5:48 UTC (Sat) by dlang (✭ supporter ✭, #313)
besides, git tends to keep the most recent version of a file uncompressed, it's only when the files are combined into packs that things need to be reconstructed, and even there git only lets the chains get so long.
Posted Feb 11, 2012 13:44 UTC (Sat) by Cyberax (✭ supporter ✭, #52523)
NoSQL systems work in a similar way - they can store the 'tip' of the data, so that they don't have to reapply all the patches all the time. However, the latest data view can be rebuilt if required.
Posted Feb 12, 2012 15:57 UTC (Sun) by nix (subscriber, #2304)
Posted Feb 12, 2012 18:29 UTC (Sun) by dlang (✭ supporter ✭, #313)
and it's frequently faster to read a compressed file and uncompress it than it is to read the uncompressed equivalent (especially for highly compressible text like code or logs), I've done benchmarks on this within the last year or so
Posted Feb 12, 2012 13:38 UTC (Sun) by Wol (guest, #4433)
Each month, when you run end-of-month statements, you save that info. When you upate an account you keep a running total.
If the system crashes you then do "set corruptaccout = true where last-month plus transactions-this-month does not equal running balance". At which point you can do a brute force integrity check on those accunts.
(If I've got a 3rd state of that flag, undefined, I can even bring my database back on line immediately I've run a "set corruptaccount to undefined" command!)
And in Pick, that query will FLY! If I've got a massive terabyte database that's crashed, it's quite likely going to take a couple of hours to reboot the OS (I just rebooted our server at work - 15-20 mins to come up including disk checks etc). What's another hour running an integrity check on the data? And I can bring my database back on line immediately that query (and others like it) have completed. Tough luck on the customer who's account has been locked ... but 99% of my customers can have normal service resume quickly.
Thing is, I now *know* after a crash that my data is safe, I'm not trusting the database company and the hardware. And if my system is so much faster than yours, once the system is back I can clear the backlog faster than you can. Plus, even if ACID saves your data, I've got so much less data in flight and at risk.
But this seems to be mirroring the other debate :-) the moan about "fsync and rename" was that fsync was guaranteeing (at major cost) far more than necessary. The programmer wanted consistency, but the only way he could get it was to use fsync, which charged a high price for durability. If I really need ACID I can use BEGIN/END TRANSACTION in Pick. But 99% of the time I don't need it, and can get 90% of its benefits with 10% of its cost, just by being careful about how I program. At the end of the day, Pick gives me moderate ACID pretty much by default. Why should I have to pay the (high) price for strong ACID when 90% of the time, it is of no benefit whatsoever? (And how many SQL programmers actually use BEGIN/END TRANSACTION, even when they should?)
Posted Feb 8, 2012 14:08 UTC (Wed) by nix (subscriber, #2304)
From what I can make out, in relational you can't even guarantee a row is a single atom!
But in practice, in SQL... just try INSERTing half a row. You can't. Atomicity at the row level is guaranteed. I hate SQL, but at least it does this right.
Posted Feb 8, 2012 15:00 UTC (Wed) by Wol (guest, #4433)
Well, the relational algebra does not discuss storage at all, and does not stipulate where relations might reside on permanent storage
Which is exactly my beef with relational databases. C&D FORBID you from telling the database where relations should be stored for efficiency. But in REALITY it is highly probable that, if you access one attributed associated with my primary key, that you will want to access others. But it's a complete gamble retrieving the same attribute associated with other primary keys. Because Pick guarantees (by accident, admittedly) that all attributes are stored in the same atom as the primary key they describe, all those attributes you are statistically most likely to want are exactly the attributes that coincidentally get retrieved together.
Posted Feb 8, 2012 15:05 UTC (Wed) by Wol (guest, #4433)
What I meant was it's not guaranteed at the physical level in the datastore. Two cells in the same row could be stored in completely different "buckets" in the database, for example the data is stored in an index with a pointer from the row. I know that probably doesn't happen but if the guy who designed the database engine thinks it's more efficient there's nothing stopping him.
So even if you the database programmer *think* an operation should be atomic right down to the disk, there's no guarantee.
Posted Feb 8, 2012 21:53 UTC (Wed) by nix (subscriber, #2304)
If it matters that data is written to the disk atomically, you have already lost, because *nothing* is written to the disk atomically, not least because you invariably have to update metadata, and secondly because no disk will guarantee what happens to partial writes in the case of power failure. So, as long as you have to keep a journal or a writeahead log to deal with that, why not allow arbitrarily large amounts of data to appear to be written atomically? Hence, transactions.
It is true that programs that truly use transactions are relatively rare: in one of my least proud moments I accidentally changed the rollback operation in one fairly major financial system to do a commit and it was a year before anyone noticed. However, when you *do* have code that uses transactions, the effect on the code complexity and volume can be dramatic. As a completely random example, I've written backtracking searchers that relied on rollback in about 200 lines before, because I could rely on the database's transaction system to do nearly all the work for me.
Posted Feb 9, 2012 20:30 UTC (Thu) by Wol (guest, #4433)
It happens quite a lot, increasingly often now that databases are lifting the horrible restrictions many of them had on the total amount of data stored per row (Oracle and MySQL had limits low enough that you could hit them in real systems quite easily).
Sorry, I have to laugh here. It's taken Pick quite a while to get rid of the 32K limit, but that limit does date from the age of the dinosaur when computers typically came with 4K of core ...
And no limit on the size of individual items, or the number of items in a FILE.
Posted Feb 9, 2012 20:38 UTC (Thu) by dlang (✭ supporter ✭, #313)
Posted Feb 10, 2012 16:25 UTC (Fri) by Wol (guest, #4433)
Look at the comment you're replying to :-) In early Pick systems I believe it was possible for a single item to be larger than available memory ...
Okay, it laid the original systems wide open to serious problems if something went wrong, but as far as users were concerned Pick systems didn't have disk. It was just "permanent memory". And Pick was designed to "store all its data in ram and treat the disk as a huge virtual memory". I believe they usually got round any problem by flushing changes from core to disk as fast as possible, so in a crash they could just restore state from disk.
Posted Feb 2, 2012 1:06 UTC (Thu) by darrint (guest, #673)
I'm not sure how that metaphor is supposed to work. A few years ago when the U.S. government passed new credit reforms they were time delayed by several months at the request of the targeted banks, supposedly to give them time to carefully update their big and very important computer systems. The reality was the banks used those several months to burn and pillage the assets of people like me. I was in debt, my own stupidty of course, and I probably lost over a thousand dollars in fees alone due to a few financial institutions playing shenanigans with the ragged edges of my account terms.
At the time I thought wistfully of how much more secure my money would be if I could just collect my pay in cash and drive it to my house.
Posted Feb 2, 2012 14:08 UTC (Thu) by rwmj (subscriber, #5474)
Does your house loan you money? What does this negative money look like that you keep under your mattress?
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds