LWN: Comments on "XFS: the filesystem of the future?"

XFS: the filesystem of the future?

vcmohans — Fri, 30 Mar 2018 07:19:51 +0000

I am trying to achieve the best performance to find a string from the 12 different files with 12 dedicated threads.For that I have configured 120GB of two samsung SSDs in RAID0(striping) configuration with XFS file system.But the result is same as EXT4 file system.Why the performace is not improved?

XFS: the filesystem of the future?

XTF — Wed, 07 Mar 2012 16:54:09 +0000

> but their underlying assumption is that moving all this into the kernel will solve all these problems. Well, it doesn't.

Why not? Some of the issues should be easy to avoid in kernel space.

> For example: how does moving the overwrite into the kernel make it any easier to decide whether we should break hard links or not?

What does the non-atomic case do? I'd do the same in the atomic case.

> IOWs, the issue here is that nobody has defined what the operations being asked for are supposed to do, the use cases, the expected behaviour, constraints, etc.

My request was quite clear: Could any of the fsync advocates post real code that does the atomic variant of open, write, close?
Isn't that quite well-defined?

> Before asking kernel developers to do something the problem space needs to be scoped and specified

Before that, we should agree that these are valid issues / assumptions / regressions.

> But what really needs to happen first is that someone asking for this new kernel functionality steps up and takes responsibility for driving this process.....

That'd be great, but at the moment most (FS) devs (and others) just declare these non-issues and refuse to discuss.

XFS: the filesystem of the future?

dgc — Wed, 07 Mar 2012 11:19:29 +0000

> the OP is concerned about rather esoteric situations

I know. But what was described was a specific set of block and disk cache manipulations. It wasn't a set of requirements; all I was doing is pointing out if I treated it as a set of requirements then it can be implemented with the existing POSIX API.

> These situations do exist, but they are rather rare and the 'right
> thing' to do is not always and obvious and consistent thing.

And that is precisely why the problem can't be solved by some new magic filesystem operation until the desired behaviour is defined and specified. The weird corner cases that make it hard for userspace also make it just as hard to implement it in the kernel. I see lots of people pointing out why using rename is hard, but their underlying assumption is that moving all this into the kernel will solve all these problems. Well, it doesn't. For example: how does moving the overwrite into the kernel make it any easier to decide whether we should break hard links or not?

IOWs, the issue here is that nobody has defined what the operations being asked for are supposed to do, the use cases, the expected behaviour, constraints, etc.

Before asking kernel developers to do something the problem space needs to be scoped and specified. For the people that want this functionality in the kernel: write a man page for the syscall. Refine it. Implement it as a library function and see if you can use effectively for your use cases. Work out all the kinks in the syscall API. Ask linux-fsdevel for comments and if it can be implemented. Go back to step 1 with all the review comments you receive, then continue looping until there's consensus and somebody implements the syscall with support from a filesystem or two. Once the syscall is done, over time more filesystems will then implement support and then maybe 5 years down the track we can assume this functionality is always available.

But what really needs to happen first is that someone asking for this new kernel functionality steps up and takes responsibility for driving this process.....

Gesta non verba.

XFS: the filesystem of the future?

XTF — Tue, 06 Mar 2012 22:09:33 +0000

> about rather esoteric situations where

Since when is not reseting meta-data (like file owner, permissions, acls, creation timestamp, etc) an esoteric situation

Each case by itself might be insignificant, but all cases together are not, IMO:
Assuming the target is not a symlink to a different volume
Assuming you are allowed to create the tmp file
Assuming you are allowed to overwrite an existing file having the same name as your tmp file
Assuming it's ok to reset meta-data, like file owner, permissions, acls, creation timestamp, etc.
Assuming the performance regression due to fsync is ok (request was for atomic, not durable)

XFS: the filesystem of the future?

nybble41 — Tue, 06 Mar 2012 22:05:09 +0000

> open(tmpfile)
> write(tmpfile)
> fsync(tmpfile)
> rename(tmpfile, destination)

Now find a way to do that *without* the need for a temporary file, and you might have something relevant to contribute to the thread.

A temporary file is not always an acceptable option; it presumes that you know of a directory in which you can create a temporary file guaranteed to be on the same filesystem as the file that it's replacing, so that rename() can be implemented atomically, and that either there are no hard links to the file or that those links should be broken by the rename(). Moreover, the rename() method resets portions of the security context of the original file, including ownership and security labels, which you can't restore without superuser capabilities.

This thread exists because people who do know plenty about the POSIX interfaces also know that they don't provide a general solution.

XFS: the filesystem of the future?

XTF — Tue, 06 Mar 2012 22:00:14 +0000

> It seems that there are lots of people with ideas of how to "improve" overwrites but few of those people really understand the mechanisms that filesystems already provide via the POSIX interface. If you ever wonder why filesystem developers are a little bit sick of this topic, your post is a perfect example.

Other people appear to have trouble reading. ;)
Your code was posted already and didn't suffice.

XFS: the filesystem of the future?

dlang — Tue, 06 Mar 2012 21:43:41 +0000

per the other messages in this thread, the OP is concerned about rather esoteric situations where

the file is a bind mount

the program doesn't have permission to make a new file in the directory (only to modify an existing one)

where the file is a special link (symlink, multiple hardlinks, bind mounted, etc) and the 'right thing' is to maintain those links instead of breaking them to have the modified version be a local copy

these situations do exist, but they are rather rare and the 'right thing' to do is not always and obvious and consistent thing.

XFS: the filesystem of the future?

dgc — Tue, 06 Mar 2012 21:36:25 +0000

> - Write new data to disk blocks which are unallocated on disk, but
> allocated in memory.

open(tmpfile)
write(tmpfile)

> - Allocate new blocks on disk.
> - Force in-order I/O (e.g. flush the disk cache).

fsync(tmpfile)

> - Atomically update inode to point to new data blocks, in memory and
> on disk.
> - Force in-order I/O (e.g. flush the disk cache).
> - De-allocate old data blocks on disk.

rename(tmpfile, destination)

It seems that there are lots of people with ideas of how to "improve" overwrites but few of those people really understand the mechanisms that filesystems already provide via the POSIX interface. If you ever wonder why filesystem developers are a little bit sick of this topic, your post is a perfect example.

Dave.

XFS: the filesystem of the future?

XTF — Tue, 06 Mar 2012 12:55:48 +0000

> In that case, why did you say that on-disk inodes do not fit within one physical sector?

You're right, inodes typically do fit in a sector.
What I wanted to say is that a meta-data transaction usually involves multiple parts / sectors. Providing consistency guarantees after a crash is hard without a journal.

> unless O_ATOMIC is exclusive, of course.

Having a reader and a writer or multiple writers at the same time is always problematic.

XFS: the filesystem of the future?

nybble41 — Tue, 06 Mar 2012 00:49:03 +0000

>> I can only assume you have a particular filesystem in mind
> Not really

In that case, why did you say that on-disk inodes do not fit within one physical sector? That is filesystem-specific, and I certainly know of some where the full inode size is less than or equal to 512 bytes; ext2 is at least capable of being configured that way.

>> True, if you want to implement full copy-on-write semantics. I was going for a simpler approach which can be implemented by almost any filesystem with no on-disk structure changes.
> What do you mean by full copy-on-write semantics?
> What on-disk structure changes would be required to do this in ext4 for example?

Perhaps none. I didn't mean to imply that it was impossible to implement atomic replacement of partial files without changing the on-disk structure; I simply hadn't proved to myself that it could be done easily. In retrospect it probably could be done, though you would run into the aforementioned fragmentation issues common to most C.O.W. filesystems.

The biggest complication is not on disk but in memory; you would need to modify the filesystem code to account for the shared data blocks, of which there may be as many alternate versions as there are O_ATOMIC file descriptors--unless O_ATOMIC is exclusive, of course.

XFS: the filesystem of the future?

nybble41 — Tue, 06 Mar 2012 00:28:57 +0000

> If you aren't allowed to create a new file in the directory of the file, are you sure you have permission to overwrite the file you are trying to modify?

These are orthogonal permissions. To overwrite the file you need write permission on the file. To create a new file in the same directory you need write permission on the directory. It's easily possible to have one without the other:

root# mkdir a
root# touch a/b
root# chown user a/b

user$ echo test > a/b # no error
user$ touch a/c
touch: cannot touch `a/c': Permission denied

> as for the symlink/hardlink 'issue', in my sysadmin experience, more problems are caused by editors that modify files in place (not using temp files and renaming them) than by breaking links.

Obviously that would remain an option. The point is that it should be an *option*, alongside the ability to atomically update a file in place. Symlinks are often used to add version control over configuration files (without putting the entire home / etc directory in the repository), while bind mounts are more often used with namespaces and chroot environments. Usually you want the latter to be read-only, but if the file is read/write it makes sense to allow it to be updated without breaking the link. (You can't just delete or rename over a bind target, either; it has to be unmounted.)

XFS: the filesystem of the future?

XTF — Tue, 06 Mar 2012 00:10:14 +0000

> I can only assume you have a particular filesystem in mind

Not really

> True, if you want to implement full copy-on-write semantics. I was going for a simpler approach which can be implemented by almost any filesystem with no on-disk structure changes.

What do you mean by full copy-on-write semantics?
What on-disk structure changes would be required to do this in ext4 for example?

XFS: the filesystem of the future?

dlang — Mon, 05 Mar 2012 23:37:02 +0000

> it's not guaranteed that there exists a directory on the same filesystem where you have permission to create a temporary file.

this sounds like a red hearing to me.

If you aren't allowed to create a new file in the directory of the file, are you sure you have permission to overwrite the file you are trying to modify?

as for the symlink/hardlink 'issue', in my sysadmin experience, more problems are caused by editors that modify files in place (not using temp files and renaming them) than by breaking links. Editors that break links when modifying a file are referred to as 'well behaved' in this area.

XFS: the filesystem of the future?

nybble41 — Mon, 05 Mar 2012 22:55:21 +0000

>> The only part that really needs to be atomic is the metadata update. That's not usually a problem so long as your on-disk inodes, or at least those fields relating to top-level data block allocation, fit within one physical sector:

> They don't and that's not how atomicity is guaranteed. Atomicity is guaranteed via the journal.

I can only assume you have a particular filesystem in mind. It is possible to arrange for inodes (or at least the data block portions) to fit within one sector, and to have atomic metadata updates without a journal. If you have a journal, great; atomic updates shouldn't be a problem. However, this system can also be retrofitted onto filesystem which do not support journals.

>> This does assume _complete_ replacement, i.e. O_ATOMIC implies O_TRUNC.

> Not really, actually. You merely have to ensure that the old state / blocks remain valid, so you have to do all writes to new blocks.

True, if you want to implement full copy-on-write semantics. I was going for a simpler approach which can be implemented by almost any filesystem with no on-disk structure changes.

XFS: the filesystem of the future?

nybble41 — Mon, 05 Mar 2012 22:45:50 +0000

>> This does assume _complete_ replacement, i.e. O_ATOMIC implies O_TRUNC.
> The process you list will badly fragment the file on disk, destroying performance (***unless you re re-writing the entire file***, in which case just writing a temporary file and renaming it will work)

Emphasis added. Yes, I assumed the entire file was being rewritten. If not, one could add an online defragmentation step after the metadata update, though online defragmentation introduces atomicity issues of its own.

Creating and renaming a temporary file has its own issues, which have already been mentioned, particularly relating to symlinks and hard links. Even for ordinary files, it's not guaranteed that there exists a directory on the same filesystem where you have permission to create a temporary file. Bind mounts (which can apply to individual files) are another potential sore spot. How much work should applications be expected to do just to find a place to put their temporary file such that rename() can be guaranteed atomic?

An O_ATOMIC option to open() would ensure that you are really replacing the original file, and that the temporary space comes from the same filesystem.

1) No, there are no obvious solutions yet, but there have been several reasonable proposals.

2) Current applications can do atomic replacement in common but limited circumstances using rename(). They all depend on creating a temporary file on the same filesystem, which assumes both that you can locate that filesystem (see: bind mounts) and that you can create new files there. They also tend to break hard links, which may or may not be a desired behavior. There is no general, straightforward solution to the problem of atomically replacing the data associated with a specific inode.

XFS: the filesystem of the future?

XTF — Mon, 05 Mar 2012 22:32:42 +0000

> The only part that really needs to be atomic is the metadata update. That's not usually a problem so long as your on-disk inodes, or at least those fields relating to top-level data block allocation, fit within one physical sector:

They don't and that's not how atomicity is guaranteed. Atomicity is guaranteed via the journal.

> This does assume _complete_ replacement, i.e. O_ATOMIC implies O_TRUNC.

Not really, actually. You merely have to ensure that the old state / blocks remain valid, so you have to do all writes to new blocks.

XFS: the filesystem of the future?

XTF — Mon, 05 Mar 2012 22:18:04 +0000

> writing the data to the new blocks may not be atomic.

It doesn't need to be

> writing the metadata to disk may not be atomic.

That's why you use a journal

> If your writes to disk can't be atomic, how can the entire transaction?

Heard of TCP? It creates a reliable connection over an unreliable network.
Or databases? Atomic transactions on unatomic disks are very possible.

> some of the data you write to a file may be visible before the metadata gets changed, which would make the change overall not be atomic.

No, because you write the new data to *new* blocks. Blocks not references by any file yet.

XFS: the filesystem of the future?

dlang — Mon, 05 Mar 2012 21:36:00 +0000

The process you list will badly fragment the file on disk, destroying performance (unless you re re-writing the entire file, in which case just writing a temporary file and renaming it will work)

It's also not clear that this is what the original poster was looking for when he said "Linux devs should really provide a proper solution (like O_ATOMIC) instead of blaming app devs for not doing the impossible."

I've been trying to make two points in this thread

1. it's not obvious what the "proper solution" that the kernel should provide looks like

2. it's not impossible to do this today (since there are many classes of programs that do this)

XFS: the filesystem of the future?

nybble41 — Mon, 05 Mar 2012 21:04:54 +0000

The only part that really needs to be atomic is the metadata update. That's not usually a problem so long as your on-disk inodes, or at least those fields relating to top-level data block allocation, fit within one physical sector:

- Write new data to disk blocks which are unallocated on disk, but allocated in memory.
- Allocate new blocks on disk.
- Force in-order I/O (e.g. flush the disk cache).
- Atomically update inode to point to new data blocks, in memory and on disk.
- Force in-order I/O (e.g. flush the disk cache).
- De-allocate old data blocks on disk.

Between the "Allocated on disk" and "De-allocate on disk" steps there are potentially two version of the file data, only one of which is connected to a real file. A basic fsck tool or journal feature can clean up the disconnected version in the event that the process is interrupted.

This does assume _complete_ replacement, i.e. O_ATOMIC implies O_TRUNC. A copy-on-write filesystem could implement O_ATOMIC efficiently without truncation, but not all filesystems have the necessary flexibility in the on-disk format for copy-on-write.

XFS: the filesystem of the future?

dlang — Mon, 05 Mar 2012 20:34:56 +0000

> At the hardware level it's quite simple: write the new data to new blocks, then write the meta-data.

writing the data to the new blocks may not be atomic.

writing the metadata to disk may not be atomic.

If your writes to disk can't be atomic, how can the entire transaction?

some of the data you write to a file may be visible before the metadata gets changed, which would make the change overall not be atomic.

XFS: the filesystem of the future?

XTF — Mon, 05 Mar 2012 14:34:44 +0000

> depending on how much you want to write, and what your definition of atomic is, what you are asking for may not be possible.

Why not?
At the hardware level it's quite simple: write the new data to new blocks, then write the meta-data.

> it also depends on what you mean about 'tons of regressions'

Assuming the target is not a symlink to a different volume
Assuming you are allowed to create the tmp file
Assuming you are allowed to overwrite an existing file having the same name as your tmp file
Assuming it's ok to reset meta-data, like file owner, permissions, acls, creation timestamp, etc.
Assuming the performance regression due to fsync is ok (request was for atomic, not durable)

> fsync temp file (may require fsync of directory)

How does one check that requirement in a portable way?

> if you want to be sure the change has taken place, fsync directory.

> no, this isn't "real code", but translating it into your language of choice is not that hard.

But it's far from trivial either.

> There are a lot of programs out there that do this right today.

But there are even more ones that don't. There's also no tool to detect these ones (AFAIK).

> lookup the lwn.net article on safely saving data from a few months back

It also failed to address any of the assumptions / regressions.

XFS: the filesystem of the future?

dlang — Fri, 02 Mar 2012 23:33:09 +0000

doing an open, write, close on an existing file that is a hardlink may not be able to be atomic (depending on the size of the write, the hardware, etc) in any case.

so I think that your requirement results in error O_PONY

While something along the lines of what you are talking about may be able to be made to work in some more limited cases, arguing that it's a requirement because the temp-file approach doesn't work in all cases is a bad argument.

In fact, in the cases where the file is a symlink or hardlink, it's questionable as to what the 'right' think to do is.

In some cases you should follow the link and replace the 'master' file, but in other cases you should not. arguably, breaking the links is the safer thing to do (the system doesn't know what the effect of the changes are ot the other things accessing the file), but there's no question that sometimes you wish it did something different.

XFS: the filesystem of the future?

mathstuf — Fri, 02 Mar 2012 22:18:29 +0000

The problem is that if the target is a symlink or a hardlink, it gets clobbered by this process and the link is destroyed. Applications that I use that do this are harder to use with my dotfiles system because of it (finch, gnupg, and others as well). What needs to happen is that the target file has readlink() done to its path to get the *actual* file that is wanted before this process starts.

XFS: the filesystem of the future?

dlang — Fri, 02 Mar 2012 21:29:37 +0000

>Could any of the fsync advocates post real code that does the atomic variant of open, write, close?
>
> Hint: it's not possible without tons of regressions.

depending on how much you want to write, and what your definition of atomic is, what you are asking for may not be possible.

it also depends on what you mean about 'tons of regressions'

but what you can do is

open temp file
write to temp file
close temp file
fsync temp file (may require fsync of directory)
mv temp file to name of real file
if you want to be sure the change has taken place, fsync directory.

if you are doing this in a shell script, you have to do 'sync' instead of fsync (on ext3, the result is the same, on other filesystems fsync is significantly faster)

no, this isn't "real code", but translating it into your language of choice is not that hard.

There are a lot of programs out there that do this right today. Most mail, nntp, and database apps do this right because their users are unwilling to loose data, and they need to work across different flavors of Unix.

for the longer answer, lookup the lwn.net article on safely saving data from a few months back

XFS: the filesystem of the future?

XTF — Thu, 01 Mar 2012 15:35:24 +0000

Could any of the fsync advocates post real code that does the atomic variant of open, write, close?

Hint: it's not possible without tons of regressions.

Linux devs should really provide a proper solution (like O_ATOMIC) instead of blaming app devs for not doing the impossible.

Shared pain

nye — Tue, 14 Feb 2012 16:16:59 +0000

>when I have applications that loose config data after a problem happens (which isn't always a system crash, apps that have this sort of problem usually have it after the application crashes as well)

That can't possibly be the case. You must be talking about applications which do something like truncate+rewrite, which is entirely orthogonal to the discussion (and is pretty clearly a bug).

I suspect you haven't understood the issue at hand.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

dlang — Sun, 12 Feb 2012 18:29:02 +0000

yes, everything stored in git is compressed, but it only gets deltafied when it gets packed.

and it's frequently faster to read a compressed file and uncompress it than it is to read the uncompressed equivalent (especially for highly compressible text like code or logs), I've done benchmarks on this within the last year or so

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

nix — Sun, 12 Feb 2012 15:57:20 +0000

Actually, even the most recent stuff is compressed. It just might not be deltified in terms of other blobs (which is what you meant, I know).

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Wol — Sun, 12 Feb 2012 13:38:16 +0000

Okay, it would need a little bit of coding, but I'd do the following ...

Each month, when you run end-of-month statements, you save that info. When you upate an account you keep a running total.

If the system crashes you then do "set corruptaccout = true where last-month plus transactions-this-month does not equal running balance". At which point you can do a brute force integrity check on those accunts.

(If I've got a 3rd state of that flag, undefined, I can even bring my database back on line immediately I've run a "set corruptaccount to undefined" command!)

And in Pick, that query will FLY! If I've got a massive terabyte database that's crashed, it's quite likely going to take a couple of hours to reboot the OS (I just rebooted our server at work - 15-20 mins to come up including disk checks etc). What's another hour running an integrity check on the data? And I can bring my database back on line immediately that query (and others like it) have completed. Tough luck on the customer who's account has been locked ... but 99% of my customers can have normal service resume quickly.

Thing is, I now *know* after a crash that my data is safe, I'm not trusting the database company and the hardware. And if my system is so much faster than yours, once the system is back I can clear the backlog faster than you can. Plus, even if ACID saves your data, I've got so much less data in flight and at risk.

But this seems to be mirroring the other debate :-) the moan about "fsync and rename" was that fsync was guaranteeing (at major cost) far more than necessary. The programmer wanted consistency, but the only way he could get it was to use fsync, which charged a high price for durability. If I really need ACID I can use BEGIN/END TRANSACTION in Pick. But 99% of the time I don't need it, and can get 90% of its benefits with 10% of its cost, just by being careful about how I program. At the end of the day, Pick gives me moderate ACID pretty much by default. Why should I have to pay the (high) price for strong ACID when 90% of the time, it is of no benefit whatsoever? (And how many SQL programmers actually use BEGIN/END TRANSACTION, even when they should?)

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Cyberax — Sat, 11 Feb 2012 13:44:15 +0000

git/svn/... use store intermediate versions of the source code, so that applying all patches becomes O(log N) instead of O(N). But that's just an optimization.

NoSQL systems work in a similar way - they can store the 'tip' of the data, so that they don't have to reapply all the patches all the time. However, the latest data view can be rebuilt if required.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

dlang — Sat, 11 Feb 2012 05:48:49 +0000

what gives you reasonable performance for a version control system with a few updates per minute is nowhere close to being reasonable for something that measures it's transaction rate in thousands per second.

besides, git tends to keep the most recent version of a file uncompressed, it's only when the files are combined into packs that things need to be reconstructed, and even there git only lets the chains get so long.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Cyberax — Sat, 11 Feb 2012 02:30:36 +0000

Well, git works exactly the same way. Is it fast enough for you?

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

dlang — Fri, 10 Feb 2012 18:43:29 +0000

so that means that you don't have any value anywhere in your database that says "this is the amount of money in account A", instead you have to search all transactions by all tellers to find out how much money is in account A

that doesn't sound like a performance win to me.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Wol — Fri, 10 Feb 2012 16:25:25 +0000

:-)

Look at the comment you're replying to :-) In early Pick systems I believe it was possible for a single item to be larger than available memory ...

Okay, it laid the original systems wide open to serious problems if something went wrong, but as far as users were concerned Pick systems didn't have disk. It was just "permanent memory". And Pick was designed to "store all its data in ram and treat the disk as a huge virtual memory". I believe they usually got round any problem by flushing changes from core to disk as fast as possible, so in a crash they could just restore state from disk.

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

Wol — Fri, 10 Feb 2012 16:17:28 +0000

The ever popular "subtract $10, add $10" ...

Well, if you define the transaction as an entity, then it gets written to its own FILE. If the system crashes then you get a discrepancy that will show up in an audit. It makes sense to define it as an entity - it has its own "primary key" ie "time X at teller Y". Okay, you'll argue that I have to run an integrity check after a crash (true) while you don't, but I can probably integrity-check the entire database in the time it takes you to scan one big table :-)

Consistency? Journalling a transaction? Easily done.

And yes, your point about flushing buffers is good, but that really should be the OS's problem, not the app (database) sitting on top. Yes I know, I used the word *should* ...

Look at it from an economic standpoint :-) If my database (on equivalent hardware) is ten times faster than yours, and I can run an integrity check after a crash without impinging on my users, and I can guarantee to repair my database in hours, which is the economic choice?

Marketing 101 - proudly announce your weaknesses as a strength. The chances of a crash occuring at the "wrong moment" and corrupting your database are much higher with SQL, because any given task will typically require between 10s and 100s more transactions between the db and OS than Pick. So SQL needs ACID. With Pick, the chances of a crash happening at the wrong moment and corrupting data are much, much lower. So expensive strong ACID actually has a prohibitive cost. Especially if you can get 90% of the benefits for 10% of the effort.

I'm not saying ACID isn't a good thing. It's just that the cost/benefit equation for Pick says strong ACID isn't worth it - because the benefits are just SO much less. (Like query optimisers. Pick doesn't have an optimiser because it's pretty much a dead cert the optimser will save less than it costs!)

Cheers,
Wol

Shared pain

khim — Thu, 09 Feb 2012 21:13:13 +0000

so you don't _really_ want the computer doing exactly what the programmer tells it to, you only want it to do so some of the time, not the rest of the time.

Sure. YMMV as I've already noted. Good filesystem for USB sticks must flush on close(2) call. Good general purpose filesystem must guarantee rename(2) atomicity in the face of system crash.

You can use whatever you want for your own system - it's you choice. But when question is about replacement of extX… it's other thing entirely. To recommend filesystem which likes to eat user's data is simply irresponsible.

Shared pain

dlang — Thu, 09 Feb 2012 20:52:23 +0000

it is badly written because you did not tell the computer that you wanted to make sure that the data was written to the drive in a particular order.

If the system does not crash, the view of the filesystem presented to the user is absolutely consistent, and the rename is atomic.

The problem is that there are a lot of 'odd' situations that you can have where data is written to a file while it is being renamed that make it non-trivial to "do the right thing" because the system is having to guess at what the "right thing" is for this situation.

try running a system with every filesystem mounted with the sync option, that will force the computer to do exactly what the application programmers told it to do, writing all data exactly when they tell it to, even if this means writing the same disk sector hundreds of times as small writes happen. The result will be un-usable.

so you don't _really_ want the computer doing exactly what the programmer tells it to, you only want it to do so some of the time, not the rest of the time.

Shared pain

Wol — Thu, 09 Feb 2012 20:44:43 +0000

And what is "badly written" about an app that expects the computer to do what was asked of it?

I know changing things around for the sake of it doesn't matter when everything goes right, but if I tell the computer "do this, *then* that, *followed* by the other", well, if I told an employee to do it and they did things in the wrong order and screwed things up as a *direct* *result* of messing with the order, they'd get the sack.

The only reason we're in this mess, is because the computer is NOT doing what the programmer asked. It thinks it knows better. And it screws up as a result.

And the fix isn't that hard - just make sure you flush the data before the metadata (or journal the data too), which is pretty much (a) sensible, and (b) what every user would want if they knew enough to care.

Cheers,
Wol

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

dlang — Thu, 09 Feb 2012 20:38:47 +0000

if a single item is larger than the track size of a drive, it is physically impossible for the write to be atomic. You don't need to get this large to run in to problems though, any write larger than a block runs the possibility of being split across different tracks (or in a RAID setup, across different drives). If you don't tell the filesystem that you care about this, the filesystem will write these blocks in whatever order is most efficient for it.

The entire noSQL family of servers is based on relaxing the reliability constraints of the classic ACID protections that SQL databases provided.

dlang — Thu, 09 Feb 2012 20:36:38 +0000

atomic, your scheme won't work if you need to make changes to two records (the ever popular "subtract $10 from account A, add $10 to account B" example)

consistency, what if part of your updates get to disk and other parts don't? what if the OS (or drive) re-orders your updates so that the write to the record for person happens before the write to building?

As far as durability goes, if you don't tell the OS to flush it's buffers (which is what fsync does), then in a crash you have no idea what may have made it to disk and what didn't.