LWN.net Logo

Supporting transactions in btrfs

By Jonathan Corbet
November 11, 2009
Much of the fuss involving fsync() and crash robustness over the last year was really about how applications can get transactional semantics out of filesystem operations. An application developer often wants to see a set of operations either succeed or fail as a unit, without the possibility of partially-completed operations. Providing ways for applications to get that behavior can be a challenge, though.

Btrfs has tried to make this capability available to user space by way of the BTRFS_IOC_TRANS_START and BTRFS_IOC_TRANS_END ioctl() calls. There are some real problems with this approach, though. They operate as a pair of system calls, with any operations between the two being treated as a transaction within the filesystem. But if something fails, or if the application never quite gets around to ending the transaction, things will eventually come to a halt. That is why the btrfs code includes this comment:

There are many ways the trans_start and trans_end ioctls can lead to deadlocks. They should only be used by applications that basically own the machine, and have a very in depth understanding of all the possible deadlocks and enospc problems.

It is, in other words, a dangerous capability which cannot be made generally available.

Sage Weil has posted a patch taking a rather different approach to the problem. The key idea is to avoid the problem of never-completed transactions by encapsulating the entire thing within a single system call. The result is a new BTRFS_IOC_USERTRANS command for ioctl(); chances are it will require a bit of work yet, but it could be the base for user-space transactions in the future.

This command takes a structure which looks something like the following:

    struct btrfs_ioctl_usertrans {
	__u64 num_ops;
	struct btrfs_ioctl_usertrans_op *ops_ptr;
	__u64 num_fds;
	__u64 data_bytes, metadata_ops;
	__u64 flags;
	__u64 ops_completed;
    };

The ops_ptr argument points to an array of num_ops individual operations to complete:

    struct btrfs_ioctl_usertrans_op {
	__u64 op;
	__s64 args[5];
	__s64 rval;
	__u64 flags;
    };

Here, op describes an individual operation. It can be BTRFS_IOC_UT_OP_OPEN (open a file), BTRFS_IOC_UT_OP_CLOSE (close a file), BTRFS_IOC_UT_OP_PWRITE (write to a file), BTRFS_IOC_UT_OP_UNLINK (unlink a file), BTRFS_IOC_UT_OP_LINK (make a link to a file), BTRFS_IOC_UT_OP_MKDIR (create a directory), BTRFS_IOC_UT_OP_RMDIR (remove a directory), BTRFS_IOC_UT_OP_TRUNCATE (truncate a file), BTRFS_IOC_UT_OP_SETXATTR (set an extended attribute), BTRFS_IOC_UT_OP_REMOVEXATTR (remove an extended attribute), or BTRFS_IOC_UT_OP_CLONERANGE (copy a range of data from one file to another). For each operation, the args field contains a set of arguments similar to what one would see for the corresponding system call. One interesting difference is that there are no hard-coded file descriptor numbers; instead, the transaction gets a new file descriptor table and all operations work with indexes into that table. Essentially, transactions work within a file descriptor space separated from that used by the calling process.

The flags field describes how the return value from each operation should be interpreted. It can be contain any of: BTRFS_IOC_UT_OP_FLAG_FAIL_ON_NE, BTRFS_IOC_UT_OP_FLAG_FAIL_ON_EQ, BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LT, BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GT, BTRFS_IOC_UT_OP_FLAG_FAIL_ON_LTE, and BTRFS_IOC_UT_OP_FLAG_FAIL_ON_GTE. In each case, the flag causes the return value to be compared against the passed-in rval field; if the comparison is successful, the transaction will fail.

What happens if the transaction fails? The partially-completed transaction will not be rolled back; btrfs, not being a database, is not really set up to do that. Instead, the number of successfully-completed operations will be passed back to user space. Optionally, the application can provide the BTRFS_IOC_UT_FLAG_WEDGEONFAIL flag, causing btrfs to leave the transaction open, locking up the filesystem until the system is rebooted. This may seem like a rather antisocial approach to transaction atomicity, but, if failure is both highly unlikely and highly problematic, that might be the right thing to do.

A patch like this raises a lot of questions. The first obstacle may be the fact that it requires exporting a number of system call implementations to modules, a change which has been resisted in the past. Kernel code need not normally call functions like sys_mkdir(), but this patch does exactly that. Calling system call implementations directly can be a bit tricky on some architectures, and there are good reasons for not making these functions available to modules in general.

Another problem is that the filesystem has no real way of determining whether a transaction will succeed before jumping into it; the best it can do is reserve some metadata space in accordance with an estimate provided by the application. If transactions are allowed to complete partially, they are no longer transactions. But the alternative of locking up the system can only leave people wondering if there isn't a better way.

Then, there is a question which was raised on the list: btrfs provides cheap snapshots, why not use them to implement transactions? Using a snapshot would take advantage of existing functionality and should make proper rollback possible. The problem would appear to be performance: btrfs snapshots are not quite that cheap, especially when one considers the need to exclude other activity on the filesystem while the transaction is active. So Chris Mason has suggested that the old standby - write-ahead logging - would be preferable because it will perform much better. But, he thinks, the multi-operation ioctl() could maybe perform better yet.

Finally, there would appear to be some real similarities between this API and the rather more general syslets mechanism. Syslets have been on the back burner for some time now, but they could come back forward if they seemed like a good way to solve this problem.

Clearly, like much of btrfs, this new ioctl() is a work in progress. If it gets into the mainline, it will be likely to have changed quite a bit on the way. But the problem this patch is trying to solve is real; it's clearly an issue which is worth thinking about.


(Log in to post comments)

Supporting transactions in btrfs

Posted Nov 12, 2009 11:01 UTC (Thu) by etienne_lorrain@yahoo.fr (guest, #38022) [Link]

About the fsync() problem, I wonder if tagging every write operation with a sequence number (one sequence number per device) would be a good solution:
The device driver can re-order any write with the same sequence number.
Each time a fsync() is done, it increment the sequence number of that device by 1.
Maybe a function call wait_until_current_sequence_number_reached(device);
Wouldn't that be a good and simple compromise?

Supporting transactions in btrfs

Posted Nov 12, 2009 16:30 UTC (Thu) by dtlin (✭ supporter ✭, #36537) [Link]

Say you have a text editor which only wants to fsync one file and a
background task which is continuously writing lots of files.

How do you propose to implement fsync such that when your text editor's
requests it, only that one file and not too much else is flushed?

Supporting transactions in btrfs

Posted Nov 12, 2009 23:27 UTC (Thu) by masoncl (subscriber, #47138) [Link]

Btrfs actually already does this. When you fsync a file the metadata is written to a dedicated logging tree, and that file's data blocks go to disk along with the metadata for the dedicated tree.

The end result is that we only write data and metadata for the one file we're sending to fsync.

Transactions, yes please!

Posted Nov 12, 2009 12:25 UTC (Thu) by walles (guest, #954) [Link]

And rollbacks, yes please!

If the machine just lost power, trying to return a number telling the user how far the "transaction" got doesn't help that much...

Supporting transactions in btrfs

Posted Nov 12, 2009 12:27 UTC (Thu) by mpee (subscriber, #37530) [Link]

Wow, that's one pretty gross ioctl parameter.

Color me puzzled

Posted Nov 12, 2009 13:58 UTC (Thu) by ebiederm (subscriber, #35028) [Link]

Aren't transactions with rollbacks just.

- Snapshot the fs.

Perform operations.

- Snapshot the fs (aka commit), or rollback to the previous snapshot.

Then the code that needs to have a consistent view of what is
correct looks at the snapshot.

Color me puzzled

Posted Nov 14, 2009 1:30 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

Aren't transactions with rollbacks just. - Snapshot the fs. Perform operations. - Snapshot the fs (aka commit), or rollback to the previous snapshot.
If there's only one transaction at a time going on. But if there are other filesystem updates happening while the transaction runs, you don't want to roll those back because the one transaction failed.

Heavyweight transactions!

Posted Nov 12, 2009 17:08 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

It would be nice to have heavyweight transactions.

I.e. I start "apt-get dist-upgrade", it makes a snapshot of the hard drive, stops all IO operations from other processes, and does upgrade. If it fails, then it will just roll back to the previous snapshot.

It would be nice to avoid locking other processes, but braindeadness of Unix filesystems seems to require it :(

Heavyweight transactions!

Posted Nov 12, 2009 20:00 UTC (Thu) by doogie (subscriber, #2445) [Link]

Won't work; let's say you stop that background postgres daemon, so it is no longer doing any IO. But the upgrade then needs to stop the daemon, upgrade it's files, then start it again.

Heavyweight transactions!

Posted Nov 12, 2009 21:38 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

Not a problem:
1) You can stop postfix before upgrade starts.
2) You can stop postfix _inside_ the transaction. It might lead to lost data, though.

Supporting transactions in btrfs

Posted Nov 12, 2009 20:53 UTC (Thu) by jimparis (subscriber, #38647) [Link]

Gross!

If the problem is that the application might die without BTRFS_IOC_TRANS_END, can't we just come up with an alternate mechanism that doesn't rely on doing things in pairs?

For example, open a special file called "/btrfs-transaction".
As long as it's open by your process, subsequent operations from that process are part of a transaction. Those operations can still fail and return errors. But even if your app dies rudely, the file descriptor will get closed, and the transaction ends.

Supporting transactions in btrfs

Posted Nov 14, 2009 12:46 UTC (Sat) by magnus (subscriber, #34778) [Link]

I'm trying to make some sense out of all this.

Basically for a "transaction" we have five system phases/states:

1. Before the user commands the transaction
2. While the commanding user "thinks" the transaction is in progress
3. For the commanding user, the transaction has finished, but the data has not been written to disk.
4. The transaction is being written to disk
5. Transaction finished and committed to disk.

For ACID databases, state 3 and 4 are guaranteed by the DBMS never to occur. Other users see the old version in state 1-2 and the new version in state 5. In case of a system crash in state 1-2, the old version will remain.

For UNIX in general, the only guarantee we have is that in states 3-5, no user on the system will see the old version of the file. The state after a crash in state 2-4 is undefined.

The guarantees we would like to add is:
- In case of a system crash during state 2/3, the old version will be on disk.
- In case of a system crash during state 4, the old or new version will be on disk.
- If the application crashes/segfaults/gets killed etc during state 2, the old version will be on disk.

The things that are left open to allow for optimization are:
- What other users see during state 2
- Which version (old or new) is left on disk after a crash in state 4.

Please correct me if I'm wrong.

Supporting transactions in btrfs

Posted Nov 14, 2009 12:59 UTC (Sat) by magnus (subscriber, #34778) [Link]

Oops, forgot the concurrency aspect of it.

What should happen if two transactions are started on the same data simultaneously. Is this a case that this API wants to cover?

Supporting transactions in btrfs

Posted Nov 24, 2009 7:11 UTC (Tue) by gadnio (guest, #30187) [Link]

Usually modern DBs have solved this issue with per-record locking mechanism.

Consider the following situation:
* We have files f1, and f2 somewhere in the system.

A)
1. Transaction T1 opens f1 for writing, locking it.
2. Transaction T2 opens f1 for writing. Since it's already locked, T2 waits for T1 to finish and then tries again (mutex).

In case this scenario is not welcome, an explicit lock mechanism exists (SQL SELECT FOR UPDATE), which can be told to throw errors when the locking fails. The same situation, replayed, will look like this:
1. Transaction T1 opens f1 for writing, locking it.
2. Transaction T2 tries to lock f1. The lock fails. T2 handles the error gracefully.

B)
This is more dangerous:
1. Transaction T1 opens f1 for writing.
2. Transaction T2 opens f2 for writing.
3. Transaction T2 opens f1 for writing.
4. Transaction T1 opens f2 for writing.

This leads to deadlock since each transaction's holding the lock of the other's data, preventing both itself and the other transaction to finish. To solve this, a transaction arbiter's needed, which fires a 'deadlock' error in both transactions, rolling them both back.

For a file system to implement just that, well... we'll have to maintain a huge subset of the ACID database support -- the rollback segments, etc... which, to be done properly, for a filesystem holding tons of files of different sizes, will be a very huge pain. Just consider the case when both f1 and f2 are of sizes like 300Gb, which is not uncommon these days. And what about fsck?

Moreover, I think the concept of an in-kernel ACID database to be something ill-minded -- whenever one wants ACID, one does store one's data in ACID database and that's all. For all of us suffering the downsizes of it, I think it's not worth it.

BR,
Hristo

Supporting transactions in btrfs

Posted Nov 14, 2009 14:49 UTC (Sat) by intgr (subscriber, #39733) [Link]

Why, instead of trying to work with other file systems to come up with a universal API, are btrfs developers insisting on reinventing their own wheels for everything? Filesystem-level RAID, snapshot API, and now transaction API.

Supporting transactions in btrfs

Posted Nov 20, 2009 9:23 UTC (Fri) by forthy (guest, #1525) [Link]

The interface reminds me on one of the proposed syslet asynchronous IO API, where you send a list of syscalls to the kernel (to be processed asynchronously in the background). I thought this was a cool idea (not new, active message passing is decades old, but not well understood by most developers ;-), and it was a pity that it didn't make it (due to being not well understood). For transactions, I don't think it's the right way.

Chris Mason works at Oracle. He should ask some peers how to implement transactions properly. Locking is not a proper way to implement transactions. What you do is: you fork the file system (i.e. create a writable snapshot and bind the calling process to that snapshot), an when you are done, you try to merge it. If the merge is successful, continue, if not, abort (and tell the caller, which can try again). If the system crashes during the transaction, file system repair will purge that snapshot. Unlike a git merge, a transaction should abort when a file that was written to during the transaction had other writers outside. The atomic part here is only the merge window, and this merge has complete information available, and especially only has to update metadata.

This doesn't need a special syslet-like ioctrl, the only thing you need to add is the btrfs_merge() call - creating writable snapshots is already there. I'd still like to have syslets - but please with the complete set of kernel calls and for asynchronous IO and such.

Supporting transactions in btrfs

Posted Nov 15, 2009 17:41 UTC (Sun) by anton (guest, #25547) [Link]

Maybe I missed it, but I have not seen a statement of crash consistency guarantees for Btrfs. I would like to see that before transactions, and with good guarantees, because nearly all applications and users will benefit from that, not just a few.

For those who missed what I consider good crash consistency guarantees, here it is: The after-crash state represents one of the logical states of the file system before the crash; not necessarily the latest, but preferably only a few seconds old (configurable). Concerning fsync(), I would like at least the option to have that guarantee even in the presence of fsync() (yes, enabling this option will make fsync() slow in some cases, but it protects against applications that use fsync() in the wrong order).

Transactions would be cool if they can be done easily. But it looks like they cannot be done easily on the file system level, so maybe this should be left to specialized applications.

Supporting transactions in btrfs

Posted Nov 20, 2009 13:07 UTC (Fri) by basdebakker (guest, #60977) [Link]

For the much discussed case of rewriting configuration files, we don't need full transactions. All we need is a barrier call to make sure the file contents are written before the rename.

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds