User: Password:
|
|
Subscribe / Log in / New account

Supporting transactions in btrfs

Supporting transactions in btrfs

Posted Nov 14, 2009 12:46 UTC (Sat) by magnus (subscriber, #34778)
Parent article: Supporting transactions in btrfs

I'm trying to make some sense out of all this.

Basically for a "transaction" we have five system phases/states:

1. Before the user commands the transaction
2. While the commanding user "thinks" the transaction is in progress
3. For the commanding user, the transaction has finished, but the data has not been written to disk.
4. The transaction is being written to disk
5. Transaction finished and committed to disk.

For ACID databases, state 3 and 4 are guaranteed by the DBMS never to occur. Other users see the old version in state 1-2 and the new version in state 5. In case of a system crash in state 1-2, the old version will remain.

For UNIX in general, the only guarantee we have is that in states 3-5, no user on the system will see the old version of the file. The state after a crash in state 2-4 is undefined.

The guarantees we would like to add is:
- In case of a system crash during state 2/3, the old version will be on disk.
- In case of a system crash during state 4, the old or new version will be on disk.
- If the application crashes/segfaults/gets killed etc during state 2, the old version will be on disk.

The things that are left open to allow for optimization are:
- What other users see during state 2
- Which version (old or new) is left on disk after a crash in state 4.

Please correct me if I'm wrong.


(Log in to post comments)

Supporting transactions in btrfs

Posted Nov 14, 2009 12:59 UTC (Sat) by magnus (subscriber, #34778) [Link]

Oops, forgot the concurrency aspect of it.

What should happen if two transactions are started on the same data simultaneously. Is this a case that this API wants to cover?

Supporting transactions in btrfs

Posted Nov 24, 2009 7:11 UTC (Tue) by gadnio (guest, #30187) [Link]

Usually modern DBs have solved this issue with per-record locking mechanism.

Consider the following situation:
* We have files f1, and f2 somewhere in the system.

A)
1. Transaction T1 opens f1 for writing, locking it.
2. Transaction T2 opens f1 for writing. Since it's already locked, T2 waits for T1 to finish and then tries again (mutex).

In case this scenario is not welcome, an explicit lock mechanism exists (SQL SELECT FOR UPDATE), which can be told to throw errors when the locking fails. The same situation, replayed, will look like this:
1. Transaction T1 opens f1 for writing, locking it.
2. Transaction T2 tries to lock f1. The lock fails. T2 handles the error gracefully.

B)
This is more dangerous:
1. Transaction T1 opens f1 for writing.
2. Transaction T2 opens f2 for writing.
3. Transaction T2 opens f1 for writing.
4. Transaction T1 opens f2 for writing.

This leads to deadlock since each transaction's holding the lock of the other's data, preventing both itself and the other transaction to finish. To solve this, a transaction arbiter's needed, which fires a 'deadlock' error in both transactions, rolling them both back.

For a file system to implement just that, well... we'll have to maintain a huge subset of the ACID database support -- the rollback segments, etc... which, to be done properly, for a filesystem holding tons of files of different sizes, will be a very huge pain. Just consider the case when both f1 and f2 are of sizes like 300Gb, which is not uncommon these days. And what about fsck?

Moreover, I think the concept of an in-kernel ACID database to be something ill-minded -- whenever one wants ACID, one does store one's data in ACID database and that's all. For all of us suffering the downsizes of it, I think it's not worth it.

BR,
Hristo


Copyright © 2018, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds