I manage transactional systems for a living (first IBM MQ, now relational databases including
Sybase, DB2 and Oracle). I am sorry to say the current filesystem approach is still somewhat
optimistic, even in the presence of barriers; and that commercial transational systems do
additional work that ext3/ext4 do not yet do, but that can be required for some disk failures.
Sadly, implementing this correctly may make the filesystems even slower.
One issue has to do with disk failures (say on a power failure) and a partial write. If
transaction A is committed and uses a partial disk block, then transaction B cannot safely be
written to the same disk block - a power failure could lead to the block being partially
written and losing the data from committed transaction B. On the other hand, always using a
new disk block or each transaction would eat up log space really quickly. One solution is to
use ping-pong blocks (three are required of transactions can span a block).
Another issue has to do with when transactions are written to disk. As the article indicates,
delaying writes may help if multiple concurrent and independent transactions can be written to
disk at the same time; but waiting to write data to disk is bad if one application is doing
most of the writes, as it would not run as quickly as possible. Some databases allow this to
be tuned on the fly (e.g. mincommit in DB2), but that is not desirable for a filesystem; the
kernel should use heuristics to figure this out as a workload is running.
Next, you want to carefully tune: the size of the log buffer in memory and the total size of
the transaction log on disk (i.e. when do we wrap round). In databases you also care about
how much log space a single transaction can use and how much other concurrent log activity
(from other transactions) may occur between activity and commit, but until we expose
filesystem transactions to userspace we can safely ignore this.
Summary: filesystem implementors ought to talk to database implementors. I'm sure both groups
can teach each other a lot, but in this area databases are still quote a bit ahead of