Barriers and journaling filesystems

Posted May 21, 2008 20:37 UTC (Wed) by hpp (subscriber, #4756)
Parent article: Barriers and journaling filesystems

I manage transactional systems for a living (first IBM MQ, now relational databases including
Sybase, DB2 and Oracle). I am sorry to say the current filesystem approach is still somewhat
optimistic, even in the presence of barriers; and that commercial transational systems do
additional work that ext3/ext4 do not yet do, but that can be required for some disk failures.
Sadly, implementing this correctly may make the filesystems even slower.

One issue has to do with disk failures (say on a power failure) and a partial write. If
transaction A is committed and uses a partial disk block, then transaction B cannot safely be
written to the same disk block - a power failure could lead to the block being partially
written and losing the data from committed transaction B. On the other hand, always using a
new disk block or each transaction would eat up log space really quickly. One solution is to
use ping-pong blocks (three are required of transactions can span a block).

Another issue has to do with when transactions are written to disk. As the article indicates,
delaying writes may help if multiple concurrent and independent transactions can be written to
disk at the same time; but waiting to write data to disk is bad if one application is doing
most of the writes, as it would not run as quickly as possible. Some databases allow this to
be tuned on the fly (e.g. mincommit in DB2), but that is not desirable for a filesystem; the
kernel should use heuristics to figure this out as a workload is running.

Next, you want to carefully tune: the size of the log buffer in memory and the total size of
the transaction log on disk (i.e. when do we wrap round). In databases you also care about
how much log space a single transaction can use and how much other concurrent log activity
(from other transactions) may occur between activity and commit, but until we expose
filesystem transactions to userspace we can safely ignore this.

Summary: filesystem implementors ought to talk to database implementors. I'm sure both groups
can teach each other a lot, but in this area databases are still quote a bit ahead of
ext3/ext4.

Barriers and journaling filesystems

Posted May 22, 2008 10:06 UTC (Thu) by Fats (guest, #14882) [Link] (4 responses)

"Summary: filesystem implementors ought to talk to database implementors. I'm sure both groups can teach each other a lot, but in this area databases are still quote a bit ahead of ext3/ext4."

The purpose of a journal is not to be sure that everything is written to disk when you do a write. It's to be sure that the file system is always in a consistent state so you don't need a very expensive fsck and risk loosing other data then what was being written. If you need to be sure that something is written to disk you have to use the fsync function in your code.

greets,
Staf.

Barriers and journaling filesystems

Posted May 24, 2008 9:14 UTC (Sat) by Xman (guest, #10620) [Link] (3 responses)

fsync *still* isn't going to help you much if I/O reordering is allowed.

Barriers and journaling filesystems

Posted May 24, 2008 9:30 UTC (Sat) by Fats (guest, #14882) [Link] (2 responses)

AFAIK fsync explicitly tells the hard drive to perform all outstanding IOs and then returns.
So, of course, if your hard drive lies to you, you are screwed.
Don't know if LVM is broken here also.

Barriers and journaling filesystems

Posted May 24, 2008 18:00 UTC (Sat) by Xman (guest, #10620) [Link] (1 responses)

fsync will block until the outstanding requests have been sync'd do disk, but it doesn't
guarantee that subsequent I/O's to the same fd won't potentially also get completed, and
potentially ahead of the I/O's submitted prior to the fsync. In fact it can't make such
guarantees without functioning barriers.

Barriers and journaling filesystems

Posted May 24, 2008 19:48 UTC (Sat) by Fats (guest, #14882) [Link]

Sure, my comment was in response to hpp and what I wanted to say is that user land code has to
take care of transactions as defined in relation databases and that fsync is the tool to use
for this.
A journaled file system only takes care that the file system stays in a consistent state so no
expensive fsck is needed with possible loss of data. Open write files may lose some of the
last writen data if no fsync was performed. To keep the file system consistency barriers are
used to guarantee a certain order of the writes. This limited guarantee allows file system to
be faster the relational databases.

greets,
Staf.

Barriers and journaling filesystems

Posted May 23, 2008 17:05 UTC (Fri) by jlokier (guest, #52227) [Link]

Journalling commit is more like "asynchronous delayed commit" from a database point of view,
when fsync() isn't used. They protect the integrity of filesystem structure itself, and are
not used for application-level transactional changes. Sometimes that weaker kind of commit is
fine, and the performance gain is large.

fsync() makes it more like a standard database commit, where the data is supposed to be secure
before the call returns.

This is one area where traditional databases can learn from filesystems. There are some
things where you don't actually need a database to commit quickly - that can take as long as
it needs to batch and optimise I/O. All you need then is consistent rollback. For example,
databases which hold cached calculations are like this.

Your point about partial writes on power failure and not using overlapping blocks (will
sectors do?) is valid, and I would like to know more about what the database professionals
have discovered about what exactly is and isn't safe. For example, can failure to write
sector N+1 corrupt sector N written a long time ago? Is the "failure block size" larger than
a single sector when doing O_DIRECT (when that really works)? Is it larger than a
filesystem/blockdev block size when not using O_DIRECT? What's the reason Oracle uses
different journal block sizes on different OSes?

I think the filesystem implementors do know about that effect. Journal entries are finished
with a commit block, to isolate the commit into its own block, which is not touched by the
next transaction. I think your two/three ping-pong blocks correspond to the journal's finite
wraparound length on a filesystem - do say more if that's not so.