Solving the ext3 latency problem

Posted Apr 19, 2009 4:27 UTC (Sun) by tytso (subscriber, #9993)
In reply to: Solving the ext3 latency problem by chad.netzer
Parent article: Solving the ext3 latency problem

OK, so there are multiple issues when people talk about "safety" and data=writeback. First of all, for ext3. In the security dimension, with data=writeback, if after a crash where the filesystem was not cleanly unmounted, files that were written right before the crash might contain unitialized data. This data could be from another user, although on a single-user system, this is obviously not very likely. How severely you consider this really depends on how paranoid you are. Even on a single-user system, this data might contain information that you don't want to send out publically, and if you don't notice that a file contained something other than what you thought before you send it out to someone else, there is a potential for a security exposure. Obviously, though on a single user system this is much less of an issue than on timesharing system.

In the data loss department, if you have an application that didn't use fsync(), and the system crashes, with data=writeback there is the chance for dataloss. In 2.6.30, Linus accepted patches which will cause an implied flush operation when a hueristic detects an application trying to replace an existing file via the replace-via-truncate and replace-via-rename cases patterns. This largely reduces the problems for non-fsync-using applications. It doesn't solve the problem for a freshly written file, but the system could have easily crashed five seconds earlier.

OK, so how does ext4 change things. By default ext4 on modern kernels (ignoring the technology preview on RHEL 5 and Fedora 10) performs delayed allocation. This means that the data blocks are not allocated right away when you write the file, but only when they are forced out, either explicitly via fsync(), or via the page writeback algorithms in the VM, which will tend to push things out after 30-45 seconds (ignoring laptop mode) and perhaps sooner if the system is short on memory.

In the security dimension, what this means is that even in data=writeback mode, in general on a crash the file will be truncated or zero-length instead of containing uninitialized data. In ext4 with delayed allocation and data=writeback, there *is* a very tiny race condition where if a transaction closes right between when the pdflush daemon allocates the filesystem block and before it has a chance to trigger the page writeback, that you might end up with uninitialized garbage. This chance is very small, but it is non-zero. In this case, ext4 data=ordered will force the write to disk, so it is technically safer in the security dimension, although this race is very hard to exploit, and very rare that it gets hit in practice. (This is also why the overhead of data=ordered and data=writeback is much less for ext4, thanks to delayed allocation --- the difference between the two is not the same, however!)

In the safety against applications that don't use fsync department, as of 2.6.30, ext4 will always do an implied allocation and flush for data=ordered and data=writeback. So there is no real difference here between data=ordered and data=writeback.

The bottom line is that while there is some performance benefit in going with data=writeback with ext4, the differences between data=ordered and data=writeback are much smaller with ext4, in both the cost and benefit dimensions.

Chris Mason is also working on a data=guarded mode, which will cause files to be truncated (much like delayed allocation) on a crash with ext3. I will look into porting this mode into ext4, if it proves to be enough of a performance advantage for ext4 over data=ordered, and yet providing a tiny bit more safety than data=writeback. It's not clear to me that it will be worth it for ext4, however.

I hope this helps answers the questions between ext3 and ext4, and data=ordered versus data=writeback.

Regards,

Ted.

Solving the ext3 latency problem

Posted Apr 19, 2009 8:07 UTC (Sun) by sitaram (guest, #5959) [Link]

Thank you...

I'm one of those people for whom the security aspect is far more important (*) than data loss -- data loss can happen for so many other reasons that one should have a good, reliable, backup regime anyway, so one more reason doesn't bother me.

So ext3: people with my mindset should stick with data=ordered. (I don't see guarded as being too useful for ext3 -- we'll probably have switched to ext4 by the time guarded becomes mainstream).

Ext4: I think I'll stick with ordered here too. If the overhead has been much reduced by delayed alloc, it correspondingly reduces the main advantage of writeback too :-) I'd rather err on the side of security when the difference is minor.

Although collectively we like choice, and we *need* choice, when it comes to actual usage, we have to rationally reduce the many choices available into one and say "*this* is what we will use"!

Thanks once again for jumping in and helping with that!

Sitaram

(*) My home desktop is used by my kids also, for instance -- so it *is* a multi-user machine in the old traditional sense. The work machine runs email and office apps as one user, and my web browser and IRC as another user (simultaneously), so -- while both users are still me -- it too is multi user in the sense of wanting to keep two disparate sets of files separate.

Solving the ext3 latency problem

Posted Apr 19, 2009 22:35 UTC (Sun) by bojan (subscriber, #14302) [Link]

Thank you kindly for you detailed reply.

Solving the ext3 latency problem

Posted Apr 19, 2009 22:50 UTC (Sun) by bojan (subscriber, #14302) [Link] (2 responses)

> Chris Mason is also working on a data=guarded mode, which will cause files to be truncated (much like delayed allocation) on a crash with ext3. I will look into porting this mode into ext4, if it proves to be enough of a performance advantage for ext4 over data=ordered, and yet providing a tiny bit more safety than data=writeback. It's not clear to me that it will be worth it for ext4, however.

Unless there is a significant performance penalty by updating metadata only after the data has been written, instead of having another mode, this is probably how writeback mode should work.

Solving the ext3 latency problem

Posted Apr 20, 2009 0:29 UTC (Mon) by tytso (subscriber, #9993) [Link] (1 responses)

Unless there is a significant performance penalty by updating metadata only after the data has been written, instead of having another mode, this is probably how writeback mode should work.

(Note that data=guarded is only deferring the update of i_size, and not any other form of metadata.)

We'll have to benchmark it and see. It does mean that i_size gets updated more, and so that means that the inode has to get updated as blocks are staged out to disk, so that means some extra writes to the journal and inode table. I don't think it should be noticeable, at least for most workloads, since it should be lost in the noise of the data block I/O, but it is extra seeks and extra writes.

Solving the ext3 latency problem

Posted Apr 20, 2009 3:43 UTC (Mon) by bojan (subscriber, #14302) [Link]

Thanks for the explanation.

I guess if i_size could be updated just once, when all the blocks are pushed out, then this would be even less of a problem. But, then again, I have no idea how this actually works inside the code, so this suggestion is probably naive.