LWN.net Logo

Ts'o: Delayed allocation and the zero-length file problem

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 20:44 UTC (Fri) by tialaramex (subscriber, #21167)
Parent article: Ts'o: Delayed allocation and the zero-length file problem

A lot of commenters take a position to which the only reasonable reply is "disable delayed allocation". If you insist that everything should appear as if it happened in order, then by definition delaying allocation is incompatible with your desires.

If you're in that camp, you need to get out and start campaigning for programmers to fallocate() more, because without that you're losing a lot of performance to ensure your ordering constraint. With fallocate() the allocation step can be brought forward and avoid the performance loss. At the very least, file downloaders (e.g. curl, or in Firefox) and basic utilities like /bin/cp and the built-in-copy of modern mv implementations for crossing filesystems, need to fallocate() or you'll fragement just as badly as in ext3 and perhaps worse (since now the maintainers assume you have delayed allocation protecting you).


(Log in to post comments)

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 13, 2009 22:24 UTC (Fri) by MisterIO (guest, #36192) [Link]

Fragmentation IMO is not such a big problem, because the online defragmenter will solve or mitigate it.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 1:10 UTC (Sat) by tialaramex (subscriber, #21167) [Link]

1. Create 400MB file, allocating as you go, resulting in hundreds of fragements
2. Run defragmenter on file to collect fragments into contiguous areas

is crazy. It's so crazy ext4's default behaviour waits as long as possible to allocate in order to avoid this scenario and causes this "bug" that got Ubuntu testers in such a tizzy. The online defragmenter, if and when it arrives in mainline, is a workaround not a fix, you don't to make it part of your daily routine, so most likely what you'll actually do is live with the reduced performance, all so that some utility developers can avoid writing a few lines of code.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 1:40 UTC (Sat) by MisterIO (guest, #36192) [Link]

Coudln't it be run as a cron job on the whole fs, daily for example?

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 14, 2009 4:19 UTC (Sat) by nlucas (subscriber, #33793) [Link]

The fallocate() man page says it's only supported since kernel 2.6.23. That is a VERY recent kernel version. In a year or two maybe I will look at that page again. For now it's just too soon to care.

fallocate fdatasync sync_file_range

Posted Mar 14, 2009 8:30 UTC (Sat) by stephen_pollei (guest, #23348) [Link]

Yes fallocate is a good thing for a programmer to know. tytso has mentioned that sqlite most likely should have used fdatasync and fallocate . He also mentioned that fsync wouldn't have really been a problem even in ext3 with data=ordered mode if it was called in a thread. I Also think sync_file_range() and fiemap and msync() are good things to know about. I can kind of see how something like mincore() that returned more information that would be in the page tables might be nice; so you could check to see if a page you scheduled for writeout is still dirty or not.

I don't think any of these things would help the case of many small text files being replaced by a rename though -- you need a fsync() to flush the metadata of the filesize increasing, I assume.

a) open and read file ~/.kde/foo/bar/baz
b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT)
c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file)
d) sync_file_range or msync to schedule but not wait for stuff to hit disk --- this is optional
e) close(fd)
f) rename("~/.kde/foo/bar/baz", "~/.kde/foo/bar/baz~") 
g) wait for the stuff to hit the disk somehow
h) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz")
I think a lot of time not being in such a rush to clobber the old data but have both around for a while might work just fine. Heck keep a few versions around to roll back to and lazily garbage collect when you can see the things are more stale. I could be totally wrong though -- just brain-storming.

Ts'o: Delayed allocation and the zero-length file problem

Posted Mar 20, 2009 15:04 UTC (Fri) by anton (guest, #25547) [Link]

If you insist that everything should appear as if it happened in order, then by definition delaying allocation is incompatible with your desires.
By what definition? It's perfectly possible to delay allocation (and writing) as long as desired, and also delay all the other operations that are supposed to happen afterwards (in particular the rename of the same file) until after the allocation and writing has happened. Delayed allocation is a red hering here, the problem is the ordering of the writes.

LinLogFS, which had the goal of providing good crash recovery, did implement delayed allocation.

Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds