Garrett: ext4, application expectations and power management
Garrett: ext4, application expectations and power management
Posted Mar 16, 2009 3:36 UTC (Mon) by neilbrown (subscriber, #359)Parent article: Garrett: ext4, application expectations and power management
The main point of the article seems to be something about power management (hence the title). Forcing a 'sync' on 'rename' implies the drive has to be written to before each rename. If instead the filesystem imposes an ordering between the flush and the rename, but doesn't necessarily hurry either of them along, then you get the guarantees which (it is claimed) application writers want, without the power costs that Matthew is (justifiably) concerned about.
In contrast, the quoted foot note is a somewhat aggressive way of saying "let's sit down and develop an API for telling the filesystem that a given collection of files should be optimised for 'database-like access'", which means (I think) "expect small files, don't worry about hard links or differing access modes, etc".
In response to the first point, I agree that it might be nice, but I don't envy the various filesystem designers the task of implementing it. Ordering considerations are fairly fundamental to the design of a journaling system. Adding extra requirements at the last minute would be quite non-trivial. If we as a community really want stronger ordering rules than POSIX provides, then we should really have a broad and open discussion about that, rather than ranting about some recently-apparent breakage.
In response to the second, we again need open and constructive conversation. Supporting "lots of small files" and still allowing hard links and chmod and extended attributes would be a significant challenge for a filesystem. I suspect that the easiest approach would be to use a "database-like" approach for files in a directory until some operation is attempted which doesn't fit, and then move that file out of the "database". e.g. store file contents inside the directory until the file exceeds 512 bytes, or a hard link is created, or it is renamed to a different directory, or a chmod/chown is performed.
For this to be truly useful there would need to be general agreement about what operations are allowed to "break" the database. Hence the need for an API. The API doesn't need to mean new syscalls or new fcntl calls. It just needs to be an agreement between filesystem developers and application developers.
The over lap between these two considerations (power-friendly data integrity and small-file optimisation) is the question of how to provide transaction semantics across a set of small files. One idea that occurs to me is to allow file locking to be applied to directories. If an application takes an exclusive lock on a directory, then we could arrange that no changes made are externally visible until the lock is voluntarily
released. If the lock is released by application-exit or system-crash, then the contents of the directory remain unchanged. If any operation is attempted on a locked directory which would break the "it is a database" property, that operation is disallowed.
I wonder if that could be made to work... and if it would actually be useful. It would certainly be a challenge to export some of this via
NFS :-)
