|
|
Log in / Subscribe / Register

Garrett: ext4, application expectations and power management

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 3:36 UTC (Mon) by neilbrown (subscriber, #359)
Parent article: Garrett: ext4, application expectations and power management

The bulk of that post seems to be saying something quite different (though not incompatible with) the footnote that has been quoted.

The main point of the article seems to be something about power management (hence the title). Forcing a 'sync' on 'rename' implies the drive has to be written to before each rename. If instead the filesystem imposes an ordering between the flush and the rename, but doesn't necessarily hurry either of them along, then you get the guarantees which (it is claimed) application writers want, without the power costs that Matthew is (justifiably) concerned about.

In contrast, the quoted foot note is a somewhat aggressive way of saying "let's sit down and develop an API for telling the filesystem that a given collection of files should be optimised for 'database-like access'", which means (I think) "expect small files, don't worry about hard links or differing access modes, etc".

In response to the first point, I agree that it might be nice, but I don't envy the various filesystem designers the task of implementing it. Ordering considerations are fairly fundamental to the design of a journaling system. Adding extra requirements at the last minute would be quite non-trivial. If we as a community really want stronger ordering rules than POSIX provides, then we should really have a broad and open discussion about that, rather than ranting about some recently-apparent breakage.

In response to the second, we again need open and constructive conversation. Supporting "lots of small files" and still allowing hard links and chmod and extended attributes would be a significant challenge for a filesystem. I suspect that the easiest approach would be to use a "database-like" approach for files in a directory until some operation is attempted which doesn't fit, and then move that file out of the "database". e.g. store file contents inside the directory until the file exceeds 512 bytes, or a hard link is created, or it is renamed to a different directory, or a chmod/chown is performed.

For this to be truly useful there would need to be general agreement about what operations are allowed to "break" the database. Hence the need for an API. The API doesn't need to mean new syscalls or new fcntl calls. It just needs to be an agreement between filesystem developers and application developers.

The over lap between these two considerations (power-friendly data integrity and small-file optimisation) is the question of how to provide transaction semantics across a set of small files. One idea that occurs to me is to allow file locking to be applied to directories. If an application takes an exclusive lock on a directory, then we could arrange that no changes made are externally visible until the lock is voluntarily
released. If the lock is released by application-exit or system-crash, then the contents of the directory remain unchanged. If any operation is attempted on a locked directory which would break the "it is a database" property, that operation is disallowed.

I wonder if that could be made to work... and if it would actually be useful. It would certainly be a challenge to export some of this via
NFS :-)


to post comments

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 4:47 UTC (Mon) by flewellyn (subscriber, #5047) [Link] (8 responses)

Or, we could use filesystems like, well, filesystems. And if we want databases, we use databases.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 4:58 UTC (Mon) by neilbrown (subscriber, #359) [Link] (5 responses)

What would you suggest is the key difference between the two, which would allow people to decide which to use in a particular situation? In my view they have fairly independent strengths, and if we could unify them that would be useful.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 9:49 UTC (Mon) by mjthayer (guest, #39183) [Link]

I think that a database lends itself better to indexing than most filesystems, notwithstanding indexing daemons running in the background and holding millions of inotify fds.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 17:56 UTC (Mon) by flewellyn (subscriber, #5047) [Link] (2 responses)

That's a fair question. I think the main issue is the level of abstraction: databases are more abstract than filesystems, and depend on properly working filesystems for their correct operation. A database is one particular application of data storage, while a filesystem is a general mechanism for data storage.

Garrett: ext4, application expectations and power management

Posted Mar 17, 2009 14:46 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

Data bases depend on filesystems for their correct operation?

What about native Pick, where the database IS the filesystem?

Or Oracle, where it's configured to use raw partitions for data storage?

Cheers,
Wol

Garrett: ext4, application expectations and power management

Posted Mar 17, 2009 17:58 UTC (Tue) by flewellyn (subscriber, #5047) [Link]

I'd argue that those databases implement their own filesystems.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 18:44 UTC (Mon) by larryr (guest, #4030) [Link]

Good support for using the filesystem as an efficient mechanism for a persistent hierarchical collection of named values is what I would like. Where the names are the filenames and the values are the file contents. Similar to sysfs. I think a lot of people are using sqlite or dbm files for this because using filesystem operations takes too long.

Larry@Riedel.org

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 14:57 UTC (Mon) by flammon (guest, #807) [Link] (1 responses)

Too bad Hans is in jail because I think that Reiser4 was designed to address the many small files problem with something called block sub-allocation http://en.wikipedia.org/wiki/Block_suballocation. Maybe we can salvage a few ideas from Reiser4.

Garrett: ext4, application expectations and power management

Posted Mar 16, 2009 18:30 UTC (Mon) by job (guest, #670) [Link]

Btrfs has that feature as well, according to documentation.


Copyright © 2026, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds