The 2006 Linux Filesystems Workshop (Part III)

[Posted July 5, 2006 by corbet]

[Editor's note: this is the third and final page of Valerie Henson's report from 2006 Linux Filesystems Workshop.]

Day Two: Ideas

The second day of the workshop began with the discovery that the bulb in the projector had burned out, which was fine with us as it further discouraged the use of mind-numbing slideware. We attributed part of the amazing productivity of the workshop to the lack of a projector and the fact that most of us also lacked any form of network connectivity from the conference room, forcing us to pay attention to what was going on instead of reading our email.

The second day of the workshop was devoted to talking about interesting file system architecture ideas. We were fortunate to have experts in most of the major Linux file systems (ext2/3, reiserfs, XFS, JFS) present to share their experiences, as well as people with expertise in other file systems (OCFS2, Lustre, ZFS), and of course, our token I/O subsystem representative (SCSI).

First, a quick review of our goals. We want a file system that can easily detect and correct disk corruption without taking the file system offline for hours or days. We especially don't want to lose the entire file system or large segments thereof when a relatively small part of the disk is corrupted. File system repair should take less time than restoring the file system from backup; in fact, we should design our file system layout with fsck performance in mind, rather than just normal file system performance. These goals can be summarized as "repair-driven file system design" - designing our file system to be quickly and easily repaired from the beginning, rather than bolting it on afterward.

We also want to avoid double writes, reallocating blocks on every write, and overly complex implementations which are difficult to understand and prone to bugs. We want to efficiently store small files but otherwise are willing to trade disk space for greater reliability. We can take advantage of disk track caches and focus on reducing seeks instead of laying out data exactly sequentially.

Chunkfs and continuation inodes

Chunkfs is an idea that Arjan van de Ven and Val Henson came up with in order to combat the coming fsck crunch. It is similar to an idea for improving Lustre at the distributed file system level currently in development at Cluster File Systems, Inc. In order to explain it, we'll start with a version of the idea that didn't work.

In early 2006, Val Henson began working on an idea to reduce fsck time in ext2. Ext2 is a wonderful file system - fast, reliable, simple - but it has this irritating requirement of a full fsck after a system crash. Her initial idea was to mark individual block groups as dirty or clean, depending on whether it was being modified, and then modify fsck to not bother checking clean block groups, thereby reducing average fsck time. Initial measurement showed that the divide-and-conquer approach to be promising, as only a maximum of 25% of block groups were dirty at any given time under a heavy file system load, and most block groups were clean more than 90% of the time under a normal development laptop workload. However, this scheme failed because file system references are not contained within block groups. If fsck wants to know if the block group allocation bitmap is correct for a particular block, it still has to build the block bitmap by traversing the entire file system, since any inode in any block group can reference any block in any block group. Instead, she implemented a file system-wide dirty bit indicating whether the entire file system was dirty or not, allowing the system to skip fsck if the system crashed while the file system was clean. (For details, see the talk at OLS, reducing fsck time on ext2.)

Arjan van de Ven got to wondering: what if we divided the file system into many small file systems (chunks), each with its own dirty bit, so that fsck time would be more predictable? This quickly evolved into a scheme for making a collection of small file systems (with any form of consistency guarantee - dirty bit not required) look like one big file system to the user, while still being individually repairable. The basic insight is to allow files and directories to span file systems by creating "continuation inodes" in other file systems. For example, if a file grows too large for its chunk, we create a continuation inode in another chunk and point to it. The new inode has a back pointer to the original inode. For a directory that has a link to a file in another chunk, we allocate a continuation inode for the directory in the new chunk and allocate the directory entry for that link in the same chunk as the file. This has the downside that the number of hard links to a file are limited by the free space in the file's chunk; however, large numbers of hard links are uncommon. This problem can be mitigated by reserving space for minimum number of hard links per inode created. Another option is to allocate a continuation inode for the file being linked to in the same chunk as the directory, and chain the various hard links together using a linked list to handle the case of removing the original link.

The important invariants are that blocks are only referenced by inodes inside the chunk, and any inodes with cross-chunk references have explicit back pointers to the parent inode. Fsck only has to check the local inodes to repair the block bitmap, and the local inodes plus the inode back pointers to check inode reference counts. A memorable quote from this discussion was: "If you want to check it, you have to chunk it."

Ideally, fsck on chunkfs would be implemented on a per-file system, on-line, on-demand basis. If a chunk suffers an I/O error or the file system discovers a checksum mismatch, that chunk can be taken offline individually and checked with less impact on the rest of the file system (directory hierarchy has to be taken into account here; it may be reasonable to replicate directories near the root of the file system or allow independent mounts of files in particular chunks).

Chunkfs has many useful properties. Chunks can be created with different inode ratios and sizes. Files can be easily segregated into chunks based on size, reducing fragmentation. One possibility is that file start out in a small file chunk, and when they grow into larger files, a continuation inode is allocated in a large-file chunk. This might be useful for workloads that only access the first few bytes of a file, such as the "file" command and some image viewers. With chunkfs, 32-bit block pointers are enough, since it is unlikely we will want chunks that are in the terabyte size range. Chunkfs puts few limitations on the implementation of the file system inside each chunk and may possibly be implemented as a generic layer usable with any local file system. Growing file systems is trivial - add new chunks - and shrinking is simplified, since data can be moved one chunk at a time.

Overall, chunkfs was one of the rare ideas that looked better the more we thought about it. We will be discussing it more on the Linux fsdevel list in the near future.

Doublefs

Doublefs is an idea that Theodore Ts'o and Val Henson came up with a couple of years ago, trying to get the consistency benefits of copy-on-write file systems without the pain of constantly allocating new blocks and figuring out space accounting for updates. Since disk space is cheap, why not pre-allocate two copies of all metadata, including directory entry blocks? When updating the file system, overwrite only one copy of the metadata, marking it with a transaction number. When the full set of updates is on disk, go back and update the second, older copy of the metadata at your leisure. Half-updated metadata is tracked by some form of dirty map or list on disk, so if the system crashes in the middle of an update, we can go back and clean it up, by copying either the old or the new copy of the metadata over the second copy, depending on when we crashed. Disagreements between the copies are settled by checking the transaction id and a checksum.

One nice property of this system is that the two copies do not need to be located next to or even near each other, since the writes are temporally separated. This increases the chance that only one copy is corrupted by an I/O error. Another nice property is that the second copy need only be written in order to reduce recovery time on reboot (and lower the chance of the good copy being corrupted before the second copy is updated). Zach Brown and Aaron Grier put some effort into prototyping doublefs.

Inodes in directory entries

Another major idea discussed at the workshop is making the directory entry the primary repository of file metadata rather than a separate inode, as suggested by Linus Torvalds. Currently, the directory entry contains only two important pieces of information: the name of the link to the file, and the inode number of the file. It might make sense to include the inode in the directory entry as well. A paper describing one implementation of this scheme is Embedded Inodes and Explicit Grouping: Exploiting Disk Bandwidth for Small Files, by Greg Ganger, et alia.

One difficulty is what happens when a file is created in one directory, another hard link is created from another directory, and the original link is removed. The inode must be then migrated to the new directory, via a linked list of some sort. This means that a particular inode number is not located in a static place, complicating anything that must look up files by inode number. Currently, only NFS servers need to look up files by inode number, due to the implementation of NFS file handles. Also, we may want to change the inode number when we move it to another directory, for internal file system architecture reasons. One solution is to have a user-visible inode number that stays the same over the lifetime of the file, and an internal inode number which can change.

Putting inodes in the directory entry opens the way to easily include a small amount of file data in the directory entry. Up to a certain threshold, file data could be packed into the directory entry along with the inode. This will greatly reduce on-disk fragmentation, due to partly-used data blocks. Having all the file information readable with one I/O will improve performance of many workloads, since normally a cold cache read of file data requires three dependent I/O's: First the look up of the directory entry, then the read of the inode, then the read of the file data pointed to by the inode. The major concern with packing file data is that the implementation tends to be bug prone and difficult to get right, due to the need to correctly repack the data if the file size changes. However, we observe that very small files are usually created and deleted atomically and rarely modified in place. One rule that might work is to only pack the file data into the directory entry when it is first created (using delayed allocation to avoid creating the file until the data is written). If the file changes or grows, immediately move the data out of the directory entry - and never move it back. Christoph Hellwig suggested that this might be easy to prototype in XFS.

The performance implications of including inodes and/or file data in-line with the direct entry are complex. One fundamental file system performance trade-off is the performance of readdir() versus stat() versus cat on a large number of files. If all you are doing is listing all the entries in a directory, you want that information to be packed as tightly as possible, so you don't have to read a bunch of extraneous data off the disk. However, if you commonly read the directory entry and then immediately stat the file, reading the inode, then it would make sense to include the inode in-line with the directory entry, even though it slows down the performance of the pure readdir workload. The same argument goes for a stat followed by a read of the file data - including some amount of file data in-line can speed up the stat followed by cat workload, at the expense of slowing down the pure stat workload.

One major point is that the readdir-only workload is actually extremely uncommon. You may think you are only doing a readdir when you type "ls", but most systems ship with color ls enabled, which automatically checks file permissions and the like in order to visually distinguish different kinds of files, like executables and hard links. From a performance standpoint, including the inode in-line with the directory entry looks like a sensible trade-off.

The second trade-off, which you can call stat-vs.-cat, is less obvious. How often do we stat files and how often do we cat files? While working on ZFS, we did some performance measurements and found that increasing the size of inodes enough to include a few hundred bytes of data severely impacted the performance of the stat workload, because it decreased the density of the data we cared about and required us to read many times more data off disk than necessary. In the end, the most efficient size of inodes depends on the file system workload, and will have to be carefully researched.

One suggestion that came out of in-line inode discussion is adding flags to the stat() system call to request only a subset of the data usually returned in the stat structure. This could allow significant performance optimizations, since calculating some of the values in the stat structure can be expensive.

Allocation

We discussed several techniques for improving allocation of file data on disk. A naive allocation implementation allocates data block by block as it is written, with no knowledge about how large the file will eventually be. This worsens disk fragmentation by mixing large files with small files, and by making it harder to reserve large contiguous ranges of blocks for larger files. What we need is some way to help predict how large a file will be before we begin allocating blocks.

The simplest technique is delayed allocation - waiting some time after a write occurs to see if more data is written. Another solution, used in BSD's implementation of FFS for several years now, is block reallocation. When a file is written to, the entire file may be moved, including blocks previously allocated elsewhere on disk, to a better location. Another technique is the POSIX fallocate() function, which tells the file system to allocate N bytes of disk space for this file, which is useful for programs like tar and package managers which know exactly how large a file will be. One of the criticisms of fallocate() is that it is not as useful for programs that are guessing at the total length of the file. If the program allocates more space than it needs, it must explicitly truncate the file. A flag to fallocate() automatically implementing this would be convenient.

Much of the discussion centered around getting hints from the application about how files will be created and used. One point of view is that applications are already giving us hints about file size, permissions, and other attributes: they are called "file names." If a file is named "*.log", it's a pretty fair guess it will start out small, grow slowly, become very large, and be append-only. If a file has the string "lock" in its name, it is likely to be zero length, frequently stat'd, and deleted in the future. If it is named "*.mp3", it is probably going to be written once, read sequentially many times, and several megabytes long. Other file attributes, such as permissions, give valuable hints for predicting the future of a file. Daniel Ellard, et al., wrote several papers on this topic, The Utility of File names [PDF] and Attribute-based Prediction of File Properties [PDF]. The second paper describes a system which automatically deduces correlations between file name patterns and file attributes and future file activity, given file system activity traces, so you don't have to manually update your file name-based optimizations when, for example, a new audio file format becomes popular.

Many of these allocation policies could be implemented at the VFS layer and optionally used by file systems, rather than being implemented in individual file systems. Many people would like to see the ext3 reservation code, written by Mingming Cao, available in some kind of portable form for use by other file systems.

One change in our assumptions that affects allocation is the fact that we have to do scrubbing in order to find and repair errors before they become unrecoverable. Since we are required to pass over the file system, we may be able to do other things while we are there, such as continuously defragment the file system. One technique that could significantly lessen the cost of scrubbing and defragmentation is called freeblock scheduling. The idea is that while the disk is servicing important "foreground" requests, the read head is frequently idle during seeks or waiting for the disk to rotate to the desired data. If we have a queue of outstanding "background" requests, we can read these blocks as we happen to pass over them while servicing foreground requests, causing little or no degradation in performance of the foreground requests. One drawback of this system is that it works best with detailed low-level knowledge of the disk hardware; however the ideas are still useful for reducing the cost of background disk maintenance tasks.

Near-term needs for Linux file systems

We devoted part of the afternoon of day two of the workshop to talking about features Linux file systems need in the next year. High on the list was any method of mitigating the fact that file system repair is extremely slow. Reducing the amount of data that needs to be checked by marking some parts of the metadata as unused or clean would help, as in the approach to progressively mark ext2/3 inodes in use as needed. Many implementations of fsck are highly optimized but there may be more room for improvement. Simply having an estimate of how long file system repair will take lets administrators decide whether to attempt repair or simply restore from backup.

Another short-term approach is to strategically add checksums and error handling to existing file systems. One example is the IRON ext3 work [PDF], done at University of Wisconsin, Madison. This led to a discussion of the difficulty of transferring code written by researchers into the mainline open source development tree. One suggestion was to get researchers to focus on producing data that can be used by mainline developers to produce code, since getting code accepted is generally irrelevant to the average file system researcher's career.

Next on the list was forced unmount support, the ability to forcibly unmount a file system, even if it has open files. This is a feature of several existing file systems and is useful for unmounting file systems without shutting down the entire system. A large part of forced unmount can be implemented at the VFS layer (e.g., replacing VFS ops of open files with a set of VFS ops that always return EIO), but specific file systems will still need some testing and work to be able to unmount safely in the face of errors. Forced unmount can probably implemented with a few months of work and available for production use in a year. This feature must go hand-in-hand with better testing with I/O errors, otherwise our effort to isolate file system errors is not particularly successful.

Another issue is the fact that file system performance is usually only tested on fairly young, unfragmented file systems. The file systems development community should work with researchers on better ways of creating aged file systems quickly. One idea was to insert plenty of sync() calls during accelerated replay of file system traces, so that the file system can't use smart temporal allocation policies that would be unavailable in a truly aged file system. The latest research on replaying file system traces is by Ningning Zhu, et alia, in a paper their FAST '05 paper, Scalable and Accurate Trace Replay for File Server Evaluation [PDF]. A good paper on aging file systems is File System Aging - Increasing the Relevance of File System Benchmarks [PDF] by Keith Smith, at alia.

One of the difficulties with multiple file systems on a system is the amount of memory used for file system journals, since it is usually allocated on a per-file system basis without taking into account global memory usage for file system data. Some method of coordinating memory use might be helpful. In general, file systems developers tend to optimize for file system performance without fully taking into account overall system performance.

Day Three: Plans

The last day of the workshop was devoted to planning the next steps for making the next generation of Linux file systems a reality - in other words, we were assigning work. It was also Saturday morning after two grueling days and nights of non-stop file system discussion. Miraculously, nearly all the workshop participants returned for the final day.

First on our to-do list is convincing the Linux community, including companies, to work together on developing new file systems. We need to educate the community about the coming fsck time crunch and the necessity to start developing new file systems now, since file systems generally take about 5 years to go from concept to production use. Workshop participants are working on slide decks, collecting requirements, and writing articles, in the hopes of influencing the community and companies to devote resources to Linux file system development.

We need to continue working on existing file systems such as ext2/3, reiserfs, and XFS so that they can bridge the gap while new file systems are in development. We are organizing meetings between hardware vendors, researchers, and file systems developers to share data and ideas. A follow-up Linux file systems workshop is tentatively planned to be co-located with the 5th USENIX FAST conference in San Jose, CA in February 2007. We have nothing formal planned for Ottawa Linux Symposium, but you might want to attend Val Henson's talk on reducing fsck time on ext2, as it will cover other topics in Linux file systems.

When it comes to code, we have several ideas to prototype:

Forced unmount support
Variable inode size and effect on stat vs. cat workloads
Progressive inode table initialization in ext3
Chunkfs prototype using existing file systems
Doublefs prototype
Inode and file data in the directory entry
Freeblock scheduling
Adding flags to stat() to request a subset of the stat data

If you or your organization is interested in working on any of these ideas or getting involved in file systems development in any other way, email the Linux fsdevel list.

The formal workshop was over at noon, but several hardy souls accepted an invitation to attend the Linux Beer Summit, organized by Kristen Carlson Accardi (docking station maintainer) and sponsored by Google. File system discussions continued well into the evening over several excellent home brewed beers, and Inaky Perez-Gonzalez (Ultra-wide-band maintainer) supplied exciting low-speed tractor rides for all.

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 6, 2006 8:21 UTC (Thu) by dvrabel (subscriber, #9500) [Link] (1 responses)

"Her initial idea was to mark individual block groups as dirty or clean, depending on whether it was being modified, and then modify fsck to not bother checking dirty block groups,"

Shouldn't this read "...not bother checking clean block groups"?

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 6, 2006 22:52 UTC (Thu) by vaurora (guest, #38407) [Link]

Good catch, yes - not bother checking CLEAN block groups.

Interesting work - and some ideas for the future

Posted Jul 6, 2006 9:06 UTC (Thu) by ayeomans (guest, #1848) [Link] (8 responses)

Thanks for the great report on the workshop. It's great to see that innovation is still continuing.
Something struck me, though, in that the thrust is towards performance and reliability improvements at the data level. I'm also interested in work at higher levels, to add more features for the user.
Let me give some examples:

Versioning file system - ages ago systems such as RSX-11M and VMS had versioning filesystems, that kept incremental copies of data. Nowadays that disk space is much cheaper, it makes increased sense to do this again. Even if systems were damaged by that mythical Linux virus which tried to overwrite files, a versioning filesystem would provide protection.
Note this can be done on a file block basis - no need to copy all data in a file, just make a copy of the allocation map. This is pretty close to some of the current journalled filesystems, and should be possible to be combined by providing the user with access to earlier file versions.
Synchronisation-friendly filesystem. Mobile devices increases the demand for synchronised portions of filesystems. By maintaining a "sync point map", i.e. list of files/blocks modified after a nominated sync point, it becomes a fast process to identify portions that need copying. Again, this is pretty close to current JFS facilities.
De-duplicating filesystem. More useful at enterprise level or for multi-computer backups. Identify duplicates of files - probably by a crypto checksum calculated during or just after writing. Then only store one logical copy of any file. Makes backups faster and reduces disk space requirements. (Think of the number of operating system files held in common across computer networks. Or the mass-mailed .ppt^h^h^h.odp presentation files.)
Enhanced metadata that gets preserved with files. Including some kind of data origin (e.g. internet download from url), also classification. So that backup and security decisions can be made automatically, including auto-encryption of confidential files when transferring out the system.
Auto-zip/unzip views - allowing a collection of files to appear like a single zip file, or vice-versa. Currently done at presentation layers, but allowing the filesystem to handle can make the facilities available to all apps. Even the enhanced metatdata mentioned above could be handled as if it were a component of a zip file. Probably a cleaner way to present resource forks and alternate data streams.

Interesting work - and some ideas for the future

Posted Jul 6, 2006 9:55 UTC (Thu) by nix (subscriber, #2304) [Link] (2 responses)

I have a number of ideas on the versioning filesystem front which are congealing into a design. Eventually people will stop giving me new ideas faster than I can figure out how to use them :) the tree-of-blocks-CoW stuff is a critical part of it, of course.

The block allocation stuff isn't the hard part (we can start with something naive and get more complex later on): the optimizers are the hard part, because we *really* want to share as much data as possible, not least because my chosen semantics for versioning involve a new version on every close() of an open-for-writing file and on every update of a directory. This also means we need automatic expiry.

(It's also versioning both by name and by inum --- essential if it's to be useful with most text editors --- which interacts interestingly with hard links and the permissions system.)

(Because it's very much a prototype and because the data structures are not simple I'm initially prototyping the thing inside PostgreSQL, with access via FUSE. Figuring out how to turn the whole thing into a more conventional filesystem can wait. What's that about typical impractical researcher's mentality? OK, OK, I admit it ;) )

Interesting work - and some ideas for the future

Posted Jul 13, 2006 19:36 UTC (Thu) by ringerc (subscriber, #3071) [Link] (1 responses)

Versioning FSs get problematic when you consider that they're most useful in areas like network file servers and home directories. Lots of programs like to drop large amounts of crap in these places, much of which should not be versioned. Consider program scratch files, thumbnail DBs, etc. Identifying these files and avoiding versioning them would be extremely useful.

Ageing out versions sounds like a good idea. You might also want to look into a winnowing process where fewer versions are kept further back in time, rather than just using a strict version count or time limit. I refer to something akin to the way round robin databases work - you keep one file from a month ago, one for each prior week, one every day for this week, and one every hour (assuming of course that the file was actually modified in each period). This helps reduce the "damn, I save every five minutes so my versions only go back an hour" problem.

Being able to mount a read-only view of the FS frozen at a point in time would be an incredibly nice inteface to the versions that wouldn't require any special tools. User says "Yeah, I deleted it some time today" ... so I just:
mount -o version_date=`date -d 'yesterday' +%Y-%m-%d-%H-%M-%S` /path/to/device /path/to/samba/share/yesterdays_files

and the user can just access `yesterday's files' with samba. For that matter, imagine a rolling view:

mount -o version_age=24h ....

so the mount shows files as they were 24h ago, updating (roughly, presumably in chunks) with time. Stupid idea? Probably. Useless? Probably. Fixed point-in-time views that were mountable would be amazingly useful though ... like a coarse continual snapshot.

Interesting work - and some ideas for the future

Posted Jul 15, 2006 11:23 UTC (Sat) by nix (subscriber, #2304) [Link]

An avoidance list will be mandatory: probably one global one and one per-directory (non-inherited). We don't want to version vim .swp files, for starters :)

The winnowing idea sounds excellent: I'll incorporate it if I can figure out a user interface! (It sounds like the sort of thing which should be controlled by a minimal cutoff date and some sort of logarithmic or exponential parameter.)

As for the read-only view, well, the current design already has the ability to roll individual files, directories, and trees backwards and forwards in time. It's not stupid at all. :)

Oh, and it's not read-only. If you write to a file (or modify a directory) which is rolled back into the past, you get a branch.

I think I've managed to conform to POSIX everywhere: I'm in the middle of verifying this at the moment. From the point of view of apps which have files open when someone rolls them back, it just looks like someone's done a truncate or a big write... the whole point of this is POSIX conformance with version control layered on top: I don't want to turn out yet *another* versioning system which doesn't support hard links, symlinks, or permissions!

Interesting work - and some ideas for the future

Posted Jul 7, 2006 14:16 UTC (Fri) by dion (guest, #2764) [Link] (4 responses)

The de-duplication feature sounds a lot like hardlinks and that can be done today.

The tricky part is the copy-on-write needed when you start modifying the file.

de-duplication in filesystems

Posted Jul 8, 2006 17:44 UTC (Sat) by giraffedata (guest, #1954) [Link] (3 responses)

The tricky part of de-duplication is identifying the duplicate files.

Users today create multiple copies of files because it's easier than sharing. The idea of de-duplication is that the users maintain that ease, but get the benefits of sharing because the system stores only one copy anyhow.

The copy on write technology is pretty much the same as is used today for snapshot copies. But the identification of duplicate files (or, in some proposals, blocks) is something I have yet to see done with demonstrable gain.

de-duplication in filesystems

Posted Jul 10, 2006 18:58 UTC (Mon) by martinfick (subscriber, #4455) [Link] (2 responses)

Check out the vserver work on vhashify.

de-duplication in filesystems

Posted Jul 15, 2006 11:26 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

That's sort of similar, except I'm trying to work on the block level. The hardest part is arranging to detect cases, where, say, someone has a big text file and inserts one byte at the front of it: the rest should still be detected as a duplicate, even if the original file and the new file are not version-related (in which case detecting the duplicate is feasible), but doing that for arbitrary unrelated files without storing ridiculously many hashes is tricky. (More generally, modifications that are not multiples of a block size should not cause unmodified portions of duplicated files to be un-duplicated.)

de-duplication in filesystems

Posted Jul 22, 2006 3:43 UTC (Sat) by JumpJoe (guest, #39288) [Link]

Not sure what level the deduplication is being done however:
www.datadomain.com

Other companies are doing deduplication above the filesystem layer (CAS)

http://searchstorage.techtarget.com/originalContent/0,289...

Yes, it would be great to have a compression/deduplication built into a filesystem.

ZFS

Posted Jul 6, 2006 9:24 UTC (Thu) by jec (subscriber, #5803) [Link] (7 responses)

If only Sun would donate ZFS as GPL for the Linux kernel...

ZFS

Posted Jul 6, 2006 14:12 UTC (Thu) by vmole (guest, #111) [Link]

Seems unlikely, as it's probably the only Solaris technology of interest to non-enterprise users. (That is: When I get around to building a fileserver for the house, I'm going to be very tempted by Solaris because of ZFS, despite years of built-up dislike.)

ZFS

Posted Jul 7, 2006 1:49 UTC (Fri) by kirkengaard (guest, #15022) [Link]

If only we would stop sighing over Sun's non-offering.

ZFS

Posted Jul 8, 2006 7:59 UTC (Sat) by paulj (subscriber, #341) [Link] (3 responses)

Sun are extremely unlikely to release ZFS under GPLv2.

That said, it is likely that CDDL and GPLv3 code could be combined, judging by the current GPLv3 drafts at least. Just have to wait and see (and persuade Linus to reconsider the GPLv2-only licence).

There are people working on ZFS for non-Solaris systems, iirc there are efforts to do ports to BSD and Linux FUSE.

Relicensing

Posted Jul 13, 2006 19:39 UTC (Thu) by ringerc (subscriber, #3071) [Link] (1 responses)

"and persuade Linus to reconsder the GPLv2 only license"

He can't. He could relicense _his_ contributions under GPlv3, but he can't relicense the whole kernel as no one person (him or otherwise) holds the copyright, and the license currently states that it is GPLv2 ONLY. Relicensing the whole kernel would require the consent of every contributor (or their estate) which is awfully unlikely to happen, and insanely impractical. How would you find them all? Heck, how would you find out who they all were?

Relicensing

Posted Sep 25, 2006 9:36 UTC (Mon) by forthy (guest, #1525) [Link]

He can. He wrote the GPLv2-only statement as interpretative comment on the lincense himself in the 2.4.0-pre-versions, without asking anyone, so he can remove it as well. It's his right as redistributor of GPL code to choose the license, when it is originally under GPLv2-or-later, and it is his right to change his choice. If anyone wants his code in Linux to be GPLv2-only, he has to mark it himself as such. AFAIK, nobody ever did (while quite a lot clairified that they really meant GPLv2-or-later).

Note that Linus does not have the right to change the GPL. All he can is to add comments to it. The GPLv2 is GPLv2 or later by default, unless explicitely stated otherwise, and you get the license from the original author. However, each redistributor can only be forced to do what he's obliged to under the license version he chose.

ZFS

Posted Jul 19, 2006 12:06 UTC (Wed) by csamuel (✭ supporter ✭, #2624) [Link]

The Google Summer of Code project is sponsoring the port of ZFS to Linux FUSE, the developer already has a ZFS FUSE Wiki and SVN instructions and a blog.

ZFS

Posted Jul 8, 2006 10:48 UTC (Sat) by skissane (subscriber, #38675) [Link]

Even if the licensing conditions are such that ZFS can never go into Linux... the fact that the source code is freely available means that it should be much easier for someone to implement a ZFS-style filesystem of their own, since they have a production-quality working example to confer to... thus, the open sourcing of ZFS makes it more likely that Linux will get ZFS-style technology, even if not by means of ZFS per se....

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 6, 2006 11:45 UTC (Thu) by NAR (subscriber, #1313) [Link] (1 responses)

Another technique is the POSIX fallocate() function, which tells the file system to allocate N bytes of disk space for this file, which is useful for programs like tar and package managers which know exactly how large a file will be.

There is an other class of applications that know the file size beforehand - download managers, FTP clients, browsers, P2P tools, etc. that get the file size from the network protocol.

Bye,NAR

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 7, 2006 14:19 UTC (Fri) by dion (guest, #2764) [Link]

... don't forget cp, it usually has a good idea about the filesize.

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 6, 2006 15:26 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (14 responses)

The continuation-inode idea is quite cute!!!

One question, though: is there some portion of the failure rate that is a function of the number of filesystems? For example, are there more superblocks and free-block maps to be corrupted?

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 6, 2006 17:45 UTC (Thu) by piman (guest, #8957) [Link] (1 responses)

IANA filesystem developer, but it would seem to me the answer is no. The two causes of disk corruption are bugs (in the driver, kernel, filesystem, etc) or hardware failure (disk or memory). In the case of a disk failure, you lose whatever data was corrupted regardless of whether you use one large filesystem or many small ones. In the case of memory failure or bugs, with many small filesystems you will only corrupt the ones being written to. So the many small filesystems approach offers an advantage here as well.

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 6, 2006 20:40 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Two more failure modes: system crash (which loses whatever writes were in flight but not completed) and point-media failures on the disk platter.

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 6, 2006 18:51 UTC (Thu) by arjan (subscriber, #36785) [Link] (9 responses)

so far in the analysis we haven't found any reason.

One of the key things is that any of the "key" data for a filesystem (superblock and such) can be duplicated many times, since it's tiny as percentage of the fs, and constant size.

fwiw a document about chunkfs (work in progress) is at
http://www.fenrus.org/chunkfs.txt

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 7, 2006 0:14 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (8 responses)

Great document, interesting stuff!!! A few questions, as always...

(1) In the discussion of hard-linking, it looked to me that directories with links get replicated in the directory's chunk and in the hard-link destination's chunk. Is this the case, or am I confused (and on second reading, it looks like only the directory entries linking to this chunk get replicated)? If it is that case, there is some sort of mutex that covers all chunks to allow both replicas of the directory to be updated atomically?

(2) Is rename still atomic? In other words, is a single task guaranteed that if it sees the new name, a subsequent lookup won't see the old name and conversely?

(3) Is unlink still atomic?

(4) Does dcache know about the continuation inodes? (Can't see why it would need to, but had to ask...)

(5) Stupid question (just like the others, but this time I am admitting it up front!) -- for a multichunk file, why have the overhead information in the chunks that are fully consumed by their segment of the file? Why not just mark the chunk as being entirely data, and have some notation that indicates that the entire chunk is an extent? And is this enough heresy for one question, or should I try harder next time? ;-)

(6) Is intra-chunk compatibility with ext2/3 a goal?

(7) I am a bit concerned about the following sequence of events: (a) chunk zero is half full, with lots of smallish logfiles. (b) a large file is created, and starts in chunk 0 (perhaps one of the logfiles is wildly expanding or something). (c) the large file fills chunk 0 and expands to chunk 1 with a continuation inode. (d) each and every logfile expands, finds no space, and each thus needs its own continuation inode, violating the assumption that continuation inodes are rare. Or did I miss something here? If I am not too confused, one approach would be to detect large files and make them -only- have continuation inodes, with -no- data stored in a chunk shared with other files. How to detect this? Sorry, no clue!!!

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 7, 2006 6:35 UTC (Fri) by arjan (subscriber, #36785) [Link] (7 responses)

(1) Not so much replicated. If you think of a directory as a file-like linear stream (I know that's too simple, but readdir() makes it so sort of), what you'd do for a hardlink is make a continuation inode for that stream in the chunk that the file resides in, and continue that stream in this chunk, at least for the one dentry of the hardlink. So there is no duplication/replication, it's continuation.

(2) that's not different as currently is the case

(3) same

(4) No... it's a pure internal thing

(5) that's not a stupid question; the one thing I've not written up is that in principle, each chunk could have it's own on-disk format variant. The "entire chunk is one file" variant already was on my list, another one is "lots of small files".

(6) you mean ext2/3 layout within a chunk? Not a goal right now, although the plan is for the prototype to reuse ext2 for this. I don't want to be tied down to exact ext2 format beforehand though.

(7) there is something needed there yes. the entire thing needs a quite good allocation strategy, probably including delayed allocation etc

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 7, 2006 15:26 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (6 responses)

(1) OK, so in a sense, directories split across chunks in exactly the same way that files do, but for different reasons. Files split across chunks because they are large, while directories split across chunks because of the location of the primary inodes (or whatever a non-continuation inode is called) of the files within a given file. No replication. So one area that will require careful attention would be the performance of reading directories that had been split across chunks.

(2,3) Good to hear that rename and unlink are still atomic! I bet I am not the only one who feels this way. ;-)

(4) Also good to hear!

(5) Having specialized chunks could be a very good thing, though the administrative tooling will have to be -very- good at automatically handling the differences between chunks. Otherwise sysadmins will choke on it.

(6) OK to be incompatible, but my guess is that it will be very important to be able to easily migrate from existing filesystems to chunkfs. One good thing about current huge and cheap disks is that just migrating the data from one area of disk to another is much more palatable than it would have been a few decades ago.

(7) Good point -- I suppose that in the extreme case, a delayed allocation scheme might be able to figure out that the file is large enough to have whole chunks dedicated to its data.

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 7, 2006 21:24 UTC (Fri) by arjan (subscriber, #36785) [Link] (5 responses)

for your (1) point... the good news is that hardlinks are relatively rare....

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 9, 2006 20:12 UTC (Sun) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (4 responses)

Good point on hard links being relatively rare (ditto mv and rename, I would guess). But the split directories they engender would persist. With N chunks, you have only 1/N chance of spontaneous repair, resulting in increasing directory splitting over time, even with low rates of ln, mv, and rename. So my guess is that there would need to be the equivalent of a defragmenter for long-lived filesystems. (A rechunker???)

I suppose that one way to do this would be to hold some chunks aside, and to periodically re-chunk into these chunks, sort of like generational garbage collection.

But perhaps see how it does in tests and benchmarks?

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 11, 2006 23:13 UTC (Tue) by dlang (guest, #313) [Link] (3 responses)

this is an area where changing the basic chunk size could have a huge effect.

the discussion was to split the disk into 1G chunks, but that can result in a LOT of chunks (3000+ on my home server :-). changing the chunk size can drasticly reduce the number of chunks needed, and therfor the number of potential places for the directories to get copied.

In addition, it would be possible to make a reasonable sized chunk that holds the beginning of every file (say the first few K, potentially with a blocksize less then 4K) and have all directories exist on that chunk, then only files that are larger would exist in the other chunks (this would also do wonders for things like updatedb that want to scan all files)

this master chunk would be absolutly critical, so it would need to be backed up or mirrored (but it's easy enough to make the first chunk be on a raid, even if it's just mirroing to another spot on the same drive)

this sounds like something that will spawn endless variations and tinkering once the basic capabilities are in place.

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 13, 2006 18:15 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Is there an influence on concurrency? Can operations local to a chunk proceed entirely independently of other chunks? Per-chunk log for local operation, with global log for cross-chunk operations? Now -that- should be trivial to implement, right??? Just a small matter of software... ;-)

But, yes, 3,000 chunks does seem a bit of a pain to manage -- at least unless there is some way of automatically taking care of them. But do the chunks really need to be the same size? Could the filesystem allocate and free them, sort of like super-extents that contain metadata?

If we explore all the options, will it get done before 2020? Even if we don't explore all the options, will it get done before 2020? ;-)

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 23, 2006 19:12 UTC (Sun) by rapsys (guest, #39313) [Link] (1 responses)

I was wondering if you may add in your schema the following features :
- each 5chunk beeing marked as md5 control sum (some of raid5 over chunk)
- blacklist chunk that did 1-2 errors (as they have a weak magnetic
surface for example)

I think that two idea are interesting as I readed some article about
independant read/write head in hard disk for future to improve the
concurrent reading/write.
I read that there is 3-4year about Video Hard Disc Recorder for TV (but
hdtv and drm may have killed such improvement ?)

The point is that if we are able to have 3-5 head on a hard disk, why not
mark one of that head surface to be reserved to backup control sum ?
(this would not kill performance if head are independant as disk bandwith
is far bigger than the mechanic part)

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 23, 2006 19:40 UTC (Sun) by rapsys (guest, #39313) [Link]

Hum, I forget something.

The interest of the previous idea is if your data has been physicaly
corrupted (write overpowered, etc) in the other chunk, of the (group
of )block checksum will be wrong.
If the chunk of control sum is avaible and valid (see below), the kernel
issue a EAGAIN/ERESTORING to the application and regenerate the data from
the control sums' chunk.
It will allow to not loose a movie/music file because a stupid block in
the middle have been corrupted :'(

The more interesting things is that the hard disk will remap the
problematic magnic part to somewhere else because you re-write on the same
place and it's marked as trashed by hd.
(it's an assumption that need to be true every time and hd manufacturer
NEED to respect, after few write/read cycle test scheduled by firmware on
that place maybe)

The only problem I see if an overpowered writes cross the chunk
separation.
The control sum will be corrupted and will lead kernel in trouble in case
of birthday paradox.

So we will need to have (group of )block checksum for the control sums'
chunk too.
(Will it fit in space with equation : x(data)+1(control sums) ?)

I made two assumption :
- read/write to that chunk is not too costly (independant head, etc)
- control sum are not cpu killer (maybe have a special feature in hd to do
that job)
- control sums of chunk are not cpu killer too (but we only increase the
consumed cpu time for each write on disk)

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 6, 2006 23:11 UTC (Thu) by vaurora (guest, #38407) [Link] (1 responses)

To expand on Arjan's reply, it's not obvious that it would increase the probability of any failure. Sure, there are a larger number of individual bitmaps, but in terms of bits on disk, they are still the same size and shouldn't have an increased likelihood of suffering an I/O error. More superblocks is interesting because they are fixed size per file system; on the other hand, most modern file systems already heavily replicate the superblock. What does seem to be true is that this scheme will limit the effect of any individual failure, as long as we are smart about handling loss of path components.

We definitely appreciate criticism as we would like to figure out (possible fatal) errors BEFORE implementing anything. So if you have any more ideas about how this will fail, let us know and hopefully we can figure something out.

The 2006 Linux Filesystems Workshop (Part III)

Posted Jul 7, 2006 0:20 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

My original question came from considering a large file spread across multiple chunks, so that loss of any of these chunks loses part of the file. So any fixed probability of chunk loss adds up (approximately, anyway). On the freelist, I agree with you, and it does seem that you can be more aggressive about replicating superblocks to reduce the probability of superblock loss (but thereby slowing superblock updates).

Idle question, but I couldn't resist asking. ;-)

Obtaining real-world aged file systems

Posted Jul 6, 2006 19:38 UTC (Thu) by wilck (guest, #29844) [Link] (7 responses)

This is actually an idea of a colleague of mine.

Wrt the problem of obtaining real-world aged file systems, we wondered if it would be worthwile to create a "file system content scrambler" - a tool that would duplicate a file system exactly 1:1, but overwrite all real content - data, file and directory names - with artificial or random data.

The cloned file system would have the same properties as the original real-world file system without revealing possibly sensitive information to the person analyzing or doing benchmarks.

Such a tool could be used to gather file systems that would exhibit real-world fragmentation etc.

Perhaps not an idea for the truly paranoid...

Obtaining real-world aged file systems

Posted Jul 7, 2006 2:46 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

the problem with aged file systems is that you may have a file system (say ext2) that has been in use for years, with lots of stuff added at different times.

now you have a new filesystem (newfs) that you want to test.

if you just copy everything from the old filesystem to the new one you end up with a optimaly layed out newfs filesystem since all the files were created at one time, useually with each file being created in one operation (no fragmentation). The resulting performance is vastly different then if the same contents had been put there the same way they were put on the origional ext2fs filesystem.

so the next step is to not record the filesystem, but record the operations on the filesystem (each write, delete, create, etc) and replay those against the new filesystems.

but since modern filsystems delay allocations for several seconds when something is done to them the replays end up with many of the operations canceling out in memory (never hitting disk) and so the result still doesn't match.

the work-around for this is to do lots of sync's to force the filesystem to actually perform the writes to disk instead of short-circuiting them.

Obtaining real-world aged file systems

Posted Jul 7, 2006 2:53 UTC (Fri) by dlang (guest, #313) [Link]

One thought that hit me just after posting the last comment,

now that there is the new timer base in the kernel, and we are nearing useability with the 'tickless' patches, how about setting up a special kernel with a custom timesource that's driven by the process writing the aged filesystem?

it knows the timestamp for all the origional filesystem actions, and it can find out all outstanding timers, after each filesystem action have it set a timer action for the time of the next filesystem action and then use the tickless capabilities to advance time up to that point (steping through each of the other timers on the way). this way the modified kernel and it's filesystem code would think that the replay took place in the same real-time that the origional actions that are being replayed took place in, andy delayed actions will take place appropriately (benifiting things if the delay would have helped in the origional, but happening between actions if they wouldn't have)

as long as the kernel isn't busy doing other stuff at the same time this replay should be very fast, and this avoids the haphazard benifits of trying to insert lots of sync calls.

David Lang

Filesystems aged like sharp cheese

Posted Jul 7, 2006 15:27 UTC (Fri) by Max.Hyre (subscriber, #1054) [Link] (3 responses)

[Darn it, someone else comes up with the same idea while I'm writing my comment. I'll post this anyway, in hopes it's got some useful additions.]

Another issue is the fact that file system performance is usually only tested on fairly young, unfragmented file systems. The file systems development community should work with researchers on better ways of creating aged file systems quickly.

Seems to me the best way to create aged filesystems quickly is to have them already prepared, waiting to be used. Ask people to send in backups of live old systems, and create a library of them---you need a few, go pick them up from the filesystem store.

Obviously this has two drawbacks: it's a bit much to have a decent number of examples when each one is multiple terabytes, and people would probably object to passing their data out for world-wide use.

We could address both by having the donors use a program which saves only the metadata including its position on disk. Remove all the file contents, rename the files to something original like `1', `2', ..., and the result should be both a good bit smaller and free of private information. The restore program (not a standard restore, which arranges things neatly as it goes) would put the metadata in the same location on the new disk, thus giving the same fragmentation, ordering of filenames in directories, &c.

Of course some organizations [say, the NSA] might consider even file sizes and directory structures, to say nothing of timestamps, too sensitive to release. Think traffic analysis. Actually, timestamps could probably be omitted, too.

In fact, the dehydrated FS need only contain that metadata needed for the testing, and everything else omitted, to be faked at reconstitution time.

Need a file system? Pull one off the shelf, reconstitute by doing a special restore which writes all data as nulls (or doesn't even write it---just take whatever happens to be lying around in the data blocks), generate new values for incidental metadata, such as timestamps, and voila! You're ready to go.

The skeletal systems could even be classified by characteristics: need to test against a ton of small files? Use one of these. Want ugly fragmentation? Use one of those.

Of course, there would need to be a large set to draw from, to avoid optimizing FSes for a fixed, though realistic, set of instances. New old ones would be continually solicited to avoid ossification. On the other hand, they could be used like PRNGs: when you want to test different implementations against the same data, you've got it.

Or has this already been put into place, and I just haven't noticed?

/And/ another thing...

Posted Jul 7, 2006 15:40 UTC (Fri) by Max.Hyre (subscriber, #1054) [Link] (1 responses)

It just occurred to me that renaming files to numbers could mess things up if the directory structure depends on the filename lengths. So the rename would have to be something the same length as the original (leading zeroes?). Which means trouble when the original name is shorter than the new name....

The fix is left as an exercise for the reader. :-)

/And/ another thing...

Posted Jul 15, 2006 11:41 UTC (Sat) by nix (subscriber, #2304) [Link]

The rename might well need to come up with something which *hashes* to the same value as the original. Good luck making *that* work for more than one file a year. :)

Filesystems aged like sharp cheese

Posted Jul 20, 2006 23:49 UTC (Thu) by efexis (guest, #26355) [Link]

The problem is when you want to test a new feature, such as a new algorithm for deciding where to place new blocks on the disk... the filesystem has to be created using this code. Grabbing an old filesystem that wasn't created using this code is completely useless.

You need to replay all the actions, all the file creates, writes, moves, deletes etc, in an order they would actually happen, to see the result.

Obtaining real-world aged file systems

Posted Jul 7, 2006 20:26 UTC (Fri) by pimlott (guest, #1535) [Link]

Isn't this what e2image does?

Other error rates that need to be looked at.

Posted Jul 6, 2006 21:03 UTC (Thu) by smoogen (subscriber, #97) [Link] (4 responses)

Actually I was wondering if another issue a filesystem might be looking at is the integrity of other data on the disk. The FSuCK checks and fixes the integrity of the metadata on a disk, but does it check the integrity of the data inside a file? And should a file-system worry about this due to the growing percentage of 'noise' that is showing up in the file-system?

[A silly idea comes to mind with the filesystem broken into several virtual layers aimed. The lowest level layer sort of RAID's the read/writes across the disk (versus disks). The next layer deals with the standard filesystem actions and the top layer interfaces with VFS. This might have some value at a time when users have 2+ TB of diskspace in their home computer that they are storing their collections of movies etc and just want to make sure that the video doesnt crash out because some bits error'd out after a bit.]

Other error rates that need to be looked at.

Posted Jul 6, 2006 23:01 UTC (Thu) by vaurora (guest, #38407) [Link] (3 responses)

The integrity of data can be checked as well, if we have checksums for it. There are a lot of options for doing this. I like the following two best:

Checksum in the indirect block pointing to the data block

Optional per-file checksum updated on close, invalidated by a write or writable mmap() - gets gross with large write-mostly files like logs

As for your ideas about layers, check out the architecture of ZFS:

http://www.opensolaris.org/os/community/zfs/source/

At the workshop, we talked a little bit about creating a more useful volume manager interface (i.e., not one that looks exactly like a dumb disk). We also talked about working with manufacturers to improve the interfaces to dumb disks in ways that file systems find useful.

Other error rates that need to be looked at.

Posted Jul 8, 2006 11:27 UTC (Sat) by job (guest, #670) [Link]

I am intrigued by the ZFS design. What is the pros/cons of moving DM-type functionality into the FS layer?

Other error rates that need to be looked at.

Posted Jul 8, 2006 17:51 UTC (Sat) by giraffedata (guest, #1954) [Link] (1 responses)

Several times in the article and comments, I've seen the implication that FSCK corrects errors -- at least metadata errors -- in a filesystem. But FSCK as I understand it hardly does that at all. It corrects inconsistencies, and sometimes it corrects an inconsistency by replacing a lost and redundant piece of metadata.

But there isn't that much redundancy in e.g. ext2, is there? If a disk error or system crash causes a filesystem to lose a file's inode, FSCK "corrects" that error by deleting all the file's blocks too, right?

As a consistency restorer, there's nothing FSCK can do with file data -- from the filesystem perspective, the data is always consistent no matter how much it gets corrupted.

Other error rates that need to be looked at.

Posted Sep 5, 2006 21:48 UTC (Tue) by anton (subscriber, #25547) [Link]

from the filesystem perspective, the data is always consistent no matter how much it gets corrupted.

Yes, many file system designers only care about metadata consistency. Note how doublefs is suggested as a replacement for copy-on-write filesystems, even though doublefs only duplicates the metadata; with update-in-place for data, I don't really see an efficient way to guarantee that the data is consistent.

common loads; RAM cache and I/O `nice'; freeblock scheduling

Posted Jul 13, 2006 18:49 UTC (Thu) by ringerc (subscriber, #3071) [Link]

Thinking about the loads I see on the servers at work, I have several quite distinct patterns:

- Small random I/O done on small files, mildly biased toward reads. Continuous. (eg Cyrus mail spools - maildir like one-file-per-message storage, plus indexing and header caches).
- Archival storage of large and medium files (images, user documents, etc). Strong read bias, low load. Data is _generally_ added and not further modified.
- Working storage for users - the generic "file server" case of small & medium files, some of which sit around forever while others are quite hot. This includes user home directories.
- System storage (libraries, excectuables, etc). Insanely strong bias toward reads, needs more priority for RAM cache than it gets. It's INCREDIBLY annoying having libraries and binaries shoved out of cache because `tar' is reading some file for a backup that'll never be looked at again.

Additionally, all this gets backed up.

One thing I'd love to see - I can not possibly emphasise this enough - is a sort of `nice' facility for disk I/O. Backups are the most obvious use case - backing up a live server is miserable bastard of a job, as the whole server pigs while the backup runs. Not all servers have load patterns with dead times where one can afford this, and with disk snapshots (like LVM provides) there shouldn't be any need to stop services to back up the system. The ability for tools like `tar' and `cp' to inform the OS and FS that:
- The data they're working with is no more likely to be read/written again in the near future than any other data on the disk, so it should not be cached in RAM at the expense of anything else; and
- Other requests should have priority over requests by this program

would be INSANELY valuable.

[WARNING: The ideas of someone totally uninformed about a subject but too dumb to keep his mouth shut follow. I'm mentioning them just on the off chance one is useful _because_ I'm not used to the thought patterns of the field.]

The freeblock scheduling tricks mentioned sound like something that might benefit from extensions to the disk interface. Consider NCQ disks - the disk can be given a queue of requests to service in the optimal order. Wouldn't it be interesting if these could be prioritised (even just to the extent of 'normal' and 'only service if it won't impact normal reads') so that the disk could opportunistically service the low priority requests if it was working in the right area anyway.

The downside is that I suspect the disk would need a very large queue for low priority requests given the chances of it actually passing over the right block in any sane period of time. I'm not sure though ... track caching could be very handy for low priority reads, for example. It'd be harder for writes, since the disk would need to actually have the data to write when it got the chance. You'd probably need better knowledge of the disk's layout - again, perhaps the disk protocol could be extended to give some more information about layout and about optimal read/write patterns.

Another thing that might be nice would be to be able to ask the disk what was in its cache, and do reads that succeed only if data can be read from cache. After all, bus bandwidth is cheap - essentially free compared to seeks, and still very cheap compared to actual disk reads.