LWN.net Logo

The 2006 Linux Filesystems Workshop (Part II)

[Editor's note: this is the second page of Valerie Henson's report from 2006 Linux Filesystems Workshop.]

Day One: Data

The first day of the workshop was devoted to reviewing the data about hardware, existing file system design, and the problems our users are facing. We began with introductions of the workshop participants:

  • Val Henson, Intel - ZFS developer
  • Zach Brown, Oracle - OCFS2 developer
  • Arjan van de Ven, Intel - former distribution kernel maintainer and all around Linux hacker
  • Andreas Dilger, Cluster File Systems - Lustre developer
  • James Bottomley, SteelEye - Linux SCSI maintainer
  • Chris Mason, SuSE - Reiserfs developer
  • Christoph Hellwig, representing himself - XFS, JFS, VxFS, and all-around file system developer
  • Mark Fasheh, Oracle - OCFS2 developer
  • Ric Wheeler, EMC2 - Storage system architect
  • Theodore Ts'o, IBM - ext2/3 developer
  • Mingming Cao, IBM - ext3 developer
  • Felix Blyakher, SGI - XFS developer
In addition to the above, Linus Torvalds dropped in on the second day of the workshop.

File system repair

We began by spelling out the coming fsck time crunch and the increasing frequency of unrecoverable I/O errors, as described in the introduction of this article. First, let's review what fsck, or the File System ChecK program, actually does. Fsck can be divided into three main tasks: recovery, repair, and refresh of the file system.

Sidebar: How to pronounce fsck How do you tell where a file systems developer is from? Listen to how they pronounce fsck! Here are some common pronunciations and their sources.
  • "F-S-C-K" (Finland, Florida, California, France (in French), Germany (in German))
  • "fisk" (California, Portland)
  • "F-suck" (New Mexico, Michigan)
  • "F-sock" (Ireland)
  • "F-sack" (Unknown)
  • "F-S check" (Berkeley)
  • "Arrrgh, again?" (Dave Jones)
  • [censored] (attributed to Al Viro)
Recovery: Fsck was originally written to recover from an unclean mount of the file system, perhaps caused by a system crash or loss of connection to the disk. Half-finished in-progress updates have to be cleaned up in order to make the file system consistent again. For example, a block may be referenced by an indirect block, but not marked as allocated in the block bitmap. Journaling file systems formalized the process of recovering in this case, occasionally moving the recovery code out of the fsck program itself. For example, XFS separated fsck into in-kernel journal replay code and a utility called xfs_repair.

Repair: A natural extension of fsck is to repair file system inconsistency caused by hardware error, software bugs, or other data corruption. The errors are less predictable than in the pure recovery case, but many commonly encountered errors (such as loss of the data in a particular block or errors in reference counts) can be corrected. In order to discover and correct these errors, fsck must read all the metadata in the entire file system.

Refresh: The final use of fsck is to traverse the entire file system looking for latent errors and attempting to fix them before they become too bad. This is the purpose of the various ext2/3 fsck timeouts (by default, fsck runs every time the file system has been mounted a fixed number of times, or if it has been more than a certain number of days since it was last checked). This use of fsck is a lot more acceptable if it can be done in the background.

Most file systems have concentrated on speeding up the recovery task of fsck, but the repair and refresh tasks are becoming more and more important. Today, repairing file system corruption can make all of the file system data inaccessible for hours or days, and hardware trends are only increasing fsck time. In many ways, data that you can't read is as bad as data loss - which is worse, losing a file and restoring from backup after a few hours, and corrupting your file system and being unable to read any of your files for several days?

One workaround, in use today, is to split the disk up into many smaller file systems, which can be repaired individually. Unfortunately, this results in greater administrative overhead and more frequent ENOSPC errors (out of space errors), since files and directories can't span file systems. Worse, an I/O error on one file system can sometimes cause the entire kernel to hang or panic, resulting in all of the data being unavailable anyway.

A frequently suggested workaround is to further optimize the fsck program itself. Perhaps the bottleneck is in the fsck code itself, and all we need to do is add some some prefetching and block reordering, and spruce up the algorithms a bit. However, much of the obvious fsck optimization work has already been done, especially for ext2 and ext3. For this to be a long-term solution, we would need to continue improving the performance of fsck by a factor of 2 every year or so to keep up with the gap between disk capacity, bandwidth, and seek time - an unsustainable goal.

Our conclusion from this data is that file systems need to support partial, concurrent, on-line file system repair. File systems also need to be written to deal with frequent I/O errors without panicking or hanging the system - and need to be tested this way. One recommendation was that people should keep bad disks, as there's nothing like the real thing for testing your error handling.

Disks and errors

The type of errors disks make affect file system design as well. We are familiar with errors that happen from something physical going wrong with the media, resulting in bit flips. These days, the disk has internal checksums that detect this kind of error. When the checksum doesn't match, the disk retries the I/O itself several different ways before giving up and returning an error. Generally, it is pointless for any OS-level software to retry an I/O that resulted in the error, since by the point the OS sees an error, it is very likely to be unrecoverable.

However, another class of errors can't be detected or fixed by the disk's internal checksums. The terms for these kinds of errors are "high-flying writes" and "over-powered seeks", "phantom writes" and "misdirected reads and writes." Basically, the drive either fails to write data but claims it did (in a "high-flying write," the head is too far above the platter to actually flip the bits), or it completes the I/O, simply on the wrong block ("over-powered seek" means that the head went too far and is positioned over the wrong track when the I/O starts). The internal checksums (indeed, any checksum embedded within the block it is checksumming) can't detect this error, because the checksum matches the data in that location - it is simply old data or the wrong data. As a result, the OS needs to have checksums at a higher level as well. Sun's ZFS takes the approach of putting checksums in the indirect blocks, which solves most of the problems with misdirected or phantom I/O's without imposing an additional performance penalty to read a separate checksum block. Checksums like this will solve many of the panic/hang problems that come from trusting corrupted file system metadata.

Another interesting class of errors are the result of failures in an underlying RAID device or volume manager. Often failures will occur in stripe sizes - 64KB of data ends up zeroed, for example. Some systems actually byte stripe data, which results in single bytes of lost data at regular intervals.

Since I/O errors are common, it makes sense for file systems to replicate important data - the more important the data is, the more copies should exist. Many file systems already make multiple copies of the file system superblock; this is an extension of that concept to other important data, such as block group summary information. However, to make copies effective, the data must be "scrubbed" - regularly read to discover whether a copy has gone bad, and then rewritten with the good data before the second copy also goes bad. A talk on data preservation by Mary Baker really brings this home with a model used to predict mean time to data loss with and without scrubbing. In the system she modeled, mean time to data loss went from less than a year without scrubbing to as much as 16 years with scrubbing, depending on the frequency of scrubbing (see slide 45 in the presentation). Even with two copies of data, data loss is surprisingly common due to the birthday paradox, exemplified by the question "How many people do you need to have a 50% chance that two or more them share a birthday?" The answer is only 23, because the chance of the same date coming up twice goes up approximately proportional to the square of the number of people. Similarly, it is surprisingly likely that, given a system that makes two copies of all data, both copies of some piece of data will go bad. The "birthday surprise," when several disk drives in the same lot fail within a few days of each other due to similar defects in manufacturing, also contributes to surprisingly high data loss even with multiple copies.

One problem implementing file systems is the wide variation in quality and features available on different disk drive lines. However, just as the disk drive industry is consolidating at the corporate level (think Seagate-Maxtor-Quantum), it is also consolidating at the level of disk drive lines. Economies of scale encourage production of fewer different kinds of disks, possibly with minor differences such as firmware version. This is positive for the low end of disks, since large enterprise customers are buying low-end disks and insisting on better quality.

Another change in disk hardware is in caching and reordering of I/O's. Back when FFS was originally designed, the maximum bandwidth was only possible when reading every other block on disk, since the disk would rotate one block underneath the head before the OS could submit another I/O (hence the superblock field specifying "rotational delay" in many implementations of FFS). Then disks began supporting multiple outstanding I/O's, and both disks and OS's did readahead of one or more blocks, making pure sequential layout the fastest game in town. Now disks read whole tracks (on the order of 256KB) into the internal disk memory (on the order of 8MB), which make any I/O's that occur within the same track or group of nearby tracks behave, practically speaking, as if they were sequential on-disk. Optimal layout has gone from every-other-block to sequential to nearby is good enough.

We discussed the effect of having some form of flash or non-volatile memory available either on the disk or in the main system. This is part of what makes Network Appliances' file servers so speedy - they cache writes on flash and return success quickly, instead of having to wait until the write hits disk. The easiest way to take advantage of flash is to use it as a write cache for blocks - physical journaling on a separate device. One suggestion is to use the flash for metadata and store data on disk, but that makes flash a single point of failure, the loss of which is almost equivalent to losing the entire file system, and also constrains the ratio of metadata to data in the system. Designing a file system that assumes the presence of OS-visible flash somewhere on the system seems like a risky proposition, given that flash has been available for many years but not integrated as a standard system component. Concerns about write speed and number of write cycles before failure (order of 100,000 to 1,000,000) also limit file system dependency on flash. Overall, a new file system should be able to take advantage of flash but not require its presence.

A recurring theme of our discussions was the desire to broaden the interface between the operating system and the disk drive. Even the most basic disk drive is a small computer now, let alone hardware RAID devices or multi-terabyte SAN servers, and can easily handle more sophisticated requests than "write block 37." This is too broad of a topic to cover here, but the key step we need to take is getting file systems architects and disk manufacturers together to talk about what interfaces would be useful.

Lessons learned from existing file systems

We discussed some lessons learned from existing file system implementations. One of the primary lessons is that file system repair tools need to reliably distinguish between metadata and things that simply look like metadata. One advantage of having a fixed number of inodes in a static location is that it's hard to mistake other data for an inode, and consequently much easier to repair the file system. Identifying potential inodes solely by an embedded magic number is a recipe for failure: consider the case of a file containing a loopback-created file system image - the file data will look like inodes if we depend only on embedded magic numbers. Metadata should be identified by some kind of out-of-band mechanism, such as block group descriptors or headers of some sort. When it comes to file system repair, static metadata location has some serious advantages to offset its disadvantages (occasional wasted space or insufficient inodes).

Another lesson learned is that file system generation numbers help distinguish between metadata and data that was once metadata in a previous incarnation of this file system on this device. If the metadata is static, it can be zeroed on file system creation, but this makes file system creation expensive and slow. One technique to speed up ext2/3 file system creation time without adding a generation number is to add a parameter to the block group information describing how much of the inode table has been initialized, so that fsck knows to ignore the rest of the inode table. An "initialized up to here" marker block adds a layer of redundancy and safety to this scheme.

Often a major goal of file system design is to reduce the on-disk size of metadata. However, several studies over the last decade have shown that many disks are only half-full. Given that redundancy of metadata helps with file system repair, it seems like a reasonable trade-off to use more on-disk space for metadata to reduce the overall chance of data loss.

File size% total
<= 1 KB25-30%
<= 2 KB40-45%
<= 3 KB50-60%
<= 4 KB55-65%
One exception to the rule of disk-space-is-cheap is small files. Many file systems are inefficient at storing small files, yet small files make up the majority of files in many file systems. We heard about a large government customer that wants to store millions of files on the order of 1KB in size; another person pointed out that many source code files are smaller than 4KB. Several of us checked the file sizes on our laptop file systems and found the following results shown on the right.

In other words, about one quarter of the files took up less than 1KB, and more than half were less than 4KB. In every case, we could find some sort of explanation for why there were so many small files, but when everything is a special case in the same way, it becomes the common case. Indeed, current application programming styles seem to only be increasing the number of small files; examples include Gnome's habit of storing one configuration item per file. With a file system block size of 4KB, easily a quarter of the stored files are wasting more space than they are using. This is similar to what the implementers of FFS found back in 1984; using a 4KB block size wasted about 45% of the disk - which is why they implemented 1KB fragments (which remain unimplemented in ext2 and ext3). New file systems must be designed to efficiently store small files.

One existing solution is to pack small files together into one block. However, one of the problems with this solution is that it ends up rewriting data belonging to files which would otherwise be read-only, increasing the likelihood of corruption. It also tends to be difficult to implement and bug-prone.

Major file system architectures

We briefly reviewed the major file system architectures to summarize what worked and did not work in each case. Simple FFS-style file systems such as the original Berkeley FFS and ext2 have the advantages of simplicity, high performance, and easy repair and recovery of data, but they require a full fsck every time the system crashes, have no data consistency guarantees, and no formal defenses against disk corruption.

The most popular modern file system architecture is the journaling file system, such as ext3, reiserfs, and many others. Journaling file systems added a log of recent transactions to the file system, which are written sequentially to a reserved location on disk. The main file system is not modified until the complete transaction is written to the log. This allows fast recovery from an unexpected crash, as we can replay the log and complete any half-finished operations. The problems with journaling include the double write problem (each operation must be written to disk twice, once to the log, once to the final location), and various performance problems stemming from limited, contended journal space. Also, journaling makes no improvement in the disk corruption case.

Log-structured file systems caused a great stir in the research file systems community but never made the leap into major production use. The insight behind log-structured file systems is first, writing out updates as a log turns a group of random write I/O's into one large streaming write, which is much more efficient, and second, once we have written the transaction to a log, why do we need to write a second time? The file system is essentially a giant sequential transaction log, with updates appended to the end. Data is never overwritten in place - log structured file systems are one kind of copy-on-write (COW) file system. The main problem with log-structured file systems is that they require large free segments of disk space, created by a "cleaner" thread. Allocated blocks must be moved out of partially free segments and into other segments. The overhead of the cleaner thread turns out to be quite high, despite many years of research on cleaner optimization. Also, calculating the amount of free space needed to complete an update (even a write to a previously allocated block) turns out to be hard, since a COW file system cannot free the old copy of a block until the new copy is written, and the number of block copies required to complete an operation is unpredictable. Finally, forced reallocation of blocks requires that a "good" allocation decision be made on every write, whereas update-in-place file systems need only make a good allocation decision once.

Soft updates was a refinement to Berkeley FFS which preserved the on-disk format while removing the need to run fsck on the file system before it could be mounted after a crash. Soft updates carefully orders updates to the file system so that if the system crashes at any time, the file system is consistent with the exception that some blocks and inodes are "leaked" - marked allocated when they are free. A background fsck, run on a file system snapshot, finds these unreferenced blocks and marks them free again. The downside of soft updates is mainly that it is extremely complex to understand and implement, and each file system operation requires its own specially designed update code. To our knowledge, there is only one implementation of soft updates in existence.

The most recent trend in file systems architecture is the copy-on-write (COW) file system, as typified by WAFL (Write Anywhere File Layout, Network Appliance's internal file system), and ZFS (the new Solaris file system). These file systems are constructed as a tree of blocks. Every time a block is updated, a new block is allocated and the chain of block pointers pointing to it is updated - also causing copies of those blocks. When a consistent set of updates is written to disk, the root block is atomically updated to point to the new tree of blocks, which includes up-to-date allocation information. This structure makes snapshots extremely simple to implement and centralizes file system consistency code. The disadvantages are similar to some of log-structured file systems disadvantages - forced reallocation on every write and uncertainty about how much space is needed to complete an update. Also, good synchronous performance requires an added journal of some sort, complicating the implementation.

Overall, previous file systems focused on solving the problem of repairing the file system in the case of system crash, and considered actual data corruption to be rare enough to handle with a full fsck. File system repair was not a major consideration in the design of the on-disk layout. Saving on-disk space was a major goal early on, but is less important now, except for the case of small files, which are still make up a large portion of stored data.

Thus ended the first day of the workshop, in doom and gloom. Only the hope of the second day of the workshop and the promised brainstorming session kept our spirits up.

[ Continue on to Page 3 ]


(Log in to post comments)

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 5, 2006 23:09 UTC (Wed) by Rakshasa (guest, #14732) [Link]

Isn't it more interesting to know the ratio of wasted space due to 4KB blocks, rather than the ratio of small files. In the general case those small files propably won't come anywhere near the size of that DVD iso you have lying around.

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 19, 2006 20:33 UTC (Wed) by bronson (subscriber, #4806) [Link]

This is a very good point. If small files waste 700 MB on my laptop's 80 GB hard drive, I could hardly care less. I have yet to see any filesystem properly implement a special-case for small files. It's always either unreliable or slow.

The government institution from the article that wants to store tons of 1KB files should probably use a database instead.

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 6, 2006 4:25 UTC (Thu) by zooko (subscriber, #2589) [Link]

Was Hans Reiser invited?

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 6, 2006 8:19 UTC (Thu) by alonso (subscriber, #2828) [Link]

It's very strange decision to exclude from the discussion a fs that implement a lot of features that they discuss about.

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 6, 2006 9:28 UTC (Thu) by nix (subscriber, #2304) [Link]

Well...
One of the primary lessons is that file system repair tools need to reliably distinguish between metadata and things that simply look like metadata.
I propose we call this the 'reiserfsck lesson'. Does reiserfsv4 still have this, um, minor design flaw? (`Minor' in much the same way that the Pacific Ocean is `slightly damp', IMNSHO, but then I make massive use of loopback fsen so I may be biased).

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 6, 2006 15:24 UTC (Thu) by alonso (subscriber, #2828) [Link]

Can you expain me better this point, please?
Thank you!

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 6, 2006 22:01 UTC (Thu) by adobriyan (guest, #30858) [Link]

From: Theodore Ts'o

You've obviously never kept several dozen reiserfs filesystem images
(for use with Xen or User-Mode Linux) on a reiserfs filesystem, and
then had a hardware failure bad enough that the fsck had to try to
rebuild the b-tree, I take it?
From: Hans Reiser

That is fixed in V4.  Until people start to use V4 they should compress
their V3 backup images that they store on V3, or store them on separate
partitions.  I regret that fixing it without a disk format change was
not possible.
Looks like fsck.reiserfs can liberate reiserfs filesystem image out of loopback prison and glue it to the main one.

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 7, 2006 18:31 UTC (Fri) by Tet (subscriber, #5433) [Link]

I propose we call this the 'reiserfsck lesson'.

:-) I couldn't have put it better myself...

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 6, 2006 11:49 UTC (Thu) by grmd (subscriber, #4391) [Link]

Perhaps Hans wasn't available? Perhaps Chris Mason put forward a ReiserFS viewpoint?

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 6, 2006 16:58 UTC (Thu) by zooko (subscriber, #2589) [Link]

I'm sorry -- I didn't mean to sound adversarial. I was just curious if Hans Reiser was invited, and if he was why he didn't attend.

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 20, 2006 13:40 UTC (Thu) by Duncan (guest, #6647) [Link]

You may not see this at this late date, but others will. Said as someone
running a 100% reiserfs system, semi-patiently waiting for reiser4, so I'm
certainly not anti-Hans-Reiser in any way, but as a realist, Hans' style
simply doesn't work well at such conferences.

He's extremely bright, possibly more so than many/most/all regular kernel
or fs hackers. However, like many extremely bright people, he's also
rather less socially skilled than most, particularly among "peers", as
he's used to being so much brighter than anyone else around that no one
can stand up to his force of logic, which means he's used to always
being "right" even when he's not, and to always getting his way virtually
without question, simply because no one else has the ability or resources
to question him on his level. As a result, he simply doesn't have the
social negotiation skills that most of us lessor mortals end up developing
because there's always someone around who can show us up.

When he gets on LKML, flames nearly always ensue, with little good coming
directly out of it. Many kernel hackers now simply refuse to be involved
at all, which is a shame as it has hampered what nearly all involved
realize /could/ be a major advance in Linux filesystems. OTOH, no one
blames them because it ultimately ends up being a matter of personal
sanity and a defense mechanism staying away from him. Go read some of the
exchanges and the insults he has thrown around and tell me you can
honestly disagree.

What has actually happened, the progress reiser4 /has/ made towards
integration, has very often been the result of Hans' employees quietly
working things out with the rest of the Linux community pretty
much /despite/ Hans' fireworks and presence, rather than /because/ of it.

There's also some serious history involved. After reiserfs (reiser3) was
included in the kernel, Hans and Namesys basically took off and abandoned
it. Their argument, not entirely unreasonable, was that it's now stable
and in bugfix mode, and you don't add new features to a stable
version, /particularly/ when the software in question is a filesystem, and
people's data is at stake if those new features add bugs that risk that
data!

OTOH, the kernel is a living/breathing/changing code collection, with
various components not always entirely stable at the same time. As the
kernel changed and the rest of its major filesystems got extended
attribute support, and data=ordered and data=journaled support for
reiserfs matured and was tested enough to be added to the mainline kernel,
Hans Reiser and Namesys was nowhere around to add it, and refused to do so
pointing to the stable thing, regardless of the fact that it was needed to
keep up with the rest of the kernel.

The result was Chris Mason and other kernel hackers had to take up the
slack, and end up maintaining code that had effectively been "dumped" on
them, and that had been accepted before it met the normal kernel coding
conventions, making it far harder to maintain. The result of that is that
the kernel hackers are being FAR harder on reiser4 than the were on
reiserfs, knowing from experience that soon after it gets into the kernel,
Hans Reiser and Namesys may well be off developing reiser5 or whatever the
next big thing is, leaving the kernel maintainers to cope the best they
can. No /wonder/ they are demanding the new code fit the coding
conventions and style of the rest of the kernel, this time! To a man
used to being right, not only because he very often is, but because he has
few peers, few that even come /close/ to being able to challenge him, this
is seems a rather nasty rebuke.

Back to the conference, however. It was Chris Mason that finally
shepherded the data=ordered and data=journaled updates of reiserfs into
the mainline kernel, Chris Mason that handled much of the extended
attribute reiserfs work, and Chris Mason that has been pretty much the
point man on reiserfs, and likely will be the point man on reiser4 as
well, in terms of working with the other kernel filesystems and the rest
of the kernel. Unlike Hans Reiser, he has demonstrated his ability to
work with others, including the rest of the core kernel and file systems
teams.

Thus, even if Hans Reiser would have had a valuable viewpoint and valuable
knowledge to contribute, and that he certainly WOULD have, from a
practical viewpoint, were he to attend the conference, far less would have
likely actually gotten hashed out. Far better that Chris Mason represent
reiserfs/reiser4, and something actually get done, than Hans, and only
fighting and flaming and bitter recriminations result.

Hans is a very gifted man, very good at what he does, developing file
systems, and at least acceptably good running a company (I can't rightly
judge /how/ good, but the company is still in business, and still making
payroll, so it's not /bad/). However, he's the wrong guy to have at a
kernel file systems conference if you want to get anything done. That's
just the way he is. Let's appreciate him and help him contribute where
he's best, and I'm certainly very happy to use his filesystems, but
please, keep him away from those conferences! =8^O

Duncan

The 2006 Linux Filesystems Workshop (Part II)

Posted Aug 28, 2006 16:04 UTC (Mon) by zooko (subscriber, #2589) [Link]

For the record, it appears to me (after doing some investigating) that Hans Reiser was deliberately not invited because the organizers and/or some of the attendees didn't want to have his presence at the meeting.

Regards,

Zooko

GConf storage

Posted Jul 6, 2006 4:59 UTC (Thu) by jamesh (subscriber, #1159) [Link]

Note that there is a modification of the GConf storage format that stores subtrees of the GConf db as single files. This was written to reduce the number of small files created, due to the disk wastage with common file systems. I'm not sure if it is enabled by default for new users though.

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 6, 2006 8:53 UTC (Thu) by nix (subscriber, #2304) [Link]

Utterly fascinating stuff, thank you.

(Aside: for some time I wondered about the efficiency effects of reading whole tracks into the on-disk memory; then it occurred to me that of course there is no efficiency reduction. That data is passing under the head *anyway*, and the disk has to read it in order to find the correct sector, and reading it doesn't cost anything; so why not cache it to assist future requests?)

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 6, 2006 12:29 UTC (Thu) by sveinrn (subscriber, #2827) [Link]

Where I work, we are collecting really expensive data that has to be stored on online disks for the eternity. And thinking through how to protect the data from long term corruption has been a really interesting excersice. We ended up with a separate database storing md5 chekcsums of every file and then scanning the files regularly restoring files from backup that do not match the checksum.

But the most interesting suggestion that came up, was creating a raid system based on the Hamming code. So that first we would have a number of data disks. Then a number of disks storing the parity bits of the Hamming code. And on top of that, two parity disks for RAID6. As far as I can see, a scheme like this will protect against most of the error scenarios described in this article. It will of course require a great number of parity disks, and writing data will be extremely slow. Also, a battery backed write cache will be essential. But for archive-systems with only a few daily write operations I think it could work. Does anybody know if a system like this has ever been tried?

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 6, 2006 23:18 UTC (Thu) by valhenson (subscriber, #38407) [Link]

You might find NetApp's latest and greatest RAID stuff interesting:

"Row-Diagonal Parity for Double Disk Failure Correction"

Peter Corbett, Bob English, Atul Goel, Tomislav Grcanac, Steven Kleiman, James Leong, and Sunitha Sankar, Network Appliance, Inc.

http://www.usenix.org/events/fast04/tech/corbett.html

Awarded best paper at FAST '04... trying to remember if I've read it myself though, since I saw the talk.

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 7, 2006 9:10 UTC (Fri) by sveinrn (subscriber, #2827) [Link]

I have looked at it. But as far as I can see, this scheme does not protect proberly against '"high-flying writes" and "over-powered seeks", "phantom writes" and "misdirected reads and writes."'

The parity schemes used in RAID are good for recreating data that is known to be missing. Error correcting codes (including the Hamming code) are good for correcting data that is readable but corrupt. So my first thought was that with RAID6 on top of the Hamming code we would be protected from 2 failed disks in combination with 1 corrupt disk. But that is clearly wrong. So a better idea could be one of the more modern codes, for example the Reed-Solomon code. That also removes the need for RAID6 parity on top. (The Reed-Solomon code is also especially effective when one knows where the error is, i.e. a missing disk.)

Pronouncing fsck

Posted Jul 6, 2006 12:41 UTC (Thu) by Webexcess (subscriber, #197) [Link]

What, noone else says "F-sick"?

Pronouncing fsck

Posted Jul 6, 2006 13:38 UTC (Thu) by pj (subscriber, #4506) [Link]

nah, it's "fisk" or "fizz-chick"

Pronouncing fsck

Posted Jul 6, 2006 14:08 UTC (Thu) by nix (subscriber, #2304) [Link]

For me it's fss-ck.

Pronouncing fsck

Posted Jul 6, 2006 23:07 UTC (Thu) by mepr (guest, #4819) [Link]

f-s-check, f.s.c.k., or f-suck

Pronouncing fsck

Posted Jul 20, 2006 12:54 UTC (Thu) by Duncan (guest, #6647) [Link]

f-s-check here too, unless I'm feeling sarcastic or pithy, or reading The
Register's BOFH series or the like, in which case in the great Unix
tradition it's "fusck", and in "No backups? You are /really/ fuscked!"

Duncan

I never pronounce it

Posted Jul 7, 2006 14:06 UTC (Fri) by Max.Hyre (subscriber, #1054) [Link]

:-)

Since I never discuss it orally, the question doesn't arise. Reading, my mind just notes ``oh, that thing'', and keeps on going.

Pronouncing fsck

Posted Jul 8, 2006 9:46 UTC (Sat) by skissane (subscriber, #38675) [Link]

Don't worry you are not alone, I say f-sick too...

small file efficiency

Posted Jul 7, 2006 2:01 UTC (Fri) by brian (subscriber, #6517) [Link]

I store a large number of small files (100s of 1000s of Maildirs). My concern regarding small files is speed rather than space.

Yes, using a minimum 4KB per email is a shame, but quick access to all those spams is far more important. Quick deletes are also important...

Paper that seems to go with Babker's presentation

Posted Jul 7, 2006 3:33 UTC (Fri) by maney (subscriber, #12630) [Link]

I have a real love/hate thing going with these sorts of slide shows. One the one hand, a good set of slides is better than nothing for those who couldn't see the talk. OTOH, they rarely do more than give you the 10,000 foot overview, and almost always leave you wondering about all the interesting details. In this case it appears there's an easily accesible paper that covers much of the same material, at least WRT the failure model:

http://www.lockss.org/locksswiki/files/3/30/Eurosys2006.pdf

A couple of fixes?

Posted Jul 7, 2006 14:59 UTC (Fri) by Max.Hyre (subscriber, #1054) [Link]

It seems to me that the disk controller hardware could fix two of the problems. These have to be done at the controller level, for speed and because the OS has no idea what the disk's real geometry is.

First, how about dual heads to allow read-after-write, as tapes have had since the dawn of time? That'll catch the high-flying writes, and blocks with oxide problems.

Second, for over-powered seeks, I suspect you need to get below the filesystem to the real format, the one the OS never sees. IIRC, waaay back when I was paying attention to disk details, the track had an ID actually written in it. You could check that to make sure you were in the right place.

Of course, this has a couple of drawbacks:

  • it would require the track number in each block, so you don't have to wait for the start of the track to come around before you know where you are, thus losing some capacity (but who cares?)
  • it would still add a delay of one full rotation every Nth write, where N = # blocks/track, because that's how often the first block you get is the one you want to write. Do they already have block numbers (instead of some physical position sensor, which sounds pretty squirrely to me)? If so, you've already got this problem, so it isn't a factor.
It might be worth it, depending on the frequency of overshoot (or undershoot?) errors, and how often the write load overpowers the cache's ability to hide the delays.

A couple of fixes?

Posted Jul 13, 2006 22:01 UTC (Thu) by dfsmith (guest, #20302) [Link]

In modern disk drives, at the innermost and outermost slider positions, the read head is +/- 10 tracks away from the write head. (Unfortunately 1-cosine!=0.)

To correct DNW (did-not-write) you would have to waste about 1.5 revs to reposition and read. (Depends on write length.)

To correct WWP (wrote-wrong-place) you have to first read all tracks where you might accidentally write, do the write, then verify the write and the other tracks.

Oddly enough, people buy preferentially on performance.

A couple of fixes?

Posted Jul 13, 2006 22:10 UTC (Thu) by dfsmith (guest, #20302) [Link]

Oops, that should be sin!=0. (The read head is a finite distance away from the write head.)

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 8, 2006 10:28 UTC (Sat) by neilbrown (subscriber, #359) [Link]

I was surprised by the mention of 'flash' for write-behind caching of
writes.

Certainly some form of non-volatile memory is a good idea, but I was
under the impression that writing to flash was quite slow.

Is flash a serious suggestion for caching writes or was it just
used as a simple term for non-volatile RAM??

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 9, 2006 0:01 UTC (Sun) by nlucas (subscriber, #33793) [Link]

I was surprised by this too, specially because I wasn't aware there were flash memories without that limit on the number of writes on the same place.

I supose this limit is becoming bigger over time, but isn't still low for this kind of things?

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 8, 2006 11:04 UTC (Sat) by job (subscriber, #670) [Link]

On my busy mail and news servers I've found that ReiserFS has a much higher performance than other file systems. I suspect that this has to do with the tail packing which is discussed in this article. This is a great feature and I would hope it doesn't get left out in future file systems because of NIH issues or more hypothetical damage scenarios.

The 2006 Linux Filesystems Workshop (Part II)

Posted Jul 15, 2006 16:44 UTC (Sat) by nix (subscriber, #2304) [Link]

Tail packing reduces performance unless files packed together are generally accessed together.

What it increases is *storage efficiency*.

Encoding sector number with each sector

Posted Dec 26, 2006 0:07 UTC (Tue) by AdamRichter (guest, #11129) [Link]

If disk manufacturers would encode the sector number with every sector
that they write and check it on each read, that would simplify recovery
from misdirected writes, where an attempt to write sector X actually
results in a write to sector Y.

With this change, an attempt subsequently to read sector Y would result
in an error notification rather than returning the data that was intended
for sector X. An attempt to read sector X would still return the old
data in sector X that was supposed to have been overwritten, with no
error notification.

It would not surprise me if disk drives already do this.

The advantage here is that a RAID system can automatically correct sector
Y when it is read, because it will be explicitly notified of the error,
even though sector Y is not know from the I/O logs.

As for sector X, its address is known from the I/O history, so it can be
checked by whatever mechanism would be used to check for the "high flying
writes" problem where the data simply was not written to disk at all.
For example, completion of the write might not be signalled to the
operating system until the head had been moved away, and the data then
reread. As another example, a raid-6 system has multiple parity
sectors and therefore can identify the source of bad data even when no
drive is indicating an error, so it could adopt a policy of checking data
data this way from all sectors being reread for the first time, and,
after power failure, all sectors might start with this status unless
there is some log being stored, or perhaps a non-volatile log might be
used on some other device to allow a more efficient recovery policy.

At least for the examples above, implementing these policies at a virtual
device level like RAID has the advantage of providing the benefit to
essentially all existing disk file systems as well as non-filesystem
users of block devices.

The 2006 Linux Filesystems Workshop (Part II)

Posted Jun 26, 2007 11:27 UTC (Tue) by razb (subscriber, #43424) [Link]

fsck is pronounced in israel like that : F-sick

Copyright © 2006, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.