What ever happened to chunkfs?

June 17, 2009

This article was contributed by Valerie Aurora

"What ever happened to chunkfs?" This is a question I hear every few months, and I always answer the same way, "Chunkfs works, the overhead is reasonable, and it is only practical if it is part of the file system design from the beginning, not tacked on after the fact. I just need to write up the paper summarizing all the data." Thanks to your benevolent LWN editor, you are now reading that paper.

Background

Before we describe chunkfs, we should first talk about the problems that led to its creation. Chunkfs was motivated by the growing difficulty of reading and processing all the data necessary to check an entire file system. As the capacity of storage grows, the capacity of the rest of the system to read and check that data is not keeping pace with that growth. As a result, the time to check a "normal" file system grows, rather than staying constant as it would if the memory, bandwidth, seek time, etc. of the system grew in proportion to its storage capacity. These differential rates of growth in hardware - like the differences between RAM capacity, bandwidth, and latency - are part of what keeps systems research interesting.

To understand the change in time to check and repair a file system, a useful analogy is to compare the growth of my library (yes, some people still own books) with the growth of my ability to read, organize, and find the books in the library. As a child, my books fit all on one shelf. I could find the book I wanted in a matter of seconds and read every single one of my books in a week. As an adult, my books take up several bookshelves (in addition to collecting on any flat surface). I can read many times faster than I could when I was a kid, and organize my books better, but finding a book can take several minutes, and reading them all would take me a year or more. If I was trying to find a twenty dollar bill I left in a book, it would take me several hours to riffle through all the books to find it. Even though I am a better reader now, my library grew faster than my ability to read or search my library. Similarly, computers are faster than ever, but storage capacity grew even faster.

The consequence of this phenomenon that first caught my attention was the enormous disparity between seek time and capacity of disks. I calculated a projected change in fsck time given these predictions [PDF] from a storage industry giant in 2006:

Improvement in disk performance, 2006 - 2013
Capacity:	16.0x
Bandwidth:	5.0x
Seek time:	1.2x
=> at least 10x increase in fsck time!

Since then, commodity flash-based storage finally became a practical reality. Solid-state disk (SSDs) have much faster "seeks", somewhat improved bandwidth, and drastically reduced capacity compared to disks, so the performance of fsck ought to be extremely good. (I am unable to find any measurements of fsck time on an SSD, but I would estimate it to be on the order of seconds. Readers?) However, SSDs don't solve all our problems. First, SSDs are an element in the cache hierarchy of a system, layered between system RAM and disk, not a complete replacement for disks. The sweet spot for SSDs will continue to expand, but it will be many years or decades before disks are completely eliminated - look at the history of tape. It's easy for a laptop user to wave their hands and say, "Disks are obsolete," but try telling that to Google, your mail server sysadmin, or even your MythTV box.

Second, remember that the whole reason we care about fsck in the first place is because file systems get corrupted. One important source of file system corruption is failures in the storage hardware itself. Ask any SSD manufacturer about the frequency of corruption on its hardware and they will inundate you with equations, lab measurements, and simulations showing that the flash in their SSDs will never wear out in "normal" use - but they also won't reveal the details of their wear-leveling algorithms or the workloads they used to make their predictions. As a result, we get surprises in real-world use, both in performance and in reliability. When it comes to failure rates, we don't have good statistics yet, so I have to fall back on my experience and that of my kernel hacker friends. What we see in practice is that SSDs corrupt more often and more quickly than disks, and prefer to devour the tastiest, most crucial parts of the file system. You can trust the hardware manufacturers when they wave their hands and tell you not to worry your pretty little head, but I put more weight on the money I've made consulting for people with corrupted SSDs.

So, if corruption is a concern, an important benefit of the chunkfs approach is that file system corruption is usually limited to a small part of the file system, regardless of the source of the corruption. Several other repair-driven file system principles - such as duplicating and checksumming important data - make recovery of data after corruption much more likely. So if you don't care about fsck time because you'll be using an SSD for the rest of your life, you might still read on because you care about getting your data back from your SSD.

The conclusion I came to is that file systems should be designed with fast, reliable check and repair as a goal, from the ground up. I outline some useful methods for achieving this goal in a short paper: Repair-driven File System Design [PDF].

Chunkfs design

Chunkfs is the unmusical name for a file system architecture designed under the assumption that the file system will be corrupted at some point. Therefore the on-disk format is optimized not only for run-time performance but also for fast, reliable file system check and repair. The basic concept behind chunkfs is that a single logical file system is constructed out of multiple individual file systems (chunks), each of which can be checked and repaired (fscked) individually.

[chunkfs]

This is great and all, but now we have a hundred little file systems, with the namespace and disk space fragmented amongst them - hardly an improvement. For example, if we have a 100GB file system divided into 100 1GB chunks, and we want to create a 2GB file, it won't fit in a single chunk - it has to somehow be shared across multiple chunks. So how do we glue all the chunks back together again so we can share the namespace and disk space while preserving the ability to check and repair each chunk individually? What we want is a way to connect a file or directory in one chunk and another file or directory in another chunk in such a way that the connection can be quickly and easily checked and repaired without a full fsck of the rest of the chunk. The solution is something we named a "continuation inode". A continuation inode "continues" a file or directory into another chunk.

[file growth]

You'll notice that there are two arrows in the picture to the left, one pointing from the original inode to the continuation inode, and another pointing back from the continuation inode to the original inode. When checking the second chunk, you can check the validity of the continuation inode quickly, by using the back pointer to look up the original inode in its chunk. This is a check of a "cross-chunk reference" - any file system metadata that makes a connection between data in two different chunks. Cross-chunk references must always have forward and back pointers so that they can be verified starting from either chunk.

Cross-chunk references must satisfy one other requirement: You must be able to quickly find all cross-references from chunk A to arbitrary chunk B, without searching or checking the entire chunk A. To see why this must be, we'll look at an example. Chunk A has an inode X which is continued into chunk B. What if chunk A was corrupted in such a manner that inode X was lost entirely? We would finish checking and repairing chunk A, and it would be internally self-consistent, but chunk B would still have a continuation inode for X that is now impossible to look up or delete - an orphan continuation inode. So as the second step in checking chunk A, we must then find all references to chunk A from chunk B (and all the other chunks) and check that they are in agreement. If we couldn't, we would have to search every other chunk in the file system to check a single chunk - and that's not quick.

A chunkfs-style file system can be checked and repaired incrementally and online, whereas most file systems must be checked all at once while the file system is offline. Some exceptions to this general rule do exist, such as the BSD FFS snapshot-based online check, and an online self-healing mechanism for NTFS, but in general these facilities are hacked on after the fact and are severely limited in scope. For example, if the BSD online check actually finds a problem, the file system must be unmounted and repaired offline, all at once, in the usual way, including a rerun of the checking stage of fsck.

The chunkfs design requires solutions to several knotty problems we don't have room to cover in this article, but you can read about in our 2006 Hot Topics in Dependability paper: Chunkfs: Using divide-and-conquer to improve file system reliability and repair [PDF].

Measurements

Repair-driven file system design, chunkfs in particular, sounds like a neat idea, but as a file system developer, I've had a lot of neat ideas that turned out to be impossible to implement, or had too much overhead to be practical. The questions I needed to answer about chunkfs were: Can it be done? If it can be done, will the overhead of continuation inodes outweigh the benefit? In particular, we need to balance the time it takes to check an individual chunk with the time it takes to check its connections to other chunks (making sure that the forward and back pointers of the continuation inodes agree with their partners in the other chunk). First, let's take a closer look at file system check time on chunkfs.

The time to check and repair a file system with one damaged chunk is the sum of two different components, the time to check one chunk internally and the time to check the cross references to and from this chunk.

T_fs = T_chunk + T_cross

The time to check one chunk is a function of the size of the chunk, and the size of the chunk is the total size of the file system divided by the number of chunks.

T_chunk = f(size_chunk)

size_chunk = size_fs / n_chunks

The time to check the cross-chunk references to and from this chunk depends on the number of those cross references. The exact number of cross-chunk references will vary, but in general larger chunks will have fewer cross references and smaller chunks will have more - that is, cross chunk check time will grow as the number of chunks grow.

T_cross = f(n_chunks)

So, per-chunk check time gets smaller as we divide the file system into more chunks, and at the same time the cross-chunk check time grows larger. Additionally, the extra disk space taken up by continuation inodes grows as the number of chunks grows, as does the overhead of looking up and following continuation inodes during normal operation. We want to find a sweet spot where the sum of the time to check a chunk and its cross-references is minimized, while keeping the runtime overhead of the continuation inodes small.

Does such a sweet spot exist? The answer depends on the layout of files and directories on disk in real life. If cross-chunk references are extremely common, then the overhead of continuation inodes will outweigh the benefits. We came up with an easy way to estimate the number of cross-chunk references in a chunkfs file system: Take a "real" in-use ext3 file system and for each file, measure the number of block groups containing data from that file. Then, for each directory, measure the number of block groups containing files from that directory. If we add up the number of block groups less one for all the files and directories, we'll get the number of cross-chunk references in a similar chunkfs file system with chunks the size of the ext3 file system's block groups (details here).

Karuna Sagar wrote a tool to measure these cross-references, called cref, and I added some code to do a worst-case estimate of the time to check the cross-references (assuming one seek for every cross-reference). The results were encouraging; assuming disk hardware progresses as predicted, the average cross-chunk cross-reference checking time would be about 5 seconds in 2013, and the worst case would be about 160 seconds (about 2.5 minutes). This is with a 1GB chunk size, so the time to check the chunk itself would be a few seconds. This estimate is worst-case in another way: the ext3 allocator is in no way optimized to reduce cross-chunk references. A chunkfs-style file system would have a more suitable allocator.

Implementations

Chunkfs was prototyped three times. The first prototype, written by Amit Gud for his master's thesis [PDF] in 2006-2007, implemented chunkfs as modifications to the ext2 driver in FUSE. Note that the design of chunkfs described in the thesis is out of date in some respects, see the chunkfs paper [PDF] for the most recent version. He also ported this implementation to the kernel in mid-2007, just for kicks. The results were encouraging. Our main concern was that continuation inodes would proliferate wildly, overwhelming any benefits. Instead, files with continuation inodes were uncommon - 2.5% of all files in the file systems - and no individual file had more than 10 continuation inodes. The test workload included some simulation of aging by periodically deleting files while filling the test file system. (It would be interesting to see the results from using Impressions [PDF], a very sophisticated tool for generating realistic file system images.)

These implementations were good first steps, but they were based on an earlier version of the chunkfs design, before we had solved some important problems. In these implementations, any chunk that had been written to since the file system was mounted had to be fscked after a crash. Given that most of the file system is idle in common usage, this reduced check time by about 2/3 in the test cases, but we are looking for a factor of 100, not a factor of 3. It also lacked a quick way to locate all of the references into a particular damaged chunk, so it only checked the references leading out of the chunk. It used an old version of the solution for hard links which would allow hard links to fail if the target's chunk ran out of space, instead of growing a continuation for the target into a chunk that did have space. It was a good first step, but lacked a key feature: drastically reduced file system repair time.

In mid-2007, I decided to write a prototype of chunkfs as a layered file system, similar to unionfs or ecryptfs, that was completely independent of the client file system used in each chunk. Continuation inodes are implemented as a regular files, using extended attributes to store the continuation-related metadata (the forward and back pointers and the offset and length of the file stored in this continuation inode). When the file data exceeds an arbitrary size (40K in my prototype), a continuation inode is allocated in another chunk and the data after that point in the file is stored in that inode's data. All of the continuation inodes emanating from a particular chunk are kept in one directory so that they can be quickly scanned during the cross-chunk pass of fsck.

To test that the prototype could recover from file system corruption in one chunk without checking the entire file system, I implemented fsck as a simple shell script. First, it fscks the damaged chunk by running the ext2 fsck on it. This may end up deleting or moving arbitrary files in the chunk, which could make it out of sync with the other chunks. Second, it mounts the now-repaired file systems and reads their /chunks directories to find all connections to the damaged chunk and consistency check them. If it finds an orphaned continuation - a continuation whose origin inode in the damaged chunk was destroyed or lost - then it deletes that continuation inode.

The fsck script is deficient in one particular aspect: it checks all the chunks because I didn't write the code to mark a chunk as damaged. This is a difficult problem in general; sometimes we know that a chunk has been damaged because the disk gave us an IO error, or we found an inconsistency while using the file system, and then we mark the file system as corrupted and needing fsck. But plenty of corruption is silent - how can we figure out which chunk was silently corrupted? We can't, but we can quietly "scrub" chunks by fscking them in the background. Currently, ext2/3/4 triggers a paranoia fsck every N mounts or M days since the last check. Unfortunately, this introduces an unexpected delay of minutes or hours at boot time; if you're lucky, you can go have a cup of coffee while it finishes, if you're unlucky, you'll be figuring out how to disable it while everyone who has come to your talk about improvements in Linux usability watches you. (N.B.: Reboot and add "fastboot" to the kernel command line.) With chunkfs, we could run fsck on a few chunks at every boot, adding a few seconds to every boot but avoiding occasional long delays. We could also fsck inactive chunks in the background while the file system in use.

Results

I gave a talk at LCA in January 2008 about chunkfs which included a live demo of the final chunkfs prototype in action: creating files, continuing them into the next chunk, deliberately damaging a chunk, and repairing the file system. Because I carefully prepared a typescript-based fallback in case things went sideways, the demo went perfectly. A video of the talk [Ogg Theora] is available.

Note that I have never been able to bring myself to watch this talk, so I don't know if it contains any embarrassing mistakes. If you want, you can follow along during the demo section with the annotated output from the demo.

The collection-of-file-systems approach ended up being more complex than I had hoped. The prototype used ext2 as the per-chunk file system - a reasonable choice for some throw-away code, but not workable in production. However, once you switch to any reliable file system, you end up with one journal per chunk, which is quite a lot of overhead. In addition, it seemed likely we'd need another journal to recover from crashes during cross-chunk operations. Another source of complexity was the use of a layered file system approach - basically, the chunkfs layer pretends to be a client file system when it's talking to the VFS, and pretends to be the VFS when it's talking to the per-chunk file system. Since none of this is officially supported, we end up with a long list of hacks and workarounds, and it's easy to run into surprise reference counting or locking bugs. Overall, the approach worked for a prototype, but it didn't seem worth the investment it would to make it production quality, especially with the advent of btrfs.

Future work

When I began working on chunkfs, the future of Linux file systems development looked bleak. The chunkfs in-kernel prototype looked like it might be the only practical way to get repair-driven design into a Linux file system. All that has changed; file system development is a popular, well-funded activity and Linux has a promising next-generation file system, btrfs, which implements many principles of repair-driven file system design, including checksums, magic numbers, and redundant metadata. Chunkfs has served its purpose as a demonstration of the power of the repair-driven design approach and I have no further development plans for it.

Conclusions

The three chunkfs prototypes and our estimates of cross-chunk references using real-world file systems showed that the chunkfs architecture works as advertised. The prototypes also convinced us that it would be difficult to retrofit existing journaling file systems to the chunkfs architecture. Features that make file system check and repair are best when designed into the architecture of the file system from the beginning. Btrfs is an example of a file system designed from the ground up with the goal of fast, reliable check and repair.

Credits

Many people and organizations generously contributed to the design and prototyping of chunkfs. Arjan van de Ven was the co-inventor of the original chunkfs architecture. Theodore Ts'o and Zach Brown gave invaluable advice and criticism while we were thrashing out the details. The participants of the Hot Topics in Dependability Workshop gave us valuable feedback and encouragement. Karuna Sagar wrote the cross-reference measuring tool that gave us the confidence to go forward with an implementation. Amit Gud wrote two prototypes while a graduate student at Kansas State University. The development of the third prototype was funded at various points by Intel, EMC, and VAH Consulting. Finally, too many people to list made valuable contributions during discussions about chunkfs on mailing lists, and I greatly appreciate their help.

Index entries for this article
Kernel	Filesystems/Chunkfs
GuestArticles	Aurora (Henson), Valerie

Maybe btrfs has no fsck,

Posted Jun 17, 2009 13:41 UTC (Wed) by qu1j0t3 (guest, #25786) [Link] (16 responses)

But there are several other mature filesystems which avoid it on an unexpected shutdown -
reiser3fs, just-released NILFS2, and of course the magnum opus, ZFS.

Maybe btrfs has no fsck,

Posted Jun 17, 2009 14:21 UTC (Wed) by anselm (subscriber, #2796) [Link] (14 responses)

Not having to do an fsck after an unclean shutdown is not the same as not being able to do an fsck at all even if one wanted to. File systems can become corrupted for various reasons other than sudden system crashes.

Maybe it's just me, but I don't think I'd be happy with a file system for which there is no reasonable fsck program.

Maybe btrfs has no fsck,

Posted Jun 17, 2009 17:54 UTC (Wed) by drag (guest, #31333) [Link] (13 responses)

Yes. Resierfs needs a fsck... whether or not it provides is kinda irrelevent. I think that a fsck for reiserfs is provided, but I am not sure. Maybe only for later versions. Never used it much myself.

One of the strengths of Ext3 over XFS and Reiserfs is it's fsck. The journalling features of XFS and Reiserfs only protect the filesystem (aka metadata) from corruption, it does not help protect your actual data or detect problems with your data. For that you need to do fsck for Ext3.

Maybe btrfs has no fsck,

Posted Jun 17, 2009 21:36 UTC (Wed) by anselm (subscriber, #2796) [Link] (7 responses)

What I've heard about the Reiserfs fsck is that it will, among other issues, mistake a file system superblock in the middle of a partition for the start of a whole new file system and get thoroughly confused. With virtualisation in widespread use and people keeping file system images in files this is more of a problem than it used to be when Reiserfs was new.

If this is actually true, it is one more reason to bury Reiserfs deeply, with a wooden stake driven through its heart.

Maybe btrfs has no fsck,

Posted Jun 18, 2009 8:46 UTC (Thu) by pcampe (guest, #28223) [Link]

I vaguely remember the problem you have described, something like a report from a key kernel developer; I am not sure it was ReiserFS the file system, nor if a workaround has been implemented.

Maybe btrfs has no fsck,

Posted Jun 18, 2009 11:33 UTC (Thu) by nye (subscriber, #51576) [Link] (5 responses)

That was basically FUD. In reiserfsck there's an option to rebuild the filesystem entirely - you're essentially telling it 'this filesystem is trashed; just look through the disk for anything that might be a valid filesystem structure and cobble it together if you can'.

The 'problem' of mistaking a superblock in some image you have somewhere for the start of a new fs is *exactly what you asked it to do*, so those who complained so loudly about it really have nobody to blame but themselves.

It'll be funny if it was not so sad

Posted Jun 18, 2009 12:16 UTC (Thu) by khim (subscriber, #9252) [Link] (4 responses)

The 'problem' of mistaking a superblock in some image you have somewhere for the start of a new fs is *exactly what you asked it to do*, so those who complained so loudly about it really have nobody to blame but themselves.

Yup. They did one mistake, but that mistake was grave: they assumed they can trust reiserfs. I've seen few cases where tiny nimber of badblocks killed reiserfs completely: nothing except this "cobble it together if you can" option worked and "cobbled together" filesystem was a mess (because there were some virtual images on that filesystem). Note: this exactly type of corruption SSD shows in real world. If reiserfs's "gentle" fsck does not work and if "last resort" approach is unusable then the only solution is to switch to other filesystem...

It'll be funny if it was not so sad

Posted Jun 18, 2009 16:29 UTC (Thu) by nye (subscriber, #51576) [Link] (3 responses)

Meh, the only FS I've ever lost data to - aside from understandably unrecoverable data on damaged disk blocks - was XFS[0], and I used reiser3 extensively until a couple of years ago when it was obvious that ship had sailed.

Perhaps I merely got lucky with my disks not failing in exactly the wrong way, but there's always going to be an anecdote for everything.

[0](it's certainly the only FS that I actively despise, and would unreservedly recommend against in all circumstances)

The problem is: disk DO fail and reiserfs is TOTALLY unready

Posted Jun 19, 2009 10:58 UTC (Fri) by khim (subscriber, #9252) [Link] (1 responses)

Meh, the only FS I've ever lost data to - aside from understandably unrecoverable data on damaged disk blocks

But the bad blocks DO exist in real life - you can not just ignore them! Reiserfs design NEVER ever considered this facet of life: if you have ONE bad block in wrong place - you are screwed 100%. When disk size is 512 bytes and HDD size of 2TiB loss of a 0.0000000003% of your data means 100% of your stuff is lost. This is not even funny.

XFS is also not a good idea (I was biten by it too) - but here we have bad implementation, not bad design. Implementation can be fixed, design mistake is unfixable.

The problem is: disk DO fail and reiserfs is TOTALLY unready

Posted Jun 22, 2009 16:22 UTC (Mon) by nye (subscriber, #51576) [Link]

>But the bad blocks DO exist in real life - you can not just ignore them!

I certainly don't disagree; I was referring to the data actually on the damaged part of the disk, which I wouldn't reasonably expect to be able to recover without great expense.

It'll be funny if it was not so sad

Posted Jun 21, 2009 6:47 UTC (Sun) by cventers (guest, #31465) [Link]

I was a reiser3 user for quite a while and often sung its praises. A few
years into that stretch, my PC started having stability problems which I
tracked to bad RAM.

I caught the bad RAM pretty quickly, and considered myself lucky that I
hadn't obviously lost any big chunks of data... I had seen the ReiserFS
journal check making some noise in dmesg but everything seemed to work.

However, replacing the RAM didn't solve the stability problem. The nature
of the problem changed... it became a random system freeze. At the time, I
didn't realize that I had a new problem - hidden filesystem corruption.

After a couple of big scares with "md" after the system had randomly
frozen, I made a full backup of the filesystem. I continued using the
computer, but the stability problem seemed to be getting worse. I
installed a brand new monster power supply and over the course of the next
month or two I burned a lot of money replacing the rest of the system,
thoroughly confused that I hadn't nailed the problem. (Mockingly, it often
seemed that replacing a part would make the problem go away for a day or
two, leading me to believe I'd fixed it until it slapped me in the face in
the middle of my work yet again.)

My full filesystem backup became handy after I was unable to bring the
filesystem online one time. reiserfsck made lots of noise about problems
with my data and was unable to repair it. I was frustrated to have lost a
month's worth of data, but thrilled that I had a backup at all.

Sadly, I lost the filesystem a few more times and burned even more time
and money on the computer before I realized that with all the hardware
having been replaced, I needed to consider what I had considered to be the
unlikely cause: the software. I became suspicious of reiserfs. This time,
rather than restoring again from my old reiserfs image, I made an ext3
partition, mounted the reiserfs image read-only and migrated.

My system never froze again.

I don't know enough about the reiserfs design to know how plausible my
hypothesis is, but it seems that the bad RAM I dealt with a long time ago
had led to a reiserfs filesystem which was "doomed". I assume the bad RAM
provided the initial corruption, some sort of corruption that made the
reiserfs kernel code fall on its face. Sometimes, the system accessed the
"wrong" bit of corrupted data and the kernel would panic or hang somewhere
inside reiserfs, spreading the corruption in the process.

There's a shocking bit of irony in this particular failure mode. Because
the backup I always restored from was a reiserfs image taken with dd, the
only way I was ever going to escape the crashes and repeated loss of my
data was to abandon reiserfs.

Maybe btrfs has no fsck,

Posted Jun 18, 2009 11:37 UTC (Thu) by viiru (subscriber, #53129) [Link] (1 responses)

> One of the strengths of Ext3 over XFS and Reiserfs is it's fsck. The
> journalling features of XFS and Reiserfs only protect the filesystem (aka
> metadata) from corruption, it does not help protect your actual data or
> detect problems with your data. For that you need to do fsck for Ext3.

Well. Actually XFS has both the ability to recover from an unclean shutdown without fsck, and a full featured repair tool. Don't be confused by the fact that the fsck.xfs-tool is essentially /bin/true. The repair tool exists, but it goes by the name of xfs_repair.

I've been using XFS on Linux in production on most of my machines for the past six years or so, and have needed to run xfs_repair twice. Haven't lost any files, either.

The "it eats your filez"-reputation of XFS has been greatly exaggerated.

If only

Posted Jun 18, 2009 12:20 UTC (Thu) by khim (subscriber, #9252) [Link]

The "it eats your filez"-reputation of XFS has been greatly exaggerated.

I've had rock solid way to reproduce this effect: run bittorrent client on 100% full filesystem. Sure, this is not nice thing to do for a filesystem (and currently btrfs does not handle this case all that well), but stuff happens. If I can not trust my filesystem in such conditions how can I trust it at all?

It looks like XFS problems are in the past but trust is easy to lose, hard to resurrect - now I'm firmly in ext3 camp.

Maybe btrfs has no fsck,

Posted Jun 23, 2009 6:18 UTC (Tue) by nix (subscriber, #2304) [Link] (2 responses)

Journals do not protect the filesystem metadata from corruption. They
protect it from being in an inconsistent state after a crash (i.e. part of
the metadata written, other parts not written).

If a kernel bug, cosmic-ray-induced bitflip, or transient drive bug
corrupts the filesystem you still need a fsck. And in the end, that *will*
happen, even with PCIe and ECCRAM: after all, the *CPU* doesn't checksum
everything inside itself, and ECCRAM can't detect all possible failure
modes.

Maybe btrfs has no fsck,

Posted Jun 24, 2009 19:19 UTC (Wed) by salimma (subscriber, #34460) [Link] (1 responses)

That's why Btrfs, ZFS and (I think) Dragonfly BSD's HammerFS have checksums for each block.

Maybe btrfs has no fsck,

Posted Jun 26, 2009 11:16 UTC (Fri) by mangoo (guest, #32602) [Link]

How would that help if I, for example, copy one block and its checksum into another area of the disk? Essentially, the block will be valid (checksum matches), but the filesystem will be corrupted.

In a virtual environment, it's not so hard to do such a mistake: just accidentally mount the filesystem twice (i.e. from a guest and a host), and two different kernels will write correct blocks all over, each one corrupting the filesystem.

Maybe btrfs has no fsck,

Posted Jun 18, 2009 17:23 UTC (Thu) by vaurora (guest, #38407) [Link]

Oops, forgot to link to my fsck article:

http://lwn.net/Articles/248180/

Basically, when people say "file system X doesn't need fsck," what they usually mean is that it can recover from a crash without running a program to check and repair the entire file system. (It may have to replay a few entries from the log but that's it.) Every file system still needs a "fsck" to check and repair the file system when it gets corrupted, it's just that old file systems like ext2 let the file system get corrupted at every crash.

btrfs has a rudimentary file system check. The plan is for a full-featured check and repair.

What ever happened to chunkfs?

Posted Jun 17, 2009 16:17 UTC (Wed) by philipstorry (subscriber, #45926) [Link]

An excellent article - well written and on a very interesting topic.

I also think it's worth thanking Valerie for stopping development. It can be a difficult decision to do so, and takes bravery and good judgement. Chunkfs, especially as a VFS, would likely have become one huge furball of hacks. To accept that it had served its purpose and that time could be better spent elsewhere looks like the right thing.

Now all I have to do is wait for btrfs to bring me all of this. I have resorted to manual mirroring via scheduled rysncs, which balances all the pros and cons nicely for me. I'd never stop doing mirroring in some form, but to be able to trust hardware (or any lower level that includes the fs) mirroring would be nice for fileservers.

So, thanks again Valerie and all involved with chunkfs. Good to know that the future will be better! :-)

RAID rebuild

Posted Jun 18, 2009 7:30 UTC (Thu) by rbuchmann (guest, #52862) [Link] (4 responses)

With magnetic disk drives it's not uncommon that a read error (marking a disk as faulty) will go away after a write (due to sector reallocation).

So RAID rebuilds happen from time to time.

A similar solution to chunkfs for fast RAID rebuild is this:

- take two or more disk drives
- partition them in smaller chunks (say 50GB or less)
- build RAID(1+) across the chunks of different drives

This will make RAID rebuilds necessary only for the "damaged" chunks. And it already helped me a few times.

RAID rebuild

Posted Jun 20, 2009 21:29 UTC (Sat) by anton (subscriber, #25547) [Link]

With magnetic disk drives it's not uncommon that a read error (marking a disk as faulty) will go away after a write (due to sector reallocation).

That's the theory, and it's quite plausible, if there are spare blocks on the disk, but I have seen several drives (from different manufacturers) with read errors that were also write errors, and none where the error went away by writing. And it's not that these drives had run out of spare blocks or something; the errors apparently were caused by the head running amok in unusual power supply conditions.

RAID rebuild

Posted Jun 22, 2009 7:05 UTC (Mon) by neilbrown (subscriber, #359) [Link] (2 responses)

It shoulds to me like you need to discover write-intent bitmaps.

Such a bitmap is effectively a set of 'dirty' bits, one for each chunk of the array (and you can choose the chunk size).

So if you set the chunk size to 50GB (I would probably set it a bit smaller) you get the same functionality as you describe, only with much less hassle.

So just create a raid1 or - if you have more than 2 drives - raid10, and

 
  mdadm --grow /dev/md0 --bitmap=internal --bitmap-chunk=1000000

and you will be much happier.

RAID rebuild

Posted Jun 23, 2009 9:51 UTC (Tue) by rbuchmann (guest, #52862) [Link] (1 responses)

What happens if a drive will be marked faulty during a read? To my understanding the write-intent is not set then, so the broken chunk would be not rewritten?

RAID rebuild

Posted Jun 23, 2009 11:08 UTC (Tue) by neilbrown (subscriber, #359) [Link]

A drive is not marked faulty due to a read error (unless the array is degraded ... and even then it probably shouldn't be.... I should fix that).

If md gets a read error when the array is not degraded, it generates the data from elsewhere and tries to write it out. If the write fails, then the drive is marked faulty.

It has not always been that way, but it has for a couple of years.

Now that I think about, there is probably still room for improvement. If it is kicked due to a write error, and it was a cable error, it would be nice if we could re-add the device and it would recover based on the bit map. I'll add that to my list....

(sorry, I didn't read the first part of your comment properly before - I only read the second half and was responding to that. I should learn to rad better ;-)

Seek times

Posted Jun 18, 2009 14:10 UTC (Thu) by tajyrink (subscriber, #2750) [Link] (1 responses)

I'd say the (random) seek time estimate of 1.2x speedup is overestimating. During the last ca. 15 years, I'd say the whole increase has been not much above 0% in the normal case (7200 rpm hard drives), with the possible exception of later command queueing techniques speeding up even random seeks a bit. I remember 850MB Quantum Fireballs having pretty speedy seek times.

So one way of approximating is that during 1994-2009:
1000x capacity
20x faster transfer rate
1.1x faster seek times

And I think application developers still don't generally understand that anything seeking the hard drive is killing performance. Just look at the login times of popular desktop environments.

Seek times

Posted Jun 18, 2009 17:43 UTC (Thu) by vaurora (guest, #38407) [Link]

I'm just quoting Seagate's roadmap on the change in seek time. :) Not going to argue with them.

A couple years back a friend resurrected an ancient Amiga (1993 era) and ran the included disk performance tests. Seek times were less than twice that of the very latest modern laptop hard drive. You can read the numbers, but having the actual hardware there in front of me made it truly sink in.

I'm sorry to see this abandoned

Posted Jun 18, 2009 22:26 UTC (Thu) by dlang (guest, #313) [Link] (2 responses)

I was actually getting ready to track this down.

I don't see the need for a journal on each chunk as being a critical problem. It does mean that the plan of having 1G chunks doesn't work, but for many larger filesystems that was already questionable (I try to split my _files_ into 1G chunks if I can, but if they are < 10g I may not try too hard)

I have a use case where I want to have a ~140 TB array that will essentially be a single directory of files, I was thinking in terms of chunk sizes in the multi-TB size, at that point the need to have a separate journal per chunk isn't that bad in terms of overhead

yes, 1G chunks with many small files can be reasonable, but in most cases where you are talking extrememly large logical drives, you aren't dealing with text files, you are dealing with large files as well.

the fact that this wouldhave allowed filesystem checks to a chunk what the logical drive is online would have been a very useful thing.

I'm sorry to see this abandoned

Posted Jun 20, 2009 3:04 UTC (Sat) by vaurora (guest, #38407) [Link] (1 responses)

The code is out there. :)

I'm sorry to see this abandoned

Posted Jun 20, 2009 7:30 UTC (Sat) by dlang (guest, #313) [Link]

yeah, but now instead of just encouraging the author to get it into the mainline (in part by being able to provide real-life use-case and production use) now I have to track down another programmer to learn the code and try to push it into mainline.

a much harder task (but still a possible one as you point out ;-)

What ever happened to chunkfs?

Posted Jun 19, 2009 3:51 UTC (Fri) by k8to (guest, #15413) [Link] (6 responses)

I think tape actually is obsolete.

Tape and disk cost a *very* similar amount per megabyte now. There are capacity points where things swing one way or the other, depending upon controllers needed and so on, but disk is pretty close.

Until you realize that tape is much less reliable, which means that tape backup requires multiple storage passes. In addition, the extremely poor access times of tape mean that you'll have to back up more data more often, which means the total capacity used for tape will be higher than disk backup.

Tape is now more expensive than disk.

This doesn't even get into the massive savings you get from the managability and performance of disk, so you don't have to spend (so much) money managing the physical requirements of the storage mechanism.

The only advantage of tape is that it's easier to move around in situations where you might drop it. This means it's easier moving data offsite or for protecting it from things like vigorous earthquakes.

But yes, it took decades for tape to fall off the cliff more or less entirely. Disk will be the same way.

What ever happened to chunkfs?

Posted Jun 19, 2009 7:09 UTC (Fri) by chad.netzer (subscriber, #4257) [Link]

I've never had an LTO3 tape that I wasn't able to restore successfully from. I *have* had Seagate
1TB drives that required special measures (firmware upgrade and multiple power cycles) to be able
to read from, and most people seem to have hard drive horror stories from one manufacturer or
another. Drives in RAIDs can be very reliable (barring correlated failures), but linear tape
technologies are also *very* reliable and fast. I don't think tape is obsolete by any stretch, and it
complements rotary drives as a backup medium. Rotating head tape technology for critical data
storage is an abomination against God, however.

What ever happened to chunkfs?

Posted Jun 19, 2009 8:42 UTC (Fri) by dlang (guest, #313) [Link]

when was the last time you saw a disk drive changer? inlarge data environment tape is still very much in use. the backups are staged to a disk array, and then streamed to tape from there.

What ever happened to chunkfs?

Posted Jun 23, 2009 3:48 UTC (Tue) by drs (guest, #16570) [Link] (3 responses)

Tape may be obsolete in the shallow end of the pool where you play.

Show me a production Petabyte-class data storage system that's entirely disk-based, with an exponential growth curve; then we can talk about obsolescence.

When you have data holdings that grow from 20TB to 1.5PB in 4 years,
(we do!) it's just not sane to do it any other way than tape. The cost of
media might be the same up front: the ongoing costs swing *way* toward
tape, for both environmentals and reliability.

What ever happened to chunkfs?

Posted Jun 25, 2009 19:53 UTC (Thu) by roelofs (guest, #2599) [Link] (2 responses)

Show me a production Petabyte-class data storage system that's entirely disk-based, with an exponential growth curve; then we can talk about obsolescence.

Your friendly neighborhood search engine? (Of course, that's less about data-storage than data-retrieval, but it's generally kind of hard to retrieve it if it isn't stored somewhere first...)

Greg

What ever happened to chunkfs?

Posted Jun 26, 2009 10:15 UTC (Fri) by Duncan (guest, #6647) [Link] (1 responses)

Probably not "search engine", at least in the conventional Internet search
engine sense. They surely index a lot of data, but probably store less of
it, and wouldn't need long-term backups of most of it. After all, the
data that was indexed for searching should for the most part be still
there on the net to reindex, a process that likely wouldn't take much
longer than restoring a backup anyway, and regardless, by the time they
finished the restore, the data would be stale, so a live re-index is going
to be more effective anyway. (Of course this says nothing about the other
non-search services such entities provide, many of which WILL need
backups.)

Rather, these huge petabyte class storage systems occur, based on my
reading on the topic, in a handful of "write mostly" situations. Of
course the "write mostly" bit is a given, since at that kind of data
volume, once past a certain point, it's a given that reading back the data
for further processing simply isn't going to be done for at least the
greatest portion thereof.

The archetypical example would be the various movie studios, now primarily
on digital media for many years. As storage capacities grew, the not only
shot/generated and processed all that data digitally, but stored it, and
not just the theater and "studio cut" editions, but the products generated
at each step of the process. Of course consumer resolutions are growing
as well, and production resolutions are multiple times that at multiple
bits more of color depth, as well. And where it's entirely or primarily
CGI, as the technology grows, so does the detail and data size of the
generated product. So they have several factors all growing at
(literally, for a number of them) geometric rates. But the advantage is
that just as they can remaster music, and just as they've been "restoring"
the movie classics for some time, now it's all digital, and with
a "simple" (which isn't so simple given the mountain of data, even with a
good index) retrieval of the original from the archived backups, they can
remaster from the bit-perfect originals.

AFAIK that was the original usage for petabyte class storage, and could
well be the first usage for exabyte class systems as well, but now that
petabyte is more and more possible and within reach of the "common"
corporation or government entity, it's actually coming to be required by
law there as well. Sarbanes-Oxley had the effect of requiring logging of
vast amounts of information for many US companies. Many readers here are
also no doubt familiar with the various ISP and etc logging requirements
many nations have legislated or tried to, with more lining up to try.
Over time that's going to add up to petabytes of information, and indeed,
many in the technical community have made the connection between drive
sizes outgrowing the needs of a typical consumer and lobbying for passage
of the various mandatory data-logging initiatives, alleging it's no
accident these laws are being passed just as drives get big enough most
ordinary consumers no longer need to get bigger ones every couple years.

Obama's electronic medical records legislation will certainly add to this,
tho many medical entities likely already electronically archive vast
quantities of information for defensive legal purposes, if nothing else.

Then there's of course usage such as that of the Internet Archive, which
would certainly need backups, tho until I just checked wikipedia, I had no
idea what their data usage was (3 petabytes as of 2009, growing at 100
terabytes/mo, as compared to 12 terabytes/mo growth in 2003, this is
apparently for the Wayback Machine alone, not including their other
archives).

Similarly (tho it is said to be smaller than the IA) with the Library of
Congress, and various other similar sites. See the Similar Projects
section of the Wikipedia Internet Archive entry for one list of such
sites.

Then there's the various social sites, tho based on the myspace image
archive torrent (17 GB, I torrented a copy) from a year or so ago, they
likely range in the terabytes, not petabytes.

But think of someone social-video based, like youtube. Even tho they're
not archiving the per-product level of data the studios are archiving, and
what they /are/ archiving is heavily compressed, they're getting content
from a vastly LARGER number of submitters, and must surely be petabyte
class by now (it's hard to believe it was founded only about 4 years ago,
2005-02, first video 2005-04). Wikipedia was no help on storage capacity,
and a quick google isn't helping much either (45 terabytes in 2006...
great help that is for 2009), but I do see figures of 10, scratch that,
15, scratch that, 20 hours of video uploaded /per/ /minute/! Even at the
compression rates they use, that's a LOT of video and therefore a LOT of
storage.

Duncan

What ever happened to chunkfs?

Posted Jun 26, 2009 22:54 UTC (Fri) by roelofs (guest, #2599) [Link]

Probably not "search engine", at least in the conventional Internet search engine sense.

ObDisclosure: I work for one...

They surely index a lot of data, but probably store less of it, ...

Hard to index it if you don't store it. ;-) Life isn't just an inverted index, after all; you need to be able to generate dynamic summaries on the fly.

... and wouldn't need long-term backups of most of it. After all, the data that was indexed for searching should for the most part be still there on the net to reindex, a process that likely wouldn't take much longer than restoring a backup anyway, and regardless, by the time they finished the restore, the data would be stale, so a live re-index is going to be more effective anyway.

That's true as far as it goes, but we're not talking about long-term backups, either. Search engines are more about robustness--think replication and failover and low (sub-second) latencies. How much data depends on which part you're talking about (tracked [webmap] vs. crawled vs. indexed), but when the document count ranges from dozens to hundreds of billions, the node count ranges from tens of thousands to hundreds of thousands (as reported by Google quite a few years ago), and the failure rate is dozens to thousands of nodes per day (also reported by Google not too long ago, IIRC), you can probably see where disk-based petabyte storage might come into play and why recrawling isn't a realistic option for point failures.

Greg

SSD vs. HD failure

Posted Jun 19, 2009 4:43 UTC (Fri) by djao (guest, #4263) [Link] (1 responses)

There's a good bit of FUD in the middle of the article when it talks about SSDs. Of course, it is certainly true that overused individual disk sectors in SSDs tend to fail. However, a fairly large percentage of these failures occur on write, meaning that you know something went wrong, but your old data is still there. In such situations, the corruption resilience properties of chunkfs are not really that useful.

If we now turn our attention to regular hard drives with mechanical platters, one of the common ways that such drives can fail involves the entire drive dying at once. (This can happen to SSDs too, but much more rarely.) Chunkfs won't really help very much in this case either. I should also add that mechanical failure of mechanical drives is more common than SSD failure. Moreover, when a mechanical drive fails for whatever reason, it almost always fails on read instead of (or in addition to) on write, meaning that you've lost old data. So, even though SSDs aren't perfect, they're still better than the alternatives if your goal is reliability. And even though the chunkfs concept is handy for a large class of disk failures, you're still better off with ext3 on an SSD than chunkfs (or any FS really) on a mechanical drive.

SSD vs. HD failure

Posted Jun 30, 2009 14:15 UTC (Tue) by jond (subscriber, #37669) [Link]

When combatting FUD with counter-claims, it's nice to provide references. I would be very interested to see some stats to support "[entire drive dying at once with ] SSDs too, but much more rarely." and "mechanical failure of mechanical drives is more common than SSD failure" in particular.