LWN.net Logo

ext3 metaclustering

By Jonathan Corbet
January 16, 2008
The ext3 system uses the classic Unix block pointer method for keeping track of the blocks in each file. For a given file, the on-disk inode structure contains space for twelve block numbers; they point to the first twelve blocks in the file - the first 48KB of space. If the file is larger than that, a 13th pointer contains the address of the first indirect block; this block contains another 1024 (on a 4K block filesystem) block pointers. Should that not suffice, there's a 14th pointer for the double-indirect block - each entry in that block is the address of an indirect block. And if even that is not enough, there's a 15th entry pointing to a triple-indirect block full of pointers to double-indirect blocks.

This is a very efficient representation for small files - the kinds of files Unix systems typically held, once upon a time. In current times, when one can forget about that directory full of DVD images and never even notice the lost space, it does not work quite as well - there is a lot of overhead for all of those individual block pointers, and a large data structure to manage. That is why removing a large file on an ext3 filesystem can take a long time - the system has to chase down all of those indirect blocks, which, in turn, forces a lot of disk activity and head seeks. For this reason, contemporary filesystems tend to use extent-based mechanisms to associate blocks with files, but that is not really an option for ext3.

An additional problem with all those indirect blocks is that filesystem checkers must locate and verify them all. That, again, causes a lot of head seeking and makes fsck run slowly. Slow filesystem checking was the motivation behind this patch from Abhishek Rai which attempts to improve performance on filesystems with a lot of indirect blocks.

The approach taken is relatively simple: the patch just tries to group indirect block allocations together on the disk. The current ext3 code will allocate indirect blocks when they are needed to account for data blocks being added to the file; they are usually placed adjacent to those data blocks. One might think that this placement would speed subsequent accesses to the file, but that is not necessarily so; the reading or writing of the indirect block will tend to happen at a different time than operations on the data blocks. What this placement does accomplish, though, is the distribution of the indirect blocks all over the disk. So a process which must examine all of the indirect blocks associated with a file must cause the disk to do a lot of head seeks.

The "metaclustering" approach works by reserving a set of contiguous blocks at the end of each block group. Whenever an indirect block is needed, the filesystem tries to get one from this dedicated area first. The end result is that all of the indirect blocks are located next to each other. Should somebody need to read a number of those blocks without being interested in the contents of the data blocks, they can grab them all quickly with minimal seeking. Filesystem checkers, as it happens, need to do exactly that - as does the file removal process. The patch did not come with benchmarks, but the speedup that comes from the elimination of all those seeks should be significant.

Even so, Andrew Morton questioned the need for this patch, worrying that its benefits do not justify the risks that comes with modifying an established, heavily-used filesystem:

In any decent environment, people will fsck their ext3 filesystems during planned downtime, and the benefit of reducing that downtime from 6 hours/machine to 2 hours/machine is probably fairly small, given that there is no service interruption.

Others disagreed, though, noting that it's the unplanned filesystem checks which are often the most time-critical. That includes the delightful "maximal mount count" boot-time check which, in your editor's experience, always happens when one is trying to get set up to give a talk somewhere. So this patch might just find eventual acceptance - it should be relatively low-risk and does not require any on-disk format changes. This is a filesystem patch, though, so nobody will be in any hurry to get it into the mainline before a lot of testing and review has been done.


(Log in to post comments)

ext3 metaclustering

Posted Jan 17, 2008 12:33 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

The maximal mount count check is optional. It exists because Linux has traditionally run on
lots of cheap PCs people lashed together in basements, and because people like running release
candidates, random patches they found on the Internet and so-on. If you're pretty confident
that your hardware works and your filesystem driver works, these checks are surplus to
requirements.

Many operating systems don't perform such routine checks. Some don't even have the tools to
perform them if you wanted to. So using tune2fs to disable the periodic checks isn't
unreasonable.

And you have backups, yes ?

ext3 metaclustering

Posted Jan 17, 2008 13:42 UTC (Thu) by glm (subscriber, #45719) [Link]

There is another (less radical) workaround against an unwanted "maximal mount count" boot-time
check: simply interrupt the check with Ctrl+C. This will defer the check until the next
boot-time, hopefully on a less time critical occasion.

ext3 metaclustering

Posted Jan 17, 2008 15:20 UTC (Thu) by sbergman27 (guest, #10767) [Link]

On all the distros I have tried, ctrl-c doesn't stop it.  Nothing stops it.  Not even if you
are in a situation where a 30 minute forced fsck is *really* embarrassing.  And it is "opt
out".  So one must remember to turn it off with tune2fs or expect it to kick in as a surprise
when one can least afford it.

ext3 metaclustering

Posted Jan 17, 2008 16:04 UTC (Thu) by nix (subscriber, #2304) [Link]

That's because signals aren't delivered from the initial console by default, and the program
which fixes that (getty) doesn't run until a long time after fsck runs.

ext3 metaclustering

Posted Jan 17, 2008 17:39 UTC (Thu) by bronson (subscriber, #4806) [Link]

So, how does one stop it?

I suppose the right time to run the automatic fsck is when the volume is being unmounted at
shutdown.  I don't mind at all if the computer wants to chug along happily for 20 minutes and
then power itself off.  I sure as heck mind if it happens at startup and prevents me from
using the computer for 1/2 hour in the morning!

Any thoughts?  Should I file an Ubuntu feature request?

ext3 metaclustering

Posted Jan 17, 2008 20:29 UTC (Thu) by jzbiciak (✭ supporter ✭, #5246) [Link]

Hmmm... does putting an "stty sane" early in the boot scripts make it work?

ext3 metaclustering

Posted Jan 22, 2008 0:09 UTC (Tue) by fergal (subscriber, #602) [Link]

This was just recently discusses on ubuntu-devel-discuss, the thread should be in the
archives.

ext3 metaclustering

Posted Jan 18, 2008 18:16 UTC (Fri) by ranmachan (subscriber, #21283) [Link]

On Debian, the filesystem check can be stopped with Ctrl+C, at least for filesystems other
than /. I'm not sure about /, since it _may_ be different and my / is usually small enough to
just wait for fsck to finish. /home can be quite annoyingly long though and gets the Ctrl+C
treatment sometimes.

ext3 metaclustering

Posted Jan 17, 2008 16:23 UTC (Thu) by Velmont (guest, #46433) [Link]

sudo tune2fs -c 0 /dev/sda1

Ahh... Feels good. I'll never be embarrassed ever again.

ext3 metaclustering

Posted Jan 17, 2008 17:26 UTC (Thu) by sbergman27 (guest, #10767) [Link]

Yes, you can now plan your important public presentation scheduled for July 15, 2008 with
confidence.

(Hint! Hint!)

ext3 metaclustering

Posted Jan 17, 2008 17:30 UTC (Thu) by bronson (subscriber, #4806) [Link]

... except when a tiny inconsistency spreads and ends up corrupting half of your partition
(the half with the presentation on it of course).

Oh, if only you'd run periodic fscks!  The corruption would have been caught early and fixed
without you ever knowing about it.   :-P

ext3 metaclustering

Posted Jan 17, 2008 18:05 UTC (Thu) by sbergman27 (guest, #10767) [Link]

I'd like to see some proof that this really happens in the real world.  That such spreading of
a tiny corruption to destroy one's whole file system is a real concern for real people.  Sure,
ext3 can, rarely, experience serious corruption.  But how do we know whether or not it started
out as a tiny problem?  Until it's proven, it's sounds like hearsay to me.

ext3 metaclustering

Posted Jan 17, 2008 22:43 UTC (Thu) by rfunk (subscriber, #4054) [Link]

Disks deteriorate underneath the filesystem.  Usually it's age causing 
increasingly large bad spots.  The filesystem may have no bugs, but things 
can happen to the disk to corrupt the filesystem.  I'd rather know about 
that.

ext3 metaclustering

Posted Jan 17, 2008 23:15 UTC (Thu) by sbergman27 (guest, #10767) [Link]

Then have "badblocks" run independently of e2fsck at 3AM once a week or so?  Instead of
forcing the user and conference attendees to sit through a mandatory hour-long e2fsck at the
whim of the machine, hoping that the e2fsck happens to catch what badblocks is far better
designed to catch?

ext3 metaclustering

Posted Jan 17, 2008 23:31 UTC (Thu) by rfunk (subscriber, #4054) [Link]

Hour-long e2fsck?

1.  Partition your disk -- root, /var, /usr, /home on different filesystems.
2.  If you ever do enable the automatic checks, set the mounts-before-check count on 
each filesystem to be a different prime number.  That way multiple filesystems almost 
never get checked at the same time.

I've never had an fsck on a non-server system (which seems to be the topic here) go 
anywhere near an hour.  Maybe five minutes at most.

In my experience, badblocks is far far slower than e2fsck.

And running anything automatically at 3am generally isn't an option on 
conference-presentation laptops.

ext3 metaclustering

Posted Jan 18, 2008 14:56 UTC (Fri) by fatrat (subscriber, #1518) [Link]


Not sure that partitions help here. If we are taking personal box/laptop /home is the only
thing I care about and it'll have all the disk space as well.

ext3 metaclustering

Posted Jan 18, 2008 15:06 UTC (Fri) by rfunk (subscriber, #4054) [Link]

OK, then you won't mind if I rm -rf /usr on your machine.  :-)

Try:  du -shc /var /usr /home
(There's also the root stuff not in those, but it's harder to measure that.)
You may be surprised at how much is in /var and /usr.

ext3 metaclustering

Posted Jan 18, 2008 15:19 UTC (Fri) by fatrat (subscriber, #1518) [Link]

My home dir contains ~82 Gb. Compared to that, /usr and /var don't contain a lot (under 10gb).
I'm sure most people are similar, hence my comment.

ext3 metaclustering

Posted Jan 18, 2008 15:45 UTC (Fri) by rfunk (subscriber, #4054) [Link]

10GB is still a big important chunk of disk, whether the rest is 20GB or 82GB.  Checking 
it separately *will* speed up each check, and separating it into a separate filesystem will 
make sure that errors on one part won't mess up the other part.

(Come to think of it, I suspect that the fsck speed is more dependent on number of files 
than data size, though I don't know for sure.)

ext3 metaclustering

Posted Jan 19, 2008 22:22 UTC (Sat) by Frej (subscriber, #4165) [Link]

Partitioning is fixing the symptoms, not the problem. 

Multiple fscks

Posted Jan 30, 2008 3:28 UTC (Wed) by Max.Hyre (subscriber, #1054) [Link]

[S]et the mounts-before-check count on each filesystem to be a different prime number. That way multiple filesystems almost never get checked at the same time.
Even better is setting the mounts/count to the same number on all filesystems, then use tunefs to set the starting count to a different value on each.

Voila! Never a multiple fsck.

ext3 metaclustering

Posted Jan 17, 2008 23:19 UTC (Thu) by magila (subscriber, #49627) [Link]

Disks these days are pretty good at hiding bad sectors from the host. If it gets bad enough
that the OS starts seeing bad data then the drive is probably on it's last legs and will soon
fail completely. In any case monitoring the SMART logs will usually catch a drive that is
gradually degrading without the frustrating fsck delays.

ext3 metaclustering

Posted Jan 17, 2008 23:33 UTC (Thu) by rfunk (subscriber, #4054) [Link]

True, but how many people monitor SMART logs on a laptop, or even a desktop?
More to the point, how many of the people disabling the auto-fsck monitor their SMART 
logs?

ext3 metaclustering

Posted Jan 18, 2008 22:00 UTC (Fri) by nix (subscriber, #2304) [Link]

smartd can send you emails when things go suspiciously wrong.

ext3 metaclustering

Posted Jan 18, 2008 22:05 UTC (Fri) by rfunk (subscriber, #4054) [Link]

True.  How many people have system-level email working properly on their laptops, and 
are able to get such emails?

ext3 metaclustering

Posted Jan 18, 2008 22:14 UTC (Fri) by nix (subscriber, #2304) [Link]

Um, anyone competent? All sorts of other email, some security-important, 
gets sent by various daemons and shouldn't just be binned or ignored... of 
course a lot of people aren't competent :/

ext3 metaclustering

Posted Jan 18, 2008 22:20 UTC (Fri) by rfunk (subscriber, #4054) [Link]

My programmer coworkers have enough trouble with the task, and they're techies.  
Forget about the non-techie user that is adopting Linux more and more.

Everyone sets up their GUI mail program, and totally ignores the system-level MTA 
(sendmail/postfix/exim).  They just never get those emails.

(Sysadmin types being the exception, of course, but they're few and far between these 
days.)

ext3 metaclustering

Posted Jan 19, 2008 18:35 UTC (Sat) by raxyx (subscriber, #50026) [Link]

So THAT's that these MTAs are for. Cool. 


> Everyone sets up their GUI mail program, and totally ignores the system-level MTA 
> (sendmail/postfix/exim).  They just never get those emails.

Full ack on that. On some of my Debian machines, during the boot sequence, the thing that
takes the most time to get loaded is exim4, so one day I got fed up with it and removed it,
didn't notice any difference afterwards. I guess I'm going to rethink that move :-)

lightweight MTAs for outgoing mail only

Posted Jan 19, 2008 20:08 UTC (Sat) by liamh (subscriber, #4872) [Link]

I have taken to removing exim4 and installing either ssmtp or nullmailer
 aptitude install ssmtp exim4- exim4-base- exim4-config- exim4-daemon-light-
Just enough MTA to get the word out.  Since few people want/need a full MTA, this seems like
it should be the default.  But I don't smart disk monitoring; a few years back I tried it and
it led to some unreliable system behavior.

ext3 metaclustering

Posted Jan 19, 2008 1:56 UTC (Sat) by cortana (subscriber, #24596) [Link]

Well, Debian configures smartd to both mail root and display a notification on the desktops of
currently-logged-in users. :)

ext3 metaclustering

Posted Jan 17, 2008 23:56 UTC (Thu) by dberkholz (subscriber, #23346) [Link]

Google published a paper fairly recently on a large study of disk failures. As I recall, they
found that SMART logs were not reliable indicators.

ext3 metaclustering

Posted Jan 18, 2008 4:12 UTC (Fri) by magila (subscriber, #49627) [Link]

Notice I said gradually degrading. SMART won't help in the event of a catastrophic mechanical
failure, which is what most of the unanticipated failures in the Google study probably were.
Fsck doesn't help in that case either though. It's only the kinds of failures that cause a
slow accumulation of bad sectors that fsck would matter for, and those are the kinds of
failures that SMART is piratically guaranteed to catch.

ext3 metaclustering

Posted Jan 18, 2008 8:51 UTC (Fri) by njs (guest, #40338) [Link]

piratically... guaranteed...?

ext3 metaclustering

Posted Jan 18, 2008 22:03 UTC (Fri) by nix (subscriber, #2304) [Link]

That's SMArrrT for you.

Using fsck to defend against disk failures?

Posted Jan 27, 2008 15:45 UTC (Sun) by anton (guest, #25547) [Link]

That and the "spreading inconsistency" theory and some other things I have read by people writing about fsck are failure types that I have never seen or read a first-hand report of, so I guess they are just myths or a perverted form of wishful thinking.

The kinds of disk failures I have seen have always been different. In particular, even if a drive developed a bad block, it recognized that itself (very slowly) and returned an error rather than wrong data. I'm not sure if fsck programs are up to dealing with a bad block of this kind in the metadata, but if a drive has a bad block, that's certainly a good time to replace the drive and restore the data from backup. Or you run RAID 1 or RAID 5, you just need to replace the drive (and make it known to the RAID driver).

Moreover, even if a disk drive deteriorates over time, that's more likely to hit the data first rather than the meta-data. But fsck checks only some kinds of errors in the meta-data, so if fsck is your defense against bad blocks, you don't value your data at all. Making a backup is more likely to unveil bad blocks than fsck (also in data), and has obvious additional benefits.

Finally, a good way (much better than fsck) to test the drive for bad blocks is "smartctl -t long", even though I am sceptical about the predictive capabilities of SMART.

Overall, I am very sceptical about the value of fsck for dealing with hardware failures, and a little bit less sceptical about its value when dealing with software failures (but I think I have not been bitten by a file system bug yet); in many cases (especially the hardware ones) we have to restore from backup anyway.

Using fsck to defend against disk failures?

Posted Jan 27, 2008 16:32 UTC (Sun) by nix (subscriber, #2304) [Link]

My mum's ancient 486 laptop had a really strange disk failure this 
Christmas. It started with a single bad sector, but then within about 
fifteen minutes one third of the sectors on the disk (in contiguous runs 
of varying length) were returning, not bad sectors, but `sector not 
found', i.e. the drive couldn't even find the sector address markers.

What I suspect may have happened, based on my extensive lack of experience 
in hard drive design, is that all the G forces the head assembly is 
exposed to whenever a seek happens had over time twisted the head reading 
the farthest side of whichever platter didn't contain the servo track out 
of true, so that when the servo track said it was over track X, the 
topmost heads were actually midway between tracks or something like that. 
In that position they couldn't read the sector addresses, couldn't find 
any data, and whoompfh, goodbye data.

(I've never heard of this failure mode anywhere else, and perhaps it was 
something different, but still, it was very strange. Disks *can* go mostly 
bad all at once. It's just rare.)

Disk failures

Posted Jan 27, 2008 21:58 UTC (Sun) by anton (guest, #25547) [Link]

Disk drives have not used servo tracks for a long time, because one could no longer align all the heads precisely enough (e.g., because of thermal expansion). Instead, servo information exists on each platter, interspersed in some way with the data. I don't know when this change happened; a 15+-year old disk (486 generation) might still have a servo track. But couldn't the symptoms also be explained by the failure of just one of the heads?

Disk failures

Posted Jan 27, 2008 22:55 UTC (Sun) by nix (subscriber, #2304) [Link]

I said it was a prehistoric system, and indeed anything more modern than 
about, what, 1991 won't have this problem.

I'm not sure if a head failure could cause a failure to find sector 
address markers: I'm not sure if you could even distinguish the two cases 
without digging into the drive. (As I said, my expertise in hard drive 
engineering is notable mainly by its absence.)

It's just that heads are solid-state, and solid-state stuff doesn't die 
all that often, while the head assembly itself is being wrenched all over 
the place: simple bending could explain this, I think.

ext3 metaclustering

Posted Jan 18, 2008 0:24 UTC (Fri) by iabervon (subscriber, #722) [Link]

These days, it doesn't make much sense to use -c periodic checking, since disk data errors are
unlikely to be associated with mounting. It makes a lot more sense to use -i periodic
checking, which you can schedule for some time when you're not giving a presentation.

Actually, it would make most sense to do it at shutdown sometime the system is plugged in and
you're going to bed, controlled with cron/anacron for noticing the need to check it and
shutdown scripts to identify that it's appropriate. Obviously, there's practically no chance
that the periodic check would actually happen to trigger on the first mount after disk
corruption occurs, and it's more likely that corruption would happen during a write (and thus,
while it's mounted) anyway.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds