LWN.net Logo

A better btrfs

A better btrfs

Posted Jan 18, 2008 3:05 UTC (Fri) by Cato (subscriber, #7643)
In reply to: A better btrfs by Felix.Braun
Parent article: A better btrfs

The one ZFS feature I really want is checksumming of all disk blocks, which can detect disk
failures, bad cables, controller failures, etc.  As disks climb towards 1 TB being an average
size, the error rate per disk becomes surprisingly high...


(Log in to post comments)

A better btrfs

Posted Jan 18, 2008 23:25 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

As disks climb towards 1 TB being an average size, the error rate per disk becomes surprisingly high...

But who cares about error rate per disk?

Error rate per system administrator or per year or per gigabyte seems more interesting.

A better btrfs

Posted Jan 19, 2008 0:41 UTC (Sat) by nix (subscriber, #2304) [Link]

Hm, the former of those is interesting. `Why yes, you *can* reduce your 
error rate: just hire more sysadmins!'

Pair admining? Test driven administration? XA (eXtreme Admin'ing)?

Posted Jan 25, 2008 21:00 UTC (Fri) by AnswerGuy (subscriber, #1256) [Link]

 "Just" hiring more sysadmins clearly wouldn't reduce your error rate.  By itself it would
almost certainly *increase* the number of systems administration errors.

 However, I have been mulling over the application of agile development methods to systems
administration.  (In fact I even gave a brief talk  on that idea at last years LinuxWorld in
San Francisco).

 Consider some of the possibilities:

 Test driven administration:  

 Configure monitoring for that new server before you configure the server.  Alarms should go
off; response procedures should be executed ... a service window should be scheduled (with
estimated date of completion), which should defer further alarms from that source.  (Same
applies to each service that's to be deployed).  Now you know that the monitoring is doing
something useful.  When the monitoring shows the service "going green" then you know you have
configured the service correctly (with respect to the monitoring system --- i.e. DNS or other
directory services, IP addressing, routing, etc).  (If you find a corner case --- where
monitoring gives a false "green" status --- try to improve the monitoring to more closely
model a service's *correct* functionality).

 Integrate imaging and system's restoration.  Image a system, configure it,  backup
configuration and initial (test) data, then create a new imaging profile to facilitate
automated re-imaging of the system with automated restore of the configuration and data.  Then
wipe the system and re-image it using that profile.  Repeat until the system's complete
configuration and data is restored automatically.  THEN put the system into production.

 There are a number of other ideas along similar veins.  One of them is that we might want to
institute a policy ... for critical production servers ... of having our admins work in pairs
(perhaps over a shared GNU screen session) where one of the admins types each command, then
the other confirms that it's safe/correct and hits [Enter] when they both concur.  (Better
admins among us have learned to pause before hitting [Enter] when working "live" on mission
critical servers ... take a deep breath ... re-read that command ... perhaps try the "echo" or
"--dry-run" version of it first ... consider the risks ... and *THEN* (maybe) hit [Enter].
But even the best of us gets in a hurry, gets flustered or tired, or just experiences cognitive
hiccoughs).

(In my case I was an electrician for years before embarking on my IT career --- working with
potentially live wiring offers similar lessons with potentially lethal and immediately painful
consequences for any lapse in due care!  And yes, despite all that I did occasionally get
zapped!)

JimD

Pair admining? Test driven administration? XA (eXtreme Admin'ing)?

Posted Jan 26, 2008 3:42 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

"Just" hiring more sysadmins clearly wouldn't reduce your error rate.

It would if the error rate you're using is disk errors per year per sysadmin, which is what we were talking about.

It underscores the point that there are lots of error rates you can define, and you have to pay attention to your denominators.

Nonetheless, your ideas about reducing errors per something by improving system administration methods are interesting.

A better btrfs

Posted Jan 19, 2008 12:19 UTC (Sat) by Cato (subscriber, #7643) [Link]

I care about error rate per disk - if each disk is very likely to have a bad block at any
time, as seems more likely to be the case with today's larger disks, then you start to really
need RAID, block checksumming, etc, simply to avoid losing data.

I believe that people are storing more and more data on a given system, and the fact that
p(error on this system) is going up should be of concern.

A better btrfs

Posted Jan 19, 2008 17:54 UTC (Sat) by giraffedata (subscriber, #1954) [Link]

if each disk is very likely to have a bad block at any time, as seems more likely to be the case with today's larger disks, then you start to really need RAID, block checksumming, etc, simply to avoid losing data.

That risk would be the same if you had 10 disks, each with one tenth the data and one tenth the error rate.

I believe that people are storing more and more data on a given system, and the fact that p(error on this system) is going up should be of concern.

Now you're talking about error rate per system, not per disk.

And I'm not convinced that's important either. Spreading data out across 10 systems doesn't make the data loss hurt any less.

A better btrfs

Posted Jan 20, 2008 18:08 UTC (Sun) by Cato (subscriber, #7643) [Link]

Error rate per system is a better metric as you say - your original post said 'error rate per
system administrator' which was a bit confusing.

A better btrfs

Posted Jan 20, 2008 20:57 UTC (Sun) by giraffedata (subscriber, #1954) [Link]

I think error rate per system is more useful than error rate per disk, but as I said, even error rate per system isn't terribly useful. Error rate per system administrator is considerably more useful.

If a major cost of disk errors is a system administrator having to replace a disk, restore from backup, recreate data, etc., then you care how many times a year the system administrator has to do that. Consolidating data from two systems onto one doubles your error rate per system, but doesn't mean you have to increase your RAID redundancy and such because the error rate per system administrator is still the same. On the other hand, piling a terabyte of movies onto the systems managed by a system administrator increases that error rate and might require some new method of dealing with the errors.

Error-rate / disk

Posted Jan 30, 2008 3:11 UTC (Wed) by Max.Hyre (subscriber, #1054) [Link]

When the error requires taking the disk offline to fix it, I care. Until I can access all of a Tbyte disk in the time it takes to access a, say, 200 Gbytes, the downtime per spindle will be greater, and what you really care about is the downtime, for any problem.*

It's bad enough waiting for a fsck on a 100 GB partition.


* Well, data loss figures in there somewhere, but lost data typically hurts a few users, but being offline affects everyone.

Copyright © 2008, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
Powered by Rackspace Managed Hosting.