On the vger.kernel.org outage

[Posted March 19, 2008 by corbet]

Readers of various kernel-oriented mailing lists are wondering what to do with all their spare time today as a result of the failure of vger.kernel.org. We asked David Miller for an update, and got the following: "There was a hard disk failure yesterday on one of the raid arrays that was bad enough to take the system down. The disk controller would subsequently hang during reboot attempts with that failed disk attached. Folks at the co-location facility are actively trying to bring it back up and rebuild the array." With luck the linux-kernel firehose will resume flowing soon.

Update: It looks like vger has come back up.

RAID array

Posted Mar 19, 2008 19:25 UTC (Wed) by clugstj (subscriber, #4020) [Link] (14 responses)

What model of RAID array is this?  It looks like one to avoid if the failure of a single drive
brought down the array.

RAID array

Posted Mar 19, 2008 20:50 UTC (Wed) by wtarreau (subscriber, #51152) [Link] (3 responses)

RAID is not about resistance to any sort of partial failure we can imagine,
it's about ensuring that upon a disk failure you will not lose your data.
If one disk becomes 99% slower but still works, there's no reason it would
be marked faulty, but in practise the service is not assured anymore.
However, having a guy overthere replace it fixes the problem.

RAID array

Posted Mar 19, 2008 21:02 UTC (Wed) by ajross (guest, #4563) [Link] (2 responses)

That said, raid arrays that fail when plugging in a specific device are awfully suspect.
Signal problems on the port should be a soft failure, not a hard one.  That's the whole
spending lots of money on server hardware.  If it was OK for a drive to take down the system,
they could have been running on a $900 box from Walmart.  All file server boxes really need is
a ton of RAM.  Disk bandwidth needn't enter into it.  Didn't I see a report a while back that
kernel.org was, in fact, serving everything out of cache anyway?

All of which, really, just goes down as evidence for my long-held opinion that hardware-level
solutions for reliability never work.   Reliability can only be achieved at the software level
via full redundancy.

RAID array

Posted Mar 20, 2008 17:31 UTC (Thu) by dmarti (subscriber, #11625) [Link] (1 responses)

"Reliability can only be achieved at the software level." Instead of a mailing list, a Usenet-like discussion system on top of git?

RAID array

Posted Mar 21, 2008 6:20 UTC (Fri) by njs (subscriber, #40338) [Link]

Amusingly, the original vision for monotone actually had NNTP as the primary intended network
transport.

RAID array

Posted Mar 19, 2008 21:51 UTC (Wed) by Cato (guest, #7643) [Link] (7 responses)

RAID is nowhere near as reliable in practice as it appears on paper - one example is that you
frequently find, after a single disk (A) has failed, that another disk (B) 'fails during
recovery' - in reality, some sectors on disk B failed first but on a sector that did not get
read or written, so the bad sectors are only picked up during RAID recovery, which entails
reading the whole of disk B and others.  

See http://www.nber.org/sys-admin/linux-nas-raid.html for coverage of some of these issues and
how NetApp and others would cover them.  Personally I would only use ZFS, NetApp or something
similar that does a lot of media scrubbing, sector checksumming etc, in addition to basic
RAID.

RAID array

Posted Mar 19, 2008 22:20 UTC (Wed) by quotemstr (subscriber, #45331) [Link] (6 responses)

We use 3ware controllers at work, which can be set up to automatically and periodically verify
that no sectors ave failed. That feature has already saved us once.

RAID array

Posted Mar 19, 2008 23:38 UTC (Wed) by daniel (guest, #3181) [Link] (1 responses)

"Personally I would only use ZFS, NetApp or something similar that does a lot of media
scrubbing, sector checksumming etc, in addition to basic RAID."

Would that not be an admission of defeat for Linux?

RAID array

Posted Mar 20, 2008 22:29 UTC (Thu) by Cato (guest, #7643) [Link]

Yes, it means I could only use Linux currently for storage if I went for ZFS/FUSE.  This is a
significant weakness in the Linux story - perhaps btrfs will go some way to fixing this, but
overall it seems that Linux RAID and associated sector-checksumming filesystems are a long way
behind where it needs to be.  

I don't think this stuff is particularly hard to do, but it does require some focused work on
btrfs and on RAID improvements, and there are patent risks due to the Sun/NetApp patent
lawsuits, of course.

RAID array

Posted Mar 20, 2008 6:46 UTC (Thu) by pbrutsch (guest, #4987) [Link] (3 responses)

That is a very common feature, and most certaily not limited to 3ware!

I have Areca, Adaptec, and LSI Logic RAID cards that can all do the same.

RAID array

Posted Mar 20, 2008 12:23 UTC (Thu) by wurtel (guest, #7155) [Link] (1 responses)

Linux software raid (md) can also check an array for problems, Debian offers to run a cronjob
every month to do that, if you install mdadm. Coupled with SMART autotests you're pretty safe.

RAID array

Posted Mar 20, 2008 20:20 UTC (Thu) by hmh (subscriber, #3838) [Link]

If you want to be pretty safe, you'll need to test every *day*, not once a month.  I know this
from experience.

RAID array

Posted Mar 20, 2008 22:31 UTC (Thu) by Cato (guest, #7643) [Link]

This sounds like a useful media scrubbing feature.  However, sector checksumming seems like it
requires filesystem changes, and I've only seen that proposed in btrfs, which is in early
development stages.

RAID array

Posted Mar 20, 2008 19:35 UTC (Thu) by dlang (guest, #313) [Link] (1 responses)

there are many ways that an electrical failure of a drive can knock out a computer. think of
what would happen if it shorts out the 5v or 12 power going into the drive for an obvious
example.

in practice it's normally not that bad, but especially fro a SCSI array you don't normally
have a seperate cable for each drive, you have them connected to a bus. if a drive is bad it
can cause the bus to lock up when the drive address is probed.

I would also be interested in the specifics of the hardware, but this sort of failure mode is
not uncommon, and the number of systems that actually implement protection against all of them
is extremely small.

RAID array

Posted Mar 20, 2008 20:28 UTC (Thu) by hmh (subscriber, #3838) [Link]

At least for RAID1 and RAID10, you should use two channels (and depending on what is doing the
mirroring, two controllers), and two enclosures.  Each half of a mirror pair is on a separate
enclosure and channel.

An electrical failure can still knock out the entire system, as the buses are not optical, but
the usual failure modes for drives nowadays have them at maximum locking up the SCSI bus (and
usually they don't get that bad).  Shorts are not common IME.

Anyway, with SAS and SATA, which are point-to-point links, the use of separate enclosures (due
to the power feed) and controllers becomes the real point.

One point of notice: Linux is not good for reliability here, unless you can remove the "IRQ
handler killer" from the picture (e.g by using extremely good controllers that don't do stupid
things, and only one device per IRQ line).  You have been warned.  A lot of failure modes
cause weirdness on the controllers, and may cause spurious IRQs.  If the kernel kicks the IRQ
line away, there goes everything tied to that IRQ line.  This is specially bad on SATA.