On the vger.kernel.org outage
There was a hard disk failure yesterday on one of the raid arrays that was bad enough to take the system down. The disk controller would subsequently hang during reboot attempts with that failed disk attached. Folks at the co-location facility are actively trying to bring it back up and rebuild the array." With luck the linux-kernel firehose will resume flowing soon.
Update: It looks like vger has come back up.
Posted Mar 19, 2008 19:25 UTC (Wed)
by clugstj (subscriber, #4020)
[Link] (14 responses)
Posted Mar 19, 2008 20:50 UTC (Wed)
by wtarreau (subscriber, #51152)
[Link] (3 responses)
Posted Mar 19, 2008 21:02 UTC (Wed)
by ajross (guest, #4563)
[Link] (2 responses)
Posted Mar 20, 2008 17:31 UTC (Thu)
by dmarti (subscriber, #11625)
[Link] (1 responses)
Posted Mar 21, 2008 6:20 UTC (Fri)
by njs (subscriber, #40338)
[Link]
Posted Mar 19, 2008 21:51 UTC (Wed)
by Cato (guest, #7643)
[Link] (7 responses)
Posted Mar 19, 2008 22:20 UTC (Wed)
by quotemstr (subscriber, #45331)
[Link] (6 responses)
Posted Mar 19, 2008 23:38 UTC (Wed)
by daniel (guest, #3181)
[Link] (1 responses)
Posted Mar 20, 2008 22:29 UTC (Thu)
by Cato (guest, #7643)
[Link]
Posted Mar 20, 2008 6:46 UTC (Thu)
by pbrutsch (guest, #4987)
[Link] (3 responses)
Posted Mar 20, 2008 12:23 UTC (Thu)
by wurtel (guest, #7155)
[Link] (1 responses)
Posted Mar 20, 2008 20:20 UTC (Thu)
by hmh (subscriber, #3838)
[Link]
Posted Mar 20, 2008 22:31 UTC (Thu)
by Cato (guest, #7643)
[Link]
Posted Mar 20, 2008 19:35 UTC (Thu)
by dlang (guest, #313)
[Link] (1 responses)
Posted Mar 20, 2008 20:28 UTC (Thu)
by hmh (subscriber, #3838)
[Link]
Posted Mar 19, 2008 22:06 UTC (Wed)
by cwarner (guest, #47176)
[Link]
Posted Mar 20, 2008 0:21 UTC (Thu)
by eeek (guest, #51156)
[Link]
Posted Mar 20, 2008 13:49 UTC (Thu)
by olecom (guest, #42886)
[Link] (1 responses)
Posted Mar 20, 2008 18:47 UTC (Thu)
by iabervon (subscriber, #722)
[Link]
Posted Mar 21, 2008 3:35 UTC (Fri)
by mgalgoci (guest, #24168)
[Link] (1 responses)
Posted Mar 21, 2008 11:44 UTC (Fri)
by Cato (guest, #7643)
[Link]
RAID array
What model of RAID array is this? It looks like one to avoid if the failure of a single drive
brought down the array.
RAID array
RAID is not about resistance to any sort of partial failure we can imagine,
it's about ensuring that upon a disk failure you will not lose your data.
If one disk becomes 99% slower but still works, there's no reason it would
be marked faulty, but in practise the service is not assured anymore.
However, having a guy overthere replace it fixes the problem.
RAID array
That said, raid arrays that fail when plugging in a specific device are awfully suspect.
Signal problems on the port should be a soft failure, not a hard one. That's the whole
spending lots of money on server hardware. If it was OK for a drive to take down the system,
they could have been running on a $900 box from Walmart. All file server boxes really need is
a ton of RAM. Disk bandwidth needn't enter into it. Didn't I see a report a while back that
kernel.org was, in fact, serving everything out of cache anyway?
All of which, really, just goes down as evidence for my long-held opinion that hardware-level
solutions for reliability never work. Reliability can only be achieved at the software level
via full redundancy.
"Reliability can only be achieved at the software level."
Instead of a mailing list, a Usenet-like discussion system on top of git?
RAID array
RAID array
Amusingly, the original vision for monotone actually had NNTP as the primary intended network
transport.
RAID array
RAID is nowhere near as reliable in practice as it appears on paper - one example is that you
frequently find, after a single disk (A) has failed, that another disk (B) 'fails during
recovery' - in reality, some sectors on disk B failed first but on a sector that did not get
read or written, so the bad sectors are only picked up during RAID recovery, which entails
reading the whole of disk B and others.
See http://www.nber.org/sys-admin/linux-nas-raid.html for coverage of some of these issues and
how NetApp and others would cover them. Personally I would only use ZFS, NetApp or something
similar that does a lot of media scrubbing, sector checksumming etc, in addition to basic
RAID.
RAID array
We use 3ware controllers at work, which can be set up to automatically and periodically verify
that no sectors ave failed. That feature has already saved us once.
RAID array
"Personally I would only use ZFS, NetApp or something similar that does a lot of media
scrubbing, sector checksumming etc, in addition to basic RAID."
Would that not be an admission of defeat for Linux?
RAID array
Yes, it means I could only use Linux currently for storage if I went for ZFS/FUSE. This is a
significant weakness in the Linux story - perhaps btrfs will go some way to fixing this, but
overall it seems that Linux RAID and associated sector-checksumming filesystems are a long way
behind where it needs to be.
I don't think this stuff is particularly hard to do, but it does require some focused work on
btrfs and on RAID improvements, and there are patent risks due to the Sun/NetApp patent
lawsuits, of course.
RAID array
That is a very common feature, and most certaily not limited to 3ware!
I have Areca, Adaptec, and LSI Logic RAID cards that can all do the same.
RAID array
Linux software raid (md) can also check an array for problems, Debian offers to run a cronjob
every month to do that, if you install mdadm. Coupled with SMART autotests you're pretty safe.
RAID array
If you want to be pretty safe, you'll need to test every *day*, not once a month. I know this
from experience.
RAID array
This sounds like a useful media scrubbing feature. However, sector checksumming seems like it
requires filesystem changes, and I've only seen that proposed in btrfs, which is in early
development stages.
RAID array
there are many ways that an electrical failure of a drive can knock out a computer. think of
what would happen if it shorts out the 5v or 12 power going into the drive for an obvious
example.
in practice it's normally not that bad, but especially fro a SCSI array you don't normally
have a seperate cable for each drive, you have them connected to a bus. if a drive is bad it
can cause the bus to lock up when the drive address is probed.
I would also be interested in the specifics of the hardware, but this sort of failure mode is
not uncommon, and the number of systems that actually implement protection against all of them
is extremely small.
RAID array
At least for RAID1 and RAID10, you should use two channels (and depending on what is doing the
mirroring, two controllers), and two enclosures. Each half of a mirror pair is on a separate
enclosure and channel.
An electrical failure can still knock out the entire system, as the buses are not optical, but
the usual failure modes for drives nowadays have them at maximum locking up the SCSI bus (and
usually they don't get that bad). Shorts are not common IME.
Anyway, with SAS and SATA, which are point-to-point links, the use of separate enclosures (due
to the power feed) and controllers becomes the real point.
One point of notice: Linux is not good for reliability here, unless you can remove the "IRQ
handler killer" from the picture (e.g by using extremely good controllers that don't do stupid
things, and only one device per IRQ line). You have been warned. A lot of failure modes
cause weirdness on the controllers, and may cause spurious IRQs. If the kernel kicks the IRQ
line away, there goes everything tied to that IRQ line. This is specially bad on SATA.
On the vger.kernel.org outage
Why is it that http://cacti.kernel.org/graph_view.php?action=tree&tr... requires a login?
On the vger.kernel.org outage
The firehose is flowing now. Looks like it's time to get back to work.
On the vger.kernel.org outage
It doesn't host neither archives nor NNTP, so why it's such a problem due to storage drives
failure? Swap?
On the vger.kernel.org outage
I'd guess the subscriber lists (as used by the mailing list software) are on the array. It's
also necessary to write messages to disk during delivery because there are recipients whose
mail servers won't respond before it's necessary to acknowledge the messages from the senders,
but the subscriber lists are the reason you'd need data stored previously to have the mailing
lists work (aside from the eventual delivery of messages that were in progress when the array
crashed).
On the vger.kernel.org outage
The problem actually wasn't disk or raid related, but filesystem related after a kernel oops
and panic. jfyi.
On the vger.kernel.org outage
This is something of a concern too - would be interesting to know the filesystem used, how it
was configured (e.g. for ext3, was barrier=1 in effect to ensure that data=ordered works
properly with drive-level write-caching), kernel version, etc.
A kernel oops shouldn't cause a very long filesystem recovery, unless perhaps the disks are so
huge that a full fsck takes a very long time. But then a journalling filesystem should
prevent such long fsck's anyway...