LWN.net Logo

savannah.gnu.org status

The server that hosts GNU and other projects, savannah.gnu.org had a catastrophic failure over the weekend. A message in the savannah-users mailing list describes the problems with a RAID array which eventually led to filesystem corruption. This took out the source code repositories for the projects hosted there. "The last backup was performed while RAID was buggy, and lots of files were reported missing, in particular for CVS/SVN/Git/Hg. Hence the last backup is incomplete. [...] And, our last full backup from tape is from end of April. Normally tape backups are more recent, but there were independent backup issues. We've not discussed since in detail as we're focusing on recovering the data asap. " Status updates are available via fsfstatus on identi.ca. (thanks to David P. Reed).
(Log in to post comments)

savannah.gnu.org status

Posted Jun 1, 2009 23:47 UTC (Mon) by jcm (subscriber, #18262) [Link]

The question is, does while RAID was "buggy" mean that the RAID failed and the box became "buggy", or does it mean that someone decided it might be a good idea to throw a random kernel on a production box?

throwing a random kernel on a production box

Posted Jun 2, 2009 1:00 UTC (Tue) by pr1268 (subscriber, #24648) [Link]

...throw a random kernel on a production box

Maybe they just don't want to admit that. (Of course, if this happened to me I wouldn't want to admit it either, instead putting a brown paper bag on and hiding under a rock until the mess blows over. ;) )

Pretty standard kernel

Posted Jun 2, 2009 20:03 UTC (Tue) by man_ls (subscriber, #15091) [Link]

Nope, AFAICT the kernel is a standard vserver lenny kernel. I believe that it was the hardware that failed.

Pretty standard kernel

Posted Jun 3, 2009 10:39 UTC (Wed) by dwmw2 (subscriber, #2063) [Link]

Hm, I seem to recall ditching vserver from OLPC because we couldn't get it to work reliably. It was doing really strange things with the file system. On hearing that someone is using vserver and seeing file system corruption, that would always be my first suspicion.

I'm shocked that you speak of a 'standard vserver lenny kernel' — Debian just went down in my estimation if they're shipping that to users.

If it's really RAID issues, that just highlights the brokenness of the RAID model. You want to build your redundancy into the file system, as btrfs does, not do it at a lower level that doesn't even understand what it's replicating.

Pretty standard kernel

Posted Jun 3, 2009 20:44 UTC (Wed) by man_ls (subscriber, #15091) [Link]

Sure they are distributing it. It seems that it was a disk hardware failure though; maybe it was compounded with hidden bugs to result in data loss. Pretty scary stuff.

Argument for DVCS

Posted Jun 2, 2009 1:15 UTC (Tue) by proski (subscriber, #104) [Link]

This failure, regrettable as it is, could be a used to demonstrate the advantage of distributed version control systems. With DVCS, developers have the whole repository. Even if no data is recovered on the server, it can be recovered from the developer's copy.

Argument for DVCS

Posted Jun 2, 2009 8:40 UTC (Tue) by danpb (subscriber, #4831) [Link]

Indeed those projects using CVS/SVN have likely lost more than a month's worth of history, while those using HG/GIT can restore their full history from any user who cloned the repository. While it is not a excuse for skipping traditional backups, the safety net provided by the distributed history is an excellent reason for using GIT/HG. Hopefully this will encourage those still holding out to finally ditch CVS/SVN

Argument for DVCS

Posted Jun 2, 2009 10:05 UTC (Tue) by jengelh (subscriber, #33263) [Link]

I am just *so* waiting for code.google.com's svn to go down in the same manner :-> just because rumour has it half of the users hit by the savannah downing won't learn from the incident.

Argument for DVCS

Posted Jun 2, 2009 13:39 UTC (Tue) by drag (subscriber, #31333) [Link]

> While it is not a excuse for skipping traditional backups,

A DVCS like git used in a moderately popular project is a hell of a lot better then traditional backups, actually.

Say your code gets downloaded by 100 users interested in looking at your project... each one of has a full copy of everything. How many backup software products that you know of does a 100 full backups with full history and has not only keeps a hash of your backup, but each individual file is sha1 hashed?

To do a backup all you have to do is a occasional 'git pull' from some machine and your done. That's it. Simple as pie and is something you'll do a half a dozen times a day if your busy... how many backup programs make sure that they are fully synced up every time you want to make a change?

It's something that is so simple and so effective that it's pretty much just done correctly by accident.. it would more difficult to do it incorrectly. Lazy for the win.

Argument for DVCS

Posted Jun 2, 2009 15:01 UTC (Tue) by jamesh (guest, #1159) [Link]

Source code isn't the only thing sites like Savannah host. This also affects things like project bug trackers which don't have that sort of protection.

Distributed bug trackers are not really at the point where they'd be useful to many projects yet.

Argument for DVCS

Posted Jun 2, 2009 16:24 UTC (Tue) by joey (subscriber, #328) [Link]

Unless you have significant hook scripts in the repository. Git doesn't clone those with the rest of the repo, and it can be easy to forget to find a way to check them in or otherwise back them up.

Some of the control files present in the bare repo also are not cloned (config, description), which could also slightly suck to deal with.

Argument for DVCS

Posted Jun 2, 2009 17:42 UTC (Tue) by flewellyn (subscriber, #5047) [Link]

The truth of the statement: "Linus Torvalds doesn't need backups. he just uploads his files and lets the world mirror them."

Argument for DVCS

Posted Jun 3, 2009 5:10 UTC (Wed) by danielbaumann (subscriber, #38804) [Link]

ftr, the original quote is:

“Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it ;)”

Argument for DVCS

Posted Jun 3, 2009 21:37 UTC (Wed) by job (guest, #670) [Link]

Git-like systems can have bugs too, just like file systems and RAID controllers have. If you rely on git for backups you can't recover from bugs in it silently corrupting your data. It's best to archive checked out data at regular intervals as well.

savannah.gnu.org status

Posted Jun 2, 2009 2:44 UTC (Tue) by quotemstr (subscriber, #45331) [Link]

Ouch. I anticipate quite a few "uhh, recommit from working copy" messages in the future.

savannah.gnu.org status

Posted Jun 2, 2009 5:04 UTC (Tue) by dkk (subscriber, #50184) [Link]

"Savannah now has 12 mirrors covering 8 countries and 4 continents!"

couldn't they just use one of those for the recovery?

savannah.gnu.org status

Posted Jun 2, 2009 7:14 UTC (Tue) by madhatter (subscriber, #4665) [Link]

> The last backup was performed while RAID was buggy

it's likely that the mirrors were also then sync'ed from the buggy RAID, and contain beautiful copies of the same crap data that's on the master. poor guys.

savannah.gnu.org status

Posted Jun 2, 2009 7:17 UTC (Tue) by Kit (guest, #55925) [Link]

From the sound of it, their servers didn't just die, they first started corrupting the data, which likely resulted in the corrupted data getting mirrored to the other servers... Unfortunately, a really sucky situation to find yourself in.

savannah.gnu.org status

Posted Jun 2, 2009 9:26 UTC (Tue) by zdzichu (subscriber, #17118) [Link]

Looks like case against proprietary RAID solution and not self-healing filesystem.

savannah.gnu.org status

Posted Jun 2, 2009 10:04 UTC (Tue) by rodgerd (guest, #58896) [Link]

Or perhaps not.

savannah.gnu.org status

Posted Jun 2, 2009 11:56 UTC (Tue) by niner (subscriber, #26151) [Link]

More likely like a case for daily incremental backups with a long history
as no hardware or filesystem is completely secure from such failures.
There are bugs everywhere.

savannah.gnu.org status

Posted Jun 2, 2009 13:25 UTC (Tue) by skvidal (subscriber, #3094) [Link]

+ many.

12-months to 2 years of one-month granularity backups <- tape
2 months of 1-day granularity backups <- tape
1 week of twice-daily backups <- disk

much profit

Mirrors and backups

Posted Jun 2, 2009 20:09 UTC (Tue) by man_ls (subscriber, #15091) [Link]

couldn't they just use one of those for the recovery?
Just for files released by projects, which is what is mirrored. For the rest (database, websites, source code repositories) the mirrors have no information at all.

savannah.gnu.org status

Posted Jun 2, 2009 6:03 UTC (Tue) by stock (guest, #5849) [Link]

SATA RAID should be fairly reliable. what might be a problem is huge
disks like 1 or 2 TB and above of which the firmware is unreliable.
This was reported in January. Not only did e.g. SuSE 11.0 by default
switch off fsck for ext3, doing a manual ext3 fsck resulted in errors
using large partitions, like 20 TB.

But here's where it gets interesting :

"check your drives firmware"
http://stx.lithium.com/stx/board/message?board.id=ata_dri...

Where Seagate Barracuda's (a 'Product of Thailand') get a SIGSEGV
during a firmware flash update.
Strangely enough, when i try to partition huge SATA disks of 1 TB or
above i get the same error using good old fdisk. Some say that only
gnupart is the tool to use.

So either there's something rotten with firmware in disks from the Far
East, or ext3 still contains a undetected bug which only shows up with
huge partitions.

Robert


savannah.gnu.org status

Posted Jun 2, 2009 11:06 UTC (Tue) by nix (subscriber, #2304) [Link]

I suspect your fdisk errors appear at the 2Tb limit. That's the point at
which old-style DOS filesystems (the only type fdisk can manipulate) burst
their seams.

parted can create GPT partition tables, which have no such limit (well,
their limit is dramatically higher).

savannah.gnu.org status

Posted Jun 3, 2009 21:33 UTC (Wed) by job (guest, #670) [Link]

It's not the filesystem that can't go beyond 2TB, it's the partition table itself that can't. BSD slices, LVM, and GPT can all go beyond that but they can be hard to boot from with BIOS.

savannah.gnu.org status

Posted Jun 3, 2009 22:31 UTC (Wed) by nix (subscriber, #2304) [Link]

Er. Yeah. I meant 'partition', really (although DOS FSen implode long
before that, nobody uses those for anything serious these days). Posting
on hayfever drugs makes me a brainfart-prone idiot.

hayfever drugs

Posted Jun 4, 2009 8:12 UTC (Thu) by xoddam (subscriber, #2322) [Link]

That excuse is starting to wear a little bit thin, nix!

hayfever drugs

Posted Jun 4, 2009 17:11 UTC (Thu) by spender (subscriber, #23067) [Link]

hayfever drugs

Posted Jun 4, 2009 20:12 UTC (Thu) by nix (subscriber, #2304) [Link]

Sorry, it's hard to get rid of it (OK, impossible as far as I know).

(I'm stunned by Brad actually hunting down past comments saying similar
things: does he have nothing useful to do at all?! Perhaps his google-fu
is simply stronger than mine.)

savannah.gnu.org status

Posted Jun 2, 2009 12:28 UTC (Tue) by hensema (guest, #980) [Link]

I wonder what RAID system (software or hardware, and if hardware, what hardware) they use.

savannah.gnu.org status

Posted Jun 2, 2009 16:38 UTC (Tue) by dlang (✭ supporter ✭, #313) [Link]

according to a link higher in this thread they got bit by a significant ZFS bug and that is what corrupted the system.

ZFS might be a red herring

Posted Jun 2, 2009 16:46 UTC (Tue) by grantingram (guest, #18390) [Link]

I think the link about ZFS has little to do with Savannah. It refers to the "Joyeur Community" and is dated 16th of January 2008. (About 18 months old)

Savannah appears to have failed this weekend and I have no idea if they are using ZFS or not.

ZFS might be a red herring

Posted Jun 4, 2009 7:52 UTC (Thu) by rodgerd (guest, #58896) [Link]

Yeah. The link I provided was a rebuttal of a poster's suggestion that there's an automatic safety in open source, checksumming filesystem and RAID layers. Sorry for any confusion.

savannah.gnu.org status

Posted Jun 2, 2009 13:33 UTC (Tue) by utoddl (subscriber, #1232) [Link]

Software wants to be free of course, but mostly,
  software wants to be backed up!

savannah.gnu.org status

Posted Jun 3, 2009 11:32 UTC (Wed) by dark (subscriber, #8483) [Link]

My experience is that software doesn't want to be backed up. It struggles mightily, and even sneaks off while you're not looking. It's like giving medicine to a cat.

savannah.gnu.org status

Posted Jun 2, 2009 16:42 UTC (Tue) by stock (guest, #5849) [Link]

Well its not really Linux's fdisk which fails, it is mkfs.ext3 which
dies when making large partitions using ALL of the available cylinders
It is however also related to kernel version and the ext2 utils you
have installed.

[jackson:root]:(~)# mkfs.ext3 -V
mke2fs 1.35 (28-Feb-2004)
Using EXT2FS Library version 1.35
[jackson:root]:(~)# uname -a
Linux jackson.stokkie.net 2.6.15 #2 SMP PREEMPT Wed Aug 27 06:44:09 CEST
2008 x86_64 AMD Opteron(tm) Processor 246 unknown GNU/Linux
[jackson:root]:(~)#

Ok i know this is old gear, but which versions are known to solve all
this mess? Here's how i had to partition a 500 Gb disk using the above
Linux kernel and mkfs.ext3 :

Disk /dev/sdc: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdc1 * 1 54960 441466168+ 83 Linux

So i could only go as far as 54960 cylinders instead of the 60801
cylinders available.

savannah.gnu.org status

Posted Jun 2, 2009 17:29 UTC (Tue) by nix (subscriber, #2304) [Link]

Upgrade. That's an absurdly ancient version of e2fsprogs (2004!), older
than support for filesystems of that size in ext2. Numerous improvements
for large filesystems have gone in since then.

e2fsprogs 1.41.5 has no problem with it.

savannah.gnu.org status

Posted Jun 3, 2009 8:04 UTC (Wed) by wingo (subscriber, #26929) [Link]

It seems they did find a May 27th backup. That's good.

savannah had administrators?

Posted Jun 3, 2009 16:28 UTC (Wed) by bkoz (guest, #4027) [Link]

News to me.

That this happened is hardly a surprise. The FSF seems to have mostly ignored savannah for years.

I've had simple requests in to create skills for GNU contributors (things like "knows doxygen") for upwards of 3-4 years without a comment. Anticipated time to fix: 5 minutes. There is still no "GCC" group on savannah to join. Etc, etc etc.

Copyright © 2009, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds