User: Password:
|
|
Subscribe / Log in / New account

Brown: A Nasty md/raid bug

Neil Brown has written a blog post about a nasty RAID bug in some versions of the Linux kernel. "The bug only fires when you shutdown/poweroff/reboot the machine. While the machine remains up the bug is completely inactive. So you will only notice the bug when you boot up again. The effect of the bug is to erase important information from the metadata that is stored on the disk drives. In particular the level, chunksize, number of devices, data_offset and role of each device in the array are erased ... and probably some other information too. This means that if you know those details you can recover your data, but if you don't, it will be harder. Hence the "mdadm -E" command suggested earlier."
(Log in to post comments)

Brown: A Nasty md/raid bug

Posted Jun 18, 2012 22:21 UTC (Mon) by xorbe (subscriber, #3165) [Link]

The best solution sounds like: upgrade kernel, sync, hit reset button!

Brown: A Nasty md/raid bug

Posted Jun 18, 2012 22:48 UTC (Mon) by alvieboy (subscriber, #51617) [Link]

I guess we can scan for the relevant information, even if most metadata is lost.

Even if you have to do some trial-and-error, automatized. With a few attempts you can eventually derive all needed values, like stripe size and offsets.

Neil however put a lot of effort explaining what was going on, who could be affected, and how to overcome the problem. I wished most bug fixes came with this sort of "tutorial", and explanation of the issue. Kudos to him.

Alvie

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 6:36 UTC (Tue) by Alterego (guest, #55989) [Link]

@xorbe: You should read the article that would prevent you from posting a misleading comment.

The article says:
If you decide to upgrade your kernel, you should do so *carefully*. Remember that the bug triggers on shutdown/reboot so you aren't safe until the new kernel is running.

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 7:15 UTC (Tue) by AngryChris (subscriber, #74783) [Link]

How is his comment misleading? It's triggered on reboot/shutdown. If you upgrade the kernel, sync your filesystems, and then hit the reset button, the machine never "shuts down" from an operating perspective. The nasty bug that's hit when rebooting is bypassed.

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 10:47 UTC (Tue) by epa (subscriber, #39769) [Link]

Unfortunately lots of machines these days have a soft reset button that triggers /sbin/shutdown. Even the so-called power switch does not cut power to the machine. You have to reach around the back of the computer and yank the power cord, or remove the battery.

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 11:14 UTC (Tue) by hummassa (subscriber, #307) [Link]

One can press-and-hold the power button; usually it has the intended effect of just powering off the device... I don't know any hardware where it does not work. A "sync; poweroff -f" should work too.

Magic SysRq Keys

Posted Jun 19, 2012 11:55 UTC (Tue) by k3ninho (subscriber, #50375) [Link]

Please don't do the following on your machine without expecting it to sync the disks (wait for it to complete) and immediately reboot:
alt-sysrq-s; alt-sysrq-b.

Source: http://en.wikipedia.org/wiki/Magic_SysRq_key

K3n.

Magic SysRq Keys

Posted Jun 19, 2012 12:27 UTC (Tue) by hummassa (subscriber, #307) [Link]

I usually do ctrl-alt-f1, sysrq-6 (to see the messages), sysrq-s (wait for the all sync message to pop up), sysrq-u (so no processes try to write to the disks again after the last sync), another sysrq-s (should pop the message quickly this time), sysrq-o, wait one minute or so, turn the machine on again (the last part is kind of superstitious but I feel all warm and fuzzy inside knowing that if the power fails, I have seen the machine boot from zero last time, so it "ought" to work).

Magic SysRq Keys

Posted Jun 20, 2012 7:31 UTC (Wed) by jezuch (subscriber, #52988) [Link]

> I usually do ctrl-alt-f1, sysrq-6 (to see the messages), sysrq-s (wait for the all sync message to pop up) (...)

I saw a mnemonic for this sequence somewhere:
Raising Skinny Elephants Is [So] Utterly Boring

(The sysrq-s part I saw recommended after sysrq-r or after sysrq-i so, just for safety, I do it in both places ;) )

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 11:26 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

You can use the "Raising Elephants Is So Utterly Boring" SysRq-combination without E,I and U letters.

Soft reset buttons

Posted Jun 22, 2012 17:07 UTC (Fri) by giraffedata (subscriber, #1954) [Link]

Unfortunately lots of machines these days have a soft reset button that triggers /sbin/shutdown.

Are you sure? Have you seen this? I know the soft power down from the power switch is the norm, but I've never seen the reset button, whose main purpose is not to involve the operating system (unlike power control, if you wanted to restart via the OS, you probably would have used the keyboard instead of a paperclip reset button), do this.

Soft reset buttons

Posted Jun 28, 2012 10:54 UTC (Thu) by epa (subscriber, #39769) [Link]

I might indeed be getting confused between soft-power and reset.

Brown: A Nasty md/raid bug

Posted Jun 20, 2012 22:46 UTC (Wed) by ttonino (guest, #4073) [Link]

Well... http://neil.brown.name/blog/20120615073245 tells me that the bug only happens if the array is NOT running on shutdown, as the array state gets overwritten from memory (which in that case has the non-running state).

So, the chance of this happening all by itself is pretty low. That said, Brown gives some good advice about saving the metadata yourself.

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 3:28 UTC (Tue) by whacker (guest, #55546) [Link]

Why do I have a weird sense of deja vu?

google doesn't say....

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 6:38 UTC (Tue) by Adi (guest, #52678) [Link]

Hah! Finally!
I was also one of users that got hit by this bug.
It caused my 'reliable' RAID 1 array to disappear many times.
Somehow, I think this happened more often when I had my NTFS partition mounted by NTFS-3G (which also was on RAID array).

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 8:32 UTC (Tue) by dany (subscriber, #18902) [Link]

I understood that there was time, when shutdown procedure made mdarrays read-only and even after this point there were some write IOs written... why not to fix this? You wouldnt have any of those problems..

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 11:27 UTC (Tue) by spender (subscriber, #23067) [Link]

In fact, all the boring normal bugs are _way_ more important, just because
there's a lot more of them. I don't think some spectacular RAID bug
should be glorified or cared about as being any more "special" than a
random spectacular crash due to bad locking.

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 14:26 UTC (Tue) by nix (subscriber, #2304) [Link]

This bug can render all your data instantly inaccessible at once, and it does it at powerdown so there's no way to recover before you restart, *and* you were using RAID so you thought you were safer.

Much more than most bugs, this one seems likely to give someone a horrible shock in the morning. Most crash bugs don't do that: you restart and you're OK.

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 14:32 UTC (Tue) by spender (subscriber, #23067) [Link]

I'm glad you agree it's a stupid comment!

You'll love this one then:
https://lkml.org/lkml/2008/7/15/296

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 14:53 UTC (Tue) by niner (subscriber, #26151) [Link]

IOW it's just an elaborate trolling attempt

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 15:27 UTC (Tue) by nix (subscriber, #2304) [Link]

We should expect no more from spender. He can't control himself. :P

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 13:40 UTC (Tue) by engla (guest, #47454) [Link]

Isn't it time to fix root filesystem issues once and for all (enter ramfs, unmount everything cleanly)?

Brown: A Nasty md/raid bug

Posted Jun 19, 2012 15:38 UTC (Tue) by drag (guest, #31333) [Link]

Systemd can umount things cleanly at shutdown, I believe.


Copyright © 2012, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds