Not logged in
Log in now
Create an account
Subscribe to LWN
Pencil, Pencil, and Pencil
Dividing the Linux desktop
LWN.net Weekly Edition for June 13, 2013
A report from pgCon 2013
Little things that matter in language design
Ext4 data corruption trouble
Posted Oct 24, 2012 20:19 UTC (Wed) by nix (subscriber, #2304)
I suspect /usr/src survived simply because, in the 3.6.3 session when I wrote to it, it happened to get cleanly unmounted rather than uncleanly. (Why /var gets hit so much more often than other dirty filesystems, I still am not quite clear on. I suspect something using /var is not dying when it should when I shut my system down, holding /var open so it never gets unmounted properly and the bug always hits it.)
Posted Oct 24, 2012 20:50 UTC (Wed) by Cyberax (✭ supporter ✭, #52523)
Posted Oct 24, 2012 21:02 UTC (Wed) by nix (subscriber, #2304)
In this situation, you cannot unmount local filesystems until you've killed everything that may have them as current directories -- but you can't kill local processes either, because that'll render your loopback-NFS mount, and everything underneath it including the local mount, inaccessible, and you don't learn this until your umount on reboot stalls indefinitely, which is tough if you're hundreds of miles away and working remotely and this is your main fileserver.
So I resorted to lazy-unmounting everything. But this unfortunately means that umount returns before the *successful* umounts are complete. So I sleep for a bit after umounting... but not necessarily long enough. This is all really gross and unclean and disgusting, but it's been working for many years so I forgot it existed and never tried to make it less gross.
(Aside: this may well go wrong even if I sleep for ages, in which case all that is needed to trigger this is a non-unmounted fs, not an fs halfway through unmounting. I'll be testing that next.)
Posted Oct 24, 2012 21:26 UTC (Wed) by Cyberax (✭ supporter ✭, #52523)
Posted Oct 24, 2012 21:36 UTC (Wed) by nix (subscriber, #2304)
Raw umount(8) does a toposort unmount as well. It is not enough.
Posted Oct 24, 2012 22:00 UTC (Wed) by tomegun (subscriber, #56697)
In that case systemd will jump back to the initramfs on shutdown, and the initramfs will then try to kill/unmount whatever processe/mounts remains in the rootfs.
Posted Oct 25, 2012 11:16 UTC (Thu) by nix (subscriber, #2304)
Worse yet, what if you have processes in other PID namespaces, holding open filesystems in other filesystem namespaces? The initramfs can't even see them! *No* umount loop can fix that. I hate adding new syscalls, but I really do think we need a new 'unmount the world' syscall which can cross such boundaries :(
Posted Oct 25, 2012 12:50 UTC (Thu) by rleigh (subscriber, #14622)
Posted Oct 25, 2012 13:31 UTC (Thu) by nix (subscriber, #2304)
Posted Oct 24, 2012 23:34 UTC (Wed) by nix (subscriber, #2304)
OK, it turns out that you need to do rather crazy things to make this go wrong -- and if you hit it at the wrong moment, 3.6.1 is vulnerable too, and quite possibly every Linux version ever. To wit, you need to disconnect the block device or reboot *during* the umount. This may well be an illegitimate thing to do, but it is unfortunately also quite *easy* to do if you pull out a USB key.
Worse yet, if you umount -l a filesystem, it becomes dangerous to *ever* reboot, because there is as far as I can tell no way to tell when lazy umount switches from 'not yet umounted, mount point still in use, safe to reboot' to 'umount in progress, rebooting is disastrous'.
I still haven't found a way to safely unmount all filesystems if you have local filesystems nested underneath NFS filesystems (where the NFS filesystems may require userspace daemons to be running in order to unmount, and the local filesystems generally require userspace daemons to be dead in order to unmount).
It may work to kill everything whose cwd is not / or which has a terminal, then unmount NFS and local filesystems in succession until you can make no more progress -- but it seems appallingly complicated and grotty, and will break as soon as some daemon holds a file open on a non-root filesystem. What's worse, it leads to shutdown locking up if a remote NFS server is unresponsive, which is the whole reason why I started using lazy umount at shutdown in the first place!
Posted Oct 25, 2012 0:06 UTC (Thu) by Kioob (subscriber, #56482)
I was thinking about a problem with DRBD, then I saw this news... so... I don't know. Is it really the only way to trigger that problem ?
Posted Oct 25, 2012 0:31 UTC (Thu) by nix (subscriber, #2304)
I have speculated on ways to fix this for good, though they require a new syscall, a new userspace utility, changes to shutdown scripts, that others on l-k agree my idea is not utterly insane, and for me to bother to implement all of this. The latter is questionable, given the number of things I mean to do that I never get around to. :)
Posted Oct 25, 2012 1:14 UTC (Thu) by luto (subscriber, #39314)
If you want to cleanly unmount everything, presumably you want (a) revoke and (b) unmount-the-$!%@-fs-even-if-it's-in-use. (I'd like both of these.)
If you want to know when filesystems are gone, maybe you want to separate the processes of mounting things into the FS hierarchy from loading a driver for an FS. Then you could force-remove-from-hierarchy (roughly equivalent to umount -l) and separately wait until the FS is no longer loaded (which has nothing to do with the hierarchy).
If you want your system to be reliable, the bug needs fixing.
Posted Oct 25, 2012 1:25 UTC (Thu) by dlang (✭ supporter ✭, #313)
1. you can't even try to unmount a filesystem if it's mounted under another filesystem that you can't reach
mount /dev/sda /
mount remote:/something on /something
mount /dev/sdb /something/else
now if remote goes down, you have no way of cleanly unmounting /dev/sdb
2. even solving for #1, namespaces cause problems because with namespaces, it is now impossible for any one script to unmount everything, or even to find what pids need to be killed in all the pid namespaces to be able to make a filesystem idle so that is can be unmounted.
Posted Oct 25, 2012 1:30 UTC (Thu) by ewen (subscriber, #4772)
However finding all the file systems in the face of many PID/filesystem name spaces is still non-trivial.
Posted Oct 25, 2012 1:56 UTC (Thu) by nix (subscriber, #2304)
I had no idea you could use remounting (plus, presumably, readonly remounting) on raw devices like that. That might work rather well in my case: all my devices are in one LVM VG, so I can just do a readonly remount on /dev/$vgname/*.
But in the general case, including PID and fs namespaces, that's really not going to work, indeed.
Posted Oct 25, 2012 3:50 UTC (Thu) by ewen (subscriber, #4772)
Posted Oct 25, 2012 1:26 UTC (Thu) by ewen (subscriber, #4772)
(there are others earlier/later, but they mostly only make sense in context.)
ObTopic: possibly there may be an ordering of write operations which ensures that the journal close off/journal replay is idempotent (ie, okay to do twice), but it would appear that EXT4 in some kernel versions either doesn't currently have that for some actions or doesn't have sufficient barriers to ensure the writes hit stable storage in that order. So there seems to be a (small) window of block writing vulnerability during the EXT4 unmounting. (Compare with, eg, the FreeBSD Soft Updates file system operation ordering -- http://en.wikipedia.org/wiki/Soft_updates.)
Posted Oct 25, 2012 2:00 UTC (Thu) by nix (subscriber, #2304)
But now there's a wave of self-sustaining accidental lies spreading across the net, damaging the reputation of ext4 unwarrantedly, and I started it without wanting to.
It's times like this when I start to understand why some companies have closed bug trackers.
Posted Oct 25, 2012 9:31 UTC (Thu) by man_ls (subscriber, #15091)
Remember the stupid "neutrinos faster than light" news where all media outlets were reporting that Einstein had been rebutted, and that we were close to time travel? In the end it was all a faulty hardware connection, the original results were corrected and the speed of light paradigm came out stronger than ever. In that case it was a few hundreds of scientists signing the original paper that started the wildfire, instead of checking and rechecking everything for a few months before publishing such a fundamental result. I hope they are widely discredited now, all 170 of them (I am not joking now, either in the figure or in the malignity).
So in a few days the bug will be pinned to a very specific and uninteresting condition, and ext4 will come out stronger than ever. One data point: I have seen no corruption with 3.6.3, but then I am never rebooting while unmounting. Now I will be unmounting with extra care :)
Posted Oct 25, 2012 13:33 UTC (Thu) by nix (subscriber, #2304)
LWN's coverage of this was much much better, emphasising the unclear and under-investigation nature of the thing.
A few things
Posted Oct 25, 2012 14:03 UTC (Thu) by man_ls (subscriber, #15091)
Also, hundreds of names on a paper may be standard practice, but it is ridiculous. Somebody should compute something like the Einstein index but dividing each result by the number of collaborators.
Finally, it appears from the wikipedia article that the Gran Sasso scientists had sat on their results for six months before publishing them. Even though I called for the same embargo in my post, that they did somehow only makes it worse -- but then life is unfair.
Posted Oct 25, 2012 10:17 UTC (Thu) by cesarb (subscriber, #6266)
Data corruption/loss is scary. Even more than most security problems (a really bad security problem will be used by some joker to erase your data, so a really bad security problem is equivalent to data corruption/loss).
If the data corruption/loss affects the most used and stable filesystem in the Linux world, the steps to reproduce sound reasonably easy to hit by chance (just reboot twice quickly), and the data loss is believed to be prevented by just not upgrading/downgrading a minor point release, it is natural human behavior to want EVERYONE to know RIGHT NOW, so people will not upgrade/will downgrade until it is safe. Thus the posts on every widely read Linux-related news media people could find.
Even now with the problem being shown to happen in less common situations, and with it being suspected of being older than 3.6.1, I would say 3.6.3 is burned, and people will not touch it with a 3-meter pole until 3.6.4 is out. Even if 3.6.4 has no ext4-related patches at all.
Posted Oct 25, 2012 16:30 UTC (Thu) by nix (subscriber, #2304)
I did try to work on it only outside working hours, but it's sometimes hard to concentrate on anything else when your filesystems are at risk, so I fear it did compromise my productivity at other times. So, thank you, Elena. :)
Posted Oct 25, 2012 17:25 UTC (Thu) by cesarb (subscriber, #6266)
Did you boss at Oracle tell you to try btrfs instead? ;-)
Copyright © 2013, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds