A filesystem corruption bug breaks loose
On November 13, Claude Heiland-Allan created a bug report about a filesystem corruption problem with the 4.19.1 kernel; other users joined in with reports of their own. Initially, the problem was thought to be in the ext4 filesystem, since that is what the affected users were using. Tracking the problem down took a few weeks, though, because few developers were able to reproduce the problem. There were some attempts at using bisection to find the commit that caused the problem, but they proved to be worse than useless, as they identified the wrong commits and caused developers to waste time on false leads.
It took until December 4 for Lukáš Krejčí to correctly bisect the problem down to a block-layer change. Commit 6ce3dd6eec, added during the 4.19 merge window, optimized the handling of requests in the multiqueue block layer. If there is no I/O scheduler in use, and if the hardware queue is not full, this patch causes new I/O requests to be placed directly into the hardware queue, shorting out a bunch of unnecessary processing. It's a harmless-seeming change that should make I/O go a little faster.
Things can go bad, though, if the low-level driver for the block device is unable to actually execute that request. This is most likely to happen as the result of a resource shortage — memory, perhaps, or something related to the hardware itself. In that case, the driver will return a soft failure, causing the I/O request to be requeued for another attempt later. While that request sits in the queue, the block layer may merge it with other requests for adjacent blocks, which should be fine. If, however, the low-level driver has already done some of the setup for the request, such as creating scatter/gather DMA mappings, those mappings may not be updated to match the larger, merged request. That results in only part of the request being executed by the hardware, with bad effects on the data involved.
The problem was partially fixed with this commit, but one more fix was required to fix a new problem caused by the first. Both fixes were included in the 4.20-rc6 release; they also found their way into 4.19.8. The original patch was never selected for backporting to older stable kernels, so those were not affected.
How this happened
Naturally, some developers are wondering how a problem like this could have made it into a final kernel release without having been noticed. Surely automated testing should have been able to find such a bug? Not all parts of the kernel have exhaustive automated testing regimes (to put it charitably), but the block layer is better than most. That is a function of the severity of bugs at that layer (which can often cause data loss), but also of the relative ease of testing that code. There are few hardware-specific issues to deal with, so testing can usually be done in virtual machines.
This particular case, though, turned up a hole or two in the testing regime. It required a filesystem on a device configured to use no I/O scheduler at all (a relatively rare configuration on the affected filesystems), and that device needed to run into resource limitations that would cause it to temporarily fail requests. The driver for the device also needed to store state in the request structure and use that state in subsequent retries. Finally, the multiqueue block layer must be in use, which only happens by default for SCSI devices as of 4.19. That last change is unlikely to have been picked up by many developers or testing setups, since kernel configuration files tend to be carried forward from one release to the next.
As a result of all those factors, nobody doing automated testing of the
block layer reported this particular problem, and developers found
themselves unable to reproduce it once it came to light. Ubuntu users who
install from the kernel-ppa repository, instead, got a shiny new
configuration file with their bleeding-edge kernel and were thus exposed to
the problem. This sort of occurrence is one of the reasons why developers
tend to like to remove configuration options; more options create more
configurations that must be tested. In this case, block
maintainer Jens Axboe has said
that he will be writing a new test for this particular issue. He also noted
that "this is the first corruption issue we've had (blk-mq or
otherwise) in any kernel in the storage stack in decades as far as I can
remember
".
Doing better next time
There have been some suggestions that the kernel community should have done more to protect users once the problem was discovered. It is not clear that there is a whole lot that could have been done, though. Arguably there should be a mechanism to inform users that a given kernel might have a serious issue and should not be used, but the community lacks such a mechanism. This particular issue was, in fact, less visible than many since it was not discussed on the mailing lists. The only people who were aware of it were those who were watching the kernel bugzilla — or who were busy restoring their filesystems from backups.
Laura Abbott noted that the problem left Fedora users in an awkward position: they could either run a 4.19 kernel that might mangle their data or run the 4.18 kernel, which was no longer being supported. Perhaps, she said, there should be a way to respond to problems like this?
Willy Tarreau responded that
he ensures that the previous long-term stable release works on the systems
he supports for just this reason. Dropping back to 4.14 is unlikely to be
a pleasing alternative for many users, but it's not clear that something
better will come along. Some people would surely like it if the previous
release (4.18 in this case) were maintained for longer but, as stable
maintainer Greg Kroah-Hartman put it, that
"isn't going to happen
"; the resources to do that simply are
not available.
Kroah-Hartman did have one suggestion for situations like this: tell him and the other stable maintainers about it. In this case, he was only informed, by accident, shortly before the bug was tracked down and fixed. There is not much that could have been done even if he had known sooner, since nobody knew what the origin of the problem was. But keeping the stable maintainers in the loop regarding serious problems that have appeared in stable kernels can only help to get the fixes out more quickly.
One other aspect of this bug is that, depending on how one looks at it, it
could be seen as resulting from either of two different underlying issues. One is
that the multiqueue block layer is arguably still not sufficiently mature,
so it is turning up with severe bugs. The other is that maintaining two
independent block subsystems is putting a strain on the system and letting
bugs get through. One's point of view may well affect how one views the
prospect of the legacy block API being removed in the next merge window,
which is the current plan. Ted Ts'o let it be
known that this idea "is not filling me with a lot of joy and
gladness
". But for many others, the time to make this transition is
long past and, in any case, the number of devices that can only use the
multiqueue API is growing quickly.
The good news is that problems of this severity are rare and, when they do
happen, they get the full attention of the developers involved. Some early
adopters were burned, which is never a good thing, but the vast majority of
users will never be affected by this issue. Some testing holes have been
identified that will hopefully be closed in the near future. But no amount
of testing will ever reveal all of the bugs in the system; occasionally a
serious one will escape and bite users. With luck and effort the number
and severity such events can be minimized, but they are not going to be
entirely eliminated anytime in the near future.
Index entries for this article | |
---|---|
Kernel | Block layer |
Kernel | Development model/Stable tree |
Posted Dec 10, 2018 20:57 UTC (Mon)
by maniax (subscriber, #4509)
[Link]
And, not sure if that counts, but I've seen patches fixing somewhat silent data corruption in ext4 if a cgroup's memory runs out.
I really hope there's an answer to the testing problem. Maybe instead of bitcoin people can donate their CPU for linux kernel tests...
Posted Dec 10, 2018 21:09 UTC (Mon)
by abatters (✭ supporter ✭, #6932)
[Link] (1 responses)
Posted Dec 10, 2018 21:38 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link]
Posted Dec 10, 2018 22:12 UTC (Mon)
by zblaxell (subscriber, #26385)
[Link] (5 responses)
There's the paradox of kernel development: Too busy cutting wood to design and build safe and fast autonomous chainsaws.
The headline is very misleading--this is a block layer bug, not a filesystem bug. Everyone's affected by the former, while only people who picked the wrong filesystem are affected by the latter.
Posted Dec 11, 2018 14:00 UTC (Tue)
by matthias (subscriber, #94967)
[Link] (4 responses)
And to see whether you are affected or not, you have to read the article anyways. Because not everyone is affected by this bug in the block layer. Only the ones who picked the wrong queueing discipline (multi queue), the wrong driver (scsi) and the wrong io scheduler (none). I do not see why this is so much different from picking a wrong filesystem.
If everyone would be affected by this bug, it would have been found much earlier.
Posted Dec 11, 2018 14:37 UTC (Tue)
by zdzichu (subscriber, #17118)
[Link] (1 responses)
Posted Dec 11, 2018 19:11 UTC (Tue)
by flussence (guest, #85566)
[Link]
(I'll just be over here in my corner, running BFQ…)
Posted Dec 11, 2018 16:30 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link]
Posted Dec 15, 2018 22:32 UTC (Sat)
by giraffedata (guest, #1954)
[Link]
This is a victim of our sloppy use of terminology. We commonly use "filesystem" 3 ways. The purest definition of filesystem is a set of bits that can be interpreted as files, as in "the filesystem that contains my paper is on this disk drive." But people normally also use the word to refer to a filesystem type, as in "FAT16 is the most portable filesystem" and also to refer to a filesystem driver ("This filesystem has bugs").
The headline refers to a true filesystem; OP took it to mean filesystem driver.
There are many more examples in technical talk of words commonly used in multiple ways (usually as both an object and a class of those objects) where the reader is expected to discern from context which one it is. "command" is a great example. I wish we did this less.
Posted Dec 10, 2018 22:18 UTC (Mon)
by cornelio (guest, #117499)
[Link] (3 responses)
Posted Dec 11, 2018 12:17 UTC (Tue)
by jezuch (subscriber, #52988)
[Link] (2 responses)
Posted Dec 11, 2018 14:25 UTC (Tue)
by Baughn (subscriber, #124425)
[Link] (1 responses)
How well do non-mirrored RAID modes work?
Posted Dec 11, 2018 20:24 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link]
Parity RAID now survives data corruption and non-timeout IO errors. It will happily--and correctly--fix itself even if you corrupt one random byte on every block of two disks in a RAID6 array (except nodatacow files, which will be corrupted because nodatacow intentionally trades detection or repair of data corruption for performance). I haven't tried it with real failing disks, though, and I lack the imagination and contempt for data integrity required to try to replicate drive firmware bugs in a VM.
That said, there are still hundreds of other bugs, and there are still plenty of opportunities to improve performance in btrfs.
Posted Dec 10, 2018 23:02 UTC (Mon)
by mangix (guest, #126006)
[Link] (3 responses)
This seems to affect users of NVME devices, as those seem to use a scheduler of none and use blk-mq by default.
This is also not the first filesystem corruption issue that I've seen. I saw a recent one introduced in kernel 4.8 for very specific MIPS devices (MIPS 1004KC, probably the only one actually). It was an L1 cache bug where the cache on the second core was not being cleared. After some uptime, data corruption ensued. ext4 managed to get corrupted very fast. btrfs less so. It even spammed dmesg like crazy. But it also was not immune.
Fixed and backported to 4.9
https://git.kernel.org/pub/scm/linux/kernel/git/next/linu...
Posted Dec 10, 2018 23:05 UTC (Mon)
by mangix (guest, #126006)
[Link]
Posted Dec 10, 2018 23:20 UTC (Mon)
by axboe (subscriber, #904)
[Link] (1 responses)
No, as the article mentions, only a subset of SCSI was affected. To hit this issue, you need:
1) Driver that can hit a BUSY condition on dispatch. SCSI fits that bill.
All of these conditions have to be met, or you cannot hit it.
Posted Dec 11, 2018 0:46 UTC (Tue)
by mangix (guest, #126006)
[Link]
Posted Dec 11, 2018 1:21 UTC (Tue)
by ken (subscriber, #625)
[Link] (10 responses)
All important files was on NFS so not a huge problem but I found out that it no longer works to run fsck on boot in ubuntu . So now your forced to boot from a usb live system just to do fsck on root.
That upset me more than the actual filesystem bug.
Posted Dec 11, 2018 1:28 UTC (Tue)
by pr1268 (guest, #24648)
[Link] (3 responses)
Was that because of this bug?
Posted Dec 11, 2018 1:46 UTC (Tue)
by ken (subscriber, #625)
[Link] (2 responses)
tried the good old. touch /forcefsck but that creates a warning
but that did not help I did still got the
That only stopped once I run fsck manually from a live usb boot.
I could not find any trace fsck was run at all on boot.
Posted Dec 11, 2018 12:59 UTC (Tue)
by mgedmin (subscriber, #34497)
[Link] (1 responses)
Perhaps fsck -f -a was unable to fix the corruption automatically? In which case you needed to pass fsck.repair=yes as well, to change that command to fsck -f -y. I'm not sure where that's documented, I discovered it by grepping through the initramfs shell scripts in /usr/share/initramfs-tools.
Anyway, booting a livecd and running fsck interactively sounds like a good plan, where you can tell what's going on without digging through a maze of little scripts in /usr/share/initramfs-tools.
Posted Dec 20, 2018 17:07 UTC (Thu)
by BenHutchings (subscriber, #37955)
[Link]
Posted Dec 11, 2018 19:30 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link] (5 responses)
The old Unix way of running fsck on a live root filesystem so that it will be potentially modifying its own program text (or block maps thereof) was, at best, a workaround for not having a usable initramfs subsystem to run fsck from. That era ended 15 years ago on Linux, and even earlier on Solaris and other commercial Unixes.
I'm surprised Linux distros that boot with initramfs today still try to fsck after / is mounted.
Posted Dec 11, 2018 20:16 UTC (Tue)
by lkundrak (subscriber, #43452)
[Link] (3 responses)
Posted Dec 11, 2018 20:18 UTC (Tue)
by zblaxell (subscriber, #26385)
[Link] (2 responses)
Posted Dec 12, 2018 3:45 UTC (Wed)
by nybble41 (subscriber, #55106)
[Link] (1 responses)
[~]$ lsinitramfs /boot/initrd.img-4.18.0-2-amd64 | fgrep fsck
Posted Dec 20, 2018 17:10 UTC (Thu)
by BenHutchings (subscriber, #37955)
[Link]
Posted Dec 14, 2018 23:41 UTC (Fri)
by nix (subscriber, #2304)
[Link]
Posted Dec 11, 2018 16:30 UTC (Tue)
by tbm (subscriber, #7049)
[Link] (2 responses)
What I find disappointing is that Lukáš Krejčí wasn't credited at all in the commit message. He did great work finding a setup that allowed the developer to reproduce and identify the issue. QA people need more recognition.
(To be fair, there is at least a Tested-by but not Lukáš)
Posted Dec 12, 2018 1:51 UTC (Wed)
by axboe (subscriber, #904)
[Link]
I totally agree, and I'll take the blame for that. For what it's worth, he's credited in the fio reproducer I made based on his setup description.
Posted Dec 12, 2018 22:17 UTC (Wed)
by masoncl (subscriber, #47138)
[Link]
We can't change git history, but we'll work out a way to send him something.
A filesystem corruption bug breaks loose
git bisect considered dangerous
git bisect considered dangerous
A **block layer** corruption bug breaks loose
A **block layer** corruption bug breaks loose
A **block layer** corruption bug breaks loose
A **block layer** corruption bug breaks loose
A **block layer** corruption bug breaks loose
A **block layer** corruption bug breaks loose
A filesystem corruption bug breaks loose
A filesystem corruption bug breaks loose
A filesystem corruption bug breaks loose
A filesystem corruption bug breaks loose
A filesystem corruption bug breaks loose
A filesystem corruption bug breaks loose
A filesystem corruption bug breaks loose
2) Driver that retains state over a BUSY condition. Only SCSI fits that bill.
3) Device that hits this condition outside of normal queue full. Only ATA fits that bill, with some commands not being queued.
4) Device NOT using an IO scheduler
5) Lots of bad luck
A filesystem corruption bug breaks loose
A filesystem corruption bug breaks loose
A filesystem corruption bug breaks loose
it no longer works to run fsck on boot
A filesystem corruption bug breaks loose
Please pass 'fsck.mode=force' on the kernel command
"warning: mounting fs with errors, running e2fsck is recommended"
A filesystem corruption bug breaks loose
These command-line parameters were first implemented by systemd and are documented in systemd-fsck(8). Since initramfs-tools took over running fsck on / and /usr, it is meant to support the same command-line parameters that control fsck in either initscripts or systemd.
A filesystem corruption bug breaks loose
A filesystem corruption bug breaks loose
Uh? A fresh Fedora install:
A filesystem corruption bug breaks loose
[root@nedofet lkundrak]# lsinitrd |grep -i fsck
-rwxr-xr-x 1 root root 28512 Oct 25 09:14 usr/lib/systemd/systemd-fsck
-rw-r--r-- 1 root root 671 Oct 25 09:14 usr/lib/systemd/system/systemd-fsck@.service
-rwxr-xr-x 2 root root 0 Jul 13 04:35 usr/sbin/e2fsck
-rwxr-xr-x 1 root root 55952 Jul 16 13:51 usr/sbin/fsck
-rwxr-xr-x 2 root root 339424 Jul 13 04:35 usr/sbin/fsck.ext4
[root@nedofet lkundrak]#
A filesystem corruption bug breaks loose
A filesystem corruption bug breaks loose
usr/sbin/e2fsck
usr/sbin/fsck
usr/sbin/fsck.ext4
usr/sbin/reiserfsck
A filesystem corruption bug breaks loose
A filesystem corruption bug breaks loose
A filesystem corruption bug breaks loose
A filesystem corruption bug breaks loose
A filesystem corruption bug breaks loose