|
|
Subscribe / Log in / New account

A filesystem corruption bug breaks loose

By Jonathan Corbet
December 10, 2018
Kernel bugs can have all kinds of unfortunate consequences, from inconvenient crashes to nasty security vulnerabilities. Some of the most feared bugs, though, are those that corrupt data in filesystems. The losses imposed on users can be severe, and the resulting problems may not be noticed for a long time, making recovery difficult. Filesystem developers, knowing that they will have to face their users in the real world, go to considerable effort to prevent this kind of bug from finding its way into a released kernel. A recent failure in that regard raises a number of interesting questions about how kernel development is done.

On November 13, Claude Heiland-Allan created a bug report about a filesystem corruption problem with the 4.19.1 kernel; other users joined in with reports of their own. Initially, the problem was thought to be in the ext4 filesystem, since that is what the affected users were using. Tracking the problem down took a few weeks, though, because few developers were able to reproduce the problem. There were some attempts at using bisection to find the commit that caused the problem, but they proved to be worse than useless, as they identified the wrong commits and caused developers to waste time on false leads.

It took until December 4 for Lukáš Krejčí to correctly bisect the problem down to a block-layer change. Commit 6ce3dd6eec, added during the 4.19 merge window, optimized the handling of requests in the multiqueue block layer. If there is no I/O scheduler in use, and if the hardware queue is not full, this patch causes new I/O requests to be placed directly into the hardware queue, shorting out a bunch of unnecessary processing. It's a harmless-seeming change that should make I/O go a little faster.

Things can go bad, though, if the low-level driver for the block device is unable to actually execute that request. This is most likely to happen as the result of a resource shortage — memory, perhaps, or something related to the hardware itself. In that case, the driver will return a soft failure, causing the I/O request to be requeued for another attempt later. While that request sits in the queue, the block layer may merge it with other requests for adjacent blocks, which should be fine. If, however, the low-level driver has already done some of the setup for the request, such as creating scatter/gather DMA mappings, those mappings may not be updated to match the larger, merged request. That results in only part of the request being executed by the hardware, with bad effects on the data involved.

The problem was partially fixed with this commit, but one more fix was required to fix a new problem caused by the first. Both fixes were included in the 4.20-rc6 release; they also found their way into 4.19.8. The original patch was never selected for backporting to older stable kernels, so those were not affected.

How this happened

Naturally, some developers are wondering how a problem like this could have made it into a final kernel release without having been noticed. Surely automated testing should have been able to find such a bug? Not all parts of the kernel have exhaustive automated testing regimes (to put it charitably), but the block layer is better than most. That is a function of the severity of bugs at that layer (which can often cause data loss), but also of the relative ease of testing that code. There are few hardware-specific issues to deal with, so testing can usually be done in virtual machines.

This particular case, though, turned up a hole or two in the testing regime. It required a filesystem on a device configured to use no I/O scheduler at all (a relatively rare configuration on the affected filesystems), and that device needed to run into resource limitations that would cause it to temporarily fail requests. The driver for the device also needed to store state in the request structure and use that state in subsequent retries. Finally, the multiqueue block layer must be in use, which only happens by default for SCSI devices as of 4.19. That last change is unlikely to have been picked up by many developers or testing setups, since kernel configuration files tend to be carried forward from one release to the next.

As a result of all those factors, nobody doing automated testing of the block layer reported this particular problem, and developers found themselves unable to reproduce it once it came to light. Ubuntu users who install from the kernel-ppa repository, instead, got a shiny new configuration file with their bleeding-edge kernel and were thus exposed to the problem. This sort of occurrence is one of the reasons why developers tend to like to remove configuration options; more options create more configurations that must be tested. In this case, block maintainer Jens Axboe has said that he will be writing a new test for this particular issue. He also noted that "this is the first corruption issue we've had (blk-mq or otherwise) in any kernel in the storage stack in decades as far as I can remember".

Doing better next time

There have been some suggestions that the kernel community should have done more to protect users once the problem was discovered. It is not clear that there is a whole lot that could have been done, though. Arguably there should be a mechanism to inform users that a given kernel might have a serious issue and should not be used, but the community lacks such a mechanism. This particular issue was, in fact, less visible than many since it was not discussed on the mailing lists. The only people who were aware of it were those who were watching the kernel bugzilla — or who were busy restoring their filesystems from backups.

Laura Abbott noted that the problem left Fedora users in an awkward position: they could either run a 4.19 kernel that might mangle their data or run the 4.18 kernel, which was no longer being supported. Perhaps, she said, there should be a way to respond to problems like this?

I'm wondering if there's anything we can do to make things easier on kernel consumers. Bugs will certainly happen but it really makes it hard to push the "always run the latest stable" narrative if there isn't a good fallback when things go seriously wrong.

Willy Tarreau responded that he ensures that the previous long-term stable release works on the systems he supports for just this reason. Dropping back to 4.14 is unlikely to be a pleasing alternative for many users, but it's not clear that something better will come along. Some people would surely like it if the previous release (4.18 in this case) were maintained for longer but, as stable maintainer Greg Kroah-Hartman put it, that "isn't going to happen"; the resources to do that simply are not available.

Kroah-Hartman did have one suggestion for situations like this: tell him and the other stable maintainers about it. In this case, he was only informed, by accident, shortly before the bug was tracked down and fixed. There is not much that could have been done even if he had known sooner, since nobody knew what the origin of the problem was. But keeping the stable maintainers in the loop regarding serious problems that have appeared in stable kernels can only help to get the fixes out more quickly.

One other aspect of this bug is that, depending on how one looks at it, it could be seen as resulting from either of two different underlying issues. One is that the multiqueue block layer is arguably still not sufficiently mature, so it is turning up with severe bugs. The other is that maintaining two independent block subsystems is putting a strain on the system and letting bugs get through. One's point of view may well affect how one views the prospect of the legacy block API being removed in the next merge window, which is the current plan. Ted Ts'o let it be known that this idea "is not filling me with a lot of joy and gladness". But for many others, the time to make this transition is long past and, in any case, the number of devices that can only use the multiqueue API is growing quickly.

The good news is that problems of this severity are rare and, when they do happen, they get the full attention of the developers involved. Some early adopters were burned, which is never a good thing, but the vast majority of users will never be affected by this issue. Some testing holes have been identified that will hopefully be closed in the near future. But no amount of testing will ever reveal all of the bugs in the system; occasionally a serious one will escape and bite users. With luck and effort the number and severity such events can be minimized, but they are not going to be entirely eliminated anytime in the near future.

Index entries for this article
KernelBlock layer
KernelDevelopment model/Stable tree


to post comments

A filesystem corruption bug breaks loose

Posted Dec 10, 2018 20:57 UTC (Mon) by maniax (subscriber, #4509) [Link]

As for this being the only corruption in the stack in the last decade, there was one more that stayed in for a few years - I opened a bug for Ubuntu for it at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1796542 , after a customer of ours hit it, we tracked it and found that Suse had found the same issue three months earlier and had pushed a patch in mainline. The issue was introduced in v4.10-rc1 with 72ecad22d9f198aafee64218512e02ffa7818671.

And, not sure if that counts, but I've seen patches fixing somewhat silent data corruption in ext4 if a cgroup's memory runs out.

I really hope there's an answer to the testing problem. Maybe instead of bitcoin people can donate their CPU for linux kernel tests...

git bisect considered dangerous

Posted Dec 10, 2018 21:09 UTC (Mon) by abatters (✭ supporter ✭, #6932) [Link] (1 responses)

So the bug is fixed upstream - great. But anyone doing a git bisect on an unrelated bug can encounter it again. An expert user who has read this article and remembers it could avoid it, but... time passes, people forget, and not everyone reads every lwn article. What can be done to make git bisect safer?

git bisect considered dangerous

Posted Dec 10, 2018 21:38 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

I'd like the ability to give a list of patches to cherry-pick onto every step of the bisect. I've done this before by using a `git bisect run` which applies the diff, runs the test, and then unapplies the diff (since bisect doesn't like local changes). Conflicts are annoying to deal with though. If the project could specify that a patch fixes something introduced in a patch, bisect could determine whether it needs to be applied automatically by checking the ancestry of the current commit. Gathering the list of potential fix commits is the hard part there (though conflicts are still an issue, how/where to store this "fixes" information might be hard). Commit trailers work well for other things though.

A **block layer** corruption bug breaks loose

Posted Dec 10, 2018 22:12 UTC (Mon) by zblaxell (subscriber, #26385) [Link] (5 responses)

> One is that the multiqueue block layer is arguably still not sufficiently mature, so it is turning up with severe bugs. The other is that maintaining two independent block subsystems is putting a strain on the system and letting bugs get through.

There's the paradox of kernel development: Too busy cutting wood to design and build safe and fast autonomous chainsaws.

The headline is very misleading--this is a block layer bug, not a filesystem bug. Everyone's affected by the former, while only people who picked the wrong filesystem are affected by the latter.

A **block layer** corruption bug breaks loose

Posted Dec 11, 2018 14:00 UTC (Tue) by matthias (subscriber, #94967) [Link] (4 responses)

The headline says filesystem corruption bug, i.e., a bug that corrupts filesystems. It does not say where the bug is located in the code or io stack.

And to see whether you are affected or not, you have to read the article anyways. Because not everyone is affected by this bug in the block layer. Only the ones who picked the wrong queueing discipline (multi queue), the wrong driver (scsi) and the wrong io scheduler (none). I do not see why this is so much different from picking a wrong filesystem.

If everyone would be affected by this bug, it would have been found much earlier.

A **block layer** corruption bug breaks loose

Posted Dec 11, 2018 14:37 UTC (Tue) by zdzichu (subscriber, #17118) [Link] (1 responses)

You hardly "pick" mq, none and scsi. Those are the defaults nowadays.

A **block layer** corruption bug breaks loose

Posted Dec 11, 2018 19:11 UTC (Tue) by flussence (guest, #85566) [Link]

I thought Kyber was the default for mq?

(I'll just be over here in my corner, running BFQ…)

A **block layer** corruption bug breaks loose

Posted Dec 11, 2018 16:30 UTC (Tue) by zblaxell (subscriber, #26385) [Link]

A block-layer bug would also affect non-filesystem things, like swap and raw partitions used by VMs.

A **block layer** corruption bug breaks loose

Posted Dec 15, 2018 22:32 UTC (Sat) by giraffedata (guest, #1954) [Link]

This is a victim of our sloppy use of terminology. We commonly use "filesystem" 3 ways. The purest definition of filesystem is a set of bits that can be interpreted as files, as in "the filesystem that contains my paper is on this disk drive." But people normally also use the word to refer to a filesystem type, as in "FAT16 is the most portable filesystem" and also to refer to a filesystem driver ("This filesystem has bugs").

The headline refers to a true filesystem; OP took it to mean filesystem driver.

There are many more examples in technical talk of words commonly used in multiple ways (usually as both an object and a class of those objects) where the reader is expected to discern from context which one it is. "command" is a great example. I wish we did this less.

A filesystem corruption bug breaks loose

Posted Dec 10, 2018 22:18 UTC (Mon) by cornelio (guest, #117499) [Link] (3 responses)

Speaking about bugs ... btrfs.

A filesystem corruption bug breaks loose

Posted Dec 11, 2018 12:17 UTC (Tue) by jezuch (subscriber, #52988) [Link] (2 responses)

Yes? What about it? (I'm using it so I'd like to know. Haven't had problems for quite a while.)

A filesystem corruption bug breaks loose

Posted Dec 11, 2018 14:25 UTC (Tue) by Baughn (subscriber, #124425) [Link] (1 responses)

You never have to manually rebalance it?

How well do non-mirrored RAID modes work?

A filesystem corruption bug breaks loose

Posted Dec 11, 2018 20:24 UTC (Tue) by zblaxell (subscriber, #26385) [Link]

You still need to balance one data block group every now and then, just enough to keep one chunk unallocated on each drive. A partial or full rebalance is required after some array layout changes (and always will be--the best outcome would be to automatically start a balance when that occurs).

Parity RAID now survives data corruption and non-timeout IO errors. It will happily--and correctly--fix itself even if you corrupt one random byte on every block of two disks in a RAID6 array (except nodatacow files, which will be corrupted because nodatacow intentionally trades detection or repair of data corruption for performance). I haven't tried it with real failing disks, though, and I lack the imagination and contempt for data integrity required to try to replicate drive firmware bugs in a VM.

That said, there are still hundreds of other bugs, and there are still plenty of opportunities to improve performance in btrfs.

A filesystem corruption bug breaks loose

Posted Dec 10, 2018 23:02 UTC (Mon) by mangix (guest, #126006) [Link] (3 responses)

A few thoughts,

This seems to affect users of NVME devices, as those seem to use a scheduler of none and use blk-mq by default.

This is also not the first filesystem corruption issue that I've seen. I saw a recent one introduced in kernel 4.8 for very specific MIPS devices (MIPS 1004KC, probably the only one actually). It was an L1 cache bug where the cache on the second core was not being cleared. After some uptime, data corruption ensued. ext4 managed to get corrupted very fast. btrfs less so. It even spammed dmesg like crazy. But it also was not immune.

Fixed and backported to 4.9

https://git.kernel.org/pub/scm/linux/kernel/git/next/linu...

A filesystem corruption bug breaks loose

Posted Dec 10, 2018 23:05 UTC (Mon) by mangix (guest, #126006) [Link]

Let's not even mention the data loss that resulted from this (bad btrfs repair options).

A filesystem corruption bug breaks loose

Posted Dec 10, 2018 23:20 UTC (Mon) by axboe (subscriber, #904) [Link] (1 responses)

> This seems to affect users of NVME devices, as those seem to use a scheduler of none and use blk-mq by default.

No, as the article mentions, only a subset of SCSI was affected. To hit this issue, you need:

1) Driver that can hit a BUSY condition on dispatch. SCSI fits that bill.
2) Driver that retains state over a BUSY condition. Only SCSI fits that bill.
3) Device that hits this condition outside of normal queue full. Only ATA fits that bill, with some commands not being queued.
4) Device NOT using an IO scheduler
5) Lots of bad luck

All of these conditions have to be met, or you cannot hit it.

A filesystem corruption bug breaks loose

Posted Dec 11, 2018 0:46 UTC (Tue) by mangix (guest, #126006) [Link]

Got it. Thanks for the clarification.

A filesystem corruption bug breaks loose

Posted Dec 11, 2018 1:21 UTC (Tue) by ken (subscriber, #625) [Link] (10 responses)

I was one of the unlucky ones hitting this on ubuntu 18.10.

All important files was on NFS so not a huge problem but I found out that it no longer works to run fsck on boot in ubuntu . So now your forced to boot from a usb live system just to do fsck on root.

That upset me more than the actual filesystem bug.

A filesystem corruption bug breaks loose

Posted Dec 11, 2018 1:28 UTC (Tue) by pr1268 (guest, #24648) [Link] (3 responses)

it no longer works to run fsck on boot

Was that because of this bug?

A filesystem corruption bug breaks loose

Posted Dec 11, 2018 1:46 UTC (Tue) by ken (subscriber, #625) [Link] (2 responses)

No I was booting 4.12 kernel when I tried to fix the drive.

tried the good old. touch /forcefsck but that creates a warning
Please pass 'fsck.mode=force' on the kernel command

but that did not help I did still got the
"warning: mounting fs with errors, running e2fsck is recommended"

That only stopped once I run fsck manually from a live usb boot.

I could not find any trace fsck was run at all on boot.

A filesystem corruption bug breaks loose

Posted Dec 11, 2018 12:59 UTC (Tue) by mgedmin (subscriber, #34497) [Link] (1 responses)

You can check /run/initramfs/fsck.log for a full log of fsck execution from the initramfs.

Perhaps fsck -f -a was unable to fix the corruption automatically? In which case you needed to pass fsck.repair=yes as well, to change that command to fsck -f -y. I'm not sure where that's documented, I discovered it by grepping through the initramfs shell scripts in /usr/share/initramfs-tools.

Anyway, booting a livecd and running fsck interactively sounds like a good plan, where you can tell what's going on without digging through a maze of little scripts in /usr/share/initramfs-tools.

A filesystem corruption bug breaks loose

Posted Dec 20, 2018 17:07 UTC (Thu) by BenHutchings (subscriber, #37955) [Link]

These command-line parameters were first implemented by systemd and are documented in systemd-fsck(8). Since initramfs-tools took over running fsck on / and /usr, it is meant to support the same command-line parameters that control fsck in either initscripts or systemd.

A filesystem corruption bug breaks loose

Posted Dec 11, 2018 19:30 UTC (Tue) by zblaxell (subscriber, #26385) [Link] (5 responses)

I copied e2fsck into initramfs years ago. Also dropbear (to diagnose and repair the root filesystem remotely), mkfs and rsync (in case the diagnosis and repair does not end well, and a restore from backups over the network is required).

The old Unix way of running fsck on a live root filesystem so that it will be potentially modifying its own program text (or block maps thereof) was, at best, a workaround for not having a usable initramfs subsystem to run fsck from. That era ended 15 years ago on Linux, and even earlier on Solaris and other commercial Unixes.

I'm surprised Linux distros that boot with initramfs today still try to fsck after / is mounted.

A filesystem corruption bug breaks loose

Posted Dec 11, 2018 20:16 UTC (Tue) by lkundrak (subscriber, #43452) [Link] (3 responses)

Uh? A fresh Fedora install:
[root@nedofet lkundrak]# lsinitrd |grep -i fsck
-rwxr-xr-x   1 root     root        28512 Oct 25 09:14 usr/lib/systemd/systemd-fsck
-rw-r--r--   1 root     root          671 Oct 25 09:14 usr/lib/systemd/system/systemd-fsck@.service
-rwxr-xr-x   2 root     root            0 Jul 13 04:35 usr/sbin/e2fsck
-rwxr-xr-x   1 root     root        55952 Jul 16 13:51 usr/sbin/fsck
-rwxr-xr-x   2 root     root       339424 Jul 13 04:35 usr/sbin/fsck.ext4
[root@nedofet lkundrak]#

A filesystem corruption bug breaks loose

Posted Dec 11, 2018 20:18 UTC (Tue) by zblaxell (subscriber, #26385) [Link] (2 responses)

OK, that's one...

A filesystem corruption bug breaks loose

Posted Dec 12, 2018 3:45 UTC (Wed) by nybble41 (subscriber, #55106) [Link] (1 responses)

Debian also includes e2fsck in the initramfs:

[~]$ lsinitramfs /boot/initrd.img-4.18.0-2-amd64 | fgrep fsck
usr/sbin/e2fsck
usr/sbin/fsck
usr/sbin/fsck.ext4
usr/sbin/reiserfsck

A filesystem corruption bug breaks loose

Posted Dec 20, 2018 17:10 UTC (Thu) by BenHutchings (subscriber, #37955) [Link]

Indeed, we've been doing this by default since Debian 8 "jessie". But if you build a custom kernel with no initramfs then of course fsck gets run in the old way.

A filesystem corruption bug breaks loose

Posted Dec 14, 2018 23:41 UTC (Fri) by nix (subscriber, #2304) [Link]

Presumably you also install iproute2, or you'll have trouble bringing the network up to rsync over it (unless it's already up for netconsole or something, I suppose).

A filesystem corruption bug breaks loose

Posted Dec 11, 2018 16:30 UTC (Tue) by tbm (subscriber, #7049) [Link] (2 responses)

This is a great illustration of how bugs get fixed in open source.

What I find disappointing is that Lukáš Krejčí wasn't credited at all in the commit message. He did great work finding a setup that allowed the developer to reproduce and identify the issue. QA people need more recognition.

(To be fair, there is at least a Tested-by but not Lukáš)

A filesystem corruption bug breaks loose

Posted Dec 12, 2018 1:51 UTC (Wed) by axboe (subscriber, #904) [Link]

> What I find disappointing is that Lukáš Krejčí wasn't credited at all in the commit message

I totally agree, and I'll take the blame for that. For what it's worth, he's credited in the fio reproducer I made based on his setup description.

A filesystem corruption bug breaks loose

Posted Dec 12, 2018 22:17 UTC (Wed) by masoncl (subscriber, #47138) [Link]

This is a great point. A lot of us were racing pretty hard to find a good repro, but I really don't think I would have nailed the recipe myself.

We can't change git history, but we'll work out a way to send him something.


Copyright © 2018, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds