A kernel unit-testing framework

Posted Mar 4, 2019 4:51 UTC (Mon) by Cyberax (✭ supporter ✭, #52523)
In reply to: A kernel unit-testing framework by Cyberax
Parent article: A kernel unit-testing framework

To add, I actually would LOVE to see a device behavior emulator that can allow to create a programmatic model of a device. With DMA, timed interrupts and so on. In particular to test the error recovery code that right now might never even get run.

A kernel unit-testing framework

Posted Mar 4, 2019 12:46 UTC (Mon) by geert (subscriber, #98403) [Link] (1 responses)

Testing DMA recovery is easy: hack the DMAC to let dmaengine_prep_slave_*() fail every Nth invocation.

Perhaps we need a kernel commandline option to enable that at the generic dmaengine level?

A kernel unit-testing framework

Posted Mar 4, 2019 20:11 UTC (Mon) by dezgeg (subscriber, #92243) [Link]

There is already standard infrastructure for fault injection which could be used: https://lwn.net/Articles/209257/

I once had a similar idea for injecting failures into USB transmissions (inspired a kernel crash in the USB hub code which would occur if he device was unplugged at a precise moment) but sadly, didn't implement it.

A kernel unit-testing framework

Posted Mar 4, 2019 17:04 UTC (Mon) by gps (subscriber, #45638) [Link]

Indeed. It also allows for people who don't have the hardware in question make meaningful changes with higher confidence.

A kernel unit-testing framework

Posted Mar 4, 2019 19:48 UTC (Mon) by roc (subscriber, #30627) [Link] (3 responses)

Do it in QEMU.

A kernel unit-testing framework

Posted Mar 5, 2019 14:14 UTC (Tue) by pm215 (subscriber, #98099) [Link] (2 responses)

This is essentially asking for double the work to be done for every driver. (My rough rule of thumb is that a device model is about the same amount of work as writing a driver -- assuming you have the specs for the device at all...) It also risks ending up with a QEMU model and a Linux driver that have the inverse of each others' bugs, neatly cancelling out. (I have actually seen this with a PCI controller driver for an Arm devboard -- the kernel code didn't actually work on the real hardware for more than one PCI card, but everybody was testing against QEMU...)

A kernel unit-testing framework

Posted Mar 5, 2019 19:08 UTC (Tue) by roc (subscriber, #30627) [Link] (1 responses)

Spending double the development effort to have reasonable (not perfect) automated tests isn't outrageous. It's in the right ballpark for projects I've worked on like Firefox and rr. Under the right conditions that spend pays for itself pretty easily.

The "right conditions" include the software living long enough for tests written today to pay off in the future, and bugs in deployed releases being costly because you have a lot of users or your software does important things or bugs found in the field are difficult to debug remotely.

Ironically the work I'm doing on improving debugging makes writing good tests slightly less important!

Tests are never perfect. I can see that device models diverging from hardware would be a problem. But it also seems to me that you could engineer around some of the problems, e.g. have a testing framework that *by default* tests for multiple instances of each hardware element, hotplugging of each hardware element, randomization of interrupt delays, etc.

A kernel unit-testing framework

Posted Mar 9, 2019 1:14 UTC (Sat) by nix (subscriber, #2304) [Link]

Spending double the development effort to have reasonable (not perfect) automated tests isn't outrageous. It's in the right ballpark for projects I've worked on like Firefox and rr. Under the right conditions that spend pays for itself pretty easily.

In glibc, which is very much following the 'everything should have tests dammit' policy (and long has), the tradeoff is sometimes much higher: it can easily take five times longer to write a decent test for some bugfixes than to fix the bug, even (sometimes especially!) when the bug is a real monster to find.

Linux would probably have terrible threading despite NPTL if Uli hadn't written a massive heap of tests for NPTL at the same time to make sure that the damn thing actually worked and did not regress. More than one bug I've looked at in the past which came down to one missed assembler instruction that triggered problems only in ludicrously obscure slowpath cases was tickled by one or more of those tests...