Automated tests

Posted Mar 9, 2021 21:33 UTC (Tue) by corbet (editor, #1)
In reply to: Linux 5.12's very bad, double ungood day by daenzer
Parent article: Linux 5.12's very bad, double ungood day

One could certainly write an automated test to catch this, but it would not be easy. You'd need to set up a machine with a swap file, drive it into memory pressure with a lot of dirty anonymous pages, then somehow verify that none of the swap traffic went astray. That means comparing the entire block device (including free space) with what you expect it to be, or mapping out the swap file, picking the swap traffic out of a blktrace stream, and ensuring that each page goes to the right place.

Certainly doable, but this would not be a fast-running test.

Automated tests

Posted Mar 10, 2021 2:49 UTC (Wed) by roc (subscriber, #30627) [Link]

You could make it run pretty fast by having the test generate a virtual machine image that is just big enough, i.e. a minimal amount of memory and a minimal-sized block device. Lots of tests could potentially benefit from this.

You'd have to write block device verification code to check the free space and the contents of all files, but that code could be useful for detecting all kinds of bugs.

One thing about automated testing is that once you bite the bullet and start creating infrastructure for things that look hard to test, you make it easier to test all kinds of things and people are much more willing to write tests for all kinds of things as part of their normal development. So the sooner you create such infrastructure, the better.

Automated tests

Posted Mar 10, 2021 2:52 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> One could certainly write an automated test to catch this, but it would not be easy.
I was one of people responsible for getting EC2 instances to hibernate. We used files for hibernation and actually found that the kernel had been broken for YEARS with file hibernation (it required a reboot for the hibernation target setting to take effect).

We also had a test for this very issue. It created an EC2 instance with a small disk (~2Gb) and limited RAM (512Mb). The test program created a swap file and then filled the disk to capacity with pseudo-random numbers (by creating a file and writing to it). It then allocated enough pseudo-random data to swap out at least some of it.

Then hibernate, thaw, and checksum the disk and the data in RAM to check for corruption.

The tests ran in about 2 minutes.

Automated tests

Posted Mar 10, 2021 22:12 UTC (Wed) by sjj (guest, #2020) [Link] (1 responses)

Interesting, I never thought about hibernation in AWS. I haven’t thought about hibernation in years, since it was the unreliable thing you had to do on laptops of the day.

Curious what the use case for it in AWS is.

Hibernation

Posted Mar 10, 2021 22:18 UTC (Wed) by corbet (editor, #1) [Link]

That work, and the use case behind it, were discussed in this OSPM article from last May.

Automated tests

Posted Mar 10, 2021 6:42 UTC (Wed) by marcH (subscriber, #57642) [Link]

> You'd need to set up a machine with a swap file,

As this is an apparently common configuration, you'd expect anyone modifying swap code to grant that configuration some reasonable amount of test time.

> drive it into memory pressure with a lot of dirty anonymous pages,

Also called "swapping"?

> then somehow verify that none of the swap traffic went astray.

Unless you're re-installing your entire system every few tests, there's a good chance you will soon notice something somewhere has gone terribly wrong even when not verifying every byte on the disk. This is apparently how the bug was found and relatively quickly by people not even testing swap but other things. The perfect that does not get done is the enemy of the good that does and this is especially true with chronically underdeveloped validation.

Automated tests

Posted Mar 11, 2021 19:56 UTC (Thu) by thumperward (guest, #34368) [Link]

For an integration test that would specifically have caught this issue, sure. But given the assumption that swap_page_sector() could be called on a swap file, a unit test that called swap_page_sector() on a swap file with a given input and verify that the file contained the right bytes in the right order afterwards is something that could well have existed and caught said bug before the refectoring inadvertently exposed it.