Storage testing

By Jake Edge
May 28, 2019

Ted Ts'o led a discussion on storage testing and, in particular, on his experience getting blktests running for his test environment, in a combined storage and filesystem session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit. He has been adding more testing to his automated test platform, including blktests, and he would like to see more people running storage tests. The idea of his session was to see what could be done to help that cause.

There are two test areas that he has recently been working on: NFS testing and blktests. His employer, Google, is rolling out cloud kernels for customers that enable NFS, so he thought it would be "a nice touch" to actually test NFS. He said that one good outcome of his investigation into running xfstests for NFS was in discovering an NFS wiki page that described the configuration and expected failures for xfstests. He effusively thanked whoever wrote that page, which he found to be invaluable. He thinks that developers for other filesystems should do something similar if they want others to run their tests.

He has also recently been running blktests to track down a problem that manifested itself as an ext4 regression in xfstests. It turned out to be a problem in the SCSI multiqueue (mq) code, but he thought it would be nice to be able to pinpoint whether future problems were block layer problems or in ext4. So he has been integrating blktests into his test suite. Ts'o said that he realized blktests is a relatively new package, so the problems he ran into are likely to get better before long. Some of what he would be relating are his feedback on the package and its documentation.

One of the biggest problems with blktests is that it is not obvious which tests are actually succeeding or failing. He put up a list of those tests that he thinks are failing, but he is not a block-layer specialist so it can be hard to figure out what went wrong. Some were lockdep reports that would seem to be kernel problems, but others may be bugs in his setup. It was quite difficult to determine which of those it was.

For example, the NVMe tests were particularly sensitive to the version of NVMe being used. He found that the bleeding-edge, not-even-released version of the nvme-cli tool was needed to make some of the tests succeed. Beyond that, the required kernel configuration is not spelled out anywhere. Blktests requires a number of kernel features to be built as modules or tests will fail, but it is not clear which ones. In a trial-and-error process, he found that 38 modules were needed in order to make most tests succeed.

He plans to put his kernel configuration into xfstests so that others can use that as a starting point. It would be good to keep that up to date, Ts'o said. As these kinds of things get documented, it will make it easier for more people to run blktests. The code for his test setup is still in an alpha state, but he plans to clean it up and make it available; it is "getting pretty good" in terms of passing most of the blktests at this point.

It is in the interests of kernel developers to get more people (and automated systems) running blktests, he said, as it will save time for the kernel developers. The way to make that happen is to find these kinds of barriers and eliminate them. Now that he has test appliance images that he can hand off to others to run their own tests on their patches, it makes his life easier.

Ric Wheeler asked how many different device types were being tested as part of this effort, but Ts'o said that the NVMe and SCSI blktests do much of their testing using loopback. There are also tests that will use the virtual hardware that VMs provide. Wheeler said that there is value to testing physical devices that is distinct from testing virtual devices in a VM. Ts'o agreed that more hardware testing would be good, but it depends on having access to real hardware; he is testing on his laptop and would rather not risk that disk.

Blktests maintainer Omar Sandoval said that the goal of blktests is to test software, not hardware, which is why the loopback devices are used. Some tests will need real hardware, while others will use the hardware if it is available and fall back to virtual devices or loopback if not. Wheeler noted that the drivers are not being tested if real hardware is not used.

The idea behind this effort is to lower the barriers to entry so that anyone can test to see that they did not break the core, Chris Mason said. The 0-Day model, where people get notified if their proposed changes break the tests, is the right one to use, he said. That way, the maintainer does not have to ask people to run the tests themselves.

Ts'o agreed that there should be a core set of tests that get run in that manner, but his current tests take 18-20 hours to run, which is not realistic for 0-Day or similar efforts. For those, some basic tests make sense. His plan is to ask people who are submitting ext4 patches to run the full set themselves before he considers them for merging.

Wheeler said that there should be some device-mapper tests added to blktests as well. Sandoval said that the device-mapper developers have plans to add their tests, but that has not happened yet. Damien Le Moal agreed that specific device-mapper tests would be useful, but it is relatively straightforward to switch out a regular block device for a device-mapper target and run the regular tests. It is a matter of test configuration, not changing the test set; having a set of standard configurations for these different options would be nice, he said.

Ts'o said that he has a similar situation with his ext4 encryption and NFSv3 tests; there is some setup and teardown that needs to be done around the blktests run. There is an interesting philosophical question whether that should be done in blktests itself or by using a wrapper script; xfstests uses the wrapper script approach and that may be fine for blktests as well. The important thing is to ensure that others do not have to figure all of that out in order to simply run the tests. Le Moal said that he had done some similar work on setup and teardown; he suggested that they work together to see what commonalities can be found.

The complexities of setting up the user-space environment were also discussed. Luis Chamberlain noted that his oscheck project, which was also brought up in the previous session, has to handle various distribution and application version requirements. He is using Ansible to manage all of that.

Ts'o said that he builds a chroot() environment based on Debian that has all of the different pieces that he needs; it is used in various places, including on Android devices. There are some environments where he needs to run blktests, but the Bash version installed there is too old for blktests; his solution is to do it all in a chroot() environment. That also allows him to build his own versions of things like dmsetup and nvme-cli as needed.

Ts'o uses Google Compute Engine for his tests, but Chamberlain would like to support other cloud options (e.g. Microsoft Azure) as well as non-cloud environments on other operating systems (e.g. Windows, macOS). He is planning to use Vagrant to help solve that problem and is looking for others who would like to collaborate on that. Ts'o said that he believes the problem is mostly solved once you have the chroot() environment; there is still some work to get that into a VM or test appliance, but that is relatively minor. For his purposes, once it works with KVM, he is done, but he does realize that others have different requirements.

Index entries for this article
Kernel	Development tools/blktests
Kernel	Regression testing
Conference	Storage, Filesystem, and Memory-Management Summit/2019

Storage testing

Posted May 28, 2019 21:04 UTC (Tue) by roc (subscriber, #30627) [Link] (3 responses)

This stuff sounds good, but there is clearly a need to unify test harnesses and configuration across the kernel. If every kernel component has its own way of writing and running tests, that's going to be a disaster. Of course some components need special infrastructure, but modularity and extension points work for test harnesses just like other software.

Storage testing

Posted May 28, 2019 21:24 UTC (Tue) by tytso (subscriber, #9993) [Link] (2 responses)

This is why I've extended kvm-xfstests and gce-xfstests to run blktests as well as xfstests. :-)

Seriously, while it would be nice if there was One True kernel testing system, it's just not going to happen. And that's because there is a huge amount of special infrastructure which is needed. File system testing requires using block devices which you can reformat; it also requires being able to run the same set of tests against different file systems and different configurations (options to mkfs, mount options, etc.)

The intel 915 tests fundamentally requires direct access to hardware --- it's not something you can emulate, and in fact you need a hardware library of different 915 video cards / chipsets in order to really do a good job testing the device driver.

And networking tests often require a pair of machines with different types of networks between the two.

Good luck trying to unify it all.

Finally, note that there are different types of testing infrastructure. There is the test suite itself, and how you run the test suite in a turn key environment. That test runner tends to be highly test environment specific. For example, gce-xfstests will pluck a kernel out of the developer's build tree, and upload it to Google Cloud Storage. It will then start a VM, and pass a URL to the kernel in the VM metadata. The VM will then kexec to the kernel-under test, and start the tests, and when they are complete, e-mail the results to the developer. From the developer's perspective, it's dead simple: " make ; gce-xfstests smoke". Done.

And if you're using a set of test hardware shared across 100 software engineers, using a custom hardware reservation system (both IBM and Google had such a set up, and naturally they were completely different), you'll need a different way of running tests. And that is always going to be very specific to the software team's test environment set up by their test engineers, which is why there will always be a large number of test harnesses.

Storage testing

Posted May 29, 2019 3:52 UTC (Wed) by roc (subscriber, #30627) [Link] (1 responses)

> The intel 915 tests fundamentally requires direct access to hardware --- it's not something you can emulate, and in fact you need a hardware library of different 915 video cards / chipsets in order to really do a good job testing the device driver.

Sure, but the software and services infrastructure for writing tests, running tests, processing test results, and reporting those results could be shared with lots of other kinds of tests.

> And networking tests often require a pair of machines with different types of networks between the two.

Ditto. (And presumably networking tests for everything above OSI level 2 can be virtualized to run on a single machine, even a single kernel.)

> Good luck trying to unify it all.

Unifying things after they're up and running is hard. Sharing stuff that already exists instead of creating new infrastructure is easier. Given that the kernel's upstream testing is totally inadequate currently, there's an opportunity here :-).

> Finally, note that there are different types of testing infrastructure. There is the test suite itself, and how you run the test suite in a turn key environment.

Yes, I can see that you want drivers for spawning test kernels on different clouds. They can exist in a world where other testing infrastructure is shared.

Surely you want a world where someone can run all the different kernel test suites (that don't require special hardware), against some chosen kernel version, on the cloud of their choice. That would demand a shared "spawn test kernel" interface that the different suites all use, wouldn't it?

Storage testing

Posted May 29, 2019 23:03 UTC (Wed) by tytso (subscriber, #9993) [Link]

> Given that the kernel's upstream testing is totally inadequate currently, there's an opportunity here :-).

I assume you're talking about kselftests, which is the self testing infrastructure which is included as part of the kernel sources? It has a very different purpose compared to other test suites. One of its goals is that it wants the total test time of all of the tests to be 20 (twenty) minutes. That's not a lot of time, even if a single file system were to hog all of it.

Before I send a push request to Linus, I run about 20 VM-hours worth of regression tests for ext4. It's sharded across multiple VM's which get launched in parallel, but that kind of testing is simply not going to be accepted into kselftests. Which is fine; it has a very different goal, which is as a quick "smoke test" for the kernel. You'd have to ask the kselftest maintainer if they were interested in taking it in a broader direction, and adding some of the support that would be needed to allow tests to be sharded across multiple VM's. One of the things that xfstests has, but which kselftests does not, is the option of writing the test results in an XML format, using the Junit format:

This allows me to reuse some Junit Python libraries to coalesce multiple XML report files and generate statistics like this:

ext4/4k: 464 tests, 43 skipped, 4307 seconds
ext4/1k: 473 tests, 1 failures, 55 skipped, 4820 seconds
Failures: generic/383
ext4/ext3: 525 tests, 1 failures, 108 skipped, 6619 seconds
Failures: ext4/023
ext4/encrypt: 533 tests, 125 skipped, 2612 seconds
ext4/nojournal: 522 tests, 2 failures, 104 skipped, 3814 seconds
Failures: ext4/301 generic/530
ext4/ext3conv: 463 tests, 1 failures, 43 skipped, 4045 seconds
Failures: generic/347
ext4/adv: 469 tests, 3 failures, 50 skipped, 4055 seconds
Failures: ext4/032 generic/399 generic/477
ext4/dioread_nolock: 463 tests, 43 skipped, 4234 seconds
ext4/data_journal: 510 tests, 4 failures, 92 skipped, 4688 seconds
Failures: generic/051 generic/371 generic/475 generic/537
ext4/bigalloc: 445 tests, 50 skipped, 4824 seconds
ext4/bigalloc_1k: 458 tests, 1 failures, 64 skipped, 3753 seconds
Failures: generic/383
Totals: 4548 tests, 777 skipped, 13 failures, 0 errors, 47529s

This is an example of something which one test infrastructure has, that other testing harnesses don't have. So while it would be "nice" to have one test framework that rules them all, that can work on multiple different cloud hosting services, there are lots of things that are "nice". I'd like to have enough money to fly around in a Private Jet so I didn't have to deal with the TSA; and then I'd like to be rich enough to buy carbon offsets so I wouldn't feel guilty flying around all over the place in a Private Jet. Unfortunately, I don't have the resources to do that any time in the foreseeable future. :-)

The question is who is going to fund that effort, and does it really make sense to ask developers to stop writing tests until this magical unicorn test harness exists? And then we have to ask the question which test infrastructure do we use as the base, and are the maintainers of that test infrastructure interested in adding all of the hair to add support for all of these features that we might "want" to have.

Storage testing

Posted May 28, 2019 21:08 UTC (Tue) by jhoblitt (subscriber, #77733) [Link] (2 responses)

I would think that a runc/docker image would solve all of the environmental issues except for which kernel modules need to be available. Docker in a qemu vagrant box should be a complete solution. That makes it easy to change kernel versions.

Storage testing

Posted May 28, 2019 21:37 UTC (Tue) by tytso (subscriber, #9993) [Link] (1 responses)

Docker adds no real value, and in fact, to the extent that tries to insulate the container from the real hardware, it gets in the way. Yes, you can run in privleged mode, but at that point, docker is no more than a fancy tar.gz plus a chroot.

It's setting up all of the qemu configuration to run the storage testing which is where the real value lies. For example, this is what "kvm-xfstests smoke" runs:

ionice -n 5 /usr/bin/kvm -boot order=c -net none -machine type=pc,accel=kvm:tcg -cpu host -drive file=/usr/projects/xfstests-bld/build-64/kvm-xfstests/test-appliance/root_fs.img,if=virtio,snapshot=on -drive file=/dev/lambda/test-4k,cache=none,if=virtio,format=raw,aio=native -drive file=/dev/lambda/scratch,cache=none,if=virtio,format=raw,aio=native -drive file=/dev/lambda/test-1k,cache=none,if=virtio,format=raw,aio=native -drive file=/dev/lambda/scratch2,cache=none,if=virtio,format=raw,aio=native -drive file=/dev/lambda/scratch3,cache=none,if=virtio,format=raw,aio=native -drive file=/dev/lambda/results,cache=none,if=virtio,format=raw,aio=native -drive file=/tmp/xfstests-cli.VpexZxAo/kvm-vdh,if=virtio,format=raw -vga none -nographic -smp 2 -m 2048 -fsdev local,id=v_tmp,path=/tmp/kvm-xfstests-tytso,security_model=none -device virtio-9p-pci,fsdev=v_tmp,mount_tag=v_tmp -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -serial mon:stdio -monitor telnet:localhost:7498,server,nowait -serial telnet:localhost:7500,server,nowait -serial telnet:localhost:7501,server,nowait -serial telnet:localhost:7502,server,nowait -gdb tcp:localhost:7499 --kernel /build/ext4-64/arch/x86/boot/bzImage --append quiet loglevel=0 root=/dev/vda console=ttyS0,115200 fstestcfg=4k fstestset=-g,quick fstestopt=aex fstesttz=America/New_York fstesttyp=ext4 fstestapi=1.5

... and where the root_fs.img can be downloaded here[1], and built from scratch using directions here[2].

[1] https://www.kernel.org/pub/linux/kernel/people/tytso/kvm-...
[2] https://github.com/tytso/xfstests-bld/blob/master/Documen...

Changing kernel versions is just a matter of pointing qemu at the kernel in the build tree: --kernel /build/ext4-64/arch/x86/boot/bzImage

And why bother with a docker image when you can just use a qemu image file: -drive file=/usr/projects/xfstests-bld/build-64/kvm-xfstests/test-appliance/root_fs.img,if=virtio,snapshot=on

Docker doesn't help you with any of the rest, which includes setting up storage devices that should be used for testing. So why use Docker?

Storage testing

Posted May 29, 2019 10:37 UTC (Wed) by unixbhaskar (guest, #44758) [Link]

Right, Ted. I was wondering how come that Docker fellow come into the picture of this kind "low level" stuff, which needs lots low-level access and tweaking.

Any container mechanism is certainly not built for this kind stuff in mind nor help greatly in the purpose.