A backdoor in xz
A backdoor in xz
Posted Mar 31, 2024 14:06 UTC (Sun) by nix (subscriber, #2304)In reply to: A backdoor in xz by himi
Parent article: A backdoor in xz
I'd say that in the specific case of xz, it should be generating corrupted test files etc using *xz itself*, and then buggering them using existing tooling. If it had done that routinely it would be obvious that a giant binary lump claiming to be a corrupted xz file was up to no good.
The hard part is testing *non*-corrupted files. You can't reliably construct *those* with the built xz, because if the built xz is buggy it's going to generate a messed-up file rather than the non-corrupted file that is desired (and you presumably want to detect cases where the compressor and decompressor are buggered-up in the same way). Not sure how to fix that: you can't use the xz already on the system because it might be similarly buggered-up...
Posted Mar 31, 2024 19:05 UTC (Sun)
by cesarb (subscriber, #6266)
[Link] (4 responses)
> You can't reliably construct *those* with the built xz, because if the built xz is buggy it's going to generate a messed-up file [...] and you presumably want to detect cases where the compressor and decompressor are buggered-up in the same way
That is easy to avoid, you just need to add an "expected hash" to check that the compressor generated the correct data for the decompressor to test. But that doesn't help when you want to test that the decompressor remains compatible with data generated by *older* versions of the compressor (which the current code no longer generates), or even alternative compressor implementations. There's a lot of flexibility in compressor output, so it's not unusual for it to change.
Posted Mar 31, 2024 19:41 UTC (Sun)
by nix (subscriber, #2304)
[Link]
In a similar area, for a long time we tried to make the libctf and ld.ctf testsuites in binutils work the same way the DWARF ones mostly do: describe a binary CTF dict via a .S file and just assemble it. But a mixture of there being no cross-arch-portable way to represent arbitrary binary data in .S files (.long, .byte4 etc, are all arch-dependent) and a desire to always test the latest CTF format except in a few error cases meant that nearly all of the tests end up compiling a .c file with -gctf just to get its CTF, then running the tests on that. It turned out to be much nicer to work with that, and much easier to see what the tests were actually *doing* as well: .S files are a pretty opaque way to describe a test input! (For corrupted CTF tests, the same problem arises, though at only a few dozen bytes each they're hardly likely to be malicious. There's just no room.)
Posted Apr 1, 2024 6:52 UTC (Mon)
by himi (subscriber, #340)
[Link] (2 responses)
This is mostly what I was thinking of - have a human-readable language designed to specify the binary formats, and use that to generate the test data. It would work for both good and bad test data, though - rather than simply copying "known bad" test files for testing, you'd pull them apart and write a spec for the test case that generated the same kind of bad data. Doing that would actually specify the badness in great detail, far more so than simply having a binary test file that happens to break things; it could also help drive highly targeted fuzzing whenever particular failures were identified.
The biggest problem I see is that for something like a compression format the difference in complexity between a specification of the binary format and it's possible contents and an actual implementation of the compression algorithm probably isn't all that big, and the specification is certainly going to be subject to as many bugs as the code it's supposed to be testing. This is why I'd hesitate to suggest trying to make this sort of thing into any kind of generalised "best practise" - there are probably going to be too many cases where the payoff just isn't worth the effort required.
That said, a compression algorithm and its on-disk format are probably just about the worst case scenario - I expect a lot of software that reads or writes a binary format would have much less difficulty creating this kind of test data spec. And in a more generalised sense the idea of generating test data rather than simply embedding it in the repository is something that's probably worth encouraging, along with tooling that would support it . . .
Posted Apr 1, 2024 12:48 UTC (Mon)
by pizza (subscriber, #46)
[Link] (1 responses)
So... you have this human-readable language generate a malicious file that contains the payload for an exploit. What have you gained here?
The problem is that the binary data is "hostile", not "how the binary data was generated".
Posted Apr 2, 2024 3:16 UTC (Tue)
by himi (subscriber, #340)
[Link]
As I've said, even if this idea might work in some cases I suspect it wouldn't be viable for xz or similar, simply because of the nature of compression algorithms. But if you can make "has undefined binary files as part of the test data set" into a code smell that gets people to take a closer look then you're raising the bar for sneaking a malicious payload into a repository.
A backdoor in xz
A backdoor in xz
A backdoor in xz
A backdoor in xz
A backdoor in xz