A backdoor in xz

Posted Apr 1, 2024 6:52 UTC (Mon) by himi (subscriber, #340)
In reply to: A backdoor in xz by cesarb
Parent article: A backdoor in xz

> In my opinion, it's the opposite: non-corrupted files could be described using a custom human-readable assembly-like language which represents the primitives allowed in the compression format (something like "LITERAL 03 AA BB CC TREE ..."), while corrupted files could be crazy files found by users which happened to cause older versions of the software to misbehave, and which were added as a regression test.

This is mostly what I was thinking of - have a human-readable language designed to specify the binary formats, and use that to generate the test data. It would work for both good and bad test data, though - rather than simply copying "known bad" test files for testing, you'd pull them apart and write a spec for the test case that generated the same kind of bad data. Doing that would actually specify the badness in great detail, far more so than simply having a binary test file that happens to break things; it could also help drive highly targeted fuzzing whenever particular failures were identified.

The biggest problem I see is that for something like a compression format the difference in complexity between a specification of the binary format and it's possible contents and an actual implementation of the compression algorithm probably isn't all that big, and the specification is certainly going to be subject to as many bugs as the code it's supposed to be testing. This is why I'd hesitate to suggest trying to make this sort of thing into any kind of generalised "best practise" - there are probably going to be too many cases where the payoff just isn't worth the effort required.

That said, a compression algorithm and its on-disk format are probably just about the worst case scenario - I expect a lot of software that reads or writes a binary format would have much less difficulty creating this kind of test data spec. And in a more generalised sense the idea of generating test data rather than simply embedding it in the repository is something that's probably worth encouraging, along with tooling that would support it . . .

A backdoor in xz

Posted Apr 1, 2024 12:48 UTC (Mon) by pizza (subscriber, #46) [Link] (1 responses)

> This is mostly what I was thinking of - have a human-readable language designed to specify the binary formats, and use that to generate the test data.

So... you have this human-readable language generate a malicious file that contains the payload for an exploit. What have you gained here?

The problem is that the binary data is "hostile", not "how the binary data was generated".

A backdoor in xz

Posted Apr 2, 2024 3:16 UTC (Tue) by himi (subscriber, #340) [Link]

If we can't simply trust that the contents of the repository aren't malicious, we need to be able to independently verify that they're not malicious. The point of a human-readable specification of the binary data is to try and make it possible for a human to read it and say "the binary version of this spec won't be malicious". Is that possible? In the general case I don't think so, but I'm pretty sure there are going to be a good number of cases where it /is/ possible, for someone with the requisite knowledge of the format. I'm suggesting that where it's possible, it might be worth trying to do, to mitigate against at least some of the risks posed by malicious binary blobs in source repositories.

As I've said, even if this idea might work in some cases I suspect it wouldn't be viable for xz or similar, simply because of the nature of compression algorithms. But if you can make "has undefined binary files as part of the test data set" into a code smell that gets people to take a closer look then you're raising the bar for sneaking a malicious payload into a repository.