A few relevant quotes [LWN.net]

A few relevant quotes

Posted Mar 31, 2024 20:50 UTC (Sun) by pizza (subscriber, #46) [Link] (24 responses)

> Elsewhere in this discussion, people have been thinking like US defense contractors and their idea of vetting, but I think this specific example points to a different type of vetting that's also tedious but much less prone to mistaking geopolitics for trustworthiness: detailed code inspection and reproduction.

That's moving the goalposts from "vetting contributors" to "vetting contributions", which are very much not the same thing.

We can't solve the problem of not having enough suitably-skilled (and now, -trusted) people to handle the expected workload by requiring *more* work from the existing people.

And if you bring more people onboard to spread the workload, how long do you carefully vet them before they are treated as "trusted"? Apparently 2.5 years is no longer enough.

A few relevant quotes

Posted Mar 31, 2024 21:16 UTC (Sun) by rra (subscriber, #99804) [Link] (23 responses)

> That's moving the goalposts from "vetting contributors" to "vetting contributions", which are very much not the same thing.

They're not the same thing, but they're certainly closely related.

Are you going to be able to detect someone who carefully behaves entirely aboveboard for years before springing their trap? No. But neither will background checks and citizenship checks and all of the other institutional machinery of governments reliably, and they have a much higher social cost and exclude all sorts of people who are entirely trustworthy, which I think was part of your original point.

This specific attacker was not careful enough to survive careful vetting of contributions. This is a problem that we could have potentially solved, and it's worth imagining a world in which it would be possible to do that. Would they have altered strategies if that vetting was in place? Yes, probably, but this sort of thing is hard and you're creating more opportunities for them to trip up.

> Apparently 2.5 years is no longer enough.

This contributor did several detectable things way earlier than 2.5 years in. There's the pressure from sock puppets, the mysterious test files, commit messages for commits that do not do what the commit message claims they do, etc. Would I have caught those things? Probably not, but I might in the future.

Sure, I too can imagine an attacker that didn't do any of those things, but I really dislike both throwing up our hands and saying this is impossible or going down a path that inevitably leads to saying only people with certain passports can ever be trusted. There are concrete things that we can do that still involve judging people by their actions. Are they perfect? No, nothing in security will ever be perfect. Are they sustainable ways to make an attacker's life harder? I think so, maybe.

A few relevant quotes

Posted Mar 31, 2024 21:47 UTC (Sun) by mb (subscriber, #50428) [Link]

>Sure, I too can imagine an attacker that didn't do any of those things, but I really dislike both throwing up our hands and saying this is impossible

Yes, I agree. This certainly taught me lessons about how to do reviews.

A few relevant quotes

Posted Mar 31, 2024 22:37 UTC (Sun) by pizza (subscriber, #46) [Link] (21 responses)

> This contributor did several detectable things way earlier than 2.5 years in. There's the pressure from sock puppets, the mysterious test files,

My GitHub presence is nearly nonexistent, and nearly all of it is filing bug tickets or commenting on pull requests that I have an interest in. Does that make me a sock puppet? (Meanwhile, My first public contribution to the Linux kernel came with accusations of using a psuedonym because my legal name "sounds silly")

....You only know them to be "sock puppets" _after the fact_, because everyone has to start _somewhere_, and even then, there's entire software development ecosytems beyond the confines of Github or any other public repositories.

Incidentally, I've also committed "mysterious" test files. They're not mysterious to _me_ but they are to pretty much everyone outside of my project. Does that make my actions questionable? Or it a matter of "you have to have <this much> domain expertise to meaningfully participate"?

The difference between "legit" and "hostile" is a matter of _intent_, not one of _appearance_, because on the first few layers they appear to be the same thing.

> Sure, I too can imagine an attacker that didn't do any of those things, but I really dislike both throwing up our hands and saying this is impossible

It's not that this is "impossible", it's that you need to be *very clear* about what you're actually trying to protect yourself against or otherwise accomplish, and what the costs (both individual and collective) will be -- and only then can one determine if the cost is "worth it". Because often enough, the rational answer is "nope" -- on both the individual and collective levels.

I'd hate to see the collective F/OSS community lose it collective mind over this and self-immolate in the name of "security"..

A few relevant quotes

Posted Apr 1, 2024 11:43 UTC (Mon) by jafd (subscriber, #129642) [Link] (20 responses)

> My GitHub presence is nearly nonexistent, and nearly all of it is filing bug tickets or commenting on pull requests that I have an interest in. Does that make me a sock puppet?

If you get into a project out of the blue and start campaigning to get its maintainer replaced because there haven't been releases made in some arbitrary interval of time or because whatever, along with several other users that haven't been in the project until that moment, then very likely yes, it does.

A few relevant quotes

Posted Apr 1, 2024 12:41 UTC (Mon) by pizza (subscriber, #46) [Link] (19 responses)

> then very likely yes, it does.

The ill intent here was only visible some time (years?) after the fact. At the time, there's no way to tell the difference.

...What you are calling "Sock puppetry" here is what others call "drive-by contributions"; ie supposedly the entire point of putting a project on the likes of github to begin with.

(There are numerous projects that I've only ever interacted with as a "drive-by" because through the course of my employment I found bugs or had issues and was able to get permission [1] to contribute something back. After that particular task (or employer) I never interacted with said projects ever again.

[1] I only obtained through dogged insistence; the bureaucracy defaults to "hell no", for $reasons.

A few relevant quotes

Posted Apr 2, 2024 21:53 UTC (Tue) by jafd (subscriber, #129642) [Link] (18 responses)

> ...What you are calling "Sock puppetry" here is what others call "drive-by contributions"; ie supposedly the entire point of putting a project on the likes of github to begin with.

Making demands, whining how, quoting, "the community desires more", and that the project's governance needs to be changed is a negative contribution, if anything.

A few relevant quotes

Posted Apr 2, 2024 22:32 UTC (Tue) by pizza (subscriber, #46) [Link] (17 responses)

> Making demands, whining how, quoting, "the community desires more", and that the project's governance needs to be changed is a negative contribution, if anything.

I don't disagree, but the more widely used the project becomes, the more likely that this entitlement represents the (overwhelming) norm.

A few relevant quotes

Posted Apr 3, 2024 11:07 UTC (Wed) by farnz (subscriber, #17727) [Link]

And arguably, that's the core problem; instead of seeing open source contributors (including maintainers) as gift-givers whose generosity is not guaranteed, people see them as obligated[1] to provide software that meets everyone's needs. If we can't get that entitlement under control, we will continue to have problems where malicious people can "take over" a project and backdoor it, simply because they promise to deliver what the users are demanding.

This problem also applies in the proprietary world; paying for binaries is not enough to ensure that the entity you're paying is not malicious. And it's within the budget of a nation-state attacker to get a few people into key points within a big company so that they can backdoor the binaries you buy; it's also possible for an attacker to outright purchase a small company that produces binaries you use specifically to backdoor them.

[1] I once contributed a small change to X.org's Xserver, targeting my then-employer's use case; I had two competitors to my employer ask me to change what I'd done to suit them. One was polite and went to the upstream mailing lists upon request, and took my advice on alternatives to my patch, and what they could do to implement what they needed. The other one was not so polite, and complained to my employer that I was refusing to spend my working time on a competitor's needs. Needless to say, I got no pushback from my employer on refusing to work on improving a competitor's product in my working time.

A few relevant quotes

Posted Apr 3, 2024 15:21 UTC (Wed) by paulj (subscriber, #341) [Link] (15 responses)

Something is broken with the norms then.

I find it interesting how I and some others were discussing the toxicity and entitlement that is present in the expectations and demands a good chunk of users seem to think they can hang on creators and maintainers of free software, just a while ago in the context of what happened to Rust Actix. And now related issues (i.e., the lack of appreciation by the wider world for free software creators and maintainers; and how that opened up this social engineering channel) pop up to what would have been one of the worst security issues in a long while - if not for the luck of Andres Freund noticing.

As a former maintainer, I'm sorry, but the attitudes of many around Free Software are seriously off. Entitlement is ingrained in many, even some of the most well-meaning. It is a significant cultural issue.

A few relevant quotes

Posted Apr 3, 2024 15:35 UTC (Wed) by rra (subscriber, #99804) [Link] (13 responses)

> As a former maintainer, I'm sorry, but the attitudes of many around Free Software are seriously off.

Yes. This is the part of the compromise that I was the most struck by, and that I keep thinking about. I kind of knew this already, but reading back through the messages and watching this social engineering happen crystallized it for me.

People use free software as if they were the consumers of a product and treat maintainers as if they were companies producing substandard equipment. Not all people, not a majority of people, but a persistent minority. And heaven help you as a maintainer if you decide that you would like to take the software that you give away for free on the Internet in a direction that you like better but that some user views as an unwanted change. The carping and nasty insults and raging entitlement can go on for literally years, despite the fact that any one of those people could fork the code and do whatever they would like with it.

There is a deep rot in our culture, and it's very off-putting.

A few relevant quotes

Posted Apr 3, 2024 15:58 UTC (Wed) by paulj (subscriber, #341) [Link] (12 responses)

If I were to do it all again, one thing I'd change would be to get rid of public forum bugtrackers, email lists, etc. I'd just supply an email address, and/or a web chat, for anyone wishing to provide comments or report issues or provide patches.

The scolds and the trolls thrive on having a public, an audience. It's the (implied) audience that allows one sense of entitlement to override the fact you're talking to a human who has given you something, and scold them instead and demand things. Trolls don't always need an audience, but it definitely encourages them more.

Get rid of the audience, make it a clear 1:1 communication instead, and I think the problem would be significantly abated.

So... if I end up maintaining something useful again, that's what I'll try next time.

(And there are large and core Free Software projects that work on that "email a private address" basis for comments/patches).

A few relevant quotes

Posted Apr 3, 2024 16:32 UTC (Wed) by farnz (subscriber, #17727) [Link] (11 responses)

You end up with a double-edged sword. Doing things in private means that it's easy to cut off the trolls, but it also means that you don't get drive-by assistance with someone who's well-meaning but not good at expressing themselves.

I'd be interested to hear how your experiment goes; my experience suggests that it won't go well, because I've had people be as entitled in private mails as I have seen in public - going as far as e-mailing my employer to request that I be fired for refusing to spend my working time changing a patch I sent upstream to suit their needs, not just my employer's needs.

Fortunately for me, they ran a competitor of my then-employer, so when my boss asked me what was going on, I just pointed him at the e-mail domain. He went to their website, and understood why I'd refused to help out - top thing on their site was a hit-piece against our product.

A few relevant quotes

Posted Apr 4, 2024 15:30 UTC (Thu) by paulj (subscriber, #341) [Link] (10 responses)

Perhaps you lose assistance. You could always still ask others to read some request, if assistance might be useful though.

Your other point, you're into the topic of corporate and industry politics. That is, IME, another thing beyond "mere" trolling and entitlement. That's a whole other level!

My view here would be that, as far as possible, try avoid having multiple corporates be competitors, while also trying to "collaborate" on an open-source project. So trying to arrange it so developers of the project are not at competing entities, and arranging it so corporate users are not developers of the project. So, have one entity that can accept resources in some way (be that via donations, support contracts - latter is probably a lot easier for users to justify) and distribute those resources to the developers in some fair way. The number of structures possible here is numerous - unincorporated association, co-ops, non-profit corporates, etc. The developers should control it though. Also, it should not be a charity - a non-profit is fine, but charitable (or equivalent) tax status is not (charity status is hard to get in UK and Ireland I think, and heavily regulated - but unfortunately it's relatively easy in the USA via 501(c)(3)).

[To any young Free Software hackers reading, if you're involved in some project that has an entity/trust/association/corporate/foundation managing your donations; and you _don't_ have *full* transparency into _what_ its income is and _how_ those are being distributed, and on a *timely basis* (not "2 years after the current fiscal year, whenever ProPublica manage to acquire the /meagre/ IRS filing" - but more like weekly), then _get_ that transparency and make sure everything is correct and fair. Do *not* simply take things on trust that the 50+yo FOSS-svengalis who run the board and appoint the executive (if not one of themselves) are working in your interests. They may or may not look after you, but they will /certainly/ look after themselves when it comes to any salaries. If you can not get that transparency, that is a _major red flag_!! Please heed this warning, from someone who was once unfortunately naive on this.]

If you already have a rats nest of corporates who each sell the FOSS projects code in some way, and already are jostling for position, and looking for ways to offload their own maintenance costs onto others whenever possible, then you may be stuck with the politics. You could try setting up a trade association, to take up some tasks and be funded, with rules, but.... even if that helps a bit, you're probably stuck with corporate politics. :( I'd be looking for another job at that point. ;)

That's what happened with the project I was on. It started out as a normal community kind of thing, with individual maintainers. As it got wider adoption and recognition, we eventually started to attract some corporates and "svengali" types. We also started to get more and more patches that were about offloading some corporates pain points onto the (unpaid) maintainers - without any benefit to the code or community.

E.g., patches to add "APIs" into the GPL code - so hooks all over the place, and then exported over some RPC - custom, JSON, capn'proto, whatever. But we never got code that built on those APIs and actually did something useful; and generally never even got code to even exercise and test the exported API. Or we got code to further abstract sets of APIs, e.g. by adding some kind of extra context - again, never provided with code to actually make use of that, never provided with code to even exercise it. Sometimes there would be /promises/ of such code in the future, but... ha!

Clearly, these companies had proprietary software/solutions they were selling, which relied on this GPL software. And they wanted to get rid of the continuous maintenance hassle to them of keeping their changes for their hooks synced with upstream. Possibly also the legal risk of their proprietary software relying on GPL software this way, at least for RPC API hooks. If they could upstream their patches, that maintenance burden would be offloaded to upstream - yay! - and they'd also be able to point to the acceptance of the RPC API changes as showing legitimacy for that (for such patches at least). Double yay!

I know others in this thread say the job of a maintainer is to say "No", but when you do that, once you've reached this "infested by shady corporates" stage, what will happen is the shady corporates start playing power politics. And they do not play nice at all.

I could continue... but this is already long.

There are some very shitty people out there.

A few relevant quotes

Posted Apr 4, 2024 16:00 UTC (Thu) by paulj (subscriber, #341) [Link]

Maybe I should write a blog post one day, going through all the nasty politics and shady shit [501(c)(3)'s mixing private consulting business with the 501(c)(3) stuff? Blocking patches to try pressure corps for funding? etc. etc] that went on. I don't know. One the one hand, it was eye-opening and maybe it'd entertaining or even informative for others; on the other hand, I neither want to waste my time remembering their shit, and just making myself appear a bitter crank. I don't know.

A few relevant quotes

Posted Apr 4, 2024 17:52 UTC (Thu) by farnz (subscriber, #17727) [Link] (8 responses)

My other point was a lot simpler; IME, you get the same entitlement issues over private e-mail as you get on public forums. It's just less visible, because it's inherently private unless the target of the request publishes it (because the origin doesn't).

A few relevant quotes

Posted Apr 5, 2024 11:08 UTC (Fri) by paulj (subscriber, #341) [Link] (7 responses)

Maybe.

I think there is a difference between the entitlement of users being ingratefully demanding of maintainers - getting the "who owes whom what" context 180 degrees backwards in thinking the maintainer owes them something - and developers from different competing corporates jostling over contributions to the same project. The latter the nastiness is more due to the entities being competitors, and if one can can leverage naivety or the feeling of Free Software principles of a competitor's developer to extract more free work out of them to their benefit, they'll do it.

I guess it's a sense of entitlement, or at least exploiting a general culture of entitlement in FOSS to try extract free work from a competitor. But it's a competitive behaviour.

The first case is just private individuals being shitty - perhaps unwittingly, because this culture is so pervasive. The second case is people working for corporates, exploiting that culture (perhaps unconsciously) for competitive gain.

A few relevant quotes

Posted Apr 5, 2024 11:24 UTC (Fri) by farnz (subscriber, #17727) [Link] (6 responses)

I've seen both of the things you describe in my e-mail inbox when contributing to big projects (not maintaining, just contributing), including people attempting to threaten me for implementing something differently to the way they want it done and asserting that I should do it their way because they work for a big corporate, and I should redo my contribution to match the BigCo way.

I don't think you can escape the shittiness of some people; the only choice you have is whether they display that shittiness in public, or whether it's done in private. And the advantage of it being done in public is that you can alert their BigCo's "press relations" team to the behaviour of the shitty people, which is often sufficiently bad that BigCo's PR team will take action to get it under control for fear that it gets picked up as an example of how BigCo behaves.

A few relevant quotes

Posted Apr 5, 2024 11:49 UTC (Fri) by paulj (subscriber, #341) [Link] (5 responses)

I've had the kind of corporate power politics you describe occur to me too. But, leveraging industry-contacts to get my boss' name, and then (min) director level - if not C*O level [it was certainly at behest of C*O level on their side] - leverage corporate "partnerships" to get (min) director level of the BU of my employer to talk to my boss (a director in my BU), to threaten me.

People are shitty.

The private entitlement stuff, I think private comms would alleviate a lot of that.

The shitty corporate power politics - which becomes engrained into certain people who work at certain shitty corps [cough, company from San Francisco with a bridge logo, cough] - you can't fix that. Those are shitty people who enjoy playing power games.

A few relevant quotes

Posted Apr 5, 2024 11:53 UTC (Fri) by paulj (subscriber, #341) [Link]

And to be clear, shitty corp-power-pols people were not working for said bridge-logo company. But many of them had previously worked for said bridge-logo company.

I never realised significance of that until a colleague of mine, unrelated to my issues with said people, was complaining to me about how difficult/nasty it can be to work with people from bridge-logo company cause it is notorious for horrendous, cut-throat internal power politics - which he told me he thought were largely due to their no-mercy stack-ranking system, where you must keep making regular promotion progress, otherwise you go out.

A few relevant quotes

Posted Apr 5, 2024 12:04 UTC (Fri) by farnz (subscriber, #17727) [Link] (3 responses)

My direct experience is that people are shittier and more entitled in private comms to someone outside the business than in public, because they know that they can leverage their contacts in their business to dismiss your issue with them in private as "he's forging the e-mail because he wants me to get in trouble - I would never risk bringing the company into disrepute like that".

In public, they say things that are relatively manageable, because they know that if they do the sort of leverage that you've had, they'll have to justify the stuff they said in public. In private, they can be as abusive and shitty as they like, because they can lie their way out of trouble, and have your boss come down on you twice over - once for not doing what the shitty person wants you to, and once for lying to get the shitty person into trouble.

Having all communications be public alleviates the worst of it, because the truly shitty people out there know that they cannot lie their way out of trouble if communications are public - and thus that they'll lose at the power games because their boss will lose at their power games if PR are having to say "we need to be ready to deal with this before the press get hold of it".

A few relevant quotes

Posted Apr 5, 2024 12:40 UTC (Fri) by paulj (subscriber, #341) [Link] (1 responses)

Completely agree. If you're in that "corporate power politics" scenario, the threats will be more overt in private.

I'm just drawing a line between the "private individuals" context and the "competing corporates" context. I'm saying the /former/ likely is fixable with private comms.

I agree you it won't fix things in the latter context. I don't know how to fix the latter context. My intention is to avoid being in that scenario again. If I were a maintainer, I would try discourage other businesses from building their business around something I maintained - cause it would ultimately result in pain for me.

The latter context is a complex topic.

The former is a lot more tractable though.

A few relevant quotes

Posted Apr 5, 2024 13:22 UTC (Fri) by farnz (subscriber, #17727) [Link]

IME, private individuals aren't shitty in comms in public where they think they're talking to a person, not a company, and they're as shitty or worse in private if they think they're dealing with a company, not a person.

Basically, my experience tells me that the only problem you fix by moving comms to private instead of public is that of corporate PR flacks asking you to remove corporate attribution of shitty comments, while you create new problems of people expecting to get away with being shitty because they can blame you for everything.

A few relevant quotes

Posted Apr 5, 2024 12:43 UTC (Fri) by paulj (subscriber, #341) [Link]

Oh, I agree with your points that public comms can be useful to damp down/thwart the nastier politics in the "competing corporates" context.

Indeed, very good public comms is probably essential to it. You need to get all the agendas and interests teased out, specified, and try identify the common interests as much as possible, and try set clear lines for competition. Good comms and negotiation needed on this.

Entitlement issues with Free Software

Posted Apr 3, 2024 16:08 UTC (Wed) by farnz (subscriber, #17727) [Link]

I think that part of it is that people have lost sight of the intention of the "no warranty"[1] clause from Free Software licences, since similar clauses are rife in the proprietary software world, and thus they don't trigger a "oh, this isn't the normal buyer/vendor relationship" reaction. Instead, because that sort of clause is common in every licence, they assume that it's just software boiler-plate, and their relationship to a Free Software provider is exactly the same as their relationship with someone who charges them $10,000/year/seat for licences for a piece of software.

In fact, though, there's an essential difference - Microsoft are existentially threatened if their customers decide en-masse that they're not paying for Windows licences in future, but are instead going to use Debian Linux for free, while Debian does not lose anything if a non-contributing user says they're going to pay for a Windows licence instead of downloading Debian for free.

And once you've got that difference in your head, you realise that the value a user provides to Free Software is in contributions - good quality bug reports, documentation that makes it easier for others to use Free Software, helping out on forums and the like, or even helping with the code itself, and not simply using the software. This means that a threat to not use this software, but to use a fork or something different isn't a big deal to the supplier of Free Software.

[1] Like the following from some BSD licence variants:

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE

A few relevant quotes

Posted Mar 31, 2024 22:13 UTC (Sun) by calumapplepie (guest, #143655) [Link] (2 responses)

> One of the critical moments in this exploit came when the "test files" were committed.

The whole point of co-maintainers is not inspecting every line they commit.

> Let's check the scripts used to generate them into the Git repository.

This would be a good practice; you never know when you need more test files. But, again, it adds a significant amount of work; and it might have been generated in a fundamentally unreproducible way. The file could've come from a fuzzer, a user, the maintainer just poking things, etc; documenting provenience would be good. But who wants to say "thanks for your free work, now do some more"? Who wants to be told that?

> But the tooling to verify that a release is a correct representation of the Git tree was absent.

That's because, in many cases, release tarballs aren't just the output of "git archive". It's actually somewhat of a problem, and the folks at Debian are talking about mitigations for it. This is hardly the first time that someone getting clever with adding files not found in VCS has been a problem (self plug for my great suspender article), but it remains an issue. Tarballs are often generated to be one-stop-shops for users; bundling dependencies, etc. "Download this file, verify it matches what I signed, and be confident that it will just work". Its a philosophy that fades in and out of favor; distributions work to check the work, but it can be quite hard (see the conversation Russ was having).

A few relevant quotes

Posted Apr 1, 2024 7:16 UTC (Mon) by pabs (subscriber, #43278) [Link] (1 responses)

Autotools needs to be changed to make the primary tarball `git archive` and create add additional secondary tarballs for people who really can't install autotools yet, which are then completely option.

A few relevant quotes

Posted Apr 1, 2024 10:23 UTC (Mon) by ballombe (subscriber, #9523) [Link]

autotools need first to stop breaking backward compatibility in minor releases.
Apple need to stop shipping a 20-yar old version of bison.
Developpers should not use cmake feature only available in the latest version.
Redhat should hot have removed perl from the default install
etc. etc.

There are reasons why people ship pregenerated files, even while that makes diff
ugly...

A few relevant quotes

Posted Apr 1, 2024 7:13 UTC (Mon) by pabs (subscriber, #43278) [Link] (14 responses)

We need to stop using incomprehensible pre-generated test files. On the GNU Poke IRC channel, there was an idea to write a blog post about how to stop storing pre-generated binary files in git and instead generate them from source code written in the GNU Poke language, so they are at least somewhat understandable with a text editor. Hopefully this post gets written so I can add it to this Debian wiki page.

https://wiki.debian.org/AutoGeneratedFiles

A few relevant quotes

Posted Apr 1, 2024 16:03 UTC (Mon) by emk (subscriber, #1128) [Link] (13 responses)

That link doesn't actually mention GNU Poke anywhere?

Anyway, GNU Poke is probably reasonable for data files that are built up in sensible ways. But a key part of my test infrastructure for certain libraries are fuzzer-generated binary blobs. These tend to be maximally "evil" misinterpretations of an input format, generated via guided search of billions of possible inputs. At some point, each of these blobs caused a crash or an overflow. And I keep them around to detect regressions, or to seed future fuzzing runs. There's nothing nice or sensible in these files; they break the core assumptions of a format. And almost none of them have ever been analyzed in depth by a human.

Or for another example, if I want to test speech recognition, I'm probably going to want a few seconds of spoken audio somewhere in my test suite.

I'm actually wrestling with this more, recently: A lot of machine learning and AI-based tools really need "test sets" containing a significant amount of real-world data, combined with minimum expected performance metrics. Figuring out how to manage this in the open source world has been tricky at times. These data sets may be too big to distribute as part of a release. Which leaves fun options like "git submodule", plus the issue of building non-proprietary test sets.

A few relevant quotes

Posted Apr 1, 2024 16:39 UTC (Mon) by nix (subscriber, #2304) [Link] (2 responses)

Quite. Also, I'm not sure how poke would have helped -- you can build anything up in it, so it just adds a not-terribly-high wall for the attacker to get over, disguising the actually-an-exploit lzma stream as something else innocent. (It would at least have been obvious that it wasn't *random*, but then in that case the attacker wouldn't have described it as random, but as a corrupted lzma archive perhaps obtained from somewhere else and thus not trivially reproducible, what a pity!)

A few relevant quotes

Posted Apr 2, 2024 4:11 UTC (Tue) by pabs (subscriber, #43278) [Link] (1 responses)

The test files in the xz case were claimed to have been solely generated by a script from a particular random seed, the other folks in the project should have required that to be public.

A few relevant quotes

Posted Apr 2, 2024 12:42 UTC (Tue) by pizza (subscriber, #46) [Link]

> the other folks in the project should have required that to be public.

"Other folks" == "one other person" in this case. Which I suspect is the overwhelmingly second-most-common case (ie other than having one person in total)

The whole point of having co-maintainers is so you can spread the work around, not create more for yourself.

A few relevant quotes

Posted Apr 2, 2024 4:09 UTC (Tue) by pabs (subscriber, #43278) [Link] (9 responses)

As I said, the poke blog post hasn't been written yet.

Sure, poke wouldn't be useful for some cases, but for the fuzzer generated blobs you could minimise them down to just the bits that trigger the bug and then convert that to poke. Those test files also don't have to be part of the source code tree. For the file format libraries I maintain, most of the test files aren't redistributable anyway, because they came from some bug reporter in private, or in public but not licensed.

Most of the data inputs for ML/AI are not licensed, just scraped from somewhere, so not redistributable even if they were small enough. There are some exceptions, some datasets are under proprietary commercial licenses, and there are libre datasets but they are rare. This is worth a read:

https://salsa.debian.org/deeplearning-team/ml-policy

A few relevant quotes

Posted Apr 2, 2024 9:27 UTC (Tue) by farnz (subscriber, #17727) [Link] (8 responses)

Sure, poke wouldn't be useful for some cases, but for the fuzzer generated blobs you could minimise them down to just the bits that trigger the bug and then convert that to poke. Those test files also don't have to be part of the source code tree. For the file format libraries I maintain, most of the test files aren't redistributable anyway, because they came from some bug reporter in private, or in public but not licensed.

The other direction with fuzzer blobs (at least with some fuzzers) is to provide instructions for recreating the fuzzer blob on your own system. This goes well with fuzzers that have a guided mode; I can provide guidance to the fuzzer, and you can run the guided fuzzer against the known buggy version of code to get the same blob yourself, thus allowing you to be confident that the blob came from the fuzzer.

A few relevant quotes

Posted Apr 2, 2024 10:52 UTC (Tue) by excors (subscriber, #95769) [Link] (2 responses)

If a project started requiring all fuzzer blobs to come with instructions on how to install the exact version of the fuzzing software and reproduce the output, and requiring that a trustworthy maintainer review and reproduce every blob before merging the change, I suspect the main consequences would be:

1) Many people would stop contributing fuzzer-based regression tests, because it's too much of a hassle, so the project would be at greater risk of security vulnerabilities from untested regressions.
2) Attackers would simply hide their payload in a logo.png instead, because nobody would think to check that for hidden data. Or they'd hide it in the middle of a 200KB file of Korean translations, which non-Korean-reading reviewers wouldn't notice, or in any other file that's similarly impractical to review.

(In fact, Jia Tan did contribute an xz-logo.png recently, and updates to the translations. I don't see anything suspicious about them, but I couldn't be certain...)

A few relevant quotes

Posted Apr 2, 2024 11:39 UTC (Tue) by ms (subscriber, #41272) [Link]

Yes, exactly. And how could you tell? Steganography allows data to be stored within an image (or any other file - not necessarily even binary) without noticeably altering the image. Similarly, I wouldn't be at all surprised if there have been contests to build PNGs or JPEGs that can also be executed and calculate pi, or Fibonacci sequences - I would expect there are contests for this sort of thing. This whole idea of banning inscrutable files is a complete non-starter - *everything* is inscrutable if you don't know what you're looking for.

A few relevant quotes

Posted Apr 2, 2024 13:09 UTC (Tue) by farnz (subscriber, #17727) [Link]

The fundamental point I'm aiming for is proof of good faith where reasonable. If you make a claim ("this blob is fuzzer output"), can you provide a reasonable proof that it is actually fuzzer output, and not malicious data?

E.g. if I claim that this is "fuzzer output", and that you can regenerate it by running AFL with this block of guidance data against this revision, you can quickly confirm that I'm not lying, because while it may have taken you 6 CPU-months to find that blob, with the guidance data I can get AFL to reproduce it in seconds. If I say it's my own work, there's nothing I can do to prove good faith.

A few relevant quotes

Posted Apr 2, 2024 11:41 UTC (Tue) by emk (subscriber, #1128) [Link] (4 responses)

You wrote: "The other direction with fuzzer blobs (at least with some fuzzers) is to provide instructions for recreating the fuzzer blob on your own system."

I mean, I certainly couldn't (and wouldn't) recreate the fuzzer blobs on my own system. To find those blobs, I rented some expensive multi-core monster in the cloud, and I ran it for days. We're definitely looking at over $100 of CPU time here. If someone is going to try to reproduce that, I'd prefer they look for new bugs!

But I do try to purge my largeish test fixtures from distributed packages, even though that means my packages don't match git.

(I fear that supply-chain attacks may be intrinsically difficult to protect against, if you assume that some of your adversaries are nation-states playing the long game. Let's not forget all the tales of companies hiring people who do not exist, whose address turned out to be an empty house rented in advance for cash. Or of national intelligence agencies intercepting shipped packages and modifying the hardware prior to delivery. And it's not like the CIA kept out Aldritch Ames, either. So I think our first steps here should be to add multiple layers of mitigations and checks, aimed at greatly reducing the risk of less sophisticated attackers succeeding. I would love it if we began by removing autotools and M4 from all the key parts of the ecosystem. And maybe started keeping databases of identities with commit rights and upload rights for critical base packages—much of which could be automatically figured out from public information.)

A few relevant quotes

Posted Apr 2, 2024 12:17 UTC (Tue) by farnz (subscriber, #17727) [Link] (2 responses)

I mean, I certainly couldn't (and wouldn't) recreate the fuzzer blobs on my own system. To find those blobs, I rented some expensive multi-core monster in the cloud, and I ran it for days. We're definitely looking at over $100 of CPU time here. If someone is going to try to reproduce that, I'd prefer they look for new bugs!

If you're going this direction, you're using a fuzzer where, having found a blob, it's trivial to tell the user how to recreate the blob via the fuzzer; in effect, you're providing the fuzzer guidance input that gets it to find the blob extremely quickly as a proof that the blob was found via the fuzzer, and not by hand, even though finding out what guidance is needed is going to take a long time and cost a lot of money.

And fundamentally, this is what it all comes down to - proving that you're acting in good faith, and not an attacker. An attacker can't provide evidence that they created the blob via a fuzzer; you can provide a guidance string from the fuzzer that has it generate the same blob that took you days to find, but taking seconds to find it because the guidance gets it down the right path immediately.

A few relevant quotes

Posted Apr 2, 2024 16:23 UTC (Tue) by kleptog (subscriber, #1183) [Link] (1 responses)

> If you're going this direction, you're using a fuzzer where, having found a blob, it's trivial to tell the user how to recreate the blob via the fuzzer; in effect, you're providing the fuzzer guidance input that gets it to find the blob extremely quickly as a proof that the blob was found via the fuzzer, and not by hand, even though finding out what guidance is needed is going to take a long time and cost a lot of money.

That's only true if you have a program that actually fails and there's no guarantee that there is a committed version that fails in the way that the test case is testing. I suppose you could add an assert to the code that is specifically defined so it fails during fuzzing at the specific test case. So the test case is really: does the fuzzer find the "flag" with the specified input.

However, if you're using the input to guide the fuzzer, all you will prove is that the input is a possible way to trigger the bug. You can't prove that the test case wasn't modified afterwards to include malicious code in a way that still triggers the bug. To do that you'd need some kind of unicity test: that the test case is the lexicographically earliest test case that could possibly trigger this path. I'm not sure it that's a typical output of fuzzers. And I'm not sure if the added effort is worth it (unless the tools can be improved to the point where it is no extra effort).

A few relevant quotes

Posted Apr 2, 2024 17:26 UTC (Tue) by farnz (subscriber, #17727) [Link]

The blob output by the fuzzer is the test input; you have a separate input for the fuzzer that's your "certificate" that the fuzzer found the test input blob, and that it's not been modified since the fuzzer created it. If you modify it afterwards, then when I attempt to regenerate the test input using the fuzzer and the certificate input, I'll see that your certificate and the test input don't match.

You'd therefore only run the fuzzer if you wanted to see it reproduce the blob - you're after checking that this certificate matches up with the blob, and raising the alarm if there's a mismatch. Mostly, you'd just trust the committed blob, even though you have instructions for reproducing it - but the idea is that it raises the bar for an attacker, since they have to be concerned that at any point, a random passer-by could decide to try and reproduce the blob, discover that the checked-in blob differs to the reproduction instructions, and raise the alarm.

A few relevant quotes

Posted Apr 2, 2024 12:36 UTC (Tue) by pizza (subscriber, #46) [Link]

> And maybe started keeping databases of identities with commit rights and upload rights for critical base packages

This is a complete non-starter (for all but the largest/well-funded projects) in many jurisdictions, due to the massive legal/regulatory requirements it triggers.

A few relevant quotes

Posted Apr 1, 2024 15:10 UTC (Mon) by welinder (guest, #4699) [Link] (2 responses)

There are very good reasons why a tool like xz will have essentially random garbage files as test files. A number of them will be files that someone has had problems with in the past.

If I want payload.o in, I could do...

1. Create obfuscated.o from payload.o
2. Take valid.xz at truncate it at, say, 64k creating truncated.xz
3. Concatenate truncated.xz and obfuscated.o into evil.xz
4. Have someone file a "xz cannot unpack this file" report.
5. Analyze the bug report and say "file is damaged -- it seems to have been overwritten with random garbage half way in"
6. Commit evil.xz as a test case with reference to the bug

This will not stand out in any way. And there are endless variations of the above -- enough to smuggle a herd of elephants into the repository.

Broken binary files are common. They are often created by unknown broken software or even hardware (usb sticks, for example). You can't vet that.

A few relevant quotes

Posted Apr 1, 2024 15:25 UTC (Mon) by farnz (subscriber, #17727) [Link]

And as a significant case, I might well check in evil.xz because I'd found a way to make xz fail in an interesting fashion using the damaged file.

For example, I have a test case in my current job which looks at a block of binary data (in a known format) and verifies that our protocol decoder correctly fails because it runs out of data to decode rather than getting stuck because an inner field claims to need more data than the size of the data block, and we know the data block's size from a header.

A few relevant quotes

Posted Apr 1, 2024 15:39 UTC (Mon) by dezgeg (subscriber, #92243) [Link]

One option would be to not allow any sort of binaries in the main git repo, but have them elsewhere (like git-lfs, or just a separate git repo) that wouldn't be checked out during the compilation, only for 'make check' (and not allow make check to make further changes to the outputted .deb/.rpm etc. artifacts).

I guess next problem is some binaries do end up in the final .deb/.rpms (like images for icons) where stuff can still be hidden (and steganography could probably be used on plaintext test cases as well). But maybe that sort of rule would still help, as libraries or pure CLI tools don't need to include images.