Adding package information to ELF objects
While it is often relatively straightforward to determine what package provided a binary that is misbehaving—crashing for instance—on Fedora and other Linux distributions, there are situations where it may be harder to do so. A feature recently proposed for Fedora 36—currently scheduled for the end of April 2022—would embed information into the binaries themselves to show where they came from. It is part of a multi-distribution effort to standardize how this information is stored in the binaries (and the libraries they use) to assist crash-reporting and other tools.
On October 25, Fedora program manager Ben Cotton posted the proposal to the Fedora devel mailing list; it is also available on the wiki. The basic idea is that each ELF object that gets created for an RPM package will get a .note.package ELF section added to it. That section will contain a JSON-formatted description of exactly which RPM it was distributed with. So those binaries will contain information that can tie them directly to the package, even in the absence of RPM metadata on the system.
The facility would be used by the systemd-coredump utility to log package versions when crashes occur. For regular Fedora systems, which normally have the RPM metadata available, there is no large advantage. But for other situations where Fedora-created binaries might be run—and crash—this mechanism would allow administrators and tools to recognize where exactly the binary came from.
The feature was originally
proposed back in April for Fedora 35, but was rejected by the Fedora
Engineering Steering Committee (FESCo) with an explicit invitation to
resubmit it with "a more detailed and
understandable 'Benefit to Fedora' section
". So the two feature
owners, Zbigniew Jędrzejewski-Szmek and Lennart Poettering, added more
information to the proposal for the resubmission. The new "Benefit to
Fedora" section details why it makes sense for the distribution:
A simple and reliable way to gather information about package versions of programs is added. It enhances, instead of replacing, the existing mechanisms. It is particularly useful when reporting crash dumps, but can also be used for image introspection and [forensics], license checks and version scans on containers, etc.If we adopt this in Fedora, Fedora leads the way on implementing the standard. Fedora binaries used in any context can be easily recognized. Fedora binaries provide a better basis to build things.
If other distros adopt this, we can introspect and report on those binaries easily within the Fedora context. For example, when somebody is using a container with some programs that originate in the Debian ecosystem, we would be able to identify those programs without tools like `apt` or `dpkg-query`. Core dump [analysis] executed in the Fedora host can easily provide useful information about programs from foreign builds.
When a program crashes, there is already an identifier that can be used: the build ID that is stored in the .note.gnu.build-id ELF section. But that ID is a long hexadecimal string that is not terribly useful to a human. In addition, the ID can only be related back to the RPM it came from by using the RPM database installed on the system or by doing some sort of network query. An example in the proposal shows how the human-readable JSON in note.package might look instead:
{
"type": "rpm",
"name": "hello",
"version": "0-1.fc35.x86_64",
"osCpe": "cpe:/o:fedoraproject:fedora:33"
}
The proposal notes that the "directly motivating use case is display of core dumps
".
The build ID could be used if the RPM database is present, but, even then,
there can be problems. It is not uncommon for the package containing a
crashing program to have been upgraded behind the scenes, so the installed
binary is different than the running one; other mishaps are also possible,
so that correspondence cannot be assured. Also, crashes that happen in
environments without the database, running from initrd or a sandboxed
container for example, can use the JSON note to extract the exact versions of
each component involved in the crash.
For users who build their own packages, once again the human-readable information will be more useful than the build ID, which would need to be maintained in some kind of database to map the ID to the source version. In addition, binaries are sometimes pulled from Fedora for use in other distributions—and vice versa. Being able to easily find out where a binary came from will be useful in those cases too:
Whilst most distributions provide some mechanism to figure out the source build information, those mechanisms vary by distribution and may not be easy to access from a "foreign" system. Such mixing is expected with containers, flatpaks, snaps, Python binary wheels, anaconda packages, and quite often when somebody compiles a binary and puts it up on the web for other people to download.
David Cantrell had a few
questions about the proposal. He wondered why Fedora should care about
mixing-and-matching its binaries on other systems; if there is no way to
reproduce the problem on a vanilla Fedora system, bug reports are not
likely to be entirely useful. Poettering said that
having the information will be useful because it will help show that the
problem happened in a mixed system. That will give Fedora the opportunity to
either try to reproduce the bug, perhaps in conjunction with the other
distribution, or at least allow it to be
"more efficient with
'not caring' for non-fedora issues
".
Cantrell also asked about whether the "NEVRA" (name-epoch-version-release-architecture) package information is sufficient, because it may not be unique and wondered if the build ID plus debuginfod servers would be enough. Debian developer Luca Boccassi noted that access to the network is not a given, nor is it desirable from the sandbox that systemd-coredump runs in. Adding the URL for the debuginfod information to the package note is a possibility, as well.
There is also a privacy issue to consider Jędrzejewski-Szmek said:
"querying debuginfo servers may expose
information (about what is running, in what versions, what is crashing, etc.)
Thus such queries need to be opt-in and under user control.
" He
also pointed out that the Fedora Koji build system ensures that the NEVRA
information is unique for the packages it creates.
Kevin Kofler had a number of objections to the feature, however. In effect, he was objecting the whole use case of running Fedora binaries in environments where the RPM database was not present. He also claimed that the licenses for the code were being violated when pulling binaries out of RPMs without providing the source code. But the licensing question is largely irrelevant, Jędrzejewski-Szmek said; in some cases there may be a license violation, but in lots of others there is not. The package information in the binary will actually help figure out when there is a problem of that sort, he continued.
Not having an RPM database results in a non-functional Fedora installation,
Kofler said:
"how can this not be
broken?
"
But lacking an RPM database is rather common, especially in the
container world, Daniel P. Berrangé said. It would
sometimes be useful to have that database available, but it is not a high
priority in container-land, which does not mean the use case is broken:
"It is simply a different approach / attitude / tradeoff towards
using
& maintaining the software stack.
"
"Bloating" all of the ELF objects in Fedora to support the use case is not reasonable, though, Kofler said. The proposal notes that the overhead is around 200 bytes per ELF object, which results in an increase of 13MB if every object in the distribution had the package information added to it. Since Kofler does not seem to accept that the use case is valid, any increase to support it is unnecessary in his eyes. But Boccassi pointed out that cost is tiny to support a use case that is prevalent:
[...] it has happened, it is happening and it will keep happening, because for others it is perfectly logical and highly desirable. So one can either stay here and complain all day long that containers are bad and they are all doing them wrong, and if they only listened to reason everything would be just perfect, or one can do something to significantly improve the baseline for everybody at a cost so ridiculously negligible that if the same standard were applied to compiler updates or changing build flags or whatnot nothing would ever, ever change.
Furthermore, Poettering noted that even vanilla Fedora systems have a piece without an RPM database:
You too run a system with no RPM database – all the time, and that thing still calls itself Fedora: a dracut initrd is exactly that: built from RPMs but without any RPM db.Thing is, there are different ways to update stuff. rpm/dnf is one thing, dracut image rebuilds is another, containers are typically updated very differently too. rpm is a useful tool (and by embedding rpm meta info into the ELF objects it becomes even stronger), but your assumption that rpm/dnf based updates is the only right way to upgrade stuff is simply neither reality nor even desirable.
A possible security issue with the proposal was also raised by
Cantrell. He wondered if there might be problems with using JSON, which
has been the source of some security problems in the past. "Of concern to me are encoding
formats, size limits or reporting, and structure formats.
"
Poettering said that JSON
was chosen because of the "battle-tested parsers
" that are
already used in systemd and elsewhere. Jędrzejewski-Szmek added that
"the implementation in systemd is undergoing continuous
fuzzing in oss-fuzz
", so there is reason to hope that many of the
parser bugs have been found and fixed.
Overall, the reception was largely favorable, though there are some concerns. It is a change with fairly minor effects on binaries—200 bytes hardly seems onerous—that can help in a number of scenarios. It also work toward a cross-distribution effort. Microsoft's CBL-Mariner container distribution has added support for the feature and it will be proposed for Debian; others may well follow suit. It will be up to FESCo, of course, but the objections and concerns do not seem to offset the benefits that it will bring.
Posted Nov 2, 2021 23:47 UTC (Tue)
by Deleted user 129183 (guest, #129183)
[Link] (5 responses)
Why serialize metadata to JSON, anyway? If things like ‘.note.gnu.build-id’ are any indication, ELF sections already provide way to embed structured, hierarchical metadata, so perhaps a better option would be multiple sections, like ‘.note.package.type’, ‘.note.package.name’, ‘.note.package.version’, etc.
Posted Nov 3, 2021 0:49 UTC (Wed)
by IanKelling (subscriber, #89418)
[Link] (3 responses)
> Using a single field rather than a set of separated notes is more space-efficient. With multiple fields the padding and alignment requirements cause unnecessary overhead.
All those unneeded quotes kind of grind my nerves, but I haven't looked closely at any alternatives, it seems fine.
Posted Nov 3, 2021 11:48 UTC (Wed)
by bluca (subscriber, #118303)
[Link]
Posted Nov 3, 2021 12:08 UTC (Wed)
by eru (subscriber, #2753)
[Link]
Posted Nov 3, 2021 15:44 UTC (Wed)
by Deleted user 129183 (guest, #129183)
[Link]
I guess I should have read everything before commenting…
Posted Nov 3, 2021 0:53 UTC (Wed)
by rustylife (subscriber, #102864)
[Link]
Posted Nov 3, 2021 0:47 UTC (Wed)
by rahulsundaram (subscriber, #21946)
[Link] (4 responses)
Who posted this information isn't really all that pertinent to feature proposals. It's going to be the same person channelling all these changes here anyway. So it's unclear to me why LWN chooses to highlight this over the people involved with the change directly.
Posted Nov 3, 2021 2:16 UTC (Wed)
by mattdm (subscriber, #18)
[Link] (3 responses)
Posted Nov 3, 2021 11:29 UTC (Wed)
by zuki (subscriber, #41808)
[Link] (2 responses)
[1] https://lwn.net/Articles/807829/
Posted Nov 3, 2021 13:21 UTC (Wed)
by jake (editor, #205)
[Link] (1 responses)
yes, I guess it was forgotten. The 'Fedora program manager' bit was meant to indicate that he was posting it in that role, but it would seem that is not entirely clear to everyone. In general, we try to attribute posts, so I did not want to leave out who posted it, but I will *try* to keep the 'on behalf of' bit in mind going forward.
jake
Posted Nov 3, 2021 13:46 UTC (Wed)
by rahulsundaram (subscriber, #21946)
[Link]
Posted Nov 3, 2021 0:50 UTC (Wed)
by IanKelling (subscriber, #89418)
[Link]
Posted Nov 3, 2021 3:09 UTC (Wed)
by jhoblitt (subscriber, #77733)
[Link] (2 responses)
Posted Nov 3, 2021 11:05 UTC (Wed)
by zuki (subscriber, #41808)
[Link] (1 responses)
We went with JSON because this it is very well known and there are various parsers available, systemd already has one for varlink. JSON has the advantage that if you extract the note, you can convert it to plain text using something like "readelf …|xxd -rp" or "strings", so you don't even need any format-specific tool to read the note.
Posted Nov 3, 2021 16:00 UTC (Wed)
by hkario (subscriber, #94864)
[Link]
Posted Nov 3, 2021 3:38 UTC (Wed)
by wtarreau (subscriber, #51152)
[Link] (4 responses)
Posted Nov 3, 2021 14:16 UTC (Wed)
by madscientist (subscriber, #16861)
[Link]
In any event, the problem you discuss is already somewhat solved via the build ID feature which most everyone implements these days: core files now contain unique IDs for the binary and all shared libraries they loaded at runtime. While you still have to go find them (using debuginfod or similar) at least you are 100% confident whether you have the right ones or not.
Posted Nov 4, 2021 9:23 UTC (Thu)
by jezz (subscriber, #59547)
[Link] (2 responses)
echo 0xF > /proc/self/coredump_filter
Posted Nov 4, 2021 15:32 UTC (Thu)
by wtarreau (subscriber, #51152)
[Link]
Posted Nov 4, 2021 15:37 UTC (Thu)
by wtarreau (subscriber, #51152)
[Link]
Thanks for your hint!
Posted Nov 3, 2021 5:00 UTC (Wed)
by pabs (subscriber, #43278)
[Link] (5 responses)
The main thing I've learned from the reproducible builds project is that data about a build should not be stored in the build products for that build, but in the metadata about that build. For eg you shouldn't put the build log in a .rpm but in a .log file beside that RPM, and you shouldn't record the build dependency versions used to build an RPM in the RPM, but in a buildinfo file next to the RPM.
Posted Nov 3, 2021 5:03 UTC (Wed)
by pabs (subscriber, #43278)
[Link]
Posted Nov 3, 2021 5:14 UTC (Wed)
by jhoblitt (subscriber, #77733)
[Link]
Posted Nov 3, 2021 7:43 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Posted Nov 3, 2021 11:46 UTC (Wed)
by bluca (subscriber, #118303)
[Link]
> When is a build reproducible?
> Relevant attributes of the build environment would usually include dependencies and their versions, build configuration flags and environment variables as far as they are used by the build system (eg. the locale).
https://reproducible-builds.org/docs/definition/
If the build environment changes, it is not expected to be able to create the same binary, and that's why it's all recorded in the buildinfo, to be able to reproduce it.
Posted Nov 3, 2021 10:41 UTC (Wed)
by bluca (subscriber, #118303)
[Link]
But even if it did, changing the build toolchain will result in changes in the binary, and that's ok, the installed environment is recorded in the buildinfo for that reason. Reproducible builds are not about minimizing changes in arbitrary ways, they are about being able, given the same input (sources plus toolchain/dependencies), to get the same output. Build deps, compiler, etc are very much part of the input.
Posted Nov 3, 2021 5:05 UTC (Wed)
by pabs (subscriber, #43278)
[Link] (2 responses)
Posted Nov 3, 2021 9:51 UTC (Wed)
by danpb (subscriber, #4831)
[Link]
Posted Nov 3, 2021 11:40 UTC (Wed)
by bluca (subscriber, #118303)
[Link]
Posted Nov 3, 2021 5:46 UTC (Wed)
by pabs (subscriber, #43278)
[Link] (2 responses)
I've recently been thinking about Debian's static linking problem; we
Posted Nov 3, 2021 15:26 UTC (Wed)
by bluca (subscriber, #118303)
[Link] (1 responses)
Posted Nov 3, 2021 15:27 UTC (Wed)
by bluca (subscriber, #118303)
[Link]
Posted Nov 3, 2021 16:01 UTC (Wed)
by ballombe (subscriber, #9523)
[Link] (4 responses)
Posted Nov 3, 2021 17:15 UTC (Wed)
by bluca (subscriber, #118303)
[Link] (2 responses)
Posted Nov 3, 2021 22:40 UTC (Wed)
by ballombe (subscriber, #9523)
[Link] (1 responses)
Posted Nov 5, 2021 7:43 UTC (Fri)
by zuki (subscriber, #41808)
[Link]
Well, the information is opt-in. If you don't want it, just don't put it in. It doesn't "defeat the concept" because the information doesn't have to be present in every build in the world for it to be useful. E.g. I care about Fedora builds, and with this I can distinguish them from every other build, in particular I'll know that any builds without the tag is not from Fedora.
Posted Nov 3, 2021 17:15 UTC (Wed)
by jengelh (guest, #33263)
[Link]
Posted Nov 3, 2021 18:15 UTC (Wed)
by developer122 (guest, #152928)
[Link] (2 responses)
Posted Nov 3, 2021 21:46 UTC (Wed)
by rgmoore (✭ supporter ✭, #75)
[Link]
Dealing with this kind of situation is exactly why they want to tag the ELF header with this information. Tracking down what's happening with buggy software is hard enough; figuring it out when people are trying to fix the problem by swapping software around is that much harder. But tagging the software itself with this kind of origin information will at least make it possible to figure out which exact version of the software is running.
Posted Nov 4, 2021 14:15 UTC (Thu)
by Paf (subscriber, #91811)
[Link]
Posted Nov 3, 2021 23:12 UTC (Wed)
by guillemj (subscriber, #49706)
[Link] (5 responses)
In Debian it also cannot properly encode the true binary package version, given that this can be passed too late in the build process when the objects have already been built, but I guess the "source" version would be good enough, even though highly confusing. This also *does* make builds less reproducible, as binaries that would have been identical between package revisions, or even different upstream versions (if the relevant source didn't change) then are guaranteed to change regardless of the above, which would be a great loss for packaging and QA checks.
The main point is that the support for this embedded information mostly makes sense for distributions (in its more general sense) that will make all historical sources and debugging symbols available for later analysis, otherwise it's just metadata for statistics and cataloging purposes at most. At which point if you already have all those sources and debugging symbols, at least in Debian you already have all the Build-Ids in metaindices files, which you can cheaply mirror and query to back reference the origin, which you might need anyway at some point if you want to download any of those. And then if you have random cores coming your way, for which you have no clue whatsoever of their provenance, well…
Posted Nov 4, 2021 7:32 UTC (Thu)
by zuki (subscriber, #41808)
[Link] (4 responses)
You received extensive replies to your two mails… The main point is that this is primarily *not* about debuginfo. If you want to download debuginfo, build-id is your friend. This is about quickly identifying software origin *before* you get to the step of downloading debuginfo.
> The main point is that the support for this embedded information mostly makes sense for distributions (in its more general sense) that will make all historical sources and debugging symbols available for later analysis, otherwise it's just metadata for statistics and cataloging purposes at most.
Well, yes, and such "statistics and cataloging" are useful. Imagine that you are developing some local software and your 15 in-company users report that 0.0.2.3 crashes, but 0.0.2.1 and 0.0.2.4 don't. You don't need any debuginfo to make use of this. Or that you get reports from your CentOS users that libfoo Alma build crashes…
> In Debian it also cannot properly encode the true binary package version, given that this can be passed too late in the build process when the objects have already been built, but I guess the "source" version would be good enough, even though highly confusing.
I don't do Debian packages myself, but I'm pretty sure this can be figured out. All the necessary information is already there, so it's just a question of arranging steps in the build the right way.
> This also *does* make builds less reproducible, as binaries that would have been identical between package revisions, or even different upstream versions (if the relevant source didn't change) then are guaranteed to change regardless of the above, which would be a great loss for packaging and QA checks.
This is the biggest misunderstanding. "reproducible build" means that you get the identical build output for the identical inputs (source + dependencies + tool versions). Proposed metadata is identical for identical package versions, so it is trivially "reproducible" in the sense of reproducible builds. Reproducible builds don't generally mean that binary objects change less between package versions. Did you maybe want to say that binaries will change more between versions? In fact, nowadays all distro builds are tagged with a build-id, and build-ids change between package versions, so binaries from different package versions are already different. In the linked Change proposal I did an investigation for some packages in Fedora. If you really think there are cases in Debian where the results would be materially different, please do the same. I would love to see those; if necessary we can adjust the proposal then.
But even if there were identical binaries in different package versions, this doesn't matter for QA. QA is always done at the level of whole packages (or even package groups), not individual files. The whole point of QA is to check the package in interaction with other packages and the whole stack of dependencies.
Posted Nov 5, 2021 3:37 UTC (Fri)
by pabs (subscriber, #43278)
[Link] (3 responses)
This is an artificially reduced and simplified definition for reproducible builds. The tool version number doesn't matter at all, what matters is behavior of the tool, which is often identical across versions of the tool. Artificially changing the behavior of the tool (by embedding the version number in the output) with every single change in the version of the tool reduces reproducibility. Likewise for dependencies, changing the contents of a library (by inserting a version number) because a typo was fixed in the changelog reduces reproducibility.
> build-ids change between package versions
This is incorrect, build-ids are meant to be deterministic given identical inputs, and package versions are mostly not inputs to package builds, only source code is.
Posted Nov 5, 2021 7:24 UTC (Fri)
by zuki (subscriber, #41808)
[Link] (2 responses)
You're conflating two things: behaviour and labelling of a package. This proposal has no effect whatsoever on behaviour of tools and their output.
Once again: since this proposal produces predictable output, it is reproducible in the sense of https://reproducible-builds.org/docs/definition/ . If you want to create some further definition, please do so, but give it a different name and a clear explanation.
> This is incorrect, build-ids are meant to be deterministic given identical inputs, and package versions are mostly not inputs to package builds, only source code is.
In practice, build-ids change almost always, e.g. see the investigation in https://fedoraproject.org/wiki/Changes/Package_informatio... .
Posted Nov 5, 2021 8:02 UTC (Fri)
by pabs (subscriber, #43278)
[Link] (1 responses)
If I build version 1 of that package on Debian and on three Debian based distributions that do not change any of the build dependencies of foo. In our current world, /usr/bin/foo will be identical for each of those distributions. Under the Fedora plan, /usr/bin/foo will have different hashes again.
If I have foo.c in package foo version 1 and in version 2 I run clang-format over the code, changing the source hashes. In our current world, /usr/bin/foo will likely be identical for both versions. Under the Fedora plan, /usr/bin/foo will have different hashes again.
The history and philosophy of reproducible builds is a lot more nuanced than the summary on the website makes it out to be. There is some more of that sort of thing in this document:
https://salsa.debian.org/reproducible-builds/specs/buildi...
Posted Nov 5, 2021 9:35 UTC (Fri)
by zuki (subscriber, #41808)
[Link]
> If I have a package foo with versions 1, 2, 3, where each version has foo.c with identical contents. In our current world, /usr/bin/foo will be identical for each version of the package.
Yes, this is theoretically possible. But in practice, at least in the practical examples I have looked into, it doesn't hold.
Trivially, many programs include the package version in output (for purposes of identification), e.g. clang, gcc, qemu, the kernel.
More subtly, in Fedora, binaries include a .gnu_debuglink section that includes the package version:
$ readelf -W -p .gnu_debuglink /usr/bin/true
$ readelf -W -p .gnu_debuglink /usr/bin/false
AFAICT, this section is included in the build-id calculation, so the build ids also vary when the package version changes.
There is a lot of moving parts here… In the examples I looked at, binaries are not "repeatable". But I'll say it once again: if somebody has an example where it is true, please show it. Right now people bring up theoretical considerations which are trivially shown to be false in real packages.
> If I build version 1 of that package on Debian and on three Debian based distributions that do not change any of the build dependencies of foo. In our current world, /usr/bin/foo will be identical for each of those distributions. Under the Fedora plan, /usr/bin/foo will have different hashes again.
I don't know the details of how Debian&derivs build packages, but at least in case of Fedora, those binaries would already be different when rebuilt in a derivative distro, as shown above.
> run clang-format over the code, changing the source hashes. In our current world, /usr/bin/foo will likely be identical for both versions.
For the sake of argument, let's say that the build process is such that you really get identical binaries in this case. Does this have any practical value? It would matter only if people reformat their code and release a new package version. People have better things to do. And if I was a maintainer and I saw release notes that say that comments were reformatted and absolutely no other changes were done, I'd just ignore that version. So please, stop with the theoretical examples and show a case that actually has an iota of practical effect.
Posted Nov 4, 2021 18:36 UTC (Thu)
by flussence (guest, #85566)
[Link]
Posted Nov 5, 2021 3:39 UTC (Fri)
by pabs (subscriber, #43278)
[Link] (9 responses)
Posted Nov 5, 2021 7:29 UTC (Fri)
by zuki (subscriber, #41808)
[Link] (8 responses)
See https://fedoraproject.org/wiki/Changes/Package_informatio...
Posted Nov 5, 2021 7:35 UTC (Fri)
by pabs (subscriber, #43278)
[Link] (7 responses)
Posted Nov 5, 2021 8:11 UTC (Fri)
by pabs (subscriber, #43278)
[Link] (6 responses)
As an example, the Debian project produces live image ISOs. Those images combine binaries from many packages in one file. Each of those ISO images has next to it a file containing the list of binary packages and package versions used to build it. This is the right way to go about solving this problem.
https://cdimage.debian.org/debian-cd/current-live/amd64/i...
Posted Nov 5, 2021 9:04 UTC (Fri)
by zuki (subscriber, #41808)
[Link] (1 responses)
As discussed in the proposal, attaching the information to the ELF files makes it visible in the place where it's is very useful: crash dumps. I'd say that having a flat text file somewhere is not as useful for this purpose.
Posted Nov 5, 2021 9:11 UTC (Fri)
by pabs (subscriber, #43278)
[Link]
Posted Nov 5, 2021 9:22 UTC (Fri)
by mjg59 (subscriber, #23239)
[Link] (2 responses)
I can think of one real-world (if corner) case that this probably does trip up, though:
If the fact that these are two separate packages were to result in different embedded data, the signature obviously wouldn't apply.
(This isn't a problem at the moment because the Debian shim-signed source package just contains the signed binaries, but it would be nice to have a world where the builds were reproducible enough to avoid that)
Posted Nov 5, 2021 9:38 UTC (Fri)
by zuki (subscriber, #41808)
[Link]
Posted Nov 5, 2021 12:07 UTC (Fri)
by BenHutchings (subscriber, #37955)
[Link]
Posted Nov 6, 2021 19:54 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link]
In principle, you could use that provenance information for crash dumps. That's how we do it at Google, in fact - we can tell the exact version that was checked into source control, and display the exact line where the faulting instruction happened, because we built the container in the first place, and so we know where everything in it came from. This is one of the benefits* of having a monorepo without (much) branching, as we can just point to one CL number instead of, say, fifty, and it's also one of the reasons** that Bazel makes such a big fuss about exhaustively tracking and declaring your entire dependency hierarchy.
The problems only really arise when a crash dump gets separated from the orchestration system's provenance data, or when the orchestration system's provenance data is inadequate (or when you don't have an orchestration system and are just manually building Docker images from random crap, of course, which is an unfortunately common practice in some shops). You might also have the "my tools suck" problem, where you theoretically have all of the information (provenance data) you need, but converting it into a useful form (a Git hash or version number that upstream can recognize and deal with) is too hard.
* There are also drawbacks, which are irrelevant here, but somebody will bring them up if I don't acknowledge that they exist.
Posted Nov 5, 2021 12:49 UTC (Fri)
by smitty_one_each (subscriber, #28989)
[Link]
Try it out in a sandbox and see if this dog hunts. Given the names pushing the idea, I expect a "yes".
But the idea could prove terrible, or lead to still better approaches.
Experiments: we can do them.
Posted Nov 11, 2021 8:48 UTC (Thu)
by jepsis (subscriber, #130218)
[Link] (4 responses)
Posted Nov 12, 2021 9:49 UTC (Fri)
by bluca (subscriber, #118303)
[Link] (3 responses)
Posted Nov 12, 2021 13:23 UTC (Fri)
by Wol (subscriber, #4433)
[Link] (2 responses)
Linux elf is a container format, so just make one of the objects in it responsible for storing the signatures of the other objects ...
Cheers,
Posted Dec 3, 2021 12:51 UTC (Fri)
by nix (subscriber, #2304)
[Link] (1 responses)
This is all fairly painful to do in GNU ld, and you can expect special-case hacks will be required, just as are needed for build-id. More generally, GNU ld (really bfd) has no dependency relationships between its sections at all. Sections are considered lumps of arbitrary data with relocations applied to them, symtabs, or strtabs, and if you want anything else you need special-case hacks. Just having a section dependent on the contents of the ELF symtab and strtab (.ctf) was... memorable to implement, even though you'd think it would be something ld already needed to do (nope!).
Posted Dec 3, 2021 12:51 UTC (Fri)
by nix (subscriber, #2304)
[Link]
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
It would also be beneficial to store crash time in elf core, very often you only get core.PID file with original attributes reset, because file has been transferred through multiple file-systems and machines, and there is no way to figure out when the crash occurred. This requires additional emails to customer to clarify the matter and get additional information, which is a waste of time.
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
[2] https://lwn.net/Articles/852416/
Adding package information to ELF objects
> "Ben Cotton, on behalf of the change owners …", but it seems this was forgotten in recent times.
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
> A build is reproducible if given the same source code, build environment and build instructions, any party can recreate bit-by-bit identical copies of all specified artifacts.
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
have no systematic tracking of static linking (except for Haskell and
Ocaml but only for for the in-development release) and so we don't
rebuild statically linked binaries after security issues. Fedora
doesn't have this issue because they just rebuild the world often and
presumably RedHat does the same. Then I thought about tracing builds
but realised that would not reveal the semantics of the build process.
So I thought about modifying toolchains and build systems to output
semantic information (foo.c converts to foo.o, foo.o is combined with
bar.o into foo.a, foo.a is combined with baz.o into foo.so etc). Then I
thought about adding source hashes to binaries and quickly discovered
the Annobin project via a RedHat blog post. After a quick experiment
with adding whitespace to a .c file I quickly realised adding source
hashes to the binary is going to change the hashes of the binary for
every build, even if the binary wouldn't change after adding whitespace
to the source. Then I realised that the source hashes form part of what
the Reproducible Builds folks record as the "build info"; the source
package and build-dependency details, except those aren't fine-grained
enough for the static linking problem. Then I figured that Annobin
could record source data outside the binary files, perhaps in the build
info files or in files referenced by them.
Adding package information to ELF objects
Adding package information to ELF objects
strip
There are situations were you do not want to leak informations about your build environment
(which is one of the motivation for reproducible builds).
strip
strip
strip
strip
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
...
[ 0] true-8.32-31.fc35.x86_64.debug
...
[ 0] false-8.32-31.fc35.x86_64.debug
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
("directly motivating use case", "second motivating use case", "third motivating use case"), and also
https://fedoraproject.org/wiki/Changes/Package_informatio... , https://fedoraproject.org/wiki/Changes/Package_informatio... .
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Having a text file with a list of package versions _somewhere_ is one workaround-slash-solution. Attaching this information directly to the ELF file is another workaround-slash-solution. Both approaches have their advantages and can coexist peacefully.
Adding package information to ELF objects
Adding package information to ELF objects
1) Have a shim-unsigned package that produces a binary
2) Upload that shim binary to Microsoft and obtain a signed copy
3) Strip that signature from the binary and add it to a shim-signed package
4) Build shim-signed in an identical environment to shim-unsigned, with the last step being to add the signature
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
** The main reason is "cache invalidation is hard, and rebuilding the entire universe from scratch is slow." But good provenance data is definitely important too.
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Adding package information to ELF objects
Wol
Adding package information to ELF objects
Adding package information to ELF objects
