|
|
Subscribe / Log in / New account

Issue IDs

Issue IDs

Posted Jan 16, 2025 23:57 UTC (Thu) by ewen (subscriber, #4772)
Parent article: The many names of commit 55039832f98c

The irony of trying to use the merkle tree hash of a single “central repo” as the way to identify an issue being fixed… in a distributed version control system :-/

Most other large projects long ago figured out they need (one or more) issue trackers, or some other “tracking ID” to be the link between issues discovered (and the cause of the issues), and their fixes. That gives a stable ID that can be cross referenced, and annotated with more information, without changing the unique ID number.

Here even a (sha256) hash of just the patch diff (not the tree) would be a better choice than an arbitrary merkle tree hash. If the kernel won’t follow modern industry best practice.

Ewen


to post comments

Issue IDs

Posted Jan 17, 2025 8:57 UTC (Fri) by taladar (subscriber, #68407) [Link]

This is a good point, this is essentially an artifact of the email based workflows again.

Taking just the diff

Posted Jan 17, 2025 9:04 UTC (Fri) by epa (subscriber, #39769) [Link] (9 responses)

Maybe git needs the concept of a ‘disembodied commit’ which stores a diff. The SHA of the commit is the SHA of the text diff as you say. But such a commit has no ancestor. If the diff applies cleanly you can ‘merge’ the disembodied commit into your branch, creating a merge commit whose ancestors are the previous state of your branch and the disembodied commit. Unlike cherry-picking, this would keep the same SHA, making it easier to see which branches have received a fix.

Taking just the diff

Posted Jan 17, 2025 11:58 UTC (Fri) by TomH (subscriber, #56149) [Link] (6 responses)

It already has git patch-id which basically does exactly that and computes an ID for a patch. I believe the primary use currently is internal to identify when two commits are essentially the same and help avoid unnecessary conflicts.

Taking just the diff

Posted Jan 17, 2025 13:46 UTC (Fri) by epa (subscriber, #39769) [Link] (5 responses)

So I guess it needs better tooling so you can ask the question "has this patch gone into my branch" just as easily as "is this commit an ancestor of my branch".

Taking just the diff

Posted Jan 17, 2025 13:54 UTC (Fri) by geert (subscriber, #98403) [Link] (4 responses)

The former is much more resource-intensive than the latter.

Taking just the diff

Posted Jan 17, 2025 18:16 UTC (Fri) by k3ninho (subscriber, #50375) [Link] (2 responses)

Am I missing something? "Has this patch gone into my branch?" is "Are the changes in this patch in place?"

The diff has end-result line numbers and expected text, you can test "does this line match this patch?" pretty cheaply. If the line number doesn't match up, the unified diff gives you three lines to find so you can check after and before the three lines following. Maybe the lines get split up, but that's a case that a tool can't work out intent.

K3n.

Taking just the diff

Posted Jan 17, 2025 18:47 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

How do you deal with reverts or subsequent edits to the line?

Taking just the diff

Posted Jan 18, 2025 8:41 UTC (Sat) by epa (subscriber, #39769) [Link]

I was thinking that if you applied a given patch, but then made later changes that reverted it or rearranged or even deleted the code, it would still count as having “gone into” that branch. This is similar to checking whether a commit is an ancestor of your branch — it either is or isn’t, for all that later changes might have modified the effect of that commit.

Taking just the diff

Posted Jan 19, 2025 1:59 UTC (Sun) by Heretic_Blacksheep (guest, #169992) [Link]

I'd rather have a family tree that takes a little time to (automatically) trace across all possible branches than no family tree and no easy way to track changes at all. Computers are supposed to be there to do the work for humans with useful and accurate results, not increase manual cognitive load and time needed to sort through disparate not-quite-linked identifiers.

A single unique ancestral id assigned strictly to committed content in which all children and cousin ids including changes to the ancestor are hard linked is preferable to systems where there can be confusion generated as to when, if, and why content may or may not have been integrated into the target work. Effectively what's needed in the kernel is a system in which all changes have such a unique content ID that can be referenced regardless of the sub project. The current system really doesn't appear to work in all cases at the scale the kernel is working at. And the DRM groups need to stop thinking they're independent of "downstream" kernel, when that's obviously a fiction. They're as dependent on kernel features just as the kernel is dependent on their features.

Taking just the diff

Posted Jan 18, 2025 11:00 UTC (Sat) by em (subscriber, #91304) [Link] (1 responses)

It would still not work. When cherry-picking some adjustments are often needed, changing the hash of the changes.

Taking just the diff

Posted Jan 18, 2025 21:17 UTC (Sat) by epa (subscriber, #39769) [Link]

But that’s exactly what I am talking about. When you merge a branch and resolve conflicts, you can make any changes you want as part of the merge. You could even end up not applying any of the change ostensibly being merged in (perhaps because the affected feature and its source code no longer exist in your branch). But still git will show that you performed a merge and the other branch is now an ancestor of yours. And this is by design.

(Another way to see whether a branch has been merged would be to look at the code and check the changes in the other branch are present in yours. This is what people sometimes had to resort to in older version control systems without change tracking, or which were not distributed, or indeed if they had no VCS at all. Both approaches have their advantages, but often we prefer the cleaner one based on explicitly tracking commit history.)

Now moving from whole branches to individual changes. You could try to work out whether a cherry-pick has been applied by looking at the current state of the code, or going back over the branch history to see if a similar-looking diff has been applied in the past. But as you say that falls down if there were conflicts or for other reasons the patch wasn’t applied exactly.

Instead, I suggest a metadata-based approach where you can create a “disembodied commit” which has a patch to apply but no ancestor commits. Merging this commit into your branch is usually simpler than merging a whole other branch, which may have unrelated changes (unless your programmers are very disciplined about always creating “daggy fixes” where the commit fixing a bug is an immediate child of the one that introduced it). Indeed merging this disembodied commit is the same as cherry-picking it, as far as the code goes. But unlike cherry-picking, the disembodied commit keeps the same SHA, and is now an ancestor of your branch. That would make it easy to ask “has this fix been applied?” without having to match the file contents, and even when the fix is only given as a text diff against some slightly different version of the codebase.

The SHA of the disembodied commit could be the hash of the original set of changes (or textual diff) but that same SHA would still be used even if you had to resolve conflicts on merging. Again, just as happens for regular branches and regular merges.

Issue IDs

Posted Jan 17, 2025 10:59 UTC (Fri) by sima (subscriber, #160698) [Link] (4 responses)

Yup, the cherry-picked from annotations we dump into our cherry-picks in drm are essentially recreating such a stable ID out of thin air. It's not perfect, but it's definitely better than trying to guess by looking at patch title and diff, since as you cherry-pick fixes around especially the diff is rarely an exact match.

It still sucks for the teams that have large internal trees (like amd's display code is shared with windows and firmware), so we still have some discontinuity in tracking changes and bugs that's not reflected in the upstream git log accurately, so ideally the kernel would need to accept an opaque original commit identifier.

I don't think a bug tracker issue would necessary work, because if your initial bugfix is broken most teams just reopen the original issue. Whereas for CVE tracking upstream wants separate IDs, if that broken fix has shown up in any release tag already. So for that purpose tracking commits instead of issue IDs has some benefits too.

Issue IDs

Posted Jan 17, 2025 20:36 UTC (Fri) by ewen (subscriber, #4772) [Link] (3 responses)

Certainly if one were using an issue tracker to get “stable ID for fix commits”, that needed to be unique, there’d need to be some conventions around how to use the bug tracker to derive those stable IDs. Either “you must open a new issue, to get a new stable ID, and have it link back to the closed issue”. Or some kind of “fix sequence number” (eg ISSUEID-SEQ, with SEQ starting at 0 and increased for each attempt at “commits fixing problem”).

But I do agree with your general point that the “stable” kernel accepting an “opaque ID that can be searched for” is the key change required. Rather than a “merkle tree commit hash in exactly their upstream tree” (Linus release), which cannot be known in advance when a fix is made before (the thing being fixed) is “merged to the central repo”.

Ewen

Issue IDs

Posted Jan 17, 2025 22:50 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

I think the average large org has a much more straightforward posture here: They have a single central repository (which might or might not be a DVCS like Git), and whatever identifier that centralized repo assigns is the identifier you use for everything.

Edge cases:

* If a commit has yet to make it into the central repo, then it has no ID, or perhaps only an unstable ID that might change or become invalid in the future. This does not matter, because these commits are considered "unfinished" and should not be used for any serious purpose (other than asking people to review them so that they can become final).
* Once a commit is in the central repo, history rewrites are forbidden (allowing for a few exceptional cases when some sensitive item gets improperly committed). The need to support that exceptional use case is one of the stronger arguments for using a non-DVCS system (which can simply assign numeric commit IDs by fiat rather than having to do this whole Merkle tree business, so you can rewrite history without changing subsequent IDs).
* When something is cherry picked, there is some notion of a "primary" or "original" commit, which will at the very least be mentioned in the commit message of the cherry pick. Good tooling can use this to resolve the original commit when given a cherry pick ID (not every org has good tooling).
* Linear history is usually enforced. When a commit becomes final, it is rebased on top of the intended branch. For some systems like Perforce, this is usually a trivial operation (Perforce's data model is that each individual file is a series of snapshots with associated numeric commit IDs, so all history is inherently linear, and "rebasing" just means "check to make sure that nobody else has edited any of the same files as us").

Obviously this would not work very well for Linux's use case, but it is the sort of thing that the MITRE people were probably expecting when they set some of these rules around stable IDs.

Issue IDs

Posted Jan 17, 2025 23:16 UTC (Fri) by kleptog (subscriber, #1183) [Link] (1 responses)

The patch tracking id doesn't need to be anything complicated. You could for example just generate a random 32 character hex string and store it in the commit message. Then it would automatically survive cherry-picks and everything. You can make a git hook that automatically generates such an id when the commit is created.

Of course, such a string without context is not very useful so we should prefix it with something, perhaps the capital letter I. That would make it easily recognisable and searchable.

Finally, it should not be bare, but it could be added to the commit message under Change-ID or some such.

Oops, now it looks like Gerrit changelog ID and we couldn't possibly do that...

All this talk of diff algorithms and automatic patch matching, the amount of cycles and discussion going in this, all to solve a problem that is trivially solved with a git hook. I don't understand it at all.

We can replace the I with a K and call it Kernel-Change-Id, maybe then people would accept it?

Issue IDs

Posted Jan 24, 2025 22:57 UTC (Fri) by marcH (subscriber, #57642) [Link]

> Oops, now it looks like Gerrit changelog ID and we couldn't possibly do that...

Of course not, because such a dead simple solution comes from lesser people who do not recognize the superiority of an email-based workflow. What were you thinking?

Also, Gerrit genuinely sucks in many ways, so absolutely nothing good can come from it and it must be entirely dismissed.

> All this talk of diff algorithms and automatic patch matching, the amount of cycles and discussion going in this, all to solve a problem that is trivially solved with a git hook. I don't understand it at all.

No one can understand it because it's not rational.

For more bad faith, don't forget to watch the seminal talk "Patches carved in stone tablets". It's a bit old (2016) but already well into the social media age where the universe is split between the "good guys" who do everything right and the "bad guys" who do everything wrong. A complex world made simple at last.

To be fairer and less bitter: DRM seems to _already_ use such a random ID! This random ID just happens to coincide with the SHA of a mutable git commit, with a "cherry-picked from:" prefix pretending that the ID is the SHA of an immutable git commit.

But this ID disguise and attempt to appease the stone tablets gods is backfiring; they have not been fooled...

Issue IDs

Posted Jan 17, 2025 17:01 UTC (Fri) by twmoure (subscriber, #58924) [Link]

The author metadata attached to the commit almost works the way a unique id should - it's preserved by cherry-picks, rebases, etc., and there's already existing flags for 'git commit' etc to say "reset author information" in the case where this default behavior isn't desired (where a commit is being split up, or rewritten to such a large extent that it's no longer the same commit, etc.). The only problem is it isn't actually a unique id, only a name/email/timestamp. But it seems like adding an actual unique id alongside those things that's managed in exactly the same way would be reasonably straightforward and consistent with git's overall design..

Issue IDs

Posted Jan 17, 2025 18:49 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (1 responses)

> Here even a (sha256) hash of just the patch diff (not the tree) would be a better choice than an arbitrary merkle tree hash. If the kernel won’t follow modern industry best practice.

Hmm. I'm not so sure that would work either. There are different diff algorithms in git itself and one might have higher context settings for various reason (e.g., I believe Phabricator wanted "infinite" context for uploaded patches).

Issue IDs

Posted Jan 17, 2025 20:47 UTC (Fri) by ewen (subscriber, #4772) [Link]

If one were generating a diff for the purpose of computing a stable “change ID hash” then obviously the exact diff algorithm, formatting, and diff parameters would need to be fixed, along with the hash algorithm. To ensure it was reproducible over time and different runtime environments.

TomH pointed out elsewhere in the thread that there is an internal “git patch-id” already, in git, that might form the basis of a “hash of just the patch diff”, if the “stable” kernel could be persuaded to accept that as the ID. It’d probably require extra tooling / metadata support to store enough information to enable efficient lookup of commits by the stable “change ID hash” (eg each git tree, or user, storing a cache of “change ID hash” to “merkle tree hash” mapping for direct lookup).

Ewen

Issue IDs

Posted Jan 17, 2025 19:42 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

Could the `Message-Id` of the originating patch be used somehow? For non-email workflows, a link to an MR/PR may suffice as well.

Issue IDs

Posted Jan 22, 2025 21:34 UTC (Wed) by siddh (subscriber, #169663) [Link]

> Here even a (sha256) hash of just the patch diff (not the tree) would be a better choice than an arbitrary merkle tree hash.

It already exists, it's called git-patch-id.

The following commits have the same patch ID:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...


Copyright © 2025, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds