Is the date there?
Is the date there?
Posted Dec 14, 2024 15:18 UTC (Sat) by Wol (subscriber, #4433)Parent article: Facing the Git commit-ID collision catastrophe
So does the number of characters really matter? That'll give you all the information you need.
Cheers,
Wol
Posted Dec 14, 2024 15:38 UTC (Sat)
by randomguy3 (subscriber, #71063)
[Link] (11 responses)
Posted Dec 14, 2024 23:55 UTC (Sat)
by skissane (subscriber, #38675)
[Link] (10 responses)
Git is not the kind of distributed system which needs a highly accurate clock. If someone is talking about a commit with some ID, and there are two commits with that ID – one from this year and one from a decade ago – most likely they mean the one from this year. So even if my clock is 5 minutes off, so long as I know what year it is, I'm fine.
Since there are a lot more commits from years ago than from this year, odds are high that when a collision finally happens, it will be between a recent commit and one much older, so it will be obvious from the timestamps which one is relevant. Yes, it is possible that an ID collision could occur between two recent commits, but that is significantly less likely because historical commits greatly outnumber recent ones.
Also there are other ways to disambiguate commits – if we are talking about the networking subsystem, and one commit with that ID is a networking commit, and the other is touching some unrelated kernel subsystem, we know which one is likely meant.
Posted Dec 16, 2024 10:22 UTC (Mon)
by taladar (subscriber, #68407)
[Link] (9 responses)
Posted Dec 16, 2024 11:42 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (8 responses)
Automated tools, when presented with an ambiguous truncated hash, should simply fail and require a human to disambiguate. The extra information in context should allow a human to look up the truncated hash, see the (likely single-digit) possibilities, and ask the automated tool to retry with the correct full hash.
And tools generating truncated hashes should verify that the truncated hash is locally unambiguous; for example, if I ask git for a 6 character hash, and it determines that 123abc is both the first 6 characters and ambiguous locally, it should give me a longer hash that isn't ambiguous (e.g. 123abc45, to distinguish from 123abcde as the other local option).
Posted Dec 16, 2024 11:55 UTC (Mon)
by intelfx (subscriber, #130118)
[Link] (6 responses)
I mean, that's how it (Git's own CLI, aka porcelain) works right now. However, there clearly are people who feel it's not adequate.
Posted Dec 16, 2024 11:59 UTC (Mon)
by farnz (subscriber, #17727)
[Link] (4 responses)
Git's porcelain is not the only tool out there that uses or generates truncated hashes; CI systems consume them, and forges generate them in their Web UIs. Hopefully, they all get things right, but the nature of code is that some of these systems will have bugs.
Posted Dec 17, 2024 18:24 UTC (Tue)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
Posted Dec 17, 2024 18:28 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (2 responses)
I would hope that all of these cases are at least considered by forges and CIs, but once you have a human in the loop, they will find ways to do weird things that you did not expect, and did not allow for - and the best a forge or CI can do is fail when I ask it to interact with an ambiguous commit hash.
Posted Dec 18, 2024 14:31 UTC (Wed)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Yes, humans always make things messy ;) . AFAIK, all APIs would stop something ambiguous at the API surface rather than after figuring out what work needs to be done. And, AFAIK, most APIs want names of refs or forge entities (like PRs) rather than commit IDs. Not that the latter don't exist, but I don't usually find interest in CI results of a specific commit rather than a ref or current PR state (without naming it by one of those ways).
Posted Dec 18, 2024 14:40 UTC (Wed)
by farnz (subscriber, #17727)
[Link]
Note that a commit ID in git qualifies as a refspec, and that's what most APIs actually ask for - something that git can resolve to a commit. And the trouble kicks in when you accept a refspec - because main, refs/branches/main, aef25be35d23 and aef25be35d23ec768eed08bfcf7ca3cf9685bc28 are all valid refspecs in git, but the truncated hash can be ambiguous with other commits, and only the full hash is unambiguous over time (since the others are mutable).
And IME, if you don't run CI on every single commit, people do ask for CI for a specific commit by hash, rather than mutable ref - when something worked 10 commits ago, but fails now, asking CI to fill in the gaps is useful.
Posted Dec 16, 2024 19:21 UTC (Mon)
by geert (subscriber, #98403)
[Link]
That is how people end up with patches with Fixes:-tags of 13 characters, even with core.abbrev = 12 in their .gitconfig. Next, a strict maintainer will reject those patches, because scripts/checkpatch.pl complains about a hash that is not exactly 12 characters, despite (some parts of) the documentation stating that _at least 12_ is fine.
So that is what the first patch in the series is supposed to fix, but everybody focuseses on the second patch ;-)
Posted Dec 16, 2024 15:45 UTC (Mon)
by taladar (subscriber, #68407)
[Link]
Any place where that could become a problem, i.e. long-term storage of commit hashes, should not use truncated hashes for that reason in my opinion.
Posted Dec 14, 2024 16:16 UTC (Sat)
by khim (subscriber, #9252)
[Link] (3 responses)
Not if you have hundreds of independent git repos (as happens with kernel development) and commits go into different trees. Not if you have many independent repos. P.S. With one, single, “canonical” repo the whole discussion wouldn't even make any sense since you can just hand over unique IDs sequentially.
Posted Dec 16, 2024 7:44 UTC (Mon)
by smurf (subscriber, #17840)
[Link] (2 responses)
The chances that a new hash collides with another new hash in a different tree that happens to *also* not be in Linus' tree are way too small to worry about, you'd need roughly 100'000 commits per release cycle for that to be even remotely likely, using 12-character hashes. (16⁶ is 16 million, which is the point where the chance of a collision approaches 50%.)
Posted Dec 17, 2024 9:57 UTC (Tue)
by khim (subscriber, #9252)
[Link] (1 responses)
Is this 100'000 commits or 100'000 unique hashes? Given the fact that each release cycle there are more than 10'000 commits accepted I would suspect that there are significantly more than 100'000 transient commits that don't go into Linus tree (rejected patches, temporary commits into different git trees, etc). And we are talking about discussions on mailing lists and other such things. These don't belong to a single git tree.
Posted Dec 17, 2024 11:52 UTC (Tue)
by smurf (subscriber, #17840)
[Link]
Commits. If you ask git to show something that might be a commit and/or something else, you get a nice list with the candidates, including the (we hope) single commit in question.
> significantly more than 100'000 transient commits that don't go into Linus tree (rejected patches, temporary commits into different git trees, etc).
These are not going to be referenced from short hashes that *are* in Linus' tree.
> And we are talking about discussions on mailing lists and other such things. These don't belong to a single git tree.
The number of commit references on mailing lists, online discussions, other git trees, et al. is significantly smaller than 100'000, which violates the birthday paradox-ish assumption that *any* two commits that share a prefix are a problem.
In other words, the real-life probability of a collision that actually matters to anybody is a lot less than what the BP tells us it might be.
Posted Dec 15, 2024 20:56 UTC (Sun)
by andy_shev (subscriber, #75870)
[Link]
Is the date there?
Is the date there?
Is the date there?
Automated tools should fail on ambiguity
Automated tools should fail on ambiguity
>
> And tools generating truncated hashes should verify that the truncated hash is locally unambiguous; for example, if I ask git for a 6 character hash, and it determines that 123abc is both the first 6 characters and ambiguous locally, it should give me a longer hash that isn't ambiguous (e.g. 123abc45, to distinguish from 123abcde as the other local option).
Automated tools should fail on ambiguity
Automated tools should fail on ambiguity
You do, however, have to cope with users doing things like highlighting the truncated hash and copying it (rather than using the affordances in the UI to get a full hash), taking a screenshot and attaching that to a mail (where the recipient then has to reconstruct the hash from the picture), and pasting a truncated hash into the CI's "request a build of an arbitrary commit" field.
Automated tools should fail on ambiguity
Automated tools should fail on ambiguity
Automated tools should fail on ambiguity
Automated tools should fail on ambiguity
Automated tools should fail on ambiguity
> If you ask for eg 10 characters, and there's a collision, can't git just display enough extra characters to disambiguate, along with the commit date?
Is the date there?
Is the date there?
> you'd need roughly 100'000 commits per release cycle for that to be even remotely likely
Is the date there?
Is the date there?
Is the date there?