|
|
Subscribe / Log in / New account

Facing the Git commit-ID collision catastrophe

By Jonathan Corbet
December 13, 2024
Commits in the Git source-code management system are identified by the SHA-1 hash of their contents — though the specific hash may change someday. The full hash is a 160-bit quantity, normally written as a 40-character hexadecimal string. While those strings are convenient for computers to work with, humans find them to be a bit unwieldy, so it is common to abbreviate the hash values to shorter strings. Geert Uytterhoeven recently proposed increasing the length of those abbreviated hashes as used in the kernel community, but the problem he was working to solve may not be as urgent as it seems.

A hash, of course, is not the same as the data it was calculated from; whenever hashes are used to represent data, there is always the possibility of a collision — when two distinct sets of data generate the same hash value. A 160-bit hash space is large enough that the risk of accidental collisions is essentially zero; the risk of intentional (malicious) collisions is higher, but is still not something that most people worry about — for now. The hash space is large enough that even a relatively small portion of the hash value is still enough to uniquely identify a value. In a small Git repository, a 24-bit (six-digit) hash may suffice; as a repository grows, the number of digits required to unambiguously identify a commit will grow. In all cases, though, the shorter commit IDs are much easier for humans to deal with, and are almost universally used.

The kernel has, for some years now, used a twelve-character (48-bit) hash value in most places where a commit ID is needed. That is the norm for citing commits within changelogs (in Fixes tags, for example), and in email discussions as well. Uytterhoeven expressed a concern that, given the growth of the kernel repository, soon twelve digits will not be enough: "the Birthday Paradox states that collisions of 12-character commit IDs are imminent". He suggested raising the number of digits used to identify commits in the kernel repository to 16 to head off this possibility.

Linus Torvalds, though, made it clear that he did not support this change, for a couple of reasons. The first of those was that, while Git uses hashes to identify the objects in a repository, those objects are not all commits. There are three core object types in Git: blobs, trees, and commits. Blobs hold the actual data that the repository is managing — one blob holds the contents of one file at a given point in the history. Tree objects hold a list of blobs and their places in the filesystem hierarchy; they indicate which files were present in a given revision, and how they were laid out. If the only change between a pair of revisions is the renaming of a single file, the associated tree objects will differ only in that file's name; both will refer to the same blob object for the file's contents. Finally, a commit contains references to a number of objects (the previous commit(s) and the tree) along with metadata like the commit author, date, changelog, and so on.

Torvalds's point was that commits only make up about 1/8 of the total objects in the repository. Even if two objects turn up with the same (shortened) hash, one of those objects is highly likely not to be a commit. Since humans rarely (never, in truth) traffic in blob or tree hashes, any collisions with those hashes are not a problem; it will be clear which one the human was referring to. When dealing with just the commit space, the problem of ambiguous abbreviations appears to be further away:

My tree is at about 1.3M commits, so we're basically an order of magnitude off the point where collisions start being an issue wrt commit IDs.

When just looking at commit IDs, he said, there are no collisions when ten-digit abbreviations are used, so twelve seems safe for a while yet. Especially given that, as Torvalds pointed out, the current state was reached after nearly 20 years of use of Git within the kernel project. It will take a fair while yet to close that order-of-magnitude buffer that the kernel still has.

Torvalds's other point, though, was that humans should not normally be quoting abbreviated hashes in isolation anyway. Within the kernel community, there is a strong expectation that commit IDs will be accompanied by the short-form version of the changelog. So rather than just citing 690b0543a813, for example, a developer would write:

    commit 690b0543a813 ("Documentation: fix formatting to make 's' happy")

There are times, Torvalds says, when the hash provided for a commit is incorrect (often because a rebase operation will have caused it to change), but the short changelog can always be used to locate the correct commit in the repository. Tools should support using that extra information; any workflow that relies too heavily on just the commit ID is already broken, he said. Given that even a twelve-digit hash is often "line noise", he was unwilling to make it even worse for a questionable gain.

That response brought an abrupt end to the conversation; the proposed patches will not be merged into the mainline. That ending cut off one other aspect of Uytterhoeven's changes, though. Current kernel documentation is inconsistent about whether hashes should be abbreviated to exactly twelve characters, or to at least that many. That inconsistency is far from the biggest problem in the kernel's documentation, but it still seems worth straightening out at some point.

Index entries for this article
KernelDevelopment tools/Git


to post comments

Is the date there?

Posted Dec 14, 2024 15:18 UTC (Sat) by Wol (subscriber, #4433) [Link] (17 responses)

If you ask for eg 10 characters, and there's a collision, can't git just display enough extra characters to disambiguate, along with the commit date? If the hash wasn't initially ambiguous, it obviously refers to the earliest one.

So does the number of characters really matter? That'll give you all the information you need.

Cheers,
Wol

Is the date there?

Posted Dec 14, 2024 15:38 UTC (Sat) by randomguy3 (subscriber, #71063) [Link] (11 responses)

the obvious immediate issue you'd have to solve is what counts as "older" in a distributed system? you can't use the date - even if you could trust the date on commits, there could easily be an older commit with the same hash in another branch that you can't currently see from your local clone (but will be able to when you - in the future - pull from someone else)

Is the date there?

Posted Dec 14, 2024 23:55 UTC (Sat) by skissane (subscriber, #38675) [Link] (10 responses)

> the obvious immediate issue you'd have to solve is what counts as "older" in a distributed system? you can't use the date - even if you could trust the date on commits, there could easily be an older commit with the same hash in another branch that you can't currently see from your local clone (but will be able to when you - in the future - pull from someone else)

Git is not the kind of distributed system which needs a highly accurate clock. If someone is talking about a commit with some ID, and there are two commits with that ID – one from this year and one from a decade ago – most likely they mean the one from this year. So even if my clock is 5 minutes off, so long as I know what year it is, I'm fine.

Since there are a lot more commits from years ago than from this year, odds are high that when a collision finally happens, it will be between a recent commit and one much older, so it will be obvious from the timestamps which one is relevant. Yes, it is possible that an ID collision could occur between two recent commits, but that is significantly less likely because historical commits greatly outnumber recent ones.

Also there are other ways to disambiguate commits – if we are talking about the networking subsystem, and one commit with that ID is a networking commit, and the other is touching some unrelated kernel subsystem, we know which one is likely meant.

Is the date there?

Posted Dec 16, 2024 10:22 UTC (Mon) by taladar (subscriber, #68407) [Link] (9 responses)

That is all nice and well for a human reader but how is automated tooling supposed to make decisions like that?

Automated tools should fail on ambiguity

Posted Dec 16, 2024 11:42 UTC (Mon) by farnz (subscriber, #17727) [Link] (8 responses)

Automated tools, when presented with an ambiguous truncated hash, should simply fail and require a human to disambiguate. The extra information in context should allow a human to look up the truncated hash, see the (likely single-digit) possibilities, and ask the automated tool to retry with the correct full hash.

And tools generating truncated hashes should verify that the truncated hash is locally unambiguous; for example, if I ask git for a 6 character hash, and it determines that 123abc is both the first 6 characters and ambiguous locally, it should give me a longer hash that isn't ambiguous (e.g. 123abc45, to distinguish from 123abcde as the other local option).

Automated tools should fail on ambiguity

Posted Dec 16, 2024 11:55 UTC (Mon) by intelfx (subscriber, #130118) [Link] (6 responses)

> Automated tools, when presented with an ambiguous truncated hash, should simply fail and require a human to disambiguate. The extra information in context should allow a human to look up the truncated hash, see the (likely single-digit) possibilities, and ask the automated tool to retry with the correct full hash.
>
> And tools generating truncated hashes should verify that the truncated hash is locally unambiguous; for example, if I ask git for a 6 character hash, and it determines that 123abc is both the first 6 characters and ambiguous locally, it should give me a longer hash that isn't ambiguous (e.g. 123abc45, to distinguish from 123abcde as the other local option).

I mean, that's how it (Git's own CLI, aka porcelain) works right now. However, there clearly are people who feel it's not adequate.

Automated tools should fail on ambiguity

Posted Dec 16, 2024 11:59 UTC (Mon) by farnz (subscriber, #17727) [Link] (4 responses)

Git's porcelain is not the only tool out there that uses or generates truncated hashes; CI systems consume them, and forges generate them in their Web UIs. Hopefully, they all get things right, but the nature of code is that some of these systems will have bugs.

Automated tools should fail on ambiguity

Posted Dec 17, 2024 18:24 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (3 responses)

AFAIK, CI nor forges communicates solely on truncated hashes. Either they use full hashes or hidden refs (e.g., `refs/pipelines/N` on GitLab-CI). The UI might *render* only truncated hashes, but interacting should get full hashes (cf. this GitLab bugfix to do so: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/146052).

Automated tools should fail on ambiguity

Posted Dec 17, 2024 18:28 UTC (Tue) by farnz (subscriber, #17727) [Link] (2 responses)

You do, however, have to cope with users doing things like highlighting the truncated hash and copying it (rather than using the affordances in the UI to get a full hash), taking a screenshot and attaching that to a mail (where the recipient then has to reconstruct the hash from the picture), and pasting a truncated hash into the CI's "request a build of an arbitrary commit" field.

I would hope that all of these cases are at least considered by forges and CIs, but once you have a human in the loop, they will find ways to do weird things that you did not expect, and did not allow for - and the best a forge or CI can do is fail when I ask it to interact with an ambiguous commit hash.

Automated tools should fail on ambiguity

Posted Dec 18, 2024 14:31 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (1 responses)

> I would hope that all of these cases are at least considered by forges and CIs, but once you have a human in the loop, they will find ways to do weird things that you did not expect, and did not allow for - and the best a forge or CI can do is fail when I ask it to interact with an ambiguous commit hash.

Yes, humans always make things messy ;) . AFAIK, all APIs would stop something ambiguous at the API surface rather than after figuring out what work needs to be done. And, AFAIK, most APIs want names of refs or forge entities (like PRs) rather than commit IDs. Not that the latter don't exist, but I don't usually find interest in CI results of a specific commit rather than a ref or current PR state (without naming it by one of those ways).

Automated tools should fail on ambiguity

Posted Dec 18, 2024 14:40 UTC (Wed) by farnz (subscriber, #17727) [Link]

Note that a commit ID in git qualifies as a refspec, and that's what most APIs actually ask for - something that git can resolve to a commit. And the trouble kicks in when you accept a refspec - because main, refs/branches/main, aef25be35d23 and aef25be35d23ec768eed08bfcf7ca3cf9685bc28 are all valid refspecs in git, but the truncated hash can be ambiguous with other commits, and only the full hash is unambiguous over time (since the others are mutable).

And IME, if you don't run CI on every single commit, people do ask for CI for a specific commit by hash, rather than mutable ref - when something worked 10 commits ago, but fails now, asking CI to fill in the gaps is useful.

Automated tools should fail on ambiguity

Posted Dec 16, 2024 19:21 UTC (Mon) by geert (subscriber, #98403) [Link]

Yes, git knows about all commits (in your repository), and will print a longer hash automatically when needed (for your repository).

That is how people end up with patches with Fixes:-tags of 13 characters, even with core.abbrev = 12 in their .gitconfig. Next, a strict maintainer will reject those patches, because scripts/checkpatch.pl complains about a hash that is not exactly 12 characters, despite (some parts of) the documentation stating that _at least 12_ is fine.

So that is what the first patch in the series is supposed to fix, but everybody focuseses on the second patch ;-)

Automated tools should fail on ambiguity

Posted Dec 16, 2024 15:45 UTC (Mon) by taladar (subscriber, #68407) [Link]

When I wrote automated tools I was more thinking of tooling consuming e.g. those Fix annotations in the history of the Linux kernel, i.e. tooling that consumes a large number of the truncated hashes.

Any place where that could become a problem, i.e. long-term storage of commit hashes, should not use truncated hashes for that reason in my opinion.

Is the date there?

Posted Dec 14, 2024 16:16 UTC (Sat) by khim (subscriber, #9252) [Link] (3 responses)

> If you ask for eg 10 characters, and there's a collision, can't git just display enough extra characters to disambiguate, along with the commit date?

Not if you have hundreds of independent git repos (as happens with kernel development) and commits go into different trees.

> If the hash wasn't initially ambiguous, it obviously refers to the earliest one.

Not if you have many independent repos.

P.S. With one, single, “canonical” repo the whole discussion wouldn't even make any sense since you can just hand over unique IDs sequentially.

Is the date there?

Posted Dec 16, 2024 7:44 UTC (Mon) by smurf (subscriber, #17840) [Link] (2 responses)

? We effectively do have a single repo, in that there is a single official git tree.

The chances that a new hash collides with another new hash in a different tree that happens to *also* not be in Linus' tree are way too small to worry about, you'd need roughly 100'000 commits per release cycle for that to be even remotely likely, using 12-character hashes. (16⁶ is 16 million, which is the point where the chance of a collision approaches 50%.)

Is the date there?

Posted Dec 17, 2024 9:57 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

> you'd need roughly 100'000 commits per release cycle for that to be even remotely likely

Is this 100'000 commits or 100'000 unique hashes? Given the fact that each release cycle there are more than 10'000 commits accepted I would suspect that there are significantly more than 100'000 transient commits that don't go into Linus tree (rejected patches, temporary commits into different git trees, etc).

And we are talking about discussions on mailing lists and other such things. These don't belong to a single git tree.

Is the date there?

Posted Dec 17, 2024 11:52 UTC (Tue) by smurf (subscriber, #17840) [Link]

> Is this 100'000 commits or 100'000 unique hashes?

Commits. If you ask git to show something that might be a commit and/or something else, you get a nice list with the candidates, including the (we hope) single commit in question.

> significantly more than 100'000 transient commits that don't go into Linus tree (rejected patches, temporary commits into different git trees, etc).

These are not going to be referenced from short hashes that *are* in Linus' tree.

> And we are talking about discussions on mailing lists and other such things. These don't belong to a single git tree.

The number of commit references on mailing lists, online discussions, other git trees, et al. is significantly smaller than 100'000, which violates the birthday paradox-ish assumption that *any* two commits that share a prefix are a problem.

In other words, the real-life probability of a collision that actually matters to anybody is a lot less than what the BP tells us it might be.

Is the date there?

Posted Dec 15, 2024 20:56 UTC (Sun) by andy_shev (subscriber, #75870) [Link]

I remember I had seen a collisions when I used 6 or so characters to refer a commit. `git log` simply outputs me all that matched.

Nice!

Posted Dec 14, 2024 17:25 UTC (Sat) by npitre (subscriber, #5680) [Link] (1 responses)

Love the commit used as example. ;-)

Nice!

Posted Dec 16, 2024 12:11 UTC (Mon) by sdalley (subscriber, #18550) [Link]

And the delightful way in which our editor does a send-up of click-baity headlines with tongue coiled firmly-in-cheek!

Including the commit date in the reference

Posted Dec 14, 2024 23:05 UTC (Sat) by alx.manpages (subscriber, #145117) [Link] (3 responses)

The commit subject is usually distinct. However, some commits have a more generic subject, which might not be enough to disambiguate.

Recently, I had to refer to many of the commits in one project I co-maintain, and came up with some notation that's straightforward, and lets me easily see the precise commit I'm referring to:

```
$ cat ~/.gitconfig | grep -A1 'ref ='
ref = show --no-patch --abbrev=12 --date=short --format=format:'%C(auto)%h (%cd,%C(reset) \"%C(white)%s%C(reset)\")'
$ git ref HEAD -3
8821d3ff2dcf (2024-12-09, "lib/fs/readlink/: readlinknul(): Fix return type")
b9d00b64a19f (2024-12-09, "lib/fs/readlink/readlinknul.h: readlinknul(): Silence warning")
205c23bff28f (2024-12-09, "Added option -a for listing active users only, optimized using if aflg,return")
```

That is:

<hash-12-chars> (<commit-date>, "commit-subject")

The combination of 12-char hash and commit date makes it completely unambiguous, and the subject makes it more readable.

A few considerations, compared to `git log --pretty=reference`:
- The commit date is more useful than the author date, because some commits might have been authored years before being finally committed. When looking at the git-log(1), and especially in --graph mode, the dates that will be correlative are the commit dates, which make it easy to search.
- This puts the date before the subject, which makes it easier to find all three fields in a long list of references.
- This includes quotes around the subject, as usual.

So, I now use references like this in commit messages:

Fixes: 419ce14b6f72 (2024-11-01, "lib/fs/readlink/: readlinknul(): Add function")

Including the commit date in the reference

Posted Dec 15, 2024 15:43 UTC (Sun) by mgedmin (subscriber, #34497) [Link] (1 responses)

This is pretty similar to the builtin `git log --pretty=ref`, about which I only learned recently and by accident.

Including the commit date in the reference

Posted Dec 16, 2024 23:06 UTC (Mon) by alx.manpages (subscriber, #145117) [Link]

I actually learned about --pretty=reference when reading the documentation for implementing my `git ref` alias. :)

I didn't like it though. The author date is useless, as it doesn't serve to find a commit in the log. Also, I prefer having quotes around the subject.

Including the commit date in the reference

Posted Dec 16, 2024 23:09 UTC (Mon) by alx.manpages (subscriber, #145117) [Link]

BTW, when dealing with multiple forks of a project, it's interesting to specify which one the commit belongs to. So, for example, I would use:

Fixes: mutt.git b423ebbfa9d2 (1999-01-04, "Make the experimental branch the main trunk.")

I also use this when cherry-picking commits from other projects; that's where this is especially useful.

Full commit hashes

Posted Dec 15, 2024 14:30 UTC (Sun) by pabs (subscriber, #43278) [Link] (3 responses)

I always refer to full commit hashes in commit messages, I wonder how popular that is.

Fixes: commit 74fd8a97e67cb8ab0da073e94a54804a66ca8e40

Full commit hashes

Posted Dec 15, 2024 14:45 UTC (Sun) by mathstuf (subscriber, #69389) [Link] (1 responses)

This works better when forges are involved in rendering such things where, generally, a hover will show context on what it is. However, when just using `git log`, I think the date/subject are very helpful. Of course, nothing is stopping anyone from adding such LSP-like behaviors to `vim-fugitive` or Magit, but they don't exist yet AFAIK. And even that is requiring using some kind of smarter tooling than `| less` that is the prevalent default.

Full commit hashes

Posted Dec 16, 2024 23:53 UTC (Mon) by pabs (subscriber, #43278) [Link]

Wonder if there is any sane way for `git log` itself to render them, maybe with an option.

There is a hyperlink ANSI escape sequence, and it seems that you can include arbitrary text as the "URL".

On GNOME Terminal at least, that means you can create an underlined sequence, that on mouse hover, will show the commit message summary. Since it strips out tabs, LF and probably more, you can't add the full commit message though.

https://github.com/Alhadis/OSC8-Adoption
https://gist.github.com/egmontkob/eb114294efbcd5adb1944c9...

Unfortunately the hover part isn't implemented in Ptyxis, which is the successor to GNOME Console, which was the successor to GNOME Terminal.

KDE Konsole also doesn't enable hyperlink support by default.

https://github.com/Alhadis/OSC8-Adoption/pull/18

Full commit hashes

Posted Dec 15, 2024 17:29 UTC (Sun) by adobriyan (subscriber, #30858) [Link]

>I always refer to full commit hashes in commit messages, I wonder how popular that is.

They are de facto banned in our beloved kernel.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...

Using a forge identifier

Posted Dec 15, 2024 19:27 UTC (Sun) by sunshowers (guest, #170655) [Link]

Interestingly, large organizations, with repositories orders of magnitude bigger than the Linux kernel, tend to use other unique identifiers in places like commit messages. The most common one is identifiers issued by whatever forge they're using.

I worked on source control at Facebook/Meta for many years, where every commit had a numerically increasing "Differential ID" associated with it, corresponding to the forge in use. The IDs had deep integration with the source control system, so you could run "hg update D123456" and your working copy would be updated to that revision.

On GitHub, it's common to use the PR number for the same purpose. Gerrit also issues identifiers I believe.

The Jujutsu VCS has change IDs which stay stable across amends and rebases, but they aren't shared across clones so are less useful as common identifiers.


Copyright © 2024, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds