Handling messy pull-request diffstats
Subsystem maintainers routinely use git request-pull as part of the process of sending work upstream. Normally, the result includes a list of commits included in the request and a nice diffstat that shows which files will be touched and how much of each will be changed; examples abound on the kernel mailing lists. Occasionally, though, a repository with a relatively complicated development history will yield a massive diffstat containing a great deal of unrelated work. The result looks ugly and obscures what the pull request is actually doing. This document describes what is happening and how to fix things up; it is derived from The Wisdom of Linus Torvalds, which has been posted numerous times over the years (example 1, example 2).
A Git development history proceeds as a series of commits. In a simplified manner, mainline kernel development looks like this:
If one wants to see what has changed between two points, a command like this will do the job:
$ git diff --stat --summary vN-rc2..vN-rc3
Here, there are two clear points in the history (vN-rc2 and vN-rc3); Git will, in a manner of speaking, subtract the commits present at the beginning point from those at the end point and display the resulting differences. The requested operation is unambiguous and easy enough to understand.
When a subsystem maintainer creates a branch and commits changes to it, the result in the simplest case is a history that looks like:
If that maintainer now uses git diff to see what has changed between the mainline branch (let's call it linus) and cN, there are still two clear end points (vN-rc2 and cN), and the result is as expected. So a pull request generated with git request-pull will also be as expected. But now consider a slightly more complex development history:
Our maintainer has created one branch at vN-rc1 and another at vN-rc2; the two were then subsequently merged into c2. Now a pull request generated for cN may end up being messy indeed, and developers often end up wondering why.
What is happening here is that there are no longer two clear end points for the git diff operation to use. The development culminating in cN started in two different places; to generate the diffstat, git diff ends up having to pick one of them and hoping for the best. If the diffstat starts at vN-rc1, it may end up including all of the changes between there and the second beginning point (vN-rc2), which is certainly not what our maintainer had in mind. With all of that extra junk in the diffstat, it may be impossible to tell what actually happened in the changes leading up to cN.
Maintainers often try to resolve this problem by, for example, rebasing the branch or performing another merge with the linus branch, then recreating the pull request. This approach tends not to lead to joy at the receiving end of that pull request; rebasing and/or merging just before pushing upstream is a well-known way to get a grumpy response.
So what is to be done? The best response when confronted with this situation is to indeed to do a merge with the branch you intend your work to be pulled into, but to do it privately, as if it were the source of shame. Create a new, throwaway branch and do the merge there:
The merge operation resolves all of the complications resulting from the multiple beginning points, yielding a coherent result that contains only the differences from the mainline branch. Once again, there are just two end points to use in generating the listing, so it will now be possible to generate a diffstat with the desired information:
$ git diff -C --stat --summary linus..TEMP
Save the output from this command, then simply delete the TEMP branch;
definitely do not expose it to the outside world. Take the saved diffstat
output and edit it into the messy pull request, yielding a result that
shows what is really going on. That request can then be sent upstream.
Index entries for this article | |
---|---|
Kernel | Development tools/Git |
Kernel | Git |
Posted Apr 22, 2022 18:39 UTC (Fri)
by khim (subscriber, #9252)
[Link] (6 responses)
Posted Apr 22, 2022 19:46 UTC (Fri)
by pbonzini (subscriber, #60935)
[Link] (5 responses)
And since all maintainers should be doing a test merge (to check for conflicts, especially the rare semantic ones that cause the build to fail and Linus to shout...) before sending a pull request, you already have the test branch ready and can use it to create the diffstat.
It's true though that, assuming you have the test branch ready, it may make sense for git-request-pull to have an option that uses it.
(FWIW this was one of the most confusing things to learn when I started maintaining KVM...)
Posted Apr 22, 2022 21:19 UTC (Fri)
by epa (subscriber, #39769)
[Link] (4 responses)
And yet, merging master (or whatever upstream) into your development branch seems to be frowned on by Linus and other guardians of git workflow. It does make the final graph more tangled. And you could argue that resolving conflicts is the judgement of the maintainer who accepts the merge request: you don’t necessarily trust some contributor to handle conflicts correctly even if you would be prepared to review and merge code they wrote.
So is there a way git could deal with this? Like a ‘merge but not merge’? If there are any conflicts you have to resolve them, and the way you did it is recorded somehow, but the final content remains unchanged. This ‘prepared to merge’ branch could then be submitted as a merge request, and the maintainer could choose to stick with the conflict resolution you did, or ignore it and resolve from scratch.
I guess in some muddled way I am analogizing to ‘prepare to commit’ in database systems, where any data conflicts are checked now, so if the next operation is a commit it is guaranteed to succeed.
Posted Apr 22, 2022 21:43 UTC (Fri)
by mathstuf (subscriber, #69389)
[Link]
Yes, I agree. It puts your target on the non-first parent path which "twists" the DAG a bit in most renderings. What you can do instead is make a new branch off of the target rc and merge your proposal into it and then submit that with conflicts resolved. This makes it look like `c2` in the later build graphs in the article (which is fine).
Posted Apr 22, 2022 22:40 UTC (Fri)
by pbonzini (subscriber, #60935)
[Link]
Resolving conflicts in Linux is done earlier, using common topic branches that are included by multiple maintainers (or that the maintainers includes in both the for-rcN and the for-next branches).
If the conflicts are not caught earlier, Linus absolutely doesn't care if he has conflicts to solve, in fact he wants to see them because it's his way of detecting places that are not collaborating well or that could be messy during the -rc phase.
Looks like material for a second article... :)
Posted Apr 23, 2022 0:27 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
Strictly speaking, you don't need to keep the temp branch around even for the purpose of saving yourself work. You can just use git rerere for that. I would tend to assume there is a reasonable way of exporting the rerere metadata so that you can share it with others, but cursory Googling only turns up weird ideas like "make a dedicated branch," and I have no idea why that should be necessary or useful for pulling already-existing data out of .git.
Posted Apr 23, 2022 11:22 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link]
Posted Apr 22, 2022 22:52 UTC (Fri)
by jgg (subscriber, #55211)
[Link]
This is an example:
https://lore.kernel.org/linux-rdma/20210901234159.GA24219...
(and look at the bottom my script that sorts this out puts a note about the diffstat)
Posted Apr 23, 2022 16:40 UTC (Sat)
by pm215 (subscriber, #98099)
[Link] (1 responses)
Posted Apr 25, 2022 11:06 UTC (Mon)
by k3ninho (subscriber, #50375)
[Link]
Maybe there should be a more-automated way to avoid suggesting things are in conflict for files you didn't edit -- the ones that change due to the repository changing around you -- but that doesn't give a hint as to the importance of changes when you merge a second branch to your non-trunk working set.
K3n.
Posted Apr 23, 2022 21:11 UTC (Sat)
by olof (subscriber, #11729)
[Link]
The normal flow I used was to remove any rerere cache, merge each branch one by one into current HEAD of Linus' tree, writing down any conflict resolution, and then generate the pull requests with a local script that did the diffstat (and log) manually. Worked well for 10+ years so far, with a few manual hiccups when I diff the wrong merge vs git tag.
Posted May 5, 2022 7:50 UTC (Thu)
by jepsis (subscriber, #130218)
[Link] (1 responses)
Posted May 5, 2022 8:28 UTC (Thu)
by geert (subscriber, #98403)
[Link]
It's still a good idea to maintain your own separate rebased branch for cross-checking: merging in a branch, and rebasing your own work on top of that branch should yield the exact same tree.
This sounds like a suspiciously simple algorithm. Why can't Git do that automatically?
Handling messy pull-request diffstats
Handling messy pull-request diffstats
Handling messy pull-request diffstats
Handling messy pull-request diffstats
Handling messy pull-request diffstats
Handling messy pull-request diffstats
Handling messy pull-request diffstats
Handling messy pull-request diffstats
Handling messy pull-request diffstats
Handling messy pull-request diffstats
Handling messy pull-request diffstats
Handling messy pull-request diffstats
Handling messy pull-request diffstats