Handling messy pull-request diffstats

By Jonathan Corbet
April 22, 2022

[Editor's note: the following was written in response to frequent questions on the linux-kernel mailing list; it was pulled into the mainline during the 5.18 merge window.]

Subsystem maintainers routinely use git request-pull as part of the process of sending work upstream. Normally, the result includes a list of commits included in the request and a nice diffstat that shows which files will be touched and how much of each will be changed; examples abound on the kernel mailing lists. Occasionally, though, a repository with a relatively complicated development history will yield a massive diffstat containing a great deal of unrelated work. The result looks ugly and obscures what the pull request is actually doing. This document describes what is happening and how to fix things up; it is derived from The Wisdom of Linus Torvalds, which has been posted numerous times over the years (example 1, example 2).

A Git development history proceeds as a series of commits. In a simplified manner, mainline kernel development looks like this:

[Commit stream]

If one wants to see what has changed between two points, a command like this will do the job:

  $ git diff --stat --summary vN-rc2..vN-rc3

Here, there are two clear points in the history (vN-rc2 and vN-rc3); Git will, in a manner of speaking, subtract the commits present at the beginning point from those at the end point and display the resulting differences. The requested operation is unambiguous and easy enough to understand.

When a subsystem maintainer creates a branch and commits changes to it, the result in the simplest case is a history that looks like:

[Commit stream]

If that maintainer now uses git diff to see what has changed between the mainline branch (let's call it linus) and cN, there are still two clear end points (vN-rc2 and cN), and the result is as expected. So a pull request generated with git request-pull will also be as expected. But now consider a slightly more complex development history:

[Commit stream]

Our maintainer has created one branch at vN-rc1 and another at vN-rc2; the two were then subsequently merged into c2. Now a pull request generated for cN may end up being messy indeed, and developers often end up wondering why.

What is happening here is that there are no longer two clear end points for the git diff operation to use. The development culminating in cN started in two different places; to generate the diffstat, git diff ends up having to pick one of them and hoping for the best. If the diffstat starts at vN-rc1, it may end up including all of the changes between there and the second beginning point (vN-rc2), which is certainly not what our maintainer had in mind. With all of that extra junk in the diffstat, it may be impossible to tell what actually happened in the changes leading up to cN.

Maintainers often try to resolve this problem by, for example, rebasing the branch or performing another merge with the linus branch, then recreating the pull request. This approach tends not to lead to joy at the receiving end of that pull request; rebasing and/or merging just before pushing upstream is a well-known way to get a grumpy response.

So what is to be done? The best response when confronted with this situation is to indeed to do a merge with the branch you intend your work to be pulled into, but to do it privately, as if it were the source of shame. Create a new, throwaway branch and do the merge there:

[Commit stream with merge]

The merge operation resolves all of the complications resulting from the multiple beginning points, yielding a coherent result that contains only the differences from the mainline branch. Once again, there are just two end points to use in generating the listing, so it will now be possible to generate a diffstat with the desired information:

  $ git diff -C --stat --summary linus..TEMP

Save the output from this command, then simply delete the TEMP branch; definitely do not expose it to the outside world. Take the saved diffstat output and edit it into the messy pull request, yielding a result that shows what is really going on. That request can then be sent upstream.

Index entries for this article
Kernel	Development tools/Git
Kernel	Git

Handling messy pull-request diffstats

Posted Apr 22, 2022 18:39 UTC (Fri) by khim (subscriber, #9252) [Link] (6 responses)

This sounds like a suspiciously simple algorithm. Why can't Git do that automatically?

Handling messy pull-request diffstats

Posted Apr 22, 2022 19:46 UTC (Fri) by pbonzini (subscriber, #60935) [Link] (5 responses)

Mostly because there can be conflicts and in that case it is not possible to do it automatically.

And since all maintainers should be doing a test merge (to check for conflicts, especially the rare semantic ones that cause the build to fail and Linus to shout...) before sending a pull request, you already have the test branch ready and can use it to create the diffstat.

It's true though that, assuming you have the test branch ready, it may make sense for git-request-pull to have an option that uses it.

(FWIW this was one of the most confusing things to learn when I started maintaining KVM...)

Handling messy pull-request diffstats

Posted Apr 22, 2022 21:19 UTC (Fri) by epa (subscriber, #39769) [Link] (4 responses)

If there are merge conflicts then when the pull request is finally accepted they will have to be resolved too. That suggests submitting this ‘temp’ branch as the branch to be merged, since the work of conflict resolution has been done once and should not be wasted. That also seems cleaner than using one branch just to fake up a diffstat while what’s actually merged is something else.

And yet, merging master (or whatever upstream) into your development branch seems to be frowned on by Linus and other guardians of git workflow. It does make the final graph more tangled. And you could argue that resolving conflicts is the judgement of the maintainer who accepts the merge request: you don’t necessarily trust some contributor to handle conflicts correctly even if you would be prepared to review and merge code they wrote.

So is there a way git could deal with this? Like a ‘merge but not merge’? If there are any conflicts you have to resolve them, and the way you did it is recorded somehow, but the final content remains unchanged. This ‘prepared to merge’ branch could then be submitted as a merge request, and the maintainer could choose to stick with the conflict resolution you did, or ignore it and resolve from scratch.

I guess in some muddled way I am analogizing to ‘prepare to commit’ in database systems, where any data conflicts are checked now, so if the next operation is a commit it is guaranteed to succeed.

Handling messy pull-request diffstats

Posted Apr 22, 2022 21:43 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

> And yet, merging master (or whatever upstream) into your development branch seems to be frowned on by Linus and other guardians of git workflow.

Yes, I agree. It puts your target on the non-first parent path which "twists" the DAG a bit in most renderings. What you can do instead is make a new branch off of the target rc and merge your proposal into it and then submit that with conflicts resolved. This makes it look like `c2` in the later build graphs in the article (which is fine).

Handling messy pull-request diffstats

Posted Apr 22, 2022 22:40 UTC (Fri) by pbonzini (subscriber, #60935) [Link]

> resolving conflicts is the judgement of the maintainer who accepts the merge request:

Resolving conflicts in Linux is done earlier, using common topic branches that are included by multiple maintainers (or that the maintainers includes in both the for-rcN and the for-next branches).

If the conflicts are not caught earlier, Linus absolutely doesn't care if he has conflicts to solve, in fact he wants to see them because it's his way of detecting places that are not collaborating well or that could be messy during the -rc phase.

Looks like material for a second article... :)

Handling messy pull-request diffstats

Posted Apr 23, 2022 0:27 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (1 responses)

> since the work of conflict resolution has been done once and should not be wasted.

Strictly speaking, you don't need to keep the temp branch around even for the purpose of saving yourself work. You can just use git rerere for that. I would tend to assume there is a reasonable way of exporting the rerere metadata so that you can share it with others, but cursory Googling only turns up weird ideas like "make a dedicated branch," and I have no idea why that should be necessary or useful for pulling already-existing data out of .git.

Handling messy pull-request diffstats

Posted Apr 23, 2022 11:22 UTC (Sat) by mathstuf (subscriber, #69389) [Link]

`rerere` doesn't use the object store (for better or worse). My issue tends to be messing up a resolution and having to remove it. But it's just easier to nuke them all than to figure out which one it is once you have a collection amassed. If they were stored with the other objects, they'd need some more tooling to clean them out more effectively.

Handling messy pull-request diffstats

Posted Apr 22, 2022 22:52 UTC (Fri) by jgg (subscriber, #55211) [Link]

I've been doing this for years.. The other twist Linus shared with me is to actually keep the merged branch in some cases and send it as a 'FYI here is my merge resolution'. I do this when there are conflicts, I'll send the conflict resolution in the pull request and also mention that I have done it in the for-linus-merged tag. Linus says he will pull and compare the for-linus-merged as a sanity check (or perhaps use it sometimes)

This is an example:

https://lore.kernel.org/linux-rdma/20210901234159.GA24219...

(and look at the bottom my script that sorts this out puts a note about the diffstat)

Handling messy pull-request diffstats

Posted Apr 23, 2022 16:40 UTC (Sat) by pm215 (subscriber, #98099) [Link] (1 responses)

I tend to feel like if this complicated explanation and tricks like "do a merge purely to create a diffstat, then throw it away" are necessary, then there's a problem with either the workflow or with the tooling's ability to support that workflow...

Handling messy pull-request diffstats

Posted Apr 25, 2022 11:06 UTC (Mon) by k3ninho (subscriber, #50375) [Link]

The tooling isn't matched to the 'send a patch' workflow of the mailing list -- a merge request wants patch-equivalent state that would apply cleanly like a good happy patch should. That said, it's weird this is a 'trick' given it's how I've come to use git for pull-request mediated development.

Maybe there should be a more-automated way to avoid suggesting things are in conflict for files you didn't edit -- the ones that change due to the repository changing around you -- but that doesn't give a hint as to the importance of changes when you merge a second branch to your non-trunk working set.

K3n.

Handling messy pull-request diffstats

Posted Apr 23, 2022 21:11 UTC (Sat) by olof (subscriber, #11729) [Link]

We've done this for the (ARM) SoC trees almost since the start, due to the complex merge history.

The normal flow I used was to remove any rerere cache, merge each branch one by one into current HEAD of Linus' tree, writing down any conflict resolution, and then generate the pull requests with a local script that did the diffstat (and log) manually. Worked well for 10+ years so far, with a few manual hiccups when I diff the wrong merge vs git tag.

Handling messy pull-request diffstats

Posted May 5, 2022 7:50 UTC (Thu) by jepsis (subscriber, #130218) [Link] (1 responses)

Why not just maintain a separate rebased branch and send a pull request from there?

Handling messy pull-request diffstats

Posted May 5, 2022 8:28 UTC (Thu) by geert (subscriber, #98403) [Link]

Because Linus does not like receiving pull requests for branches that have been rebased recently, as that invalidates any testing done on the pre-rebased branch.

It's still a good idea to maintain your own separate rebased branch for cross-checking: merging in a branch, and rebasing your own work on top of that branch should yield the exact same tree.