"git request-pull" and confusing diffstats

By Jonathan Corbet
October 21, 2019

When a kernel subsystem maintainer has a set of commits to send up the chain toward the mainline, the git request-pull command is usually the right tool for the job. But various maintainers have noticed over the years that this command can sometimes generate confusing results when confronted with anything but the simplest of histories. A brief conversation on the linux-kernel mailing list delved into why this situation comes about and what maintainers can do in response.

While git request-pull is intended to be a general-purpose command, it's no secret that its output is designed for the needs of one specific consumer; it contains almost everything Linus Torvalds needs to know when considering a request to pull a tree into the mainline kernel. This information includes the commit upon which the tree is based, the timestamp for the most recent commit in the tree, the tree to pull the commits from, the maintainer's description of the changes, a list of commits (in the git shortlog format), and a diffstat showing which files will be touched. See a typical pull request to see how all of those elements are worked into a single email.

That example is generated from a relatively straightforward development history that looks something like this:

Generating both the commit log and the diffstat for this history is relatively straightforward, and the pull requests looks exactly as one would expect.

Recently, Will Deacon ran into a more complex situation, though. His tree was initially based in 5.4-rc1, but then required a merge of 5.4-rc2 to obtain the dependencies for a fix. The history ended up looking something like this:

When one runs git request-pull for a tree like this, the commit-log portion will look exactly as expected — it contains the commits in the local tree. The diffstat, though, is likely to reflect a large number of unrelated changes, making the pull request look like a scary beast indeed. In essence, that diffstat will reflect not just the local changes, but also everything that was pulled into the local tree when 5.4-rc2 was merged. That makes it hard, at best, to see what the pull request will actually do.

Deacon was surprised that the commit log was correct while the diffstat was wrong. Torvalds explained it this way:

So logs and diffs are fundamentally different.

A log is an operation on a _set_ of commits (that's the whole point - you don't list the beginning and the end, you list all the commits in between), while a diff is fundamentally an operation on two end-points and shows the code difference between those two points.

And the summary versions of those operations (shortlog and diffstat) are no different.

He went on to say that, when there are only two endpoints and the history is simple, it is not hard for a tool like Git to calculate the difference between them. Throwing another merge into the mix complicates the situation, though, by adding another endpoint. The end result is the useless diffstat included in the pull request.

Deacon resolved the issue by merging the current mainline with the tree to be pulled:

The diffstat could then be generated manually choosing the mainline and the merged tree as the two endpoints; the only differences that will be visible will be those that are not in the current mainline — the changes to be pulled from Deacon's tree, in other words. The clean diffstat was then manually patched into the pull request. The merge itself can then be thrown away; it should not be part of the commit stream sent upstream. As Torvalds explained, performing the merge reduced the diffstat problem back to two simple endpoints, so the result became as one would expect.

Should maintainers perform this sort of dance before sending pull requests upstream? It seems that Torvalds appreciates the effort; he described the result as a "good quality" pull request. He also noted, though, that he often gets pull requests with confusing diffstats and doesn't really have a hard time dealing with them. Still, maintainers who want to be sure that their pull requests are as pleasing to the recipient as possible may want to go to the extra effort.

The best solution, of course, would be to fix git request-pull to do the right thing in this sort of situation. Depending on how complex the merge history is, though, "the right thing" may not always be entirely obvious. It might also, like the merge described above, require changing the repository, which the request-pull command does not currently do. But, as Ingo Molnar noted, it should at least be possible for git request-pull to detect this situation and issue a warning when it arises. Then, at least, developers would not be surprised by a bogus diffstat — something that can easily happen immediately after having sent it upstream.

Index entries for this article
Kernel	Development tools/Git

"git request-pull" and confusing diffstats

Posted Oct 21, 2019 23:02 UTC (Mon) by unixbhaskar (guest, #44758) [Link]

Good one Jon!

Rebase

Posted Oct 22, 2019 6:44 UTC (Tue) by Spack (subscriber, #77556) [Link] (2 responses)

Maybe a noob question, why not rebasing the branch before submitting the pull request?

Rebase

Posted Oct 22, 2019 8:26 UTC (Tue) by famzheng (subscriber, #121411) [Link]

Some downstream people rely on the stable hash of commits once they land in a maintainer's branch, for "git cherry-pick -x".

Rebase

Posted Oct 22, 2019 14:50 UTC (Tue) by cyphar (subscriber, #110703) [Link]

You're in luck -- there's actually a document in Documentation/ that explains why this is a bad idea for maintainers[1]. There are two primary reasons:

1. Maintainers' trees are public, and thus developers base their work on them. Rebasing a public tree causes trouble for downstreams.
2. Rebasing a tree before you send it to Linus results in you sending untested patches (any previous tests run on your tree are invalid).

[1]: https://www.kernel.org/doc/html/latest/maintainer/rebasin...

"git request-pull" and confusing diffstats

Posted Oct 22, 2019 11:23 UTC (Tue) by fsateler (subscriber, #65497) [Link]

I think I'm missing something.

Why isn't `git diff mainline...HEAD` (three dots) used?

"git request-pull" and confusing diffstats

Posted Oct 22, 2019 11:28 UTC (Tue) by jezuch (subscriber, #52988) [Link]

It's possible that in many cases this could be fixed by addition of a single dot:

git diff start...end

Instead of:

git diff start..end

(I learned this only this year, despite being considered a git expert in my area!)

"git request-pull" and confusing diffstats

Posted Oct 22, 2019 15:34 UTC (Tue) by pbonzini (subscriber, #60935) [Link]

I wonder if it's enough to use "git request-pull origin/master" instead of just "git request-pull".

"git request-pull" and confusing diffstats

Posted Oct 22, 2019 17:07 UTC (Tue) by jgg (subscriber, #55211) [Link]

Note that Linus has also said in the past he does not like the merge to mainline as the top of pull request, so if you generate a merge commit don't send it.

Based on a conversation with Linus, what I do is to prepare two branches, one that has the unmerged tree, and one that has a merge commit to latest at the very top. The Pull Request is for the first tree, but the diffstat is replaced with the one from the second. The second is tagged something like 'for-linus-merged' and is mentioned in the PR if there were conflicts to resolve. I usually have minor conflicts.

Linus explained he will use the -merged tag to compare any merge conflict resolutions he makes against the sender's version of those resolutions.

You can't really 'fix' git pull-request because it doesn't have a merged base to do the diff against. Perhaps the best one could do is have it try to automerge and use that if there are no conflicts. Uwe provided an algorithm that does this later in the thread.

The diagram in the article is not quite right. The key thing is that 'C2' was sent and merged by Linus into rc3, and then rc2 was merged on top of a commit that was logically in rc3. This 'going backwards' creates the problem.

It could also have been avoided if rc3 was merged instead of rc2. ie don't merge older tags into your tree.

"git request-pull" and confusing diffstats

Posted Oct 22, 2019 23:54 UTC (Tue) by ceplm (subscriber, #41334) [Link]

I am always fascinated by this utility. It seems so full of opportunity and so little used outside of the kernel world. Did for example anybody managed to make an utility parsing its output in the gitlab issue report (hopefully from a repository on another site) and making a merge request out of it?

Something to help with https://gitlab.com/gitlab-org/gitlab/issues/14116 .

"git request-pull" and confusing diffstats

Posted Oct 26, 2019 7:57 UTC (Sat) by kccqzy (guest, #121854) [Link]

It should be noted that this kind of situation has been known since a very long time and some commercial Git hosting providers do exactly the same dance to show a good diff, namely perform a merge and show that diff. See for example here.

I got used to seeing this kind of diff very quickly. If git request-pull could automate this it would be great. However I do think this is easier in a centralized environment where every change to the mainline can automatically cause the pull request diff to be recompiled.