Stochastic bisection in Git

By Jonathan Corbet
December 10, 2021

Regressions are no fun; among other things, finding the source of a regression among thousands of changes can be a needle-in-the-haystack sort of problem. The git bisect command can help; it is a (relatively) easy way to sift through large numbers of commits to find the one that introduces a regression. When it works well, it can quickly point out the change that causes a specific problem. Bisection is not a perfect tool, though; it can go badly wrong in situations where a bug cannot be reliably reproduced. In an attempt to make bisection more useful in such cases, Jan Kara is proposing to add "stochastic bisection" support to Git.

Bisection looks for problem commits using a binary search. The developer identifies the latest known good commit with git bisect good and the earliest known commit showing the bug with git bisect bad. Git will then find a commit near the midpoint between the two, check out that commit, and wait for the developer to try to reproduce the bug. Another git bisect command is used to mark the commit as "good" or "bad", and the process repeats, dividing the range of commits in half each time, until only one commit remains. That commit must be the one that introduced the bug in question.

This technique can be powerful. A bug introduced in a 12,000-commit kernel merge window can be narrowed to a single commit in 14 bisect cycles, which makes the process of finding the actual bug much easier. But it works less well when dealing with bugs that are difficult to reproduce and which, thus, may not manifest in any given testing cycle. A 14-step bisection is 14 opportunities for the developer to provide an incorrect result, and it only takes one such to throw the entire process off. It is not uncommon to see nonsensical bisection results posted to mailing lists; they are often caused by just this kind of problem.

Stochastic bisection is a way of trying to adapt the bisection process to situations where it is not 100% clear whether a bug is present at a given point in the history. It can't say for sure which commit is responsible for the bug, but it can find the commit that is probably at fault. To use this feature, a developer will supply the desired confidence level (as a number between zero and one) with the --confidence option to git bisect start. The process will then continue until one commit is identified as being the most likely source of the bug with the requested confidence level.

From there, the bisection process starts as usual, picking a commit in the middle of the specified range. With each test, though, the developer will indicate their confidence in the result they report with the same --confidence flag. A commit that exhibits the bug is almost certainly bad, so a command like:

    git bisect bad --confidence 1.0

would be appropriate. Tests that fail to show the bug may be more ambiguous, though; perhaps the race condition that causes the bug to manifest just didn't trigger this time. If the developer suspects that the bug might be present even though the test indicated otherwise, they can specify a lower confidence level.

Kara included an example in the series cover letter showing how this process would work. It starts with a commit history looking like this:

    A--B--C--D-----F-----H--------K
     \     \  \-E-/     /        /
      \     \--------G-/        /
       \------------------I--J-/

The problematic commit is I, but the developer only knows that A works and K occasionally shows the problem. They go through the steps, reporting all results (some of which are incorrect) with a confidence level of 0.7; eventually Git reports that commit I is the likely culprit with the requested confidence level of 0.9.

One thing from the example will jump out at developers who have used bisection in the past. A normal bisection process will not present the same commit to the developer for testing twice; once a verdict has been rendered on any given commit, it will not change. Stochastic bisection, instead, will ask the developer to rerun tests on commits that have been seen before, sometimes multiple times in a row. The result is a rather longer bisection process. Given the 11-commit history shown above, a normal bisection would be done after four steps; the stochastic bisection process, instead, required 14 steps. The relative expansion of a bisection over a larger range of commits is likely to be significantly less, but it will still make the process longer. A longer process is better than one that yields a nonsense result, though.

Perhaps the most substantive comment in response to this posting came from Taylor Blau. Kara's patches try to share as much logic as possible between normal and stochastic bisection, which is a worthy goal. In the process of joining the two, though, Kara caused the bisection points reached in some of the Git tests to change; the new points are just as valid, but they are different, so Kara had to change the tests as well. Blau would rather see the new feature added without any behavior changes for developers who are not using it. Kara responded that he feels that his solution is the right way to do things, but he also expressed a willingness to change things if the Git developers request it. Nobody has given him an answer on that question at this point.

There are other changes that will be needed, of course, as is usually the case for a complex patch set, so there should be at least one revised version posted to the list at some point. Stochastic bisection will not be useful for all developers, but there are clearly situations where a more probabilistic approach is required. It would thus be surprising if some version of this work didn't eventually find its way into the Git mainline.

Stochastic bisection in Git

Posted Dec 10, 2021 17:06 UTC (Fri) by atnot (subscriber, #124910) [Link] (14 responses)

I feel like this could be really useful for smaller performance regressions? Measure the distribution of results for the good and bad commit, then set the confidence value to the probability of the result being part of the good/bad distribution.

Stochastic bisection in Git

Posted Dec 10, 2021 20:05 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (13 responses)

> Measure the distribution of results for the good and bad commit, then set the confidence value to the probability of the result being part of the good/bad distribution.

The latter probability has two possible interpretations, which are both non-ideal:

1. The probability that, if we drew a random sample from the good/bad distribution, it would be at least as extreme/unlikely as the actual result. This is called the p value, and is widely used in (frequentist) statistics, but you have to be careful when using the p value because it is very easy to misinterpret as (2). A small p value may indicate that you can rule out a possibility, but a large p value tells you nothing useful. The same result might plausibly originate from either distribution, if the distributions significantly overlap, and in that case, the p values will be large for both distributions, which is obviously unhelpful. (If they don't overlap much or at all, then you can just use deterministic bisection together with "rerun the test a couple of times" and the whole problem is trivial.)
2. The probability that the random sample was actually drawn from the good/bad distribution. This is obviously a lot closer to what the --confidence flag wants, but the problem with this probability is that it either doesn't exist at all (frequentist interpretation) or exists but must be calculated from a prior probability (Bayesian interpretation), which is a subjective and arbitrary input value.

The reason we need a prior for (2) is complicated, but essentially boils down to this: We need to know which distribution was more likely to begin with. If a commit is just a simple refactor, with no changes in actual behavior, then that commit is much less likely to cause the regression than (say) a commit which completely rewrites the core algorithm, and the Bayesian interpretation says that we need to include that sort of information in our analysis in the form of prior probabilities. The problem is, there is no universally good way to determine such priors, because the whole point of a prior is that you should come up with it before you've actually looked at the performance data (that's why it's called a "prior" - it comes first). The other problem is that you can't just use 50% everywhere, because you have to update your beliefs across the entire chain of commits every time you do a test. So you could start with 50-50 priors for the first commit, but you had better factor in the outcome of the first test to your priors in subsequent tests, or else the posterior probabilities will be invalid. This is annoying, but not particularly hard in terms of actual theory.

Stochastic bisection in Git

Posted Dec 11, 2021 8:30 UTC (Sat) by epa (subscriber, #39769) [Link] (12 responses)

Yes, if a commit is just changing white space or comments, it’s very unlikely to change the behaviour. A commit renaming a variable is only slightly more interesting, as is one that deletes a function entirely (since it clearly wasn’t being called). These can go into prior probabilities for a stochastic test. But they are also useful for ordinary bisection. ‘git bisect’ will often waste time testing a commit which is a trivial change. It will narrow down in the end, of course, but you want your midpoint in binary search to be where there is a half chance of the sought item being before the midpoint. If you don’t know anything about the items you are searching then you just pick the halfway index. But with some prior probabilities you can do better. Give each commit a score (say 1 for a mere white space change, 5 for renaming a function, 50 for changing a for-loop to a while-loop, 500 for some real programming) and ‘git bisect’ picks a midpoint where the sum of scores at or before it equals the sum after it. If the priors are generally reliable, this should mean bisecting in slightly fewer steps.

Stochastic bisection in Git

Posted Dec 14, 2021 13:33 UTC (Tue) by geert (subscriber, #98403) [Link] (11 responses)

What is a "trivial" change that is unlikely to change the behavior? E.g. removing an unused function impacts memory location, and thus may have an impact on cache usage and performance.

The hardest bugs are related to changes that look trivial, non-suspicious, and passed both review and testing.

Note that if you know which driver or subsystem contains the buggy change, you can reduce the search space by passing a path to git bisect. I barely use that, though. I'd rather do a full "dumb and numb" bisect to pinpoint the real issue with (some) certainty, than try to be smart, only to discover my initial guess was wrong, and have to do the full bisect anyway later.

Stochastic bisection in Git

Posted Dec 14, 2021 17:03 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (4 responses)

Indeed. Just yesterday I was bisecting CPython to track down a behavior change introduced in 3.10[1] and with jumping around, incremental builds weren't stable, so I ended up with a lot of `skip` entries before realizing that a clean build was needed (not just a clean install to build against). Future work to say "crap, I gave some bad information" and rewind a few steps would help quite a bit. Another one is "please carry around this patch to anywhere that is jumped to because it needs fixed in order to test my problem, but is not the problem I'm trying to figure out right now". Sometimes you can just leave the diff in the working tree, but I'd rather `git bisect` know what it is for and give me a conflict resolution flow instead of just refusing to jump to the next candidate commit.

(Sure, I could post these to the ML, but they're just as likely to rot as pie-in-the-sky ideas as here :) .)

[1] For anyone curious, there's new initialization behavior where modifying `sys.path` before `Py_Main` is nuked from orbit when `Py_Main(argc, argv)` reinitializes the interpreter state. You now need to give `argc` and `argv` via `PyConfig` and initialize that way and then use `Py_RunMain()` to avoid this. Available from 3.8, but the behavior change basically forces it for 3.10+.

Stochastic bisection in Git

Posted Dec 14, 2021 19:43 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (3 responses)

Much of this functionality already exists:

> Future work to say "crap, I gave some bad information" and rewind a few steps would help quite a bit.

git bisect log and git bisect replay already do exactly that.

> Another one is "please carry around this patch to anywhere that is jumped to because it needs fixed in order to test my problem, but is not the problem I'm trying to figure out right now". Sometimes you can just leave the diff in the working tree, but I'd rather `git bisect` know what it is for and give me a conflict resolution flow instead of just refusing to jump to the next candidate commit.

You can build it yourself out of git bisect run, although built-in functionality would be Nice To Have. But if you've already turned the test into a single command, writing a wrapper script that applies/reverts a fixed patch is not *that* much extra work.

Stochastic bisection in Git

Posted Dec 14, 2021 20:50 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (1 responses)

> git bisect log and git bisect replay already do exactly that.

Oh nifty. I knew of `log`, but had missed `replay`. However, I feel that `git bisect undo [n]` would also be useful (at least to avoid touching files all over the place in the process of restarting since I suspect that the replay is not done purely in-index quite yet[1]).

> You can build it yourself out of git bisect run, although built-in functionality would be Nice To Have. But if you've already turned the test into a single command, writing a wrapper script that applies/reverts a fixed patch is not *that* much extra work.

Sure, but that's generally a "hindsight is 20/20" situation. The problem is usually that such things are discovered in the process of bisecting, so by the time I have a robust reproducer script to give to `git bisect run`, I've already done a bisection…

[1] Willing to be pleasantly surprised though :) .

Stochastic bisection in Git

Posted Dec 14, 2021 21:24 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

You could just use something like this script:

#!/bin/bash

patch -p2 /path/to/file.diff # Replace 2 with the correct number.
git show -s --oneline
echo "Press ^D when finished testing this commit."
bash
echo -n "Was this commit (G)ood, (B)ad, or (S)kip? "
read REPLY
git reset --hard HEAD
case "$REPLY" in
g|good) exit 0;;
b|bad) exit 1;;
s|skip) exit 125;;
*) echo "Unrecognized input, aborting."; exit -1;;
esac

You would still need to handle the "what if patch(1) fails?" case, though.

Stochastic bisection in Git

Posted Dec 23, 2021 17:54 UTC (Thu) by zblaxell (subscriber, #26385) [Link]

'git bisect edit-log', maybe? Something similar to 'git rebase --edit-todo', so we don't all have to write the same shell script that chops off the last line between 'git bisect log' and 'git bisect replay'.

Speaking of rebase, that would be an awesome workflow for git bisect. Check out the commit to be tested, then cherry-pick a list of patches, with quick options to skip or take over manually (or leverage the git-rebase machinery). When a test result is received, rebase the list of patches to the next commit to be tested. Keep the rebase branch heads around in the log, so that later bisects over the same range don't need to have rerere done to them over and over.

I currently do this by having a script check out the revision chosen by bisect in a separate working tree, cherry-pick a stack of patches (and ignore patches that are known to fail if and only if they're not needed), then build and test that instead of what bisect has checked out. Sometimes I need to have different patches on different revision ranges but it doesn't happen often enough to bother automating it.

That's all assuming I don't give up on git bisect completely, and just take a flat list of commits and run the bisection algorithm myself. After implementing the editable log, the cherry-picking, and adding other conveniences like a place to store annotations about test results (admittedly this is most useful for recording statistical data to compute confidence in the results), getting a list of revisions to test and running a search algorithm against them is not that much extra work...and then what's left for git bisect to do?

Stochastic bisection in Git

Posted Dec 14, 2021 19:54 UTC (Tue) by epa (subscriber, #39769) [Link] (5 responses)

By a trivial change I meant one that only changes white space or comments. Slightly less trivial is to rename a function. Actual coding -- even if "obviously correct" -- gets a much higher probability of changing the behaviour. I was talking about getting 'git bisect' to try testing real changes rather than ones that have no effect on the object code produced (once symbols are stripped).

Stochastic bisection in Git

Posted Dec 14, 2021 22:32 UTC (Tue) by nix (subscriber, #2304) [Link] (4 responses)

That's suddenly language specific, though, and git has always tried to avoid that sort of thing in its core mechanics. I guess you'd need a new .gitattributes attribute to allow you to say what sort of thing is considered significant to a given language (most importantly: spaces and tabs are significant to languages like Python).

Stochastic bisection in Git

Posted Dec 15, 2021 10:05 UTC (Wed) by farnz (subscriber, #17727) [Link] (2 responses)

Or, in the grand tradition of Unix, invoke an external tool (configurable) to make the judgement call - git already knows about external diff and merge tools (which in theory can be language-specific), so an external tool to identify the "significance" of a change isn't a big stretch.

Stochastic bisection in Git

Posted Dec 15, 2021 16:37 UTC (Wed) by nijhof (subscriber, #4034) [Link] (1 responses)

You could probably already do that today in git bisect run. In the command/script:

   if <this commit is insignificant>;   then
      exit 125 # tell git bisect to skip this commit.
   fi

Stochastic bisection in Git

Posted Dec 15, 2021 16:47 UTC (Wed) by epa (subscriber, #39769) [Link]

Yes, you can skip it altogether, but only if you're 100% absolutely certain it cannot affect the result.

I have lots of commits like 'whitespace' or 'renamed variable' or 'changed from private to public' where I am only 99% sure. So I would like git bisect to mostly disregard them when picking its midpoint to test, but of course, if one of these commits is the only candidate between the current bad and good labels, it should be built and tested.

That will also speed up bisection by making the search tree shallower (assuming the prior probabilities or scores assigned to each commit are broadly reliable). Doing the normal bisection and then skipping an individual commit will give a smaller speedup.

Stochastic bisection in Git

Posted Dec 15, 2021 12:15 UTC (Wed) by epa (subscriber, #39769) [Link]

Yes, you would have to provide an external command to give a "score" to each commit, or add them with git tags, or something.
Git itself can't know that information.

Stochastic bisection in Git

Posted Dec 10, 2021 17:46 UTC (Fri) by MrWim (subscriber, #47432) [Link] (1 responses)

See also BBChop: https://github.com/Ealdwulf/BBChop

> BBChop is like 'git bisect' (or equivalent), but works when your bug is intermittent. That is, it works in the presence of false negatives (when a version happens to work this time even though it contains the bug). It assumes that there are no false positives (in principle, the same approach would work, but adding it may be non-trivial).
> ...
> BBChop is based on Bayesian Search Theory
(http://en.wikipedia.org/wiki/Bayesian_search_theory ).

I've not used BBChop, it's just something that I came across a while ago - but it seems relevant here.

Note also BBChop seems to have a great weakness:

> Unlike 'git bisect', bbchop is not integrated with a version control system. It has hooks to allow it to be integrated with a version control system

Stochastic bisection in Git

Posted Dec 10, 2021 19:33 UTC (Fri) by JoeBuck (subscriber, #2330) [Link]

If the hooks are available for common version control systems, that isn't a weakness, it's a strength (if the same flow can be used with git, mercurial, subversion, Perforce, etc).

Stochastic bisection in Git

Posted Dec 10, 2021 20:33 UTC (Fri) by droundy (subscriber, #4559) [Link]

It would be lovely to have a way to introduce an unknown uncertainty in a result, and have bisect automatically rerun and estimate the uncertainty based on what it sees, under the assumption that all results of the same type have the same uncertainty. Statistically this is possible, and works save the nuisance of having to rerun many times before starting in order to determine the likelihood of a false success.

Stochastic bisection in Git

Posted Dec 10, 2021 21:04 UTC (Fri) by andy_shev (subscriber, #75870) [Link] (1 responses)

When `git bisect bad --confidence 0.1` becomes equal to `git bisect good --confidence 10.0`?

Stochastic bisection in Git

Posted Dec 13, 2021 16:20 UTC (Mon) by jengelh (guest, #33263) [Link]

more like 0.9

Cool idea!

Posted Dec 11, 2021 3:11 UTC (Sat) by david.a.wheeler (subscriber, #72896) [Link]

This is a really cool idea. I can't wait to try it out.

Stochastic bisection in Git

Posted Dec 11, 2021 12:41 UTC (Sat) by fsateler (subscriber, #65497) [Link] (2 responses)

I couldn't figure out how this would work with `git bisect run`. Or is that simply not supported?

Stochastic bisection in Git

Posted Dec 12, 2021 14:19 UTC (Sun) by gray_-_wolf (subscriber, #131074) [Link]

I do not know but would hope so. Especially if the testing cycle is very cheap
compute-time wise, it would make it possible to look even for very rare bugs
and just letting the bisect run over night with low confidence on the good
results..

Stochastic bisection in Git

Posted Dec 14, 2021 13:29 UTC (Tue) by jan.kara (subscriber, #59161) [Link]

In principle this can work but I did not implement it yet - we need some way for the 'run-script' to report the decision confidence to the git bisect algorithm. So once things settle for stochastic bisection, I'll look into this as well.