|
|
Log in / Subscribe / Register

Mergiraf: syntax-aware merging for Git

By Daroc Alden
October 31, 2025

The idea of automatic syntax-aware merging in version-control systems goes back to 2005 or earlier, but initial implementations were often language-specific and slow. Mergiraf is a merge-conflict resolver that uses a generic algorithm plus a small amount of language-specific knowledge to solve conflicts that Git's default strategy cannot. The project's contributors have been working on the tool for just under a year, but it already supports 33 languages, including C, Python, Rust, and even SystemVerilog.

Mergiraf was started by Antonin Delpeuch, but several other contributors have stepped up to help, of which Ada Alakbarova is the most prolific. The project is written in Rust and licensed under version 3 of the GPL.

The default Git merge algorithm ("ort") is primarily line-based. It does include some tree-based logic for merging directories, but changes within a single file are merged on a line-by-line basis. That can lead to situations where two logically separate changes that affect the same line cause a merge conflict.

Consider the following base version:

    void callback(int status);

And then suppose that one person makes the function fallible:

    int callback(int status);

While someone else changes the argument type:

    void callback(long status);

The default merge algorithm can't handle that, because there are conflicting changes to the same line. Syntax-aware merging, however, is based on the syntactical elements of the language, not individual lines. So, for example, Mergiraf can resolve the above conflict like this:

    int callback(long status);

From its point of view, the changes don't actually overlap, because the return type and the argument type are treated as separate, non-overlapping regions. This kind of syntax-aware merging has been bandied about for many years, but the complexity of writing a merge algorithm for syntax trees kept it from really being practical for widespread use. Spork, an implementation of the idea for Java, was released in 2023, showing that it was actually feasible. Mergiraf attempts to extend that Java-specific algorithm to programming (and configuration or markup) languages in general.

The design

Mergiraf relies on the tree-sitter incremental parsing library to convert individual languages into generic syntax trees where each leaf corresponds to a specific token in the file, and each internal node represents a language construct. However, Mergiraf itself needs relatively little information about each language to work. Instead, it uses a non-language-specific tree-matching algorithm to guide conflict resolution, plus a small amount of language knowledge layered on top. This design is part of the reason that the tool has been adapted to so many different languages.

The Mergiraf algorithm starts by doing a regular line-based merge; if that succeeds, as it often does, then the program doesn't need to resort to the more expensive tree-based merging algorithm. Even if a line-based merge fails, however, it often fails only in a few locations. When parsing the different versions of the file being merged, Mergiraf can mark any parts of the syntax tree that were resolved without conflicts by the line-based merge as not needing changes, allowing it to focus only on the conflicting parts. This provides a substantial speedup, especially for large files.

For the remaining parts, the tool uses the GumTree algorithm to find fuzzy matches between the remaining subtrees. Identifying the matches is enough to produce a diff, but it doesn't provide enough information on its own to resolve any conflicts. Next, Mergiraf flattens the syntax tree into a list of facts about how the nodes in the tree are related to each other. These facts are tagged with whether they came from the base, left, or right revision of the merge (i.e., the most recent common ancestor, the commit being merged into, and the commit being merged). Then a new syntax tree is reconstructed from the merged list of facts. If a fact from the base revision conflicts with another fact, it is discarded. If two facts from the left and right revisions disagree, that indicates an actual conflict that Mergiraf cannot resolve.

The advantage of this approach is that it eliminates the kind of move/edit conflicts that plague the ort algorithm: if one revision edits the internal parts of some part of the program, and the other revision relocates that part of the program, those facts don't contradict one another. On the other hand, if both revisions edit the exact same part of the program, that does represent a real conflict that a human should really look at.

Although, for edits in some languages, Mergiraf can use language-specific knowledge to resolve even conflicts like this. For example, consider the following change to a Rust structure:

    // Base version
    struct Foo {
        field1: Bar,
    }

    // Left revision
    struct Foo {
        field1: Bar,
        new_field_left: Baz,
    }

    // Right revision
    struct Foo {
        field1: Bar,
        new_field_right: Quux,
    }

This is a merge conflict because a line-based algorithm couldn't tell which order to add the new lines in — and which order lines appear in a program is usually important. In Rust, however, the compiler is allowed to rearrange structure fields as it sees fit (unless the structure is marked #[repr(C)] or one of the other repr settings — which seems to be a known bug in the current version of Mergiraf). Therefore, this merge conflict can be resolved automatically by putting the lines in any order. The resulting merged program has the same behavior either way. On the other hand, that wouldn't be a correct way to resolve the equivalent merge conflict in C, because, in C, the order of members in a structure can affect the correctness of the program.

When a syntactic element's children can be freely reordered without changing the meaning of the program, Mergiraf calls it a "commutative parent". Part of the language-specific information that Mergiraf needs is a list of which parts of the language are commutative parents, if any. A commutative parent isn't a get-out-of-jail-free card for merge conflicts, though: if two revisions add fields with the same name and different types, for example, that would still be a conflict. In such cases, Mergiraf uses an additional piece of language-specific information to put the conflicting lines close together, so that the resulting conflict markers pinpoint the problem as precisely as possible.

Using it

When I encountered it, Mergiraf's approach sounded promising, but I was curious about how much of a difference it would actually make in real-world use of Git. The Linux kernel repository contains, at the time of writing, 7,415 merge commits that, when replayed using the default merge algorithm, result in conflicts. These are the merge commits that would have had to be fixed by hand, although it's probably an underestimate of the number of merge conflicts that kernel developers have had to deal with. It doesn't include merge conflicts that would have appeared during rebasing, for example, because information about rebases isn't included in the Git history for analysis.

After extracting a list of every merge conflict in the kernel's Git history, I tried using Mergiraf to resolve them. 6,987 still resulted in conflicts, but 428 were resolved successfully. A much larger fraction of merge conflicts were still partially resolved. Should those results generalize, which I think is likely, adopting Mergiraf could reduce the number of merge conflicts requiring manual merging by a small amount, which is still potentially helpful to save valuable maintainer time.

The tool itself has two interfaces: one that can be run by hand on a file with conflict markers (such as those produced by ort) in order to attempt to resolve conflicts, and one that can be used by Git automatically. Running "mergiraf solve <path>" will read the conflict markers in the given file and attempt to resolve them. Adding this snippet to one's Git configuration and setting the driver as the default in .gitattributes will use Mergiraf as the Git merge driver from the beginning:

    [merge "mergiraf"]
        name = mergiraf
        driver = mergiraf merge --git %O %A %B -s %S -x %X -y %Y -p %P -l %L

When invoked by Git, the user can review the conflicts that Mergiraf encountered and how it resolved them by running "mergiraf review". For people who don't have a merge conflict handy, Mergiraf has an example repository containing various kinds of conflicts, in order to show how Mergiraf resolves them. The tool also works with Jujutsu, and likely with other version-control systems, as long as they use the same merge-conflict syntax as Git.

Programmers have gotten along just fine without Mergiraf, so it isn't necessarily something that everyone will want to add to their set of programming tools. But few people enjoy running into merge conflicts, and tools that can help intelligently resolve them — especially the ones that are obvious to a human, and therefore a waste of time to deal with — are an attractive prospect.



to post comments

Can we trust it?

Posted Oct 31, 2025 23:44 UTC (Fri) by alx.manpages (subscriber, #145117) [Link] (4 responses)

Linus kernel numbers:
> After extracting a list of every merge conflict in the kernel's Git history, I tried using Mergiraf to resolve them. 6,987 still resulted in conflicts, but 428 were resolved successfully.

Did those 428 result in the same results that the human did? Where it differed, was it in a good or neutral way, or did it do something wrong?

---

Rust example:
> Therefore, this merge conflict can be resolved automatically by putting the lines in any order. The resulting merged program has the same behavior either way.

What if those differently named members should actually be the same thing, just that they were written slightly differently? A human would realize and merge them into a single thing. An automated algorithm will not. That could be problematic.

I tend to not trust this level of automagic. Even if sometimes while working with git(1) I think a better algorithm could do it, then I stop and think about how bad it can go, and I'm glad that git(1) is just a stupid content tracker. Stupid tools are great!

Can we trust it?

Posted Nov 1, 2025 3:30 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (2 responses)

> What if those differently named members should actually be the same thing, just that they were written slightly differently? A human would realize and merge them into a single thing. An automated algorithm will not. That could be problematic.

If they have the same name, then this will result in a compile error and somebody will notice eventually.

If they have different names, then you are proposing an evil merge, because whichever one you rename will need to be renamed globally throughout the entire project, and that goes well beyond the scope of conflict resolution.

Can we trust it?

Posted Nov 1, 2025 9:21 UTC (Sat) by alx.manpages (subscriber, #145117) [Link] (1 responses)

> If they have different names, then you are proposing an evil merge, because whichever one you rename will need to be renamed globally throughout the entire project, and that goes well beyond the scope of conflict resolution.

I've been rumiating about such conflicts. I think they occur much more often in rebases than in merges. And indeed, they require global renaming. Consider the case where two patch sets do something similar, but name their things differently. When one is applied, the other needs to be updated to be compatible. This kind of conflict happens relatively often when you hold patches downstream.

I do a lot of rebases, and every now and then I see one of these conflicts that require this or that are otherwise difficult for a human to resolve. The usual thing I do to minimize conflicts when they are large is to abort the rebase, and try again in smaller steps. If I'm rebasing on top of a branch that has several commits that conflict, I'll rebase many times, one commit at a time, so most of the rebases will be conflict-less, and those that conflict will have smaller conflicts that can be resolved more easily.

On the other hand, I've also been rumiating about the dangers of automating this, and maybe it's not so bad. git(1) reports the strategy when it has done a merge, so if it reports that it has resorted to the mergiraf strategy, one could be alerted and revise more carefully the resulting merge commit. Or if it's been a rebase, it could use git-range-diff(1) to confirm that everything looks good.

So, maybe it's something acceptable, if humans remember to revise what this strategy has done. Maybe I'm too scared, but I still use git-send-email(1) and git-am(1), and not the automagic b4(1). Most of the time, b4(1) could be helpful, but then eventually things happen like the recent issue that Kees Cook had. I tend to prefer simpler tools.

Can we trust it?

Posted Nov 3, 2025 21:05 UTC (Mon) by pabs (subscriber, #43278) [Link]

Can we trust it?

Posted Nov 13, 2025 8:34 UTC (Thu) by bjackman (subscriber, #109548) [Link]

It kinda has mistrust built-in: while I think it's possible to run in yolo-mode (which is surely useful for quick experiments), the main mode has an explicit "review the results" step.

I think the idea is less "automate merges" and more "do the physics toil of resolving trivial conflicts".

Personally I haven't found it's success-rate is good enough for C code to actually be worth adopting. But I think there's a clear usecase for stuff like rebasing branches. I'm forever resolving very boring conflicts like "we both added a new #include to the top of a file" or "we both deleted some code, but the code we both deleted was slightly different". Just taking out the keyboard schlep from that kinda thing is quite attractive.

Not convinced this is desirable

Posted Nov 1, 2025 1:48 UTC (Sat) by SLi (subscriber, #53131) [Link] (19 responses)

I've seen ideas like this pop up over the years, yet I've never been convinced this would actually be a good idea. Another use case often quoted is that of someone having renamed a variable.

The thing is, someone changing the signature of a function or the name of a variable usually signifies more than just that mechanical action. That name or return type didn't just randomly change; it changed for a reason.

In merges, a merge conflict is the safe path of "I'm not sure what to do with this, so why don't you look". I argue that's what we should also do if someone has made one of these changes.

That's not to say that merging couldn't be more intelligent than the current text based approach. The point is that *some* intelligent judgment is involved in the kind of merge conflicts that these approaches tries to solve automatically. Computers may or may not be able to provide that judgement, but I don't think a simple mechanical rule is that.

Not convinced this is desirable

Posted Nov 1, 2025 3:15 UTC (Sat) by mathstuf (subscriber, #69389) [Link] (16 responses)

After any merge, you still want to run the tests. Just because a line-based merge algorithm succeeded does not mean that it is OK.

Consider:

Base:

```
int foo(int bar);

int f(void) {
return foo(0);
}
```

Left:

```
int frobnitz(int bar);

int f(void) {
return frobnitz(0);
}
```

Right:

```
int foo(int bar);

int f(void) {
return foo(0);
}

int g(void) {
return foo(1);
}
```

Git is going to merge this just fine and end up with:

```
int frobnitz(int bar);

int f(void) {
return frobnitz(0);
}

int g(void) {
return foo(1);
}
```

I call these "logical conflicts". The only way to detect them is to run the CI.

So yes, "smart" tools can get it wrong, but it's not like the solution is to trust the "stupid" tool either.

Not convinced this is desirable

Posted Nov 1, 2025 13:05 UTC (Sat) by iabervon (subscriber, #722) [Link] (1 responses)

This is a case where syntax-aware merges could be helpful: one side added a reference to a symbol while the other side changed the declaration of that symbol. The correct resolution may not be obvious, and the obvious resolution (if there is one) may not be correct, but it's feasible to indicate a problem and suggest a possible resolution.

Not convinced this is desirable

Posted Nov 2, 2025 1:54 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

I don't think (m?)any languages can do this kind of thing purely at a syntax level. Consider if `g` lives in a namespace with a different `foo` declared and the naïve resolution *is* correct. Or `foo` is overloaded and `g` wants another member of the overload set (yes, these are crazy, but…have you seen the spectrum of code that exists?). I think the "syntax-based resolution" is an improvement in that you get fewer needs-manual-resolution work to get to a "resolved" state. You still need to test it either way.

Not convinced this is desirable

Posted Nov 2, 2025 14:01 UTC (Sun) by SLi (subscriber, #53131) [Link] (13 responses)

I would say testing is needed anyway. But how big portion of projects have 100% test coverage? Line based merges are risky, but I'd still argue they are _less risky_ than things like this.

I do agree that when carefully reviewed this may save work. I've just seen it in people who I thought know better (me :D) how lazy one can get reviewing small changes that compile and seem to work!

Not convinced this is desirable

Posted Nov 3, 2025 5:30 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (12 responses)

I mean, we can turn this around and say: without 100% test coverage, how do you verify any normal contribution? Why is a merge conflict (regardless of whether solved by line, AST, LLM, or human) not also subject to the same scrutiny?

Not convinced this is desirable

Posted Nov 3, 2025 8:19 UTC (Mon) by SLi (subscriber, #53131) [Link] (11 responses)

Because a merge conflict solved by line merge is more likely to be correct. A change in a function type or a variable name likely signifies a semantic change to that function or variable, which is much more likely to cause incorrect behavior with an automatic merge than a random change to a completely different part of the source file.

Not convinced this is desirable

Posted Nov 3, 2025 11:06 UTC (Mon) by epa (subscriber, #39769) [Link] (1 responses)

I would like it if the compiler provided a bit more help to let me assert that a given change has no observable effect. If I rename a local variable in a compiled language, the resulting object code should be byte-for-byte unchanged -- or at least it *could* be unchanged and still be a valid compliation of the new code. Reordering local declarations (into "reverse Christmas tree" or whatever you prefer) is a change that will typically result in different object code, but doesn't *have* to. In higher-level languages with inheritance, it's normally a no-op to change a declaration to a more specific type or less specific type, provided the code still compiles. Perhaps some kind of optimizer flag "-Onormalize" would omit debug information, inline everything possible, and generally try to ensure that a change with no semantic effect has no effect on the object code. It would not succeed in all cases.

I tend to write a long patch series where half the commits are "no-change changes", another 40% are fairly trivial reshufflings or adding unit tests or comments, and then the real behaviour change comes in one isolated commit (perhaps followed by some more "no-change" cleanups). I do make mistakes and accidentally break something when I thought I was making a pure refactoring. If the compiler could help me, it could also help check the result of a merge. At least some of the merges will be whitespace or whatever, and the human programmer only needs to review those where the compiler can't prove the change was harmless.

Not convinced this is desirable

Posted Nov 3, 2025 11:35 UTC (Mon) by SLi (subscriber, #53131) [Link]

I agree. It's a hard problem in general, but often I find myself hoping that I could give multiple implementations of a snippet of code and say "choose the one you think is fastest, and if you see any difference in behavior, alert me".

But particularly for refactoring I think something like this could be, sometimes, doable (probably most often stymied by assumptions that hold but the compiler does not know about).

Not convinced this is desirable

Posted Nov 3, 2025 14:24 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (8 responses)

> Because a merge conflict solved by line merge is more likely to be correct.

[citation needed]

I've seen enough logical merge conflicts (the code text merged cleanly, but the system is still unhappy) that would require an LSP-like system to detect because the conflicts occurred in different files; something even mergiraf is blind to. To me, syntax-aware merging is more about getting a resolution without human involvement so that more time can be spent on dealing with the fallout of whatever the resolution is.

Not convinced this is desirable

Posted Nov 3, 2025 15:02 UTC (Mon) by Wol (subscriber, #4433) [Link] (5 responses)

Well it's making the assumption (certainly incorrect for C) that lines are syntactic / semantic units!

For example I sometimes put an if/then/else on one line, sometimes on three, and sometimes on five. Okay, I have a very simple rule of thumb how many lines I spread it across, but there's nothing stopping me putting a simple one across five lines instead of one, or a complex one across one line instead of five.

Cheers,
Wol

Not convinced this is desirable

Posted Nov 3, 2025 16:02 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (4 responses)

Many of the projects I work on are collaborative efforts. If I cannot write down the formatting rules in a way tools can enforce them, they are simply an additional review burden. I'd *much* rather a tool do the enforcement rather than saying "no, this is now how we do things" over and over for "cosmetic" things even if it means living somewhere on the slopes of the perfect ideal rather than the summit. For projects where you can feed everyone working on it with one pizza, sure, do your bespoke formatting rules, but for anything beyond that, please add a knob to the relevant formatter and just make it happen so the already scarce review resources can focus on "more important" parts of the review.

Not convinced this is desirable

Posted Nov 3, 2025 17:30 UTC (Mon) by Wol (subscriber, #4433) [Link] (3 responses)

Dare I suggest you read the thread, not just the parent post? A line in C can contain as many logical units as one cares to put there, which was my point. Applying semantic analysis line by line is likely to result in weird answers ...

As for collaborative efforts and tools, for my sins I'm currently programming in VBA. Collaboration is pretty much nil (I'm the SME (Subject Matter Expert)), and I wish I had some tools that would help collaboration!

The sooner we can ditch Excel and have something that "actually works" (tm), the better! :-)

Cheers,
Wol

Not convinced this is desirable

Posted Nov 4, 2025 4:04 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (2 responses)

Hmm? I'm the GP of that comment, so maybe I've just missed your point completely.

S line-based merge conflict is more likely to happen than something that "sees" the code through an AST. Either way, one should test the results of a merge. I see "smarter" conflict resolution as a way to get to that testing faster than having to manually dissect it.

Not convinced this is desirable

Posted Nov 4, 2025 9:59 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

To flesh it out in full, there was the (unsupported) statement that line-based merges were less likely to be error prone than syntax based.

Was it you (mathstuf) that challenged it? I think so. I then agreed with you, pointing out that lines have no syntactical/semantic meaning in C, and used the example of splitting an "if" statement over 1, 3, or 5 lines.

I then got called on my use of "random" formatting of source, hence my comment to look back at the whole thread.

A typical case of - presumably - someone in "unread comments" mode responding to the post in front of them, and not the context in which that post was written. I'm lucky to have a very good memory, so usually if I want to respond to a post out of context I can remember that context, or I get alarm bells ringing that that's not what the poster meant ... I get the impression that sort of memory is not common :-) (which is why I tend to be generous in my quoting of previous posts - to try and prevent that sort of mis-understanding).

Cheers,
Wol

Not convinced this is desirable

Posted Nov 4, 2025 12:38 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

> Was it you (mathstuf) that challenged it? I think so. I then agreed with you, pointing out that lines have no syntactical/semantic meaning in C, and used the example of splitting an "if" statement over 1, 3, or 5 lines.

Ah, sorry. It was not clear that it was agreement; it read as an assertion. Rereading more closely, I now see that "it" is referring to "line-based" rather than "AST". Sorry about that.

> A typical case of - presumably - someone in "unread comments" mode responding to the post in front of them, and not the context in which that post was written.

Indeed, I go through unread comments most of the time.

Not convinced this is desirable

Posted Nov 3, 2025 17:26 UTC (Mon) by SLi (subscriber, #53131) [Link] (1 responses)

I don't think the existence of bad merges when current merge algorithms "succeed" says much about the non-existence of additional bad merges given an algorithm that succeeds more often by clever trickery.

But really the cases which are caught by the compiler or the unit tests are the benign ones; it's not those that I'm worried about.

In general, I think "someone touched this same part; please resolve manually" by a dumb algorithm like line merge is a decent, but not perfect, proxy for this. The practical role of conflicts is not only to say "I couldn't find a way to merge these"—we could use non-context diffs and merges would (almost) always succeed when the diffs are "insert these characters at offset 13245"—but also to alert the user to the fact that their eyeballs are needed.

I'd go further and bet that the likelihood of two changes causing a "logical" merge conflict (say, as a simple model, git/patch is happy but the compiler is not) has an inverse relationship to the distance between the two changes in the file. (Would be interesting to investigate this...)

And that's why I think it's actually in many ways useful that edits to the same lines causes a merge conflict instead of a clever manual solution that is more likely to make the code compile and misbehave.

Not convinced this is desirable

Posted Nov 4, 2025 4:07 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

> I'd go further and bet that the likelihood of two changes causing a "logical" merge conflict (say, as a simple model, git/patch is happy but the compiler is not) has an inverse relationship to the distance between the two changes in the file. (Would be interesting to investigate this...)

Many of the cases I've seen are APIs being reworked (changing parameters, names, or adding deprecation markers) while another topic adds a new instance of the old API. These generally have more tree-based proximity than in-file proximity IME. And the more "core" the API, the farther reaching its usages can be.

Not convinced this is desirable

Posted Nov 1, 2025 7:22 UTC (Sat) by epa (subscriber, #39769) [Link] (1 responses)

For me, this would be most useful for ‘git stash pop’. I’ll often stash my work while I go and make some simple cleanup like renaming a variable, then get back to it. And even if the stash is applied cleanly, you’d still normally look at the changes before committing. I agree that perhaps ‘git merge’ should have an extra review step if the syntax-aware merging was used. It could even put conflict markers into the result to mark some sections as needing attention, despite the merge succeeding.

Not convinced this is desirable

Posted Nov 1, 2025 13:10 UTC (Sat) by pintoch (guest, #180098) [Link]

hello, I am one of the authors of the tool. Exciting to see it mentioned here, thank you so much for the review! I'm surprised it got so effective on the kernel because it very often falls back on plain line-based merges for C code, due to the difficulty of parsing macros. Perhaps the kernel is well-behaved enough in its usage of preprocessor code.

About whether it's worth it and safe enough, it's perhaps worth mentioning that mergiraf encourages to review its merge resolutions when they differ from a line-based merge, by displaying something like this (in the output of "git merge" for instance):

"INFO Mergiraf: Solved 2 conflicts. Review with: mergiraf review Parser.java_o0i2JL8B"

There is also the option of invoking mergiraf on a file-per-file basis with the "mergiraf solve" command.
There might be other ways to integrate this into git merge tools (such as meld, kdiff3…), so that the reviewing of its work fits better into people's existing workflows. I'm keen to explore that and welcome suggestions of tools which could be fitting. And of course we're still working on improving the core algorithm, with safety in mind.

Concerning "git stash" support: this is already available. If you install mergiraf as a git merge driver it will also be used for any git command that does merging, including "git revert" or "git stash apply" for instance. I don't think git lets the user configure for which of those commands the merge driver should be used, but maybe that could be added. I've already contributed some improvements to git in this area (without which mergiraf would be much less usable), and I'm very grateful to Junio C Hamano and Phillip Wood from the git project for making that possible.

changelog

Posted Nov 1, 2025 9:57 UTC (Sat) by ballombe (subscriber, #9523) [Link] (1 responses)

One case where this could be useful is for merging changelog entries, where the ordering does not really matter.

changelog

Posted Nov 13, 2025 21:32 UTC (Thu) by joey (guest, #328) [Link]

And dpkg-mergechangelogs can do that for one changelog format. (As you probably know..) It's saved me a good amount of busywork over the years.

Merge for human language documents?

Posted Nov 1, 2025 10:39 UTC (Sat) by marcel.oliver (subscriber, #5441) [Link] (3 responses)

I looked at the list of supported languages/formats, and, to my disappointment, TeX/LaTeX is not there. As I use git routinely for writing (mostly scientific papers, but also pretty much everything else), this is something that would help a lot.

The most problematic part is actually not the LaTeX markup itself (nontrivial changes are infrequent, and math needs to be checked carefully by hand anyway), but the fact that the majority of "modern" text editors seem to prefer super-long lines with display-time line wrap (Overleaf is one the offenders, despite their otherwise very useful git integration). Thus, a line is often an entire paragraph of text, so that line-based merging produces conflicts that could presumably be resolved automatically if, e.g., different sentences (as terminated by a period) were treated as independent units.

Worse, different editors routinely mess differently with linebreaks and whitespace (removal of trailing space, autofill-mode in Emacs, etc.), creating merge conflicts even when one of the conflicting changes did not actually introduce semantic changes. I presume that a high-profile project like the Linux kernel can just say "don't do this", but in a low-level collaboration, it's better if things "just work".

Thus, I'd expect that basic punctuation aware merge strategies, even without any fancy human language parsing, could lead to a very substantial reduction of the need for manual merging in text-based file formats (.tex, .md, etc.).

Merge for human language documents?

Posted Nov 1, 2025 13:54 UTC (Sat) by alx.manpages (subscriber, #145117) [Link] (2 responses)

For human language documents, I recommend using semantic newlines. See the advice we use for manual pages:
$ MANWIDTH=72 man man-pages | sed -n '/Use semantic newlines/,/^$/p'
   Use semantic newlines
     In the source of a manual page, new sentences should be started on
     new lines, long sentences should be split  into  lines  at  clause
     breaks  (commas,  semicolons, colons, and so on), and long clauses
     should be split at phrase boundaries.  This convention,  sometimes
     known as "semantic newlines", makes it easier to see the effect of
     patches, which often operate at the level of individual sentences,
     clauses, or phrases.
See also:
    Hints for Preparing Documents
    
    Most documents go through several versions
    (always more than you expected)
    before they are finally finished.
    Accordingly,
    you should do whatever possible
    to make the job of changing them easy.
    
    First,
    when you do the purely mechanical operations of typing,
    type so subsequent editing will be easy.
    Start each sentence on a new line.
    Make lines short,
    and break lines at natural places,
    such as after commas and semicolons,
    rather than randomly.
    Since most people change documents
    by rewriting phrases and
    adding, deleting and rearranging sentences,
    these precautions simplify any editing you have to do later.
-- Brian W. Kernighan, 1974 [UNIX For Beginners]:

(line breaks are my own, but they're close to the original.)

Of course, you need cooperation from your editor. I use vim(1), with all automagic disabled:
set colorcolumn=73,81
set nowrap
set hlsearch
set scrolloff=4
set nu

syntax on

set nocindent
set nosmartindent
set noautoindent
set indentexpr=
filetype indent off
filetype plugin indent off

Merge for human language documents?

Posted Nov 2, 2025 11:11 UTC (Sun) by marcel.oliver (subscriber, #5441) [Link]

I agree that all you say are good ideas. My point is that it is often not possible to get everybody to stick to the same set of good ideas, as old habits die hard. So if tooling on the git side of things were more accommodating to poor formatting choices, that would save unnecessary routine work for cleanup...

Merge for human language documents?

Posted Nov 9, 2025 16:00 UTC (Sun) by dkg (subscriber, #55359) [Link]

Semantic line breaks are a great idea, especially if every collaborator in a project is on board.

One way to encourage consensus on this is to point people to a stable and simple reference. I've found sembr.org to be useful in this regard.

Not just #[repr(C)]

Posted Nov 1, 2025 11:16 UTC (Sat) by cesarb (subscriber, #6266) [Link] (3 responses)

> In Rust, however, the compiler is allowed to rearrange structure fields as it sees fit (unless the structure is marked #[repr(C)] or one of the other repr settings [...])

It's not just #[repr(..)] that can be a problem. #[derive(...)] for instance can also depend on field ordering (either built-in ones like Debug or PartialOrd, or custom derives like the ones from serde). Sure, the compiler might reorder the fields in memory, but the order in which the fields are declared in the source code is still important.

There's also the case where the whole structure declaration is being passed to a macro (this is a more general version of the custom derive issue), which could depend on the field order even without any special attributes on the structure declaration. At least this is visible in the syntax tree (macro calls in Rust always end with an exclamation point), but it could be far before the structure declaration (just looking at the immediate parent is not enough).

Not just #[repr(C)]

Posted Nov 2, 2025 11:01 UTC (Sun) by garyguo (subscriber, #173367) [Link]

I'd also add that if you're writing unsafe code, you might also be relying on drop order for soundness.

Not just #[repr(C)]

Posted Nov 2, 2025 11:33 UTC (Sun) by xi0n (subscriber, #138144) [Link] (1 responses)

Then there is also the more general case of rustdoc comments in between the fields. If mergigraf isn’t aware of them, the result might mix up those comments and put them on the wrong fields.

Not just #[repr(C)]

Posted Nov 3, 2025 5:43 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

In Rust, documentation comments are dedicated syntax items, not just "special' comments. They are the same as `#[doc = "…"]` attributes on the following item (or the containing item for `//!`). I believe they are moved with the attached field (just like any other attributes should).

auto handling the "obviously not an actual conflict" conflicts

Posted Nov 2, 2025 14:08 UTC (Sun) by pm215 (subscriber, #98099) [Link] (7 responses)

The merge conflicts that irritate me are not so much the ones that need a lot of language aware cleverness to resolve, but the silly ones like "two people added a new function at the same point in the file, and git thinks there's a conflict over the line with the closing brace from the function before that", or "two different include directives got added next to each other", where a human can look at it and say "this is obviously just a trivial textual conflict".

Some language awareness might also be interesting in patch/diff generation: some patches are unnecessarily hard to read because the diff algorithm has decided that it should consider a "}" or "} else {" line as "unchanged" when what's actually happened is an entire section of code has been rewritten.

In "moon on a stick" territory because so many tools parse patch files and format changes are a non starter, but it would be nice to have a patch syntax for "reindent lines A...B from X spaces to Y spaces": moving a block of code into or out of an if() is hard to review when it's presented as "remove all the existing lines and replace them with these other lines".

auto handling the "obviously not an actual conflict" conflicts

Posted Nov 2, 2025 18:49 UTC (Sun) by willy (subscriber, #9762) [Link] (1 responses)

Judicious use of diff -w can help the human who has to look at the conflict.

auto handling the "obviously not an actual conflict" conflicts

Posted Nov 2, 2025 21:42 UTC (Sun) by excors (subscriber, #95769) [Link]

There's also e.g. `git diff --color-words` which diffs words instead of lines, so it'll ignore changes to line wrapping as well as indentation, and highlight only the words that changed.

Or (probably much better) there's plenty of modern diff tools, like delta (https://dandavison.github.io/delta/), which detect and highlight changes within a line, so you can easily see when a new line is simply an old line plus whitespace (or just as easily see small but significant changes highlighted within a long line, so you won't overlook them when reviewing). Side-by-side mode helps too with trickier patches.

Even the default patch renderers in GitHub/Gerrit/etc are much better than git's default. It seems there's little need for patch format changes, you can just use better tools to view them.

auto handling the "obviously not an actual conflict" conflicts

Posted Nov 3, 2025 5:41 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (3 responses)

> In "moon on a stick" territory because so many tools parse patch files and format changes are a non starter, but it would be nice to have a patch syntax for "reindent lines A...B from X spaces to Y spaces": moving a block of code into or out of an if() is hard to review when it's presented as "remove all the existing lines and replace them with these other lines".

Isn't this just an `ed` script at heart? Sure, spice it up with some vi, vim, or nvim commands at the ready.

And to echo the other replies, here are some aliases:

- prefix `w`: `--color-words` (word diff)
- prefix `ww`: `--color-words=[[:alnum:]_]+|[^[:space:]]` (WORD diff)
- prefix `c`: `--color-words=[^[:space:]]` (character diff)

for each of `diff`, `show`, and `log` (`log --patch`) suffixes.

auto handling the "obviously not an actual conflict" conflicts

Posted Nov 3, 2025 8:17 UTC (Mon) by pm215 (subscriber, #98099) [Link] (2 responses)

Those all assume that you're reviewing a patch by having it actually applied somewhere. At least for my personal workflow, 99% of the time I review patches by reading the email in my email client, so I care about how the change is represented as a patch in the email, not in what I could theoretically do if I went to extra effort to apply it to a suitable git tree and use local tools on it.

auto handling the "obviously not an actual conflict" conflicts

Posted Nov 3, 2025 10:33 UTC (Mon) by excors (subscriber, #95769) [Link]

Some tools like `delta` don't need the patch applied - you can pipe the whole email into delta and it will display the non-diff text verbatim, while parsing and reformatting and highlighting the diffs to be more readable. (Less helpful when you want to reply and add review comments though; maybe email isn't the optimal technology for code review.)

auto handling the "obviously not an actual conflict" conflicts

Posted Nov 3, 2025 14:29 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

It would be nice if one could annotate a commit to indicate whether there should be a "reviewable" diff rendered using some of the flags above. Forges could also benefit from it, but email would have far better bang-for-buck from it since the "UI" is "done" as soon as you figure out how to integrate it into `format-patch`.

auto handling the "obviously not an actual conflict" conflicts

Posted Nov 3, 2025 23:53 UTC (Mon) by intgr (subscriber, #39733) [Link]

> two different include directives got added next to each other

C is less affected by this but in my experience with newer languages, easily half the merge conflicts occur in import statements. That's because it's the norm to list all individual symbols to be imported from each module, not just module name (e.g. Python `from sys import argv, stdin, ...`)

In practice these are trivially resolvable by just accepting conflicting lines from both sides and running the auto-formatter or IDE's "cleanup unneeded/duplicated imports" feature. (And if it doesn't resolve correctly, it's almost guaranteed to produce an obvious compile error.)

But it's still a chore, I've been wishing for tooling to automate this. I'm wary that mergiraffe may be too smart for its own good to perform this function well, if it also tries to merge other less obvious cases.

Pronounced like “merge” + “giraffe”?

Posted Nov 2, 2025 14:12 UTC (Sun) by ms-tg (subscriber, #89231) [Link]

Noting the giraffe imagery on the linked home page, is it safe to assume that the intended pronunciation of the project name is “merge” + “giraffe”?


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds