|
|
Log in / Subscribe / Register

Token-based authorship information from Git

By Jake Edge
August 31, 2016

LinuxCon North America

At LinuxCon North America 2016, Daniel German presented some research that he and others have done to extract more fine-grained authorship information from Git. Instead of the traditional line-based attribution for changes, they took things to the next level by looking at individual tokens in the source code and attributing changes to individuals. This kind of analysis may be helpful in establishing code provenance for copyright and other purposes.

German, who is from the University of Victoria, worked on the project with Kate Stewart of the Linux Foundation and Bram Adams of Polytechnique Montréal. It was a "combination of research plus hacking", he said, and the results were fascinating.

Git and blame

[Daniel German]

Git is in widespread use and not just for software development. Its pervasiveness is "proof enough that it is an excellent tool". It is also a great archival tool for historical research. Each revision gets stored away and can be compared against other versions using diff and similar utilities. That allows users to see some interesting differences between the versions.

The git blame command is often used to determine who changed a particular line or section of a code file. The GitHub web interface to the blame command gives a side-by-side view of authors and the lines they changed, with links to the revisions where the change was made. It is line-based, which is "sufficiently good for most of the tasks we have", German said.

It is now common for authors of text documents to use Git for version control. He uses it when working on papers with his students and other collaborators. All of the advantages that it provides for source code, such as branching, merging, and blaming, can be applied equally well to text, he said.

But Git allows its users to rewrite history; it provides a lot of tools to clean up the commits in a repository before pushing them to the master. He is a software archaeologist, though, so he would like to see more of the raw history. Quoting Indiana Jones, he said that archaeology is a search for facts, not truth. The more facts there are available, the more history that can be reconstructed.

So the history in Git is likely to be incomplete, but there is still a lot there; what can be done with that? The line-oriented information is fine for many uses, but perhaps looking more deeply inside the lines of code would provide other insights. To research that idea, the project first looked at Git itself. The Git project is likely to be one of the better users of the Git tool, he said. Plus, there was a certain symmetry to using Git to recursively study its own repository.

Tokenize

The idea was to tokenize the source code to track changes at that level. The first step was to decide what the tokens are. The equality operator (==) is an obvious token, as are function and variable names and the like. Strings and comments might be considered a series of tokens, but the project ultimately decided to treat them as a single token.

[Token-based blame]

He showed an example of a line in the Git source repository that had been originally authored by Linus Torvalds in 2006. It was changed twice in 2014, but those changes simply altered the argument list, whereas a look at the tokens would show the actual tokens that changed and in which commits (as shown on the left). German said that he wanted to be able to get that level of detail, "whether it is good or not is a different question".

A tokenizer was developed for C code that essentially acts as a filter on an input file, producing the tokenized output, which German called a "view" of the file. Each language needs its own tokenizer; beyond C, there are tokenizers for Java and Python. C++ is harder, he said, since the researchers didn't want to build a full-blown parser for the languages.

After early experiments with the Git source, the researchers turned to the Linux kernel. The filter was run on every version of every file that appears in the Linux kernel repository. It is not a difficult thing to do, but it takes a few days to do it. The filter produces a file with one token per line.

Then the token files were matched up with the commits in the kernel repository to create views of those commits that were, naturally, checked into a Git repository that mirrored that of the mainline kernel. The goal was to create a repository with history by token, then all of the Git (and GitHub) tools can be used to look at that history.

The kernel project had no version control system until the adoption of BitKeeper in 2001. Some subsystems used CVS, but Linus Torvalds never liked the existing choices, so the net became the repository and changes were sent to him as patches. But there is a repository that Yoann Padioleau put together that covers 0.01 up until 2001. At that point, the BitKeeper era started, which ran until Git was created in 2005. Thomas Gleixner has a Git repository for the BitKeeper period and Torvalds, of course, has a repository from 2005 on.

The earliest repository has low granularity for changes (as it is based on the release tarballs), while the BitKeeper granularity is generally good; the best commit history comes from the Git-era repository, unsurprisingly. Unlike other repositories, PostgreSQL for example, the Linux Git history is not simply made up of squash commits of features without merges. That allows the history to be followed.

Git allows concatenating these three separate repositories into one, effectively. Git also has some other features that were useful for historical analysis. In particular, the rename detection was valuable for the project, German said.

There is one warning he provided about the data: the "author" in Git terms may not actually be the author of the code. It may be that the Git author is simply passing the code along from elsewhere. In addition, refactoring or moving code around within the tree may credit the wrong person with authorship. Refactoring is an area that needs more research, he said, since even lawyers are not able to decide who holds the copyright for refactored code.

Findings

He then presented some of the findings of the research, which looked at changes up through the 4.7 release. The number of tokens was roughly six times larger than the number of non-blank, non-comment lines of code. The number of people that occur in git blame for 4.7 is roughly the same, though (12,005 by lines, 12,087 by tokens).

[User interface]

He was curious to see what function had remained closest to its 0.01 version. It turns out that skip_atoi() contains the most code from 0.01. That kind of makes sense, he said, since the mechanism to convert a string to an integer hasn't needed much change. He put up a slide (seen on the right) that demonstrated the user interface to look at token-level changes. It lists three commits and the number of tokens changed; each commit has its changes in a different color and you can hover over a change to see what the commit message was.

A file that has not changed a whole lot is ctype.c, which consists of a table that maps character values to their types (white space, digits, letters, etc.). If you look at the line-level git blame output, it would seem that Andre Rosa updated most of the file in 2009, but that is not really the case. It turns out that a const was added and tabs were changed to spaces, which made Rosa the "owner" of much of the file. In fact, the bulk of the file tokens date back to the days before version control for the kernel (called "pre-git" in the slides [SlideShare]). In 1996, though, some additional mappings were added, but the commas are still there from the original file, which calls into question who has the copyright for that piece, he said.

At the beginning of the project, he wondered if token-level attribution would make a real difference in the authorship numbers. As it turns out, he was surprised to find that comparing authorship by lines versus tokens does not change things much. "Some people win, some lose", but it all comes out as a wash. He showed a graph of the year of origin of code in Linux as counted by lines and tokens—the graphed points were nearly identical. He also looked only at the kernel directory to see if the core code showed anything different. Again, he was surprised to see that there was essentially no difference.

He produced "top twenty" lists of committers to both Linux and to just the kernel directory by tokens versus lines and the lists looked much the same. There are some differences and a bit of reordering to be sure, but little that stands out. For Linux, the pre-git commits are 5.15% by token, while only 3.81% by line. Two examples did show some differences, though: Joe Perches made 0.64% of changes when measured by lines (number ten on the list), but did not appear in the top twenty (so less than 0.44%) by tokens; on the flip side, Arend van Spriel was number thirteen by tokens (0.6%) but was not present on the lines list (so less than 0.5%). Results for the top twenty committers in the kernel directory showed much the same for tokens versus lines.

German also reported some overall statistics on tokens in commits, which show that the repository is made up of mostly small changes. For non-merge commits that modified .c or .h files, 9.5% added three or fewer tokens and removed three or fewer. 7% only removed tokens, while 3.8% added and removed exactly one token. Adding or removing up to ten tokens was 22.4% and half of all commits added/removed up to 60 tokens. On the other hand, two commits added or removed more than one million tokens.

To measure "churn", the researchers calculated the number of tokens added and subtracted the tokens removed. 10% of commits had zero churn, while 48% had a positive churn value of ten or less; 26% had a negative churn.

He concluded his presentation by saying that the research had shown that there is not much difference between lines and tokens at the large scale. But on the small scale, doing that analysis can provide a more fine-grained view of the evolution of the code.

In answer to some audience questions, German said that there is no reason this feature could not be built into Git itself if that was desirable. As to the intellectual property and legal ramifications of the work, he was a bit non-committal. The output of this work simply adds more information that the courts can work with when deciding those kinds of cases; it is the job of the courts to find the truth, he said.

The code is not currently available. The team plans to release it as open source in the next three to four months. In the meantime, though, for projects that do not have the scale of the kernel, he is willing to process them and make the repositories available to interested projects.

[I would like to thank the Linux Foundation for travel assistance to attend LinuxCon North America in Toronto.]


Index entries for this article
ConferenceLinuxCon North America/2016


to post comments

Token-based authorship information from Git

Posted Sep 1, 2016 7:44 UTC (Thu) by pabs (subscriber, #43278) [Link] (6 responses)

It would be definitely be useful to have this feature available from git itself.

Token-based authorship information from Git

Posted Sep 1, 2016 10:38 UTC (Thu) by andrewsh (subscriber, #71043) [Link]

I needed a similar feature in Mercurial last week, when I needed to properly attribute a piece of code from multiple authors (the attribution in the file header wasn’t incorrect, but wasn’t complete either, as it covered the major contributions only). I’d even convert that repository to Git purely to access this feature were it available in Git proper.

Token-based authorship information from Git

Posted Sep 1, 2016 13:34 UTC (Thu) by epa (subscriber, #39769) [Link] (1 responses)

This kind of code analysis doesn't depend much on the details of the particular version control system used; its inputs are really just a list of revisions with their contents. (Adding branches makes things a bit more interesting, but the basic concepts of a merge and a tree of history are common to all version control systems.) So it might be better written as a general tool which you can configure to work with backends for git, svn or whatever.

In fact, I wonder if there is a general wrapper for version control systems to do operations like 'show the list of revisions', 'list the files in this revision' and 'get me the text of the file X in revision Y'. Not so much for interactive use (it's always better to use the native tools) but as an aid to scripting. That said, git is so dominant these days it may make more sense to just use git and import your whole repository into git to analyse it.

Token-based authorship information from Git

Posted Sep 22, 2017 13:10 UTC (Fri) by cjwatson (subscriber, #7322) [Link]

It's not an exact match for that feature set, but vcs would perhaps be a good place to put that kind of thing.

Token-based authorship information from Git

Posted Sep 2, 2016 9:51 UTC (Fri) by jnareb (subscriber, #46500) [Link] (2 responses)

Adding token-based blame to Git would have, I think, consist of two steps.

First is modifying blame so it can work on "words", and not only lines, similar to how `git diff --word-diff` works. This should be fairly uncontroversial, except with coming up for good output format - though in this case `git blame` could require either `--porcelain` or `--incremental`.

Second, if it would be not possible to modify words regexp to separate tokens instead, it would need to include or call a tokenizer (lexer). This would be harder to argue for...

Token-based authorship information from Git

Posted Sep 2, 2016 11:01 UTC (Fri) by karkhaz (subscriber, #99844) [Link] (1 responses)

> it would need to include or call a tokenizer (lexer). This would be harder to argue for...

so actually I think having a "semantic" git diff/git blame (even normal diff) would be terrific. You could use that, for example, to ignore code changes that are merely refactorings (whitespace and comment changes only). If the parse trees of the diffs are the same, then semantically the code hasn't changed.

Token-based authorship information from Git

Posted Sep 13, 2016 8:53 UTC (Tue) by dakas (guest, #88146) [Link]

git blame -w already helps against blaming indentation.

One thing I'd be interested in would be the basics of their tool chain. I think that just converting each tree in history to token-per-line files (for non-binary files) and committing with a tag reflecting the original commit id should be enough as the basic work horse, allowing you to just employ git blame itself.

Of course, looong files with small changes were particularly slow to work with before my rewrite of git blame's core (I've seen a factor of 5 on repositories containing word lists) 2 years ago or so. So it's sort of a vanity interest whether they actually employed git blame itself for the grunt work or rather coded up something themselves.

Token-based authorship information from Git

Posted Sep 9, 2016 21:19 UTC (Fri) by dfsmith (guest, #20302) [Link] (2 responses)

I'd be interested to find out people's experience on using git with text documents. My experience is that a single word change can ripple to mark the rest of the paragraph. I guess you could use one line per paragraph and rely on the editor's rendering to make it readable, but that's more trouble than it's worth (and doesn't help disentangle the change, either).

Token-based authorship information from Git

Posted Sep 9, 2016 22:41 UTC (Fri) by karkhaz (subscriber, #99844) [Link]

I usually write LaTeX documents with one sentence per line, especially when I'm collaborating with others. Otherwise, as you say, diffing/blaming becomes impossible. It does help to disentangle the change, as a changed sentence will appear as only a single line in a diff, without affecting other sentences in the same paragraph.

It's not much trouble at all. You can configure vim to not hard-wrap the text and insert a newline every time you insert a period followed by a space whenever you open a file with .tex as the extension; and you can also configure it to scroll up and down by 'on-screen lines' rather than actual newline-separated lines. I suppose equivalent functionality exists for emacs.

Token-based authorship information from Git

Posted Sep 13, 2016 18:36 UTC (Tue) by bfields (subscriber, #19510) [Link]

Git's "--word-diff" option may do what you want. You can use it on any command that displays a diff (git show, git log -p, git diff, etc.).

(I don't think the results can be fed to git-apply, though. I don't know if there's a word-by-word format that can also be used to patch files.)

Token-based authorship information from Git

Posted Jan 13, 2017 3:42 UTC (Fri) by pabs (subscriber, #43278) [Link]

This talk was recorded and there is a video on Youtube:

https://www.youtube.com/watch?v=iXZV5uAYMJI

Token-based authorship information from Git

Posted Jan 13, 2017 13:44 UTC (Fri) by madscientist (subscriber, #16861) [Link] (2 responses)

Any information on the open-sourcing of this work? It's now "three to four months" later so I thought I'd ask although I know these things invariably take longer than expected.

Token-based authorship information from Git

Posted Jan 14, 2017 5:25 UTC (Sat) by pabs (subscriber, #43278) [Link] (1 responses)

I contacted Daniel German, will report back if I get a response.

Token-based authorship information from Git

Posted May 16, 2017 23:41 UTC (Tue) by pabs (subscriber, #43278) [Link]

Daniel German just emailed me to say that they have now done this:

https://github.com/cregit/cregit


Copyright © 2016, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds