GitHub is my copilot
Copilot is a machine-learning system that generates code. Given the beginning of a function or data-structure definition, it attempts to fill in the rest; it can also work from a comment describing the desired functionality. If one believes the testimonials on the Copilot site, it can do a miraculous job of figuring out the developer's intent and providing the needed code. It promises to take some of the grunge work out of development and increase developer productivity. Of course, it can happily generate security vulnerabilities; it also uploads the code you're working on and remembers if you took its suggestions, but that's the world we've built for ourselves.
Machine-learning systems, of course, must be trained on large amounts of data. Happily for GitHub, it just happens to be sitting on a massive pile of code, most of which is under free-software licenses. So the company duly used the code in the publicly available repositories it hosts to train this model; evidently private repositories were not used for this purpose. For now, the result is available as a restricted beta offering; the company plans to turn it into a commercial product going forward.
Copy-and-paste
Looked at one way, GitHub Copilot is the embodiment of a number of aspects of software development that are, perhaps, not fully covered in school:
- Much of a software developer's time is spent cranking out boilerplate code that looks much like a lot of other boilerplate code in circulation. Unsurprisingly, developers do not find being freed of this work to be a distasteful prospect.
- An awful lot of software development is actually done by copying and pasting code. It's tempting to say that this is especially true of contemporary developers, but development worked this way even before the days of Stack Overflow.
- While we like to think that the code we write is original, we are all strongly influenced by code we have seen in the past. Developers who have read a lot of code tend to have many useful patterns at their fingertips.
If much of our work really comes down to copying and pasting at varying degrees of remove, perhaps it makes sense to get the computer to do that work for us when it can.
The use of free software to train Copilot has raised some interesting questions, though. If a machine-learning model has been trained on a particular body of code, is that model a derived work of that code? If so, since GPL-licensed code was used to train the model, the result would also come under the terms of the GPL. If that were true, it would not change much, since GitHub does not appear to have any interest in distributing its model.
But what about the code that Copilot spits out? Is that code, too, a derived work of the code used to train the model? The fact that Copilot occasionally regurgitates verbatim copies of the training code (0.1% of the time, according to the Copilot FAQ) tends to support those who believe that Copilot's output should be seen as a derived work. If this is true, then any code body using Copilot output is in the same situation, which would be a bit of a mess, since it will be derived from multiple bodies of code with conflicting licenses and an endless list of attribution requirements. The derived-work interpretation would make any code developed with Copilot's help entirely undistributable.
The best outcome is unclear
Your editor is not a lawyer and certainly does not wish to play one on the
net. That said, there are arguments to be made to the effect that Copilot's
output should not be seen as a derived work of the code used for
training. Certainly GitHub sees it that way; the Copilot FAQ states:
"Training machine learning models on publicly available data is
considered fair use across the machine learning community
". How
closely that consideration matches actual copyright law is not entirely
clear, but it is an ongoing precedent and practice.
More intuitively, one can easily compare Copilot with a seasoned software developer who has seen a lot of code over a long career. The code that developer writes today will surely be influenced by what they have seen in the past, but today's code is not generally seen as being a derived work of yesterday's reading. One could argue that Copilot is doing the same thing; the only difference is that, since it's a computer, it can read vast amounts of code — even PHP code — without going insane.
Former European parliamentarian Julia Reda makes
the argument that the code snippets produced by Copilot are not large
or complex enough to be considered original, copyrightable works. One
might well wonder how much better Copilot has to get before that line will
be crossed, but she also claims that "the output of a machine simply
does not qualify for copyright protection – it is in the public
domain
". This argument, if taken to its extreme, suggests that
copyrighted work could be put into the public domain by running it through
a photocopier. These arguments may hold for now, but it's not clear that
they are tenable in the long term.
More interestingly, Reda, along with Matthew Garrett, argues that a derived-work interpretation is not in the interests of the free-software community in any case. Copyleft, they say, is a response to overly strong copyright protection for code, not a reason to make it stronger. As Garrett put it:
The powers that the GPL uses to enforce sharing of code are used by the authors of proprietary software to reduce that sharing. They attempt to forbid us from examining their code to determine how it works - they argue that anyone who does so is tainted, unable to contribute similar code to free software projects in case they produce a derived work of the original. Broadly speaking, the further the definition of a derived work reaches, the greater the power of proprietary software authors.
On the other hand, he continues, systems like Copilot offer the prospect of training models with proprietary code and using the result without worries of being tainted. That, he says, is likely to be a positive outcome for the free-software community.
It seems reasonable to assume that Copilot is not the only
machine-learning-based code-synthesis system out there; it is also
plausible that these systems will become more capable over time. The
copyright issues raised by Copilot seem to be concentrated on free software
for now, but they may well expand beyond that realm in the future. What
happens now, though, will set precedents for the that future; if the
free-software community somehow shuts down Copilot over copyright issues, other
interests will have a stronger argument for strengthened copyright laws
applied to future systems. That power could be used to extend the reach of
proprietary software or shut down machine-learning systems that are
beneficial to the community.
We should, thus, be careful about what we wish
for, lest we actually get it.
