Deep learning and free software
Deep-learning applications typically rely on a trained neural net to accomplish their goal (e.g. photo recognition, automatic translation, or playing go). That neural net uses what is essentially a large collection of weighting numbers that have been empirically determined as part of its training (which generally uses a huge set of training data). A free-software application could use those weights, but there are a number of barriers for users who might want to tweak them for various reasons. A discussion on the debian-devel mailing list recently looked at whether these deep-learning applications can ever truly be considered "free" (as in freedom) because of these pre-computed weights—and the difficulties inherent in changing them.
The conversation was started by Zhou Mo ("Lumin"); he is concerned that, even if deep-learning application projects release the weights under a free license, there are questions about how much freedom that really provides. In particular, he noted that training these networks is done using NVIDIA's proprietary cuDNN library that only runs on NVIDIA hardware.
While it might be possible to train (or retrain) these
networks using only free software, it is prohibitively expensive in terms
of CPU time to do so, he said. So, he asked: "Is GPL-[licensed]
pretrained neural network REALLY FREE? Is it really
DFSG-compatible?
"
Jonas Smedegaard did not think the "100x
slower" argument held much water in terms of free-software licensing. Once
Mo had clarified some of his thinking,
Smedegaard said:
I therefore believe there is no license violation, as long as the code is _possible_ to compile without non-free code (e.g. blobs to activate GPUs) - even if ridiculously expensive in either time or hardware.
He did note that if rebuilding the neural network data was required for releases, there was a practical problem: blocking the build for, say, 100 years would not really be possible. That stretches way beyond even Debian's relatively slow release pace. Theodore Y. Ts'o likened the situation to that of e2fsprogs, which distributes the output from autoconf as well as the input for it; many distributions will simply use the output as newer versions of autoconf may not generate it correctly.
Ian Jackson strongly stated that GPL-licensed neural networks were not truly free, nor are they DFSG compatible in his opinion:
In fact, they are probably not redistributable unless all the training data is supplied, since the GPL's definition of "source code" is the "preferred form for modification". For a pretrained neural network that is the training data.
But there may be other data sets that have similar properties, Russ Allbery
said in something of a thought experiment.
He hypothesized about a database of
astronomical objects where the end product is derived from a huge data set
of observations
using lots of computation, but the analysis code and perhaps some of the
observations are not released. He pointed to genome data as another
possible area where this might come up. He wondered whether that kind of
data would be compatible with the DFSG. "For
a lot of scientific data, reproducing a result data set is not trivial and
the concept of 'source' is pretty murky.
"
Jackson sees things differently, however. The hypothetical NASA database can be changed as needed or wanted, but the weightings of a neural network are not even remotely transparent:
If the user does not like the results given by the neural network, it is not sensibly possible to diagnose and remedy the problem by modifying the weighting tables directly. The user is rendered helpless.
If training data and training software is not provided, they cannot retrain the network even if they choose to buy or rent the hardware.
That argument convinced Allbery, but Russell Stuart dug a little deeper. He noted that the package that Mo mentioned in his initial message, leela-zero, is a reimplementation of the AlphaGo Zero program that has learned to play go at a level beyond that of the best humans. Stuart said that Debian already accepts chess, backgammon, and go programs that he probably could not sensibly modify even if he completely understood the code.
Allbery noted that GNU Backgammon (which he packages for Debian) was built in a similar way to AlphaGo Zero: training a neural network by playing against itself. He thinks the file of weighting information is a reasonable thing to distribute:
However, Ximin Luo (who filed the "intent to package" (ITP) bug report for adding leela-zero to Debian) pointed out that there is no weight file that comes with leela-zero. There are efforts to generate such a file in a distributed manner among interested users.
He is clearly a bit irritated by the DFSG-suitability question, at least with regard to leela-zero, but it is an important question to (eventually) settle. Deep-learning will clearly become more prevalent over time, for good or ill (and Jackson made several points about the ethical problems that can stem from it). How these applications and data sets will be handled by Debian (and other distributions) will have to be worked out, sooner or later.
A separate kind of license for these data sets (training or pre-trained weights), as the Linux Foundation has been working on with the Community Data License Agreement, may help a bit, but won't be any kind of panacea. The license doesn't really change the fundamental computing resources needed to use a covered data set, for example. It is going to come down to a question of what a truly free deep-learning application looks like and what, if anything, users can do to modify it. The application of huge computing resources to problems that have long bedeviled computer scientists is certainly a boon in some areas, but it would seem to be leading away from the democratization of software to a certain extent.
Posted Jul 18, 2018 22:24 UTC (Wed)
by Sesse (subscriber, #53779)
[Link] (3 responses)
I don't see the dependency on cuDNN as too problematic; alternative GPU implementations are a factor of 2 or 3 away, which is hardly impending on anyone's freedom. It's more interesting that the “compilation” process is rarely deterministic, although I see that as a minor quibble.
PS: Like in backgammon, nearly every top chess program these days, including the GPLv3-ed champion Stockfish (which is in Debian), uses weights optimized through this kind of self-play. Thousands and thousands of core-hours from volunteers has gone into this optimization process. However, I believe the training program (fishtest) is free, too, so I don't see a problem.
Posted Jul 19, 2018 1:12 UTC (Thu)
by brooksmoses (guest, #88422)
[Link] (2 responses)
One difference is that it's not always true that you'd modify things by changing the training program or the input data. In many cases where the input set changes over time, it's common to continually train the model by taking last-week's converged weights as a starting point and iterating on new inputs, and thereby tweaking the weights to make this week's model. It may not be "direct editing", but it is certainly a situation where the preferred form of input for making changes includes the existing weights.
Another difference is that I think the non-determinism of the training process is somewhat more of a quibble than you do. Certainly it's a reason why it's important to include the trained weights with the source code to make something "free"; if you can't fix a bug in the file-input part of a program without changing the computations it makes to analyze the file after it's opened, that's not really in the spirit of software freedom. A similar case would be if I have a magic proprietary compiler that analyzes my C++ floating-point math and promotes floats to double where needed for numerical stability, and I don't make that free, me giving you my physics-model source code isn't giving you what you need to rebuild it and get the same results.
The fundamental problem, I think, is simply that we don't have the computer science yet to be able to edit the weights directly, and so there _is_ really no ideal starting point for making "small" changes to an existing model (other than retraining from the existing weights). We're basically at a point where important parts of the program are write-only and to edit them we have to recreate them -- and that re-creation can involve human creative input as well as machine time.
Posted Jul 19, 2018 7:39 UTC (Thu)
by Sesse (subscriber, #53779)
[Link] (1 responses)
In the case of Leela, it's all a bit muddied in that it's not even clear exactly what the training data is. The amount of human input is zero, but there's all these games produced as part of the training that's being used to train the network itself. I'm not sure if they need to be included or not—I'd lean towards not.
Posted Jul 21, 2018 20:53 UTC (Sat)
by k8to (guest, #15413)
[Link]
So this is basically a distributed computing client as well as a Go engine that runs with the results of that distributed computing effort.
From a free software perspective, the major concerns I see for leela-zero are the ownership and reproducibility of the "central infrastructure" so that others could continue or proceed alternatively in this process without the help of the original creators.
That all of that work is going on in the public view on github with the tools exposed should go a long way towards alleviating practical concerns, but it would be nice if the server tools were packaged as well.
Posted Jul 19, 2018 0:05 UTC (Thu)
by sfeam (subscriber, #2841)
[Link] (1 responses)
Posted Jul 19, 2018 0:49 UTC (Thu)
by ejr (subscriber, #51652)
[Link]
Posted Jul 19, 2018 1:29 UTC (Thu)
by brooksmoses (guest, #88422)
[Link] (2 responses)
I've been at a conference where Richard Stallman addressed that issue -- which is to say, he basically punted and said that Free Software was irrelevant to computer hardware because nobody would be able to fab a replacement chip. Personally, I think that's a shortsighted conclusion (and it was particularly ironic that he said this at a conference of people largely funded by DoD contracts, many of whom likely would be in a position to fab their own chips), but I do find it an interesting datapoint. Is the sort of neural-net training data that can only be generated by running something on the equivalent of Google's CloudTPU clusters for zillions of hours a thing that is limited in its ability to be "free" in much the same way that the Xeon in my desktop is limited by the technological inability of anyone but Intel to make one? And, if so, is the right answer actually to say "that's not a problem we can solve with Software Freedom right now," as Stallman did with hardware?
If so, then the question becomes: What is practical can we do instead? And, of those options, what do we want to do?
Posted Jul 19, 2018 3:14 UTC (Thu)
by pabs (subscriber, #43278)
[Link] (1 responses)
https://www.fsf.org/blogs/rms/how-to-make-hardware-design...
Posted Jul 19, 2018 7:58 UTC (Thu)
by brooksmoses (guest, #88422)
[Link]
Posted Jul 19, 2018 5:05 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (21 responses)
Training a neural net requires a lot of compute time, but it's definitely not "outside of abilities of a non-profit". A small cluster of 10-20 computers outfitted with GPUs will be able to train models like AlphaZero within a day or so. That's what Mozilla does for their DeepSpeech project.
20 computers with latest GPUs will cost around $100000 - not an insignificant amount, but not insurmountable either. And now we have even more AI-specialized hardware accelerators, so the price can be cut even further.
Posted Jul 19, 2018 6:58 UTC (Thu)
by pabs (subscriber, #43278)
[Link] (1 responses)
Posted Jul 19, 2018 23:56 UTC (Thu)
by tedd (subscriber, #74183)
[Link] (15 responses)
Posted Jul 20, 2018 1:49 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link] (14 responses)
Posted Jul 21, 2018 16:47 UTC (Sat)
by epa (subscriber, #39769)
[Link] (13 responses)
Posted Jul 21, 2018 16:53 UTC (Sat)
by sfeam (subscriber, #2841)
[Link] (1 responses)
Posted Jul 21, 2018 17:22 UTC (Sat)
by epa (subscriber, #39769)
[Link]
Posted Jul 21, 2018 20:24 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (10 responses)
Posted Jul 23, 2018 6:46 UTC (Mon)
by epa (subscriber, #39769)
[Link] (9 responses)
Posted Jul 23, 2018 7:35 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (8 responses)
Posted Jul 23, 2018 9:24 UTC (Mon)
by excors (subscriber, #95769)
[Link] (6 responses)
The only references I can find to nondeterminism are about e.g. reductions using atomic add, where the nondeterministic part is just the order in which the parallel threads execute the add instruction, which matters since floating point addition is not associative. But if you imposed some order on the threads then it would go back to being perfectly deterministic.
Posted Jul 23, 2018 9:30 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
> That would seem very surprising - do you have some evidence for that?
1) Trigonometric and transcendental functions are implemented _slightly_ differently between different GPU models or driver versions.
Posted Jul 23, 2018 9:53 UTC (Mon)
by epa (subscriber, #39769)
[Link]
Posted Jul 23, 2018 9:55 UTC (Mon)
by excors (subscriber, #95769)
[Link] (2 responses)
RAM corruption would count as hardware nondeterminism, but it looks like Tesla GPUs do have ECC - e.g. https://www.nvidia.com/content/tesla/pdf/NVIDIA-Tesla-Kep... says "External DRAM is ECC protected in Tesla K10. Both external and internal memories are ECC protected in Tesla K40, K20X, and K20" (where "internal memories" means caches, register files, etc), so they've been doing it since 2012.
Posted Jul 26, 2018 14:26 UTC (Thu)
by gpu (guest, #125963)
[Link] (1 responses)
1. Non-synchronized GPU atomics (https://docs.nvidia.com/cuda/cuda-c-programming-guide/ind...)
2. Non-deterministic numerical algorithms, e.g. http://papers.nips.cc/paper/4390-hogwild-a-lock-free-appr... (though, this particular example is CPU-specific)
Posted Jul 28, 2018 19:44 UTC (Sat)
by wx (guest, #103979)
[Link]
This is spot on at least for anything using TensorFlow which, sadly, applies to the majority of deep learning research out there. The respective issue trackers on github are full of bug reports about TensorFlow not generating reproducible results. These are usually closed with claims that the use of atomics is strictly required to obtain plausible performance.
Anecdotal evidence from colleagues involved in deep learning research suggests that even if you have access to all source code and training data the resulting networks will often differ wildly if TensorFlow is involved. E.g. it's not uncommon for success rates of a trained classifier to vary from 75% to 90% between different training runs. With that in mind the discussion within Debian is, well, a little off from actual real-world problems.
Posted Jul 28, 2018 19:03 UTC (Sat)
by wx (guest, #103979)
[Link]
Unless you come up with code to reproduce such situations I'll have to call this out as an urban legend. I've been doing numerical optimization research (which has to be 100% reproducible) on virtually every high-end Nvidia GPU (consumer and Tesla) released since 2011. I've seen a lot of issues with Nvidia's toolchain but not a single case of the actual hardware misbehaving like that. Correct machine code will generate the exact same results every time you run it on the same GPU when using the same kernel launch parameters.
I'm also not aware of any credible reports by others to the contrary. There was one vague report of a Titan V not producing reproducible results (http://www.theregister.co.uk/2018/03/21/nvidia_titan_v_re...) but that is much more likely to be caused by the micro architecture changes in Volta. Intra-warp communication now requires explicit synchronization which can require significant porting effort for existing code bases and is rather tricky to get right.
Posted Jul 23, 2018 16:43 UTC (Mon)
by sfeam (subscriber, #2841)
[Link]
So in my view no, the presence of pre-calculated values or nondeterministic hardware or code that depends on non-reproducible properties like clock time, none of that affects the free/non-free status of the code itself for the purpose of licensing.
Posted Jul 23, 2018 8:28 UTC (Mon)
by Lennie (subscriber, #49641)
[Link] (1 responses)
Posted Jul 23, 2018 9:06 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link]
The problem is in CUDA or OpenCL implementations - the free stacks basically suck, so everybody uses proprietary drivers.
Posted Jul 28, 2018 18:40 UTC (Sat)
by wx (guest, #103979)
[Link]
Depending on the type of network to be trained that figure is off by an order or magnitude or two. Many networks require a lot of memory per GPU and many GPUs within a single node. Often you will need Tesla branded hardware because consumer cards don't have enough memory. A single DGX-1 (8 GPUs) will set you back by about EUR 100k including "EDU/startup" rebates. A single DGX-2 (16 GPUs) currently runs for about EUR 370k.
> And now we have even more AI-specialized hardware accelerators, so the price can be cut even further.
I guess you're referring to Google's TPUs which, however, are special purpose hardware and have virtually no effect on Nvidia pricing which is stable at extremely high levels. Even several years old used Tesla hardware will sell for little less than the original retail price.
Posted Jul 19, 2018 6:22 UTC (Thu)
by rbrito (guest, #66188)
[Link] (2 responses)
Posted Jul 19, 2018 9:07 UTC (Thu)
by Sesse (subscriber, #53779)
[Link]
Posted Jul 19, 2018 11:47 UTC (Thu)
by ssl (guest, #98177)
[Link]
Posted Jul 19, 2018 7:04 UTC (Thu)
by pabs (subscriber, #43278)
[Link]
http://penta.debconf.org/dc12_schedule/events/888.en.html
Some other examples of deep learning issues:
https://bugs.debian.org/699609
Posted Jul 19, 2018 7:38 UTC (Thu)
by ernstp (guest, #13694)
[Link] (3 responses)
Posted Jul 26, 2018 13:26 UTC (Thu)
by t-v (guest, #112111)
[Link] (2 responses)
Given that there is (in progress) ROCm-support in the usual deep learning libraries, it would be a great step towards having a things like DeepSpeech (in one implementation or other) in Debian - something I think would be hugely beneficial to assistant devices etc.
Posted Jul 28, 2018 4:11 UTC (Sat)
by bridgman (guest, #50408)
[Link] (1 responses)
Posted Jul 28, 2018 4:25 UTC (Sat)
by bridgman (guest, #50408)
[Link]
Posted Jul 19, 2018 15:12 UTC (Thu)
by xav (guest, #18536)
[Link]
Posted Jul 24, 2018 2:17 UTC (Tue)
by pr1268 (guest, #24648)
[Link]
Given the discussion (in this article) about computer learning and Go, it's interesting to run across this link: How the Enlightenment Ends. Also somewhat unusual is who the author is...
Deep learning and free software
Deep learning and free software
Deep learning and free software
Deep learning and free software
It is not really novel for practical use of a program to depend on access to pre-calculated data values. Is the Gnu Scientific Library unfree because it embeds tables of numerical coefficients that would be tedious (but surely possible) to recreate if they were not included with the source?
Deep learning and free software
Deep learning and free software
Deep learning and free software
Deep learning and free software
http://www.wired.com/2015/03/need-free-digital-hardware-d...
https://www.wired.com/2015/03/richard-stallman-how-to-mak...
Deep learning and free software
Let's not exaggerate
Let's not exaggerate
Let's not exaggerate
Let's not exaggerate
Let's not exaggerate
If we are considering game-playing programs that must respond to time-pressure limits, then full reproducibility might require either identical hardware or a dummied-up clock so that the nominal clock time of all decision points is pre-determined.
Let's not exaggerate
Let's not exaggerate
Let's not exaggerate
Let's not exaggerate
Let's not exaggerate
Let's not exaggerate
Let's not exaggerate
Yup.
I don't have examples of when 2+2 = 5, but I haven't searched for them. Usually it boils down to:
2) Optimizers try to use fused instructions just a little bit differently when compiling shaders on different driver versions.
3) RAM on GPUs is most definitely not using ECCs and it's clocked at pretty high frequencies. So you can expect not-really-occasional data corruptions.
Let's not exaggerate
Let's not exaggerate
Let's not exaggerate
Let's not exaggerate
Let's not exaggerate
> Yup.
This all seems rather tangential to the original point. Hardware differences or nondeterminism are of interest if you care about 100% reproducibility. As Cyberax points out, that is not usually a concern in training a neural net. The question is, if the application itself does not require 100% reproducibility does this nevertheless affect how it can be licensed? From my point of view the reliability or reproducibility of a program's output is not a property on which "free vs non-free" hinges. If you have the code, you have the input, you have everything needed to build and run the program, and yet the output is nonreproducible, that is not due to a lack of freedom. To claim otherwise opens the door to arguments that a nondeterministic code can never be free, or for that matter that the presence of bugs makes a program non-free. Let's not exaggerate
Let's not exaggerate
Let's not exaggerate
Let's not exaggerate
Too much CUDA (was: Deep learning and free software)
Too much CUDA (was: Deep learning and free software)
Too much CUDA (was: Deep learning and free software)
Deep learning and free software
https://ffmpeg.org/pipermail/ffmpeg-devel/2018-July/23182...
Deep learning and free software
Deep learning and free software
Deep learning and free software
Deep learning and free software
Deep learning and free software
Interesting link somewhat related
I was amazed that a computer could master Go, which is more complex than chess.
[...]
The speaker insisted that this ability could not be preprogrammed. His machine, he said, learned to master Go by training itself through practice.