LWN: Comments on "Deep learning and free software"

Let's not exaggerate

wx — Sat, 28 Jul 2018 19:44:33 +0000

> 1. Non-synchronized GPU atomics

This is spot on at least for anything using TensorFlow which, sadly, applies to the majority of deep learning research out there. The respective issue trackers on github are full of bug reports about TensorFlow not generating reproducible results. These are usually closed with claims that the use of atomics is strictly required to obtain plausible performance.

Anecdotal evidence from colleagues involved in deep learning research suggests that even if you have access to all source code and training data the resulting networks will often differ wildly if TensorFlow is involved. E.g. it's not uncommon for success rates of a trained classifier to vary from 75% to 90% between different training runs. With that in mind the discussion within Debian is, well, a little off from actual real-world problems.

Let's not exaggerate

wx — Sat, 28 Jul 2018 19:03:07 +0000

> > As in, you could run a plain register-to-register floating-point arithmetic instruction on the GPU and it would sometimes give different results for the same input (on the same GPU and same compiled code etc)?
> Yup.

Unless you come up with code to reproduce such situations I'll have to call this out as an urban legend. I've been doing numerical optimization research (which has to be 100% reproducible) on virtually every high-end Nvidia GPU (consumer and Tesla) released since 2011. I've seen a lot of issues with Nvidia's toolchain but not a single case of the actual hardware misbehaving like that. Correct machine code will generate the exact same results every time you run it on the same GPU when using the same kernel launch parameters.

I'm also not aware of any credible reports by others to the contrary. There was one vague report of a Titan V not producing reproducible results (http://www.theregister.co.uk/2018/03/21/nvidia_titan_v_re...) but that is much more likely to be caused by the micro architecture changes in Volta. Intra-warp communication now requires explicit synchronization which can require significant porting effort for existing code bases and is rather tricky to get right.

Let's not exaggerate

wx — Sat, 28 Jul 2018 18:40:14 +0000

> 20 computers with latest GPUs will cost around $100000

Depending on the type of network to be trained that figure is off by an order or magnitude or two. Many networks require a lot of memory per GPU and many GPUs within a single node. Often you will need Tesla branded hardware because consumer cards don't have enough memory. A single DGX-1 (8 GPUs) will set you back by about EUR 100k including "EDU/startup" rebates. A single DGX-2 (16 GPUs) currently runs for about EUR 370k.

> And now we have even more AI-specialized hardware accelerators, so the price can be cut even further.

I guess you're referring to Google's TPUs which, however, are special purpose hardware and have virtually no effect on Nvidia pricing which is stable at extremely high levels. Even several years old used Tesla hardware will sell for little less than the original retail price.

Deep learning and free software

bridgman — Sat, 28 Jul 2018 04:25:08 +0000

Forgot to mention that TensorFlow support on ROCm is now up to version 1.8. IIRC Lumin's concerns were specifically related to AlphaGo Zero, which I believe trains its networks using TensorFlow:

https://gpuopen.com/rocm-tensorflow-1-8-release/

Deep learning and free software

bridgman — Sat, 28 Jul 2018 04:11:34 +0000

The hardware is non-free so you would need to add the amdgpu HW microcode images, but other than that the entire stack is open source. The core kernel code is upstream so should flow into the various Debian branches over time (Vega KFD is probably only in experimental at the moment).

Let's not exaggerate

gpu — Thu, 26 Jul 2018 14:26:17 +0000

Two better examples of non-determinism on GPUs would be:

1. Non-synchronized GPU atomics (https://docs.nvidia.com/cuda/cuda-c-programming-guide/ind...)

2. Non-deterministic numerical algorithms, e.g. http://papers.nips.cc/paper/4390-hogwild-a-lock-free-appr... (though, this particular example is CPU-specific)

Deep learning and free software

t-v — Thu, 26 Jul 2018 13:26:35 +0000

But is it blob-free enough to have in Debian?

Given that there is (in progress) ROCm-support in the usual deep learning libraries, it would be a great step towards having a things like DeepSpeech (in one implementation or other) in Debian - something I think would be hugely beneficial to assistant devices etc.

Interesting link somewhat related

pr1268 — Tue, 24 Jul 2018 02:17:17 +0000

Given the discussion (in this article) about computer learning and Go, it's interesting to run across this link: How the Enlightenment Ends.

I was amazed that a computer could master Go, which is more complex than chess.
[...]
The speaker insisted that this ability could not be preprogrammed. His machine, he said, learned to master Go by training itself through practice.

Also somewhat unusual is who the author is...

Let's not exaggerate

sfeam — Mon, 23 Jul 2018 16:43:06 +0000

This all seems rather tangential to the original point. Hardware differences or nondeterminism are of interest if you care about 100% reproducibility. As Cyberax points out, that is not usually a concern in training a neural net. The question is, if the application itself does not require 100% reproducibility does this nevertheless affect how it can be licensed? From my point of view the reliability or reproducibility of a program's output is not a property on which "free vs non-free" hinges. If you have the code, you have the input, you have everything needed to build and run the program, and yet the output is nonreproducible, that is not due to a lack of freedom. To claim otherwise opens the door to arguments that a nondeterministic code can never be free, or for that matter that the presence of bugs makes a program non-free.

So in my view no, the presence of pre-calculated values or nondeterministic hardware or code that depends on non-reproducible properties like clock time, none of that affects the free/non-free status of the code itself for the purpose of licensing.

Let's not exaggerate

excors — Mon, 23 Jul 2018 09:55:54 +0000

Differences between GPU models and driver versions seem like a separate issue, not related to the claim that "Hardware itself is also not that trustworthy". It's still important for reproducibility, but not really any different to software on CPUs where different compilers and different libc maths functions will give different results.

RAM corruption would count as hardware nondeterminism, but it looks like Tesla GPUs do have ECC - e.g. https://www.nvidia.com/content/tesla/pdf/NVIDIA-Tesla-Kep... says "External DRAM is ECC protected in Tesla K10. Both external and internal memories are ECC protected in Tesla K40, K20X, and K20" (where "internal memories" means caches, register files, etc), so they've been doing it since 2012.

Let's not exaggerate

epa — Mon, 23 Jul 2018 09:53:41 +0000

I guess it's the second part of your original comment that appears not to fit. "Especially since neural networks often use f16 or even f8 precision." The reasons you describe for nondeterminism don't seem that they would affect smaller floating point types any worse than bigger ones.

Let's not exaggerate

Cyberax — Mon, 23 Jul 2018 09:30:33 +0000

> As in, you could run a plain register-to-register floating-point arithmetic instruction on the GPU and it would sometimes give different results for the same input (on the same GPU and same compiled code etc)?
Yup.

> That would seem very surprising - do you have some evidence for that?
I don't have examples of when 2+2 = 5, but I haven't searched for them. Usually it boils down to:

1) Trigonometric and transcendental functions are implemented _slightly_ differently between different GPU models or driver versions.
2) Optimizers try to use fused instructions just a little bit differently when compiling shaders on different driver versions.
3) RAM on GPUs is most definitely not using ECCs and it's clocked at pretty high frequencies. So you can expect not-really-occasional data corruptions.

Let's not exaggerate

excors — Mon, 23 Jul 2018 09:24:47 +0000

As in, you could run a plain register-to-register floating-point arithmetic instruction on the GPU and it would sometimes give different results for the same input (on the same GPU and same compiled code etc)? That would seem very surprising - do you have some evidence for that?

The only references I can find to nondeterminism are about e.g. reductions using atomic add, where the nondeterministic part is just the order in which the parallel threads execute the add instruction, which matters since floating point addition is not associative. But if you imposed some order on the threads then it would go back to being perfectly deterministic.

Let's not exaggerate

Cyberax — Mon, 23 Jul 2018 09:06:43 +0000

Orchestrating software is usually not a problem here, there are tons of free software available.

The problem is in CUDA or OpenCL implementations - the free stacks basically suck, so everybody uses proprietary drivers.

Let's not exaggerate

Lennie — Mon, 23 Jul 2018 08:28:20 +0000

In the article it was mentioned that those clusters don't run free software. And it would take a lot more time without it.

Let's not exaggerate

Cyberax — Mon, 23 Jul 2018 07:35:37 +0000

Yes. Hardware is far less trustworthy. I guess you can do multiple runs and compare the results... Neural networks are self-correcting during training, so I’m not really aware of anybody working on it.

Let's not exaggerate

epa — Mon, 23 Jul 2018 06:46:40 +0000

The floating point operations provided by an FPU will give approximate results in many cases, but they are deterministic. Is it different for GPU calculations?

Deep learning and free software

k8to — Sat, 21 Jul 2018 20:53:56 +0000

leela-zero is effectively a client in a distributed effort to build neural network weights for leela-zero to use to play Go well. Effectively, there's no training data, since this is in the class of problems where the code can self-train without external data by playing itself.

So this is basically a distributed computing client as well as a Go engine that runs with the results of that distributed computing effort.

From a free software perspective, the major concerns I see for leela-zero are the ownership and reproducibility of the "central infrastructure" so that others could continue or proceed alternatively in this process without the help of the original creators.

That all of that work is going on in the public view on github with the tools exposed should go a long way towards alleviating practical concerns, but it would be nice if the server tools were packaged as well.

Let's not exaggerate

Cyberax — Sat, 21 Jul 2018 20:24:07 +0000

There are several sources of non-determinism here, mostly from parallel execution. Hardware itself is also not that trustworthy, especially since neural networks often use f16 or even f8 precision.

Let's not exaggerate

epa — Sat, 21 Jul 2018 17:22:03 +0000

Yes, during the training stage the time limit would have to be expressed as a number of computation steps rather than wall clock time. That seems like common sense anyway.

Let's not exaggerate

sfeam — Sat, 21 Jul 2018 16:53:49 +0000

If we are considering game-playing programs that must respond to time-pressure limits, then full reproducibility might require either identical hardware or a dummied-up clock so that the nominal clock time of all decision points is pre-determined.

Let's not exaggerate

epa — Sat, 21 Jul 2018 16:47:33 +0000

Not deterministic? Surely you can seed the pseudo-random number generator with an agreed value and get reproducible results? Or is there something in the GPU hardware used that makes the output inherently non-deterministic?

Let's not exaggerate

Cyberax — Fri, 20 Jul 2018 01:49:36 +0000

You do need powerful GPUs to train models, the gulf in performance between GPUs and CPUs is simply too ridiculous to even consider. There's also a problem with trust - deep learning is not deterministic and you can't easily cross-check the results.

Let's not exaggerate

tedd — Thu, 19 Jul 2018 23:56:03 +0000

Would it also be feasible to develop a BOINC project for this, to get the community to train the models in a distributed fashion?

Deep learning and free software

xav — Thu, 19 Jul 2018 15:12:57 +0000

Isn't a game with "liberated" artwork & code free, even if I can't redraw the artwork myself ?

Too much CUDA (was: Deep learning and free software)

ssl — Thu, 19 Jul 2018 11:47:01 +0000

Well, the cause of this simple: CUDA is much easier to program against than OpenCL, simple as that.

Too much CUDA (was: Deep learning and free software)

Sesse — Thu, 19 Jul 2018 09:07:23 +0000

Possibly because OpenCL doesn't really work too well.

Deep learning and free software

brooksmoses — Thu, 19 Jul 2018 07:58:51 +0000

Oh, good. Thanks for updating me! The conference I was thinking of was probably eight or ten years ago.

Deep learning and free software

Sesse — Thu, 19 Jul 2018 07:39:42 +0000

I agree you _also_ need to ship the weights. And if the final weights are e.g. quantized (ie., inference uses a form different from training), you'd need to ship the original ones, too, for continued training.

In the case of Leela, it's all a bit muddied in that it's not even clear exactly what the training data is. The amount of human input is zero, but there's all these games produced as part of the training that's being used to train the network itself. I'm not sure if they need to be included or not—I'd lean towards not.

Deep learning and free software

ernstp — Thu, 19 Jul 2018 07:38:31 +0000

AMD's stuff is moving in the right direction:

https://github.com/ROCmSoftwarePlatform/MIOpen

Let's not exaggerate

Cyberax — Thu, 19 Jul 2018 07:30:05 +0000

https://voice.mozilla.org/

Deep learning and free software

pabs — Thu, 19 Jul 2018 07:04:21 +0000

Debian also discussed this back in 2012 at DebConf:

http://penta.debconf.org/dc12_schedule/events/888.en.html

Some other examples of deep learning issues:

https://bugs.debian.org/699609
https://ffmpeg.org/pipermail/ffmpeg-devel/2018-July/23182...

Let's not exaggerate

pabs — Thu, 19 Jul 2018 06:58:46 +0000

DeepSpeech isn't the greatest example, I read somewhere that some of the source data used for training the model is proprietary.

Too much CUDA (was: Deep learning and free software)

rbrito — Thu, 19 Jul 2018 06:22:34 +0000

I know that this is orthogonal to the issue in question, but I just can't help to express my disappointment when I see a lot of "data analysis/big data/machine learning" frameworks supporting only CUDA (and frequently requiring/recommending Linux), yet the more open alternatives (like OpenCL) are totally forgotten...

Let's not exaggerate

Cyberax — Thu, 19 Jul 2018 05:05:48 +0000

Let's not exaggerate, please.

Training a neural net requires a lot of compute time, but it's definitely not "outside of abilities of a non-profit". A small cluster of 10-20 computers outfitted with GPUs will be able to train models like AlphaZero within a day or so. That's what Mozilla does for their DeepSpeech project.

20 computers with latest GPUs will cost around $100000 - not an insignificant amount, but not insurmountable either. And now we have even more AI-specialized hardware accelerators, so the price can be cut even further.

Deep learning and free software

pabs — Thu, 19 Jul 2018 03:14:03 +0000

Stallman has revised his position on hardware, it is a bit more nuanced now:

https://www.fsf.org/blogs/rms/how-to-make-hardware-design...
http://www.wired.com/2015/03/need-free-digital-hardware-d...
https://www.wired.com/2015/03/richard-stallman-how-to-mak...

Deep learning and free software

brooksmoses — Thu, 19 Jul 2018 01:29:04 +0000

There's an interesting analogy here, in that this is not the first time that the free-software community has grappled with the idea that having the source code is not sufficient to edit the operation of a computing system without a lot of resources. A similar problem arises with computer hardware: Even if I had all of the source code to the processor in my television (never mind the one in my workstation!), I couldn't rebuild that processor with a bugfix without a few billion dollars in resources.

I've been at a conference where Richard Stallman addressed that issue -- which is to say, he basically punted and said that Free Software was irrelevant to computer hardware because nobody would be able to fab a replacement chip. Personally, I think that's a shortsighted conclusion (and it was particularly ironic that he said this at a conference of people largely funded by DoD contracts, many of whom likely would be in a position to fab their own chips), but I do find it an interesting datapoint. Is the sort of neural-net training data that can only be generated by running something on the equivalent of Google's CloudTPU clusters for zillions of hours a thing that is limited in its ability to be "free" in much the same way that the Xeon in my desktop is limited by the technological inability of anyone but Intel to make one? And, if so, is the right answer actually to say "that's not a problem we can solve with Software Freedom right now," as Stallman did with hardware?

If so, then the question becomes: What is practical can we do instead? And, of those options, what do we want to do?

Deep learning and free software

brooksmoses — Thu, 19 Jul 2018 01:12:21 +0000

I'd disagree with that premise -- neural weights are similar to object code, but I think the differences matter.

One difference is that it's not always true that you'd modify things by changing the training program or the input data. In many cases where the input set changes over time, it's common to continually train the model by taking last-week's converged weights as a starting point and iterating on new inputs, and thereby tweaking the weights to make this week's model. It may not be "direct editing", but it is certainly a situation where the preferred form of input for making changes includes the existing weights.

Another difference is that I think the non-determinism of the training process is somewhat more of a quibble than you do. Certainly it's a reason why it's important to include the trained weights with the source code to make something "free"; if you can't fix a bug in the file-input part of a program without changing the computations it makes to analyze the file after it's opened, that's not really in the spirit of software freedom. A similar case would be if I have a magic proprietary compiler that analyzes my C++ floating-point math and promotes floats to double where needed for numerical stability, and I don't make that free, me giving you my physics-model source code isn't giving you what you need to rebuild it and get the same results.

The fundamental problem, I think, is simply that we don't have the computer science yet to be able to edit the weights directly, and so there _is_ really no ideal starting point for making "small" changes to an existing model (other than retraining from the existing weights). We're basically at a point where important parts of the program are write-only and to edit them we have to recreate them -- and that re-creation can involve human creative input as well as machine time.

Deep learning and free software

ejr — Thu, 19 Jul 2018 00:49:13 +0000

Modern day's "Table Maker's Dilemma"?

Deep learning and free software

sfeam — Thu, 19 Jul 2018 00:05:29 +0000

It is not really novel for practical use of a program to depend on access to pre-calculated data values. Is the Gnu Scientific Library unfree because it embeds tables of numerical coefficients that would be tedious (but surely possible) to recreate if they were not included with the source?