Let's not exaggerate

Posted Jul 21, 2018 16:47 UTC (Sat) by epa (subscriber, #39769)
In reply to: Let's not exaggerate by Cyberax
Parent article: Deep learning and free software

Not deterministic? Surely you can seed the pseudo-random number generator with an agreed value and get reproducible results? Or is there something in the GPU hardware used that makes the output inherently non-deterministic?

Let's not exaggerate

Posted Jul 21, 2018 16:53 UTC (Sat) by sfeam (subscriber, #2841) [Link] (1 responses)

If we are considering game-playing programs that must respond to time-pressure limits, then full reproducibility might require either identical hardware or a dummied-up clock so that the nominal clock time of all decision points is pre-determined.

Let's not exaggerate

Posted Jul 21, 2018 17:22 UTC (Sat) by epa (subscriber, #39769) [Link]

Yes, during the training stage the time limit would have to be expressed as a number of computation steps rather than wall clock time. That seems like common sense anyway.

Let's not exaggerate

Posted Jul 21, 2018 20:24 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (10 responses)

There are several sources of non-determinism here, mostly from parallel execution. Hardware itself is also not that trustworthy, especially since neural networks often use f16 or even f8 precision.

Let's not exaggerate

Posted Jul 23, 2018 6:46 UTC (Mon) by epa (subscriber, #39769) [Link] (9 responses)

The floating point operations provided by an FPU will give approximate results in many cases, but they are deterministic. Is it different for GPU calculations?

Let's not exaggerate

Posted Jul 23, 2018 7:35 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (8 responses)

Yes. Hardware is far less trustworthy. I guess you can do multiple runs and compare the results... Neural networks are self-correcting during training, so I’m not really aware of anybody working on it.

Let's not exaggerate

Posted Jul 23, 2018 9:24 UTC (Mon) by excors (subscriber, #95769) [Link] (6 responses)

As in, you could run a plain register-to-register floating-point arithmetic instruction on the GPU and it would sometimes give different results for the same input (on the same GPU and same compiled code etc)? That would seem very surprising - do you have some evidence for that?

The only references I can find to nondeterminism are about e.g. reductions using atomic add, where the nondeterministic part is just the order in which the parallel threads execute the add instruction, which matters since floating point addition is not associative. But if you imposed some order on the threads then it would go back to being perfectly deterministic.

Let's not exaggerate

Posted Jul 23, 2018 9:30 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

> As in, you could run a plain register-to-register floating-point arithmetic instruction on the GPU and it would sometimes give different results for the same input (on the same GPU and same compiled code etc)?
Yup.

> That would seem very surprising - do you have some evidence for that?
I don't have examples of when 2+2 = 5, but I haven't searched for them. Usually it boils down to:

1) Trigonometric and transcendental functions are implemented _slightly_ differently between different GPU models or driver versions.
2) Optimizers try to use fused instructions just a little bit differently when compiling shaders on different driver versions.
3) RAM on GPUs is most definitely not using ECCs and it's clocked at pretty high frequencies. So you can expect not-really-occasional data corruptions.

Let's not exaggerate

Posted Jul 23, 2018 9:53 UTC (Mon) by epa (subscriber, #39769) [Link]

I guess it's the second part of your original comment that appears not to fit. "Especially since neural networks often use f16 or even f8 precision." The reasons you describe for nondeterminism don't seem that they would affect smaller floating point types any worse than bigger ones.

Let's not exaggerate

Posted Jul 23, 2018 9:55 UTC (Mon) by excors (subscriber, #95769) [Link] (2 responses)

Differences between GPU models and driver versions seem like a separate issue, not related to the claim that "Hardware itself is also not that trustworthy". It's still important for reproducibility, but not really any different to software on CPUs where different compilers and different libc maths functions will give different results.

RAM corruption would count as hardware nondeterminism, but it looks like Tesla GPUs do have ECC - e.g. https://www.nvidia.com/content/tesla/pdf/NVIDIA-Tesla-Kep... says "External DRAM is ECC protected in Tesla K10. Both external and internal memories are ECC protected in Tesla K40, K20X, and K20" (where "internal memories" means caches, register files, etc), so they've been doing it since 2012.

Let's not exaggerate

Posted Jul 26, 2018 14:26 UTC (Thu) by gpu (guest, #125963) [Link] (1 responses)

Two better examples of non-determinism on GPUs would be:

1. Non-synchronized GPU atomics (https://docs.nvidia.com/cuda/cuda-c-programming-guide/ind...)

2. Non-deterministic numerical algorithms, e.g. http://papers.nips.cc/paper/4390-hogwild-a-lock-free-appr... (though, this particular example is CPU-specific)

Let's not exaggerate

Posted Jul 28, 2018 19:44 UTC (Sat) by wx (guest, #103979) [Link]

> 1. Non-synchronized GPU atomics

This is spot on at least for anything using TensorFlow which, sadly, applies to the majority of deep learning research out there. The respective issue trackers on github are full of bug reports about TensorFlow not generating reproducible results. These are usually closed with claims that the use of atomics is strictly required to obtain plausible performance.

Anecdotal evidence from colleagues involved in deep learning research suggests that even if you have access to all source code and training data the resulting networks will often differ wildly if TensorFlow is involved. E.g. it's not uncommon for success rates of a trained classifier to vary from 75% to 90% between different training runs. With that in mind the discussion within Debian is, well, a little off from actual real-world problems.

Let's not exaggerate

Posted Jul 28, 2018 19:03 UTC (Sat) by wx (guest, #103979) [Link]

> > As in, you could run a plain register-to-register floating-point arithmetic instruction on the GPU and it would sometimes give different results for the same input (on the same GPU and same compiled code etc)?
> Yup.

Unless you come up with code to reproduce such situations I'll have to call this out as an urban legend. I've been doing numerical optimization research (which has to be 100% reproducible) on virtually every high-end Nvidia GPU (consumer and Tesla) released since 2011. I've seen a lot of issues with Nvidia's toolchain but not a single case of the actual hardware misbehaving like that. Correct machine code will generate the exact same results every time you run it on the same GPU when using the same kernel launch parameters.

I'm also not aware of any credible reports by others to the contrary. There was one vague report of a Titan V not producing reproducible results (http://www.theregister.co.uk/2018/03/21/nvidia_titan_v_re...) but that is much more likely to be caused by the micro architecture changes in Volta. Intra-warp communication now requires explicit synchronization which can require significant porting effort for existing code bases and is rather tricky to get right.

Let's not exaggerate

Posted Jul 23, 2018 16:43 UTC (Mon) by sfeam (subscriber, #2841) [Link]

This all seems rather tangential to the original point. Hardware differences or nondeterminism are of interest if you care about 100% reproducibility. As Cyberax points out, that is not usually a concern in training a neural net. The question is, if the application itself does not require 100% reproducibility does this nevertheless affect how it can be licensed? From my point of view the reliability or reproducibility of a program's output is not a property on which "free vs non-free" hinges. If you have the code, you have the input, you have everything needed to build and run the program, and yet the output is nonreproducible, that is not due to a lack of freedom. To claim otherwise opens the door to arguments that a nondeterministic code can never be free, or for that matter that the presence of bugs makes a program non-free.

So in my view no, the presence of pre-calculated values or nondeterministic hardware or code that depends on non-reproducible properties like clock time, none of that affects the free/non-free status of the code itself for the purpose of licensing.