Let's not exaggerate

Posted Jul 23, 2018 9:55 UTC (Mon) by excors (subscriber, #95769)
In reply to: Let's not exaggerate by Cyberax
Parent article: Deep learning and free software

Differences between GPU models and driver versions seem like a separate issue, not related to the claim that "Hardware itself is also not that trustworthy". It's still important for reproducibility, but not really any different to software on CPUs where different compilers and different libc maths functions will give different results.

RAM corruption would count as hardware nondeterminism, but it looks like Tesla GPUs do have ECC - e.g. https://www.nvidia.com/content/tesla/pdf/NVIDIA-Tesla-Kep... says "External DRAM is ECC protected in Tesla K10. Both external and internal memories are ECC protected in Tesla K40, K20X, and K20" (where "internal memories" means caches, register files, etc), so they've been doing it since 2012.

Let's not exaggerate

Posted Jul 26, 2018 14:26 UTC (Thu) by gpu (guest, #125963) [Link] (1 responses)

Two better examples of non-determinism on GPUs would be:

1. Non-synchronized GPU atomics (https://docs.nvidia.com/cuda/cuda-c-programming-guide/ind...)

2. Non-deterministic numerical algorithms, e.g. http://papers.nips.cc/paper/4390-hogwild-a-lock-free-appr... (though, this particular example is CPU-specific)

Let's not exaggerate

Posted Jul 28, 2018 19:44 UTC (Sat) by wx (guest, #103979) [Link]

> 1. Non-synchronized GPU atomics

This is spot on at least for anything using TensorFlow which, sadly, applies to the majority of deep learning research out there. The respective issue trackers on github are full of bug reports about TensorFlow not generating reproducible results. These are usually closed with claims that the use of atomics is strictly required to obtain plausible performance.

Anecdotal evidence from colleagues involved in deep learning research suggests that even if you have access to all source code and training data the resulting networks will often differ wildly if TensorFlow is involved. E.g. it's not uncommon for success rates of a trained classifier to vary from 75% to 90% between different training runs. With that in mind the discussion within Debian is, well, a little off from actual real-world problems.