|
|
Subscribe / Log in / New account

Courtès: What's in a package

Over at the Guix-HPC blog, Ludovic Courtès writes about trying to package the PyTorch machine-learning library for the Guix distribution. Building from source in a user-verifiable manner is part of the philosophy behind Guix, but there were a number of problems that were encountered:
The first surprise when starting packaging PyTorch is that, despite being on PyPI, PyTorch is first and foremost a large C++ code base. It does have a setup.py as commonly found in pure Python packages, but that file delegates the bulk of the work to CMake.

The second surprise is that PyTorch bundles (or "vendors", as some would say) source code for no less than 41 dependencies, ranging from small Python and C++ helper libraries to large C++ neural network tools. Like other distributions such as Debian, Guix avoids bundling: we would rather have one Guix package for each of these dependencies. The rationale is manifold, but it boils down to keeping things auditable, reducing resource usage, and making security updates practical.



to post comments

Courtès: What's in a package

Posted Sep 22, 2021 22:01 UTC (Wed) by timrichardson (subscriber, #72836) [Link] (2 responses)

Good luck maintaining that.

Courtès: What's in a package

Posted Sep 23, 2021 17:17 UTC (Thu) by developer122 (guest, #152928) [Link] (1 responses)

That they were surprised about the C++ was the first hint. Just wait until they realize how much proprietary code can be involved in neural network accelerators (hint: it's almost 100% nvidia CUDA). I suspect the only reason they're willing to take this on is because pytorch recently picked up ROCm support which supposedly can work on the open source drivers, well enough. Otherwise it would largely be an academic exercise as you'd only be able to run it on the CPU. Even now it's not clear if the AMD support will take off.

(oh, did I mention all AMD drivers rely on loading and calling into the AtomBIOS ROM stored on the GPU? Yeah, that fight was lost internally between AMD and ATI *years* ago. See: https://www.phoronix.com/forums/forum/linux-graphics-x-or...)

Courtès: What's in a package

Posted Sep 23, 2021 18:03 UTC (Thu) by flussence (guest, #85566) [Link]

The AMD open driver effort withered because it operated like a frat party - we all remember ajax's "bonghits" commit that blanked the radeonhd repo. I'd be bitter too if someone abused root access on a highly visible server to indulge in schoolyard bullying like that against me.

It says a lot about how little cultural progress they've made internally that their latest CPUs didn't even have a real cpufreq driver for 2 years before Valve stepped in.

Courtès: What's in a package

Posted Sep 23, 2021 7:02 UTC (Thu) by LtWorf (subscriber, #124958) [Link] (4 responses)

If pypi was curated and would drop packages that do this thing, the developers wouldn't distribute in this way.

Courtès: What's in a package

Posted Sep 23, 2021 7:41 UTC (Thu) by NAR (subscriber, #1313) [Link] (2 responses)

Yeah, they'd just put a curl ... | bash ... command on their website...

Courtès: What's in a package

Posted Sep 23, 2021 9:43 UTC (Thu) by LtWorf (subscriber, #124958) [Link] (1 responses)

But not being installable as a normal library… It would be much more rare for it to be used as a dependency.

Courtès: What's in a package

Posted Sep 23, 2021 12:18 UTC (Thu) by t-v (subscriber, #112111) [Link]

Have you used it (or the most obvious comparison point, tensorflow) much?

Similar to what NAR suggests, is that my impression from hanging out on the PyTorch forums is that most people pick whatever version they want and then copy-paste whatever

https://pytorch.org/get-started/locally/

tells them to.
Personally, I doubt that people have PyTorch installed automatically through dependencies that much, it would be hit-or-miss if it works with their hardware etc.

From the forums it looks like people are using conda a lot with PyTorch, I don't know if there is a way to distinguish CI-based downloads from human ones in PyPI.

Courtès: What's in a package

Posted Sep 23, 2021 13:00 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

PyPI requiring auditwheel-ok packages that don't depend on anything not in the wheel is a lot of the problem (IME). There's no way for my C++ project that uses HDF5 to use the copy that comes with h5py. Even if there was, how do I make sure I get a *compatible* dependency link at `pip install` time? Multiply this out for umpteen C++ -> {C, C++} dependencies with varying qualities of Python wrappers (e.g., which Qt binding to use when all I need is the C++ API?). Cross-language support where the internal API boundaries matter is where single-language package managers (including cargo) really fall down. For this kind of stuff, I think that `conda` is really the way to go because it actually understands that not everything that is in Python is pure Python. Alas, `pip install` is what everyone asks for…

Courtès: What's in a package

Posted Sep 23, 2021 7:46 UTC (Thu) by t-v (subscriber, #112111) [Link]

As someone with a lot of involvement in PyTorch, maybe some comments from my personal point of view (not speaking for PyTorch):
  • One thing to keep in mind as context is that PyTorch is owned by Facebook and it provides for the lion share of the development (with NVIDIA, AMD, Microsoft, ... having people working on things particularly dear to them and also Quansight working on things on FB's behalf(?)). They do want traction and community, but in the end PyTorch's priorities are largely informed by Facebook's use-cases.
  • PyTorch tries to take a "batteries included" approach that appears to be appreciated in many contexts. It is similar in complexity as NumPy/SciPy (who also bundle many numerical routines) but has the additional caveat that many of the dependencies (in particular where GPUs are involved) are not that well-supported in the usual distributions and thoroughly change between versions, or have weird bugs in specific versions.
  • Some other things have originated close to PyTorch with PyTorch's use in mind but have been set up as separate projects. That's hardly a worse situation than sticking it into the PyTorch repository as an integral part.
  • The autobuilders for PyTorch run with conda or bespoke docker containers. Personally, I am developing my PyTorch things on a vanilla Debian/unstable machine with the Debian-provided CUDA stack from non-free, so I've tried to keep that working (and I think with reasonable success) when glitches happened for not sticking stuff into /usr/local/cuda.
  • For the limitations that PyPI already has, PyTorch prefers conda (where e.g. many of the NVidia deps are taken from the packages for that) as is. Because PyPI cannot match the matrix of packages they offer (various CUDA versions, AMD ROCm, CPU only) they already self-host many of the packages that people install via pip (in fact 3-4x as many, not counting nightly builds).
  • Many of the bundled dependencies are changing very fast, releases only are for different versions of Ubuntu (e.g. NVidia seems to do 18.04 at the moment, others are on 20.04), so it would be a hassle for users to track them manually. I have seen them try with external dependencies and move those to bundled. This will be a major headache for packagers, but really, it's as much PyTorch's doing as it is that of the dependencies. Also, it means that PyTorch devs are generally not all that keen to have it packaged in the distributions (I think the Debian dev lumin has gotten less than enthusiastic comments).
  • Some things will be very hard to support out of the box, e.g. NVidia's embedded hardware has a totally different hardware / driver setup to the desktop/server GPUs, so you cannot currently run them with Debian-provided CUDA because it is on an older version.
That said, I can see how packaging PyTorch is a huge task and I personally look forward to the day when most people can just grab it from Debian...

Courtès: What's in a package

Posted Sep 23, 2021 8:34 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

Another quote from the article:

> Long story short: “unbundling” is often tedious, all the more so in this case. We ended up packaging about ten dependencies that were not already available or were otherwise outdated or incomplete, including big C++ libraries like the XNNPACK and onnx neural network helper libraries.

This is why people vendor things in the first place.

Courtès: What's in a package

Posted Sep 23, 2021 12:47 UTC (Thu) by swilmet (subscriber, #98424) [Link] (5 responses)

(speaking more generally about bundling/vendoring).

Bundling/vendoring a dependency is sometimes done because that dependency is not evolving well, in the direction that we want. So we simply pick up an older version that works well for us, and that's it. It's open source, after all.

Courtès: What's in a package

Posted Sep 23, 2021 13:05 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (2 responses)

The main project(s) I work on do vendor some dependencies, but I try to make sure that they at least behave well. This includes:

- options to use external copies;
- mangling symbols to avoid conflicts when co-existing with the "real" thing in a process;
- mangling library names to avoid runtime loader problems; and
- moving headers to a subdirectory to avoid conflicting with a "real" install.

Of course, there are some that we have patched for our own purposes (with upstreamed PRs, usually merged depending on upstream activity levels) and there's no public version that is viable yet.

I don't like doing it because it's a huge PITA, but when Windows and macOS are target platforms and you depend on lots of external libraries (no, Homebrew and MacPorts are not suitable in the general case for macOS), shipping copies is way *less* work than walking users through how to build with their pet copies (of which there's usually poor uniformity) and the `PATH` shenanigans that usually end up being required.

Courtès: What's in a package

Posted Sep 24, 2021 4:10 UTC (Fri) by pabs (subscriber, #43278) [Link] (1 responses)

For Windows/macOS, wouldn't just shipping the copies in the binary packages be enough? I wouldn't have thought having copies in the source repo would be needed. I believe there are package managers for Windows/macOS too, you could just direct users to installing via those.

Courtès: What's in a package

Posted Sep 24, 2021 11:51 UTC (Fri) by swilmet (subscriber, #98424) [Link]

With my limited experience with the MS Store, yes for Windows you prepare your package the way you want, installing dependencies separately (the version that works best), either with dynamic linking or static linking. No need to copy the source code in the main git repo (in that case).

Courtès: What's in a package

Posted Sep 24, 2021 2:47 UTC (Fri) by JanC_ (guest, #34940) [Link] (1 responses)

But if you fork something, why not maintain the fork separately, probably using a different name, so that it can also be used (and maintained) by others who disagree?

Courtès: What's in a package

Posted Sep 24, 2021 12:01 UTC (Fri) by swilmet (subscriber, #98424) [Link]

Sometimes it's just picking up an older tarball and building it, other times it's creating a small branch in git with a few commits.

And it's definitely possible to make the "light fork" parallel-installable with the main, upstream version, so that Linux distros can install both, see for example:
https://developer.gnome.org/documentation/guidelines/main...
(this is used for instance for the different major versions of GTK, they can co-exist on the same prefix).

Courtès: What's in a package

Posted Sep 23, 2021 13:19 UTC (Thu) by martin.langhoff (guest, #61417) [Link]

This is a typical transition between experimental / not really production-quality / random or brittle (version specific) dependencies, limited testing, etc.

As components mature, they get more users, move a bit slower (as they've accrued more complexity, so moving too fast breaks stuff), and start cleaning up their dependencies, build reproducililty/testabiltiy, etc.

The distro packager has an un-enviable role in pushing for a lot of this maturation to happen, often facing antagonism from the developers. But it's a key step. Once upon a time, foundational pieces of today's stack such as MySQL and PostgreSQL were a gnarly mess to package...


Copyright © 2021, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds