unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed
* PyTorch with ROCm
@ 2024-03-23 23:02 David Elsing
  2024-03-24 15:41 ` Ricardo Wurmus
  2024-03-28 22:21 ` Ludovic Courtès
  0 siblings, 2 replies; 7+ messages in thread
From: David Elsing @ 2024-03-23 23:02 UTC (permalink / raw)
  To: guix-devel; +Cc: ludo, rekado

Hello,

after seeing that ROCm packages [1] are available in the Guix-HPC
channel, I decided to try and package PyTorch 2.2.1 with ROCm 6.0.2.

For this, I first unbundled the (many) remaining dependencies of the
python-pytorch package and updated it to 2.2.1, the patch series for
which can be found here [2,3].

For building ROCm and building the remaining packages, I did not apply
the same quality standard as for python-pytorch and just tried to get it
working at all with ROCM 6.0.2. To reduce the build time, I also only
tested them for gfx1101 as set in the %amdgpu-targets variable in
amd/rocm-base.scm (which needs to be adjusted for other GPUs). Here, it
seemed to work fine on my GPU.

The changes for the ROCm packages are here [4] as a modification of
Guix-HPC. There, the python-pytorch-rocm package in
amd/machine-learning.scm depends on the python-pytorch-avx package in
[2,3]. Both python-pytorch and python-pytorch-avx support AVX2 / AVX-512
instructions, but the latter also has support for fbgemm and nnpack. I
used it over python-pytorch because AVX2 or AVX-512 instructions should
be available on a CPU with PCIe atomics anyway, which ROCm requires.

For some packages, such as composable-kernel, the build time and
memory requirement is already very high when building only for one GPU
architecture, so maybe it would be best to make a separate package for
each architecture?
I'm not sure they can be combined however, as the GPU code is included
in the shared libraries. Thus all dependent packages like
python-pytorch-rocm would need to be built for each architecture as
well, which is a large duplication for the non-GPU parts.

There were a few other issues as well, some of them should probably be
addressed upstream:
- Many tests assume a GPU to be present, so they need to be disabled.
- For several packages (e.g. rocfft), I had to disable the
  validate-runpath? phase, as there was an error when reading ELF
  files. It is however possible that I also disabled it for packages
  where it was not necessary, but it was the case for rocblas at
  least. Here, kernels generated are contained in ELF files, which are
  detected by elf-file? in guix/build/utils.scm, but rejected by
  has-elf-header? in guix/elf.scm, which leads to an error.
- Dependencies of python-tensile copy source files and later copy them
  with shutil.copy, sometimes twice. This leads to permission errors,
  as the permissions in the store are kept, so I patched it to use
  shutil.copyfile instead.
- There were a few errors due to using the GCC 11 system headers with
  rocm-toolchain (which is based on Clang+LLVM). For roctracer,
  replacing std::experimental::filesystem by std::filesystem suffices,
  but for rocthrust, the placement new operator is not found. I
  applied the patch from Gentoo [5], where it is replaced by a simple
  assignment. It looks like UB to me though, even if it happens to
  work. The question is whether this is a bug in libstdc++, clang or
  amdclang++...
- rocMLIR also contains a fork of the LLVM source tree and it is not
  clear at a first glance how exactly it differs from the main ROCm
  fork of LLVM or upstream LLVM.

It would be really great to have these packages in Guix proper, but
first of course the base ROCm packages need to be added after deciding
how to deal with the different architectures. Also, are several ROCm
versions necessary or would only one (the current latest) version
suffice?

Cheers,
David

[1] https://hpc.guix.info/blog/2024/01/hip-and-rocm-come-to-guix/
[2] https://issues.guix.gnu.org/69591
[3] https://codeberg.org/dtelsing/Guix/src/branch/pytorch
[4] https://codeberg.org/dtelsing/Guix-HPC/src/branch/pytorch-rocm
[5] https://gitweb.gentoo.org/repo/gentoo.git/tree/sci-libs/rocThrust/files/rocThrust-4.0-operator_new.patch


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-04-03 22:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-23 23:02 PyTorch with ROCm David Elsing
2024-03-24 15:41 ` Ricardo Wurmus
2024-03-24 18:13   ` David Elsing
2024-03-28 22:21 ` Ludovic Courtès
2024-03-31 22:21   ` David Elsing
2024-04-02 14:00     ` Ludovic Courtès
2024-04-03 22:21       ` David Elsing

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).