PyTorch with ROCm - David Elsing

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: David Elsing <david.elsing@posteo.net>
To: guix-devel@gnu.org
Cc: ludo@gnu.org, rekado@elephly.net
Subject: PyTorch with ROCm
Date: Sat, 23 Mar 2024 23:02:53 +0000	[thread overview]
Message-ID: <86msqoeele.fsf@posteo.net> (raw)

Hello,

after seeing that ROCm packages [1] are available in the Guix-HPC
channel, I decided to try and package PyTorch 2.2.1 with ROCm 6.0.2.

For this, I first unbundled the (many) remaining dependencies of the
python-pytorch package and updated it to 2.2.1, the patch series for
which can be found here [2,3].

For building ROCm and building the remaining packages, I did not apply
the same quality standard as for python-pytorch and just tried to get it
working at all with ROCM 6.0.2. To reduce the build time, I also only
tested them for gfx1101 as set in the %amdgpu-targets variable in
amd/rocm-base.scm (which needs to be adjusted for other GPUs). Here, it
seemed to work fine on my GPU.

The changes for the ROCm packages are here [4] as a modification of
Guix-HPC. There, the python-pytorch-rocm package in
amd/machine-learning.scm depends on the python-pytorch-avx package in
[2,3]. Both python-pytorch and python-pytorch-avx support AVX2 / AVX-512
instructions, but the latter also has support for fbgemm and nnpack. I
used it over python-pytorch because AVX2 or AVX-512 instructions should
be available on a CPU with PCIe atomics anyway, which ROCm requires.

For some packages, such as composable-kernel, the build time and
memory requirement is already very high when building only for one GPU
architecture, so maybe it would be best to make a separate package for
each architecture?
I'm not sure they can be combined however, as the GPU code is included
in the shared libraries. Thus all dependent packages like
python-pytorch-rocm would need to be built for each architecture as
well, which is a large duplication for the non-GPU parts.

There were a few other issues as well, some of them should probably be
addressed upstream:
- Many tests assume a GPU to be present, so they need to be disabled.
- For several packages (e.g. rocfft), I had to disable the
  validate-runpath? phase, as there was an error when reading ELF
  files. It is however possible that I also disabled it for packages
  where it was not necessary, but it was the case for rocblas at
  least. Here, kernels generated are contained in ELF files, which are
  detected by elf-file? in guix/build/utils.scm, but rejected by
  has-elf-header? in guix/elf.scm, which leads to an error.
- Dependencies of python-tensile copy source files and later copy them
  with shutil.copy, sometimes twice. This leads to permission errors,
  as the permissions in the store are kept, so I patched it to use
  shutil.copyfile instead.
- There were a few errors due to using the GCC 11 system headers with
  rocm-toolchain (which is based on Clang+LLVM). For roctracer,
  replacing std::experimental::filesystem by std::filesystem suffices,
  but for rocthrust, the placement new operator is not found. I
  applied the patch from Gentoo [5], where it is replaced by a simple
  assignment. It looks like UB to me though, even if it happens to
  work. The question is whether this is a bug in libstdc++, clang or
  amdclang++...
- rocMLIR also contains a fork of the LLVM source tree and it is not
  clear at a first glance how exactly it differs from the main ROCm
  fork of LLVM or upstream LLVM.

It would be really great to have these packages in Guix proper, but
first of course the base ROCm packages need to be added after deciding
how to deal with the different architectures. Also, are several ROCm
versions necessary or would only one (the current latest) version
suffice?

Cheers,
David

[1] https://hpc.guix.info/blog/2024/01/hip-and-rocm-come-to-guix/
[2] https://issues.guix.gnu.org/69591
[3] https://codeberg.org/dtelsing/Guix/src/branch/pytorch
[4] https://codeberg.org/dtelsing/Guix-HPC/src/branch/pytorch-rocm
[5] https://gitweb.gentoo.org/repo/gentoo.git/tree/sci-libs/rocThrust/files/rocThrust-4.0-operator_new.patch

next             reply	other threads:[~2024-03-24 12:08 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-23 23:02 David Elsing [this message]
2024-03-24 15:41 ` PyTorch with ROCm Ricardo Wurmus
2024-03-24 18:13   ` David Elsing
2024-03-28 22:21 ` Ludovic Courtès
2024-03-31 22:21   ` David Elsing
2024-04-02 14:00     ` Ludovic Courtès
2024-04-03 22:21       ` David Elsing

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86msqoeele.fsf@posteo.net \
    --to=david.elsing@posteo.net \
    --cc=guix-devel@gnu.org \
    --cc=ludo@gnu.org \
    --cc=rekado@elephly.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).