Re: PyTorch with ROCm - Ludovic Courtès

unofficial mirror of guix-devel@gnu.org 
 help / color / mirror / code / Atom feed

From: "Ludovic Courtès" <ludovic.courtes@inria.fr>
To: David Elsing <david.elsing@posteo.net>
Cc: guix-devel@gnu.org,  rekado@elephly.net
Subject: Re: PyTorch with ROCm
Date: Thu, 28 Mar 2024 23:21:19 +0100	[thread overview]
Message-ID: <87y1a2j8v4.fsf@gnu.org> (raw)
In-Reply-To: <86msqoeele.fsf@posteo.net> (David Elsing's message of "Sat, 23 Mar 2024 23:02:53 +0000")

Hello!

David Elsing <david.elsing@posteo.net> skribis:

> after seeing that ROCm packages [1] are available in the Guix-HPC
> channel, I decided to try and package PyTorch 2.2.1 with ROCm 6.0.2.

Nice!

> The changes for the ROCm packages are here [4] as a modification of
> Guix-HPC. There, the python-pytorch-rocm package in
> amd/machine-learning.scm depends on the python-pytorch-avx package in
> [2,3]. Both python-pytorch and python-pytorch-avx support AVX2 / AVX-512
> instructions, but the latter also has support for fbgemm and nnpack. I
> used it over python-pytorch because AVX2 or AVX-512 instructions should
> be available on a CPU with PCIe atomics anyway, which ROCm requires.

I’m happy to merge your changes in the ‘guix-hpc’ channel for the time
being (I can create you an account there if you wish so you can create
merge requests etc.).  Let me know!

I agree with Ricardo that this should be merged into Guix proper
eventually.  This is still in flux and we’d need to check what Kjetil
and Thomas at AMD think, in particular wrt. versions, so no ETA so far.

> For some packages, such as composable-kernel, the build time and
> memory requirement is already very high when building only for one GPU
> architecture, so maybe it would be best to make a separate package for
> each architecture?

Is PyTorch able to build code for several GPU architectures and pick the
right one at run time?  If it does, that would seem like the better
option for me, unless that is indeed so computationally expensive that
it’s not affordable.

> I'm not sure they can be combined however, as the GPU code is included
> in the shared libraries. Thus all dependent packages like
> python-pytorch-rocm would need to be built for each architecture as
> well, which is a large duplication for the non-GPU parts.

Yeah, but maybe that’s OK if we keep the number of supported GPU
architectures to a minimum?

> There were a few other issues as well, some of them should probably be
> addressed upstream:
> - Many tests assume a GPU to be present, so they need to be disabled.

Yes.  I/we’d like to eventually support that.  (There’d need to be some
annotation in derivations or packages specifying what hardware is
required, and ‘cuirass remote-worker’, ‘guix offload’, etc. would need
to honor that.)

> - For several packages (e.g. rocfft), I had to disable the
>   validate-runpath? phase, as there was an error when reading ELF
>   files. It is however possible that I also disabled it for packages
>   where it was not necessary, but it was the case for rocblas at
>   least. Here, kernels generated are contained in ELF files, which are
>   detected by elf-file? in guix/build/utils.scm, but rejected by
>   has-elf-header? in guix/elf.scm, which leads to an error.

Weird.  We’d need to look more closely into the errors you got.

[...]

> - There were a few errors due to using the GCC 11 system headers with
>   rocm-toolchain (which is based on Clang+LLVM). For roctracer,
>   replacing std::experimental::filesystem by std::filesystem suffices,
>   but for rocthrust, the placement new operator is not found. I
>   applied the patch from Gentoo [5], where it is replaced by a simple
>   assignment. It looks like UB to me though, even if it happens to
>   work. The question is whether this is a bug in libstdc++, clang or
>   amdclang++...
> - rocMLIR also contains a fork of the LLVM source tree and it is not
>   clear at a first glance how exactly it differs from the main ROCm
>   fork of LLVM or upstream LLVM.

Oh, just noticed your patch bring a lot of things beyond PyTorch itself!
I think there’s some overlap with
<https://gitlab.inria.fr/guix-hpc/guix-hpc/-/merge_requests/38>, we
should synchronize.

Thanks!

Ludo’.

next prev parent reply	other threads:[~2024-03-28 22:22 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-23 23:02 PyTorch with ROCm David Elsing
2024-03-24 15:41 ` Ricardo Wurmus
2024-03-24 18:13   ` David Elsing
2024-03-28 22:21 ` Ludovic Courtès [this message]
2024-03-31 22:21   ` David Elsing
2024-04-02 14:00     ` Ludovic Courtès
2024-04-03 22:21       ` David Elsing

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://guix.gnu.org/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87y1a2j8v4.fsf@gnu.org \
    --to=ludovic.courtes@inria.fr \
    --cc=david.elsing@posteo.net \
    --cc=guix-devel@gnu.org \
    --cc=rekado@elephly.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).