From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp1.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms8.migadu.com with LMTPS id aLNtA5DtBWaw9gAA62LTzQ:P1 (envelope-from ) for ; Thu, 28 Mar 2024 23:22:08 +0100 Received: from aspmx1.migadu.com ([2001:41d0:303:e224::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp1.migadu.com with LMTPS id aLNtA5DtBWaw9gAA62LTzQ (envelope-from ) for ; Thu, 28 Mar 2024 23:22:08 +0100 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=inria.fr header.s=dc header.b=My4nmq05; dmarc=pass (policy=none) header.from=inria.fr; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Seal: i=1; s=key1; d=yhetil.org; t=1711664528; a=rsa-sha256; cv=none; b=T2mcWlsBPLHZJPSTlqwXClfwTDn49bM3TZKJ4gN95xlzHL0bmllTGQj20psoIdhqm1Uo5i hQ/+cUomivNzLG3Wo9pWpPsRhVArJPTtMdEyp9N6Ruq2ZRFxJD3Y+WISX/kR8EOZuKiSCp v/YDYL+2nBVjapvmyfOKTWeKw8QCkWST9fykeFAin9p9P2e+djVZUDW6OTtFM4xzDO70Zj gPPyMGu/m1gFq3/uIFZI5hfhT0FlWd3zxxsemiqTTcmjWZzAeaLyqnHDvoTQppWOb1gtJe Q9rNxunaBp1wy4saB1MIu09lFnyuiDzRvy7EdCEMDs4g5jmhN4rEUoIReSqaOg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=inria.fr header.s=dc header.b=My4nmq05; dmarc=pass (policy=none) header.from=inria.fr; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1711664528; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=Wa3uV9mc0jrKvdxYcVaq/yFZWBjUrIaiQSWqTlhndBY=; b=U41xiXeHSZAMJdIdc+SZeFNgOZmWL24KM8Hxn2Z8B6jC5UBACVK7MKhk8aH7UuZnJXLwwv DsOZDMyChzUt5+f6ciUrhDzzV7Ss/ezyPopDP5NJlSnxaJGa0Bvtf5k0+blREt6mLtsJuJ WRKxfCP9XADrmZHla1PVTIDy10qsW6+v+da4LGo28YYPPovLWtQ8DXPZp/ubop9980NHhi t5DzyMqK/RJVgh4Vo3ty/BY7AOk2vyzSUBarPs1PkKX9HdZGEnBZA+FXD8rGP8zwcXcN68 uhzwwEwZyabSjiCQDZXILx48KwgoFMDVlREOM35LU2jcO9jCPMd3Q+fuoN8tfg== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id EC35910F44 for ; Thu, 28 Mar 2024 23:22:07 +0100 (CET) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rpy7n-0003dB-QY; Thu, 28 Mar 2024 18:21:31 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rpy7j-0003bQ-Ub for guix-devel@gnu.org; Thu, 28 Mar 2024 18:21:28 -0400 Received: from mail2-relais-roc.national.inria.fr ([192.134.164.83]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rpy7g-0002Cl-RL for guix-devel@gnu.org; Thu, 28 Mar 2024 18:21:27 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=inria.fr; s=dc; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=Wa3uV9mc0jrKvdxYcVaq/yFZWBjUrIaiQSWqTlhndBY=; b=My4nmq05Fr5I7nDo9d/Q4Okowm07x0JoE8KaJum0QGcB8UFLVmBSr/1h 3Bgg5ph4fSfFNHAb8zzecZCaT4neceUWxv4K+Yh191PgfSU/8l9HzuTBE 6PojfJ3/9XrCKaFVMcd0XnrS1P+cmrNapySxVuOfB5CA3Gxp4vISGPT7N o=; X-IronPort-AV: E=Sophos;i="6.07,162,1708383600"; d="scan'208";a="159018075" Received: from 91-160-117-201.subs.proxad.net (HELO ribbon) ([91.160.117.201]) by mail2-relais-roc.national.inria.fr with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Mar 2024 23:21:20 +0100 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: David Elsing Cc: guix-devel@gnu.org, rekado@elephly.net Subject: Re: PyTorch with ROCm In-Reply-To: <86msqoeele.fsf@posteo.net> (David Elsing's message of "Sat, 23 Mar 2024 23:02:53 +0000") References: <86msqoeele.fsf@posteo.net> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: Nonidi 9 Germinal an 232 de la =?utf-8?Q?R=C3=A9volu?= =?utf-8?Q?tion=2C?= jour de l'Aulne X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Thu, 28 Mar 2024 23:21:19 +0100 Message-ID: <87y1a2j8v4.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=192.134.164.83; envelope-from=ludovic.courtes@inria.fr; helo=mail2-relais-roc.national.inria.fr X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Flow: FLOW_IN X-Migadu-Country: US X-Spam-Score: -11.05 X-Migadu-Queue-Id: EC35910F44 X-Migadu-Scanner: mx12.migadu.com X-Migadu-Spam-Score: -11.05 X-TUID: GlX5nblDQgMT Hello! David Elsing skribis: > after seeing that ROCm packages [1] are available in the Guix-HPC > channel, I decided to try and package PyTorch 2.2.1 with ROCm 6.0.2. Nice! > The changes for the ROCm packages are here [4] as a modification of > Guix-HPC. There, the python-pytorch-rocm package in > amd/machine-learning.scm depends on the python-pytorch-avx package in > [2,3]. Both python-pytorch and python-pytorch-avx support AVX2 / AVX-512 > instructions, but the latter also has support for fbgemm and nnpack. I > used it over python-pytorch because AVX2 or AVX-512 instructions should > be available on a CPU with PCIe atomics anyway, which ROCm requires. I=E2=80=99m happy to merge your changes in the =E2=80=98guix-hpc=E2=80=99 c= hannel for the time being (I can create you an account there if you wish so you can create merge requests etc.). Let me know! I agree with Ricardo that this should be merged into Guix proper eventually. This is still in flux and we=E2=80=99d need to check what Kjet= il and Thomas at AMD think, in particular wrt. versions, so no ETA so far. > For some packages, such as composable-kernel, the build time and > memory requirement is already very high when building only for one GPU > architecture, so maybe it would be best to make a separate package for > each architecture? Is PyTorch able to build code for several GPU architectures and pick the right one at run time? If it does, that would seem like the better option for me, unless that is indeed so computationally expensive that it=E2=80=99s not affordable. > I'm not sure they can be combined however, as the GPU code is included > in the shared libraries. Thus all dependent packages like > python-pytorch-rocm would need to be built for each architecture as > well, which is a large duplication for the non-GPU parts. Yeah, but maybe that=E2=80=99s OK if we keep the number of supported GPU architectures to a minimum? > There were a few other issues as well, some of them should probably be > addressed upstream: > - Many tests assume a GPU to be present, so they need to be disabled. Yes. I/we=E2=80=99d like to eventually support that. (There=E2=80=99d nee= d to be some annotation in derivations or packages specifying what hardware is required, and =E2=80=98cuirass remote-worker=E2=80=99, =E2=80=98guix offloa= d=E2=80=99, etc. would need to honor that.) > - For several packages (e.g. rocfft), I had to disable the > validate-runpath? phase, as there was an error when reading ELF > files. It is however possible that I also disabled it for packages > where it was not necessary, but it was the case for rocblas at > least. Here, kernels generated are contained in ELF files, which are > detected by elf-file? in guix/build/utils.scm, but rejected by > has-elf-header? in guix/elf.scm, which leads to an error. Weird. We=E2=80=99d need to look more closely into the errors you got. [...] > - There were a few errors due to using the GCC 11 system headers with > rocm-toolchain (which is based on Clang+LLVM). For roctracer, > replacing std::experimental::filesystem by std::filesystem suffices, > but for rocthrust, the placement new operator is not found. I > applied the patch from Gentoo [5], where it is replaced by a simple > assignment. It looks like UB to me though, even if it happens to > work. The question is whether this is a bug in libstdc++, clang or > amdclang++... > - rocMLIR also contains a fork of the LLVM source tree and it is not > clear at a first glance how exactly it differs from the main ROCm > fork of LLVM or upstream LLVM. Oh, just noticed your patch bring a lot of things beyond PyTorch itself! I think there=E2=80=99s some overlap with , we should synchronize. Thanks! Ludo=E2=80=99.