From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms8.migadu.com with LMTPS id +KRiD9UXAGaqHwEAqHPOHw:P1 (envelope-from ) for ; Sun, 24 Mar 2024 13:08:53 +0100 Received: from aspmx1.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0.migadu.com with LMTPS id +KRiD9UXAGaqHwEAqHPOHw (envelope-from ) for ; Sun, 24 Mar 2024 13:08:53 +0100 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=posteo.net header.s=2017 header.b=AuLIjmPJ; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=posteo.net ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1711282133; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:list-id:list-help:list-unsubscribe: list-subscribe:list-post:dkim-signature; bh=kA1oOtmAYTwD3fnvGnzTI1JKlIiLg+HQ77egK5AvRE4=; b=I17W00b3n7Wh0iJz8KrJcibI6PTV/e/mycBSpi39ESjRgjVuBHSI0oJ1KGwEtPYAsM/EVU 9mUbeZ/CO7qbIYsr2gCBvKtHspS2HZwiBKf7Kwl7r+Sw5/1/1EwXOHpTbLq7dGRJ9EWIt4 d4IQawbEqFXBBtoBOIqNaFDpcQN6Wn1AL0uuFss3O8YbytlaKyUsHCjFpitUWJbWoCDRTj 6D03YaCmZWU/DLYbogwgCaD5UonfNVzQ34MTxgmSkVjNAOZGMYtDr/xVzoubsh606ylGVC ZxMq/p3UWVUd6w5YVGMcDi8zcVuhlBw1NE2te2BIlctNSpwx95QtGCEpAib+zg== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=posteo.net header.s=2017 header.b=AuLIjmPJ; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=posteo.net ARC-Seal: i=1; s=key1; d=yhetil.org; t=1711282133; a=rsa-sha256; cv=none; b=cY8fgaCiRBQoYH/G9Rs4QarAPW7weJziE0iPsXnAo8LMYULCbxX3CxcPq5ixIk3parrjuJ H1z9yTADfeiKmBzCjlGl0+Pe4dAxT3sJyiDeIsKZug/z3d8UkYTWVXR+39EcCkuTo1EviJ r1EYRQSOLOvgDDoz+pNK0hYpYG3YUVy0GQoSjpjbMKtoaYPt6qgraLgcOsM3SEpe/Gqh2G K3va4TVB55JDFryugLt7vdLHgn8ogqwCZkS56ttecrLcRyD6cOSARpjXzXsDg2nwC/ZRWs +E9nSWanwfG5L1adt5bVk4xwme5NJvSyoB4MDi9yt7xexFYBBiSMayF8YaIfww== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 243EBA7FF for ; Sun, 24 Mar 2024 13:08:53 +0100 (CET) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1roMdp-0002xr-E8; Sun, 24 Mar 2024 08:07:57 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1roAOE-0007FV-Pa for guix-devel@gnu.org; Sat, 23 Mar 2024 19:03:03 -0400 Received: from mout01.posteo.de ([185.67.36.65]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1roAOC-0002Tr-3R for guix-devel@gnu.org; Sat, 23 Mar 2024 19:03:02 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id 722E3240028 for ; Sun, 24 Mar 2024 00:02:55 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1711234975; bh=R0Ye6BtfH8NoSoNRCsZyPHDAPktnpuDENytR8BtYpW8=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type: From; b=AuLIjmPJIIa8WDW/6n2+nedqHmfC+9fUfjf28C6of/09WJrFzTiGgAVlqvwsrGktB uH9ZCbiCqg4kEDaoRGsndJYoxU8t8pFa2MjJxDZFSu/Y+e6SN4m0RVWamodewKDqB9 nlMByQzAxRjNfQ5jE1sY5odAiZrmxPlBrnJ3Vbmmy44YtUHujLEYySZ8ZQt8k2xc6W xluq/Vm1tfWkjRyXvDMEsLLNnia6YuCCeZ0TOPGmynHhepU0r8nAgX/gjniekRV4K/ RhCYsZNjRJIUlbM60hxgPHxkbPbvLd+cFEGCunSJzy2xIn9GmMqaTW5SH8yZkDEAjj jU6wyVSWw5Kug== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4V2F8t34Xzz9rxB; Sun, 24 Mar 2024 00:02:54 +0100 (CET) From: David Elsing To: guix-devel@gnu.org Cc: ludo@gnu.org, rekado@elephly.net Subject: PyTorch with ROCm Date: Sat, 23 Mar 2024 23:02:53 +0000 Message-ID: <86msqoeele.fsf@posteo.net> MIME-Version: 1.0 Content-Type: text/plain Received-SPF: pass client-ip=185.67.36.65; envelope-from=david.elsing@posteo.net; helo=mout01.posteo.de X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Mailman-Approved-At: Sun, 24 Mar 2024 08:07:56 -0400 X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN X-Migadu-Queue-Id: 243EBA7FF X-Spam-Score: -8.52 X-Migadu-Spam-Score: -8.52 X-Migadu-Scanner: mx10.migadu.com X-TUID: hogmdU+XnLqG Hello, after seeing that ROCm packages [1] are available in the Guix-HPC channel, I decided to try and package PyTorch 2.2.1 with ROCm 6.0.2. For this, I first unbundled the (many) remaining dependencies of the python-pytorch package and updated it to 2.2.1, the patch series for which can be found here [2,3]. For building ROCm and building the remaining packages, I did not apply the same quality standard as for python-pytorch and just tried to get it working at all with ROCM 6.0.2. To reduce the build time, I also only tested them for gfx1101 as set in the %amdgpu-targets variable in amd/rocm-base.scm (which needs to be adjusted for other GPUs). Here, it seemed to work fine on my GPU. The changes for the ROCm packages are here [4] as a modification of Guix-HPC. There, the python-pytorch-rocm package in amd/machine-learning.scm depends on the python-pytorch-avx package in [2,3]. Both python-pytorch and python-pytorch-avx support AVX2 / AVX-512 instructions, but the latter also has support for fbgemm and nnpack. I used it over python-pytorch because AVX2 or AVX-512 instructions should be available on a CPU with PCIe atomics anyway, which ROCm requires. For some packages, such as composable-kernel, the build time and memory requirement is already very high when building only for one GPU architecture, so maybe it would be best to make a separate package for each architecture? I'm not sure they can be combined however, as the GPU code is included in the shared libraries. Thus all dependent packages like python-pytorch-rocm would need to be built for each architecture as well, which is a large duplication for the non-GPU parts. There were a few other issues as well, some of them should probably be addressed upstream: - Many tests assume a GPU to be present, so they need to be disabled. - For several packages (e.g. rocfft), I had to disable the validate-runpath? phase, as there was an error when reading ELF files. It is however possible that I also disabled it for packages where it was not necessary, but it was the case for rocblas at least. Here, kernels generated are contained in ELF files, which are detected by elf-file? in guix/build/utils.scm, but rejected by has-elf-header? in guix/elf.scm, which leads to an error. - Dependencies of python-tensile copy source files and later copy them with shutil.copy, sometimes twice. This leads to permission errors, as the permissions in the store are kept, so I patched it to use shutil.copyfile instead. - There were a few errors due to using the GCC 11 system headers with rocm-toolchain (which is based on Clang+LLVM). For roctracer, replacing std::experimental::filesystem by std::filesystem suffices, but for rocthrust, the placement new operator is not found. I applied the patch from Gentoo [5], where it is replaced by a simple assignment. It looks like UB to me though, even if it happens to work. The question is whether this is a bug in libstdc++, clang or amdclang++... - rocMLIR also contains a fork of the LLVM source tree and it is not clear at a first glance how exactly it differs from the main ROCm fork of LLVM or upstream LLVM. It would be really great to have these packages in Guix proper, but first of course the base ROCm packages need to be added after deciding how to deal with the different architectures. Also, are several ROCm versions necessary or would only one (the current latest) version suffice? Cheers, David [1] https://hpc.guix.info/blog/2024/01/hip-and-rocm-come-to-guix/ [2] https://issues.guix.gnu.org/69591 [3] https://codeberg.org/dtelsing/Guix/src/branch/pytorch [4] https://codeberg.org/dtelsing/Guix-HPC/src/branch/pytorch-rocm [5] https://gitweb.gentoo.org/repo/gentoo.git/tree/sci-libs/rocThrust/files/rocThrust-4.0-operator_new.patch