From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp2.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms8.migadu.com with LMTPS id SBPaARPiCWbxzAAAe85BDQ:P1 (envelope-from ) for ; Mon, 01 Apr 2024 00:22:11 +0200 Received: from aspmx1.migadu.com ([2001:41d0:403:4876::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp2.migadu.com with LMTPS id SBPaARPiCWbxzAAAe85BDQ (envelope-from ) for ; Mon, 01 Apr 2024 00:22:11 +0200 X-Envelope-To: larch@yhetil.org Authentication-Results: aspmx1.migadu.com; dkim=pass header.d=posteo.net header.s=2017 header.b=QTmiVvVQ; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=posteo.net ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=yhetil.org; s=key1; t=1711923730; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:list-id:list-help: list-unsubscribe:list-subscribe:list-post:dkim-signature; bh=cWLpgWZ/KAfKrhqfycge+29KlHH68Kih/DXwOUiGbEM=; b=tDa3wEU3XNc6AggBjqfh3te7ccN+1OPRxXmN4T2MkQTJ8MfoWSO6TbRoWTIDBIhFO2wDkb 5F0JwC5DlT+V46xol7j7/MFnxMbtVCp0j+VFHx0H9rZx67BUS+lMOwgFNJ8rE5T/ExOOFO bvXDwokd8qlA+Vx0HJy8QlP1khgDeXa5dHkO6x+cXQs1qLqI0lguiVTMrxc5gVeyXEYCZ9 k6fYyRG2zLOWdZMs8x/2/VFuZ09e051G8qTmCznozOXMGuJAz/gMbmMbMI1ezXvgj6xGsE daumgRXPA7Az7PUvnaOiCulIbH3xSgP7JmTiR7n/YfQbClq21slhZ8Nx1C24/g== ARC-Authentication-Results: i=1; aspmx1.migadu.com; dkim=pass header.d=posteo.net header.s=2017 header.b=QTmiVvVQ; spf=pass (aspmx1.migadu.com: domain of "guix-devel-bounces+larch=yhetil.org@gnu.org" designates 209.51.188.17 as permitted sender) smtp.mailfrom="guix-devel-bounces+larch=yhetil.org@gnu.org"; dmarc=pass (policy=none) header.from=posteo.net ARC-Seal: i=1; s=key1; d=yhetil.org; t=1711923730; a=rsa-sha256; cv=none; b=lW/o0+w1EXfpPIBM+5f5unsL4ZDX6M+kXQFCyNPyDz4NzwmATl6NWzUYkSNPYI9anVfwOy V+zhMEo5jL4eXlDVCpxxWs7dZ37CyboOAvpA9kUnlwhROmJ289SOHObjDii3gEBFKyYbNt bgL1I8cAZLdGnopt9Yba6nBW2fxjJEXTTwhEALj7rHheh6H5QI+vgZwhElV68gGD/CKwgG kh/RBayMGlpV+7drX8hzm4dYXEa9CfwaqLQTb516lHK0iFVwgbg9ic3T/qGGVNOxNfJRbr A5dXZxq1rhemm4CiA8PFKm8JR2SeRe2vMR6F7vOGUwUV+AVNUcizPmqhN1JQHw== Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id BF2E27645A for ; Mon, 1 Apr 2024 00:22:10 +0200 (CEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1rr3Yd-00057K-Qe; Sun, 31 Mar 2024 18:21:43 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rr3Yc-00057A-KR for guix-devel@gnu.org; Sun, 31 Mar 2024 18:21:42 -0400 Received: from mout01.posteo.de ([185.67.36.65]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rr3Ya-0002X2-4P for guix-devel@gnu.org; Sun, 31 Mar 2024 18:21:42 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id D79D6240028 for ; Mon, 1 Apr 2024 00:21:35 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1711923695; bh=lNOkLo/jKOeKscIiC4zQPEvcDkcoTX/5LjfrYor5QN8=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type: Content-Transfer-Encoding:From; b=QTmiVvVQLxEV4Ztx/PEtL+w1ZoENaiTosPygmU2u9RMD0a2oJ/lDLuuxq3V5aMC2F U7YgS5I/WJPK1BIhMMpKA26DA2lVWxXsvN4znCfl8lP1Ag02rflG5SuiuNq3juxq50 Yrt3raqqRZGp1qxqmnK44Gldpe4ABM4bs2jcmLNSnbZDmCFYs5aJXlAFOqVwOsYyJM Sy5urj6oPQQiMIjdbGJz1DH261rcOoqpAbec8v/zSINHTF0tM4OqIOQ/zuFINTDoLc Adi6zUznJ/MBQFHvbhjeeMiZLeTm/gmVNKQgfmLi/L8f8lCO7d/Vk2DynHgV+OPMv3 zDLzn/bddn3nw== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4V77sV5pQWz9rxD; Mon, 1 Apr 2024 00:21:34 +0200 (CEST) From: David Elsing To: Ludovic =?utf-8?Q?Court=C3=A8s?= Cc: guix-devel@gnu.org, rekado@elephly.net Subject: Re: PyTorch with ROCm In-Reply-To: <87y1a2j8v4.fsf@gnu.org> References: <86msqoeele.fsf@posteo.net> <87y1a2j8v4.fsf@gnu.org> Date: Sun, 31 Mar 2024 22:21:26 +0000 Message-ID: <7ymsqe9h5l.fsf@posteo.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=185.67.36.65; envelope-from=david.elsing@posteo.net; helo=mout01.posteo.de X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: guix-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Development of GNU Guix and the GNU System distribution." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guix-devel-bounces+larch=yhetil.org@gnu.org Sender: guix-devel-bounces+larch=yhetil.org@gnu.org X-Migadu-Country: US X-Migadu-Flow: FLOW_IN X-Migadu-Queue-Id: BF2E27645A X-Spam-Score: -10.06 X-Migadu-Spam-Score: -10.06 X-Migadu-Scanner: mx10.migadu.com X-TUID: MSk5NdpOZwf3 Hi! Ludovic Court=C3=A8s writes: > I=E2=80=99m happy to merge your changes in the =E2=80=98guix-hpc=E2=80=99= channel for the time > being (I can create you an account there if you wish so you can create > merge requests etc.). Let me know! Ok sure, that sounds good! I made the packages only for ROCm 6.0.2 so far though. > I agree with Ricardo that this should be merged into Guix proper > eventually. This is still in flux and we=E2=80=99d need to check what Kj= etil > and Thomas at AMD think, in particular wrt. versions, so no ETA so far. Yes I agree, the ROCm packages are not ready to be merged yet. > Is PyTorch able to build code for several GPU architectures and pick the > right one at run time? If it does, that would seem like the better > option for me, unless that is indeed so computationally expensive that > it=E2=80=99s not affordable. It is the same as for other HIP/ROCm libraries, so the GPU architectures chosen at build time are all available at runtime and automatically picked. For reference, the Arch Linux package for PyTorch [1] enables 12 architectures. I think the architectures which can be chosen at compile time also depend on the ROCm version. >> I'm not sure they can be combined however, as the GPU code is included >> in the shared libraries. Thus all dependent packages like >> python-pytorch-rocm would need to be built for each architecture as >> well, which is a large duplication for the non-GPU parts. > > Yeah, but maybe that=E2=80=99s OK if we keep the number of supported GPU > architectures to a minimum? If it's no issue for the build farm it would probably be good to include a set of default architectures (the officially supported ones?) like you suggested, and make it easy to recompile all dependent packages for other architectures. Maybe this can be done with a package transformation like for '--tune'?. IIRC, building composable-kernel for the default architectures with 16 threads exceeded 32 GB of memory before I cancelled the build and set it to only architecture. >> - Many tests assume a GPU to be present, so they need to be disabled. > > Yes. I/we=E2=80=99d like to eventually support that. (There=E2=80=99d n= eed to be some > annotation in derivations or packages specifying what hardware is > required, and =E2=80=98cuirass remote-worker=E2=80=99, =E2=80=98guix offl= oad=E2=80=99, etc. would need > to honor that.) That sounds like a good idea, could this also include CPU ISA extensions, such as AVX2 and AVX-512? >> - For several packages (e.g. rocfft), I had to disable the >> validate-runpath? phase, as there was an error when reading ELF >> files. It is however possible that I also disabled it for packages >> where it was not necessary, but it was the case for rocblas at >> least. Here, kernels generated are contained in ELF files, which are >> detected by elf-file? in guix/build/utils.scm, but rejected by >> has-elf-header? in guix/elf.scm, which leads to an error. > > Weird. We=E2=80=99d need to look more closely into the errors you got. I think the issue is simply that elf-file? just checks the magic bytes and has-elf-header? checks for the entire header. If the former returns #t and the latter #f, an error is raised by parse-elf in guix/elf.scm. It seems some ROCm (or tensile?) ELF files have another header format. > Oh, just noticed your patch bring a lot of things beyond PyTorch itself! > I think there=E2=80=99s some overlap with > , we > should synchronize. Ah, I did not see this before, the overlap seems to be tensile, roctracer and rocblas. For rocblas, I saw that they set "-DAMDGPU_TARGETS=3Dgfx1030;gfx90a", probably for testing? Thank you! David [1] https://gitlab.archlinux.org/archlinux/packaging/packages/python-pytorc= h/-/blob/ae90c1e8bdb99af458ca0a545c5736950a747690/PKGBUILD