Freeing Machine Learning with ROCm

unofficial mirror of guix-science@gnu.org 
 help / color / mirror / Atom feed

* Freeing Machine Learning with ROCm
@ 2022-04-23  5:14 Zacchaeus Scheffer
  2022-04-25  6:31 ` Lars-Dominik Braun
  0 siblings, 1 reply; 4+ messages in thread
From: Zacchaeus Scheffer @ 2022-04-23  5:14 UTC (permalink / raw)
  To: guix-science

[-- Attachment #1: Type: text/plain, Size: 3029 bytes --]

tHi guix-science,

Basically all computer vision and/or machine learning research is done on
GPUs in Pytorch and/or Tensorflow.  Now, it should be possible to do this
with ROCm drivers on a supported AMD GPU.  However, I'm having trouble
utilizing my GPU with ROCm drivers.  This seems to be due to problems in
the current Guix version, as I was able to utilize the GPU fine on a
different OS.  Based on the fact that many ROCm packages exist in guix, and
that I don't see people complain, it seems it must have worked in the
past.  While I am interested in helping fix this in the current guix
version (discussed more below), I also think it is important that people be
able to use GPUs now.  This brings me to my first question:

Has anyone been able to run a ROCm compatible GPU on a Guix system using
ROCm drivers?  And, if so, could you provide resources to do so?
(channels.scm with working guix commit, system.scm, home.scm, manifest.scm,
etc.)  Also, if you were able to get pytorch/tensorflow to play well on a
GPU, info on that would also be nice.

Currently, I have tried putting the results of "guix search rocm" (minus
procmail) into a manifest (included below), and calling rocminfo (AMD
nvidia-smi equivalent-ish).  This gives me:
> ROCk module is loaded
> Unable to open /dev/kfd read-write: No such file or directory
> <my username here> is member of video group
Maybe there is a missing magic udev rule?  I was able to find a thread
somewhere (can't find it now) where they suggested rolling back the kernel
version.  Cross-checking with the ROCm 4.3 install documentation (because
the ROCm version in the guix repo is 4.3), I saw that the supported ubuntu
version had kernel version 5.4.*, so I tried downgrading my kernel by
adding:
(kernel (specification->package "linux-libre@5.4.190"))
to my system.scm, reconfiguring, and rebooting.  I also tried similarly
adding all rocm packages to my system.scm.  In every instance, I tried
running as a user (with appropriate groups as indicated by ROCm
documentation) and root.  In all cases, I get the error printed above.
While this problem would seem like a good question for upstream ROCm, they
don't officially support any but a few OS's, so here I am.

In retrospect, could it maybe be that I can use the card without probing it
with rocminfo?  It would certainly be nice to be able to check the
temperature (especially so I don't have to leave the fan on full blast)
among other things, but maybe that isn't strictly necessary for doing
machine learning on it?

Any suggestion for how to get closer to getting GPU-accelerated (ROCm)
pytorch/tensorflow running on Guix is appreciated.

Thanks,
Zacchaeus

P.S.
The archive for guix-science@gnu.org is pretty sparse.  Should I be posting
this to bug-guix instead?

contents of my manifest.scm mentioned above:
(specifications->manifest
 '("rocm-cmake"
   "rocminfo"
   "rocm-opencl-runtime"
   "rocm-device-libs"
   "rocm-comgr"
   "rocm-bandwidth-test"
   "rocr-runtime"
   "roct-thunk-interface"
   "rocclr"))

[-- Attachment #2: Type: text/html, Size: 3424 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Freeing Machine Learning with ROCm
  2022-04-23  5:14 Freeing Machine Learning with ROCm Zacchaeus Scheffer
@ 2022-04-25  6:31 ` Lars-Dominik Braun
  2022-04-26  4:45   ` Zacchaeus Scheffer
  0 siblings, 1 reply; 4+ messages in thread
From: Lars-Dominik Braun @ 2022-04-25  6:31 UTC (permalink / raw)
  To: Zacchaeus Scheffer; +Cc: guix-science

Hi Zacchaeus,

I packaged ROCm for Guix.

> Based on the fact that many ROCm packages exist in guix, and
> that I don't see people complain, it seems it must have worked in the
> past.
Indeed, I am using Guix’ darktable and rocm-opencl-runtime packages
for OpenCL-accelerated photo editing. But I’m also doing this on a
foreign distribution with a custom kernel (5.15) – not Guix System.

> > ROCk module is loaded
> > Unable to open /dev/kfd read-write: No such file or directory
> > <my username here> is member of video group
Which GPU are you using? Can you see it with `lspci` and does it have the
`amdgpu` driver attached? Is the firmware loaded (`dmesg | grep amdgpu`,
I’m guessing no, since you use linux-libre)?

> In retrospect, could it maybe be that I can use the card without probing it
> with rocminfo?  It would certainly be nice to be able to check the
> temperature (especially so I don't have to leave the fan on full blast)
> among other things, but maybe that isn't strictly necessary for doing
> machine learning on it?
rocminfo does not show the card’s temperature. You need this[1]
(unpackaged) tool.

Cheers,
Lars

[1] https://github.com/RadeonOpenCompute/rocm_smi_lib/tree/master/python_smi_tools



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Freeing Machine Learning with ROCm
  2022-04-25  6:31 ` Lars-Dominik Braun
@ 2022-04-26  4:45   ` Zacchaeus Scheffer
  2022-04-26  6:24     ` Lars-Dominik Braun
  0 siblings, 1 reply; 4+ messages in thread
From: Zacchaeus Scheffer @ 2022-04-26  4:45 UTC (permalink / raw)
  To: Lars-Dominik Braun; +Cc: guix-science

[-- Attachment #1: Type: text/plain, Size: 3373 bytes --]

>
> > Based on the fact that many ROCm packages exist in guix, and
> > that I don't see people complain, it seems it must have worked in the
> > past.
> Indeed, I am using Guix’ darktable and rocm-opencl-runtime packages
> for OpenCL-accelerated photo editing. But I’m also doing this on a
> foreign distribution with a custom kernel (5.15) – not Guix System.
>
I tried kernel version 5.15 before I tried 5.4.  Is there anything else
special about your kernel version?

> > ROCk module is loaded
> > > Unable to open /dev/kfd read-write: No such file or directory
> > > <my username here> is member of video group
> Which GPU are you using? Can you see it with `lspci` and does it have the
> `amdgpu` driver attached? Is the firmware loaded (`dmesg | grep amdgpu`,
> I’m guessing no, since you use linux-libre)?
>
I have an AMD Radeon Instinct MI60, one of the few officially supported
GPUs.  `lspci | grep -i amd` gives:
01:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 14a0
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] Device 14a1
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20
so it seems to be detected. `dmesg | grep amdgpu` gives:
[   12.446826] [drm] amdgpu kernel modesetting enabled.
[   12.485012] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ)
feature not supported
[   12.522503] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ROM BAR
[   12.522538] amdgpu: ATOM BIOS: 113-D1630600-107
[   12.523127] [drm:sdma_v4_0_early_init.cold [amdgpu]] *ERROR* sdma_v4_0:
Failed to load firmware "/*(DEBLOBBED)*/"
[   12.523277] [drm:sdma_v4_0_early_init.cold [amdgpu]] *ERROR* Failed to
load sdma firmware!
[   12.533887] amdgpu 0000:03:00.0: amdgpu: MEM ECC is active.
[   12.533889] amdgpu 0000:03:00.0: amdgpu: SRAM ECC is active.
[   12.533893] amdgpu 0000:03:00.0: amdgpu: RAS INFO: ras initialized
successfully, hardware ability[7fff] ras_mask[7fff]
[   12.533902] amdgpu 0000:03:00.0: amdgpu: VRAM: 32752M 0x0000008000000000
- 0x00000087FEFFFFFF (32752M used)
[   12.533904] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 -
0x000000001FFFFFFF
[   12.533905] amdgpu 0000:03:00.0: amdgpu: AGP: 267878400M
0x0000008800000000 - 0x0000FFFFFFFFFFFF
[   12.557543] [drm] amdgpu: 32752M of VRAM memory ready
[   12.557549] [drm] amdgpu: 24018M of GTT memory ready.
[   12.557775] amdgpu 0000:03:00.0: amdgpu: failed to init sos firmware
[   12.557777] [drm:psp_sw_init [amdgpu]] *ERROR* Failed to load psp
firmware!
[   12.557916] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* sw_init of IP
block <psp> failed -2
[   12.558042] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[   12.558044] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
[   12.558047] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[   12.602981] amdgpu: probe of 0000:03:00.0 failed with error -2
[   12.603026] [drm] amdgpu: ttm finalized
So it seems to be partially working, partially not.  That "Fatal error
during GPU init" is pretty discouraging though...  With the way AMD
promoted ROCm as being so open, I was really under the impression that I
would be able to make this work on Guix, albeit with some work on my end,
but you sound skeptical.  Do you think it is possible?

Thanks for your kind response,
Zacchaeus

[-- Attachment #2: Type: text/html, Size: 4046 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Freeing Machine Learning with ROCm
  2022-04-26  4:45   ` Zacchaeus Scheffer
@ 2022-04-26  6:24     ` Lars-Dominik Braun
  0 siblings, 0 replies; 4+ messages in thread
From: Lars-Dominik Braun @ 2022-04-26  6:24 UTC (permalink / raw)
  To: Zacchaeus Scheffer; +Cc: guix-science

Hi Zacchaeus,

> I tried kernel version 5.15 before I tried 5.4.  Is there anything else
> special about your kernel version?
no, pretty much the stock Gentoo kernel.

> [   12.523127] [drm:sdma_v4_0_early_init.cold [amdgpu]] *ERROR* sdma_v4_0:
> Failed to load firmware "/*(DEBLOBBED)*/"
> [   12.523277] [drm:sdma_v4_0_early_init.cold [amdgpu]] *ERROR* Failed to
> load sdma firmware!
Yeah, you definitely need firmware. Even my pretty old RX 460 needs
it to do 3D accelleration and OpenCL/Vulkan.

> That "Fatal error
> during GPU init" is pretty discouraging though...  With the way AMD
> promoted ROCm as being so open, I was really under the impression that I
> would be able to make this work on Guix, albeit with some work on my end,
> but you sound skeptical.  Do you think it is possible?
Sure, it might work if you use a nonguix[1] kernel plus firmware – unless
your GPU needs a newer ROCm. In that case we have to figure out how to
upgrade ROCm itself, which is always a little tricky.

Cheers,
Lars

[1] https://gitlab.com/nonguix/nonguix



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-04-26  6:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-04-23  5:14 Freeing Machine Learning with ROCm Zacchaeus Scheffer
2022-04-25  6:31 ` Lars-Dominik Braun
2022-04-26  4:45   ` Zacchaeus Scheffer
2022-04-26  6:24     ` Lars-Dominik Braun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).