unofficial mirror of guix-science@gnu.org 
 help / color / mirror / Atom feed
* Freeing Machine Learning with ROCm
@ 2022-04-23  5:14 Zacchaeus Scheffer
  2022-04-25  6:31 ` Lars-Dominik Braun
  0 siblings, 1 reply; 4+ messages in thread
From: Zacchaeus Scheffer @ 2022-04-23  5:14 UTC (permalink / raw)
  To: guix-science

[-- Attachment #1: Type: text/plain, Size: 3029 bytes --]

tHi guix-science,

Basically all computer vision and/or machine learning research is done on
GPUs in Pytorch and/or Tensorflow.  Now, it should be possible to do this
with ROCm drivers on a supported AMD GPU.  However, I'm having trouble
utilizing my GPU with ROCm drivers.  This seems to be due to problems in
the current Guix version, as I was able to utilize the GPU fine on a
different OS.  Based on the fact that many ROCm packages exist in guix, and
that I don't see people complain, it seems it must have worked in the
past.  While I am interested in helping fix this in the current guix
version (discussed more below), I also think it is important that people be
able to use GPUs now.  This brings me to my first question:

Has anyone been able to run a ROCm compatible GPU on a Guix system using
ROCm drivers?  And, if so, could you provide resources to do so?
(channels.scm with working guix commit, system.scm, home.scm, manifest.scm,
etc.)  Also, if you were able to get pytorch/tensorflow to play well on a
GPU, info on that would also be nice.

Currently, I have tried putting the results of "guix search rocm" (minus
procmail) into a manifest (included below), and calling rocminfo (AMD
nvidia-smi equivalent-ish).  This gives me:
> ROCk module is loaded
> Unable to open /dev/kfd read-write: No such file or directory
> <my username here> is member of video group
Maybe there is a missing magic udev rule?  I was able to find a thread
somewhere (can't find it now) where they suggested rolling back the kernel
version.  Cross-checking with the ROCm 4.3 install documentation (because
the ROCm version in the guix repo is 4.3), I saw that the supported ubuntu
version had kernel version 5.4.*, so I tried downgrading my kernel by
adding:
(kernel (specification->package "linux-libre@5.4.190"))
to my system.scm, reconfiguring, and rebooting.  I also tried similarly
adding all rocm packages to my system.scm.  In every instance, I tried
running as a user (with appropriate groups as indicated by ROCm
documentation) and root.  In all cases, I get the error printed above.
While this problem would seem like a good question for upstream ROCm, they
don't officially support any but a few OS's, so here I am.

In retrospect, could it maybe be that I can use the card without probing it
with rocminfo?  It would certainly be nice to be able to check the
temperature (especially so I don't have to leave the fan on full blast)
among other things, but maybe that isn't strictly necessary for doing
machine learning on it?

Any suggestion for how to get closer to getting GPU-accelerated (ROCm)
pytorch/tensorflow running on Guix is appreciated.

Thanks,
Zacchaeus

P.S.
The archive for guix-science@gnu.org is pretty sparse.  Should I be posting
this to bug-guix instead?

contents of my manifest.scm mentioned above:
(specifications->manifest
 '("rocm-cmake"
   "rocminfo"
   "rocm-opencl-runtime"
   "rocm-device-libs"
   "rocm-comgr"
   "rocm-bandwidth-test"
   "rocr-runtime"
   "roct-thunk-interface"
   "rocclr"))

[-- Attachment #2: Type: text/html, Size: 3424 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-04-26  6:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-23  5:14 Freeing Machine Learning with ROCm Zacchaeus Scheffer
2022-04-25  6:31 ` Lars-Dominik Braun
2022-04-26  4:45   ` Zacchaeus Scheffer
2022-04-26  6:24     ` Lars-Dominik Braun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).