unofficial mirror of bug-guix@gnu.org 
 help / color / mirror / code / Atom feed
* bug#46229: rdma-core 33.x breaks InfiniBand support in Open MPI
@ 2021-02-01  8:55 Ludovic Courtès
  2021-02-01  9:13 ` Efraim Flashner
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Ludovic Courtès @ 2021-02-01  8:55 UTC (permalink / raw)
  To: 46229; +Cc: Florent Pruvost, Greg Hogan

Hello,

We noticed that the recent rdma-core upgrade to 33.1¹ leads to segfaults
in InfiniBand related routines:

--8<---------------cut here---------------start------------->8---
$ guix time-machine --commit=23a5dcce1d893b8f5c5301ae3c1af863776ed3cf --  environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-debug-info=rdma-core -- mpiexec -np 2 IMB-MPI1 PingPong
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
$ file core.20879 
core.20879: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'IMB-MPI1 PingPong', real uid: 10218, effective uid: 10218, real gid: 11018, effective gid: 11018, execfn: '/gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1', platform: 'x86_64'
$ gdb /gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1 core.20879 
(gdb) bt
#0  0x00007f93b2789e88 in ibv_cmd_create_cq ()
   from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
#1  0x00007f93b28c57bb in hfi1_create_cq ()
   from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs/libhfi1verbs-rdmav33.so
#2  0x00007f93b2796331 in ibv_create_cq@@IBVERBS_1.1 ()
   from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
#3  0x00007f93b27c0a55 in opal_common_verbs_qp_test ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmca_common_verbs.so.40
#4  0x00007f93b27f4e83 in btl_openib_component_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_btl_openib.so
#5  0x00007f93b4516aaf in mca_btl_base_select ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libopen-pal.so.40
#6  0x00007f93b29552c2 in mca_bml_r2_component_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_bml_r2.so
#7  0x00007f93b4b81b54 in mca_bml_base_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#8  0x00007f93b4bc4ef8 in ompi_mpi_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#9  0x00007f93b4b5ee55 in PMPI_Init_thread ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#10 0x0000000000405b55 in main ()
--8<---------------cut here---------------end--------------->8---

Conversely, a pre-upgrade commit works fine:

--8<---------------cut here---------------start------------->8---
$ guix time-machine --commit=c2538db5617032788ac2f140496d00d8107579c8 --  environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks -- mpiexec -np 2 IMB-MPI1 PingPong
--8<---------------cut here---------------end--------------->8---

Does that ring a bell?

Thanks,
Ludo’.

¹ https://git.savannah.gnu.org/cgit/guix.git/commit/?id=c2739c0801ebc5461564e862ce8f08405e2782dc




^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#46229: rdma-core 33.x breaks InfiniBand support in Open MPI
  2021-02-01  8:55 bug#46229: rdma-core 33.x breaks InfiniBand support in Open MPI Ludovic Courtès
@ 2021-02-01  9:13 ` Efraim Flashner
  2021-02-01 10:13 ` Ludovic Courtès
  2021-02-01 11:10 ` Ludovic Courtès
  2 siblings, 0 replies; 5+ messages in thread
From: Efraim Flashner @ 2021-02-01  9:13 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: Florent Pruvost, 46229, Greg Hogan

[-- Attachment #1: Type: text/plain, Size: 4066 bytes --]

On Mon, Feb 01, 2021 at 09:55:19AM +0100, Ludovic Courtès wrote:
> Hello,
> 
> We noticed that the recent rdma-core upgrade to 33.1¹ leads to segfaults
> in InfiniBand related routines:
> 
> --8<---------------cut here---------------start------------->8---
> $ guix time-machine --commit=23a5dcce1d893b8f5c5301ae3c1af863776ed3cf --  environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-debug-info=rdma-core -- mpiexec -np 2 IMB-MPI1 PingPong
> --------------------------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> $ file core.20879 
> core.20879: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'IMB-MPI1 PingPong', real uid: 10218, effective uid: 10218, real gid: 11018, effective gid: 11018, execfn: '/gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1', platform: 'x86_64'
> $ gdb /gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1 core.20879 
> (gdb) bt
> #0  0x00007f93b2789e88 in ibv_cmd_create_cq ()
>    from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
> #1  0x00007f93b28c57bb in hfi1_create_cq ()
>    from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs/libhfi1verbs-rdmav33.so
> #2  0x00007f93b2796331 in ibv_create_cq@@IBVERBS_1.1 ()
>    from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
> #3  0x00007f93b27c0a55 in opal_common_verbs_qp_test ()
>    from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmca_common_verbs.so.40
> #4  0x00007f93b27f4e83 in btl_openib_component_init ()
>    from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_btl_openib.so
> #5  0x00007f93b4516aaf in mca_btl_base_select ()
>    from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libopen-pal.so.40
> #6  0x00007f93b29552c2 in mca_bml_r2_component_init ()
>    from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_bml_r2.so
> #7  0x00007f93b4b81b54 in mca_bml_base_init ()
>    from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
> #8  0x00007f93b4bc4ef8 in ompi_mpi_init ()
>    from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
> #9  0x00007f93b4b5ee55 in PMPI_Init_thread ()
>    from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
> #10 0x0000000000405b55 in main ()
> --8<---------------cut here---------------end--------------->8---
> 
> Conversely, a pre-upgrade commit works fine:
> 
> --8<---------------cut here---------------start------------->8---
> $ guix time-machine --commit=c2538db5617032788ac2f140496d00d8107579c8 --  environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks -- mpiexec -np 2 IMB-MPI1 PingPong
> --8<---------------cut here---------------end--------------->8---
> 
> Does that ring a bell?
> 
> Thanks,
> Ludo’.
> 
> ¹ https://git.savannah.gnu.org/cgit/guix.git/commit/?id=c2739c0801ebc5461564e862ce8f08405e2782dc
> 

I thought I built everything that depended on rdma-core, and
unfortunately I don't have a way to test it. As an actual user of the
package I trust you to revert the change if necessary.

I don't see anything on their mailing list pointing to this, or any
other bugs really.
http://vger.kernel.org/vger-lists.html#linux-rdma

-- 
Efraim Flashner   <efraim@flashner.co.il>   אפרים פלשנר
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#46229: rdma-core 33.x breaks InfiniBand support in Open MPI
  2021-02-01  8:55 bug#46229: rdma-core 33.x breaks InfiniBand support in Open MPI Ludovic Courtès
  2021-02-01  9:13 ` Efraim Flashner
@ 2021-02-01 10:13 ` Ludovic Courtès
  2021-02-01 11:10 ` Ludovic Courtès
  2 siblings, 0 replies; 5+ messages in thread
From: Ludovic Courtès @ 2021-02-01 10:13 UTC (permalink / raw)
  To: 46229; +Cc: Florent Pruvost, Greg Hogan

Ludovic Courtès <ludovic.courtes@inria.fr> skribis:

> $ guix time-machine --commit=23a5dcce1d893b8f5c5301ae3c1af863776ed3cf --  environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-debug-info=rdma-core -- mpiexec -np 2 IMB-MPI1 PingPong

A workaround is to ask Open MPI to ignore the Verbs library with:

  mpiexec --mca btl ^openib …

Ludo’.




^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#46229: rdma-core 33.x breaks InfiniBand support in Open MPI
  2021-02-01  8:55 bug#46229: rdma-core 33.x breaks InfiniBand support in Open MPI Ludovic Courtès
  2021-02-01  9:13 ` Efraim Flashner
  2021-02-01 10:13 ` Ludovic Courtès
@ 2021-02-01 11:10 ` Ludovic Courtès
  2021-02-01 13:05   ` Ludovic Courtès
  2 siblings, 1 reply; 5+ messages in thread
From: Ludovic Courtès @ 2021-02-01 11:10 UTC (permalink / raw)
  To: 46229; +Cc: Florent Pruvost, Greg Hogan

Ludovic Courtès <ludovic.courtes@inria.fr> skribis:

> mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on signal 11 (Segmentation fault).

Now with a nicer backtrace:

--8<---------------cut here---------------start------------->8---
(gdb) bt full
#0  attr_optional (attr=0x0) at include/infiniband/cmd_ioctl.h:239
No locals.
#1  ibv_icmd_create_cq (context=context@entry=0x1074890, cqe=cqe@entry=2, channel=channel@entry=0x0, 
    comp_vector=comp_vector@entry=0, flags=flags@entry=0, cq=cq@entry=0x1074c50, link=0x7ffe0a089690)
    at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/libibverbs/cmd_cq.c:63
        cmdb = {{next = 0x7ffe0a089690, next_attr = 0x0, last_attr = 0x0, uhw_in_idx = 255 '\377', 
            uhw_out_idx = 255 '\377', uhw_in_headroom_dwords = 0 '\000', uhw_out_headroom_dwords = 0 '\000', 
            buffer_error = 0 '\000', fallback_require_ex = 0 '\000', fallback_ioctl_only = 0 '\000', hdr = {
              length = 0, object_id = 0, method_id = 0, num_attrs = 0, reserved1 = 0, driver_id = 0, reserved2 = 0, 
              attrs = 0x7ffe0a0895f8}}}
        priv = <optimized out>
        handle = <optimized out>
        async_fd_attr = <optimized out>
        resp_cqe = <optimized out>
        ret = 0
#2  0x00007f9ec83f2e4e in ibv_cmd_create_cq (context=context@entry=0x1074890, cqe=cqe@entry=2, 
    channel=channel@entry=0x0, comp_vector=comp_vector@entry=0, cq=cq@entry=0x1074c50, cmd=cmd@entry=0x0, cmd_size=0, 
    resp=0x7ffe0a089760, resp_size=16) at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/libibverbs/cmd_cq.c:137
        __cmdbtotal = 2
        cmdb = {{next = 0x0, next_attr = 0x7ffe0a0896d8, last_attr = 0x7ffe0a0896e8, uhw_in_idx = 255 '\377', 
            uhw_out_idx = 0 '\000', uhw_in_headroom_dwords = 0 '\000', uhw_out_headroom_dwords = 2 '\002', 
            buffer_error = 0 '\000', fallback_require_ex = 0 '\000', fallback_ioctl_only = 0 '\000', hdr = {
              length = 0, object_id = 3, method_id = 0, num_attrs = 0, reserved1 = 0, driver_id = 0, reserved2 = 0, 
              attrs = 0x7ffe0a0896c8}}, {next = 0x100081001, next_attr = 0x7ffe0a089768, last_attr = 0x6e0000005b, 
            uhw_in_idx = 124 '|', uhw_out_idx = 0 '\000', uhw_in_headroom_dwords = 0 '\000', 
            uhw_out_headroom_dwords = 0 '\000', buffer_error = 1 '\001', fallback_require_ex = 1 '\001', 
            fallback_ioctl_only = 1 '\001', hdr = {length = 0, object_id = 0, method_id = 0, num_attrs = 0, 
              reserved1 = 0, driver_id = 17, reserved2 = 15, attrs = 0x7ffe0a089700}}}
        __cmdbdummy = <optimized out>
#3  0x00007f9ec85257bb in hfi1_create_cq (context=0x1074890, cqe=2, channel=0x0, comp_vector=0)
    at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/providers/hfi1verbs/verbs.c:184
        cq = 0x1074c50
        resp = {ibv_resp = {cq_handle = 0, cqe = 0, driver_data = 0x7ffe0a089768}, offset = 0}
        ret = <optimized out>
        size = <optimized out>
#4  0x00007f9ec83fde41 in __ibv_create_cq_1_1 (context=0x1074890, cqe=<optimized out>, cq_context=0x0, channel=0x0, 
    comp_vector=<optimized out>) at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/libibverbs/verbs.c:509
        cq = <optimized out>
#5  0x00007f9ec8426a55 in opal_common_verbs_qp_test ()
   from /gnu/store/6gcssdkn8iaiagzdv0d9mi93gppc85r4-openmpi-4.0.5/lib/libmca_common_verbs.so.40
No symbol table info available.
#6  0x00007f9ec8454e83 in btl_openib_component_init ()
   from /gnu/store/6gcssdkn8iaiagzdv0d9mi93gppc85r4-openmpi-4.0.5/lib/openmpi/mca_btl_openib.so
--8<---------------cut here---------------end--------------->8---

Version 29.2 is good and everything beyond that isn’t.  This has to do
with those rdma-core changes:

--8<---------------cut here---------------start------------->8---
$ git log --oneline v26.4..v33.1 libibverbs/cmd_cq.c
317d8895 verbs: Enhance async FD usage
195c9191 verbs: Introduce verbs_cq for extended CQ operations
90a4d0cc verbs: Extend CQ KABI to get an async FD
--8<---------------cut here---------------end--------------->8---

(The first commit in the list above appeared in v30.)

I forgot to mention this happens with Omni-Path hardware:

--8<---------------cut here---------------start------------->8---
$ guix environment --ad-hoc rdma-core -- ibv_devinfo

hca_id: hfi1_0
        transport:                      InfiniBand (0)
        fw_ver:                         1.27.0
        node_guid:                      0011:7509:0107:573e
        sys_image_guid:                 0011:7509:0107:573e
        vendor_id:                      0x1175
        vendor_part_id:                 9456
        hw_ver:                         0x11
        board_id:                       Intel Omni-Path Host Fabric Interface Adapter 100 Series
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               4
                        port_lmc:               0x00
                        link_layer:             InfiniBand

--8<---------------cut here---------------end--------------->8---

Ludo’.




^ permalink raw reply	[flat|nested] 5+ messages in thread

* bug#46229: rdma-core 33.x breaks InfiniBand support in Open MPI
  2021-02-01 11:10 ` Ludovic Courtès
@ 2021-02-01 13:05   ` Ludovic Courtès
  0 siblings, 0 replies; 5+ messages in thread
From: Ludovic Courtès @ 2021-02-01 13:05 UTC (permalink / raw)
  To: 46229-done; +Cc: Florent Pruvost, Greg Hogan

Good news!  This is fixed by:

  https://git.savannah.gnu.org/cgit/guix.git/commit/?id=37e997bc7867901dc5eaf9060358dfddacae8dd6

Ludo’.




^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-02-01 13:06 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-01  8:55 bug#46229: rdma-core 33.x breaks InfiniBand support in Open MPI Ludovic Courtès
2021-02-01  9:13 ` Efraim Flashner
2021-02-01 10:13 ` Ludovic Courtès
2021-02-01 11:10 ` Ludovic Courtès
2021-02-01 13:05   ` Ludovic Courtès

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/guix.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).