From: "Ludovic Courtès" <ludo@gnu.org>
To: 46229@debbugs.gnu.org
Cc: Florent Pruvost <florent.pruvost@inria.fr>,
Greg Hogan <code@greghogan.com>
Subject: bug#46229: rdma-core 33.x breaks InfiniBand support in Open MPI
Date: Mon, 01 Feb 2021 12:10:34 +0100 [thread overview]
Message-ID: <87r1m0gi1h.fsf@gnu.org> (raw)
In-Reply-To: <87r1m0i2vc.fsf@inria.fr> ("Ludovic Courtès"'s message of "Mon, 01 Feb 2021 09:55:19 +0100")
Ludovic Courtès <ludovic.courtes@inria.fr> skribis:
> mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on signal 11 (Segmentation fault).
Now with a nicer backtrace:
--8<---------------cut here---------------start------------->8---
(gdb) bt full
#0 attr_optional (attr=0x0) at include/infiniband/cmd_ioctl.h:239
No locals.
#1 ibv_icmd_create_cq (context=context@entry=0x1074890, cqe=cqe@entry=2, channel=channel@entry=0x0,
comp_vector=comp_vector@entry=0, flags=flags@entry=0, cq=cq@entry=0x1074c50, link=0x7ffe0a089690)
at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/libibverbs/cmd_cq.c:63
cmdb = {{next = 0x7ffe0a089690, next_attr = 0x0, last_attr = 0x0, uhw_in_idx = 255 '\377',
uhw_out_idx = 255 '\377', uhw_in_headroom_dwords = 0 '\000', uhw_out_headroom_dwords = 0 '\000',
buffer_error = 0 '\000', fallback_require_ex = 0 '\000', fallback_ioctl_only = 0 '\000', hdr = {
length = 0, object_id = 0, method_id = 0, num_attrs = 0, reserved1 = 0, driver_id = 0, reserved2 = 0,
attrs = 0x7ffe0a0895f8}}}
priv = <optimized out>
handle = <optimized out>
async_fd_attr = <optimized out>
resp_cqe = <optimized out>
ret = 0
#2 0x00007f9ec83f2e4e in ibv_cmd_create_cq (context=context@entry=0x1074890, cqe=cqe@entry=2,
channel=channel@entry=0x0, comp_vector=comp_vector@entry=0, cq=cq@entry=0x1074c50, cmd=cmd@entry=0x0, cmd_size=0,
resp=0x7ffe0a089760, resp_size=16) at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/libibverbs/cmd_cq.c:137
__cmdbtotal = 2
cmdb = {{next = 0x0, next_attr = 0x7ffe0a0896d8, last_attr = 0x7ffe0a0896e8, uhw_in_idx = 255 '\377',
uhw_out_idx = 0 '\000', uhw_in_headroom_dwords = 0 '\000', uhw_out_headroom_dwords = 2 '\002',
buffer_error = 0 '\000', fallback_require_ex = 0 '\000', fallback_ioctl_only = 0 '\000', hdr = {
length = 0, object_id = 3, method_id = 0, num_attrs = 0, reserved1 = 0, driver_id = 0, reserved2 = 0,
attrs = 0x7ffe0a0896c8}}, {next = 0x100081001, next_attr = 0x7ffe0a089768, last_attr = 0x6e0000005b,
uhw_in_idx = 124 '|', uhw_out_idx = 0 '\000', uhw_in_headroom_dwords = 0 '\000',
uhw_out_headroom_dwords = 0 '\000', buffer_error = 1 '\001', fallback_require_ex = 1 '\001',
fallback_ioctl_only = 1 '\001', hdr = {length = 0, object_id = 0, method_id = 0, num_attrs = 0,
reserved1 = 0, driver_id = 17, reserved2 = 15, attrs = 0x7ffe0a089700}}}
__cmdbdummy = <optimized out>
#3 0x00007f9ec85257bb in hfi1_create_cq (context=0x1074890, cqe=2, channel=0x0, comp_vector=0)
at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/providers/hfi1verbs/verbs.c:184
cq = 0x1074c50
resp = {ibv_resp = {cq_handle = 0, cqe = 0, driver_data = 0x7ffe0a089768}, offset = 0}
ret = <optimized out>
size = <optimized out>
#4 0x00007f9ec83fde41 in __ibv_create_cq_1_1 (context=0x1074890, cqe=<optimized out>, cq_context=0x0, channel=0x0,
comp_vector=<optimized out>) at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/libibverbs/verbs.c:509
cq = <optimized out>
#5 0x00007f9ec8426a55 in opal_common_verbs_qp_test ()
from /gnu/store/6gcssdkn8iaiagzdv0d9mi93gppc85r4-openmpi-4.0.5/lib/libmca_common_verbs.so.40
No symbol table info available.
#6 0x00007f9ec8454e83 in btl_openib_component_init ()
from /gnu/store/6gcssdkn8iaiagzdv0d9mi93gppc85r4-openmpi-4.0.5/lib/openmpi/mca_btl_openib.so
--8<---------------cut here---------------end--------------->8---
Version 29.2 is good and everything beyond that isn’t. This has to do
with those rdma-core changes:
--8<---------------cut here---------------start------------->8---
$ git log --oneline v26.4..v33.1 libibverbs/cmd_cq.c
317d8895 verbs: Enhance async FD usage
195c9191 verbs: Introduce verbs_cq for extended CQ operations
90a4d0cc verbs: Extend CQ KABI to get an async FD
--8<---------------cut here---------------end--------------->8---
(The first commit in the list above appeared in v30.)
I forgot to mention this happens with Omni-Path hardware:
--8<---------------cut here---------------start------------->8---
$ guix environment --ad-hoc rdma-core -- ibv_devinfo
hca_id: hfi1_0
transport: InfiniBand (0)
fw_ver: 1.27.0
node_guid: 0011:7509:0107:573e
sys_image_guid: 0011:7509:0107:573e
vendor_id: 0x1175
vendor_part_id: 9456
hw_ver: 0x11
board_id: Intel Omni-Path Host Fabric Interface Adapter 100 Series
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 4
port_lmc: 0x00
link_layer: InfiniBand
--8<---------------cut here---------------end--------------->8---
Ludo’.
next prev parent reply other threads:[~2021-02-01 11:11 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-01 8:55 bug#46229: rdma-core 33.x breaks InfiniBand support in Open MPI Ludovic Courtès
2021-02-01 9:13 ` Efraim Flashner
2021-02-01 10:13 ` Ludovic Courtès
2021-02-01 11:10 ` Ludovic Courtès [this message]
2021-02-01 13:05 ` Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87r1m0gi1h.fsf@gnu.org \
--to=ludo@gnu.org \
--cc=46229@debbugs.gnu.org \
--cc=code@greghogan.com \
--cc=florent.pruvost@inria.fr \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this external index
https://git.savannah.gnu.org/cgit/guix.git
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.