From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mp0 ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by ms11 with LMTPS id yKjMNeLhF2AkIAAA0tVLHw (envelope-from ) for ; Mon, 01 Feb 2021 11:11:30 +0000 Received: from aspmx1.migadu.com ([2001:41d0:2:4a6f::]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) by mp0 with LMTPS id KGCvMeLhF2AdJAAA1q6Kng (envelope-from ) for ; Mon, 01 Feb 2021 11:11:30 +0000 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by aspmx1.migadu.com (Postfix) with ESMTPS id 3996E940276 for ; Mon, 1 Feb 2021 11:11:30 +0000 (UTC) Received: from localhost ([::1]:53738 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1l6X7B-000213-2E for larch@yhetil.org; Mon, 01 Feb 2021 06:11:29 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:43448) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1l6X6n-00020s-Ea for bug-guix@gnu.org; Mon, 01 Feb 2021 06:11:05 -0500 Received: from debbugs.gnu.org ([209.51.188.43]:47463) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1l6X6k-0008Qb-2n for bug-guix@gnu.org; Mon, 01 Feb 2021 06:11:05 -0500 Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.84_2) (envelope-from ) id 1l6X6j-00063S-Uf for bug-guix@gnu.org; Mon, 01 Feb 2021 06:11:01 -0500 X-Loop: help-debbugs@gnu.org Subject: bug#46229: rdma-core 33.x breaks InfiniBand support in =?UTF-8?Q?Open=C2=A0MPI?= Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Mon, 01 Feb 2021 11:11:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 46229 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: 46229@debbugs.gnu.org Received: via spool by 46229-submit@debbugs.gnu.org id=B46229.161217785023255 (code B ref 46229); Mon, 01 Feb 2021 11:11:01 +0000 Received: (at 46229) by debbugs.gnu.org; 1 Feb 2021 11:10:50 +0000 Received: from localhost ([127.0.0.1]:59009 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1l6X6X-000630-4o for submit@debbugs.gnu.org; Mon, 01 Feb 2021 06:10:50 -0500 Received: from eggs.gnu.org ([209.51.188.92]:53762) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1l6X6U-00062k-QL for 46229@debbugs.gnu.org; Mon, 01 Feb 2021 06:10:47 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]:50422) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1l6X6O-0008BJ-Fn; Mon, 01 Feb 2021 06:10:40 -0500 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=54620 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1l6X6K-0001LV-Kf; Mon, 01 Feb 2021 06:10:39 -0500 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87r1m0i2vc.fsf@inria.fr> Date: Mon, 01 Feb 2021 12:10:34 +0100 In-Reply-To: <87r1m0i2vc.fsf@inria.fr> ("Ludovic =?UTF-8?Q?Court=C3=A8s?="'s message of "Mon, 01 Feb 2021 09:55:19 +0100") Message-ID: <87r1m0gi1h.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list X-BeenThere: bug-guix@gnu.org List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Florent Pruvost , Greg Hogan Errors-To: bug-guix-bounces+larch=yhetil.org@gnu.org Sender: "bug-Guix" X-Migadu-Flow: FLOW_IN X-Migadu-Spam-Score: -2.86 Authentication-Results: aspmx1.migadu.com; dkim=none; dmarc=pass (policy=none) header.from=gnu.org; spf=pass (aspmx1.migadu.com: domain of bug-guix-bounces@gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=bug-guix-bounces@gnu.org X-Migadu-Queue-Id: 3996E940276 X-Spam-Score: -2.86 X-Migadu-Scanner: scn1.migadu.com X-TUID: 0+UCDDCuSq7v Ludovic Court=C3=A8s skribis: > mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on = signal 11 (Segmentation fault). Now with a nicer backtrace: --8<---------------cut here---------------start------------->8--- (gdb) bt full #0 attr_optional (attr=3D0x0) at include/infiniband/cmd_ioctl.h:239 No locals. #1 ibv_icmd_create_cq (context=3Dcontext@entry=3D0x1074890, cqe=3Dcqe@entr= y=3D2, channel=3Dchannel@entry=3D0x0,=20 comp_vector=3Dcomp_vector@entry=3D0, flags=3Dflags@entry=3D0, cq=3Dcq@e= ntry=3D0x1074c50, link=3D0x7ffe0a089690) at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/libibverbs/cmd_c= q.c:63 cmdb =3D {{next =3D 0x7ffe0a089690, next_attr =3D 0x0, last_attr = =3D 0x0, uhw_in_idx =3D 255 '\377',=20 uhw_out_idx =3D 255 '\377', uhw_in_headroom_dwords =3D 0 '\000'= , uhw_out_headroom_dwords =3D 0 '\000',=20 buffer_error =3D 0 '\000', fallback_require_ex =3D 0 '\000', fa= llback_ioctl_only =3D 0 '\000', hdr =3D { length =3D 0, object_id =3D 0, method_id =3D 0, num_attrs =3D= 0, reserved1 =3D 0, driver_id =3D 0, reserved2 =3D 0,=20 attrs =3D 0x7ffe0a0895f8}}} priv =3D handle =3D async_fd_attr =3D resp_cqe =3D ret =3D 0 #2 0x00007f9ec83f2e4e in ibv_cmd_create_cq (context=3Dcontext@entry=3D0x10= 74890, cqe=3Dcqe@entry=3D2,=20 channel=3Dchannel@entry=3D0x0, comp_vector=3Dcomp_vector@entry=3D0, cq= =3Dcq@entry=3D0x1074c50, cmd=3Dcmd@entry=3D0x0, cmd_size=3D0,=20 resp=3D0x7ffe0a089760, resp_size=3D16) at /tmp/guix-build-rdma-core-33.= A.drv-0/rdma-core-33.1/libibverbs/cmd_cq.c:137 __cmdbtotal =3D 2 cmdb =3D {{next =3D 0x0, next_attr =3D 0x7ffe0a0896d8, last_attr = =3D 0x7ffe0a0896e8, uhw_in_idx =3D 255 '\377',=20 uhw_out_idx =3D 0 '\000', uhw_in_headroom_dwords =3D 0 '\000', = uhw_out_headroom_dwords =3D 2 '\002',=20 buffer_error =3D 0 '\000', fallback_require_ex =3D 0 '\000', fa= llback_ioctl_only =3D 0 '\000', hdr =3D { length =3D 0, object_id =3D 3, method_id =3D 0, num_attrs =3D= 0, reserved1 =3D 0, driver_id =3D 0, reserved2 =3D 0,=20 attrs =3D 0x7ffe0a0896c8}}, {next =3D 0x100081001, next_attr = =3D 0x7ffe0a089768, last_attr =3D 0x6e0000005b,=20 uhw_in_idx =3D 124 '|', uhw_out_idx =3D 0 '\000', uhw_in_headro= om_dwords =3D 0 '\000',=20 uhw_out_headroom_dwords =3D 0 '\000', buffer_error =3D 1 '\001'= , fallback_require_ex =3D 1 '\001',=20 fallback_ioctl_only =3D 1 '\001', hdr =3D {length =3D 0, object= _id =3D 0, method_id =3D 0, num_attrs =3D 0,=20 reserved1 =3D 0, driver_id =3D 17, reserved2 =3D 15, attrs = =3D 0x7ffe0a089700}}} __cmdbdummy =3D #3 0x00007f9ec85257bb in hfi1_create_cq (context=3D0x1074890, cqe=3D2, cha= nnel=3D0x0, comp_vector=3D0) at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/providers/hfi1ve= rbs/verbs.c:184 cq =3D 0x1074c50 resp =3D {ibv_resp =3D {cq_handle =3D 0, cqe =3D 0, driver_data =3D= 0x7ffe0a089768}, offset =3D 0} ret =3D size =3D #4 0x00007f9ec83fde41 in __ibv_create_cq_1_1 (context=3D0x1074890, cqe=3D<= optimized out>, cq_context=3D0x0, channel=3D0x0,=20 comp_vector=3D) at /tmp/guix-build-rdma-core-33.A.drv-0/= rdma-core-33.1/libibverbs/verbs.c:509 cq =3D #5 0x00007f9ec8426a55 in opal_common_verbs_qp_test () from /gnu/store/6gcssdkn8iaiagzdv0d9mi93gppc85r4-openmpi-4.0.5/lib/libmc= a_common_verbs.so.40 No symbol table info available. #6 0x00007f9ec8454e83 in btl_openib_component_init () from /gnu/store/6gcssdkn8iaiagzdv0d9mi93gppc85r4-openmpi-4.0.5/lib/openm= pi/mca_btl_openib.so --8<---------------cut here---------------end--------------->8--- Version 29.2 is good and everything beyond that isn=E2=80=99t. This has to= do with those rdma-core changes: --8<---------------cut here---------------start------------->8--- $ git log --oneline v26.4..v33.1 libibverbs/cmd_cq.c 317d8895 verbs: Enhance async FD usage 195c9191 verbs: Introduce verbs_cq for extended CQ operations 90a4d0cc verbs: Extend CQ KABI to get an async FD --8<---------------cut here---------------end--------------->8--- (The first commit in the list above appeared in v30.) I forgot to mention this happens with Omni-Path hardware: --8<---------------cut here---------------start------------->8--- $ guix environment --ad-hoc rdma-core -- ibv_devinfo hca_id: hfi1_0 transport: InfiniBand (0) fw_ver: 1.27.0 node_guid: 0011:7509:0107:573e sys_image_guid: 0011:7509:0107:573e vendor_id: 0x1175 vendor_part_id: 9456 hw_ver: 0x11 board_id: Intel Omni-Path Host Fabric Interfa= ce Adapter 100 Series phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 4 port_lmc: 0x00 link_layer: InfiniBand --8<---------------cut here---------------end--------------->8--- Ludo=E2=80=99.