On Mon, Feb 01, 2021 at 09:55:19AM +0100, Ludovic Courtès wrote: > Hello, > > We noticed that the recent rdma-core upgrade to 33.1¹ leads to segfaults > in InfiniBand related routines: > > --8<---------------cut here---------------start------------->8--- > $ guix time-machine --commit=23a5dcce1d893b8f5c5301ae3c1af863776ed3cf -- environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-debug-info=rdma-core -- mpiexec -np 2 IMB-MPI1 PingPong > -------------------------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code. Per user-direction, the job has been aborted. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > $ file core.20879 > core.20879: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'IMB-MPI1 PingPong', real uid: 10218, effective uid: 10218, real gid: 11018, effective gid: 11018, execfn: '/gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1', platform: 'x86_64' > $ gdb /gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1 core.20879 > (gdb) bt > #0 0x00007f93b2789e88 in ibv_cmd_create_cq () > from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1 > #1 0x00007f93b28c57bb in hfi1_create_cq () > from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs/libhfi1verbs-rdmav33.so > #2 0x00007f93b2796331 in ibv_create_cq@@IBVERBS_1.1 () > from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1 > #3 0x00007f93b27c0a55 in opal_common_verbs_qp_test () > from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmca_common_verbs.so.40 > #4 0x00007f93b27f4e83 in btl_openib_component_init () > from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_btl_openib.so > #5 0x00007f93b4516aaf in mca_btl_base_select () > from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libopen-pal.so.40 > #6 0x00007f93b29552c2 in mca_bml_r2_component_init () > from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_bml_r2.so > #7 0x00007f93b4b81b54 in mca_bml_base_init () > from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40 > #8 0x00007f93b4bc4ef8 in ompi_mpi_init () > from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40 > #9 0x00007f93b4b5ee55 in PMPI_Init_thread () > from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40 > #10 0x0000000000405b55 in main () > --8<---------------cut here---------------end--------------->8--- > > Conversely, a pre-upgrade commit works fine: > > --8<---------------cut here---------------start------------->8--- > $ guix time-machine --commit=c2538db5617032788ac2f140496d00d8107579c8 -- environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks -- mpiexec -np 2 IMB-MPI1 PingPong > --8<---------------cut here---------------end--------------->8--- > > Does that ring a bell? > > Thanks, > Ludo’. > > ¹ https://git.savannah.gnu.org/cgit/guix.git/commit/?id=c2739c0801ebc5461564e862ce8f08405e2782dc > I thought I built everything that depended on rdma-core, and unfortunately I don't have a way to test it. As an actual user of the package I trust you to revert the change if necessary. I don't see anything on their mailing list pointing to this, or any other bugs really. http://vger.kernel.org/vger-lists.html#linux-rdma -- Efraim Flashner אפרים פלשנר GPG key = A28B F40C 3E55 1372 662D 14F7 41AA E7DC CA3D 8351 Confidentiality cannot be guaranteed on emails sent or received unencrypted