From mboxrd@z Thu Jan 1 00:00:00 1970 From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Subject: bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2 Date: Wed, 09 May 2018 09:17:35 +0200 Message-ID: <87zi19z4a8.fsf@gnu.org> References: <877exuj58y.fsf@gnu.org> <87d0yo1tie.fsf@gnu.org> <87fu3124nt.fsf@gnu.org> <87d0y5k6sl.fsf@netris.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:33752) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fGJNP-0001PQ-Oq for bug-guix@gnu.org; Wed, 09 May 2018 03:19:10 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fGJNO-0008Hh-H2 for bug-guix@gnu.org; Wed, 09 May 2018 03:19:03 -0400 Received: from debbugs.gnu.org ([208.118.235.43]:46850) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1fGJNO-0008HH-D9 for bug-guix@gnu.org; Wed, 09 May 2018 03:19:02 -0400 Sender: "Debbugs-submit" Resent-Message-ID: In-Reply-To: <87d0y5k6sl.fsf@netris.org> (Mark H. Weaver's message of "Tue, 08 May 2018 20:32:26 -0400") List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: Mark H Weaver Cc: Andy Wingo , 28211@debbugs.gnu.org Hi Mark, Mark H Weaver skribis: > ludo@gnu.org (Ludovic Court=C3=A8s) writes: > [...] >> Thread 1 (Thread 0x7f6fe6f5d700 (LWP 2856)): >> #0 0x00007f7019db0d79 in scm_is_pair (x=3D0x0) at ../libguile/pairs.h:159 >> #1 scm_ilength (sx=3D) at list.c:190 > [...] >> What this means is that Thread 1 gets NULL instead of a list as its >> on-stack argument (vm-engine.c:909 is =E2=80=98tail-apply=E2=80=99). >> >> How can arguments on the VM stack be zeroed? > > I doubt that's what happened, because I expect that each VM stack is > dedicated to a single hardware thread. In theory, if a single VM stack > is used by one thread, and then later used by another thread, > thread-safety issues on the VM stack could arise in the absense of > proper thread synchronization. > > However, I think it's _far_ more likely that the NULL argument on the > stack was copied from memory shared by multiple threads without proper > thread synchronization. It could be this, but this particular case is an =E2=80=9Cembarrassingly parallel=E2=80=9D program where threads work on independent data sets witho= ut any inter-thread communication whatsoever. What you describe could nevertheless be happening at a lower level, within libguile, though it=E2=80=99s not clear to me where that could be. >> I commented out the MADV_DONTNEED call to be sure, but I can still >> reproduce the bug. > > I strongly doubt that the MADV_DONTNEED is relevant to this issue. I thought about it because that=E2=80=99s one way part of the VM stack coul= d be zeroed out. >> Then I thought vp->sp might be out-of-sync compared to the local >> variable =E2=80=98sp=E2=80=99, which in turn could cause =E2=80=98scm_i_= vm_mark_stack=E2=80=99 to not >> mark a few items on the tip of the stack. So I did this: >> >> diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c >> index 9509cd643..1136b2271 100644 >> --- a/libguile/vm-engine.c >> +++ b/libguile/vm-engine.c >> @@ -151,7 +151,8 @@ >> code, or otherwise push anything on the stack, you will need to >> CACHE_SP afterwards to restore the possibly-changed stack pointer. = */ >>=20=20 >> -#define SYNC_IP() vp->ip =3D (ip) >> +#define SYNC_IP() \ >> + do { vp->ip =3D (ip); vp->sp =3D (sp); } while (0) > > I don't see how a change like this could be useful for any thread safety > problem. I witnessed situations where the local =E2=80=98sp=E2=80=99 seemed to be di= fferent from =E2=80=98vp->sp=E2=80=99, though it=E2=80=99s hard to tell because I=E2=80= =99m unsure where gcc stores =E2=80=98sp=E2=80=99. Here=E2=80=99s an example: --8<---------------cut here---------------start------------->8--- (gdb) frame #16 0x00007fabf30af2ca in vm_regular_engine (thread=3D0x24e6000, vp=3D0x22d= e6c0, registers=3D0x0, resume=3D40) at vm-engine.c:785 785 ret =3D scm_apply_subr (sp, FRAME_LOCALS_COUNT ()); (gdb) p vp->sp $5 =3D (union scm_vm_stack_element *) 0x7fabec158718 (gdb) p (union scm_vm_stack_element *) $r13 $6 =3D (union scm_vm_stack_element *) 0x7fabec158e30 (gdb) p $6 - $5 $7 =3D 227 (gdb) p vp->fp $8 =3D (union scm_vm_stack_element *) 0x7fabec158730 (gdb) p vp->stack_top $9 =3D (union scm_vm_stack_element *) 0x7fabec159000 (gdb) p vp->stack_bottom $10 =3D (union scm_vm_stack_element *) 0x7fabec158000 (gdb) p vp->sp_min_since_gc $11 =3D (union scm_vm_stack_element *) 0x7fabec158620 (gdb) info registers rax 0x1 1 rbx 0xa 10 rcx 0x28 40 rdx 0x0 0 rsi 0x23f1920 37689632 rdi 0x24e6000 38690816 rbp 0x22de6c0 0x22de6c0 rsp 0x7fabcce18660 0x7fabcce18660 r8 0x1 1 r9 0x1 1 r10 0x100 256 r11 0x23f1920 37689632 r12 0x7fabf330b8c0 140376496191680 r13 0x7fabec158e30 140376376970800 r14 0x7fabf30c6d7c 140376493813116 r15 0x7fabf0fa7f28 140376459083560 rip 0x7fabf30af2ca 0x7fabf30af2ca eflags 0x10246 [ PF ZF IF RF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 --8<---------------cut here---------------end--------------->8--- My hypothesis was that such a bug could lead heap elements to be reclaimed too early. This is more likely to happen in a multi-threaded context because one thread could be allocating memory and triggering a GC while another thread is invoking a subr with an out-of-sync =E2=80=98vp-= >sp=E2=80=99. Does that make sense? > For now, I would suggest avoiding multi-threaded code in Guix, or at > least to avoid loading any Scheme code from multiple threads. > > How about using multiple processes instead? We could do that, but with my Guile maintainer hat on (a hat I don=E2=80=99t wear that often as you might have noticed ;-)) I think it would be nice to fix the issue. Thanks, Ludo=E2=80=99.