From mboxrd@z Thu Jan 1 00:00:00 1970 From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Subject: bug#28211: Stack marking issue in multi-threaded code Date: Fri, 29 Jun 2018 17:03:42 +0200 Message-ID: <87a7rdvdm9.fsf_-_@gnu.org> References: <877exuj58y.fsf@gnu.org> <87d0yo1tie.fsf@gnu.org> <87fu3124nt.fsf@gnu.org> <87d0y5k6sl.fsf@netris.org> <871sel6vnq.fsf@igalia.com> <87fu30dmx3.fsf@netris.org> <87tvrg3q1d.fsf@igalia.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:37055) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fYuwQ-0007lm-UK for bug-guix@gnu.org; Fri, 29 Jun 2018 11:04:11 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fYuwM-0004QW-Ma for bug-guix@gnu.org; Fri, 29 Jun 2018 11:04:06 -0400 Received: from debbugs.gnu.org ([208.118.235.43]:60337) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1fYuwM-0004Pj-Hk for bug-guix@gnu.org; Fri, 29 Jun 2018 11:04:02 -0400 Sender: "Debbugs-submit" Resent-Message-ID: In-Reply-To: <87tvrg3q1d.fsf@igalia.com> (Andy Wingo's message of "Thu, 10 May 2018 09:53:18 +0200") List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: Andy Wingo Cc: 28211@debbugs.gnu.org --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hey hey, comrades! I have a fix for some (most?) of the crashes we were seeing while running multi-threaded code such as (guix build compile), and, presumably, the grafting code mentioned at the beginning of this bug report, although I haven=E2=80=99t checked yet. So, =E2=80=98scm_i_vm_mark_stack=E2=80=99 marks the stack precisely, but co= ntrary to what I suspected, precise marking is not at fault. Instead, the problem has to do with the fact that some VM instructions change the frame pointer (vp->fp) before they have set up the dynamic link for that new frame. As a consequence, if a stop-the-world GC is triggered after vp->fp has been changed and before its dynamic link has been set, the stack-walking loop in =E2=80=98scm_i_vm_mark_stack=E2=80=99 could stop very early, leavin= g a lot of objects unmarked. The patch below fixes the problem for me. \o/ I=E2=80=99m thinking we could perhaps add a compiler barrier before =E2=80= =98vp->fp =3D new_fp=E2=80=99 statements, but in practice it=E2=80=99s not necessary here (x86_64, gcc 7). Thoughts? I=E2=80=99d like to push this real soon. I=E2=80=99ll also do more testing= on real workloads from Guix, and then I=E2=80=99d like to release 2.2.4, hopefully within a few days. Thank you and thanks Andy for the discussions on IRC! Ludo=E2=80=99, who=E2=80=99s going to party all night long. :-) --=-=-= Content-Type: text/x-patch Content-Disposition: inline diff --git a/libguile/vm-engine.c b/libguile/vm-engine.c index 1aa4e9699..19ff3e498 100644 --- a/libguile/vm-engine.c +++ b/libguile/vm-engine.c @@ -548,7 +548,7 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp, VM_DEFINE_OP (1, call, "call", OP2 (X8_F24, X8_C24)) { scm_t_uint32 proc, nlocals; - union scm_vm_stack_element *old_fp; + union scm_vm_stack_element *old_fp, *new_fp; UNPACK_24 (op, proc); UNPACK_24 (ip[1], nlocals); @@ -556,9 +556,10 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp, PUSH_CONTINUATION_HOOK (); old_fp = vp->fp; - vp->fp = SCM_FRAME_SLOT (old_fp, proc - 1); - SCM_FRAME_SET_DYNAMIC_LINK (vp->fp, old_fp); - SCM_FRAME_SET_RETURN_ADDRESS (vp->fp, ip + 2); + new_fp = SCM_FRAME_SLOT (old_fp, proc - 1); + SCM_FRAME_SET_DYNAMIC_LINK (new_fp, old_fp); + SCM_FRAME_SET_RETURN_ADDRESS (new_fp, ip + 2); + vp->fp = new_fp; RESET_FRAME (nlocals); @@ -586,7 +587,7 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp, { scm_t_uint32 proc, nlocals; scm_t_int32 label; - union scm_vm_stack_element *old_fp; + union scm_vm_stack_element *old_fp, *new_fp; UNPACK_24 (op, proc); UNPACK_24 (ip[1], nlocals); @@ -595,9 +596,10 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp, PUSH_CONTINUATION_HOOK (); old_fp = vp->fp; - vp->fp = SCM_FRAME_SLOT (old_fp, proc - 1); - SCM_FRAME_SET_DYNAMIC_LINK (vp->fp, old_fp); - SCM_FRAME_SET_RETURN_ADDRESS (vp->fp, ip + 3); + new_fp = SCM_FRAME_SLOT (old_fp, proc - 1); + SCM_FRAME_SET_DYNAMIC_LINK (new_fp, old_fp); + SCM_FRAME_SET_RETURN_ADDRESS (new_fp, ip + 3); + vp->fp = new_fp; RESET_FRAME (nlocals); @@ -3893,7 +3895,7 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp, NEXT (1); { - union scm_vm_stack_element *old_fp; + union scm_vm_stack_element *old_fp, *new_fp; size_t old_frame_size = FRAME_LOCALS_COUNT (); SCM proc = scm_i_async_pop (thread); @@ -3907,9 +3909,10 @@ VM_NAME (scm_i_thread *thread, struct scm_vm *vp, handle-interrupts opcode to handle any additional interrupts. */ old_fp = vp->fp; - vp->fp = SCM_FRAME_SLOT (old_fp, old_frame_size + 1); - SCM_FRAME_SET_DYNAMIC_LINK (vp->fp, old_fp); - SCM_FRAME_SET_RETURN_ADDRESS (vp->fp, ip); + new_fp = SCM_FRAME_SLOT (old_fp, old_frame_size + 1); + SCM_FRAME_SET_DYNAMIC_LINK (new_fp, old_fp); + SCM_FRAME_SET_RETURN_ADDRESS (new_fp, ip); + vp->fp = new_fp; SP_SET (0, proc); diff --git a/libguile/vm.c b/libguile/vm.c index c8ec6e1b2..7749159e5 100644 --- a/libguile/vm.c +++ b/libguile/vm.c @@ -1011,6 +1011,18 @@ scm_i_vm_mark_stack (struct scm_vm *vp, struct GC_ms_entry *mark_stack_ptr, slot_map = find_slot_map (SCM_FRAME_RETURN_ADDRESS (fp), &cache); } + size_t extra = 0; + for (; sp < vp->stack_top; sp++) + { + if (GC_is_heap_ptr (sp->as_ptr)) + extra++; + } + if (extra) + { + printf ("%s extra: %zi\n", __func__, extra); + abort (); + } + return_unused_stack_to_os (vp); return mark_stack_ptr; --=-=-=--