From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Subject: bug#28211: Stack marking issue in multi-threaded code, 2020 edition Date: Thu, 12 Mar 2020 22:59:11 +0100 Message-ID: <87tv2tp74g.fsf_-_@gnu.org> References: <877exuj58y.fsf@gnu.org> <87d0yo1tie.fsf@gnu.org> <87fu3124nt.fsf@gnu.org> <87d0y5k6sl.fsf@netris.org> <871sel6vnq.fsf@igalia.com> <87fu30dmx3.fsf@netris.org> <87tvrg3q1d.fsf@igalia.com> <87a7rdvdm9.fsf_-_@gnu.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Return-path: Received: from eggs.gnu.org ([2001:470:142:3::10]:43144) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jCVs3-0007cR-Ms for bug-guix@gnu.org; Thu, 12 Mar 2020 18:00:04 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1jCVs2-00057S-Aj for bug-guix@gnu.org; Thu, 12 Mar 2020 18:00:03 -0400 Received: from debbugs.gnu.org ([209.51.188.43]:51935) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1jCVs2-000573-6Y for bug-guix@gnu.org; Thu, 12 Mar 2020 18:00:02 -0400 Sender: "Debbugs-submit" Resent-Message-ID: In-Reply-To: <87a7rdvdm9.fsf_-_@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\?\= \=\?utf-8\?Q\?\=22's\?\= message of "Fri, 29 Jun 2018 17:03:42 +0200") List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane-mx.org@gnu.org Sender: "bug-Guix" To: Andy Wingo Cc: 28211@debbugs.gnu.org Hi! I think I=E2=80=99ve found another race condition involving stack marking, = as a followup to (this time on 3.0.1+, but the code is almost the same.) =E2=80=98abort_to_prompt=E2=80=99 does this: --8<---------------cut here---------------start------------->8--- fp =3D vp->stack_top - fp_offset; sp =3D vp->stack_top - sp_offset; /* Continuation gets nargs+1 values: the one more is for the cont. */ sp =3D sp - nargs - 1; /* Shuffle abort arguments down to the prompt continuation. We have to be jumping to an older part of the stack. */ if (sp < vp->sp) abort (); sp[nargs].as_scm =3D cont; while (nargs--) sp[nargs] =3D vp->sp[nargs]; /* Restore VM regs */ vp->fp =3D fp; vp->sp =3D sp; vp->ip =3D vra; --8<---------------cut here---------------end--------------->8--- What if =E2=80=98scm_i_vm_mark_stack=E2=80=99 walks the stack right before = the =E2=80=98vp->fp=E2=80=99 assignment? It can determine that one of the just-assigned =E2=80=98sp[nar= gs]=E2=80=99 is a dead slot, and thus set it to SCM_UNSPECIFIED. Later, when we set =E2=80=98vp->fp=E2=80=99, that stack slot that we just initialized has been= overwritten by =E2=80=98scm_i_vm_mark_stack=E2=80=99. Down the road, we get something = like: Wrong type to apply: # I believe this is what I=E2=80=99m seeing here (0x7ff7f838dda0 is being set= to SCM_UNSPECIFIED while thread 2 is in =E2=80=98abort_to_prompt=E2=80=99): --8<---------------cut here---------------start------------->8--- (rr) thread 5 [Switching to thread 5 (Thread 24572.24575)] #0 scm_i_vm_mark_stack (vp=3D0x7ff7fd820b48, mark_stack_ptr=3D0x7ff7fc0ebf= 90,=20 mark_stack_limit=3D0x7ff7fc0fbec0) at vm.c:743 743 break; (rr) list 738 break; 739 case SLOT_DESC_DEAD: 740 /* This value may become dead as a result of GC, 741 so we can't just leave it on the stack. */ 742 sp->as_scm =3D SCM_UNSPECIFIED; 743 break; 744 } 745 } 746 sp =3D SCM_FRAME_PREVIOUS_SP (fp); 747 /* Inner frames may have a dead slots map for precise marking. (rr) p sp->as_scm $59 =3D # (rr) p sp $60 =3D (union scm_vm_stack_element *) 0x7ff7f838dda0 (rr) thread 2 [Switching to thread 2 (Thread 24572.24577)] #0 0x00007ff7fdb7bb36 in __GI___sigsuspend ( set=3Dset@entry=3D0x7ff7fe132720 ) at ../sysdeps/unix/sysv/linux/sigsuspend.c:26 26 ../sysdeps/unix/sysv/linux/sigsuspend.c: Dosiero a=C5=AD dosierujo ne ek= zistas. (rr) frame 4 #4 0x00007ff7fe228f14 in abort_to_prompt (thread=3D0x7ff7fd820b40,=20 saved_mra=3D) at vm.c:1465 1465 sp[nargs] =3D vp->sp[nargs]; (rr) p sp $61 =3D (union scm_vm_stack_element *) 0x7ff7f838dd90 (rr) p fp $62 =3D (union scm_vm_stack_element *) 0x7ff7f838ddb0 (rr) p &sp[2] $63 =3D (union scm_vm_stack_element *) 0x7ff7f838dda0 (rr) p vp->sp $64 =3D (union scm_vm_stack_element *) 0x7ff7f838dcf0 (rr) p vp->fp $65 =3D (union scm_vm_stack_element *) 0x7ff7f838dd08 (rr) p vp->stack_bottom $66 =3D (union scm_vm_stack_element *) 0x7ff7f838a000 (rr) p vp->stack_top $67 =3D (union scm_vm_stack_element *) 0x7ff7f838e000 --8<---------------cut here---------------end--------------->8--- Comments about this analysis? How do we fix it? It=E2=80=99s a bit troubling that this is all lock-free.= A fix I can think of is to just re-do the sp[nargs] assignments after the vp->sp etc. assignments. Thoughts? Ludo=E2=80=99.