From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andy Wingo Subject: bug#28211: Grafting code triggers GC/thread-safety issue on Guile 2.2.2 Date: Wed, 09 May 2018 11:11:21 +0200 Message-ID: <871sel6vnq.fsf@igalia.com> References: <877exuj58y.fsf@gnu.org> <87d0yo1tie.fsf@gnu.org> <87fu3124nt.fsf@gnu.org> <87d0y5k6sl.fsf@netris.org> Mime-Version: 1.0 Content-Type: text/plain Return-path: Received: from eggs.gnu.org ([2001:4830:134:3::10]:58062) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fGL8p-0003uV-4Q for bug-guix@gnu.org; Wed, 09 May 2018 05:12:11 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fGL8k-0006Z5-Tc for bug-guix@gnu.org; Wed, 09 May 2018 05:12:07 -0400 Received: from debbugs.gnu.org ([208.118.235.43]:46910) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1fGL8k-0006Yo-P0 for bug-guix@gnu.org; Wed, 09 May 2018 05:12:02 -0400 Sender: "Debbugs-submit" Resent-Message-ID: In-Reply-To: <87d0y5k6sl.fsf@netris.org> (Mark H. Weaver's message of "Tue, 08 May 2018 20:32:26 -0400") List-Id: Bug reports for GNU Guix List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: bug-guix-bounces+gcggb-bug-guix=m.gmane.org@gnu.org Sender: "bug-Guix" To: Mark H Weaver Cc: 28211@debbugs.gnu.org On Wed 09 May 2018 02:32, Mark H Weaver writes: > However, I think it's _far_ more likely that the NULL argument on the > stack was copied from memory shared by multiple threads without proper > thread synchronization. I think this is unlikely on x86 given its total-store-ordering memory model. I agree with you about the value of barriers, but I don't think they are part of this bug that Ludo is seeing. >> I commented out the MADV_DONTNEED call to be sure, but I can still >> reproduce the bug. > > I strongly doubt that the MADV_DONTNEED is relevant to this issue. It could be. It would zero out VM stack frames, and if GC is called when/if vp->sp is out of date, then that would be possible. However I think vp->sp is never out of date, so that's probably not it. The things that can be out of date are the on-heap copy of the IP (vp->ip) and the local register copy of the sp (sp). It's more likely for the local "sp" cache to be out of date -- if we recursed through Scheme in a call out from the VM, eventualy causing stack expansion and relocation, then on the return forgot to re-cache the sp value, that could be it. Similarly, forgetting to set vp->ip before calling out to something that could GC could likewise cause a problem because the stack map wouldn't be right and the precise stack marker could clear a value by mistake. This only happens for non-innermost frames though; the innermost frame is marked conservatively. The rules are: update vp->ip before something that can allocate, and update local "sp" after returning from a C function that could have recursively called Scheme. I did find a couple places in the VM where we forgot to do one of these, e.g. 07b7490f73fb4a6cb0c9577d082d37c8d9cee7b0 and just now 9a72e212622fa3bd118d7c02c4386601285b3224. These two patches aren't shipped yet fwiw. Andy