From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Linas Vepstas Newsgroups: gmane.lisp.guile.devel Subject: Re: Now crashing [was Re: guile-2.9.2 and threading Date: Wed, 17 Jul 2019 22:52:47 -0500 Message-ID: References: <87h892ault.fsf@netris.org> <87k1cgwo20.fsf@netris.org> Reply-To: linasvepstas@gmail.com Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="000000000000c32d4a058dec8b3c" Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="170048"; mail-complaints-to="usenet@blaine.gmane.org" Cc: Guile Development To: Mark H Weaver Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Thu Jul 18 05:53:13 2019 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1hnxTj-000i4t-Qk for guile-devel@m.gmane.org; Thu, 18 Jul 2019 05:53:12 +0200 Original-Received: from localhost ([::1]:34000 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1hnxTi-0005Vs-Mn for guile-devel@m.gmane.org; Wed, 17 Jul 2019 23:53:10 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:42006) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1hnxTe-0005Vi-Hv for guile-devel@gnu.org; Wed, 17 Jul 2019 23:53:08 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hnxTc-0005bp-4u for guile-devel@gnu.org; Wed, 17 Jul 2019 23:53:06 -0400 Original-Received: from mail-lj1-x243.google.com ([2a00:1450:4864:20::243]:45973) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1hnxTa-0005Zs-G7 for guile-devel@gnu.org; Wed, 17 Jul 2019 23:53:04 -0400 Original-Received: by mail-lj1-x243.google.com with SMTP id m23so25743125lje.12 for ; Wed, 17 Jul 2019 20:53:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:reply-to:from:date:message-id :subject:to:cc; bh=oV5Q0Ep5vwcHskl3gAxc3VaqrKB8N2IY0qMBabB5FNs=; b=YYH/lZpi8cwnM/5FYoOKk5PchIz2FOJ9GS7WDT5Q3YbPrEuTVoy94yzRmtucz+/PfH KW5mV+oXSfpQO/+HUSoIhhIWTe7QXjGBdvxbHI2EV+xr+j6YcVOqcH0IAsC5w1gEeTj3 dK5BcdNqMfUcCa/TSZFViiw8zAjjPYr13gRf5BZEiSMliG72MLkRNfyWQbdM9VP1YZda UCe9xduKQmZe9t1auZoKfkDMPyV8toXhfAExRLFAAmLfkTH2DMHNsd+Rj0ju2ZaRmbEG DDaBgx8ZLo/5KHk6ruHCYX2eOMIoMe22R873R79BYJjrjOcRpL/0bUS22PpMU+Riag+J WXWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to:cc; bh=oV5Q0Ep5vwcHskl3gAxc3VaqrKB8N2IY0qMBabB5FNs=; b=Tku7QNNTW+K6DiTIlnnIEmvMabQKk01Vgo5WkNDBoDe0LWmetpAraJODu98alnrZhT MAQ7KhdyUdgVJuvLQaCm1R/e0bR+hxPluaYEHIZ65Vc5h+bOD0UGpPYNCRhScwovjUJO L6JhiUxLe9KtZC7uLrPg666apgr5TgiAvTwMwuqrM4+4Qw5PJprTVQ1nzyL5IJPmvYoE 09F7xTBFE16DEPM8E5JrPcULyd9948cmOAlD8UCbq6zfGiWdun9u2EYjEqCg7v5FUEZ6 gMtKG1lSBxCpNfwDj5K8VINuwnJUrqwMTcG1aPpNV/uIz3o91j90YlhjHEI/Cy8/u008 jLzw== X-Gm-Message-State: APjAAAXhfa0XCAk9OmmzPvi7JRDZILlfBztMdYzOTi5qGTQOKP6MveUM XScf1Ctpbir79Lgk9lTUe3KIxkGgqdYQK5u/KMA= X-Google-Smtp-Source: APXvYqwH4WnIqRICUmch4M0y7br6W9tqQsVbMrlZscejigOYqjKsA4UOOyV2mu2ayNDam2vpRsVWV0qKSigpwXfjaQI= X-Received: by 2002:a2e:8650:: with SMTP id i16mr23054124ljj.178.1563421980174; Wed, 17 Jul 2019 20:53:00 -0700 (PDT) In-Reply-To: X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2a00:1450:4864:20::243 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: "guile-devel" Xref: news.gmane.org gmane.lisp.guile.devel:20017 Archived-At: --000000000000c32d4a058dec8b3c Content-Type: text/plain; charset="UTF-8" Oh, I get it. I think the bug is this: VM_DEFINE_OP (7, return_values,...) finds some mcode, and calls it. What it found was the emit_get_callee_vcode but it is totally pointless to call this mcode, since we're returning, and not calling. So its just not useful. Worse, it gets called with garbage values, which are then silenced by ignoring the resulting scm_error, and everything appears to run smoothly ... for a while. Until some later time, (millions of calls later), when there is a completely unrelated race condition that causes the scm_error to get tangled and die. The ideal solution would be simply to not call the mcode for get_callee; that would save time and trouble. That's my hypothesis. I tried to test a mock-up of this solution with the patch below, but it is too simplistic t actually work (null pointer-deref.) I con't find a beter solution If you've got a better idea, let me know... -- Linas --- a/libguile/vm-engine.c +++ b/libguile/vm-engine.c @@ -553,6 +553,7 @@ VM_NAME (scm_thread *thread) mcode = SCM_FRAME_MACHINE_RETURN_ADDRESS (old_fp); if (mcode && mcode != scm_jit_return_to_interpreter_trampoline) { + VP->unused = 1; scm_jit_enter_mcode (thread, mcode); CACHE_REGISTER (); NEXT (0); diff --git a/libguile/vm.c b/libguile/vm.c index d7b1788..8e178c7 100644 --- a/libguile/vm.c +++ b/libguile/vm.c @@ -620,6 +620,7 @@ scm_i_vm_prepare_stack (struct scm_vm *vp) vp->compare_result = SCM_F_COMPARE_NONE; vp->engine = vm_default_engine; vp->trace_level = 0; + vp->unused = 0; #define INIT_HOOK(h) vp->h##_hook = SCM_BOOL_F; FOR_EACH_HOOK (INIT_HOOK) #undef INIT_HOOK @@ -1515,6 +1516,7 @@ get_callee_vcode (scm_thread *thread) vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp); + if (vp->unused) { vp->unused = 0; return 0; } scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S", scm_list_1 (proc), scm_list_1 (proc)); } On Wed, Jul 17, 2019 at 8:42 PM Linas Vepstas wrote: > Seem to be narrowing it down ... or at least, I have more details ... > > On Wed, Jul 17, 2019 at 4:44 PM Linas Vepstas > wrote: > >> >> >> On Wed, Jul 17, 2019 at 12:49 PM Mark H Weaver wrote: >> >>> Hi Linas, >>> >>> > Investigating the crash with good-old printf's in libguile/vm.c >>> produces >>> > a vast ocean of prints ... that should have not been printed, and/or >>> should >>> > have been actual errors, but somehow were not handled by scm_error. >>> > Using today's git pull of master, here's the diff containing a printf: >>> > >>> > --- a/libguile/vm.c >>> > +++ b/libguile/vm.c >>> > @@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0); } >>> > >>> > proc = SCM_SMOB_DESCRIPTOR (proc).apply_trampoline; >>> > SCM_FRAME_LOCAL (vp->fp, 0) = proc; >>> > return SCM_PROGRAM_CODE (proc); >>> > } >>> > >>> > +printf("duuude wrong type to apply!\n" >>> > +"proc=%lx\n" >>> > +"ip=%p\n" >>> > +"sp=%p\n" >>> > +"fp=%p\n" >>> > +"sp_min=%p\n" >>> > +"stack_lim=%p\n", >>> > +SCM_FRAME_SLOT(vp->fp, 0)->as_u64, >>> > +vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack_limit); >>> > +fflush(stdout); >>> > + >>> > vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp); >>> > >>> > scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S", >>> > scm_list_1 (proc), scm_list_1 (proc)); >>> > } >>> > >>> > As you can see, shortly after my printf, there should have been an >>> > error report. >>> >>> Not necessarily. Note that what 'scm_error' actually does is to raise >>> an exception. What happens next depends on what exception handlers are >>> installed at the time of the error. >>> >> >> OK, but... when I look at what get_callee_vcode() actually does, it seems >> to be earnestly trying to fish out the location of a callable function >> from the >> frame pointer, and it does so three plausible ways. If those three don't >> work >> out, then it sets the instruction pointer (to the garbage value), >> followed by >> scm_error(Wrong type to apply). This also looks like an earnest, honest >> attempt to report a real error. But lets double-check. >> >> So who calls get_callee_vcode(), and why, and what did they expect to >> happen? >> Well, that's in three places: one in scm_call_n which is a plausible >> place where >> one might expect the instruction pointer to be set to a valid value. Then >> there's two >> places in vm-engine.c -- "tail-call" and "call" both of which one might >> plausibly expect >> to have a valid instruction pointer. I can't imagine any valid scenario >> where anyone >> was expecting get_callee_vcode() to actually fail in the normal course of >> operations. >> > > There is one more place where get_callee_vcode() can get called -- via > the jump_table, > via a call to scm_jit_enter_mcode() which issues the code emitted by > emit_get_callee_vcode > > There are four calls to scm_jit_enter_mcode() The one that immediately > preceeds > the bug is always the one made here, in vm-engine.c: > VM_DEFINE_OP (7, return_values, "return-values", OP1 (X32)) > > Right before the call to scm_jit_enter_mcode(), I can printf VP->fp and > SCM_FRAME_LOCAL(VP->fp, 0), > and they are... fp=0x7fffe000caf8 fpslot=d33b00 (typical) > > the mcode is of course some bytecode that bounces through lightning, and a > few insns > later, it arrives at get_callee_vcode() but now the fp is different, (it > changes by 0x20, > always) and the slot is different: fp=0x7fffe000cad8 and > SCM_FRAME_LOCAL(fp,0) > is 0x32 and the 0x32 triggers the scm_error(). (because 0x32 is not any of > SCM_PROGRAM_P or SCM_STRUCTP or a smob) > > (but also, the fpslot=d33b00 is never a SCM_PROGRAM_P or SCM_STRUCTP or > a smob, either... so something got computed along the way ... ) > > That's what I've got so far. Its highly reproducible. Quick to happen. > I'm not sure > what to do next. I guess I need to examine emit_get_callee_vcode() and see > what > it does, and why. Any comments, suggestions would be useful. > > -- Linas > > >> That is, I can't think of any valid reason why anyone would want to >> suppress >> the scm_error(). And even if I could -- calling scm_error() hundreds of >> times >> per second, as fast as possible, does not seem like efficient coding for >> dealing >> with a call to an invalid address. >> >> Anyway I'm trying to track down where the invalid value gets set. No luck >> so far. >> There are 6 or 8 places in vm-engine.c where the frame pointer is set to >> something >> that isn't a pointer (which seems like cheating to me: passing >> non-pointer values >> in something called "pointer" is .. well, knee jerk reaction is that it's >> not wise, but >> there may be a deeper reason.) >> >> >>> >>> > There is no error report... until 5-10 minutes later, when the error >>> > report itself causes a crash. Before then, I get an endless >>> > high-speed spew of prints: >>> >>> It looks like another error is happening within the exception handler. >>> >> >> Well, yes, that also. But given that the instruction pointer contains >> garbage >> its perhaps not entirely surprising... at best, the question is, why >> didn't it fail >> sooner? >> >> -- Linas >> >>> >>> Mark >>> >>> PS: It would be good to pick either 'guile-devel' or 'guile-user' for >>> continuation of this thread. I don't see a reason why it should be >>> sent to both lists. >>> >> >> >> -- >> cassette tapes - analog TV - film cameras - you >> > > > -- > cassette tapes - analog TV - film cameras - you > -- cassette tapes - analog TV - film cameras - you --000000000000c32d4a058dec8b3c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Oh, I get it.=C2=A0 I think the bug is this:=C2=A0 VM= _DEFINE_OP (7, return_values,...)
finds some mcode, and call= s it.=C2=A0 What it found was the emit_get_callee_vcode
but it is= totally pointless to call this mcode, since we're returning, and not
calling. So its just not useful.

Wors= e, it gets called with garbage values, which are then silenced by ignoring =
the resulting=C2=A0 scm_error, and everything appears to run= smoothly ... for a while.
Until some later time, (millions = of calls later), when there is a completely unrelated
race c= ondition that causes the scm_error to get tangled and die.=C2=A0 The ideal =
solution would be simply to not call the mcode for get_calle= e; that would save
time and trouble.

That's my hypothesis. I tried to test a mock-up of this solution= with the patch
below, but it is too simplistic t actually work (= null pointer-deref.)=C2=A0 I con't find
a beter solution
<= /div>

If you've got a better idea, let me know...

-- Linas

--- a/libguile= /vm-engine.c
+++ b/libguile/vm-engine.c
@@ -553,6 +553,7 @@ VM_NAME (= scm_thread *thread)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0mcode =3D S= CM_FRAME_MACHINE_RETURN_ADDRESS (old_fp);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0if (mcode && mcode !=3D scm_jit_return_to_interpreter_= trampoline)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0{
+ =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0VP->unused =3D 1;
=C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0scm_jit_enter_mcode (threa= d, mcode);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0CACHE_= REGISTER ();
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0NEXT= (0);
diff --git a/libguile/vm.c b/libguile/vm.c
index d7b1788..8e178= c7 100644
--- a/libguile/vm.c
+++ b/libguile/vm.c
@@ -620,6 +620,7= @@ scm_i_vm_prepare_stack (struct scm_vm *vp)
=C2=A0 =C2=A0vp->compa= re_result =3D SCM_F_COMPARE_NONE;
=C2=A0 =C2=A0vp->engine =3D vm_defa= ult_engine;
=C2=A0 =C2=A0vp->trace_level =3D 0;
+ =C2=A0vp->unu= sed =3D 0;
=C2=A0#define INIT_HOOK(h) vp->h##_hook =3D SCM_BOOL_F;=C2=A0 =C2=A0FOR_EACH_HOOK (INIT_HOOK)
=C2=A0#undef INIT_HOOK
@@ -15= 15,6 +1516,7 @@ get_callee_vcode (scm_thread *thread)
=C2=A0
=C2=A0 = =C2=A0vp->ip =3D SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp);
=C2=A0=
+ =C2=A0if (vp->unused) { vp->unused =3D 0; return 0; }
=C2=A0= =C2=A0scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S&quo= t;,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 scm_list_1 (proc), = scm_list_1 (proc));
=C2=A0}

On Wed, Jul 17, 2019 at 8:42 PM Lin= as Vepstas <linasvepstas@gmail= .com> wrote:
Seem to be narrowing it down ... or a= t least, I have more details ...

On Wed, Jul 17, 2019 at 4:44 PM Linas V= epstas <lina= svepstas@gmail.com> wrote:


On Wed, Jul 17, 201= 9 at 12:49 PM Mark H Weaver <mhw@netris.org> wrote:
Hi Linas,

> Investigating the crash with good-old printf's in libguile/vm.c pr= oduces
> a vast ocean of prints ... that should have not been printed, and/or s= hould
> have been actual errors, but somehow were not handled by scm_error. > Using today's git pull of master, here's the diff containing a= printf:
>
> --- a/libguile/vm.c
> +++ b/libguile/vm.c
> @@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0);= }
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 proc =3D SCM_SMOB_DESCRIPTOR (proc).apply_t= rampoline;
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 SCM_FRAME_LOCAL (vp->fp, 0) =3D proc; >=C2=A0 =C2=A0 =C2=A0 =C2=A0 return SCM_PROGRAM_CODE (proc);
>=C2=A0 =C2=A0 =C2=A0 }
>
> +printf("duuude wrong type to apply!\n"
> +"proc=3D%lx\n"
> +"ip=3D%p\n"
> +"sp=3D%p\n"
> +"fp=3D%p\n"
> +"sp_min=3D%p\n"
> +"stack_lim=3D%p\n",
> +SCM_FRAME_SLOT(vp->fp, 0)->as_u64,
> +vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack= _limit);
> +fflush(stdout);
> +
>=C2=A0 =C2=A0 vp->ip =3D SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp= );
>
>=C2=A0 =C2=A0 scm_error (scm_arg_type_key, NULL, "Wrong type to ap= ply: ~S",
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0scm_list_1 (proc= ), scm_list_1 (proc));
>=C2=A0 }
>
> As you can see, shortly after my printf, there should have been an
> error report.

Not necessarily.=C2=A0 Note that what 'scm_error' actually does is = to raise
an exception.=C2=A0 What happens next depends on what exception handlers ar= e
installed at the time of the error.

OK,= but... when I look at what get_callee_vcode() actually does, it seems
to be earnestly trying to fish out the location of a callable functio= n from the
frame pointer, and it does so three plausible way= s. If those three don't work
out, then it sets the instructio= n pointer (to the garbage value), followed by
scm_error(Wron= g type to apply). This also looks like an earnest, honest
attempt= to report a real error.=C2=A0 But lets double-check.

So who calls get_callee_vcode(), and why, and what did they expect = to happen?
Well, that's in three places: one in scm_call_n wh= ich is a plausible place where
one might expect the instruct= ion pointer to be set to a valid value. Then there's two
plac= es in vm-engine.c -- "tail-call" and "call" both of whi= ch one might plausibly expect
to have a valid instruction pointer= .=C2=A0 I can't imagine any valid scenario where anyone
was e= xpecting get_callee_vcode() to actually fail in the normal course of operat= ions.=C2=A0

There is= one more place where=C2=A0 get_callee_vcode() can get called -- via the ju= mp_table,
via a call to scm_jit_enter_mcode()=C2=A0 which issues= the code emitted by emit_get_callee_vcode

There a= re four calls to scm_jit_enter_mcode()=C2=A0 The one that immediately prece= eds
the bug is always the one made here, in vm-engine.c:
VM_DEFINE_OP (7, return_values, "return-values", OP1 (X32)= ) =C2=A0

Right before the call to scm_jit_ent= er_mcode(), I can printf VP->fp and
SCM_FRAME_LOCAL(VP-&g= t;fp, 0),
and they are... fp=3D0x7fffe000caf8 fpslot=3Dd33b00 (ty= pical)

the mcode is of course some bytecode th= at bounces through lightning, and a few insns
later, it arrives a= t get_callee_vcode() but now=C2=A0 the fp is different, (it changes by 0x20= ,
always) and the slot is different:=C2=A0 fp=3D0x7fffe000cad8=C2= =A0 and SCM_FRAME_LOCAL(fp,0)
is 0x32 and the 0x32 triggers = the scm_error(). (because 0x32 is not any of
SCM_PROGRAM_P o= r SCM_STRUCTP or a smob)

(but also, the fpslot=3Dd= 33b00 is never a SCM_PROGRAM_P or SCM_STRUCTP or
a smob, eit= her... so something got computed along the way ... )

That= 's what I've got so far. Its highly reproducible.=C2=A0 Quick to ha= ppen.=C2=A0 I'm not sure
what to do next. I guess I need to e= xamine emit_get_callee_vcode() and see what
it does, and why= .=C2=A0=C2=A0 Any comments, suggestions would be useful.

-- Linas

=
That is, I can't think of any valid reason why anyone wo= uld want to suppress
the scm_error().=C2=A0 And even if I could -= - calling scm_error() hundreds of times
per second, as fast as po= ssible, does not seem like efficient coding for dealing
with a ca= ll to an invalid address.=C2=A0

Anyway I'm try= ing to track down where the invalid value gets set. No luck so far.
There are 6 or 8 places in vm-engine.c where the frame pointer is set to= something
that isn't a pointer (which seems like cheating to= me: passing non-pointer values
in something called "po= inter" is .. well, knee jerk reaction is that it's not wise, but
there may be a deeper reason.)
=C2=A0

> There is no error report... until 5-10 minutes later, when the error > report itself causes a crash.=C2=A0 Before then, I get an endless
> high-speed spew of prints:

It looks like another error is happening within the exception handler.
<= /blockquote>

Well, yes, that also. But given that the in= struction pointer contains garbage
its perhaps not entirely surpr= ising... at best, the question is, why didn't it fail
sooner?=

-- Linas

=C2=A0 =C2=A0 =C2=A0 =C2=A0Mark

PS: It would be good to pick either 'guile-devel' or 'guile-use= r' for
=C2=A0 =C2=A0 continuation of this thread.=C2=A0 I don't see a reason w= hy it should be
=C2=A0 =C2=A0 sent to both lists.


--
cassette tapes - analog TV - film cameras - you


--
cassette tapes= - analog TV - film cameras - you


--
cassette tapes - analog TV - film cameras = - you
--000000000000c32d4a058dec8b3c--