From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail
From: Linas Vepstas <linasvepstas@gmail.com>
Newsgroups: gmane.lisp.guile.devel
Subject: Re: Now crashing [was Re: guile-2.9.2 and threading
Date: Sun, 21 Jul 2019 16:10:48 -0500
Message-ID: <CAHrUA37dZZw2wChp5Lj7V0i+UDNo3sJhaViKh0g+n_OcPGXoDA@mail.gmail.com>
References: <CAHrUA35tHiw6huwxC6Rt=dKj4=W7XyDOS61tDtEYu0LF1_AmSQ@mail.gmail.com>
 <87h892ault.fsf@netris.org>
 <CAHrUA35MCaLRaUGLevn2CGZVVxFzgn8VogfPp_3Qnpw1-msNug@mail.gmail.com>
 <CAHrUA35YQgr6a+26Qe-3caN7y0QfmMnppjYnn2tAi5aEjrY-+Q@mail.gmail.com>
 <CAHrUA37DZgyqu7JvC--1sOUfW5NJf2y-OYGi0c5BmZoqOGj4Mw@mail.gmail.com>
 <CAHrUA350B3+vGip+E+HTZ3hkQHMfRMrnB+7-eCgAeKCTqp044Q@mail.gmail.com>
 <CAHrUA37CWvNrmUpAc3CpPeQBV5yTFhnt9egT7OaqfRKyZLTA-g@mail.gmail.com>
 <87k1cgwo20.fsf@netris.org>
 <CAHrUA36vxr+Qnn5uyitmmUOUJ=0LYjeeXjrLZnFQdRiPi6ZHXQ@mail.gmail.com>
 <CAHrUA34ECsTRNFecrmB+yDByZ5xMeq8=mwDjKpeO+Lu6vE8ZfQ@mail.gmail.com>
 <CAHrUA35ke3HgrvqHpK4B26X--Su8KAV0pnfbRrx1gdmCLuODEw@mail.gmail.com>
Reply-To: linasvepstas@gmail.com
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="00000000000079ed09058e3765ae"
Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226";
	logging-data="118088"; mail-complaints-to="usenet@blaine.gmane.org"
Cc: Guile Development <guile-devel@gnu.org>
To: Mark H Weaver <mhw@netris.org>
Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sun Jul 21 23:11:10 2019
Return-path: <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Envelope-to: guile-devel@m.gmane.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.89)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1hpJ6s-000UcM-Iy
	for guile-devel@m.gmane.org; Sun, 21 Jul 2019 23:11:10 +0200
Original-Received: from localhost ([::1]:57748 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.86_2)
	(envelope-from <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>)
	id 1hpJ6r-0003mK-Kb
	for guile-devel@m.gmane.org; Sun, 21 Jul 2019 17:11:09 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:54554)
 by lists.gnu.org with esmtp (Exim 4.86_2)
 (envelope-from <linasvepstas@gmail.com>) id 1hpJ6n-0003aw-N1
 for guile-devel@gnu.org; Sun, 21 Jul 2019 17:11:08 -0400
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <linasvepstas@gmail.com>) id 1hpJ6l-0001ZI-9T
 for guile-devel@gnu.org; Sun, 21 Jul 2019 17:11:05 -0400
Original-Received: from mail-lj1-x244.google.com ([2a00:1450:4864:20::244]:32849)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
 (Exim 4.71) (envelope-from <linasvepstas@gmail.com>)
 id 1hpJ6k-0001XG-S4
 for guile-devel@gnu.org; Sun, 21 Jul 2019 17:11:03 -0400
Original-Received: by mail-lj1-x244.google.com with SMTP id h10so35543855ljg.0
 for <guile-devel@gnu.org>; Sun, 21 Jul 2019 14:11:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:references:in-reply-to:reply-to:from:date:message-id
 :subject:to:cc;
 bh=u01H4nsip75KdE/gm/lFCvwwZoN7nTbqHqYWqx23yno=;
 b=R0cvUx6Rc8iy+i5pvWoU4285TlSAl0toWyHgGkRRs2mcoAfjQ42UKgAAtJmg0k/vde
 1Ay+HGChTr0RXgt6Pzatp1BX06nsnpl1hkzWk8SiQYu1seFqYBagvSWK5Ns40PkvnuQl
 Ee0xx5TOWIXiDv8Ee9dRkAD6FkBs3mnkclWLADGC2LDdftL8aRoUvbkClUEg5qBioZ9l
 litHyfKaVHp7Q7tZmpzTZGyR91pZ9EUtQlTHEsKtkLHD1X0cYtwyy3vGcGxzPchNZfg8
 Pa8bEgfYsCFsNBwmw8JQnyGlv/aYBID0zWAx+ywQYscmUvHYLM04QsuRNZFDoWVC0lFe
 3KzA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:reply-to
 :from:date:message-id:subject:to:cc;
 bh=u01H4nsip75KdE/gm/lFCvwwZoN7nTbqHqYWqx23yno=;
 b=A6DIQW4todzLXHUyDhryywwgaZ65YSlHBbRmP72/FQAvflQKSMvHDE+5pQXvkWZHpM
 TIH+HQywHMH4amm8Z91ixCClPCR4lSS5H1WYPtrptHjERACoj7U8YOp6fc/9O1YuZ78G
 217d64om8qDldaxHSoasfZ7M4zTw2DW7BCyYj+y+uKiV5w5/GveoFW3YU6lpb+UTuPF3
 rkLjbcgwHXvmBiAiUoh7VmTV5T/ZJLNJLdbSv4QvFSSIpteIKpVwmWC72RYl9spE2LYJ
 UFOveUlzLU7ncwBOiJLs3FaJeviCX1CO3FG0JSPv6aZzCFTIF1S3zrlv82+WCa/v44LB
 XZPA==
X-Gm-Message-State: APjAAAUYYnN9kKWBNoROa2yZnl1UXoaqPe+NAV9ewJbgsf9osrRnDwNo
 nnZ/1OrWNCfy+lQlhz5uEfM7wuzF+JserC5e8iT/UQ==
X-Google-Smtp-Source: APXvYqwg12uecfFHwBdAjVO1AvBp3osA+/YpMjS/Y5NZ5b/rAuJ4MqTYm8ocBP7J0nqCE9JHtQO8ByE5pz2ga4OVIG4=
X-Received: by 2002:a2e:8650:: with SMTP id i16mr34556455ljj.178.1563743460386; 
 Sun, 21 Jul 2019 14:11:00 -0700 (PDT)
In-Reply-To: <CAHrUA35ke3HgrvqHpK4B26X--Su8KAV0pnfbRrx1gdmCLuODEw@mail.gmail.com>
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
 recognized.
X-Received-From: 2a00:1450:4864:20::244
X-BeenThere: guile-devel@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Developers list for Guile,
 the GNU extensibility library" <guile-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/guile-devel>,
 <mailto:guile-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/guile-devel>
List-Post: <mailto:guile-devel@gnu.org>
List-Help: <mailto:guile-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/guile-devel>,
 <mailto:guile-devel-request@gnu.org?subject=subscribe>
Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org
Original-Sender: "guile-devel" <guile-devel-bounces+guile-devel=m.gmane.org@gnu.org>
Xref: news.gmane.org gmane.lisp.guile.devel:20018
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.devel/20018>

--00000000000079ed09058e3765ae
Content-Type: text/plain; charset="UTF-8"

How utterly embarrassing.  Please ignore most of this verbose and difficult
email chain. Yes, guile-2.9.2 is still crashing, but almost all of my
analysis was wrong. Turns out that my scheme code was calling `(10)` i.e.
taking an integer, and treating it as a function, and attempting to call
it. So the call to `scm_error` was exactly right. It was invisible to me
because ... it was ignored in my code.

However -- if one does call `scm_error` fairly rapidly, from multiple
threads, one will eventually hit a race condition and get a crash.  I'm not
sure how to create a mini-test-case for this within guile; my code is
creating threads outside of guile, and launching `scm_eval` in each (and
ignoring the resulting error).  This was leading to a crash after 5-10
minutes.

-- Linas

On Wed, Jul 17, 2019 at 10:52 PM Linas Vepstas <linasvepstas@gmail.com>
wrote:

> Oh, I get it.  I think the bug is this:  VM_DEFINE_OP (7,
> return_values,...)
> finds some mcode, and calls it.  What it found was the
> emit_get_callee_vcode
> but it is totally pointless to call this mcode, since we're returning, and
> not
> calling. So its just not useful.
>
> Worse, it gets called with garbage values, which are then silenced by
> ignoring
> the resulting  scm_error, and everything appears to run smoothly ... for a
> while.
> Until some later time, (millions of calls later), when there is a
> completely unrelated
> race condition that causes the scm_error to get tangled and die.  The
> ideal
> solution would be simply to not call the mcode for get_callee; that would
> save
> time and trouble.
>
> That's my hypothesis. I tried to test a mock-up of this solution with the
> patch
> below, but it is too simplistic t actually work (null pointer-deref.)  I
> con't find
> a beter solution
>
> If you've got a better idea, let me know...
>
> -- Linas
>
> --- a/libguile/vm-engine.c
> +++ b/libguile/vm-engine.c
> @@ -553,6 +553,7 @@ VM_NAME (scm_thread *thread)
>            mcode = SCM_FRAME_MACHINE_RETURN_ADDRESS (old_fp);
>            if (mcode && mcode != scm_jit_return_to_interpreter_trampoline)
>              {
> +              VP->unused = 1;
>                scm_jit_enter_mcode (thread, mcode);
>                CACHE_REGISTER ();
>                NEXT (0);
> diff --git a/libguile/vm.c b/libguile/vm.c
> index d7b1788..8e178c7 100644
> --- a/libguile/vm.c
> +++ b/libguile/vm.c
> @@ -620,6 +620,7 @@ scm_i_vm_prepare_stack (struct scm_vm *vp)
>    vp->compare_result = SCM_F_COMPARE_NONE;
>    vp->engine = vm_default_engine;
>    vp->trace_level = 0;
> +  vp->unused = 0;
>  #define INIT_HOOK(h) vp->h##_hook = SCM_BOOL_F;
>    FOR_EACH_HOOK (INIT_HOOK)
>  #undef INIT_HOOK
> @@ -1515,6 +1516,7 @@ get_callee_vcode (scm_thread *thread)
>
>    vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp);
>
> +  if (vp->unused) { vp->unused = 0; return 0; }
>    scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S",
>               scm_list_1 (proc), scm_list_1 (proc));
>  }
>
> On Wed, Jul 17, 2019 at 8:42 PM Linas Vepstas <linasvepstas@gmail.com>
> wrote:
>
>> Seem to be narrowing it down ... or at least, I have more details ...
>>
>> On Wed, Jul 17, 2019 at 4:44 PM Linas Vepstas <linasvepstas@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Jul 17, 2019 at 12:49 PM Mark H Weaver <mhw@netris.org> wrote:
>>>
>>>> Hi Linas,
>>>>
>>>> > Investigating the crash with good-old printf's in libguile/vm.c
>>>> produces
>>>> > a vast ocean of prints ... that should have not been printed, and/or
>>>> should
>>>> > have been actual errors, but somehow were not handled by scm_error.
>>>> > Using today's git pull of master, here's the diff containing a printf:
>>>> >
>>>> > --- a/libguile/vm.c
>>>> > +++ b/libguile/vm.c
>>>> > @@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0); }
>>>> >
>>>> >        proc = SCM_SMOB_DESCRIPTOR (proc).apply_trampoline;
>>>> >        SCM_FRAME_LOCAL (vp->fp, 0) = proc;
>>>> >        return SCM_PROGRAM_CODE (proc);
>>>> >      }
>>>> >
>>>> > +printf("duuude wrong type to apply!\n"
>>>> > +"proc=%lx\n"
>>>> > +"ip=%p\n"
>>>> > +"sp=%p\n"
>>>> > +"fp=%p\n"
>>>> > +"sp_min=%p\n"
>>>> > +"stack_lim=%p\n",
>>>> > +SCM_FRAME_SLOT(vp->fp, 0)->as_u64,
>>>> > +vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack_limit);
>>>> > +fflush(stdout);
>>>> > +
>>>> >    vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp);
>>>> >
>>>> >    scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S",
>>>> >               scm_list_1 (proc), scm_list_1 (proc));
>>>> >  }
>>>> >
>>>> > As you can see, shortly after my printf, there should have been an
>>>> > error report.
>>>>
>>>> Not necessarily.  Note that what 'scm_error' actually does is to raise
>>>> an exception.  What happens next depends on what exception handlers are
>>>> installed at the time of the error.
>>>>
>>>
>>> OK, but... when I look at what get_callee_vcode() actually does, it seems
>>> to be earnestly trying to fish out the location of a callable function
>>> from the
>>> frame pointer, and it does so three plausible ways. If those three don't
>>> work
>>> out, then it sets the instruction pointer (to the garbage value),
>>> followed by
>>> scm_error(Wrong type to apply). This also looks like an earnest, honest
>>> attempt to report a real error.  But lets double-check.
>>>
>>> So who calls get_callee_vcode(), and why, and what did they expect to
>>> happen?
>>> Well, that's in three places: one in scm_call_n which is a plausible
>>> place where
>>> one might expect the instruction pointer to be set to a valid value.
>>> Then there's two
>>> places in vm-engine.c -- "tail-call" and "call" both of which one might
>>> plausibly expect
>>> to have a valid instruction pointer.  I can't imagine any valid scenario
>>> where anyone
>>> was expecting get_callee_vcode() to actually fail in the normal course
>>> of operations.
>>>
>>
>> There is one more place where  get_callee_vcode() can get called -- via
>> the jump_table,
>> via a call to scm_jit_enter_mcode()  which issues the code emitted by
>> emit_get_callee_vcode
>>
>> There are four calls to scm_jit_enter_mcode()  The one that immediately
>> preceeds
>> the bug is always the one made here, in vm-engine.c:
>> VM_DEFINE_OP (7, return_values, "return-values", OP1 (X32))
>>
>> Right before the call to scm_jit_enter_mcode(), I can printf VP->fp and
>> SCM_FRAME_LOCAL(VP->fp, 0),
>> and they are... fp=0x7fffe000caf8 fpslot=d33b00 (typical)
>>
>> the mcode is of course some bytecode that bounces through lightning, and
>> a few insns
>> later, it arrives at get_callee_vcode() but now  the fp is different, (it
>> changes by 0x20,
>> always) and the slot is different:  fp=0x7fffe000cad8  and
>> SCM_FRAME_LOCAL(fp,0)
>> is 0x32 and the 0x32 triggers the scm_error(). (because 0x32 is not any
>> of
>> SCM_PROGRAM_P or SCM_STRUCTP or a smob)
>>
>> (but also, the fpslot=d33b00 is never a SCM_PROGRAM_P or SCM_STRUCTP or
>> a smob, either... so something got computed along the way ... )
>>
>> That's what I've got so far. Its highly reproducible.  Quick to happen.
>> I'm not sure
>> what to do next. I guess I need to examine emit_get_callee_vcode() and
>> see what
>> it does, and why.   Any comments, suggestions would be useful.
>>
>> -- Linas
>>
>>
>>> That is, I can't think of any valid reason why anyone would want to
>>> suppress
>>> the scm_error().  And even if I could -- calling scm_error() hundreds of
>>> times
>>> per second, as fast as possible, does not seem like efficient coding for
>>> dealing
>>> with a call to an invalid address.
>>>
>>> Anyway I'm trying to track down where the invalid value gets set. No
>>> luck so far.
>>> There are 6 or 8 places in vm-engine.c where the frame pointer is set to
>>> something
>>> that isn't a pointer (which seems like cheating to me: passing
>>> non-pointer values
>>> in something called "pointer" is .. well, knee jerk reaction is that
>>> it's not wise, but
>>> there may be a deeper reason.)
>>>
>>>
>>>>
>>>> > There is no error report... until 5-10 minutes later, when the error
>>>> > report itself causes a crash.  Before then, I get an endless
>>>> > high-speed spew of prints:
>>>>
>>>> It looks like another error is happening within the exception handler.
>>>>
>>>
>>> Well, yes, that also. But given that the instruction pointer contains
>>> garbage
>>> its perhaps not entirely surprising... at best, the question is, why
>>> didn't it fail
>>> sooner?
>>>
>>> -- Linas
>>>
>>>>
>>>>        Mark
>>>>
>>>> PS: It would be good to pick either 'guile-devel' or 'guile-user' for
>>>>     continuation of this thread.  I don't see a reason why it should be
>>>>     sent to both lists.
>>>>
>>>
>>>
>>> --
>>> cassette tapes - analog TV - film cameras - you
>>>
>>
>>
>> --
>> cassette tapes - analog TV - film cameras - you
>>
>
>
> --
> cassette tapes - analog TV - film cameras - you
>


-- 
cassette tapes - analog TV - film cameras - you

--00000000000079ed09058e3765ae
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>How utterly embarrassing.=C2=A0 Please ignore most of=
 this verbose and difficult email chain. Yes, guile-2.9.2 is still crashing=
, but almost all of my analysis was wrong. Turns out that my scheme code wa=
s calling `(10)` i.e. taking an integer, and treating it as a function, and=
 attempting to call it. So the call to `scm_error` was exactly right. It wa=
s invisible to me because ... it was ignored in my code. <br></div><div><br=
></div><div>However -- if one does call `scm_error` fairly rapidly, from mu=
ltiple threads, one will eventually hit a race condition and get a crash.=
=C2=A0 I&#39;m not sure how to create a mini-test-case for this within guil=
e; my code is creating threads outside of guile, and launching `scm_eval` i=
n each (and ignoring the resulting error).=C2=A0 This was leading to a cras=
h after 5-10 minutes.<br></div><div><br></div><div>-- Linas<br></div></div>=
<br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Wed=
, Jul 17, 2019 at 10:52 PM Linas Vepstas &lt;<a href=3D"mailto:linasvepstas=
@gmail.com">linasvepstas@gmail.com</a>&gt; wrote:<br></div><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid r=
gb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div>Oh, I get it.=C2=A0=
 I think the bug is this:=C2=A0 VM_DEFINE_OP (7, return_values,...) <br></d=
iv><div>finds some mcode, and calls it.=C2=A0 What it found was the emit_ge=
t_callee_vcode</div><div>but it is totally pointless to call this mcode, si=
nce we&#39;re returning, and not<br></div><div>calling. So its just not use=
ful.</div><div><br></div><div>Worse, it gets called with garbage values, wh=
ich are then silenced by ignoring <br></div><div>the resulting=C2=A0 scm_er=
ror, and everything appears to run smoothly ... for a while. <br></div><div=
>Until some later time, (millions of calls later), when there is a complete=
ly unrelated <br></div><div>race condition that causes the scm_error to get=
 tangled and die.=C2=A0 The ideal <br></div><div>solution would be simply t=
o not call the mcode for get_callee; that would save <br></div><div>time an=
d trouble. <br></div><div><br></div><div>That&#39;s my hypothesis. I tried =
to test a mock-up of this solution with the patch</div><div>below, but it i=
s too simplistic t actually work (null pointer-deref.)=C2=A0 I con&#39;t fi=
nd</div><div>a beter solution<br></div><div><br></div><div>If you&#39;ve go=
t a better idea, let me know...</div><div><br></div><div>-- Linas<br></div>=
<div><br></div><div>--- a/libguile/vm-engine.c<br>+++ b/libguile/vm-engine.=
c<br>@@ -553,6 +553,7 @@ VM_NAME (scm_thread *thread)<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0mcode =3D SCM_FRAME_MACHINE_RETURN_ADDRESS (old_fp)=
;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (mcode &amp;&amp; mcode !=
=3D scm_jit_return_to_interpreter_trampoline)<br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0{<br>+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0VP-&gt;unused =3D 1;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0scm_jit_enter_mcode (thread, mcode);<br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0CACHE_REGISTER ();<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0NEXT (0);<br>diff --git a/libguile/vm.c b=
/libguile/vm.c<br>index d7b1788..8e178c7 100644<br>--- a/libguile/vm.c<br>+=
++ b/libguile/vm.c<br>@@ -620,6 +620,7 @@ scm_i_vm_prepare_stack (struct sc=
m_vm *vp)<br>=C2=A0 =C2=A0vp-&gt;compare_result =3D SCM_F_COMPARE_NONE;<br>=
=C2=A0 =C2=A0vp-&gt;engine =3D vm_default_engine;<br>=C2=A0 =C2=A0vp-&gt;tr=
ace_level =3D 0;<br>+ =C2=A0vp-&gt;unused =3D 0;<br>=C2=A0#define INIT_HOOK=
(h) vp-&gt;h##_hook =3D SCM_BOOL_F;<br>=C2=A0 =C2=A0FOR_EACH_HOOK (INIT_HOO=
K)<br>=C2=A0#undef INIT_HOOK<br>@@ -1515,6 +1516,7 @@ get_callee_vcode (scm=
_thread *thread)<br>=C2=A0<br>=C2=A0 =C2=A0vp-&gt;ip =3D SCM_FRAME_VIRTUAL_=
RETURN_ADDRESS (vp-&gt;fp);<br>=C2=A0<br>+ =C2=A0if (vp-&gt;unused) { vp-&g=
t;unused =3D 0; return 0; }<br>=C2=A0 =C2=A0scm_error (scm_arg_type_key, NU=
LL, &quot;Wrong type to apply: ~S&quot;,<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 scm_list_1 (proc), scm_list_1 (proc));<br>=C2=A0}<br></di=
v></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr=
">On Wed, Jul 17, 2019 at 8:42 PM Linas Vepstas &lt;<a href=3D"mailto:linas=
vepstas@gmail.com" target=3D"_blank">linasvepstas@gmail.com</a>&gt; wrote:<=
br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8e=
x;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"=
><div dir=3D"ltr">Seem to be narrowing it down ... or at least, I have more=
 details ...<br></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=
=3D"gmail_attr">On Wed, Jul 17, 2019 at 4:44 PM Linas Vepstas &lt;<a href=
=3D"mailto:linasvepstas@gmail.com" target=3D"_blank">linasvepstas@gmail.com=
</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:=
0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">=
<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Wed, Jul 17, 2019 at 12:49 PM Mark=
 H Weaver &lt;<a href=3D"mailto:mhw@netris.org" target=3D"_blank">mhw@netri=
s.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:=
1ex">Hi Linas,<br>
<br>
&gt; Investigating the crash with good-old printf&#39;s in libguile/vm.c pr=
oduces<br>
&gt; a vast ocean of prints ... that should have not been printed, and/or s=
hould<br>
&gt; have been actual errors, but somehow were not handled by scm_error.<br=
>
&gt; Using today&#39;s git pull of master, here&#39;s the diff containing a=
 printf:<br>
&gt;<br>
&gt; --- a/libguile/vm.c<br>
&gt; +++ b/libguile/vm.c<br>
&gt; @@ -1514,12 +1514,23 @@ thread-&gt;guard); fflush(stdout); assert (0);=
 }<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 proc =3D SCM_SMOB_DESCRIPTOR (proc).apply_t=
rampoline;<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 SCM_FRAME_LOCAL (vp-&gt;fp, 0) =3D proc;<br=
>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 return SCM_PROGRAM_CODE (proc);<br>
&gt;=C2=A0 =C2=A0 =C2=A0 }<br>
&gt;<br>
&gt; +printf(&quot;duuude wrong type to apply!\n&quot;<br>
&gt; +&quot;proc=3D%lx\n&quot;<br>
&gt; +&quot;ip=3D%p\n&quot;<br>
&gt; +&quot;sp=3D%p\n&quot;<br>
&gt; +&quot;fp=3D%p\n&quot;<br>
&gt; +&quot;sp_min=3D%p\n&quot;<br>
&gt; +&quot;stack_lim=3D%p\n&quot;,<br>
&gt; +SCM_FRAME_SLOT(vp-&gt;fp, 0)-&gt;as_u64,<br>
&gt; +vp-&gt;ip, vp-&gt;sp, vp-&gt;fp, vp-&gt;sp_min_since_gc, vp-&gt;stack=
_limit);<br>
&gt; +fflush(stdout);<br>
&gt; +<br>
&gt;=C2=A0 =C2=A0 vp-&gt;ip =3D SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp-&gt;fp=
);<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 scm_error (scm_arg_type_key, NULL, &quot;Wrong type to ap=
ply: ~S&quot;,<br>
&gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0scm_list_1 (proc=
), scm_list_1 (proc));<br>
&gt;=C2=A0 }<br>
&gt;<br>
&gt; As you can see, shortly after my printf, there should have been an<br>
&gt; error report.<br>
<br>
Not necessarily.=C2=A0 Note that what &#39;scm_error&#39; actually does is =
to raise<br>
an exception.=C2=A0 What happens next depends on what exception handlers ar=
e<br>
installed at the time of the error.<br></blockquote><div><br></div><div>OK,=
 but... when I look at what get_callee_vcode() actually does, it seems</div=
><div>to be earnestly trying to fish out the location of a callable functio=
n from the <br></div><div>frame pointer, and it does so three plausible way=
s. If those three don&#39;t work</div><div>out, then it sets the instructio=
n pointer (to the garbage value), followed by <br></div><div>scm_error(Wron=
g type to apply). This also looks like an earnest, honest</div><div>attempt=
 to report a real error.=C2=A0 But lets double-check.<br></div><div><br></d=
iv><div>So who calls get_callee_vcode(), and why, and what did they expect =
to happen?</div><div>Well, that&#39;s in three places: one in scm_call_n wh=
ich is a plausible place where <br></div><div>one might expect the instruct=
ion pointer to be set to a valid value. Then there&#39;s two</div><div>plac=
es in vm-engine.c -- &quot;tail-call&quot; and &quot;call&quot; both of whi=
ch one might plausibly expect</div><div>to have a valid instruction pointer=
.=C2=A0 I can&#39;t imagine any valid scenario where anyone</div><div>was e=
xpecting get_callee_vcode() to actually fail in the normal course of operat=
ions.=C2=A0 <br></div></div></div></blockquote><div><br></div><div>There is=
 one more place where=C2=A0 get_callee_vcode() can get called -- via the ju=
mp_table,</div><div> via a call to scm_jit_enter_mcode()=C2=A0 which issues=
 the code emitted by emit_get_callee_vcode</div><div><br></div><div>There a=
re four calls to scm_jit_enter_mcode()=C2=A0 The one that immediately prece=
eds <br></div><div>the bug is always the one made here, in vm-engine.c:</di=
v><div>VM_DEFINE_OP (7, return_values, &quot;return-values&quot;, OP1 (X32)=
) =C2=A0 <br></div><div><br></div><div>Right before the call to scm_jit_ent=
er_mcode(), I can printf VP-&gt;fp and <br></div><div>SCM_FRAME_LOCAL(VP-&g=
t;fp, 0),</div><div>and they are... fp=3D0x7fffe000caf8 fpslot=3Dd33b00 (ty=
pical)<br></div><div><br></div><div>the mcode is of course some bytecode th=
at bounces through lightning, and a few insns</div><div>later, it arrives a=
t get_callee_vcode() but now=C2=A0 the fp is different, (it changes by 0x20=
,</div><div>always) and the slot is different:=C2=A0 fp=3D0x7fffe000cad8=C2=
=A0 and SCM_FRAME_LOCAL(fp,0) <br></div><div>is 0x32 and the 0x32 triggers =
the scm_error(). (because 0x32 is not any of <br></div><div>SCM_PROGRAM_P o=
r SCM_STRUCTP or a smob)</div><div><br></div><div>(but also, the fpslot=3Dd=
33b00 is never a SCM_PROGRAM_P or SCM_STRUCTP or <br></div><div>a smob, eit=
her... so something got computed along the way ... )<br></div><br><div>That=
&#39;s what I&#39;ve got so far. Its highly reproducible.=C2=A0 Quick to ha=
ppen.=C2=A0 I&#39;m not sure</div><div>what to do next. I guess I need to e=
xamine emit_get_callee_vcode() and see what <br></div><div>it does, and why=
.=C2=A0=C2=A0 Any comments, suggestions would be useful.</div><div><br></di=
v><div>-- Linas<br></div><div><br></div><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pad=
ding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_quote"><div></div><div>=
<br></div><div>That is, I can&#39;t think of any valid reason why anyone wo=
uld want to suppress</div><div>the scm_error().=C2=A0 And even if I could -=
- calling scm_error() hundreds of times</div><div>per second, as fast as po=
ssible, does not seem like efficient coding for dealing</div><div>with a ca=
ll to an invalid address.=C2=A0</div><div><br></div><div>Anyway I&#39;m try=
ing to track down where the invalid value gets set. No luck so far.</div><d=
iv>There are 6 or 8 places in vm-engine.c where the frame pointer is set to=
 something</div><div>that isn&#39;t a pointer (which seems like cheating to=
 me: passing non-pointer values <br></div><div>in something called &quot;po=
inter&quot; is .. well, knee jerk reaction is that it&#39;s not wise, but</=
div><div>there may be a deeper reason.)<br></div><div>=C2=A0</div><blockquo=
te class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px =
solid rgb(204,204,204);padding-left:1ex">
<br>
&gt; There is no error report... until 5-10 minutes later, when the error<b=
r>
&gt; report itself causes a crash.=C2=A0 Before then, I get an endless<br>
&gt; high-speed spew of prints:<br>
<br>
It looks like another error is happening within the exception handler.<br><=
/blockquote><div><br></div><div>Well, yes, that also. But given that the in=
struction pointer contains garbage</div><div>its perhaps not entirely surpr=
ising... at best, the question is, why didn&#39;t it fail</div><div>sooner?=
<br></div><div><br></div><div>-- Linas<br></div><blockquote class=3D"gmail_=
quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,=
204);padding-left:1ex">
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0Mark<br>
<br>
PS: It would be good to pick either &#39;guile-devel&#39; or &#39;guile-use=
r&#39; for<br>
=C2=A0 =C2=A0 continuation of this thread.=C2=A0 I don&#39;t see a reason w=
hy it should be<br>
=C2=A0 =C2=A0 sent to both lists.<br>
</blockquote></div><br clear=3D"all"><br>-- <br><div dir=3D"ltr" class=3D"g=
mail-m_1808181318796064356gmail-m_-8632637567639334034gmail-m_7753262463235=
342865gmail_signature"><div dir=3D"ltr">cassette tapes - analog TV - film c=
ameras - you<br></div></div></div>
</blockquote></div><br clear=3D"all"><br>-- <br><div dir=3D"ltr" class=3D"g=
mail-m_1808181318796064356gmail-m_-8632637567639334034gmail_signature"><div=
 dir=3D"ltr">cassette tapes - analog TV - film cameras - you<br></div></div=
></div>
</blockquote></div><br clear=3D"all"><br>-- <br><div dir=3D"ltr" class=3D"g=
mail-m_1808181318796064356gmail_signature"><div dir=3D"ltr">cassette tapes =
- analog TV - film cameras - you<br></div></div>
</blockquote></div><br clear=3D"all"><br>-- <br><div dir=3D"ltr" class=3D"g=
mail_signature"><div dir=3D"ltr">cassette tapes - analog TV - film cameras =
- you<br></div></div>

--00000000000079ed09058e3765ae--