From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED.blaine.gmane.org!not-for-mail From: Linas Vepstas Newsgroups: gmane.lisp.guile.devel Subject: Re: Now crashing [was Re: guile-2.9.2 and threading Date: Wed, 17 Jul 2019 20:42:13 -0500 Message-ID: References: <87h892ault.fsf@netris.org> <87k1cgwo20.fsf@netris.org> Reply-To: linasvepstas@gmail.com Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="000000000000c07390058deab8e9" Injection-Info: blaine.gmane.org; posting-host="blaine.gmane.org:195.159.176.226"; logging-data="180565"; mail-complaints-to="usenet@blaine.gmane.org" Cc: Guile Development To: Mark H Weaver Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Thu Jul 18 03:42:34 2019 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([209.51.188.17]) by blaine.gmane.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1hnvRK-000kp0-39 for guile-devel@m.gmane.org; Thu, 18 Jul 2019 03:42:34 +0200 Original-Received: from localhost ([::1]:33644 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1hnvRI-0003sl-VQ for guile-devel@m.gmane.org; Wed, 17 Jul 2019 21:42:32 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:34526) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1hnvRF-0003sc-54 for guile-devel@gnu.org; Wed, 17 Jul 2019 21:42:31 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hnvRD-00042l-26 for guile-devel@gnu.org; Wed, 17 Jul 2019 21:42:29 -0400 Original-Received: from mail-lj1-x22c.google.com ([2a00:1450:4864:20::22c]:44367) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1hnvRC-00041D-Ij for guile-devel@gnu.org; Wed, 17 Jul 2019 21:42:26 -0400 Original-Received: by mail-lj1-x22c.google.com with SMTP id k18so25564210ljc.11 for ; Wed, 17 Jul 2019 18:42:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:reply-to:from:date:message-id :subject:to:cc; bh=skzVVrXRDwTcIoRIlujJAdPzug2RYr91keNxzvInGdI=; b=slfHcOgCoVoESCLi5It81cTyAfQyalGMY+vcl5EsFsdnkzVjTJqOoP4OVkDhyOCHL5 5kGPPC4jjdTyt3vk+ur4I9Z5LYExnBsRwA2j9UP8Tp1Y7LFkK3hC1LkXEW/jvRkVWnFv IKOSVn2uCNqrac/R6S6ZRa+k4ACpv8JVI1Eznw2Mu9UQYDQ32mXzkZveOdqT0dyu86d+ 4rv83CJVie2LU+beLhQi5m0bcCr6N9/8/KXLMNtV7JiHOIEJiOcYLvCwtxfO3sLpBKj/ 37K8cNEauh+y+hCiMhwf6qGbU+Orehd94hx9s1/xOp+MbJ0JvwG/O3vD6lpRg20T2tbe hoAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to:cc; bh=skzVVrXRDwTcIoRIlujJAdPzug2RYr91keNxzvInGdI=; b=T+p2d0Mwfw0s9Z4gLWJcsLO1D3FmgMq1zmi/SsXfaykDjxzCw4ye80icgEIz5CxToJ alTsHUcVpZ21ZpGs1tVMswJpx3Y4ghDKpc3ZaIHJOcZTrvZmIeCq/eXz68FWanfQit/j JcA3ge+888uK+qu6IpstTqlkGb+OxAFTFpUXZbbTjuh7YHE+yen8EgBh69rj/sU3C2j6 vOcjnTrMCHRxkgXvUicpT1zWqZ1v5NcU9Wox7TM2sri2uSTlQwStn/jS+3MoRJkWKVH3 kpyss8lzTWcMgejWgpBKmROrzovZYNWjkNhwsKtKJTHk/W0ZN8rAujdjH7rs58sqUYxf 8/Ew== X-Gm-Message-State: APjAAAV49VKs27U1OpGHaRJ7THfI9ysN6K+zZ9qQG9eRE5yxERMNuhrp qfjuYULLEDhqeeFLiparkknMhyGwcaaPQ9GxEbc= X-Google-Smtp-Source: APXvYqwRRilcF/WVGXLbIToTL2jgnA8owKYWctHieklzvM8ICpaDz2FnrZtx2B9FntnXER9VO7vTWeanM8HGs6h8R/k= X-Received: by 2002:a2e:900c:: with SMTP id h12mr21597395ljg.197.1563414145035; Wed, 17 Jul 2019 18:42:25 -0700 (PDT) In-Reply-To: X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2a00:1450:4864:20::22c X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: "guile-devel" Xref: news.gmane.org gmane.lisp.guile.devel:20016 Archived-At: --000000000000c07390058deab8e9 Content-Type: text/plain; charset="UTF-8" Seem to be narrowing it down ... or at least, I have more details ... On Wed, Jul 17, 2019 at 4:44 PM Linas Vepstas wrote: > > > On Wed, Jul 17, 2019 at 12:49 PM Mark H Weaver wrote: > >> Hi Linas, >> >> > Investigating the crash with good-old printf's in libguile/vm.c produces >> > a vast ocean of prints ... that should have not been printed, and/or >> should >> > have been actual errors, but somehow were not handled by scm_error. >> > Using today's git pull of master, here's the diff containing a printf: >> > >> > --- a/libguile/vm.c >> > +++ b/libguile/vm.c >> > @@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0); } >> > >> > proc = SCM_SMOB_DESCRIPTOR (proc).apply_trampoline; >> > SCM_FRAME_LOCAL (vp->fp, 0) = proc; >> > return SCM_PROGRAM_CODE (proc); >> > } >> > >> > +printf("duuude wrong type to apply!\n" >> > +"proc=%lx\n" >> > +"ip=%p\n" >> > +"sp=%p\n" >> > +"fp=%p\n" >> > +"sp_min=%p\n" >> > +"stack_lim=%p\n", >> > +SCM_FRAME_SLOT(vp->fp, 0)->as_u64, >> > +vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack_limit); >> > +fflush(stdout); >> > + >> > vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp); >> > >> > scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S", >> > scm_list_1 (proc), scm_list_1 (proc)); >> > } >> > >> > As you can see, shortly after my printf, there should have been an >> > error report. >> >> Not necessarily. Note that what 'scm_error' actually does is to raise >> an exception. What happens next depends on what exception handlers are >> installed at the time of the error. >> > > OK, but... when I look at what get_callee_vcode() actually does, it seems > to be earnestly trying to fish out the location of a callable function > from the > frame pointer, and it does so three plausible ways. If those three don't > work > out, then it sets the instruction pointer (to the garbage value), followed > by > scm_error(Wrong type to apply). This also looks like an earnest, honest > attempt to report a real error. But lets double-check. > > So who calls get_callee_vcode(), and why, and what did they expect to > happen? > Well, that's in three places: one in scm_call_n which is a plausible place > where > one might expect the instruction pointer to be set to a valid value. Then > there's two > places in vm-engine.c -- "tail-call" and "call" both of which one might > plausibly expect > to have a valid instruction pointer. I can't imagine any valid scenario > where anyone > was expecting get_callee_vcode() to actually fail in the normal course of > operations. > There is one more place where get_callee_vcode() can get called -- via the jump_table, via a call to scm_jit_enter_mcode() which issues the code emitted by emit_get_callee_vcode There are four calls to scm_jit_enter_mcode() The one that immediately preceeds the bug is always the one made here, in vm-engine.c: VM_DEFINE_OP (7, return_values, "return-values", OP1 (X32)) Right before the call to scm_jit_enter_mcode(), I can printf VP->fp and SCM_FRAME_LOCAL(VP->fp, 0), and they are... fp=0x7fffe000caf8 fpslot=d33b00 (typical) the mcode is of course some bytecode that bounces through lightning, and a few insns later, it arrives at get_callee_vcode() but now the fp is different, (it changes by 0x20, always) and the slot is different: fp=0x7fffe000cad8 and SCM_FRAME_LOCAL(fp,0) is 0x32 and the 0x32 triggers the scm_error(). (because 0x32 is not any of SCM_PROGRAM_P or SCM_STRUCTP or a smob) (but also, the fpslot=d33b00 is never a SCM_PROGRAM_P or SCM_STRUCTP or a smob, either... so something got computed along the way ... ) That's what I've got so far. Its highly reproducible. Quick to happen. I'm not sure what to do next. I guess I need to examine emit_get_callee_vcode() and see what it does, and why. Any comments, suggestions would be useful. -- Linas > That is, I can't think of any valid reason why anyone would want to > suppress > the scm_error(). And even if I could -- calling scm_error() hundreds of > times > per second, as fast as possible, does not seem like efficient coding for > dealing > with a call to an invalid address. > > Anyway I'm trying to track down where the invalid value gets set. No luck > so far. > There are 6 or 8 places in vm-engine.c where the frame pointer is set to > something > that isn't a pointer (which seems like cheating to me: passing non-pointer > values > in something called "pointer" is .. well, knee jerk reaction is that it's > not wise, but > there may be a deeper reason.) > > >> >> > There is no error report... until 5-10 minutes later, when the error >> > report itself causes a crash. Before then, I get an endless >> > high-speed spew of prints: >> >> It looks like another error is happening within the exception handler. >> > > Well, yes, that also. But given that the instruction pointer contains > garbage > its perhaps not entirely surprising... at best, the question is, why > didn't it fail > sooner? > > -- Linas > >> >> Mark >> >> PS: It would be good to pick either 'guile-devel' or 'guile-user' for >> continuation of this thread. I don't see a reason why it should be >> sent to both lists. >> > > > -- > cassette tapes - analog TV - film cameras - you > -- cassette tapes - analog TV - film cameras - you --000000000000c07390058deab8e9 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Seem to be narrowing it down ... or at le= ast, I have more details ...

On Wed, Jul 17, 2019 at 4:44 PM Linas Vepst= as <linasvepstas@gmail.com= > wrote:


On Wed, Jul 17, 2019 at 12:49 PM Mark H W= eaver <mhw@netris.or= g> wrote:
Hi Linas,

> Investigating the crash with good-old printf's in libguile/vm.c pr= oduces
> a vast ocean of prints ... that should have not been printed, and/or s= hould
> have been actual errors, but somehow were not handled by scm_error. > Using today's git pull of master, here's the diff containing a= printf:
>
> --- a/libguile/vm.c
> +++ b/libguile/vm.c
> @@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0);= }
>
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 proc =3D SCM_SMOB_DESCRIPTOR (proc).apply_t= rampoline;
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 SCM_FRAME_LOCAL (vp->fp, 0) =3D proc; >=C2=A0 =C2=A0 =C2=A0 =C2=A0 return SCM_PROGRAM_CODE (proc);
>=C2=A0 =C2=A0 =C2=A0 }
>
> +printf("duuude wrong type to apply!\n"
> +"proc=3D%lx\n"
> +"ip=3D%p\n"
> +"sp=3D%p\n"
> +"fp=3D%p\n"
> +"sp_min=3D%p\n"
> +"stack_lim=3D%p\n",
> +SCM_FRAME_SLOT(vp->fp, 0)->as_u64,
> +vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack= _limit);
> +fflush(stdout);
> +
>=C2=A0 =C2=A0 vp->ip =3D SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp= );
>
>=C2=A0 =C2=A0 scm_error (scm_arg_type_key, NULL, "Wrong type to ap= ply: ~S",
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0scm_list_1 (proc= ), scm_list_1 (proc));
>=C2=A0 }
>
> As you can see, shortly after my printf, there should have been an
> error report.

Not necessarily.=C2=A0 Note that what 'scm_error' actually does is = to raise
an exception.=C2=A0 What happens next depends on what exception handlers ar= e
installed at the time of the error.

OK,= but... when I look at what get_callee_vcode() actually does, it seems
to be earnestly trying to fish out the location of a callable functio= n from the
frame pointer, and it does so three plausible way= s. If those three don't work
out, then it sets the instructio= n pointer (to the garbage value), followed by
scm_error(Wron= g type to apply). This also looks like an earnest, honest
attempt= to report a real error.=C2=A0 But lets double-check.

So who calls get_callee_vcode(), and why, and what did they expect = to happen?
Well, that's in three places: one in scm_call_n wh= ich is a plausible place where
one might expect the instruct= ion pointer to be set to a valid value. Then there's two
plac= es in vm-engine.c -- "tail-call" and "call" both of whi= ch one might plausibly expect
to have a valid instruction pointer= .=C2=A0 I can't imagine any valid scenario where anyone
was e= xpecting get_callee_vcode() to actually fail in the normal course of operat= ions.=C2=A0

There is= one more place where=C2=A0 get_callee_vcode() can get called -- via the ju= mp_table,
via a call to scm_jit_enter_mcode()=C2=A0 which issues= the code emitted by emit_get_callee_vcode

There a= re four calls to scm_jit_enter_mcode()=C2=A0 The one that immediately prece= eds
the bug is always the one made here, in vm-engine.c:
VM_DEFINE_OP (7, return_values, "return-values", OP1 (X32)= ) =C2=A0

Right before the call to scm_jit_ent= er_mcode(), I can printf VP->fp and
SCM_FRAME_LOCAL(VP-&g= t;fp, 0),
and they are... fp=3D0x7fffe000caf8 fpslot=3Dd33b00 (ty= pical)

the mcode is of course some bytecode th= at bounces through lightning, and a few insns
later, it arrives a= t get_callee_vcode() but now=C2=A0 the fp is different, (it changes by 0x20= ,
always) and the slot is different:=C2=A0 fp=3D0x7fffe000cad8=C2= =A0 and SCM_FRAME_LOCAL(fp,0)
is 0x32 and the 0x32 triggers = the scm_error(). (because 0x32 is not any of
SCM_PROGRAM_P o= r SCM_STRUCTP or a smob)

(but also, the fpslot=3Dd= 33b00 is never a SCM_PROGRAM_P or SCM_STRUCTP or
a smob, eit= her... so something got computed along the way ... )

That= 's what I've got so far. Its highly reproducible.=C2=A0 Quick to ha= ppen.=C2=A0 I'm not sure
what to do next. I guess I need to e= xamine emit_get_callee_vcode() and see what
it does, and why= .=C2=A0=C2=A0 Any comments, suggestions would be useful.

-- Linas

=
That is, I can't think of any valid reason why anyone wo= uld want to suppress
the scm_error().=C2=A0 And even if I could -= - calling scm_error() hundreds of times
per second, as fast as po= ssible, does not seem like efficient coding for dealing
with a ca= ll to an invalid address.=C2=A0

Anyway I'm try= ing to track down where the invalid value gets set. No luck so far.
There are 6 or 8 places in vm-engine.c where the frame pointer is set to= something
that isn't a pointer (which seems like cheating to= me: passing non-pointer values
in something called "po= inter" is .. well, knee jerk reaction is that it's not wise, but
there may be a deeper reason.)
=C2=A0

> There is no error report... until 5-10 minutes later, when the error > report itself causes a crash.=C2=A0 Before then, I get an endless
> high-speed spew of prints:

It looks like another error is happening within the exception handler.
<= /blockquote>

Well, yes, that also. But given that the in= struction pointer contains garbage
its perhaps not entirely surpr= ising... at best, the question is, why didn't it fail
sooner?=

-- Linas

=C2=A0 =C2=A0 =C2=A0 =C2=A0Mark

PS: It would be good to pick either 'guile-devel' or 'guile-use= r' for
=C2=A0 =C2=A0 continuation of this thread.=C2=A0 I don't see a reason w= hy it should be
=C2=A0 =C2=A0 sent to both lists.


--
cassette tapes = - analog TV - film cameras - you


--
cassette tapes - analog TV - film cameras = - you
--000000000000c07390058deab8e9--