Seem to be narrowing it down ... or at least, I have more details ... On Wed, Jul 17, 2019 at 4:44 PM Linas Vepstas wrote: > > > On Wed, Jul 17, 2019 at 12:49 PM Mark H Weaver wrote: > >> Hi Linas, >> >> > Investigating the crash with good-old printf's in libguile/vm.c produces >> > a vast ocean of prints ... that should have not been printed, and/or >> should >> > have been actual errors, but somehow were not handled by scm_error. >> > Using today's git pull of master, here's the diff containing a printf: >> > >> > --- a/libguile/vm.c >> > +++ b/libguile/vm.c >> > @@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0); } >> > >> > proc = SCM_SMOB_DESCRIPTOR (proc).apply_trampoline; >> > SCM_FRAME_LOCAL (vp->fp, 0) = proc; >> > return SCM_PROGRAM_CODE (proc); >> > } >> > >> > +printf("duuude wrong type to apply!\n" >> > +"proc=%lx\n" >> > +"ip=%p\n" >> > +"sp=%p\n" >> > +"fp=%p\n" >> > +"sp_min=%p\n" >> > +"stack_lim=%p\n", >> > +SCM_FRAME_SLOT(vp->fp, 0)->as_u64, >> > +vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack_limit); >> > +fflush(stdout); >> > + >> > vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp); >> > >> > scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S", >> > scm_list_1 (proc), scm_list_1 (proc)); >> > } >> > >> > As you can see, shortly after my printf, there should have been an >> > error report. >> >> Not necessarily. Note that what 'scm_error' actually does is to raise >> an exception. What happens next depends on what exception handlers are >> installed at the time of the error. >> > > OK, but... when I look at what get_callee_vcode() actually does, it seems > to be earnestly trying to fish out the location of a callable function > from the > frame pointer, and it does so three plausible ways. If those three don't > work > out, then it sets the instruction pointer (to the garbage value), followed > by > scm_error(Wrong type to apply). This also looks like an earnest, honest > attempt to report a real error. But lets double-check. > > So who calls get_callee_vcode(), and why, and what did they expect to > happen? > Well, that's in three places: one in scm_call_n which is a plausible place > where > one might expect the instruction pointer to be set to a valid value. Then > there's two > places in vm-engine.c -- "tail-call" and "call" both of which one might > plausibly expect > to have a valid instruction pointer. I can't imagine any valid scenario > where anyone > was expecting get_callee_vcode() to actually fail in the normal course of > operations. > There is one more place where get_callee_vcode() can get called -- via the jump_table, via a call to scm_jit_enter_mcode() which issues the code emitted by emit_get_callee_vcode There are four calls to scm_jit_enter_mcode() The one that immediately preceeds the bug is always the one made here, in vm-engine.c: VM_DEFINE_OP (7, return_values, "return-values", OP1 (X32)) Right before the call to scm_jit_enter_mcode(), I can printf VP->fp and SCM_FRAME_LOCAL(VP->fp, 0), and they are... fp=0x7fffe000caf8 fpslot=d33b00 (typical) the mcode is of course some bytecode that bounces through lightning, and a few insns later, it arrives at get_callee_vcode() but now the fp is different, (it changes by 0x20, always) and the slot is different: fp=0x7fffe000cad8 and SCM_FRAME_LOCAL(fp,0) is 0x32 and the 0x32 triggers the scm_error(). (because 0x32 is not any of SCM_PROGRAM_P or SCM_STRUCTP or a smob) (but also, the fpslot=d33b00 is never a SCM_PROGRAM_P or SCM_STRUCTP or a smob, either... so something got computed along the way ... ) That's what I've got so far. Its highly reproducible. Quick to happen. I'm not sure what to do next. I guess I need to examine emit_get_callee_vcode() and see what it does, and why. Any comments, suggestions would be useful. -- Linas > That is, I can't think of any valid reason why anyone would want to > suppress > the scm_error(). And even if I could -- calling scm_error() hundreds of > times > per second, as fast as possible, does not seem like efficient coding for > dealing > with a call to an invalid address. > > Anyway I'm trying to track down where the invalid value gets set. No luck > so far. > There are 6 or 8 places in vm-engine.c where the frame pointer is set to > something > that isn't a pointer (which seems like cheating to me: passing non-pointer > values > in something called "pointer" is .. well, knee jerk reaction is that it's > not wise, but > there may be a deeper reason.) > > >> >> > There is no error report... until 5-10 minutes later, when the error >> > report itself causes a crash. Before then, I get an endless >> > high-speed spew of prints: >> >> It looks like another error is happening within the exception handler. >> > > Well, yes, that also. But given that the instruction pointer contains > garbage > its perhaps not entirely surprising... at best, the question is, why > didn't it fail > sooner? > > -- Linas > >> >> Mark >> >> PS: It would be good to pick either 'guile-devel' or 'guile-user' for >> continuation of this thread. I don't see a reason why it should be >> sent to both lists. >> > > > -- > cassette tapes - analog TV - film cameras - you > -- cassette tapes - analog TV - film cameras - you