How utterly embarrassing. Please ignore most of this verbose and difficult email chain. Yes, guile-2.9.2 is still crashing, but almost all of my analysis was wrong. Turns out that my scheme code was calling `(10)` i.e. taking an integer, and treating it as a function, and attempting to call it. So the call to `scm_error` was exactly right. It was invisible to me because ... it was ignored in my code. However -- if one does call `scm_error` fairly rapidly, from multiple threads, one will eventually hit a race condition and get a crash. I'm not sure how to create a mini-test-case for this within guile; my code is creating threads outside of guile, and launching `scm_eval` in each (and ignoring the resulting error). This was leading to a crash after 5-10 minutes. -- Linas On Wed, Jul 17, 2019 at 10:52 PM Linas Vepstas wrote: > Oh, I get it. I think the bug is this: VM_DEFINE_OP (7, > return_values,...) > finds some mcode, and calls it. What it found was the > emit_get_callee_vcode > but it is totally pointless to call this mcode, since we're returning, and > not > calling. So its just not useful. > > Worse, it gets called with garbage values, which are then silenced by > ignoring > the resulting scm_error, and everything appears to run smoothly ... for a > while. > Until some later time, (millions of calls later), when there is a > completely unrelated > race condition that causes the scm_error to get tangled and die. The > ideal > solution would be simply to not call the mcode for get_callee; that would > save > time and trouble. > > That's my hypothesis. I tried to test a mock-up of this solution with the > patch > below, but it is too simplistic t actually work (null pointer-deref.) I > con't find > a beter solution > > If you've got a better idea, let me know... > > -- Linas > > --- a/libguile/vm-engine.c > +++ b/libguile/vm-engine.c > @@ -553,6 +553,7 @@ VM_NAME (scm_thread *thread) > mcode = SCM_FRAME_MACHINE_RETURN_ADDRESS (old_fp); > if (mcode && mcode != scm_jit_return_to_interpreter_trampoline) > { > + VP->unused = 1; > scm_jit_enter_mcode (thread, mcode); > CACHE_REGISTER (); > NEXT (0); > diff --git a/libguile/vm.c b/libguile/vm.c > index d7b1788..8e178c7 100644 > --- a/libguile/vm.c > +++ b/libguile/vm.c > @@ -620,6 +620,7 @@ scm_i_vm_prepare_stack (struct scm_vm *vp) > vp->compare_result = SCM_F_COMPARE_NONE; > vp->engine = vm_default_engine; > vp->trace_level = 0; > + vp->unused = 0; > #define INIT_HOOK(h) vp->h##_hook = SCM_BOOL_F; > FOR_EACH_HOOK (INIT_HOOK) > #undef INIT_HOOK > @@ -1515,6 +1516,7 @@ get_callee_vcode (scm_thread *thread) > > vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp); > > + if (vp->unused) { vp->unused = 0; return 0; } > scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S", > scm_list_1 (proc), scm_list_1 (proc)); > } > > On Wed, Jul 17, 2019 at 8:42 PM Linas Vepstas > wrote: > >> Seem to be narrowing it down ... or at least, I have more details ... >> >> On Wed, Jul 17, 2019 at 4:44 PM Linas Vepstas >> wrote: >> >>> >>> >>> On Wed, Jul 17, 2019 at 12:49 PM Mark H Weaver wrote: >>> >>>> Hi Linas, >>>> >>>> > Investigating the crash with good-old printf's in libguile/vm.c >>>> produces >>>> > a vast ocean of prints ... that should have not been printed, and/or >>>> should >>>> > have been actual errors, but somehow were not handled by scm_error. >>>> > Using today's git pull of master, here's the diff containing a printf: >>>> > >>>> > --- a/libguile/vm.c >>>> > +++ b/libguile/vm.c >>>> > @@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0); } >>>> > >>>> > proc = SCM_SMOB_DESCRIPTOR (proc).apply_trampoline; >>>> > SCM_FRAME_LOCAL (vp->fp, 0) = proc; >>>> > return SCM_PROGRAM_CODE (proc); >>>> > } >>>> > >>>> > +printf("duuude wrong type to apply!\n" >>>> > +"proc=%lx\n" >>>> > +"ip=%p\n" >>>> > +"sp=%p\n" >>>> > +"fp=%p\n" >>>> > +"sp_min=%p\n" >>>> > +"stack_lim=%p\n", >>>> > +SCM_FRAME_SLOT(vp->fp, 0)->as_u64, >>>> > +vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack_limit); >>>> > +fflush(stdout); >>>> > + >>>> > vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp); >>>> > >>>> > scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S", >>>> > scm_list_1 (proc), scm_list_1 (proc)); >>>> > } >>>> > >>>> > As you can see, shortly after my printf, there should have been an >>>> > error report. >>>> >>>> Not necessarily. Note that what 'scm_error' actually does is to raise >>>> an exception. What happens next depends on what exception handlers are >>>> installed at the time of the error. >>>> >>> >>> OK, but... when I look at what get_callee_vcode() actually does, it seems >>> to be earnestly trying to fish out the location of a callable function >>> from the >>> frame pointer, and it does so three plausible ways. If those three don't >>> work >>> out, then it sets the instruction pointer (to the garbage value), >>> followed by >>> scm_error(Wrong type to apply). This also looks like an earnest, honest >>> attempt to report a real error. But lets double-check. >>> >>> So who calls get_callee_vcode(), and why, and what did they expect to >>> happen? >>> Well, that's in three places: one in scm_call_n which is a plausible >>> place where >>> one might expect the instruction pointer to be set to a valid value. >>> Then there's two >>> places in vm-engine.c -- "tail-call" and "call" both of which one might >>> plausibly expect >>> to have a valid instruction pointer. I can't imagine any valid scenario >>> where anyone >>> was expecting get_callee_vcode() to actually fail in the normal course >>> of operations. >>> >> >> There is one more place where get_callee_vcode() can get called -- via >> the jump_table, >> via a call to scm_jit_enter_mcode() which issues the code emitted by >> emit_get_callee_vcode >> >> There are four calls to scm_jit_enter_mcode() The one that immediately >> preceeds >> the bug is always the one made here, in vm-engine.c: >> VM_DEFINE_OP (7, return_values, "return-values", OP1 (X32)) >> >> Right before the call to scm_jit_enter_mcode(), I can printf VP->fp and >> SCM_FRAME_LOCAL(VP->fp, 0), >> and they are... fp=0x7fffe000caf8 fpslot=d33b00 (typical) >> >> the mcode is of course some bytecode that bounces through lightning, and >> a few insns >> later, it arrives at get_callee_vcode() but now the fp is different, (it >> changes by 0x20, >> always) and the slot is different: fp=0x7fffe000cad8 and >> SCM_FRAME_LOCAL(fp,0) >> is 0x32 and the 0x32 triggers the scm_error(). (because 0x32 is not any >> of >> SCM_PROGRAM_P or SCM_STRUCTP or a smob) >> >> (but also, the fpslot=d33b00 is never a SCM_PROGRAM_P or SCM_STRUCTP or >> a smob, either... so something got computed along the way ... ) >> >> That's what I've got so far. Its highly reproducible. Quick to happen. >> I'm not sure >> what to do next. I guess I need to examine emit_get_callee_vcode() and >> see what >> it does, and why. Any comments, suggestions would be useful. >> >> -- Linas >> >> >>> That is, I can't think of any valid reason why anyone would want to >>> suppress >>> the scm_error(). And even if I could -- calling scm_error() hundreds of >>> times >>> per second, as fast as possible, does not seem like efficient coding for >>> dealing >>> with a call to an invalid address. >>> >>> Anyway I'm trying to track down where the invalid value gets set. No >>> luck so far. >>> There are 6 or 8 places in vm-engine.c where the frame pointer is set to >>> something >>> that isn't a pointer (which seems like cheating to me: passing >>> non-pointer values >>> in something called "pointer" is .. well, knee jerk reaction is that >>> it's not wise, but >>> there may be a deeper reason.) >>> >>> >>>> >>>> > There is no error report... until 5-10 minutes later, when the error >>>> > report itself causes a crash. Before then, I get an endless >>>> > high-speed spew of prints: >>>> >>>> It looks like another error is happening within the exception handler. >>>> >>> >>> Well, yes, that also. But given that the instruction pointer contains >>> garbage >>> its perhaps not entirely surprising... at best, the question is, why >>> didn't it fail >>> sooner? >>> >>> -- Linas >>> >>>> >>>> Mark >>>> >>>> PS: It would be good to pick either 'guile-devel' or 'guile-user' for >>>> continuation of this thread. I don't see a reason why it should be >>>> sent to both lists. >>>> >>> >>> >>> -- >>> cassette tapes - analog TV - film cameras - you >>> >> >> >> -- >> cassette tapes - analog TV - film cameras - you >> > > > -- > cassette tapes - analog TV - film cameras - you > -- cassette tapes - analog TV - film cameras - you