Investigating the crash with good-old printf's in libguile/vm.c produces a vast ocean of prints ... that should have not been printed, and/or should have been actual errors, but somehow were not handled by scm_error. Using today's git pull of master, here's the diff containing a printf: --- a/libguile/vm.c +++ b/libguile/vm.c @@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0); } proc = SCM_SMOB_DESCRIPTOR (proc).apply_trampoline; SCM_FRAME_LOCAL (vp->fp, 0) = proc; return SCM_PROGRAM_CODE (proc); } +printf("duuude wrong type to apply!\n" +"proc=%lx\n" +"ip=%p\n" +"sp=%p\n" +"fp=%p\n" +"sp_min=%p\n" +"stack_lim=%p\n", +SCM_FRAME_SLOT(vp->fp, 0)->as_u64, +vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack_limit); +fflush(stdout); + vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp); scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S", scm_list_1 (proc), scm_list_1 (proc)); } As you can see, shortly after my printf, there should have been an error report. There is no error report... until 5-10 minutes later, when the error report itself causes a crash. Before then, I get an endless high-speed spew of prints: duuude wrong type to apply! proc=2e ip=0x7efcbbd5b69c sp=0x7efcd14589b8 fp=0x7efcd14589c0 sp_min=0x7efcd1458708 stack_lim=0x7efcd1458000 duuude wrong type to apply! proc=36 ip=0x7efcbbd5b69c sp=0x7efcd15ffad0 fp=0x7efcd15ffad8 sp_min=0x7efcd15ff528 stack_lim=0x7efcd15ff000 duuude wrong type to apply! proc=36 ip=0x7efcbbd5b69c sp=0x7efcd14568a0 fp=0x7efcd14568a8 sp_min=0x7efcd1456318 stack_lim=0x7efcd1456000 duuude wrong type to apply! proc=2a ip=0x7efcbbd5b69c sp=0x7efcd15ffb40 fp=0x7efcd15ffb48 sp_min=0x7efcd15ff3d8 stack_lim=0x7efcd15ff000 duuude wrong type to apply! proc=26 ip=0x7efcbbd5b69c sp=0x7efcd15fdc90 fp=0x7efcd15fdc98 sp_min=0x7efcd15fd798 stack_lim=0x7efcd15fd000 duuude wrong type to apply! proc=22 ip=0x7efcbbd5b69c sp=0x7efcd152ba98 fp=0x7efcd152baa0 sp_min=0x7efcd152b888 stack_lim=0x7efcd152b000 duuude wrong type to apply! The addresses all look healthy. I also added a guard-word, looking for memory corruption, but none was found. Next up: decoding the SCM by hand, and figuring out why it's there. -- Linas On Sun, Jul 14, 2019 at 10:03 PM Linas Vepstas wrote: > Exactly the same crash, same stack trace (slightly different line numbers), > with a fresh pull today: > commit 89e28df1c9069dcb65188fe7b3973c333d87d7e2 > Author: Andy Wingo > Date: Thu Jun 20 14:02:05 2019 +0200 > which is the current HEAD on master. > > FWIW, 60-odd guile threads waiting here: > > #0 __lll_lock_wait () at > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 > #1 0x00007f45ff4fbdbd in __GI___pthread_mutex_lock (mutex=mutex@entry > =0x16e1f68) > at ../nptl/pthread_mutex_lock.c:80 > #2 0x00007f45ff7e085f in scm_c_weak_set_lookup (set=, > raw_hash=raw_hash@entry=551753256168943069, > pred=pred@entry=0x7f45ff7d1660 , > closure=closure@entry=0x7f3e44ff7ac0, dflt=dflt@entry=0x4) > at ../../libguile/weak-set.c:760 > #3 0x00007f45ff7d11e9 in lookup_interned_symbol > (raw_hash=551753256168943069, > name=0x2965ca0) at ../../libguile/symbols.c:112 > #4 scm_i_str2symbol (str=0x2965ca0) at ../../libguile/symbols.c:244 > > The parallelism is low because of this one lock. This appears to be the > primary bottleneck for my workload. > > -- Linas > > > On Sun, Jul 14, 2019 at 5:03 PM Linas Vepstas > wrote: > >> Below was for >> guile (GNU Guile) 2.9.2.14-1fb399 >> >> --linas >> >> On Sun, Jul 14, 2019 at 4:59 PM Linas Vepstas >> wrote: >> >>> >>> So, here's my next installment on using guile-2.9.2. The first >>> installment said that I'd piled up CPU-months of guile 2.9.2 experience >>> without any crashes. Well, now, a different workload crashes in minutes. >>> Below is a highly simplified, edited gdb session -- it crashes because it >>> unexpectedly aborts, during an abort(!) because `get_callee_vcode()` >>> failed. Harrumpf. >>> >>> Background: there are 140 threads, half in guile, the other half waiting >>> for guile to finish. Yes, that's too many, but anyways ... 70 threads in >>> guile and one crashed: >>> >>> #2 0x00007f85d3f6ecdb in capture_delimited_continuation ( >>> current_registers=, dynstack=, >>> saved_registers=, saved_mra=, >>> saved_fp=, vp=) at >>> ../../libguile/vm.c:1327 >>> #3 abort_to_prompt (thread=0x15f692dc0, saved_mra=) >>> at ../../libguile/vm.c:1454 >>> >>> Both frames are interesting, because libguile/vm.c:1327 shows >>> if (SCM_FRAME_DYNAMIC_LINK (base_fp) != saved_fp) >>> abort(); >>> hey!? who called this? line 1454 is in the middle of abort_to_prompt () >>> Yow! an unexpected abort during an abort... >>> >>> How did we get here? >>> #15 0x00007f85d3eedeb5 in scm_error_scm (key=key@entry=0xdc5420, >>> subr=, message=message@entry=0x1607c9380, >>> args=args@entry=0x15af130e0, data=data@entry=0x15af130f0) >>> at ../../libguile/error.c:90 >>> #16 0x00007f85d3eedf4f in scm_error (key=0xdc5420, subr=subr@entry=0x0, >>> message=message@entry=0x7f85d3fa228c "Wrong type to apply: ~S", >>> args=0x15af130e0, >>> rest=rest@entry=0x15af130f0) at ../../libguile/error.c:62 >>> #17 0x00007f85d3f6f913 in get_callee_vcode (thread=0x15f692dc0) >>> at ../../libguile/vm.c:1527 >>> >>> and libguile/vm.c:1527 tells me that get_callee_vcode () is very >>> unhappy. But why? I cannot tell .. after that, things peter out in boring >>> stack frames that started with my call scm_c_catch() ... the same seemingly >>> harmless call that is pending in 70 other threads. (the same call that has >>> survived several CPU month of pounding with a different collection of >>> scheme code) >>> >>> My best guess is that the current workload, by unintentionally launching >>> gobs of threads is exposing a race condition that has been hithertho >>> hidden. I don't know how to debug any further. I will try a slightly >>> newer guile shortly, to see if I get lucky. >>> >>> -- Linas >>> >>> p.s. here's the whole stack trace. But really, its boring, except for >>> the above highlights. >>> >>> >>> >>> (gdb) bt >>> #0 0x00007f85d38ef428 in __GI_raise (sig=sig@entry=6) >>> at ../sysdeps/unix/sysv/linux/raise.c:54 >>> #1 0x00007f85d38f102a in __GI_abort () at abort.c:89 >>> #2 0x00007f85d3f6ecdb in capture_delimited_continuation ( >>> current_registers=, dynstack=, >>> saved_registers=, saved_mra=, >>> saved_fp=, vp=) at >>> ../../libguile/vm.c:1327 >>> #3 abort_to_prompt (thread=0x15f692dc0, saved_mra=) >>> at ../../libguile/vm.c:1454 >>> #4 0x00007f85ac539041 in ?? () >>> #5 0x00007f85ac37f040 in ?? () >>> #6 0x00007f85d41e10c0 in jump_table_ () from >>> /usr/local/lib/libguile-3.0.so.0 >>> #7 0x000000015f692dc0 in ?? () >>> #8 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0, >>> mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796 >>> #9 0x00007f85d3f70600 in vm_debug_engine (thread=0x7f85ac539000) >>> at ../../libguile/vm-engine.c:370 >>> #10 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0xe45a20, >>> argv=, >>> nargs=5) at ../../libguile/vm.c:1605 >>> #11 0x00007f85d3eefdcb in scm_apply_0 (proc=0xe45a20, args=0x304) >>> at ../../libguile/eval.c:603 >>> #12 0x00007f85d3ef0a0d in scm_apply_1 (proc=, >>> arg1=arg1@entry=0xdc5420, args=args@entry=0x15af130a0) >>> at ../../libguile/eval.c:609 >>> #13 0x00007f85d3f6c546 in scm_throw (key=key@entry=0xdc5420, >>> args=0x15af130a0) >>> at ../../libguile/throw.c:272 >>> #14 0x00007f85d3f6caf9 in scm_ithrow (key=key@entry=0xdc5420, >>> args=, >>> no_return=no_return@entry=1) at ../../libguile/throw.c:619 >>> #15 0x00007f85d3eedeb5 in scm_error_scm (key=key@entry=0xdc5420, >>> subr=, message=message@entry=0x1607c9380, >>> args=args@entry=0x15af130e0, data=data@entry=0x15af130f0) >>> at ../../libguile/error.c:90 >>> ---Type to continue, or q to quit--- >>> #16 0x00007f85d3eedf4f in scm_error (key=0xdc5420, subr=subr@entry=0x0, >>> message=message@entry=0x7f85d3fa228c "Wrong type to apply: ~S", >>> args=0x15af130e0, >>> rest=rest@entry=0x15af130f0) at ../../libguile/error.c:62 >>> #17 0x00007f85d3f6f913 in get_callee_vcode (thread=0x15f692dc0) >>> at ../../libguile/vm.c:1527 >>> #18 0x00007f85b4314805 in ?? () >>> #19 0x00007f85b428a000 in ?? () >>> #20 0x00007f85d41e10c0 in jump_table_ () from >>> /usr/local/lib/libguile-3.0.so.0 >>> #21 0x000000015f692dc0 in ?? () >>> #22 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0, >>> mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796 >>> #23 0x00007f85d3f70600 in vm_debug_engine (thread=0x2) >>> at ../../libguile/vm-engine.c:370 >>> #24 0x00007f85d3f76db2 in scm_call_n (proc=, >>> argv=argv@entry=0x7f7e85fe2600, nargs=nargs@entry=3) at >>> ../../libguile/vm.c:1605 >>> #25 0x00007f85d3eef97f in scm_call_3 (proc=, >>> arg1=, >>> arg2=, arg3=) at >>> ../../libguile/eval.c:510 >>> #26 0x00007f85d4262b6f in ?? () >>> #27 0x00007f85d4262a80 in ?? () >>> #28 0x00007f85d41e10c0 in jump_table_ () from >>> /usr/local/lib/libguile-3.0.so.0 >>> #29 0x000000015f692dc0 in ?? () >>> #30 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0, >>> mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796 >>> #31 0x00007f85d3f70600 in vm_debug_engine (thread=0x304) >>> at ../../libguile/vm-engine.c:370 >>> #32 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0x15b341ee0, >>> argv=argv@entry=0x0, nargs=nargs@entry=0) at >>> ../../libguile/vm.c:1605 >>> #33 0x00007f85d3eef8d9 in scm_call_0 (proc=proc@entry=0x15b341ee0) >>> at ../../libguile/eval.c:490 >>> #34 0x00007f85d3f6c1aa in catch (tag=tag@entry=0x404, >>> thunk=0x15b341ee0, >>> handler=0x15b341ec0, pre_unwind_handler=0x15b341ea0) at >>> ../../libguile/throw.c:146 >>> #35 0x00007f85d3f6c505 in scm_catch_with_pre_unwind_handler >>> (key=key@entry=0x404, >>> ---Type to continue, or q to quit--- >>> thunk=, handler=, >>> pre_unwind_handler=) at ../../libguile/throw.c:260 >>> #36 0x00007f85d3f6c6bf in scm_c_catch (tag=tag@entry=0x404, >>> body=, >>> body_data=, >>> handler=handler@entry=0x7f85c95d0f00 >>> >> scm_unused_struct*)>, >>> handler_data=handler_data@entry=0x7f7e60000980, >>> pre_unwind_handler=pre_unwind_handler@entry=0x7f85c95d0c40 >>> >> scm_unused_struct*)>, >>> pre_unwind_handler_data=0x7f7e60000980) at ../../libguile/throw.c:385 >>> #37 0x00007f85c95d122a in opencog::SchemeEval::do_eval >>> (this=0x7f7e60000980, >>> expr="(observe-mpg \"The countess, with her loving heart, felt that >>> her children were being ruined, that it was not the count's fault for he >>> could not help being what he was -- that (though he tried to hide "...) >>> at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:590 >>> #38 0x00007f85c95d12aa in opencog::SchemeEval::c_wrap_eval >>> (p=0x7f7e60000980) >>> at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:507 >>> #39 0x00007f85d3eeb47a in c_body (d=0x7f7e85fe2d40) >>> at ../../libguile/continuations.c:430 >>> #40 0x00007f85d4262b6f in ?? () >>> #41 0x00007f85d4262a80 in ?? () >>> #42 0x00007f85d41e10c0 in jump_table_ () from >>> /usr/local/lib/libguile-3.0.so.0 >>> #43 0x000000015f692dc0 in ?? () >>> #44 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0, >>> mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796 >>> #45 0x00007f85d3f70600 in vm_debug_engine (thread=0x304) >>> at ../../libguile/vm-engine.c:370 >>> #46 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0x15b341fe0, >>> argv=argv@entry=0x0, nargs=nargs@entry=0) at >>> ../../libguile/vm.c:1605 >>> #47 0x00007f85d3eef8d9 in scm_call_0 (proc=proc@entry=0x15b341fe0) >>> at ../../libguile/eval.c:490 >>> #48 0x00007f85d3f6c1aa in catch (tag=tag@entry=0x404, >>> thunk=0x15b341fe0, >>> ---Type to continue, or q to quit--- >>> handler=0x15b341fc0, pre_unwind_handler=0x15b341fa0) at >>> ../../libguile/throw.c:146 >>> #49 0x00007f85d3f6c505 in scm_catch_with_pre_unwind_handler >>> (key=key@entry=0x404, >>> thunk=, handler=, >>> pre_unwind_handler=) at ../../libguile/throw.c:260 >>> #50 0x00007f85d3f6c6bf in scm_c_catch (tag=tag@entry=0x404, >>> body=body@entry=0x7f85d3eeb470 , >>> body_data=body_data@entry=0x7f7e85fe2d40, >>> handler=handler@entry=0x7f85d3eeb720 , >>> handler_data=handler_data@entry=0x7f7e85fe2d40, >>> pre_unwind_handler=pre_unwind_handler@entry=0x7f85d3eeb580 >>> , >>> pre_unwind_handler_data=0xe174a0) at ../../libguile/throw.c:385 >>> #51 0x00007f85d3eeb9e3 in scm_i_with_continuation_barrier ( >>> body=body@entry=0x7f85d3eeb470 , >>> body_data=body_data@entry=0x7f7e85fe2d40, >>> handler=handler@entry=0x7f85d3eeb720 , >>> handler_data=handler_data@entry=0x7f7e85fe2d40, >>> pre_unwind_handler=pre_unwind_handler@entry=0x7f85d3eeb580 >>> , >>> pre_unwind_handler_data=0xe174a0) at >>> ../../libguile/continuations.c:368 >>> #52 0x00007f85d3eebac5 in scm_c_with_continuation_barrier >>> (func=, >>> data=) at ../../libguile/continuations.c:464 >>> #53 0x00007f85d3575127 in GC_call_with_gc_active ( >>> fn=fn@entry=0x7f85d3f6a070 , >>> client_data=client_data@entry=0x7f7e85fe2e20) at >>> ../pthread_support.c:1343 >>> #54 0x00007f85d3f6ac4f in with_guile (base=base@entry=0x7f7e85fe2df0, >>> data=data@entry=0x7f7e85fe2e20) at ../../libguile/threads.c:683 >>> #55 0x00007f85d356f132 in GC_call_with_stack_base ( >>> fn=fn@entry=0x7f85d3f6abb0 , arg=arg@entry >>> =0x7f7e85fe2e20) >>> at ../misc.c:1941 >>> #56 0x00007f85d3f6aff8 in scm_i_with_guile (dynamic_state=>> out>, >>> data=0x7f7e60000980, >>> func=0x7f85c95d1290 ) >>> at ../../libguile/threads.c:698 >>> ---Type to continue, or q to quit--- >>> #57 scm_with_guile ( >>> func=func@entry=0x7f85c95d1290 >>> , >>> data=data@entry=0x7f7e60000980) at ../../libguile/threads.c:704 >>> #58 0x00007f85c95d126e in opencog::SchemeEval::eval_expr >>> (this=0x7f7e60000980, >>> expr=...) at >>> /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:479 >>> #59 0x00007f85bc783439 in opencog::GenericShell::eval_loop >>> (this=0x7f7ef0001e90) >>> at >>> /home/ubuntu/src/opencog/opencog/cogserver/shell/GenericShell.cc:588 >>> #60 0x00007f85c6e5ac80 in ?? () from >>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >>> #61 0x00007f85d3c916ba in start_thread (arg=0x7f7e85fe3700) at >>> pthread_create.c:333 >>> #62 0x00007f85d39c141d in clone () at >>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 >>> (gdb) >>> >>> >>> >>> -- >>> cassette tapes - analog TV - film cameras - you >>> >> >> >> -- >> cassette tapes - analog TV - film cameras - you >> > > > -- > cassette tapes - analog TV - film cameras - you > -- cassette tapes - analog TV - film cameras - you