guile-2.9.2 and threading

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* guile-2.9.2 and threading
@ 2019-06-02 23:25 Linas Vepstas
  2019-06-07  4:26 ` Mark H Weaver
  0 siblings, 1 reply; 15+ messages in thread
From: Linas Vepstas @ 2019-06-02 23:25 UTC (permalink / raw)
  To: Guile Development, Guile User

[-- Attachment #1: Type: text/plain, Size: 4854 bytes --]

I'm trying to understand how scm_jit_enter_mcode leads to
scm_timed_lock_mutex ... I want to know who is attempting to lock, and why
... and how to work around this...

Background: par-for-each works poorly for my app, I want to understand why.
Specifically, how par-for-each and n-par-for-each work in guile-2.9.2.
I've got my app, which is a guile/c++ mixture.  I've got a list of 137078
items and a proc I want to apply.  Single-threaded, it takes 530 seconds so
about 137078/530 = 260 items/second. Baseline. So.

(n-proc-for-each 6 proc list)

and I see this: 230% cpu use in top (i.e. two+ active threads) Why only
two? It's always about the same: n-proc-for-each 8, n-proc-for-each 12,
whatever ...  gdb shows that all but two threads are stuck here:

```
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f343d27bb95 in scm_pthread_cond_wait (cond=<optimized out>,
    mutex=<optimized out>) at ../../libguile/threads.c:1615
#2  0x00007f343d27bd8b in block_self (queue=0x55ef64391cd0,
    mutex=mutex@entry=0x55ef63d21fc0, waittime=waittime@entry=0x0)
    at ../../libguile/threads.c:316
#3  0x00007f343d27bedf in lock_mutex (current_thread=0x55f0a3d42c60,
    waittime=0x0, m=0x55ef63d21fc0, kind=SCM_MUTEX_STANDARD)
    at ../../libguile/threads.c:1031
#4  scm_timed_lock_mutex (mutex=0x55ef64391cc0, timeout=<optimized out>)
    at ../../libguile/threads.c:1092
#5  0x00007f343d56582a in ?? ()
#6  0x00007f34104bdf90 in ?? ()
#7  0x00007f343d4f00a0 in jump_table_ () from
/usr/local/lib/libguile-3.0.so.0
#8  0x000055ef6347b3a8 in ?? ()
#9  0x00007f343d22bf61 in scm_jit_enter_mcode (thread=0x55f0a3d42840,
    mcode=0x55f0a3d42840 "\240)ԣ\360U") at ../../libguile/jit.c:4819
#10 0x00007f343d28089c in vm_debug_engine (thread=0x55f0a3d42840)
    at ../../libguile/vm-engine.c:370
#11 0x00007f343d28707a in scm_call_n (proc=proc@entry=0x55eff19c9f00,
    argv=argv@entry=0x0, nargs=nargs@entry=0) at ../../libguile/vm.c:1605
```

The other two threads are happily doing things in my app.

I'm using (ice-9 threads)   I see this:

(define (n-par-for-each n proc . arglists)
  (let ((m (make-mutex))
   (threads '()))
    (do ((i 0 (+ 1 i)))
   ((= i n)
    (for-each join-thread threads))
      (set! threads
       (cons (begin-thread
         (let loop ()
           (lock-mutex m)
           (if (null? (car arglists))
          (unlock-mutex m)
          (let ((args (map car arglists)))
            (set! arglists (map cdr arglists))
            (unlock-mutex m)
            (apply proc args)
            (loop)))))
        threads)))))

Oh, I says to myself: bad bad mutex. Let me write a lock-less loop: it
chops the list into n pieces, each in its own thread. (a bit sloppy, but
adequate):

(define (my-for-each n proc args)
   (define len (length args))
   (define quo (euclidean-quotient len n))
   (define rem (euclidean-remainder len n))
   (define threads '())
   (do ((i 0 (+ 1 i)))
      ((= i n) (for-each join-thread threads))
         (set! threads
            (cons
               (begin-thread
                  (for-each proc (take (drop args (* i quo)) quo)))
               threads)))
   (for-each proc (drop args (* n quo)))
)

Let me go hog-wild: (my-for-each 12 proc list)   (I have a cpu with that
many cores) So... what happens? A little better .. not much.  This time,
gdb shows that there are four threads in my app. Two are stuck here:

#0  __lll_lock_wait () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f343ca69bb5 in __GI___pthread_mutex_lock (
    mutex=mutex@entry=0x7f343d4f0f40 <bytes_until_gc_lock>)
    at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f343d213e20 in scm_gc_register_allocation (size=size@entry=16)
    at ../../libguile/gc.c:591

All the others are stuck in the first stack trace.  Average speedup over
one thread is 1.5x (345 seconds vs 525 seconds) and it burns about 250% CPU
(according to top) to achieve this.

Why aren't there 12 threads in my app? my-for-each is lockless, so where is
the scm_timed_lock_mutex ... scm_jit_enter_mcode coming from?

Using par-for-each results in times that are identical to single-thread
times.  Identical - no speedup, no slow-down. Toy sniff tests show that
par-for-each really does run in parallel, so no clue why its not any faster.

FWIW, guile-2.2 behaves much worse. Running two threads, things got 1.5x
faster.  Running 4 threads, things ran at half-speed of single-threaded.
Running six threads ran maybe 10x or 20x slower than single-threaded.
Clearly, a really bad live-lock situation. I want to get out from under
this.  I'm trying to do big data, and this is a bottleneck.

--linas

-- 
cassette tapes - analog TV - film cameras - you

[-- Attachment #2: Type: text/html, Size: 5749 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: guile-2.9.2 and threading
  2019-06-02 23:25 guile-2.9.2 and threading Linas Vepstas
@ 2019-06-07  4:26 ` Mark H Weaver
  2019-06-07  5:01   ` Mark H Weaver
  2019-07-09 20:46   ` Linas Vepstas
  0 siblings, 2 replies; 15+ messages in thread
From: Mark H Weaver @ 2019-06-07  4:26 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: Guile User, Guile Development

Hi Linas,

Linas Vepstas <linasvepstas@gmail.com> writes:

> I'm trying to understand how scm_jit_enter_mcode leads to
> scm_timed_lock_mutex ...

This simply means that 'lock-mutex' was called from Scheme, and
specifically from Scheme code that has been compiled to machine code by
our JIT compiler.

> I want to know who is attempting to lock, and why ... and how to work
> around this...

You'll need to look at the stack frames on the Scheme stack.  It can be
done from GDB if necessary, but it might be sufficient to use Guile's
debugger.

My first thought was to suggest ",break lock-mutex", but that doesn't
work, presumably because it's a C primitive (although we should fix
that), but I was able to patch in a pure Scheme wrapper for it, which
then allows us to set a Scheme breakpoint:

--8<---------------cut here---------------start------------->8---
mhw@jojen ~/guile-master$ meta/guile
GNU Guile 2.9.2.14-1fb399-dirty
Copyright (C) 1995-2019 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.

Enter `,help' for help.
scheme@(guile-user)> ,use (ice-9 threads)
scheme@(guile-user)> (define (test n) (par-for-each + (iota n) (iota n)))
scheme@(guile-user)> (define core-lock-mutex lock-mutex)
scheme@(guile-user)> (set! lock-mutex (lambda* (mutex #:optional timeout)
                                        (core-lock-mutex mutex timeout)))
scheme@(guile-user)> ,break lock-mutex
Trap 0: Breakpoint at #<procedure lock-mutex (mutex #:optional timeout)>.
scheme@(guile-user)> (test 10000)
Trap 0: Breakpoint at #<procedure lock-mutex (mutex #:optional timeout)>
Entering a new prompt.  Type `,bt' for a backtrace or `,q' to continue.
scheme@(guile-user) [1]> ,bt
In current input:
      2:0  2 (test 10000)
In ice-9/threads.scm:
    389:6  1 (_ #<procedure 1307800 at ice-9/boot-9.scm:2698:7 ()>)
In current input:
     4:17  0 (lock-mutex #<mutex 1307f40>)
scheme@(guile-user) [1]> 
--8<---------------cut here---------------end--------------->8---

Hopefully this should be enough to figure out what's calling
'lock-mutex' in your case.

One possibility is that your code is accessing the module table too
often.  Does your code frequently call procedures like 'resolve-module',
'module-variable', 'module-ref', or any of the other procedures
documented in section 6.20.8 (Module System Reflection) of the Guile
manual?

> I'm using (ice-9 threads)   I see this:
>
> (define (n-par-for-each n proc . arglists)
>   (let ((m (make-mutex))
>    (threads '()))
>     (do ((i 0 (+ 1 i)))
>    ((= i n)
>     (for-each join-thread threads))
>       (set! threads
>        (cons (begin-thread
>          (let loop ()
>            (lock-mutex m)
>            (if (null? (car arglists))
>           (unlock-mutex m) 
>           (let ((args (map car arglists)))
>             (set! arglists (map cdr arglists))
>             (unlock-mutex m)
>             (apply proc args)
>             (loop)))))
>         threads)))))
>
> Oh, I says to myself: bad bad mutex. Let me write a lock-less loop: it
> chops the list into n pieces, each in its own thread. (a bit sloppy,
> but adequate):
>
> (define (my-for-each n proc args)
>    (define len (length args))
>    (define quo (euclidean-quotient len n))
>    (define rem (euclidean-remainder len n))
>    (define threads '())
>    (do ((i 0 (+ 1 i)))
>       ((= i n) (for-each join-thread threads))
>          (set! threads
>             (cons
>                (begin-thread
>                   (for-each proc (take (drop args (* i quo)) quo)))
>                threads)))
>    (for-each proc (drop args (* n quo)))
> )
>
> Let me go hog-wild: (my-for-each 12 proc list)   (I have a cpu with that many cores) So... what happens? A little better .. not much.  This time, gdb shows that there are four threads in my app.
> Two are stuck here:
>
> #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x00007f343ca69bb5 in __GI___pthread_mutex_lock (
>     mutex=mutex@entry=0x7f343d4f0f40 <bytes_until_gc_lock>)
>     at ../nptl/pthread_mutex_lock.c:80
> #2  0x00007f343d213e20 in scm_gc_register_allocation (size=size@entry=16)
>     at ../../libguile/gc.c:591

This is the global GC allocation lock, which might be an issue if your
threads are performing a lot of heap allocation.

      Mark



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: guile-2.9.2 and threading
  2019-06-07  4:26 ` Mark H Weaver
@ 2019-06-07  5:01   ` Mark H Weaver
  2019-07-09 20:46   ` Linas Vepstas
  1 sibling, 0 replies; 15+ messages in thread
From: Mark H Weaver @ 2019-06-07  5:01 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: Guile User, Guile Development

Mark H Weaver <mhw@netris.org> writes:

>> Two are stuck here:
>>
>> #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
>> #1  0x00007f343ca69bb5 in __GI___pthread_mutex_lock (
>>     mutex=mutex@entry=0x7f343d4f0f40 <bytes_until_gc_lock>)
>>     at ../nptl/pthread_mutex_lock.c:80
>> #2  0x00007f343d213e20 in scm_gc_register_allocation (size=size@entry=16)
>>     at ../../libguile/gc.c:591
>
> This is the global GC allocation lock, which might be an issue if your
> threads are performing a lot of heap allocation.

Sorry, I spoke too quickly here, although what I wrote is not far from
the truth.  'bytes_until_gc_lock' is used by Guile to protect its global
counter of how much plain 'malloc' memory has been allocated since the
last garbage collection.  Once it reaches a certain threshold, a garbage
collection is forced.

Why do we have this?  Sometimes the GC heap contains a large number of
small collectible objects which reference much larger objects in plain
'malloc' memory.  For example, this can happen if you're creating a lot
of bignums, which are small GC objects pointing to digit data stored in
'malloc' memory.  The garbage collector doesn't trigger because there's
plenty of free memory in the GC heap, and it doesn't know that those
small objects are keeping alive much larger objects elsewhere.

However, if this particular lock is a pain point for you, there are
things we could do to improve this.  One idea that comes to mind is to
keep smaller per-thread counters, which are only added to the global
counter after they reach a certain threshold value.  In this case, we
don't need a precise global count.

        Mark

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: guile-2.9.2 and threading
  2019-06-07  4:26 ` Mark H Weaver
  2019-06-07  5:01   ` Mark H Weaver
@ 2019-07-09 20:46   ` Linas Vepstas
  2019-07-14 21:59     ` Now crashing [was " Linas Vepstas
  1 sibling, 1 reply; 15+ messages in thread
From: Linas Vepstas @ 2019-07-09 20:46 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Guile User, Guile Development

[-- Attachment #1: Type: text/plain, Size: 2641 bytes --]

Hi Mark,

Sorry for the late reply; my email client mananged to hide your email where
I won't see it. I need to fix this.

On Thu, Jun 6, 2019 at 11:28 PM Mark H Weaver <mhw@netris.org> wrote:

>
> You'll need to look at the stack frames on the Scheme stack.  It can be
> done from GDB if necessary, but it might be sufficient to use Guile's
> debugger.
>
> My first thought was to suggest ",break lock-mutex", but that doesn't
> work, presumably because it's a C primitive (although we should fix
> that), but I was able to patch in a pure Scheme wrapper for it, which
> then allows us to set a Scheme breakpoint:
>
>
I'll poke around and report back. Meanwhile, some orienting remarks.  Since
last email, I've accumulated several  CPU-months running guile-2.9.2 with
zero crashes and zero hangs. So I like it!  For threading, I see three or
four different behaviors or modes; sometimes it works great, sometimes it
doesn't work at all, and I'm still trying to figure out why.

A works-great example:
(par-for-each (lambda (stuff) (... little bit of scheme + CPU-intensive
C++)) some-precomputed-list)

The "CPU-intensive C++" are calls that take at least a millisecond to run,
sometimes seconds.  The above will very happily use all 24 cores and
deliver a 24x speedup over single-threaded.  Yay!

A works-poorly example:
(par-for-each (lambda (stuff) (.. (fold (lambda () numeric addition after
tiny C++)) some-list) list)

Here, "tiny C++" is something that just enters C++ and leaves almost
immediately; it's used to grab and return a numeric value. This runs as if
it were single-threaded, and delivers performance equivalent to being
single-threaded.  I don't know why; this is what I reported in the earlier
emails.

A doesn't-work example: mostly same as "works-poorly", but performance is
worse-than-single-threaded. Sometimes 2x worse. Why, I don't know. (It does
seem to do a LOT of gc; that might account for all of the slowdown; not
sure. These loops iterate over millions/tens-of-millions of items and can
take an hour to complete...)

The worse-than-single-threaded behavior was actually the norm for
guile-2.2; its no longer the norm (yay!). In guile-2.2, there seemed to be
some kind of livelock, where two threads were 1.5x faster than one, three
threads were 1.2x faster than one, and four threads were slower, and
sometimes one-thousand-fold slower! (but still making forward progress,
i.e. not a deadlock) That era seems to be over, yay!

I'll report on the rest, later, when I get a chance (the compute jobs are
hard to manage, and take an hour to set up)

-- Linas

-- 
cassette tapes - analog TV - film cameras - you

[-- Attachment #2: Type: text/html, Size: 3437 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Now crashing [was Re: guile-2.9.2 and threading
  2019-07-09 20:46   ` Linas Vepstas
@ 2019-07-14 21:59     ` Linas Vepstas
  2019-07-14 22:03       ` Linas Vepstas
  0 siblings, 1 reply; 15+ messages in thread
From: Linas Vepstas @ 2019-07-14 21:59 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Guile User, Guile Development

[-- Attachment #1: Type: text/plain, Size: 11099 bytes --]

So, here's my next installment on using guile-2.9.2. The first installment
said that I'd piled up CPU-months of guile 2.9.2 experience without any
crashes. Well, now, a different workload crashes in minutes.  Below is a
highly simplified, edited gdb session -- it crashes because it unexpectedly
aborts, during an abort(!) because `get_callee_vcode()` failed. Harrumpf.

Background: there are 140 threads, half in guile, the other half waiting
for guile to finish. Yes, that's too many, but anyways ... 70 threads in
guile and one crashed:

#2  0x00007f85d3f6ecdb in capture_delimited_continuation (
    current_registers=<optimized out>, dynstack=<optimized out>,
    saved_registers=<optimized out>, saved_mra=<optimized out>,
    saved_fp=<optimized out>, vp=<optimized out>) at
../../libguile/vm.c:1327
#3  abort_to_prompt (thread=0x15f692dc0, saved_mra=<optimized out>)
    at ../../libguile/vm.c:1454

Both frames are interesting, because libguile/vm.c:1327 shows
  if (SCM_FRAME_DYNAMIC_LINK (base_fp) != saved_fp)
     abort();
hey!? who called this? line 1454 is in the middle of abort_to_prompt ()
Yow! an unexpected abort during an abort...

How did we get here?
#15 0x00007f85d3eedeb5 in scm_error_scm (key=key@entry=0xdc5420,
    subr=<optimized out>, message=message@entry=0x1607c9380,
    args=args@entry=0x15af130e0, data=data@entry=0x15af130f0)
    at ../../libguile/error.c:90
#16 0x00007f85d3eedf4f in scm_error (key=0xdc5420, subr=subr@entry=0x0,
    message=message@entry=0x7f85d3fa228c "Wrong type to apply: ~S",
args=0x15af130e0,
    rest=rest@entry=0x15af130f0) at ../../libguile/error.c:62
#17 0x00007f85d3f6f913 in get_callee_vcode (thread=0x15f692dc0)
    at ../../libguile/vm.c:1527

and libguile/vm.c:1527 tells me that get_callee_vcode () is very unhappy.
But why? I cannot tell .. after that, things peter out in boring stack
frames that started with my call scm_c_catch() ... the same seemingly
harmless call that is pending in 70 other threads. (the same call that has
survived several CPU month of pounding with a different collection of
scheme code)

My best guess is that the current workload, by unintentionally launching
gobs of threads is exposing a race condition that has been hithertho
hidden.  I don't know how to debug any further.  I will try a slightly
newer guile shortly, to see if I get lucky.

-- Linas

p.s. here's the whole stack trace. But really, its boring, except for the
above highlights.



(gdb) bt
#0  0x00007f85d38ef428 in __GI_raise (sig=sig@entry=6)
    at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f85d38f102a in __GI_abort () at abort.c:89
#2  0x00007f85d3f6ecdb in capture_delimited_continuation (
    current_registers=<optimized out>, dynstack=<optimized out>,
    saved_registers=<optimized out>, saved_mra=<optimized out>,
    saved_fp=<optimized out>, vp=<optimized out>) at
../../libguile/vm.c:1327
#3  abort_to_prompt (thread=0x15f692dc0, saved_mra=<optimized out>)
    at ../../libguile/vm.c:1454
#4  0x00007f85ac539041 in ?? ()
#5  0x00007f85ac37f040 in ?? ()
#6  0x00007f85d41e10c0 in jump_table_ () from
/usr/local/lib/libguile-3.0.so.0
#7  0x000000015f692dc0 in ?? ()
#8  0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
    mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
#9  0x00007f85d3f70600 in vm_debug_engine (thread=0x7f85ac539000)
    at ../../libguile/vm-engine.c:370
#10 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0xe45a20,
argv=<optimized out>,
    nargs=5) at ../../libguile/vm.c:1605
#11 0x00007f85d3eefdcb in scm_apply_0 (proc=0xe45a20, args=0x304)
    at ../../libguile/eval.c:603
#12 0x00007f85d3ef0a0d in scm_apply_1 (proc=<optimized out>,
    arg1=arg1@entry=0xdc5420, args=args@entry=0x15af130a0)
    at ../../libguile/eval.c:609
#13 0x00007f85d3f6c546 in scm_throw (key=key@entry=0xdc5420,
args=0x15af130a0)
    at ../../libguile/throw.c:272
#14 0x00007f85d3f6caf9 in scm_ithrow (key=key@entry=0xdc5420,
args=<optimized out>,
    no_return=no_return@entry=1) at ../../libguile/throw.c:619
#15 0x00007f85d3eedeb5 in scm_error_scm (key=key@entry=0xdc5420,
    subr=<optimized out>, message=message@entry=0x1607c9380,
    args=args@entry=0x15af130e0, data=data@entry=0x15af130f0)
    at ../../libguile/error.c:90
---Type <return> to continue, or q <return> to quit---
#16 0x00007f85d3eedf4f in scm_error (key=0xdc5420, subr=subr@entry=0x0,
    message=message@entry=0x7f85d3fa228c "Wrong type to apply: ~S",
args=0x15af130e0,
    rest=rest@entry=0x15af130f0) at ../../libguile/error.c:62
#17 0x00007f85d3f6f913 in get_callee_vcode (thread=0x15f692dc0)
    at ../../libguile/vm.c:1527
#18 0x00007f85b4314805 in ?? ()
#19 0x00007f85b428a000 in ?? ()
#20 0x00007f85d41e10c0 in jump_table_ () from
/usr/local/lib/libguile-3.0.so.0
#21 0x000000015f692dc0 in ?? ()
#22 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
    mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
#23 0x00007f85d3f70600 in vm_debug_engine (thread=0x2)
    at ../../libguile/vm-engine.c:370
#24 0x00007f85d3f76db2 in scm_call_n (proc=<optimized out>,
    argv=argv@entry=0x7f7e85fe2600, nargs=nargs@entry=3) at
../../libguile/vm.c:1605
#25 0x00007f85d3eef97f in scm_call_3 (proc=<optimized out>, arg1=<optimized
out>,
    arg2=<optimized out>, arg3=<optimized out>) at ../../libguile/eval.c:510
#26 0x00007f85d4262b6f in ?? ()
#27 0x00007f85d4262a80 in ?? ()
#28 0x00007f85d41e10c0 in jump_table_ () from
/usr/local/lib/libguile-3.0.so.0
#29 0x000000015f692dc0 in ?? ()
#30 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
    mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
#31 0x00007f85d3f70600 in vm_debug_engine (thread=0x304)
    at ../../libguile/vm-engine.c:370
#32 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0x15b341ee0,
    argv=argv@entry=0x0, nargs=nargs@entry=0) at ../../libguile/vm.c:1605
#33 0x00007f85d3eef8d9 in scm_call_0 (proc=proc@entry=0x15b341ee0)
    at ../../libguile/eval.c:490
#34 0x00007f85d3f6c1aa in catch (tag=tag@entry=0x404, thunk=0x15b341ee0,
    handler=0x15b341ec0, pre_unwind_handler=0x15b341ea0) at
../../libguile/throw.c:146
#35 0x00007f85d3f6c505 in scm_catch_with_pre_unwind_handler
(key=key@entry=0x404,

---Type <return> to continue, or q <return> to quit---
    thunk=<optimized out>, handler=<optimized out>,
    pre_unwind_handler=<optimized out>) at ../../libguile/throw.c:260
#36 0x00007f85d3f6c6bf in scm_c_catch (tag=tag@entry=0x404, body=<optimized
out>,
    body_data=<optimized out>,
    handler=handler@entry=0x7f85c95d0f00
<opencog::SchemeEval::catch_handler_wrapper(void*, scm_unused_struct*,
scm_unused_struct*)>,
    handler_data=handler_data@entry=0x7f7e60000980,
    pre_unwind_handler=pre_unwind_handler@entry=0x7f85c95d0c40
<opencog::SchemeEval::preunwind_handler_wrapper(void*, scm_unused_struct*,
scm_unused_struct*)>,
    pre_unwind_handler_data=0x7f7e60000980) at ../../libguile/throw.c:385
#37 0x00007f85c95d122a in opencog::SchemeEval::do_eval
(this=0x7f7e60000980,
    expr="(observe-mpg \"The countess, with her loving heart, felt that her
children were being ruined, that it was not the count's fault for he could
not help being what he was -- that (though he tried to hide "...)
    at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:590
#38 0x00007f85c95d12aa in opencog::SchemeEval::c_wrap_eval
(p=0x7f7e60000980)
    at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:507
#39 0x00007f85d3eeb47a in c_body (d=0x7f7e85fe2d40)
    at ../../libguile/continuations.c:430
#40 0x00007f85d4262b6f in ?? ()
#41 0x00007f85d4262a80 in ?? ()
#42 0x00007f85d41e10c0 in jump_table_ () from
/usr/local/lib/libguile-3.0.so.0
#43 0x000000015f692dc0 in ?? ()
#44 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
    mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
#45 0x00007f85d3f70600 in vm_debug_engine (thread=0x304)
    at ../../libguile/vm-engine.c:370
#46 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0x15b341fe0,
    argv=argv@entry=0x0, nargs=nargs@entry=0) at ../../libguile/vm.c:1605
#47 0x00007f85d3eef8d9 in scm_call_0 (proc=proc@entry=0x15b341fe0)
    at ../../libguile/eval.c:490
#48 0x00007f85d3f6c1aa in catch (tag=tag@entry=0x404, thunk=0x15b341fe0,
---Type <return> to continue, or q <return> to quit---
    handler=0x15b341fc0, pre_unwind_handler=0x15b341fa0) at
../../libguile/throw.c:146
#49 0x00007f85d3f6c505 in scm_catch_with_pre_unwind_handler
(key=key@entry=0x404,

    thunk=<optimized out>, handler=<optimized out>,
    pre_unwind_handler=<optimized out>) at ../../libguile/throw.c:260
#50 0x00007f85d3f6c6bf in scm_c_catch (tag=tag@entry=0x404,
    body=body@entry=0x7f85d3eeb470 <c_body>,
    body_data=body_data@entry=0x7f7e85fe2d40,
    handler=handler@entry=0x7f85d3eeb720 <c_handler>,
    handler_data=handler_data@entry=0x7f7e85fe2d40,
    pre_unwind_handler=pre_unwind_handler@entry=0x7f85d3eeb580
<pre_unwind_handler>,
    pre_unwind_handler_data=0xe174a0) at ../../libguile/throw.c:385
#51 0x00007f85d3eeb9e3 in scm_i_with_continuation_barrier (
    body=body@entry=0x7f85d3eeb470 <c_body>,
    body_data=body_data@entry=0x7f7e85fe2d40,
    handler=handler@entry=0x7f85d3eeb720 <c_handler>,
    handler_data=handler_data@entry=0x7f7e85fe2d40,
    pre_unwind_handler=pre_unwind_handler@entry=0x7f85d3eeb580
<pre_unwind_handler>,
    pre_unwind_handler_data=0xe174a0) at ../../libguile/continuations.c:368
#52 0x00007f85d3eebac5 in scm_c_with_continuation_barrier (func=<optimized
out>,
    data=<optimized out>) at ../../libguile/continuations.c:464
#53 0x00007f85d3575127 in GC_call_with_gc_active (
    fn=fn@entry=0x7f85d3f6a070 <with_guile_trampoline>,
    client_data=client_data@entry=0x7f7e85fe2e20) at
../pthread_support.c:1343
#54 0x00007f85d3f6ac4f in with_guile (base=base@entry=0x7f7e85fe2df0,
    data=data@entry=0x7f7e85fe2e20) at ../../libguile/threads.c:683
#55 0x00007f85d356f132 in GC_call_with_stack_base (
    fn=fn@entry=0x7f85d3f6abb0 <with_guile>, arg=arg@entry=0x7f7e85fe2e20)
    at ../misc.c:1941
#56 0x00007f85d3f6aff8 in scm_i_with_guile (dynamic_state=<optimized out>,
    data=0x7f7e60000980,
    func=0x7f85c95d1290 <opencog::SchemeEval::c_wrap_eval(void*)>)
    at ../../libguile/threads.c:698
---Type <return> to continue, or q <return> to quit---
#57 scm_with_guile (
    func=func@entry=0x7f85c95d1290
<opencog::SchemeEval::c_wrap_eval(void*)>,
    data=data@entry=0x7f7e60000980) at ../../libguile/threads.c:704
#58 0x00007f85c95d126e in opencog::SchemeEval::eval_expr
(this=0x7f7e60000980,
    expr=...) at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:479
#59 0x00007f85bc783439 in opencog::GenericShell::eval_loop
(this=0x7f7ef0001e90)
    at /home/ubuntu/src/opencog/opencog/cogserver/shell/GenericShell.cc:588
#60 0x00007f85c6e5ac80 in ?? () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#61 0x00007f85d3c916ba in start_thread (arg=0x7f7e85fe3700) at
pthread_create.c:333
#62 0x00007f85d39c141d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb)



-- 
cassette tapes - analog TV - film cameras - you

[-- Attachment #2: Type: text/html, Size: 12649 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Now crashing [was Re: guile-2.9.2 and threading
  2019-07-14 21:59     ` Now crashing [was " Linas Vepstas
@ 2019-07-14 22:03       ` Linas Vepstas
  2019-07-15  3:03         ` Linas Vepstas
  0 siblings, 1 reply; 15+ messages in thread
From: Linas Vepstas @ 2019-07-14 22:03 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Guile User, Guile Development

[-- Attachment #1: Type: text/plain, Size: 11747 bytes --]

Below was for
guile (GNU Guile) 2.9.2.14-1fb399

--linas

On Sun, Jul 14, 2019 at 4:59 PM Linas Vepstas <linasvepstas@gmail.com>
wrote:

>
> So, here's my next installment on using guile-2.9.2. The first installment
> said that I'd piled up CPU-months of guile 2.9.2 experience without any
> crashes. Well, now, a different workload crashes in minutes.  Below is a
> highly simplified, edited gdb session -- it crashes because it unexpectedly
> aborts, during an abort(!) because `get_callee_vcode()` failed. Harrumpf.
>
> Background: there are 140 threads, half in guile, the other half waiting
> for guile to finish. Yes, that's too many, but anyways ... 70 threads in
> guile and one crashed:
>
> #2  0x00007f85d3f6ecdb in capture_delimited_continuation (
>     current_registers=<optimized out>, dynstack=<optimized out>,
>     saved_registers=<optimized out>, saved_mra=<optimized out>,
>     saved_fp=<optimized out>, vp=<optimized out>) at
> ../../libguile/vm.c:1327
> #3  abort_to_prompt (thread=0x15f692dc0, saved_mra=<optimized out>)
>     at ../../libguile/vm.c:1454
>
> Both frames are interesting, because libguile/vm.c:1327 shows
>   if (SCM_FRAME_DYNAMIC_LINK (base_fp) != saved_fp)
>      abort();
> hey!? who called this? line 1454 is in the middle of abort_to_prompt ()
> Yow! an unexpected abort during an abort...
>
> How did we get here?
> #15 0x00007f85d3eedeb5 in scm_error_scm (key=key@entry=0xdc5420,
>     subr=<optimized out>, message=message@entry=0x1607c9380,
>     args=args@entry=0x15af130e0, data=data@entry=0x15af130f0)
>     at ../../libguile/error.c:90
> #16 0x00007f85d3eedf4f in scm_error (key=0xdc5420, subr=subr@entry=0x0,
>     message=message@entry=0x7f85d3fa228c "Wrong type to apply: ~S",
> args=0x15af130e0,
>     rest=rest@entry=0x15af130f0) at ../../libguile/error.c:62
> #17 0x00007f85d3f6f913 in get_callee_vcode (thread=0x15f692dc0)
>     at ../../libguile/vm.c:1527
>
> and libguile/vm.c:1527 tells me that get_callee_vcode () is very unhappy.
> But why? I cannot tell .. after that, things peter out in boring stack
> frames that started with my call scm_c_catch() ... the same seemingly
> harmless call that is pending in 70 other threads. (the same call that has
> survived several CPU month of pounding with a different collection of
> scheme code)
>
> My best guess is that the current workload, by unintentionally launching
> gobs of threads is exposing a race condition that has been hithertho
> hidden.  I don't know how to debug any further.  I will try a slightly
> newer guile shortly, to see if I get lucky.
>
> -- Linas
>
> p.s. here's the whole stack trace. But really, its boring, except for the
> above highlights.
>
>
>
> (gdb) bt
> #0  0x00007f85d38ef428 in __GI_raise (sig=sig@entry=6)
>     at ../sysdeps/unix/sysv/linux/raise.c:54
> #1  0x00007f85d38f102a in __GI_abort () at abort.c:89
> #2  0x00007f85d3f6ecdb in capture_delimited_continuation (
>     current_registers=<optimized out>, dynstack=<optimized out>,
>     saved_registers=<optimized out>, saved_mra=<optimized out>,
>     saved_fp=<optimized out>, vp=<optimized out>) at
> ../../libguile/vm.c:1327
> #3  abort_to_prompt (thread=0x15f692dc0, saved_mra=<optimized out>)
>     at ../../libguile/vm.c:1454
> #4  0x00007f85ac539041 in ?? ()
> #5  0x00007f85ac37f040 in ?? ()
> #6  0x00007f85d41e10c0 in jump_table_ () from
> /usr/local/lib/libguile-3.0.so.0
> #7  0x000000015f692dc0 in ?? ()
> #8  0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
> #9  0x00007f85d3f70600 in vm_debug_engine (thread=0x7f85ac539000)
>     at ../../libguile/vm-engine.c:370
> #10 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0xe45a20,
> argv=<optimized out>,
>     nargs=5) at ../../libguile/vm.c:1605
> #11 0x00007f85d3eefdcb in scm_apply_0 (proc=0xe45a20, args=0x304)
>     at ../../libguile/eval.c:603
> #12 0x00007f85d3ef0a0d in scm_apply_1 (proc=<optimized out>,
>     arg1=arg1@entry=0xdc5420, args=args@entry=0x15af130a0)
>     at ../../libguile/eval.c:609
> #13 0x00007f85d3f6c546 in scm_throw (key=key@entry=0xdc5420,
> args=0x15af130a0)
>     at ../../libguile/throw.c:272
> #14 0x00007f85d3f6caf9 in scm_ithrow (key=key@entry=0xdc5420,
> args=<optimized out>,
>     no_return=no_return@entry=1) at ../../libguile/throw.c:619
> #15 0x00007f85d3eedeb5 in scm_error_scm (key=key@entry=0xdc5420,
>     subr=<optimized out>, message=message@entry=0x1607c9380,
>     args=args@entry=0x15af130e0, data=data@entry=0x15af130f0)
>     at ../../libguile/error.c:90
> ---Type <return> to continue, or q <return> to quit---
> #16 0x00007f85d3eedf4f in scm_error (key=0xdc5420, subr=subr@entry=0x0,
>     message=message@entry=0x7f85d3fa228c "Wrong type to apply: ~S",
> args=0x15af130e0,
>     rest=rest@entry=0x15af130f0) at ../../libguile/error.c:62
> #17 0x00007f85d3f6f913 in get_callee_vcode (thread=0x15f692dc0)
>     at ../../libguile/vm.c:1527
> #18 0x00007f85b4314805 in ?? ()
> #19 0x00007f85b428a000 in ?? ()
> #20 0x00007f85d41e10c0 in jump_table_ () from
> /usr/local/lib/libguile-3.0.so.0
> #21 0x000000015f692dc0 in ?? ()
> #22 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
> #23 0x00007f85d3f70600 in vm_debug_engine (thread=0x2)
>     at ../../libguile/vm-engine.c:370
> #24 0x00007f85d3f76db2 in scm_call_n (proc=<optimized out>,
>     argv=argv@entry=0x7f7e85fe2600, nargs=nargs@entry=3) at
> ../../libguile/vm.c:1605
> #25 0x00007f85d3eef97f in scm_call_3 (proc=<optimized out>,
> arg1=<optimized out>,
>     arg2=<optimized out>, arg3=<optimized out>) at
> ../../libguile/eval.c:510
> #26 0x00007f85d4262b6f in ?? ()
> #27 0x00007f85d4262a80 in ?? ()
> #28 0x00007f85d41e10c0 in jump_table_ () from
> /usr/local/lib/libguile-3.0.so.0
> #29 0x000000015f692dc0 in ?? ()
> #30 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
> #31 0x00007f85d3f70600 in vm_debug_engine (thread=0x304)
>     at ../../libguile/vm-engine.c:370
> #32 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0x15b341ee0,
>     argv=argv@entry=0x0, nargs=nargs@entry=0) at ../../libguile/vm.c:1605
> #33 0x00007f85d3eef8d9 in scm_call_0 (proc=proc@entry=0x15b341ee0)
>     at ../../libguile/eval.c:490
> #34 0x00007f85d3f6c1aa in catch (tag=tag@entry=0x404, thunk=0x15b341ee0,
>     handler=0x15b341ec0, pre_unwind_handler=0x15b341ea0) at
> ../../libguile/throw.c:146
> #35 0x00007f85d3f6c505 in scm_catch_with_pre_unwind_handler (key=key@entry=0x404,
>
> ---Type <return> to continue, or q <return> to quit---
>     thunk=<optimized out>, handler=<optimized out>,
>     pre_unwind_handler=<optimized out>) at ../../libguile/throw.c:260
> #36 0x00007f85d3f6c6bf in scm_c_catch (tag=tag@entry=0x404,
> body=<optimized out>,
>     body_data=<optimized out>,
>     handler=handler@entry=0x7f85c95d0f00
> <opencog::SchemeEval::catch_handler_wrapper(void*, scm_unused_struct*,
> scm_unused_struct*)>,
>     handler_data=handler_data@entry=0x7f7e60000980,
>     pre_unwind_handler=pre_unwind_handler@entry=0x7f85c95d0c40
> <opencog::SchemeEval::preunwind_handler_wrapper(void*, scm_unused_struct*,
> scm_unused_struct*)>,
>     pre_unwind_handler_data=0x7f7e60000980) at ../../libguile/throw.c:385
> #37 0x00007f85c95d122a in opencog::SchemeEval::do_eval
> (this=0x7f7e60000980,
>     expr="(observe-mpg \"The countess, with her loving heart, felt that
> her children were being ruined, that it was not the count's fault for he
> could not help being what he was -- that (though he tried to hide "...)
>     at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:590
> #38 0x00007f85c95d12aa in opencog::SchemeEval::c_wrap_eval
> (p=0x7f7e60000980)
>     at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:507
> #39 0x00007f85d3eeb47a in c_body (d=0x7f7e85fe2d40)
>     at ../../libguile/continuations.c:430
> #40 0x00007f85d4262b6f in ?? ()
> #41 0x00007f85d4262a80 in ?? ()
> #42 0x00007f85d41e10c0 in jump_table_ () from
> /usr/local/lib/libguile-3.0.so.0
> #43 0x000000015f692dc0 in ?? ()
> #44 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
> #45 0x00007f85d3f70600 in vm_debug_engine (thread=0x304)
>     at ../../libguile/vm-engine.c:370
> #46 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0x15b341fe0,
>     argv=argv@entry=0x0, nargs=nargs@entry=0) at ../../libguile/vm.c:1605
> #47 0x00007f85d3eef8d9 in scm_call_0 (proc=proc@entry=0x15b341fe0)
>     at ../../libguile/eval.c:490
> #48 0x00007f85d3f6c1aa in catch (tag=tag@entry=0x404, thunk=0x15b341fe0,
> ---Type <return> to continue, or q <return> to quit---
>     handler=0x15b341fc0, pre_unwind_handler=0x15b341fa0) at
> ../../libguile/throw.c:146
> #49 0x00007f85d3f6c505 in scm_catch_with_pre_unwind_handler (key=key@entry=0x404,
>
>     thunk=<optimized out>, handler=<optimized out>,
>     pre_unwind_handler=<optimized out>) at ../../libguile/throw.c:260
> #50 0x00007f85d3f6c6bf in scm_c_catch (tag=tag@entry=0x404,
>     body=body@entry=0x7f85d3eeb470 <c_body>,
>     body_data=body_data@entry=0x7f7e85fe2d40,
>     handler=handler@entry=0x7f85d3eeb720 <c_handler>,
>     handler_data=handler_data@entry=0x7f7e85fe2d40,
>     pre_unwind_handler=pre_unwind_handler@entry=0x7f85d3eeb580
> <pre_unwind_handler>,
>     pre_unwind_handler_data=0xe174a0) at ../../libguile/throw.c:385
> #51 0x00007f85d3eeb9e3 in scm_i_with_continuation_barrier (
>     body=body@entry=0x7f85d3eeb470 <c_body>,
>     body_data=body_data@entry=0x7f7e85fe2d40,
>     handler=handler@entry=0x7f85d3eeb720 <c_handler>,
>     handler_data=handler_data@entry=0x7f7e85fe2d40,
>     pre_unwind_handler=pre_unwind_handler@entry=0x7f85d3eeb580
> <pre_unwind_handler>,
>     pre_unwind_handler_data=0xe174a0) at ../../libguile/continuations.c:368
> #52 0x00007f85d3eebac5 in scm_c_with_continuation_barrier (func=<optimized
> out>,
>     data=<optimized out>) at ../../libguile/continuations.c:464
> #53 0x00007f85d3575127 in GC_call_with_gc_active (
>     fn=fn@entry=0x7f85d3f6a070 <with_guile_trampoline>,
>     client_data=client_data@entry=0x7f7e85fe2e20) at
> ../pthread_support.c:1343
> #54 0x00007f85d3f6ac4f in with_guile (base=base@entry=0x7f7e85fe2df0,
>     data=data@entry=0x7f7e85fe2e20) at ../../libguile/threads.c:683
> #55 0x00007f85d356f132 in GC_call_with_stack_base (
>     fn=fn@entry=0x7f85d3f6abb0 <with_guile>, arg=arg@entry=0x7f7e85fe2e20)
>     at ../misc.c:1941
> #56 0x00007f85d3f6aff8 in scm_i_with_guile (dynamic_state=<optimized out>,
>     data=0x7f7e60000980,
>     func=0x7f85c95d1290 <opencog::SchemeEval::c_wrap_eval(void*)>)
>     at ../../libguile/threads.c:698
> ---Type <return> to continue, or q <return> to quit---
> #57 scm_with_guile (
>     func=func@entry=0x7f85c95d1290
> <opencog::SchemeEval::c_wrap_eval(void*)>,
>     data=data@entry=0x7f7e60000980) at ../../libguile/threads.c:704
> #58 0x00007f85c95d126e in opencog::SchemeEval::eval_expr
> (this=0x7f7e60000980,
>     expr=...) at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:479
> #59 0x00007f85bc783439 in opencog::GenericShell::eval_loop
> (this=0x7f7ef0001e90)
>     at /home/ubuntu/src/opencog/opencog/cogserver/shell/GenericShell.cc:588
> #60 0x00007f85c6e5ac80 in ?? () from
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #61 0x00007f85d3c916ba in start_thread (arg=0x7f7e85fe3700) at
> pthread_create.c:333
> #62 0x00007f85d39c141d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> (gdb)
>
>
>
> --
> cassette tapes - analog TV - film cameras - you
>


-- 
cassette tapes - analog TV - film cameras - you

[-- Attachment #2: Type: text/html, Size: 13292 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Now crashing [was Re: guile-2.9.2 and threading
  2019-07-14 22:03       ` Linas Vepstas
@ 2019-07-15  3:03         ` Linas Vepstas
  2019-07-17 16:27           ` Linas Vepstas
  0 siblings, 1 reply; 15+ messages in thread
From: Linas Vepstas @ 2019-07-15  3:03 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Guile User, Guile Development

[-- Attachment #1: Type: text/plain, Size: 13278 bytes --]

Exactly the same crash, same stack trace (slightly different line numbers),
with a fresh pull today:
    commit 89e28df1c9069dcb65188fe7b3973c333d87d7e2
    Author: Andy Wingo <wingo@pobox.com>
    Date:   Thu Jun 20 14:02:05 2019 +0200
which is the current HEAD on master.

FWIW, 60-odd guile threads waiting here:

#0  __lll_lock_wait () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f45ff4fbdbd in __GI___pthread_mutex_lock (mutex=mutex@entry
=0x16e1f68)
    at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f45ff7e085f in scm_c_weak_set_lookup (set=<optimized out>,
    raw_hash=raw_hash@entry=551753256168943069,
    pred=pred@entry=0x7f45ff7d1660 <string_lookup_predicate_fn>,
    closure=closure@entry=0x7f3e44ff7ac0, dflt=dflt@entry=0x4)
    at ../../libguile/weak-set.c:760
#3  0x00007f45ff7d11e9 in lookup_interned_symbol
(raw_hash=551753256168943069,
    name=0x2965ca0) at ../../libguile/symbols.c:112
#4  scm_i_str2symbol (str=0x2965ca0) at ../../libguile/symbols.c:244

The parallelism is low because of this one lock.  This appears to be the
primary bottleneck for my workload.

-- Linas


On Sun, Jul 14, 2019 at 5:03 PM Linas Vepstas <linasvepstas@gmail.com>
wrote:

> Below was for
> guile (GNU Guile) 2.9.2.14-1fb399
>
> --linas
>
> On Sun, Jul 14, 2019 at 4:59 PM Linas Vepstas <linasvepstas@gmail.com>
> wrote:
>
>>
>> So, here's my next installment on using guile-2.9.2. The first
>> installment said that I'd piled up CPU-months of guile 2.9.2 experience
>> without any crashes. Well, now, a different workload crashes in minutes.
>> Below is a highly simplified, edited gdb session -- it crashes because it
>> unexpectedly aborts, during an abort(!) because `get_callee_vcode()`
>> failed. Harrumpf.
>>
>> Background: there are 140 threads, half in guile, the other half waiting
>> for guile to finish. Yes, that's too many, but anyways ... 70 threads in
>> guile and one crashed:
>>
>> #2  0x00007f85d3f6ecdb in capture_delimited_continuation (
>>     current_registers=<optimized out>, dynstack=<optimized out>,
>>     saved_registers=<optimized out>, saved_mra=<optimized out>,
>>     saved_fp=<optimized out>, vp=<optimized out>) at
>> ../../libguile/vm.c:1327
>> #3  abort_to_prompt (thread=0x15f692dc0, saved_mra=<optimized out>)
>>     at ../../libguile/vm.c:1454
>>
>> Both frames are interesting, because libguile/vm.c:1327 shows
>>   if (SCM_FRAME_DYNAMIC_LINK (base_fp) != saved_fp)
>>      abort();
>> hey!? who called this? line 1454 is in the middle of abort_to_prompt ()
>> Yow! an unexpected abort during an abort...
>>
>> How did we get here?
>> #15 0x00007f85d3eedeb5 in scm_error_scm (key=key@entry=0xdc5420,
>>     subr=<optimized out>, message=message@entry=0x1607c9380,
>>     args=args@entry=0x15af130e0, data=data@entry=0x15af130f0)
>>     at ../../libguile/error.c:90
>> #16 0x00007f85d3eedf4f in scm_error (key=0xdc5420, subr=subr@entry=0x0,
>>     message=message@entry=0x7f85d3fa228c "Wrong type to apply: ~S",
>> args=0x15af130e0,
>>     rest=rest@entry=0x15af130f0) at ../../libguile/error.c:62
>> #17 0x00007f85d3f6f913 in get_callee_vcode (thread=0x15f692dc0)
>>     at ../../libguile/vm.c:1527
>>
>> and libguile/vm.c:1527 tells me that get_callee_vcode () is very unhappy.
>> But why? I cannot tell .. after that, things peter out in boring stack
>> frames that started with my call scm_c_catch() ... the same seemingly
>> harmless call that is pending in 70 other threads. (the same call that has
>> survived several CPU month of pounding with a different collection of
>> scheme code)
>>
>> My best guess is that the current workload, by unintentionally launching
>> gobs of threads is exposing a race condition that has been hithertho
>> hidden.  I don't know how to debug any further.  I will try a slightly
>> newer guile shortly, to see if I get lucky.
>>
>> -- Linas
>>
>> p.s. here's the whole stack trace. But really, its boring, except for the
>> above highlights.
>>
>>
>>
>> (gdb) bt
>> #0  0x00007f85d38ef428 in __GI_raise (sig=sig@entry=6)
>>     at ../sysdeps/unix/sysv/linux/raise.c:54
>> #1  0x00007f85d38f102a in __GI_abort () at abort.c:89
>> #2  0x00007f85d3f6ecdb in capture_delimited_continuation (
>>     current_registers=<optimized out>, dynstack=<optimized out>,
>>     saved_registers=<optimized out>, saved_mra=<optimized out>,
>>     saved_fp=<optimized out>, vp=<optimized out>) at
>> ../../libguile/vm.c:1327
>> #3  abort_to_prompt (thread=0x15f692dc0, saved_mra=<optimized out>)
>>     at ../../libguile/vm.c:1454
>> #4  0x00007f85ac539041 in ?? ()
>> #5  0x00007f85ac37f040 in ?? ()
>> #6  0x00007f85d41e10c0 in jump_table_ () from
>> /usr/local/lib/libguile-3.0.so.0
>> #7  0x000000015f692dc0 in ?? ()
>> #8  0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
>> #9  0x00007f85d3f70600 in vm_debug_engine (thread=0x7f85ac539000)
>>     at ../../libguile/vm-engine.c:370
>> #10 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0xe45a20,
>> argv=<optimized out>,
>>     nargs=5) at ../../libguile/vm.c:1605
>> #11 0x00007f85d3eefdcb in scm_apply_0 (proc=0xe45a20, args=0x304)
>>     at ../../libguile/eval.c:603
>> #12 0x00007f85d3ef0a0d in scm_apply_1 (proc=<optimized out>,
>>     arg1=arg1@entry=0xdc5420, args=args@entry=0x15af130a0)
>>     at ../../libguile/eval.c:609
>> #13 0x00007f85d3f6c546 in scm_throw (key=key@entry=0xdc5420,
>> args=0x15af130a0)
>>     at ../../libguile/throw.c:272
>> #14 0x00007f85d3f6caf9 in scm_ithrow (key=key@entry=0xdc5420,
>> args=<optimized out>,
>>     no_return=no_return@entry=1) at ../../libguile/throw.c:619
>> #15 0x00007f85d3eedeb5 in scm_error_scm (key=key@entry=0xdc5420,
>>     subr=<optimized out>, message=message@entry=0x1607c9380,
>>     args=args@entry=0x15af130e0, data=data@entry=0x15af130f0)
>>     at ../../libguile/error.c:90
>> ---Type <return> to continue, or q <return> to quit---
>> #16 0x00007f85d3eedf4f in scm_error (key=0xdc5420, subr=subr@entry=0x0,
>>     message=message@entry=0x7f85d3fa228c "Wrong type to apply: ~S",
>> args=0x15af130e0,
>>     rest=rest@entry=0x15af130f0) at ../../libguile/error.c:62
>> #17 0x00007f85d3f6f913 in get_callee_vcode (thread=0x15f692dc0)
>>     at ../../libguile/vm.c:1527
>> #18 0x00007f85b4314805 in ?? ()
>> #19 0x00007f85b428a000 in ?? ()
>> #20 0x00007f85d41e10c0 in jump_table_ () from
>> /usr/local/lib/libguile-3.0.so.0
>> #21 0x000000015f692dc0 in ?? ()
>> #22 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
>> #23 0x00007f85d3f70600 in vm_debug_engine (thread=0x2)
>>     at ../../libguile/vm-engine.c:370
>> #24 0x00007f85d3f76db2 in scm_call_n (proc=<optimized out>,
>>     argv=argv@entry=0x7f7e85fe2600, nargs=nargs@entry=3) at
>> ../../libguile/vm.c:1605
>> #25 0x00007f85d3eef97f in scm_call_3 (proc=<optimized out>,
>> arg1=<optimized out>,
>>     arg2=<optimized out>, arg3=<optimized out>) at
>> ../../libguile/eval.c:510
>> #26 0x00007f85d4262b6f in ?? ()
>> #27 0x00007f85d4262a80 in ?? ()
>> #28 0x00007f85d41e10c0 in jump_table_ () from
>> /usr/local/lib/libguile-3.0.so.0
>> #29 0x000000015f692dc0 in ?? ()
>> #30 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
>> #31 0x00007f85d3f70600 in vm_debug_engine (thread=0x304)
>>     at ../../libguile/vm-engine.c:370
>> #32 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0x15b341ee0,
>>     argv=argv@entry=0x0, nargs=nargs@entry=0) at ../../libguile/vm.c:1605
>> #33 0x00007f85d3eef8d9 in scm_call_0 (proc=proc@entry=0x15b341ee0)
>>     at ../../libguile/eval.c:490
>> #34 0x00007f85d3f6c1aa in catch (tag=tag@entry=0x404, thunk=0x15b341ee0,
>>     handler=0x15b341ec0, pre_unwind_handler=0x15b341ea0) at
>> ../../libguile/throw.c:146
>> #35 0x00007f85d3f6c505 in scm_catch_with_pre_unwind_handler (key=key@entry=0x404,
>>
>> ---Type <return> to continue, or q <return> to quit---
>>     thunk=<optimized out>, handler=<optimized out>,
>>     pre_unwind_handler=<optimized out>) at ../../libguile/throw.c:260
>> #36 0x00007f85d3f6c6bf in scm_c_catch (tag=tag@entry=0x404,
>> body=<optimized out>,
>>     body_data=<optimized out>,
>>     handler=handler@entry=0x7f85c95d0f00
>> <opencog::SchemeEval::catch_handler_wrapper(void*, scm_unused_struct*,
>> scm_unused_struct*)>,
>>     handler_data=handler_data@entry=0x7f7e60000980,
>>     pre_unwind_handler=pre_unwind_handler@entry=0x7f85c95d0c40
>> <opencog::SchemeEval::preunwind_handler_wrapper(void*, scm_unused_struct*,
>> scm_unused_struct*)>,
>>     pre_unwind_handler_data=0x7f7e60000980) at ../../libguile/throw.c:385
>> #37 0x00007f85c95d122a in opencog::SchemeEval::do_eval
>> (this=0x7f7e60000980,
>>     expr="(observe-mpg \"The countess, with her loving heart, felt that
>> her children were being ruined, that it was not the count's fault for he
>> could not help being what he was -- that (though he tried to hide "...)
>>     at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:590
>> #38 0x00007f85c95d12aa in opencog::SchemeEval::c_wrap_eval
>> (p=0x7f7e60000980)
>>     at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:507
>> #39 0x00007f85d3eeb47a in c_body (d=0x7f7e85fe2d40)
>>     at ../../libguile/continuations.c:430
>> #40 0x00007f85d4262b6f in ?? ()
>> #41 0x00007f85d4262a80 in ?? ()
>> #42 0x00007f85d41e10c0 in jump_table_ () from
>> /usr/local/lib/libguile-3.0.so.0
>> #43 0x000000015f692dc0 in ?? ()
>> #44 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
>> #45 0x00007f85d3f70600 in vm_debug_engine (thread=0x304)
>>     at ../../libguile/vm-engine.c:370
>> #46 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0x15b341fe0,
>>     argv=argv@entry=0x0, nargs=nargs@entry=0) at ../../libguile/vm.c:1605
>> #47 0x00007f85d3eef8d9 in scm_call_0 (proc=proc@entry=0x15b341fe0)
>>     at ../../libguile/eval.c:490
>> #48 0x00007f85d3f6c1aa in catch (tag=tag@entry=0x404, thunk=0x15b341fe0,
>> ---Type <return> to continue, or q <return> to quit---
>>     handler=0x15b341fc0, pre_unwind_handler=0x15b341fa0) at
>> ../../libguile/throw.c:146
>> #49 0x00007f85d3f6c505 in scm_catch_with_pre_unwind_handler (key=key@entry=0x404,
>>
>>     thunk=<optimized out>, handler=<optimized out>,
>>     pre_unwind_handler=<optimized out>) at ../../libguile/throw.c:260
>> #50 0x00007f85d3f6c6bf in scm_c_catch (tag=tag@entry=0x404,
>>     body=body@entry=0x7f85d3eeb470 <c_body>,
>>     body_data=body_data@entry=0x7f7e85fe2d40,
>>     handler=handler@entry=0x7f85d3eeb720 <c_handler>,
>>     handler_data=handler_data@entry=0x7f7e85fe2d40,
>>     pre_unwind_handler=pre_unwind_handler@entry=0x7f85d3eeb580
>> <pre_unwind_handler>,
>>     pre_unwind_handler_data=0xe174a0) at ../../libguile/throw.c:385
>> #51 0x00007f85d3eeb9e3 in scm_i_with_continuation_barrier (
>>     body=body@entry=0x7f85d3eeb470 <c_body>,
>>     body_data=body_data@entry=0x7f7e85fe2d40,
>>     handler=handler@entry=0x7f85d3eeb720 <c_handler>,
>>     handler_data=handler_data@entry=0x7f7e85fe2d40,
>>     pre_unwind_handler=pre_unwind_handler@entry=0x7f85d3eeb580
>> <pre_unwind_handler>,
>>     pre_unwind_handler_data=0xe174a0) at
>> ../../libguile/continuations.c:368
>> #52 0x00007f85d3eebac5 in scm_c_with_continuation_barrier
>> (func=<optimized out>,
>>     data=<optimized out>) at ../../libguile/continuations.c:464
>> #53 0x00007f85d3575127 in GC_call_with_gc_active (
>>     fn=fn@entry=0x7f85d3f6a070 <with_guile_trampoline>,
>>     client_data=client_data@entry=0x7f7e85fe2e20) at
>> ../pthread_support.c:1343
>> #54 0x00007f85d3f6ac4f in with_guile (base=base@entry=0x7f7e85fe2df0,
>>     data=data@entry=0x7f7e85fe2e20) at ../../libguile/threads.c:683
>> #55 0x00007f85d356f132 in GC_call_with_stack_base (
>>     fn=fn@entry=0x7f85d3f6abb0 <with_guile>, arg=arg@entry
>> =0x7f7e85fe2e20)
>>     at ../misc.c:1941
>> #56 0x00007f85d3f6aff8 in scm_i_with_guile (dynamic_state=<optimized
>> out>,
>>     data=0x7f7e60000980,
>>     func=0x7f85c95d1290 <opencog::SchemeEval::c_wrap_eval(void*)>)
>>     at ../../libguile/threads.c:698
>> ---Type <return> to continue, or q <return> to quit---
>> #57 scm_with_guile (
>>     func=func@entry=0x7f85c95d1290
>> <opencog::SchemeEval::c_wrap_eval(void*)>,
>>     data=data@entry=0x7f7e60000980) at ../../libguile/threads.c:704
>> #58 0x00007f85c95d126e in opencog::SchemeEval::eval_expr
>> (this=0x7f7e60000980,
>>     expr=...) at
>> /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:479
>> #59 0x00007f85bc783439 in opencog::GenericShell::eval_loop
>> (this=0x7f7ef0001e90)
>>     at
>> /home/ubuntu/src/opencog/opencog/cogserver/shell/GenericShell.cc:588
>> #60 0x00007f85c6e5ac80 in ?? () from
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>> #61 0x00007f85d3c916ba in start_thread (arg=0x7f7e85fe3700) at
>> pthread_create.c:333
>> #62 0x00007f85d39c141d in clone () at
>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>> (gdb)
>>
>>
>>
>> --
>> cassette tapes - analog TV - film cameras - you
>>
>
>
> --
> cassette tapes - analog TV - film cameras - you
>


-- 
cassette tapes - analog TV - film cameras - you

[-- Attachment #2: Type: text/html, Size: 15285 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Now crashing [was Re: guile-2.9.2 and threading
  2019-07-15  3:03         ` Linas Vepstas
@ 2019-07-17 16:27           ` Linas Vepstas
  2019-07-17 17:47             ` Mark H Weaver
  0 siblings, 1 reply; 15+ messages in thread
From: Linas Vepstas @ 2019-07-17 16:27 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Guile User, Guile Development

[-- Attachment #1: Type: text/plain, Size: 15996 bytes --]

Investigating the crash with good-old printf's in libguile/vm.c produces
a vast ocean of prints ... that should have not been printed, and/or should
have been actual errors, but somehow were not handled by scm_error.
Using today's git pull of master, here's the diff containing a printf:

--- a/libguile/vm.c
+++ b/libguile/vm.c
@@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0); }

       proc = SCM_SMOB_DESCRIPTOR (proc).apply_trampoline;
       SCM_FRAME_LOCAL (vp->fp, 0) = proc;
       return SCM_PROGRAM_CODE (proc);
     }

+printf("duuude wrong type to apply!\n"
+"proc=%lx\n"
+"ip=%p\n"
+"sp=%p\n"
+"fp=%p\n"
+"sp_min=%p\n"
+"stack_lim=%p\n",
+SCM_FRAME_SLOT(vp->fp, 0)->as_u64,
+vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack_limit);
+fflush(stdout);
+
   vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp);

   scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S",
              scm_list_1 (proc), scm_list_1 (proc));
 }

As you can see, shortly after my printf, there should have been an error
report.
There is no error report... until 5-10 minutes later, when the error report
itself
causes a crash.  Before then, I get an endless high-speed spew of prints:

duuude wrong type to apply!
proc=2e
ip=0x7efcbbd5b69c
sp=0x7efcd14589b8
fp=0x7efcd14589c0
sp_min=0x7efcd1458708
stack_lim=0x7efcd1458000
duuude wrong type to apply!
proc=36
ip=0x7efcbbd5b69c
sp=0x7efcd15ffad0
fp=0x7efcd15ffad8
sp_min=0x7efcd15ff528
stack_lim=0x7efcd15ff000
duuude wrong type to apply!
proc=36
ip=0x7efcbbd5b69c
sp=0x7efcd14568a0
fp=0x7efcd14568a8
sp_min=0x7efcd1456318
stack_lim=0x7efcd1456000
duuude wrong type to apply!
proc=2a
ip=0x7efcbbd5b69c
sp=0x7efcd15ffb40
fp=0x7efcd15ffb48
sp_min=0x7efcd15ff3d8
stack_lim=0x7efcd15ff000
duuude wrong type to apply!
proc=26
ip=0x7efcbbd5b69c
sp=0x7efcd15fdc90
fp=0x7efcd15fdc98
sp_min=0x7efcd15fd798
stack_lim=0x7efcd15fd000
duuude wrong type to apply!
proc=22
ip=0x7efcbbd5b69c
sp=0x7efcd152ba98
fp=0x7efcd152baa0
sp_min=0x7efcd152b888
stack_lim=0x7efcd152b000
duuude wrong type to apply!

The addresses all look healthy. I also added a guard-word, looking
for memory corruption, but none was found. Next up: decoding the
SCM by hand, and figuring out why it's there.

-- Linas

On Sun, Jul 14, 2019 at 10:03 PM Linas Vepstas <linasvepstas@gmail.com>
wrote:

> Exactly the same crash, same stack trace (slightly different line numbers),
> with a fresh pull today:
>     commit 89e28df1c9069dcb65188fe7b3973c333d87d7e2
>     Author: Andy Wingo <wingo@pobox.com>
>     Date:   Thu Jun 20 14:02:05 2019 +0200
> which is the current HEAD on master.
>
> FWIW, 60-odd guile threads waiting here:
>
> #0  __lll_lock_wait () at
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
> #1  0x00007f45ff4fbdbd in __GI___pthread_mutex_lock (mutex=mutex@entry
> =0x16e1f68)
>     at ../nptl/pthread_mutex_lock.c:80
> #2  0x00007f45ff7e085f in scm_c_weak_set_lookup (set=<optimized out>,
>     raw_hash=raw_hash@entry=551753256168943069,
>     pred=pred@entry=0x7f45ff7d1660 <string_lookup_predicate_fn>,
>     closure=closure@entry=0x7f3e44ff7ac0, dflt=dflt@entry=0x4)
>     at ../../libguile/weak-set.c:760
> #3  0x00007f45ff7d11e9 in lookup_interned_symbol
> (raw_hash=551753256168943069,
>     name=0x2965ca0) at ../../libguile/symbols.c:112
> #4  scm_i_str2symbol (str=0x2965ca0) at ../../libguile/symbols.c:244
>
> The parallelism is low because of this one lock.  This appears to be the
> primary bottleneck for my workload.
>
> -- Linas
>
>
> On Sun, Jul 14, 2019 at 5:03 PM Linas Vepstas <linasvepstas@gmail.com>
> wrote:
>
>> Below was for
>> guile (GNU Guile) 2.9.2.14-1fb399
>>
>> --linas
>>
>> On Sun, Jul 14, 2019 at 4:59 PM Linas Vepstas <linasvepstas@gmail.com>
>> wrote:
>>
>>>
>>> So, here's my next installment on using guile-2.9.2. The first
>>> installment said that I'd piled up CPU-months of guile 2.9.2 experience
>>> without any crashes. Well, now, a different workload crashes in minutes.
>>> Below is a highly simplified, edited gdb session -- it crashes because it
>>> unexpectedly aborts, during an abort(!) because `get_callee_vcode()`
>>> failed. Harrumpf.
>>>
>>> Background: there are 140 threads, half in guile, the other half waiting
>>> for guile to finish. Yes, that's too many, but anyways ... 70 threads in
>>> guile and one crashed:
>>>
>>> #2  0x00007f85d3f6ecdb in capture_delimited_continuation (
>>>     current_registers=<optimized out>, dynstack=<optimized out>,
>>>     saved_registers=<optimized out>, saved_mra=<optimized out>,
>>>     saved_fp=<optimized out>, vp=<optimized out>) at
>>> ../../libguile/vm.c:1327
>>> #3  abort_to_prompt (thread=0x15f692dc0, saved_mra=<optimized out>)
>>>     at ../../libguile/vm.c:1454
>>>
>>> Both frames are interesting, because libguile/vm.c:1327 shows
>>>   if (SCM_FRAME_DYNAMIC_LINK (base_fp) != saved_fp)
>>>      abort();
>>> hey!? who called this? line 1454 is in the middle of abort_to_prompt ()
>>> Yow! an unexpected abort during an abort...
>>>
>>> How did we get here?
>>> #15 0x00007f85d3eedeb5 in scm_error_scm (key=key@entry=0xdc5420,
>>>     subr=<optimized out>, message=message@entry=0x1607c9380,
>>>     args=args@entry=0x15af130e0, data=data@entry=0x15af130f0)
>>>     at ../../libguile/error.c:90
>>> #16 0x00007f85d3eedf4f in scm_error (key=0xdc5420, subr=subr@entry=0x0,
>>>     message=message@entry=0x7f85d3fa228c "Wrong type to apply: ~S",
>>> args=0x15af130e0,
>>>     rest=rest@entry=0x15af130f0) at ../../libguile/error.c:62
>>> #17 0x00007f85d3f6f913 in get_callee_vcode (thread=0x15f692dc0)
>>>     at ../../libguile/vm.c:1527
>>>
>>> and libguile/vm.c:1527 tells me that get_callee_vcode () is very
>>> unhappy. But why? I cannot tell .. after that, things peter out in boring
>>> stack frames that started with my call scm_c_catch() ... the same seemingly
>>> harmless call that is pending in 70 other threads. (the same call that has
>>> survived several CPU month of pounding with a different collection of
>>> scheme code)
>>>
>>> My best guess is that the current workload, by unintentionally launching
>>> gobs of threads is exposing a race condition that has been hithertho
>>> hidden.  I don't know how to debug any further.  I will try a slightly
>>> newer guile shortly, to see if I get lucky.
>>>
>>> -- Linas
>>>
>>> p.s. here's the whole stack trace. But really, its boring, except for
>>> the above highlights.
>>>
>>>
>>>
>>> (gdb) bt
>>> #0  0x00007f85d38ef428 in __GI_raise (sig=sig@entry=6)
>>>     at ../sysdeps/unix/sysv/linux/raise.c:54
>>> #1  0x00007f85d38f102a in __GI_abort () at abort.c:89
>>> #2  0x00007f85d3f6ecdb in capture_delimited_continuation (
>>>     current_registers=<optimized out>, dynstack=<optimized out>,
>>>     saved_registers=<optimized out>, saved_mra=<optimized out>,
>>>     saved_fp=<optimized out>, vp=<optimized out>) at
>>> ../../libguile/vm.c:1327
>>> #3  abort_to_prompt (thread=0x15f692dc0, saved_mra=<optimized out>)
>>>     at ../../libguile/vm.c:1454
>>> #4  0x00007f85ac539041 in ?? ()
>>> #5  0x00007f85ac37f040 in ?? ()
>>> #6  0x00007f85d41e10c0 in jump_table_ () from
>>> /usr/local/lib/libguile-3.0.so.0
>>> #7  0x000000015f692dc0 in ?? ()
>>> #8  0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>>>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
>>> #9  0x00007f85d3f70600 in vm_debug_engine (thread=0x7f85ac539000)
>>>     at ../../libguile/vm-engine.c:370
>>> #10 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0xe45a20,
>>> argv=<optimized out>,
>>>     nargs=5) at ../../libguile/vm.c:1605
>>> #11 0x00007f85d3eefdcb in scm_apply_0 (proc=0xe45a20, args=0x304)
>>>     at ../../libguile/eval.c:603
>>> #12 0x00007f85d3ef0a0d in scm_apply_1 (proc=<optimized out>,
>>>     arg1=arg1@entry=0xdc5420, args=args@entry=0x15af130a0)
>>>     at ../../libguile/eval.c:609
>>> #13 0x00007f85d3f6c546 in scm_throw (key=key@entry=0xdc5420,
>>> args=0x15af130a0)
>>>     at ../../libguile/throw.c:272
>>> #14 0x00007f85d3f6caf9 in scm_ithrow (key=key@entry=0xdc5420,
>>> args=<optimized out>,
>>>     no_return=no_return@entry=1) at ../../libguile/throw.c:619
>>> #15 0x00007f85d3eedeb5 in scm_error_scm (key=key@entry=0xdc5420,
>>>     subr=<optimized out>, message=message@entry=0x1607c9380,
>>>     args=args@entry=0x15af130e0, data=data@entry=0x15af130f0)
>>>     at ../../libguile/error.c:90
>>> ---Type <return> to continue, or q <return> to quit---
>>> #16 0x00007f85d3eedf4f in scm_error (key=0xdc5420, subr=subr@entry=0x0,
>>>     message=message@entry=0x7f85d3fa228c "Wrong type to apply: ~S",
>>> args=0x15af130e0,
>>>     rest=rest@entry=0x15af130f0) at ../../libguile/error.c:62
>>> #17 0x00007f85d3f6f913 in get_callee_vcode (thread=0x15f692dc0)
>>>     at ../../libguile/vm.c:1527
>>> #18 0x00007f85b4314805 in ?? ()
>>> #19 0x00007f85b428a000 in ?? ()
>>> #20 0x00007f85d41e10c0 in jump_table_ () from
>>> /usr/local/lib/libguile-3.0.so.0
>>> #21 0x000000015f692dc0 in ?? ()
>>> #22 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>>>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
>>> #23 0x00007f85d3f70600 in vm_debug_engine (thread=0x2)
>>>     at ../../libguile/vm-engine.c:370
>>> #24 0x00007f85d3f76db2 in scm_call_n (proc=<optimized out>,
>>>     argv=argv@entry=0x7f7e85fe2600, nargs=nargs@entry=3) at
>>> ../../libguile/vm.c:1605
>>> #25 0x00007f85d3eef97f in scm_call_3 (proc=<optimized out>,
>>> arg1=<optimized out>,
>>>     arg2=<optimized out>, arg3=<optimized out>) at
>>> ../../libguile/eval.c:510
>>> #26 0x00007f85d4262b6f in ?? ()
>>> #27 0x00007f85d4262a80 in ?? ()
>>> #28 0x00007f85d41e10c0 in jump_table_ () from
>>> /usr/local/lib/libguile-3.0.so.0
>>> #29 0x000000015f692dc0 in ?? ()
>>> #30 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>>>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
>>> #31 0x00007f85d3f70600 in vm_debug_engine (thread=0x304)
>>>     at ../../libguile/vm-engine.c:370
>>> #32 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0x15b341ee0,
>>>     argv=argv@entry=0x0, nargs=nargs@entry=0) at
>>> ../../libguile/vm.c:1605
>>> #33 0x00007f85d3eef8d9 in scm_call_0 (proc=proc@entry=0x15b341ee0)
>>>     at ../../libguile/eval.c:490
>>> #34 0x00007f85d3f6c1aa in catch (tag=tag@entry=0x404,
>>> thunk=0x15b341ee0,
>>>     handler=0x15b341ec0, pre_unwind_handler=0x15b341ea0) at
>>> ../../libguile/throw.c:146
>>> #35 0x00007f85d3f6c505 in scm_catch_with_pre_unwind_handler
>>> (key=key@entry=0x404,
>>> ---Type <return> to continue, or q <return> to quit---
>>>     thunk=<optimized out>, handler=<optimized out>,
>>>     pre_unwind_handler=<optimized out>) at ../../libguile/throw.c:260
>>> #36 0x00007f85d3f6c6bf in scm_c_catch (tag=tag@entry=0x404,
>>> body=<optimized out>,
>>>     body_data=<optimized out>,
>>>     handler=handler@entry=0x7f85c95d0f00
>>> <opencog::SchemeEval::catch_handler_wrapper(void*, scm_unused_struct*,
>>> scm_unused_struct*)>,
>>>     handler_data=handler_data@entry=0x7f7e60000980,
>>>     pre_unwind_handler=pre_unwind_handler@entry=0x7f85c95d0c40
>>> <opencog::SchemeEval::preunwind_handler_wrapper(void*, scm_unused_struct*,
>>> scm_unused_struct*)>,
>>>     pre_unwind_handler_data=0x7f7e60000980) at ../../libguile/throw.c:385
>>> #37 0x00007f85c95d122a in opencog::SchemeEval::do_eval
>>> (this=0x7f7e60000980,
>>>     expr="(observe-mpg \"The countess, with her loving heart, felt that
>>> her children were being ruined, that it was not the count's fault for he
>>> could not help being what he was -- that (though he tried to hide "...)
>>>     at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:590
>>> #38 0x00007f85c95d12aa in opencog::SchemeEval::c_wrap_eval
>>> (p=0x7f7e60000980)
>>>     at /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:507
>>> #39 0x00007f85d3eeb47a in c_body (d=0x7f7e85fe2d40)
>>>     at ../../libguile/continuations.c:430
>>> #40 0x00007f85d4262b6f in ?? ()
>>> #41 0x00007f85d4262a80 in ?? ()
>>> #42 0x00007f85d41e10c0 in jump_table_ () from
>>> /usr/local/lib/libguile-3.0.so.0
>>> #43 0x000000015f692dc0 in ?? ()
>>> #44 0x00007f85d3f19581 in scm_jit_enter_mcode (thread=0x15f692dc0,
>>>     mcode=0x15f692dc0 "\200\205\267_\001") at ../../libguile/jit.c:4796
>>> #45 0x00007f85d3f70600 in vm_debug_engine (thread=0x304)
>>>     at ../../libguile/vm-engine.c:370
>>> #46 0x00007f85d3f76db2 in scm_call_n (proc=proc@entry=0x15b341fe0,
>>>     argv=argv@entry=0x0, nargs=nargs@entry=0) at
>>> ../../libguile/vm.c:1605
>>> #47 0x00007f85d3eef8d9 in scm_call_0 (proc=proc@entry=0x15b341fe0)
>>>     at ../../libguile/eval.c:490
>>> #48 0x00007f85d3f6c1aa in catch (tag=tag@entry=0x404,
>>> thunk=0x15b341fe0,
>>> ---Type <return> to continue, or q <return> to quit---
>>>     handler=0x15b341fc0, pre_unwind_handler=0x15b341fa0) at
>>> ../../libguile/throw.c:146
>>> #49 0x00007f85d3f6c505 in scm_catch_with_pre_unwind_handler
>>> (key=key@entry=0x404,
>>>     thunk=<optimized out>, handler=<optimized out>,
>>>     pre_unwind_handler=<optimized out>) at ../../libguile/throw.c:260
>>> #50 0x00007f85d3f6c6bf in scm_c_catch (tag=tag@entry=0x404,
>>>     body=body@entry=0x7f85d3eeb470 <c_body>,
>>>     body_data=body_data@entry=0x7f7e85fe2d40,
>>>     handler=handler@entry=0x7f85d3eeb720 <c_handler>,
>>>     handler_data=handler_data@entry=0x7f7e85fe2d40,
>>>     pre_unwind_handler=pre_unwind_handler@entry=0x7f85d3eeb580
>>> <pre_unwind_handler>,
>>>     pre_unwind_handler_data=0xe174a0) at ../../libguile/throw.c:385
>>> #51 0x00007f85d3eeb9e3 in scm_i_with_continuation_barrier (
>>>     body=body@entry=0x7f85d3eeb470 <c_body>,
>>>     body_data=body_data@entry=0x7f7e85fe2d40,
>>>     handler=handler@entry=0x7f85d3eeb720 <c_handler>,
>>>     handler_data=handler_data@entry=0x7f7e85fe2d40,
>>>     pre_unwind_handler=pre_unwind_handler@entry=0x7f85d3eeb580
>>> <pre_unwind_handler>,
>>>     pre_unwind_handler_data=0xe174a0) at
>>> ../../libguile/continuations.c:368
>>> #52 0x00007f85d3eebac5 in scm_c_with_continuation_barrier
>>> (func=<optimized out>,
>>>     data=<optimized out>) at ../../libguile/continuations.c:464
>>> #53 0x00007f85d3575127 in GC_call_with_gc_active (
>>>     fn=fn@entry=0x7f85d3f6a070 <with_guile_trampoline>,
>>>     client_data=client_data@entry=0x7f7e85fe2e20) at
>>> ../pthread_support.c:1343
>>> #54 0x00007f85d3f6ac4f in with_guile (base=base@entry=0x7f7e85fe2df0,
>>>     data=data@entry=0x7f7e85fe2e20) at ../../libguile/threads.c:683
>>> #55 0x00007f85d356f132 in GC_call_with_stack_base (
>>>     fn=fn@entry=0x7f85d3f6abb0 <with_guile>, arg=arg@entry
>>> =0x7f7e85fe2e20)
>>>     at ../misc.c:1941
>>> #56 0x00007f85d3f6aff8 in scm_i_with_guile (dynamic_state=<optimized
>>> out>,
>>>     data=0x7f7e60000980,
>>>     func=0x7f85c95d1290 <opencog::SchemeEval::c_wrap_eval(void*)>)
>>>     at ../../libguile/threads.c:698
>>> ---Type <return> to continue, or q <return> to quit---
>>> #57 scm_with_guile (
>>>     func=func@entry=0x7f85c95d1290
>>> <opencog::SchemeEval::c_wrap_eval(void*)>,
>>>     data=data@entry=0x7f7e60000980) at ../../libguile/threads.c:704
>>> #58 0x00007f85c95d126e in opencog::SchemeEval::eval_expr
>>> (this=0x7f7e60000980,
>>>     expr=...) at
>>> /home/ubuntu/src/atomspace/opencog/guile/SchemeEval.cc:479
>>> #59 0x00007f85bc783439 in opencog::GenericShell::eval_loop
>>> (this=0x7f7ef0001e90)
>>>     at
>>> /home/ubuntu/src/opencog/opencog/cogserver/shell/GenericShell.cc:588
>>> #60 0x00007f85c6e5ac80 in ?? () from
>>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>>> #61 0x00007f85d3c916ba in start_thread (arg=0x7f7e85fe3700) at
>>> pthread_create.c:333
>>> #62 0x00007f85d39c141d in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
>>> (gdb)
>>>
>>>
>>>
>>> --
>>> cassette tapes - analog TV - film cameras - you
>>>
>>
>>
>> --
>> cassette tapes - analog TV - film cameras - you
>>
>
>
> --
> cassette tapes - analog TV - film cameras - you
>


-- 
cassette tapes - analog TV - film cameras - you

[-- Attachment #2: Type: text/html, Size: 18736 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Now crashing [was Re: guile-2.9.2 and threading
  2019-07-17 16:27           ` Linas Vepstas
@ 2019-07-17 17:47             ` Mark H Weaver
  2019-07-17 21:44               ` Linas Vepstas
  0 siblings, 1 reply; 15+ messages in thread
From: Mark H Weaver @ 2019-07-17 17:47 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: Guile User, Guile Development

Hi Linas,

> Investigating the crash with good-old printf's in libguile/vm.c produces
> a vast ocean of prints ... that should have not been printed, and/or should
> have been actual errors, but somehow were not handled by scm_error.
> Using today's git pull of master, here's the diff containing a printf:
>
> --- a/libguile/vm.c
> +++ b/libguile/vm.c
> @@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0); }
>
>        proc = SCM_SMOB_DESCRIPTOR (proc).apply_trampoline;
>        SCM_FRAME_LOCAL (vp->fp, 0) = proc;
>        return SCM_PROGRAM_CODE (proc);
>      }
>
> +printf("duuude wrong type to apply!\n"
> +"proc=%lx\n"
> +"ip=%p\n"
> +"sp=%p\n"
> +"fp=%p\n"
> +"sp_min=%p\n"
> +"stack_lim=%p\n",
> +SCM_FRAME_SLOT(vp->fp, 0)->as_u64,
> +vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack_limit);
> +fflush(stdout);
> +
>    vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp);
>
>    scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S",
>               scm_list_1 (proc), scm_list_1 (proc));
>  }
>
> As you can see, shortly after my printf, there should have been an
> error report.

Not necessarily.  Note that what 'scm_error' actually does is to raise
an exception.  What happens next depends on what exception handlers are
installed at the time of the error.

> There is no error report... until 5-10 minutes later, when the error
> report itself causes a crash.  Before then, I get an endless
> high-speed spew of prints:

It looks like another error is happening within the exception handler.

       Mark

PS: It would be good to pick either 'guile-devel' or 'guile-user' for
    continuation of this thread.  I don't see a reason why it should be
    sent to both lists.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Now crashing [was Re: guile-2.9.2 and threading
  2019-07-17 17:47             ` Mark H Weaver
@ 2019-07-17 21:44               ` Linas Vepstas
  2019-07-18  1:42                 ` Linas Vepstas
  0 siblings, 1 reply; 15+ messages in thread
From: Linas Vepstas @ 2019-07-17 21:44 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Guile Development

[-- Attachment #1: Type: text/plain, Size: 3716 bytes --]

On Wed, Jul 17, 2019 at 12:49 PM Mark H Weaver <mhw@netris.org> wrote:

> Hi Linas,
>
> > Investigating the crash with good-old printf's in libguile/vm.c produces
> > a vast ocean of prints ... that should have not been printed, and/or
> should
> > have been actual errors, but somehow were not handled by scm_error.
> > Using today's git pull of master, here's the diff containing a printf:
> >
> > --- a/libguile/vm.c
> > +++ b/libguile/vm.c
> > @@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0); }
> >
> >        proc = SCM_SMOB_DESCRIPTOR (proc).apply_trampoline;
> >        SCM_FRAME_LOCAL (vp->fp, 0) = proc;
> >        return SCM_PROGRAM_CODE (proc);
> >      }
> >
> > +printf("duuude wrong type to apply!\n"
> > +"proc=%lx\n"
> > +"ip=%p\n"
> > +"sp=%p\n"
> > +"fp=%p\n"
> > +"sp_min=%p\n"
> > +"stack_lim=%p\n",
> > +SCM_FRAME_SLOT(vp->fp, 0)->as_u64,
> > +vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack_limit);
> > +fflush(stdout);
> > +
> >    vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp);
> >
> >    scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S",
> >               scm_list_1 (proc), scm_list_1 (proc));
> >  }
> >
> > As you can see, shortly after my printf, there should have been an
> > error report.
>
> Not necessarily.  Note that what 'scm_error' actually does is to raise
> an exception.  What happens next depends on what exception handlers are
> installed at the time of the error.
>

OK, but... when I look at what get_callee_vcode() actually does, it seems
to be earnestly trying to fish out the location of a callable function from
the
frame pointer, and it does so three plausible ways. If those three don't
work
out, then it sets the instruction pointer (to the garbage value), followed
by
scm_error(Wrong type to apply). This also looks like an earnest, honest
attempt to report a real error.  But lets double-check.

So who calls get_callee_vcode(), and why, and what did they expect to
happen?
Well, that's in three places: one in scm_call_n which is a plausible place
where
one might expect the instruction pointer to be set to a valid value. Then
there's two
places in vm-engine.c -- "tail-call" and "call" both of which one might
plausibly expect
to have a valid instruction pointer.  I can't imagine any valid scenario
where anyone
was expecting get_callee_vcode() to actually fail in the normal course of
operations.

That is, I can't think of any valid reason why anyone would want to suppress
the scm_error().  And even if I could -- calling scm_error() hundreds of
times
per second, as fast as possible, does not seem like efficient coding for
dealing
with a call to an invalid address.

Anyway I'm trying to track down where the invalid value gets set. No luck
so far.
There are 6 or 8 places in vm-engine.c where the frame pointer is set to
something
that isn't a pointer (which seems like cheating to me: passing non-pointer
values
in something called "pointer" is .. well, knee jerk reaction is that it's
not wise, but
there may be a deeper reason.)

>
> > There is no error report... until 5-10 minutes later, when the error
> > report itself causes a crash.  Before then, I get an endless
> > high-speed spew of prints:
>
> It looks like another error is happening within the exception handler.
>

Well, yes, that also. But given that the instruction pointer contains
garbage
its perhaps not entirely surprising... at best, the question is, why didn't
it fail
sooner?

-- Linas

>
>        Mark
>
> PS: It would be good to pick either 'guile-devel' or 'guile-user' for
>     continuation of this thread.  I don't see a reason why it should be
>     sent to both lists.
>

-- 
cassette tapes - analog TV - film cameras - you

[-- Attachment #2: Type: text/html, Size: 5285 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Now crashing [was Re: guile-2.9.2 and threading
  2019-07-17 21:44               ` Linas Vepstas
@ 2019-07-18  1:42                 ` Linas Vepstas
  2019-07-18  3:52                   ` Linas Vepstas
  0 siblings, 1 reply; 15+ messages in thread
From: Linas Vepstas @ 2019-07-18  1:42 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Guile Development

[-- Attachment #1: Type: text/plain, Size: 5370 bytes --]

Seem to be narrowing it down ... or at least, I have more details ...

On Wed, Jul 17, 2019 at 4:44 PM Linas Vepstas <linasvepstas@gmail.com>
wrote:

>
>
> On Wed, Jul 17, 2019 at 12:49 PM Mark H Weaver <mhw@netris.org> wrote:
>
>> Hi Linas,
>>
>> > Investigating the crash with good-old printf's in libguile/vm.c produces
>> > a vast ocean of prints ... that should have not been printed, and/or
>> should
>> > have been actual errors, but somehow were not handled by scm_error.
>> > Using today's git pull of master, here's the diff containing a printf:
>> >
>> > --- a/libguile/vm.c
>> > +++ b/libguile/vm.c
>> > @@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0); }
>> >
>> >        proc = SCM_SMOB_DESCRIPTOR (proc).apply_trampoline;
>> >        SCM_FRAME_LOCAL (vp->fp, 0) = proc;
>> >        return SCM_PROGRAM_CODE (proc);
>> >      }
>> >
>> > +printf("duuude wrong type to apply!\n"
>> > +"proc=%lx\n"
>> > +"ip=%p\n"
>> > +"sp=%p\n"
>> > +"fp=%p\n"
>> > +"sp_min=%p\n"
>> > +"stack_lim=%p\n",
>> > +SCM_FRAME_SLOT(vp->fp, 0)->as_u64,
>> > +vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack_limit);
>> > +fflush(stdout);
>> > +
>> >    vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp);
>> >
>> >    scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S",
>> >               scm_list_1 (proc), scm_list_1 (proc));
>> >  }
>> >
>> > As you can see, shortly after my printf, there should have been an
>> > error report.
>>
>> Not necessarily.  Note that what 'scm_error' actually does is to raise
>> an exception.  What happens next depends on what exception handlers are
>> installed at the time of the error.
>>
>
> OK, but... when I look at what get_callee_vcode() actually does, it seems
> to be earnestly trying to fish out the location of a callable function
> from the
> frame pointer, and it does so three plausible ways. If those three don't
> work
> out, then it sets the instruction pointer (to the garbage value), followed
> by
> scm_error(Wrong type to apply). This also looks like an earnest, honest
> attempt to report a real error.  But lets double-check.
>
> So who calls get_callee_vcode(), and why, and what did they expect to
> happen?
> Well, that's in three places: one in scm_call_n which is a plausible place
> where
> one might expect the instruction pointer to be set to a valid value. Then
> there's two
> places in vm-engine.c -- "tail-call" and "call" both of which one might
> plausibly expect
> to have a valid instruction pointer.  I can't imagine any valid scenario
> where anyone
> was expecting get_callee_vcode() to actually fail in the normal course of
> operations.
>

There is one more place where  get_callee_vcode() can get called -- via the
jump_table,
via a call to scm_jit_enter_mcode()  which issues the code emitted by
emit_get_callee_vcode

There are four calls to scm_jit_enter_mcode()  The one that immediately
preceeds
the bug is always the one made here, in vm-engine.c:
VM_DEFINE_OP (7, return_values, "return-values", OP1 (X32))

Right before the call to scm_jit_enter_mcode(), I can printf VP->fp and
SCM_FRAME_LOCAL(VP->fp, 0),
and they are... fp=0x7fffe000caf8 fpslot=d33b00 (typical)

the mcode is of course some bytecode that bounces through lightning, and a
few insns
later, it arrives at get_callee_vcode() but now  the fp is different, (it
changes by 0x20,
always) and the slot is different:  fp=0x7fffe000cad8  and
SCM_FRAME_LOCAL(fp,0)
is 0x32 and the 0x32 triggers the scm_error(). (because 0x32 is not any of
SCM_PROGRAM_P or SCM_STRUCTP or a smob)

(but also, the fpslot=d33b00 is never a SCM_PROGRAM_P or SCM_STRUCTP or
a smob, either... so something got computed along the way ... )

That's what I've got so far. Its highly reproducible.  Quick to happen.
I'm not sure
what to do next. I guess I need to examine emit_get_callee_vcode() and see
what
it does, and why.   Any comments, suggestions would be useful.

-- Linas


> That is, I can't think of any valid reason why anyone would want to
> suppress
> the scm_error().  And even if I could -- calling scm_error() hundreds of
> times
> per second, as fast as possible, does not seem like efficient coding for
> dealing
> with a call to an invalid address.
>
> Anyway I'm trying to track down where the invalid value gets set. No luck
> so far.
> There are 6 or 8 places in vm-engine.c where the frame pointer is set to
> something
> that isn't a pointer (which seems like cheating to me: passing non-pointer
> values
> in something called "pointer" is .. well, knee jerk reaction is that it's
> not wise, but
> there may be a deeper reason.)
>
>
>>
>> > There is no error report... until 5-10 minutes later, when the error
>> > report itself causes a crash.  Before then, I get an endless
>> > high-speed spew of prints:
>>
>> It looks like another error is happening within the exception handler.
>>
>
> Well, yes, that also. But given that the instruction pointer contains
> garbage
> its perhaps not entirely surprising... at best, the question is, why
> didn't it fail
> sooner?
>
> -- Linas
>
>>
>>        Mark
>>
>> PS: It would be good to pick either 'guile-devel' or 'guile-user' for
>>     continuation of this thread.  I don't see a reason why it should be
>>     sent to both lists.
>>
>
>
> --
> cassette tapes - analog TV - film cameras - you
>


-- 
cassette tapes - analog TV - film cameras - you

[-- Attachment #2: Type: text/html, Size: 7792 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Now crashing [was Re: guile-2.9.2 and threading
  2019-07-18  1:42                 ` Linas Vepstas
@ 2019-07-18  3:52                   ` Linas Vepstas
  2019-07-21 21:10                     ` Linas Vepstas
  0 siblings, 1 reply; 15+ messages in thread
From: Linas Vepstas @ 2019-07-18  3:52 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Guile Development

[-- Attachment #1: Type: text/plain, Size: 7706 bytes --]

Oh, I get it.  I think the bug is this:  VM_DEFINE_OP (7,
return_values,...)
finds some mcode, and calls it.  What it found was the emit_get_callee_vcode
but it is totally pointless to call this mcode, since we're returning, and
not
calling. So its just not useful.

Worse, it gets called with garbage values, which are then silenced by
ignoring
the resulting  scm_error, and everything appears to run smoothly ... for a
while.
Until some later time, (millions of calls later), when there is a
completely unrelated
race condition that causes the scm_error to get tangled and die.  The ideal
solution would be simply to not call the mcode for get_callee; that would
save
time and trouble.

That's my hypothesis. I tried to test a mock-up of this solution with the
patch
below, but it is too simplistic t actually work (null pointer-deref.)  I
con't find
a beter solution

If you've got a better idea, let me know...

-- Linas

--- a/libguile/vm-engine.c
+++ b/libguile/vm-engine.c
@@ -553,6 +553,7 @@ VM_NAME (scm_thread *thread)
           mcode = SCM_FRAME_MACHINE_RETURN_ADDRESS (old_fp);
           if (mcode && mcode != scm_jit_return_to_interpreter_trampoline)
             {
+              VP->unused = 1;
               scm_jit_enter_mcode (thread, mcode);
               CACHE_REGISTER ();
               NEXT (0);
diff --git a/libguile/vm.c b/libguile/vm.c
index d7b1788..8e178c7 100644
--- a/libguile/vm.c
+++ b/libguile/vm.c
@@ -620,6 +620,7 @@ scm_i_vm_prepare_stack (struct scm_vm *vp)
   vp->compare_result = SCM_F_COMPARE_NONE;
   vp->engine = vm_default_engine;
   vp->trace_level = 0;
+  vp->unused = 0;
 #define INIT_HOOK(h) vp->h##_hook = SCM_BOOL_F;
   FOR_EACH_HOOK (INIT_HOOK)
 #undef INIT_HOOK
@@ -1515,6 +1516,7 @@ get_callee_vcode (scm_thread *thread)

   vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp);

+  if (vp->unused) { vp->unused = 0; return 0; }
   scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S",
              scm_list_1 (proc), scm_list_1 (proc));
 }

On Wed, Jul 17, 2019 at 8:42 PM Linas Vepstas <linasvepstas@gmail.com>
wrote:

> Seem to be narrowing it down ... or at least, I have more details ...
>
> On Wed, Jul 17, 2019 at 4:44 PM Linas Vepstas <linasvepstas@gmail.com>
> wrote:
>
>>
>>
>> On Wed, Jul 17, 2019 at 12:49 PM Mark H Weaver <mhw@netris.org> wrote:
>>
>>> Hi Linas,
>>>
>>> > Investigating the crash with good-old printf's in libguile/vm.c
>>> produces
>>> > a vast ocean of prints ... that should have not been printed, and/or
>>> should
>>> > have been actual errors, but somehow were not handled by scm_error.
>>> > Using today's git pull of master, here's the diff containing a printf:
>>> >
>>> > --- a/libguile/vm.c
>>> > +++ b/libguile/vm.c
>>> > @@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0); }
>>> >
>>> >        proc = SCM_SMOB_DESCRIPTOR (proc).apply_trampoline;
>>> >        SCM_FRAME_LOCAL (vp->fp, 0) = proc;
>>> >        return SCM_PROGRAM_CODE (proc);
>>> >      }
>>> >
>>> > +printf("duuude wrong type to apply!\n"
>>> > +"proc=%lx\n"
>>> > +"ip=%p\n"
>>> > +"sp=%p\n"
>>> > +"fp=%p\n"
>>> > +"sp_min=%p\n"
>>> > +"stack_lim=%p\n",
>>> > +SCM_FRAME_SLOT(vp->fp, 0)->as_u64,
>>> > +vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack_limit);
>>> > +fflush(stdout);
>>> > +
>>> >    vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp);
>>> >
>>> >    scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S",
>>> >               scm_list_1 (proc), scm_list_1 (proc));
>>> >  }
>>> >
>>> > As you can see, shortly after my printf, there should have been an
>>> > error report.
>>>
>>> Not necessarily.  Note that what 'scm_error' actually does is to raise
>>> an exception.  What happens next depends on what exception handlers are
>>> installed at the time of the error.
>>>
>>
>> OK, but... when I look at what get_callee_vcode() actually does, it seems
>> to be earnestly trying to fish out the location of a callable function
>> from the
>> frame pointer, and it does so three plausible ways. If those three don't
>> work
>> out, then it sets the instruction pointer (to the garbage value),
>> followed by
>> scm_error(Wrong type to apply). This also looks like an earnest, honest
>> attempt to report a real error.  But lets double-check.
>>
>> So who calls get_callee_vcode(), and why, and what did they expect to
>> happen?
>> Well, that's in three places: one in scm_call_n which is a plausible
>> place where
>> one might expect the instruction pointer to be set to a valid value. Then
>> there's two
>> places in vm-engine.c -- "tail-call" and "call" both of which one might
>> plausibly expect
>> to have a valid instruction pointer.  I can't imagine any valid scenario
>> where anyone
>> was expecting get_callee_vcode() to actually fail in the normal course of
>> operations.
>>
>
> There is one more place where  get_callee_vcode() can get called -- via
> the jump_table,
> via a call to scm_jit_enter_mcode()  which issues the code emitted by
> emit_get_callee_vcode
>
> There are four calls to scm_jit_enter_mcode()  The one that immediately
> preceeds
> the bug is always the one made here, in vm-engine.c:
> VM_DEFINE_OP (7, return_values, "return-values", OP1 (X32))
>
> Right before the call to scm_jit_enter_mcode(), I can printf VP->fp and
> SCM_FRAME_LOCAL(VP->fp, 0),
> and they are... fp=0x7fffe000caf8 fpslot=d33b00 (typical)
>
> the mcode is of course some bytecode that bounces through lightning, and a
> few insns
> later, it arrives at get_callee_vcode() but now  the fp is different, (it
> changes by 0x20,
> always) and the slot is different:  fp=0x7fffe000cad8  and
> SCM_FRAME_LOCAL(fp,0)
> is 0x32 and the 0x32 triggers the scm_error(). (because 0x32 is not any of
> SCM_PROGRAM_P or SCM_STRUCTP or a smob)
>
> (but also, the fpslot=d33b00 is never a SCM_PROGRAM_P or SCM_STRUCTP or
> a smob, either... so something got computed along the way ... )
>
> That's what I've got so far. Its highly reproducible.  Quick to happen.
> I'm not sure
> what to do next. I guess I need to examine emit_get_callee_vcode() and see
> what
> it does, and why.   Any comments, suggestions would be useful.
>
> -- Linas
>
>
>> That is, I can't think of any valid reason why anyone would want to
>> suppress
>> the scm_error().  And even if I could -- calling scm_error() hundreds of
>> times
>> per second, as fast as possible, does not seem like efficient coding for
>> dealing
>> with a call to an invalid address.
>>
>> Anyway I'm trying to track down where the invalid value gets set. No luck
>> so far.
>> There are 6 or 8 places in vm-engine.c where the frame pointer is set to
>> something
>> that isn't a pointer (which seems like cheating to me: passing
>> non-pointer values
>> in something called "pointer" is .. well, knee jerk reaction is that it's
>> not wise, but
>> there may be a deeper reason.)
>>
>>
>>>
>>> > There is no error report... until 5-10 minutes later, when the error
>>> > report itself causes a crash.  Before then, I get an endless
>>> > high-speed spew of prints:
>>>
>>> It looks like another error is happening within the exception handler.
>>>
>>
>> Well, yes, that also. But given that the instruction pointer contains
>> garbage
>> its perhaps not entirely surprising... at best, the question is, why
>> didn't it fail
>> sooner?
>>
>> -- Linas
>>
>>>
>>>        Mark
>>>
>>> PS: It would be good to pick either 'guile-devel' or 'guile-user' for
>>>     continuation of this thread.  I don't see a reason why it should be
>>>     sent to both lists.
>>>
>>
>>
>> --
>> cassette tapes - analog TV - film cameras - you
>>
>
>
> --
> cassette tapes - analog TV - film cameras - you
>


-- 
cassette tapes - analog TV - film cameras - you

[-- Attachment #2: Type: text/html, Size: 10891 bytes --]

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Now crashing [was Re: guile-2.9.2 and threading
  2019-07-18  3:52                   ` Linas Vepstas
@ 2019-07-21 21:10                     ` Linas Vepstas
  2019-08-05 18:07                       ` Mark H Weaver
  0 siblings, 1 reply; 15+ messages in thread
From: Linas Vepstas @ 2019-07-21 21:10 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Guile Development

[-- Attachment #1: Type: text/plain, Size: 8918 bytes --]

How utterly embarrassing.  Please ignore most of this verbose and difficult
email chain. Yes, guile-2.9.2 is still crashing, but almost all of my
analysis was wrong. Turns out that my scheme code was calling `(10)` i.e.
taking an integer, and treating it as a function, and attempting to call
it. So the call to `scm_error` was exactly right. It was invisible to me
because ... it was ignored in my code.

However -- if one does call `scm_error` fairly rapidly, from multiple
threads, one will eventually hit a race condition and get a crash.  I'm not
sure how to create a mini-test-case for this within guile; my code is
creating threads outside of guile, and launching `scm_eval` in each (and
ignoring the resulting error).  This was leading to a crash after 5-10
minutes.

-- Linas

On Wed, Jul 17, 2019 at 10:52 PM Linas Vepstas <linasvepstas@gmail.com>
wrote:

> Oh, I get it.  I think the bug is this:  VM_DEFINE_OP (7,
> return_values,...)
> finds some mcode, and calls it.  What it found was the
> emit_get_callee_vcode
> but it is totally pointless to call this mcode, since we're returning, and
> not
> calling. So its just not useful.
>
> Worse, it gets called with garbage values, which are then silenced by
> ignoring
> the resulting  scm_error, and everything appears to run smoothly ... for a
> while.
> Until some later time, (millions of calls later), when there is a
> completely unrelated
> race condition that causes the scm_error to get tangled and die.  The
> ideal
> solution would be simply to not call the mcode for get_callee; that would
> save
> time and trouble.
>
> That's my hypothesis. I tried to test a mock-up of this solution with the
> patch
> below, but it is too simplistic t actually work (null pointer-deref.)  I
> con't find
> a beter solution
>
> If you've got a better idea, let me know...
>
> -- Linas
>
> --- a/libguile/vm-engine.c
> +++ b/libguile/vm-engine.c
> @@ -553,6 +553,7 @@ VM_NAME (scm_thread *thread)
>            mcode = SCM_FRAME_MACHINE_RETURN_ADDRESS (old_fp);
>            if (mcode && mcode != scm_jit_return_to_interpreter_trampoline)
>              {
> +              VP->unused = 1;
>                scm_jit_enter_mcode (thread, mcode);
>                CACHE_REGISTER ();
>                NEXT (0);
> diff --git a/libguile/vm.c b/libguile/vm.c
> index d7b1788..8e178c7 100644
> --- a/libguile/vm.c
> +++ b/libguile/vm.c
> @@ -620,6 +620,7 @@ scm_i_vm_prepare_stack (struct scm_vm *vp)
>    vp->compare_result = SCM_F_COMPARE_NONE;
>    vp->engine = vm_default_engine;
>    vp->trace_level = 0;
> +  vp->unused = 0;
>  #define INIT_HOOK(h) vp->h##_hook = SCM_BOOL_F;
>    FOR_EACH_HOOK (INIT_HOOK)
>  #undef INIT_HOOK
> @@ -1515,6 +1516,7 @@ get_callee_vcode (scm_thread *thread)
>
>    vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp);
>
> +  if (vp->unused) { vp->unused = 0; return 0; }
>    scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S",
>               scm_list_1 (proc), scm_list_1 (proc));
>  }
>
> On Wed, Jul 17, 2019 at 8:42 PM Linas Vepstas <linasvepstas@gmail.com>
> wrote:
>
>> Seem to be narrowing it down ... or at least, I have more details ...
>>
>> On Wed, Jul 17, 2019 at 4:44 PM Linas Vepstas <linasvepstas@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Jul 17, 2019 at 12:49 PM Mark H Weaver <mhw@netris.org> wrote:
>>>
>>>> Hi Linas,
>>>>
>>>> > Investigating the crash with good-old printf's in libguile/vm.c
>>>> produces
>>>> > a vast ocean of prints ... that should have not been printed, and/or
>>>> should
>>>> > have been actual errors, but somehow were not handled by scm_error.
>>>> > Using today's git pull of master, here's the diff containing a printf:
>>>> >
>>>> > --- a/libguile/vm.c
>>>> > +++ b/libguile/vm.c
>>>> > @@ -1514,12 +1514,23 @@ thread->guard); fflush(stdout); assert (0); }
>>>> >
>>>> >        proc = SCM_SMOB_DESCRIPTOR (proc).apply_trampoline;
>>>> >        SCM_FRAME_LOCAL (vp->fp, 0) = proc;
>>>> >        return SCM_PROGRAM_CODE (proc);
>>>> >      }
>>>> >
>>>> > +printf("duuude wrong type to apply!\n"
>>>> > +"proc=%lx\n"
>>>> > +"ip=%p\n"
>>>> > +"sp=%p\n"
>>>> > +"fp=%p\n"
>>>> > +"sp_min=%p\n"
>>>> > +"stack_lim=%p\n",
>>>> > +SCM_FRAME_SLOT(vp->fp, 0)->as_u64,
>>>> > +vp->ip, vp->sp, vp->fp, vp->sp_min_since_gc, vp->stack_limit);
>>>> > +fflush(stdout);
>>>> > +
>>>> >    vp->ip = SCM_FRAME_VIRTUAL_RETURN_ADDRESS (vp->fp);
>>>> >
>>>> >    scm_error (scm_arg_type_key, NULL, "Wrong type to apply: ~S",
>>>> >               scm_list_1 (proc), scm_list_1 (proc));
>>>> >  }
>>>> >
>>>> > As you can see, shortly after my printf, there should have been an
>>>> > error report.
>>>>
>>>> Not necessarily.  Note that what 'scm_error' actually does is to raise
>>>> an exception.  What happens next depends on what exception handlers are
>>>> installed at the time of the error.
>>>>
>>>
>>> OK, but... when I look at what get_callee_vcode() actually does, it seems
>>> to be earnestly trying to fish out the location of a callable function
>>> from the
>>> frame pointer, and it does so three plausible ways. If those three don't
>>> work
>>> out, then it sets the instruction pointer (to the garbage value),
>>> followed by
>>> scm_error(Wrong type to apply). This also looks like an earnest, honest
>>> attempt to report a real error.  But lets double-check.
>>>
>>> So who calls get_callee_vcode(), and why, and what did they expect to
>>> happen?
>>> Well, that's in three places: one in scm_call_n which is a plausible
>>> place where
>>> one might expect the instruction pointer to be set to a valid value.
>>> Then there's two
>>> places in vm-engine.c -- "tail-call" and "call" both of which one might
>>> plausibly expect
>>> to have a valid instruction pointer.  I can't imagine any valid scenario
>>> where anyone
>>> was expecting get_callee_vcode() to actually fail in the normal course
>>> of operations.
>>>
>>
>> There is one more place where  get_callee_vcode() can get called -- via
>> the jump_table,
>> via a call to scm_jit_enter_mcode()  which issues the code emitted by
>> emit_get_callee_vcode
>>
>> There are four calls to scm_jit_enter_mcode()  The one that immediately
>> preceeds
>> the bug is always the one made here, in vm-engine.c:
>> VM_DEFINE_OP (7, return_values, "return-values", OP1 (X32))
>>
>> Right before the call to scm_jit_enter_mcode(), I can printf VP->fp and
>> SCM_FRAME_LOCAL(VP->fp, 0),
>> and they are... fp=0x7fffe000caf8 fpslot=d33b00 (typical)
>>
>> the mcode is of course some bytecode that bounces through lightning, and
>> a few insns
>> later, it arrives at get_callee_vcode() but now  the fp is different, (it
>> changes by 0x20,
>> always) and the slot is different:  fp=0x7fffe000cad8  and
>> SCM_FRAME_LOCAL(fp,0)
>> is 0x32 and the 0x32 triggers the scm_error(). (because 0x32 is not any
>> of
>> SCM_PROGRAM_P or SCM_STRUCTP or a smob)
>>
>> (but also, the fpslot=d33b00 is never a SCM_PROGRAM_P or SCM_STRUCTP or
>> a smob, either... so something got computed along the way ... )
>>
>> That's what I've got so far. Its highly reproducible.  Quick to happen.
>> I'm not sure
>> what to do next. I guess I need to examine emit_get_callee_vcode() and
>> see what
>> it does, and why.   Any comments, suggestions would be useful.
>>
>> -- Linas
>>
>>
>>> That is, I can't think of any valid reason why anyone would want to
>>> suppress
>>> the scm_error().  And even if I could -- calling scm_error() hundreds of
>>> times
>>> per second, as fast as possible, does not seem like efficient coding for
>>> dealing
>>> with a call to an invalid address.
>>>
>>> Anyway I'm trying to track down where the invalid value gets set. No
>>> luck so far.
>>> There are 6 or 8 places in vm-engine.c where the frame pointer is set to
>>> something
>>> that isn't a pointer (which seems like cheating to me: passing
>>> non-pointer values
>>> in something called "pointer" is .. well, knee jerk reaction is that
>>> it's not wise, but
>>> there may be a deeper reason.)
>>>
>>>
>>>>
>>>> > There is no error report... until 5-10 minutes later, when the error
>>>> > report itself causes a crash.  Before then, I get an endless
>>>> > high-speed spew of prints:
>>>>
>>>> It looks like another error is happening within the exception handler.
>>>>
>>>
>>> Well, yes, that also. But given that the instruction pointer contains
>>> garbage
>>> its perhaps not entirely surprising... at best, the question is, why
>>> didn't it fail
>>> sooner?
>>>
>>> -- Linas
>>>
>>>>
>>>>        Mark
>>>>
>>>> PS: It would be good to pick either 'guile-devel' or 'guile-user' for
>>>>     continuation of this thread.  I don't see a reason why it should be
>>>>     sent to both lists.
>>>>
>>>
>>>
>>> --
>>> cassette tapes - analog TV - film cameras - you
>>>
>>
>>
>> --
>> cassette tapes - analog TV - film cameras - you
>>
>
>
> --
> cassette tapes - analog TV - film cameras - you
>


-- 
cassette tapes - analog TV - film cameras - you

[-- Attachment #2: Type: text/html, Size: 12359 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Now crashing [was Re: guile-2.9.2 and threading
  2019-07-21 21:10                     ` Linas Vepstas
@ 2019-08-05 18:07                       ` Mark H Weaver
  2019-08-07 16:05                         ` Linas Vepstas
  0 siblings, 1 reply; 15+ messages in thread
From: Mark H Weaver @ 2019-08-05 18:07 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: Guile Development

Hi Linas,

Linas Vepstas <linasvepstas@gmail.com> writes:
> However -- if one does call `scm_error` fairly rapidly, from multiple
> threads, one will eventually hit a race condition and get a crash.

If you could produce a small, self-contained example demonstrating this,
it would enable us to investigate further.

      Thanks,
        Mark



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Now crashing [was Re: guile-2.9.2 and threading
  2019-08-05 18:07                       ` Mark H Weaver
@ 2019-08-07 16:05                         ` Linas Vepstas
  0 siblings, 0 replies; 15+ messages in thread
From: Linas Vepstas @ 2019-08-07 16:05 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Guile Development

[-- Attachment #1: Type: text/plain, Size: 623 bytes --]

I will ponder how to do that, it will take at least several weeks; I'll
file it through the normal guile bug report system. -- linas

On Mon, Aug 5, 2019 at 1:07 PM Mark H Weaver <mhw@netris.org> wrote:

> Hi Linas,
>
> Linas Vepstas <linasvepstas@gmail.com> writes:
> > However -- if one does call `scm_error` fairly rapidly, from multiple
> > threads, one will eventually hit a race condition and get a crash.
>
> If you could produce a small, self-contained example demonstrating this,
> it would enable us to investigate further.
>
>       Thanks,
>         Mark
>


-- 
cassette tapes - analog TV - film cameras - you

[-- Attachment #2: Type: text/html, Size: 1108 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-08-07 16:05 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-02 23:25 guile-2.9.2 and threading Linas Vepstas
2019-06-07  4:26 ` Mark H Weaver
2019-06-07  5:01   ` Mark H Weaver
2019-07-09 20:46   ` Linas Vepstas
2019-07-14 21:59     ` Now crashing [was " Linas Vepstas
2019-07-14 22:03       ` Linas Vepstas
2019-07-15  3:03         ` Linas Vepstas
2019-07-17 16:27           ` Linas Vepstas
2019-07-17 17:47             ` Mark H Weaver
2019-07-17 21:44               ` Linas Vepstas
2019-07-18  1:42                 ` Linas Vepstas
2019-07-18  3:52                   ` Linas Vepstas
2019-07-21 21:10                     ` Linas Vepstas
2019-08-05 18:07                       ` Mark H Weaver
2019-08-07 16:05                         ` Linas Vepstas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).