I'm trying to understand how scm_jit_enter_mcode leads to
scm_timed_lock_mutex ... I want to know who is attempting to lock, and why
... and how to work around this...

Background: par-for-each works poorly for my app, I want to understand why.
Specifically, how par-for-each and n-par-for-each work in guile-2.9.2.
I've got my app, which is a guile/c++ mixture.  I've got a list of 137078
items and a proc I want to apply.  Single-threaded, it takes 530 seconds so
about 137078/530 = 260 items/second. Baseline. So.

(n-proc-for-each 6 proc list)

and I see this: 230% cpu use in top (i.e. two+ active threads) Why only
two? It's always about the same: n-proc-for-each 8, n-proc-for-each 12,
whatever ...  gdb shows that all but two threads are stuck here:

```
#0  pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f343d27bb95 in scm_pthread_cond_wait (cond=<optimized out>,
    mutex=<optimized out>) at ../../libguile/threads.c:1615
#2  0x00007f343d27bd8b in block_self (queue=0x55ef64391cd0,
    mutex=mutex@entry=0x55ef63d21fc0, waittime=waittime@entry=0x0)
    at ../../libguile/threads.c:316
#3  0x00007f343d27bedf in lock_mutex (current_thread=0x55f0a3d42c60,
    waittime=0x0, m=0x55ef63d21fc0, kind=SCM_MUTEX_STANDARD)
    at ../../libguile/threads.c:1031
#4  scm_timed_lock_mutex (mutex=0x55ef64391cc0, timeout=<optimized out>)
    at ../../libguile/threads.c:1092
#5  0x00007f343d56582a in ?? ()
#6  0x00007f34104bdf90 in ?? ()
#7  0x00007f343d4f00a0 in jump_table_ () from
/usr/local/lib/libguile-3.0.so.0
#8  0x000055ef6347b3a8 in ?? ()
#9  0x00007f343d22bf61 in scm_jit_enter_mcode (thread=0x55f0a3d42840,
    mcode=0x55f0a3d42840 "\240)ԣ\360U") at ../../libguile/jit.c:4819
#10 0x00007f343d28089c in vm_debug_engine (thread=0x55f0a3d42840)
    at ../../libguile/vm-engine.c:370
#11 0x00007f343d28707a in scm_call_n (proc=proc@entry=0x55eff19c9f00,
    argv=argv@entry=0x0, nargs=nargs@entry=0) at ../../libguile/vm.c:1605
```

The other two threads are happily doing things in my app.

I'm using (ice-9 threads)   I see this:

(define (n-par-for-each n proc . arglists)
  (let ((m (make-mutex))
   (threads '()))
    (do ((i 0 (+ 1 i)))
   ((= i n)
    (for-each join-thread threads))
      (set! threads
       (cons (begin-thread
         (let loop ()
           (lock-mutex m)
           (if (null? (car arglists))
          (unlock-mutex m)
          (let ((args (map car arglists)))
            (set! arglists (map cdr arglists))
            (unlock-mutex m)
            (apply proc args)
            (loop)))))
        threads)))))

Oh, I says to myself: bad bad mutex. Let me write a lock-less loop: it
chops the list into n pieces, each in its own thread. (a bit sloppy, but
adequate):

(define (my-for-each n proc args)
   (define len (length args))
   (define quo (euclidean-quotient len n))
   (define rem (euclidean-remainder len n))
   (define threads '())
   (do ((i 0 (+ 1 i)))
      ((= i n) (for-each join-thread threads))
         (set! threads
            (cons
               (begin-thread
                  (for-each proc (take (drop args (* i quo)) quo)))
               threads)))
   (for-each proc (drop args (* n quo)))
)

Let me go hog-wild: (my-for-each 12 proc list)   (I have a cpu with that
many cores) So... what happens? A little better .. not much.  This time,
gdb shows that there are four threads in my app. Two are stuck here:

#0  __lll_lock_wait () at
../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f343ca69bb5 in __GI___pthread_mutex_lock (
    mutex=mutex@entry=0x7f343d4f0f40 <bytes_until_gc_lock>)
    at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f343d213e20 in scm_gc_register_allocation (size=size@entry=16)
    at ../../libguile/gc.c:591

All the others are stuck in the first stack trace.  Average speedup over
one thread is 1.5x (345 seconds vs 525 seconds) and it burns about 250% CPU
(according to top) to achieve this.

Why aren't there 12 threads in my app? my-for-each is lockless, so where is
the scm_timed_lock_mutex ... scm_jit_enter_mcode coming from?

Using par-for-each results in times that are identical to single-thread
times.  Identical - no speedup, no slow-down. Toy sniff tests show that
par-for-each really does run in parallel, so no clue why its not any faster.

FWIW, guile-2.2 behaves much worse. Running two threads, things got 1.5x
faster.  Running 4 threads, things ran at half-speed of single-threaded.
Running six threads ran maybe 10x or 20x slower than single-threaded.
Clearly, a really bad live-lock situation. I want to get out from under
this.  I'm trying to do big data, and this is a bottleneck.

--linas

-- 
cassette tapes - analog TV - film cameras - you