Hi We sporadically get "mutex not locked" and "mutex not locked by current thread" exceptions on Solaris 10u10 with guile-2.0.11. This problem can be reproduced with following scheme scripts: guile-fmoc-test-mnl.scm (for "mutex not locked") Two threads, one of them (reader-000) waits on a condition variable that nothing will trigger and the other thread (writer-000) locks and unlocks the mutex used with the condition variable. This code causes "mutex not locked" exception with some consistency. Output for this is in guile-fmoc-test-mnl-problem_output.txt (referenced as mnl-problem_output.txt) * owner id for reader-000 is 12593872 * owner id for writer-000 is 12595040 guile-fmoc-test-mnlbct.scm (for "mutex not locked by current thread") Same as guile-fmoc-test-mnl.scm except here writer-000 signals condition variable. This code causes "mutex not locked" and "mutex not locked by current thread" errors. Output for this, showing "mutex not locked by current thread", is in guile-fmoc-test-mnlbct-problem_output.txt (referenced as mnlbct- problem_output.txt) * owner id for reader-000 is 14535648 * owner id for writer-000 is 14536400 To track down this issue we have added some debug printfs (see guile-2.0.11- with-debug.patch). Given that this changes the line numbers I have referenced original line numbers as o:file:line and line numbers with patch as d[:file]: line. Also, these printfs have resulted in some irrelevant output (for internal and verbose logging mutexes) which has been filtered out. There is various scenarios leading to these errors that we have found but all caused by same problem. The a detailed analysis for "mutex not locked by current thread" scenario that can be seen in mnlbct-problem_output.txt is included below and detailed analysis for other scenarios will be shared if required. Scenario from mnlbct-problem_output.txt: 1. [writer-000:14536400] unlocks fat_mutex[14512880] and queues reader-000: 14535648 at o:threads.c:1664 - d:1681 before mnlbct-problem_output.txt:4079 2. [reader-000:14535648] locks fat_mutex[14512880] with fat_mutex_lock at o:threads.c:1394 - d:1401 before mnlbct-problem_output.txt:4080 3. [reader-000:14535648] enters wait-condition-variable and changes fat_mutex [14512880].owner to writer-000:14536400 at o:threads.c:1616 - d:1631 before mnlbct-problem_output.txt:4083 4. [reader-000:14535648] goes into block_self and starts waiting on ptheead condition variable inside fat_mutex_unlock > block_self. This unlocks fat_mutex[14512880].mutex, allowing some other thread to lock fat_mutex [14512880] at o:threads.c:452 - d:456 before mnlbct-problem_output.txt:4083 5. [writer-000:14536400] locks fat_mutex[14512880] with fat_mutex_lock at o:threads.c:1394 - d:1401 before mnlbct-problem_output.txt:4082 6. [reader-000:14535648] spurious wake-up occurs for condition variable which causes block_self to return EINTR to fat_mutex_unlock at o:threads.c:1621 - d:1636 before mnlbct-problem_output.txt:4084 7. [reader-000:14535648] loops and sets fat_mutex[14512880].owner=4 (i.e. not locked) while writer-000:14536400 should still be owner of fat_mutex [14512880]. Since it was spurious wake-up reader-000:14535648 continues to wait for condition to be notified again. at o:threads.c:1616 - d:1631 before mnlbct-problem_output.txt:4086 8. [writer-000:14536400] completes signal-condition-variable before mnlbct-problem_output.txt:4088 9. [reader-000:14535648] now gets actual notification and block_self returns 0. This causes fat_mutex:14512880 to be locked again - which works cos fat_mutex:14512880.owner is 4. This changes fat_mutex:14512880.owner from 4 to 14535648. at o:threads.c:1643 - d:1660 before mnlbct-problem_output.txt:4086 10. [writer-000:14536400] tries to unlock the mutex, this fails though as reader-000:14535648 now owns the mutex - resulting in "mutex not locked by current thread" exception. at o:threads.c:1599 - d:1614 before mnlbct-problem_output.txt:4089 Briefly, for mnl-problem_output.txt: 1. reader-000 locks fat_mutex and unlocks it again as it starts waiting for condition to be notified. 2. writer-000 locks mutex 3. Spurious wake-up occurs for reader-000 which causes reader to change fat_mutex.owner from writer-000 id to 4 and then resumes waiting on condition variable 4. Writer tries to unlock fat_mutex but now owner is 4 and this results in "mutex not locked" exception The cause of these problems seems to be related to fat_mutex_unlock changing fat_mutex.owner inside the while loop that is intended for checking condition variable predicate which is problematic if spurious wake-ups from pthread_cond_wait occur. Spurious wake-ups from pthread_cond_wait seems less common on Linux, which is why we have only been observing the issue on Solaris. It does however look like this problem will occur on any platform when a spurious wake-up does occur. As far as we can tell there is no reason for the fat_mutex.owner assignment to happen inside the loop. It seems more appropriate that this happens only once before the loop and not again. To this extent we moved owner reassignment out of the loop and this seems to have resolved our issues. The patch for this is in guile-2.0.11-with-fmoc_fix.patch. We have ran the test suite with this on Linux and everything passes. There is however other issues with test suite on Solaris that prevents it from completing (both with and without the patch) which needs further investigation. * All files related to this can be found at https://gitlab.com/concurrent-systems/osp-issues-1512/tree/master/guile-fmoc * Source with guile-2.0.11-with-debug.patch can be found at https://gitlab.com/concurrent-systems/guile/tree/v2.0.11-with-debug * Source with guile-2.0.11-with-fmoc_fix.patch can be found at https://gitlab.com/concurrent-systems/guile/tree/v2.0.11-with-fmoc_fix Regards -- Iwan Aucamp