Hi everyone, Thanks for your comments and patience. I've attached a new version of my SRFI-18 patch which I hope addresses the stuff that Ludovic raised. Sorry it's taken so long! In real life, I've been hung up on freeing myself from the entanglements of one full-time job and getting wrapped up in those of another. At any rate: > My comments below are mostly cosmetic. Once you're done with them, > could you please provide a ChangeLog entry? Of course. Please find relevant updates included in the revised patch. > Then we'll need to update the doc of the core API, and also add > documentation of the SRFI module itself (the custom has been to somewhat > duplicate SRFIs in the manual so that we have all documentation in one > place). Sure, I'll do that -- I just wanted to make sure the scope of the patch was well-established before I wrote anything. > `threads.test' doesn't need to be updated because threads don't throw > exceptions, right? Right, sort of: doing 'join-thread' on a thread that has been canceled via the SRFI-18 'terminate-thread!' function will throw a thread-terminated-exception (terminate-thread! installs a special cleanup handler right before exiting that does this). But everything related to cancellation in threads.test uses cancel-thread, which does not cause an exception to be thrown, and none of the other tests involve exception handling. > I don't quite get this one. Could you illustrate the problem > step-by-step with a simple scenario? > > The value of HANDLER in `scm_spawn_thread ()' doesn't seem to affect > critical-section locking. Maybe I should have said the problem lies in really_spawn() -- if data->handler is non-NULL, then really_spawn() enters the thread's body via scm_internal_catch() instead of directly. scm_internal_catch() calls scm_c_catch(), which calls make_jmpbuf(), which enters a critical section, leading to the deadlock. Here's the sequence of events that I was experiencing (surprisingly often): Thread 1, in guile-mode (heap mutex locked) launches Thread 2 Thread 2 enters a critical section, locking the critical section mutex Thread 1, as part of expression evaluation in eval.i.c, attempts to lock the critical section and blocks Thread 2, as part of make_jmpbuf, calls SCM_NEWSMOB2, leading to a call to scm_double_cell, which causes a GC. Thread 2 attempts to lock the heap mutex of all executing threads, but Thread 1's heap mutex will never be unlocked I've subsequently discovered and fixed a separate (quasi-) deadlock, also-preexisting, this time related to a lack of thread safety surrounding the usage of scm_thread_go_to_sleep -- in short, there can be a race on the value of scm_thread_go_to_sleep such that a thread can continue to enter and leave guile-mode even while scm_i_thread_put_to_sleep is trying to put it to sleep for GC. The fix is to require that threads hold the thread_admin_mutex while locking their heap_mutexes in scm_enter_guile, so that they don't inadvertantly re-enter guile-mode during a GC attempt. I can give you an example if you'd like. > Don't do that, we must keep all years. Fixed. > In `block_self ()': > >> @@ -239,6 +265,7 @@ >> err = EINTR; >> t->block_asyncs--; >> scm_i_reset_sleep (t); >> + scm_i_pthread_cleanup_pop (0); >> } > > Why is it needed? I'm not sure what you mean. In a literal sense, I pop the cleanup handler because I installed it a few lines earlier. More generally, the reason I added the handler was because scm_i_pthread_cond_wait is a cancellation point, and scm_i_setup_sleep has just been called -- if the thread is canceled while in cond_wait, the sleep state will never be reset. This may not be a critical issue (or it may be -- I added this code while I was chasing down an unrelated deadlock), but it makes for better thread bookkeeping. >> +extern scm_i_thread *scm_i_signal_delivery_thread; > > Could it go in one of `{threads,scmsigs}.h'? Yes -- I'd meant to include that in the original patch. Sorry! It's in scmsigs.h now. >> +typedef struct { > > Make sure to follow the GCS, i.e., put the opening brace on a line of > its own. Fixed for all occurences. >> - /* Ensure the signal handling thread has been launched, because we might be >> - shutting it down. */ >> - scm_i_ensure_signal_delivery_thread (); >> - >> /* Unblocking the joining threads needs to happen in guile mode >> since the queue is a SCM data structure. */ >> scm_with_guile (do_thread_exit, v); >> >> + /* Ensure the signal handling thread has been launched, because we might be >> + shutting it down. */ >> + scm_i_ensure_signal_delivery_thread (); >> + > > Why does that need to be moved? It doesn't -- thanks for catching that. I was trying to resolve a deadlock related to the one described above that I ultimately fixed by removing the catch handler for the signal delivery thread. > +SCM_DEFINE (scm_join_thread_timed, "join-thread", 1, 2, 0, > > Scheme/C name mismatch. I believe it effectively hides the first > `join-thread', right? Yes, I put it there to override the binding of join-thread. Is that a problem? I do it several other places, too. I wasn't sure how else to maintain binary compatibility for users of scm_join_thread while extending the functionality of the Scheme 'join-thread' function. Would it be better if I removed the first SCM_DEFINE (the one that refers to scm_join_thread()) and replaced it with a normal function declaration for scm_join_thread()? >> +"Suspend execution of the calling thread until the target @var{thread} " > > Indenting would be nicer. You mean indenting the doc strings so that they line up with the parentheses? None of the other functions do this. (And is this documentation actually used to generate anything?) > Remember spaces after periods. :-) Fixed for all occurences. > Open bracket right after `block_self'. Fixed for all occurences. > Too bad we have yet another time API... > ... and another exception API, too. Yeah... such is the nature of SRFIs, though. Anything you want me to change here -- other than the exception handling described below? > We have `*unspecified*' in core Guile, which is better because it > doesn't have any side-effect, so better use it. Ah, excellent -- didn't know that was there! Fixed. (Can we put that in the manual index?) > I'd use pairs or records for exception objects rather than just symbols > since symbols can always be forged. So we'd have, e.g.: I hadn't thought of that! Done, though it took a while. SRFI-18 exception throwing and handling are now pass-thrus to SRFI-34. There was a bit of difficulty in that Guile's implementation of SRFI-34 causes exceptions to get rethrown at the end of the handler, whereas SRFI-18 requires that the handler re-enter the continuation of the called primitive afterwards. I've resolved that by wrapping any installed user handlers such that they store and then apply their caller's continuation. > Type-checking overhead may become a concern if we are to convert values > often, e.g., once after every timed-join trial. OTOH, it's good to have > good error reporting. So... do you have a ruling, one way or the other? For what it's worth, my opinion is that API exposed to the user must always feature input-checking, but I defer to your maintainer wisdom. Regards, Julian