Re: Hang in srfi-18.test

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

From: "Neil Jerram" <neiljerram@googlemail.com>
To: guile-devel <guile-devel@gnu.org>
Subject: Re: Hang in srfi-18.test
Date: Wed, 22 Oct 2008 19:14:09 +0100	[thread overview]
Message-ID: <49dd78620810221114v1e84240fn8104bdc2e189fc68@mail.gmail.com> (raw)
In-Reply-To: <49dd78620810220047xfaffbf5l133bcdcba595c56d@mail.gmail.com>

2008/10/22 Neil Jerram <neiljerram@googlemail.com>:
> 2008/10/18 Neil Jerram <neiljerram@googlemail.com>:
>> Just a heads up...
>>
>> I'm seeing a hang in srfi-18.test, when running make check in master,
>> in the "exception handler installation is thread-safe" test.  It's not
>> 100% reproducible, so looks like there's a race involved.
>>
>> I think the problem is that wait-condition-variable is not actually
>> atomic in the way that it is supposed to be.  It unlocks the mutex,
>> then starts waiting on the cond var.  So it is possible for another
>> thread to lock the same mutex, and signal the cond var, before the
>> wait-condition-variable thread starts waiting.
>>
>> I don't currently have a solution, but I'm working on it...
>
> The attached patch seems to work.  I'll write more to explain later!

So:

In order for wait-condition-variable to be atomic - e.g. in a race
where thread A holds (Scheme-level) mutex M, and calls
(wait-condition-variable C M), and thread B calls (begin (lock-mutex
M) (signal-condition-variable C)) - it needs to call pthread_cond_wait
with the same underlying mutex as is involved in the `lock-mutex'
call.  In terms of the threads.c code, this means that it has to use
M->lock, not C->lock.

block_self() used its mutex arg for two purposes: for protecting
access and changes to the wait queue, and for the pthread_cond_wait
call.  But it wouldn't work reliably to use M->lock to protect C's
wait queue, because in theory two threads can call
(wait-condition-variable C M1) and (wait-condition-variable C M2)
concurrently, with M1 and M2 different.  So we either have to pass
both C->lock and M->lock into block_self(), or use some other mutex to
protect the wait queue.  For this patch, I switched to using the
critical section mutex, because that is a global and so easily
available.  (If that turns out to be a problem for performance, we
could make each queue structure have its own mutex, but there's no
reason to believe yet that it is a problem, because the critical
section mutex isn't used much overall.)

So then we call block_self() with M->lock, and move where M->lock is
unlocked to after the block_self() call, instead of before.

That solves the first hang, but introduces a new one, when a SRFI-18
thread is terminated (`thread-terminate!') between being launched
(`make-thread') and started (`thread-start!').  The problem now is
that pthread_cond_wait is a cancellation point (see man
pthread_cancel), so the pthread_cond_wait call is one of the few
places where a thread-terminate! call can take effect.  If the thread
is cancelled at that point, M->lock ends up still being locked, and
then when do_thread_exit() tries to lock M->lock again, it hangs.

The fix for that is a new `held_mutex' field in scm_i_thread, which is
set to point to the mutex just before a pthread_cond_(timed)wait call,
and set to NULL again afterwards.  If on_thread_exit() finds that
held_mutex is non-NULL, it unlocks that mutex.

A detail is that checking and unlocking held_mutex must be done before
on_thread_exit() calls scm_i_ensure_signal_delivery_thread(), because
the innards of scm_i_ensure_signal_delivery_thread() can do another
pthread_cond_wait() call and so overwrite held_mutex.  But that's OK,
because it's fine for the mutex check and unlock to happen outside
Guile mode.

Lastly, C->lock is then not needed, so I've removed it.

Any thoughts or comments?  (In case not, I guess I'll commit in a
couple of days; let me know if I should wait longer.)

     Neil

next prev parent reply	other threads:[~2008-10-22 18:14 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-10-18  9:46 Hang in srfi-18.test Neil Jerram
2008-10-22  7:47 ` Neil Jerram
2008-10-22 18:14   ` Neil Jerram [this message]
2008-10-24 14:25     ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49dd78620810221114v1e84240fn8104bdc2e189fc68@mail.gmail.com \
    --to=neiljerram@googlemail.com \
    --cc=guile-devel@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).