From: "Neil Jerram" <neiljerram@googlemail.com>
To: guile-devel <guile-devel@gnu.org>
Subject: Re: Hang in srfi-18.test
Date: Wed, 22 Oct 2008 19:14:09 +0100 [thread overview]
Message-ID: <49dd78620810221114v1e84240fn8104bdc2e189fc68@mail.gmail.com> (raw)
In-Reply-To: <49dd78620810220047xfaffbf5l133bcdcba595c56d@mail.gmail.com>
2008/10/22 Neil Jerram <neiljerram@googlemail.com>:
> 2008/10/18 Neil Jerram <neiljerram@googlemail.com>:
>> Just a heads up...
>>
>> I'm seeing a hang in srfi-18.test, when running make check in master,
>> in the "exception handler installation is thread-safe" test. It's not
>> 100% reproducible, so looks like there's a race involved.
>>
>> I think the problem is that wait-condition-variable is not actually
>> atomic in the way that it is supposed to be. It unlocks the mutex,
>> then starts waiting on the cond var. So it is possible for another
>> thread to lock the same mutex, and signal the cond var, before the
>> wait-condition-variable thread starts waiting.
>>
>> I don't currently have a solution, but I'm working on it...
>
> The attached patch seems to work. I'll write more to explain later!
So:
In order for wait-condition-variable to be atomic - e.g. in a race
where thread A holds (Scheme-level) mutex M, and calls
(wait-condition-variable C M), and thread B calls (begin (lock-mutex
M) (signal-condition-variable C)) - it needs to call pthread_cond_wait
with the same underlying mutex as is involved in the `lock-mutex'
call. In terms of the threads.c code, this means that it has to use
M->lock, not C->lock.
block_self() used its mutex arg for two purposes: for protecting
access and changes to the wait queue, and for the pthread_cond_wait
call. But it wouldn't work reliably to use M->lock to protect C's
wait queue, because in theory two threads can call
(wait-condition-variable C M1) and (wait-condition-variable C M2)
concurrently, with M1 and M2 different. So we either have to pass
both C->lock and M->lock into block_self(), or use some other mutex to
protect the wait queue. For this patch, I switched to using the
critical section mutex, because that is a global and so easily
available. (If that turns out to be a problem for performance, we
could make each queue structure have its own mutex, but there's no
reason to believe yet that it is a problem, because the critical
section mutex isn't used much overall.)
So then we call block_self() with M->lock, and move where M->lock is
unlocked to after the block_self() call, instead of before.
That solves the first hang, but introduces a new one, when a SRFI-18
thread is terminated (`thread-terminate!') between being launched
(`make-thread') and started (`thread-start!'). The problem now is
that pthread_cond_wait is a cancellation point (see man
pthread_cancel), so the pthread_cond_wait call is one of the few
places where a thread-terminate! call can take effect. If the thread
is cancelled at that point, M->lock ends up still being locked, and
then when do_thread_exit() tries to lock M->lock again, it hangs.
The fix for that is a new `held_mutex' field in scm_i_thread, which is
set to point to the mutex just before a pthread_cond_(timed)wait call,
and set to NULL again afterwards. If on_thread_exit() finds that
held_mutex is non-NULL, it unlocks that mutex.
A detail is that checking and unlocking held_mutex must be done before
on_thread_exit() calls scm_i_ensure_signal_delivery_thread(), because
the innards of scm_i_ensure_signal_delivery_thread() can do another
pthread_cond_wait() call and so overwrite held_mutex. But that's OK,
because it's fine for the mutex check and unlock to happen outside
Guile mode.
Lastly, C->lock is then not needed, so I've removed it.
Any thoughts or comments? (In case not, I guess I'll commit in a
couple of days; let me know if I should wait longer.)
Neil
next prev parent reply other threads:[~2008-10-22 18:14 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-10-18 9:46 Hang in srfi-18.test Neil Jerram
2008-10-22 7:47 ` Neil Jerram
2008-10-22 18:14 ` Neil Jerram [this message]
2008-10-24 14:25 ` Ludovic Courtès
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/guile/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49dd78620810221114v1e84240fn8104bdc2e189fc68@mail.gmail.com \
--to=neiljerram@googlemail.com \
--cc=guile-devel@gnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).