From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Han-Wen Nienhuys Newsgroups: gmane.lisp.guile.devel Subject: Re: Race condition in threading code? Date: Sat, 30 Aug 2008 22:49:03 -0300 Message-ID: References: <2bc5f8210808161142n2b415569y8499f3efafb4a@mail.gmail.com> <87prnu293y.fsf@gnu.org> <2bc5f8210808270614s3ddc6e9fued2ed9f95da15303@mail.gmail.com> <2bc5f8210808301605v5a6376ffs98b58c848c2f64fa@mail.gmail.com> Reply-To: hanwen@xs4all.nl NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1220147507 19531 80.91.229.12 (31 Aug 2008 01:51:47 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sun, 31 Aug 2008 01:51:47 +0000 (UTC) To: guile-devel@gnu.org Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sun Aug 31 03:52:42 2008 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1KZc7K-0003ps-BD for guile-devel@m.gmane.org; Sun, 31 Aug 2008 03:52:38 +0200 Original-Received: from localhost ([127.0.0.1]:37270 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1KZc6L-0001EF-P9 for guile-devel@m.gmane.org; Sat, 30 Aug 2008 21:51:37 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1KZc5V-0000If-Nz for guile-devel@gnu.org; Sat, 30 Aug 2008 21:50:45 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1KZc5T-0000Gc-92 for guile-devel@gnu.org; Sat, 30 Aug 2008 21:50:44 -0400 Original-Received: from [199.232.76.173] (port=39706 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1KZc5T-0000GV-5J for guile-devel@gnu.org; Sat, 30 Aug 2008 21:50:43 -0400 Original-Received: from main.gmane.org ([80.91.229.2]:59859 helo=ciao.gmane.org) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1KZc5S-0000G5-Qp for guile-devel@gnu.org; Sat, 30 Aug 2008 21:50:43 -0400 Original-Received: from list by ciao.gmane.org with local (Exim 4.43) id 1KZc5H-0003aX-ME for guile-devel@gnu.org; Sun, 31 Aug 2008 01:50:31 +0000 Original-Received: from 201.80.3.52 ([201.80.3.52]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 31 Aug 2008 01:50:31 +0000 Original-Received: from hanwen by 201.80.3.52 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sun, 31 Aug 2008 01:50:31 +0000 X-Injected-Via-Gmane: http://gmane.org/ Original-Lines: 31 Original-X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 201.80.3.52 User-Agent: Thunderbird 2.0.0.16 (X11/20080723) In-Reply-To: <2bc5f8210808301605v5a6376ffs98b58c848c2f64fa@mail.gmail.com> X-detected-kernel: by monty-python.gnu.org: Linux 2.6, seldom 2.4 (older, 4) X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:7548 Archived-At: Julian Graham escreveu: > Okay, I think I know what the problem is: Part of the SRFI-18 thread > start / creation process involves contention for a mutex, and there's > a bug in fat_mutex_lock code that causes the locking thread to > sometimes miss an unlocking thread's notification that a mutex is > available. So it's actually a mutex bug -- specifically, in the loop > code in fat_mutex_lock that ends with the following snippet: > > ... > scm_i_pthread_mutex_unlock (&m->lock); > SCM_TICK; > scm_i_scm_pthread_mutex_lock (&m->lock); > } > block_self (m->waiting, mutex, &m->lock, timeout); > > ...which means that if the loop is entered while the mutex is still > locked but the owner unlocks it after the locking thread releases the > administrative lock to run the tick, the locking thread will sleep > forever because it doesn't re-check the state of the mutex. I've made > a small change (blocking before doing the tick instead of after) that > seems to resolve the issue (so far no lock-ups using Han-Wen's x.test > for a couple of hours). There's a patch attached. > > (Sorry, should have noticed this earlier; the problem existed before > the changes I introduced to support SRFI-18...) Would this also explain the 'corruption' in the evaluator we have been seeing ("bad bindings at .. ")? -- Han-Wen Nienhuys - hanwen@xs4all.nl - http://www.xs4all.nl/~hanwen