From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Julian Graham" Newsgroups: gmane.lisp.guile.devel Subject: Re: Race condition in threading code? Date: Sat, 30 Aug 2008 19:05:17 -0400 Message-ID: <2bc5f8210808301605v5a6376ffs98b58c848c2f64fa@mail.gmail.com> References: <2bc5f8210808161142n2b415569y8499f3efafb4a@mail.gmail.com> <87prnu293y.fsf@gnu.org> <2bc5f8210808270614s3ddc6e9fued2ed9f95da15303@mail.gmail.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_38672_14743787.1220137517961" X-Trace: ger.gmane.org 1220137553 596 80.91.229.12 (30 Aug 2008 23:05:53 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 30 Aug 2008 23:05:53 +0000 (UTC) Cc: guile-devel@gnu.org To: "=?ISO-8859-1?Q?Ludovic_Court=E8s?=" Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sun Aug 31 01:06:47 2008 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1KZZWl-0006s1-34 for guile-devel@m.gmane.org; Sun, 31 Aug 2008 01:06:43 +0200 Original-Received: from localhost ([127.0.0.1]:43192 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1KZZVm-0004Cd-BK for guile-devel@m.gmane.org; Sat, 30 Aug 2008 19:05:42 -0400 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1KZZVR-000434-Rx for guile-devel@gnu.org; Sat, 30 Aug 2008 19:05:21 -0400 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1KZZVQ-00042O-9o for guile-devel@gnu.org; Sat, 30 Aug 2008 19:05:21 -0400 Original-Received: from [199.232.76.173] (port=36999 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1KZZVQ-00042H-0J for guile-devel@gnu.org; Sat, 30 Aug 2008 19:05:20 -0400 Original-Received: from ug-out-1314.google.com ([66.249.92.172]:16349) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1KZZVP-0006Uo-AF for guile-devel@gnu.org; Sat, 30 Aug 2008 19:05:19 -0400 Original-Received: by ug-out-1314.google.com with SMTP id m2so1916395uge.17 for ; Sat, 30 Aug 2008 16:05:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:cc:in-reply-to:mime-version:content-type:references; bh=WhXM7xoBZti5hwKPHJsejxGj2opRiHlT3LblOlehKqc=; b=N89G5oGVBT4bvFdomh4GKAymV69Ns2GeS+1/i5K2JHhi2lb2a8+zyzEDCx0mlV5UNa ZT6oZ3vnXGa66EMzNmvLbTZTXH6iskvIU4krWUdzHE45Ao5BXmhNStybqrMUH5NzFsK3 iuLZYmC/9VWzPezaK/3cYur72hovazznKowy4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version :content-type:references; b=GnzUCCLYgJWcitUm3TAKiPBAjouBbc31VJqFQgszf4HI51gaM2Bawy+7iv60eD3Be5 MV1xCBGYIPrSBGXXI1esvONCl5rHj/n0AjPC/8OhTqyJb9qAzp3xwAMn+sc5LHtwr1ik mG768Cfqnx+mU4STnuQq1r0HCii1ayzDhC110= Original-Received: by 10.67.10.18 with SMTP id n18mr1068173ugi.88.1220137517979; Sat, 30 Aug 2008 16:05:17 -0700 (PDT) Original-Received: by 10.66.237.3 with HTTP; Sat, 30 Aug 2008 16:05:17 -0700 (PDT) In-Reply-To: <2bc5f8210808270614s3ddc6e9fued2ed9f95da15303@mail.gmail.com> X-detected-kernel: by monty-python.gnu.org: Linux 2.6 (newer, 2) X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:7547 Archived-At: ------=_Part_38672_14743787.1220137517961 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline Okay, I think I know what the problem is: Part of the SRFI-18 thread start / creation process involves contention for a mutex, and there's a bug in fat_mutex_lock code that causes the locking thread to sometimes miss an unlocking thread's notification that a mutex is available. So it's actually a mutex bug -- specifically, in the loop code in fat_mutex_lock that ends with the following snippet: ... scm_i_pthread_mutex_unlock (&m->lock); SCM_TICK; scm_i_scm_pthread_mutex_lock (&m->lock); } block_self (m->waiting, mutex, &m->lock, timeout); ...which means that if the loop is entered while the mutex is still locked but the owner unlocks it after the locking thread releases the administrative lock to run the tick, the locking thread will sleep forever because it doesn't re-check the state of the mutex. I've made a small change (blocking before doing the tick instead of after) that seems to resolve the issue (so far no lock-ups using Han-Wen's x.test for a couple of hours). There's a patch attached. (Sorry, should have noticed this earlier; the problem existed before the changes I introduced to support SRFI-18...) Regards, Julian On Wed, Aug 27, 2008 at 9:14 AM, Julian Graham wrote: >> I've seen `srfi-18.test' hang from time to time, but not often enough to >> give me an incentive to nail it down. :-( I don't think it relates to >> Han-Wen's GC changes. > > > Crap, I'm seeing some lockups now, too. Sorry, guys. I'm debugging, > but don't let that stop you from investigating as well. ;) ------=_Part_38672_14743787.1220137517961 Content-Type: text/x-diff; name=0001-Resolve-a-deadlock-caused-by-not-checking-mutex-stat.patch Content-Transfer-Encoding: base64 X-Attachment-Id: f_fkiuc9ur0 Content-Disposition: attachment; filename=0001-Resolve-a-deadlock-caused-by-not-checking-mutex-stat.patch RnJvbSAxMmE1YzdjYTVlMGJlYzkzODZlZTI0YjJlMjMyMGQxYWEwM2U1NWQ1IE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBKdWxpYW4gR3JhaGFtIDxqdWxpYW5AY291bnR5aGVsbC4obm9u ZSk+CkRhdGU6IFNhdCwgMzAgQXVnIDIwMDggMTk6MDM6MjEgLTA0MDAKU3ViamVjdDogW1BBVENI XSBSZXNvbHZlIGEgZGVhZGxvY2sgY2F1c2VkIGJ5IG5vdCBjaGVja2luZyBtdXRleCBzdGF0ZSBh ZnRlciBjYWxsaW5nCiBgU0NNX1RJQ0snLgoKLS0tCiBsaWJndWlsZS9DaGFuZ2VMb2cgfCAgICA1 ICsrKysrCiBsaWJndWlsZS90aHJlYWRzLmMgfCAgICAyICstCiAyIGZpbGVzIGNoYW5nZWQsIDYg aW5zZXJ0aW9ucygrKSwgMSBkZWxldGlvbnMoLSkKCmRpZmYgLS1naXQgYS9saWJndWlsZS9DaGFu Z2VMb2cgYi9saWJndWlsZS9DaGFuZ2VMb2cKaW5kZXggZThkOTM2Mi4uNDBlMmJiNCAxMDA2NDQK LS0tIGEvbGliZ3VpbGUvQ2hhbmdlTG9nCisrKyBiL2xpYmd1aWxlL0NoYW5nZUxvZwpAQCAtMSwz ICsxLDggQEAKKzIwMDgtMDgtMjkgIEp1bGlhbiBHcmFoYW0gIDxqb29sZWFuQGdtYWlsLmNvbT4K KworCSogdGhyZWFkcy5jIChmYXRfbXV0ZXhfbG9jayk6IFJlc29sdmUgYSBkZWFkbG9jayBjYXVz ZWQgYnkgbm90CisJY2hlY2tpbmcgbXV0ZXggc3RhdGUgYWZ0ZXIgY2FsbGluZyBgU0NNX1RJQ0sn LgkKKwogMjAwOC0wOC0yNyAgTHVkb3ZpYyBDb3VydMOocyAgPGx1ZG9AZ251Lm9yZz4KIAogCUZp eCBidWlsZHMgYC0td2l0aG91dC10aHJlYWRzJy4gIFJlcG9ydGVkIGJ5IEhhbi1XZW4gTmllbmh1 eXMKZGlmZiAtLWdpdCBhL2xpYmd1aWxlL3RocmVhZHMuYyBiL2xpYmd1aWxlL3RocmVhZHMuYwpp bmRleCA3ZTU1ZjNiLi44Njk5ZmQwIDEwMDY0NAotLS0gYS9saWJndWlsZS90aHJlYWRzLmMKKysr IGIvbGliZ3VpbGUvdGhyZWFkcy5jCkBAIC0xMjkyLDExICsxMjkyLDExIEBAIGZhdF9tdXRleF9s b2NrIChTQ00gbXV0ZXgsIHNjbV90X3RpbWVzcGVjICp0aW1lb3V0LCBTQ00gb3duZXIsIGludCAq cmV0KQogCQkgIGJyZWFrOwogCQl9CiAJICAgIH0KKwkgIGJsb2NrX3NlbGYgKG0tPndhaXRpbmcs IG11dGV4LCAmbS0+bG9jaywgdGltZW91dCk7CiAJICBzY21faV9wdGhyZWFkX211dGV4X3VubG9j ayAoJm0tPmxvY2spOwogCSAgU0NNX1RJQ0s7CiAJICBzY21faV9zY21fcHRocmVhZF9tdXRleF9s b2NrICgmbS0+bG9jayk7CiAJfQotICAgICAgYmxvY2tfc2VsZiAobS0+d2FpdGluZywgbXV0ZXgs ICZtLT5sb2NrLCB0aW1lb3V0KTsKICAgICB9CiAgIHNjbV9pX3B0aHJlYWRfbXV0ZXhfdW5sb2Nr ICgmbS0+bG9jayk7CiAgIHJldHVybiBlcnI7Ci0tIAoxLjUuNC4zCgo= ------=_Part_38672_14743787.1220137517961--