From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: "Linas Vepstas" Newsgroups: gmane.lisp.guile.bugs Subject: Re: [PATCH] Final: thread lock nesting debugging Date: Thu, 20 Nov 2008 10:03:43 -0600 Message-ID: <3ae3aa420811200803u485a8fa7ia486c281da54497a@mail.gmail.com> References: <3ae3aa420811162042n4b5d50e3i136a9f7b00262a69@mail.gmail.com> <49dd78620811191514m6e73438esd4f1474c9926880d@mail.gmail.com> Reply-To: linasvepstas@gmail.com NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Trace: ger.gmane.org 1227197561 17986 80.91.229.12 (20 Nov 2008 16:12:41 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Thu, 20 Nov 2008 16:12:41 +0000 (UTC) Cc: bug-guile@gnu.org To: "Neil Jerram" Original-X-From: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Thu Nov 20 17:13:43 2008 Return-path: Envelope-to: guile-bugs@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.50) id 1L3C1g-0007Pn-1f for guile-bugs@m.gmane.org; Thu, 20 Nov 2008 17:05:07 +0100 Original-Received: from localhost ([127.0.0.1]:59836 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1L3C0X-0004Z8-5d for guile-bugs@m.gmane.org; Thu, 20 Nov 2008 11:03:53 -0500 Original-Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1L3C0Q-0004XU-Sf for bug-guile@gnu.org; Thu, 20 Nov 2008 11:03:46 -0500 Original-Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1L3C0P-0004Vx-9x for bug-guile@gnu.org; Thu, 20 Nov 2008 11:03:46 -0500 Original-Received: from [199.232.76.173] (port=35271 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1L3C0P-0004Vc-4b for bug-guile@gnu.org; Thu, 20 Nov 2008 11:03:45 -0500 Original-Received: from hs-out-0708.google.com ([64.233.178.242]:59609) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1L3C0O-0004Rp-TM for bug-guile@gnu.org; Thu, 20 Nov 2008 11:03:45 -0500 Original-Received: by hs-out-0708.google.com with SMTP id 55so303166hsc.10 for ; Thu, 20 Nov 2008 08:03:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:reply-to :to:subject:cc:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references; bh=OzCM0Hq1IcTb4mR8oXRWQUQsPvMKdDjf5fym0ar3TW8=; b=Al3/sbGSWOAXmwnYykaeibNed5OeavkEzQhdCcTvp0u6vliBuIj9hfFyqDydgszyjg LFh7JNM8LXaQtk4CNuWOufc9tFbLSKDJv+BcAyez5IyvEdORj5QGShYAYWpH+9dZnTdW E0/y+NToYjd+uZkSm4hF+gF6erIPni0LqzK60= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:reply-to:to:subject:cc:in-reply-to :mime-version:content-type:content-transfer-encoding :content-disposition:references; b=WE55gq26SRoqxJsr+J+99fZU+if0eMMUuDbW+pSvb/p2GOlDRU4UU1xwo1ubDFliJK jTK/ZA7Tk2hZNJHUkwCRcf9JqOUNM1jin/WBaRE4SbPWUWj7PhWSiANjmyGTjFS8j91k ykKpGJ4kVjfGqZxeH8tnIK2mXXFx5ZBIgobmo= Original-Received: by 10.100.210.9 with SMTP id i9mr1205519ang.132.1227197023438; Thu, 20 Nov 2008 08:03:43 -0800 (PST) Original-Received: by 10.100.226.4 with HTTP; Thu, 20 Nov 2008 08:03:43 -0800 (PST) In-Reply-To: <49dd78620811191514m6e73438esd4f1474c9926880d@mail.gmail.com> Content-Disposition: inline X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6 (newer, 2) X-BeenThere: bug-guile@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Bug reports for GUILE, GNU's Ubiquitous Extension Language" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Errors-To: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.bugs:4120 Archived-At: 2008/11/19 Neil Jerram : > 2008/11/17 Linas Vepstas : >> I've been seeing all sorts of deadlocks in guile, and so I wrote a small >> debugging utility to try to track down the problems. > > Interesting patch! > > One query; I may be being a bit dumb, I'm only just recovering from a > bad cold, but anyway... Your patch checks for a thread unlocking > mutexes in the reverse order that it locked them in (let's call this > point "A"). But I thought your recent investigations had shown that > the problem was threads doing locking in inconsistent order, e.g. > thread 1 locks M1 and then M2, while thread 2 locks M2 and then M1 > (point "B"). Are points "A" and "B" equivalent? (It isn't obvious to > me if so.) Hi Neil, There is (should be) only one lock in guile that is "inconsistent" in its locking order, and this is the t->heap_mutex lock. My guess is that valgrind is tripping over this one. I guess I should argue that this is why one needs a custom patch, instead of using valgrind (which is otherwise pretty fantastic for mem corruption and the like). The t->heap_mutex lock is heled whenever a thread is "guilified" or is "in guile mode". Its primary reason for being is to keep the garbage collector from running until all threads have been halted. (This is done by scm_i_thread_put_to_sleep) After applying my set of patches, the only inconsistent (and widespread!) lock ordering problem that I'm seeing stems from the asymmetric way in which scm_i_scm_pthread_mutex_lock is used to take a lock, and then drop it. If you follow the #define for scm_i_scm_pthread_mutex_lock, you find that its of the form: drop (thread->heap_mutex) take(some lock) take (thread->heap_mutex) Whereas the unlock is just drop(some lock) You can see this, for example, in ports.c line 728 scm_i_scm_pthread_mutex_lock (&scm_i_port_table_mutex); scm_i_remove_port (port); scm_i_pthread_mutex_unlock (&scm_i_port_table_mutex); Tto be "correctly" nested, the unlock should have droped the heap mutex first, and then reacquired it. I believe that doing this would be enough to quiet down helgrind. (at least for most cases ... what remains would be interesting) OK, the above was just facts; below, some random comments which might be incorrect (reasoning about locks can be deceptive; I've certainly mis-reasoned several times w.r.t guile) -- I had decided that the way that the dropping of the lock is done is OK, and that it would be silly (and a performance hit) to try to "fix" the unlocking order. For debugging with helgrind, you may want to do this anyway, but for production, it seemed un-necessary. Prod me, and perhaps I can reconstruct why I cam to this conclusion. -- The reason for dropping the heap_mutex before grabbing the other lock (for example scm_i_port_table_mutex), is somewhat obscure, but at one point I decided that this was OK, and arguably correct. As I write this, I've forgotten why. However, this should be a focus of attention, and re-thought-out. If you are willing to think about it, prod me and maybe I can say something intelligent. Changing this would also quiet helgrind (and boost performance). It might be safe, I need to rethink this. Anyway, my patch allows for this single occurance of the out-of-sequence heap_mutex unlock. --linas