From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Peter Brett Newsgroups: gmane.lisp.guile.user,gmane.comp.cad.geda.devel Subject: Re: Help needed debugging segfault with Guile 1.8.7 Date: Thu, 11 Nov 2010 14:22:23 +0000 Organization: University of Surrey, Guildford, England Message-ID: References: <871v6sqbny.fsf@ambire.localdomain> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Trace: dough.gmane.org 1289485388 2203 80.91.229.12 (11 Nov 2010 14:23:08 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Thu, 11 Nov 2010 14:23:08 +0000 (UTC) Cc: geda-dev@seul.org To: guile-user@gnu.org Original-X-From: guile-user-bounces+guile-user=m.gmane.org@gnu.org Thu Nov 11 15:23:03 2010 Return-path: Envelope-to: guile-user@m.gmane.org Original-Received: from lists.gnu.org ([199.232.76.165]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PGY3K-0000Lc-G5 for guile-user@m.gmane.org; Thu, 11 Nov 2010 15:23:02 +0100 Original-Received: from localhost ([127.0.0.1]:56412 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PGY3J-0005CL-RH for guile-user@m.gmane.org; Thu, 11 Nov 2010 09:23:01 -0500 Original-Received: from [140.186.70.92] (port=36294 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PGY3E-0005BN-6J for guile-user@gnu.org; Thu, 11 Nov 2010 09:22:57 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PGY3C-0002OF-Sw for guile-user@gnu.org; Thu, 11 Nov 2010 09:22:56 -0500 Original-Received: from lo.gmane.org ([80.91.229.12]:45082) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PGY3C-0002O1-Ho for guile-user@gnu.org; Thu, 11 Nov 2010 09:22:54 -0500 Original-Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1PGY3A-0000HM-2y for guile-user@gnu.org; Thu, 11 Nov 2010 15:22:52 +0100 Original-Received: from 131.227.8.61 ([131.227.8.61]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 11 Nov 2010 15:22:52 +0100 Original-Received: from peter by 131.227.8.61 with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 11 Nov 2010 15:22:52 +0100 X-Injected-Via-Gmane: http://gmane.org/ Original-Followup-To: gmane.lisp.guile.user Original-Lines: 150 Original-X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: 131.227.8.61 User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (gnu/linux) Cancel-Lock: sha1:C5Sp550Vg3qtN8GAOLke44EzcBw= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-BeenThere: guile-user@gnu.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: General Guile related discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: guile-user-bounces+guile-user=m.gmane.org@gnu.org Errors-To: guile-user-bounces+guile-user=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.user:8214 gmane.comp.cad.geda.devel:9039 Archived-At: Thien-Thi Nguyen writes: > () Peter Brett > () Thu, 11 Nov 2010 10:52:41 +0000 > >> stupid logic error in some weak ref management code > > Could you please describe this error? > Sure. libgeda uses direct management of memory, and the structures used in its document object model need to be explicitly deleted when finished with. I decided to use a Guile smob to represent these structures for access from Scheme code, with the pointer to the actual structure in SCM_SMOB_DATA and with the low nibble of SCM_SMOB_FLAGS indicating which type of DOM structure the smob references. This would have been sufficient if Scheme code had only been working with libgeda DOMs created and managed entirely via Scheme code. However, here Guile is being used simply to provide extensibility to electronics engineering applications based on libgeda, such as gschem. It would theoretically be possible for the following sequence of events to occur: 1. In a Scheme function called from the schematic editor, a transistor is instantiated, added to the current page, and also stashed somewhere in the Guile environment. 2. A bit later, the user closes the page. It is destroyed from C code, and so is the transistor instance. 3. Finally, a Scheme function is called that unstashes the transistor instance, and tries to use it, leading to a segfault. There were two main design considerations taken into account when looking for a solution to this problem. Firstly, I wanted it to be impossible to make libgeda leak memory from Scheme code, so that doing something like (do ((i 1 (1+ i)) ((> i 1000000))) (make-transistor)) would be safe. That meant that it had to be possible to destroy DOM structures from the smob_free() function. On the other hand, I wanted to find a solution that avoided adding explicit Guile dependencies to the core of libgeda (since I hope to split off the Scheme binding into a separate library at some point). The solution was to add weak reference facilities to the libgeda DOM data structures. A weak reference is added by calling a function similar to: object_weak_ref (OBJECT *object, void (*notify_func)(void *, void *), void *user_data); The notify_func and user_data are prepended to a singly-linked list in the OBJECT structure. When the OBJECT is deleted via the C API, each entry in the list is alerted by calling: notify_func (object, user_data) The notification function I use for the weak references held by smobs looks like this: static void smob_weakref_notify (void *target, void *smob) { SCM s = (SCM) smob; SCM_SET_SMOB_DATA (s, NULL); } I've provided wrapper functions & macros for checking smob validity (i.e. non-null smob data) before allowing any dereference of e.g. (OBJECT *) SCM_SMOB_DATA (smob). To allow garbage collection of DOM element smobs where possible, I use bit 4 of the SCM_SMOB_FLAGS as a `GC allowed' flag. Any API functions that put a DOM element in a state where it may be destroyed from C code rather than the smob_free() function are required to clear the flag (and vice versa). So, where was the bug? When a smob is GC'd, and if the pointer it contains hasn't already been cleared, it calls: object_weak_unref (SCM_SMOB_DATA (smob), smob_weakref_notify, smob); Before I fixed the bug, object_weak_unref() contained code that looked something like this: for (iter = weak_refs; iter != NULL; iter = list_next (iter)) { struct WeakRef *entry = iter->data; if ((entry->notify_func == notify_func) && (entry->user_data != user_data)) { // ERROR: != should be == free (entry); iter->data = NULL; } } weak_refs = list_remove_all (weak_refs, NULL); Now, how does this result in Guile GC freelist corruption? It requires two smobs to be created for the same DOM structure S, let's say A and B. (This can only occur if S is being managed from C code, so we know that the `GC allowed' flag will be cleared). smob_addr cell CAR cell CDR A 0x1000 0x0 B 0x1008 0x0 Weakref user_data for S: 0x1008, 0x1000 Now, smob A is garbage collected. Because we've told smob_free that it's not allowed to destroy S, the smob_free() function calls object_weak_unref(). Since the latter is broken, it clears the wrong weak reference. Now things look like this: smob_addr cell CAR cell CDR A 0x1000 B 0x1008 0x0 Weakref user_data for S: 0x1000 Some time later, S is destroyed from C code, and this results in the smob_weakref_notify() function described earlier being called thus: smob_weakref_notify (, 0x1000) After S is destroyed, things look like this: smob_addr cell CAR cell CDR A 0x1000 0x0 (OH NOES) B 0x1008 0x0 At this stage, the freelist has become corrupted, and will result in a segfault in scm_cell() at some indeterminate future time. With the fix in this commit: http://git.gpleda.org/?p=gaf.git;a=commit;h=41ea61b2f156 this memory corruption does not occur. I hope that explained things reasonably precisely! Regards, Peter -- Peter Brett Remote Sensing Research Group Surrey Space Centre