Re: Help needed debugging segfault with Guile 1.8.7

unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed

From: Peter Brett <peter@peter-b.co.uk>
To: guile-user@gnu.org
Cc: geda-dev@seul.org
Subject: Re: Help needed debugging segfault with Guile 1.8.7
Date: Thu, 11 Nov 2010 14:22:23 +0000	[thread overview]
Message-ID: <we1mwroklz34.fsf@ssclt001.eps.surrey.ac.uk> (raw)
In-Reply-To: 871v6sqbny.fsf@ambire.localdomain

Thien-Thi Nguyen <ttn@gnuvola.org> writes:

> () Peter Brett <peter@peter-b.co.uk>
> () Thu, 11 Nov 2010 10:52:41 +0000
>
>>    stupid logic error in some weak ref management code
>
> Could you please describe this error?
>

Sure.  libgeda uses direct management of memory, and the structures used
in its document object model need to be explicitly deleted when finished
with.  I decided to use a Guile smob to represent these structures for
access from Scheme code, with the pointer to the actual structure in
SCM_SMOB_DATA and with the low nibble of SCM_SMOB_FLAGS indicating which
type of DOM structure the smob references.

This would have been sufficient if Scheme code had only been working
with libgeda DOMs created and managed entirely via Scheme code.
However, here Guile is being used simply to provide extensibility to
electronics engineering applications based on libgeda, such as gschem.
It would theoretically be possible for the following sequence of events
to occur:

 1. In a Scheme function called from the schematic editor, a transistor
    is instantiated, added to the current page, and also stashed
    somewhere in the Guile environment.

 2. A bit later, the user closes the page.  It is destroyed from C code,
    and so is the transistor instance.

 3. Finally, a Scheme function is called that unstashes the transistor
    instance, and tries to use it, leading to a segfault.

There were two main design considerations taken into account when
looking for a solution to this problem.  Firstly, I wanted it to be
impossible to make libgeda leak memory from Scheme code, so that doing
something like

   (do ((i 1 (1+ i)) ((> i 1000000)))
       (make-transistor))

would be safe.  That meant that it had to be possible to destroy DOM
structures from the smob_free() function.  On the other hand, I wanted
to find a solution that avoided adding explicit Guile dependencies to
the core of libgeda (since I hope to split off the Scheme binding into a
separate library at some point).

The solution was to add weak reference facilities to the libgeda DOM
data structures.  A weak reference is added by calling a function
similar to:

  object_weak_ref (OBJECT *object, void (*notify_func)(void *, void *),
                   void *user_data);

The notify_func and user_data are prepended to a singly-linked list in
the OBJECT structure.  When the OBJECT is deleted via the C API, each
entry in the list is alerted by calling:

  notify_func (object, user_data)

The notification function I use for the weak references held by smobs
looks like this:

  static void
  smob_weakref_notify (void *target, void *smob) {
    SCM s = (SCM) smob;
    SCM_SET_SMOB_DATA (s, NULL);
  }

I've provided wrapper functions & macros for checking smob validity
(i.e. non-null smob data) before allowing any dereference of
e.g. (OBJECT *) SCM_SMOB_DATA (smob).

To allow garbage collection of DOM element smobs where possible, I use
bit 4 of the SCM_SMOB_FLAGS as a `GC allowed' flag.  Any API functions
that put a DOM element in a state where it may be destroyed from C code
rather than the smob_free() function are required to clear the flag (and
vice versa).

So, where was the bug?  When a smob is GC'd, and if the pointer it
contains hasn't already been cleared, it calls:

  object_weak_unref (SCM_SMOB_DATA (smob), smob_weakref_notify, smob);

Before I fixed the bug, object_weak_unref() contained code that looked
something like this:

  for (iter = weak_refs; iter != NULL; iter = list_next (iter)) {
    struct WeakRef *entry = iter->data;
    if ((entry->notify_func == notify_func) &&
        (entry->user_data != user_data)) { // ERROR: != should be ==
      free (entry);
      iter->data = NULL;
    }
  }
  weak_refs = list_remove_all (weak_refs, NULL);

Now, how does this result in Guile GC freelist corruption?  It requires
two smobs to be created for the same DOM structure S, let's say A and B.
(This can only occur if S is being managed from C code, so we know that
the `GC allowed' flag will be cleared).

            smob_addr      cell CAR         cell CDR
     A      0x1000         0x0              <ptr to S>
     B      0x1008         0x0              <ptr to S>

     Weakref user_data for S: 0x1008, 0x1000

Now, smob A is garbage collected.  Because we've told smob_free that
it's not allowed to destroy S, the smob_free() function calls
object_weak_unref().  Since the latter is broken, it clears the wrong
weak reference.  Now things look like this:

            smob_addr      cell CAR         cell CDR
     A      0x1000         <magic number>   <ptr to next free cell>
     B      0x1008         0x0              <ptr to S>

     Weakref user_data for S: 0x1000

Some time later, S is destroyed from C code, and this results in the
smob_weakref_notify() function described earlier being called thus:

   smob_weakref_notify (<pointer to S>, 0x1000)

After S is destroyed, things look like this:

            smob_addr      cell CAR         cell CDR
     A      0x1000         <magic number>   0x0 (OH NOES)
     B      0x1008         0x0              <ptr to S>

At this stage, the freelist has become corrupted, and will result in a
segfault in scm_cell() at some indeterminate future time.

With the fix in this commit:

  http://git.gpleda.org/?p=gaf.git;a=commit;h=41ea61b2f156

this memory corruption does not occur.

I hope that explained things reasonably precisely!

Regards,

                                 Peter

-- 
Peter Brett <peter@peter-b.co.uk>
Remote Sensing Research Group
Surrey Space Centre

next prev parent reply	other threads:[~2010-11-11 14:22 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-10 12:43 Help needed debugging segfault with Guile 1.8.7 Peter TB Brett
2010-11-10 21:35 ` Peter TB Brett
2010-11-11 10:52   ` Peter Brett
2010-11-11 12:37     ` Thien-Thi Nguyen
2010-11-11 14:22       ` Peter Brett [this message]
2010-11-28 11:38         ` Neil Jerram
2010-11-28 17:21           ` Linas Vepstas
2010-11-30 19:56             ` Peter TB Brett
2010-12-01 19:48               ` Andy Wingo
2010-11-30 19:43           ` Peter TB Brett
2010-12-01 13:46             ` Ludovic Courtès
2010-12-03  7:52               ` Peter TB Brett
2010-11-11  8:22 ` rixed
2010-11-11  8:33 ` Neil Jerram
2010-11-11 13:30 ` Ludovic Courtès

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/guile/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=we1mwroklz34.fsf@ssclt001.eps.surrey.ac.uk \
    --to=peter@peter-b.co.uk \
    --cc=geda-dev@seul.org \
    --cc=guile-user@gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).