unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed
* Help needed debugging segfault with Guile 1.8.7
@ 2010-11-10 12:43 Peter TB Brett
  2010-11-10 21:35 ` Peter TB Brett
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Peter TB Brett @ 2010-11-10 12:43 UTC (permalink / raw)
  To: guile-user


[-- Attachment #1.1: Type: text/plain, Size: 3151 bytes --]

Hi folks,

I've recently been working on greatly expanding the Guile Scheme API to
libgeda, the shared library in the gEDA suite [1].  Unfortunately, I
need your help.

I've added a testsuite for the libgeda Scheme API.  In one commit [2],
the testsuite succeeds.  In the following commit [3], a test which does
not touch any of the changed code starts causing a segfault in the Guile
interpreter.

I am using Guile 1.8.7 on Fedora Linux (guile-1.8.7-5.fc13.i686).  I've
recompiled the package from the official spec file with the
`--enable-debug' flag.  I can reliably reproduce the problem as
follows:


  git clone git://repo.or.cz/geda-gaf/peter-b.git
  cd peter-b
  git checkout -t origin/guile-scheme-api

  ./autogen.sh && ./configure
  make version.h             # This avoids building whole package

  (cd libgeda && make check) # Segfault in t0015-object-complex.scm

  git checkout HEAD^

  (cd libgeda && make check) # Succeeds


What I've tried so far:

1. Running the test under gdb:

     srcdir=libgeda/scheme \
     libtool --mode=execute gdb --args libgeda/shell/geda-shell -q \
       -L libgeda/scheme \
       -s libgeda/scheme/unit-tests/t0015-object-complex.scm

   Works perfectly.

2. Running the test under gdb in a different way:

     srcdir=libgeda/scheme \
     libtool --mode=execute gdb --args libgeda/shell/geda-shell -q \
       -L libgeda/scheme \
       -c '(load (cadr (command-line)))' \
       libgeda/scheme/unit-tests/t0015-object-complex.scm

   Also works perfectly.

2. Running the test under gdb in a third way (this is the same way it's
   run by `make check'):

     srcdir=libgeda/scheme \
     libtool --mode=execute gdb --args libgeda/shell/geda-shell -q \
     -L libgeda/scheme \
     -c '(use-modules (unit-test)) (load (cadr (command-line)))' \
     libgeda/scheme/unit-tests/t0015-object-complex.scm
 
   We can observe the segfault:

     Program received signal SIGSEGV, Segmentation fault.
     scm_cell (e1=0x404) at ../libguile/inline.h:127
     127           *freelist = SCM_FREE_CELL_CDR (*freelist);

   Unfortunately, I've been unable to interpret the backtrace
   (attached).  Can anyone help me out with this?

Valgrind / Memcheck wasn't very helpful either.  Is there a way to make
Guile zero each new heap page it requests before using it in order to
reduce the number of false positives the gc generates in Memcheck?

In fact, *none* of the ways I've found of running the test without the
segfault occurring have given me *any* clue as to how to fix it
"properly".

I'd really appreciate any suggestions that anyone might be able to give
me on figuring out how I've managed to break things.  At the moment, I'm
at a complete loss.

Thanks in advance for any help you can give!

Regards,

                                       Peter
   

[1] http://www.gpleda.org
[2] http://repo.or.cz/w/geda-gaf/peter-b.git/commit/f8b371c8732f
[3] http://repo.or.cz/w/geda-gaf/peter-b.git/commit/c7d44a2507ed

-- 
Peter Brett <peter@peter-b.co.uk>
Remote Sensing Research Group
Surrey Space Centre


[-- Attachment #1.2: Backtrace of segfault location --]
[-- Type: text/plain, Size: 6089 bytes --]

#0  scm_cell (e1=0x404) at ../libguile/inline.h:127
#1  scm_list_1 (e1=0x404) at list.c:47
#2  0x001b3ee0 in scm_eval_args (l=0xb7f574a0, env=0xb7ef33d0, proc=
    0xb7f5c5e0) at eval.c:2963
#3  0x001ac27a in ceval (x=0xb7f574a8, env=0xb7ef33d0) at eval.c:4597
#4  0x001b3ed6 in scm_eval_args (l=0xb7f57480, env=0xb7ef33d0, proc=
    0xb7f5c5e0) at eval.c:2961
#5  0x001ac27a in ceval (x=0xb7f574f0, env=0xb7ef33d0) at eval.c:4597
#6  0x001abe31 in ceval (x=0xb7f42830, env=0xb7ef3530) at eval.c:4212
#7  0x001b4727 in call_closure_1 (proc=0xb7ef44f8, arg1=0xb7ee9a70)
    at eval.c:5261
#8  0x001af297 in scm_map (proc=0xb7ef44f8, arg1=0xb7ef4638, args=
    0x404) at eval.c:5489
#9  0x001ac96c in ceval (x=0x404, env=0xb7ef4518) at eval.c:4367
#10 0x001abe8f in ceval (x=0xb7fb11b0, env=0xb7ef4518) at eval.c:4342
#11 0x001b4727 in call_closure_1 (proc=0xb7eee8d0, arg1=0xb7ee9aa8)
    at eval.c:5261
#12 0x001af297 in scm_map (proc=0xb7eee8d0, arg1=0xb7eefee8, args=
    0x404) at eval.c:5489
#13 0x001ac96c in ceval (x=0x404, env=0xb7eee8e0) at eval.c:4367
#14 0x001b3ed6 in scm_eval_args (l=0xb7f48b78, env=0xb7eee8e0, proc=
    0xb7f5b250) at eval.c:2961
#15 0x001ac27a in ceval (x=0xb7f48b78, env=0xb7eee8e0) at eval.c:4597
#16 0x001b4727 in call_closure_1 (proc=0xb7eee6d8, arg1=0xb7eee620)
    at eval.c:5261
#17 0x001af297 in scm_map (proc=0xb7eee6d8, arg1=0xb7eee618, args=
    0x404) at eval.c:5489
#18 0x001ac96c in ceval (x=0x404, env=0xb7eee6e8) at eval.c:4367
#19 0x001abe8f in ceval (x=0xb7fac660, env=0xb7eee6e8) at eval.c:4342
#20 0x001abe8f in ceval (x=0xb7faa7f0, env=0xb7eebed0) at eval.c:4342
#21 0x001b4727 in call_closure_1 (proc=0xb7eeaa08, arg1=0xb7eeb6c0)
    at eval.c:5261
#22 0x001af297 in scm_map (proc=0xb7eeaa08, arg1=0xb7eeab30, args=
    0x404) at eval.c:5489
#23 0x001ac96c in ceval (x=0x404, env=0xb7eeaa20) at eval.c:4367
#24 0x001abe8f in ceval (x=0xb7fb11b0, env=0xb7eeaa20) at eval.c:4342
#25 0x001acd40 in ceval (x=<value optimized out>, 
    env=<value optimized out>) at eval.c:3648
#26 0x001b3d0e in scm_call_0 (proc=0xb7ee9798) at eval.c:4666
#27 0x001b678e in apply_thunk (thunk=0xb7ee9798) at fluids.c:400
#28 0x001b696f in scm_c_with_fluid (fluid=0xb7f8f330, value=
    0xb7f53d80, cproc=0x1b6770 <apply_thunk>, cdata=0xb7ee9798)
    at fluids.c:463
#29 0x001b69c6 in scm_with_fluid (fluid=0xb7f8f330, value=
    0xb7f53d80, thunk=0xb7ee9798) at fluids.c:450
#30 0x001ac412 in ceval (x=<value optimized out>, env=0xb7ee97c0)
    at eval.c:4547
#31 0x001abf86 in ceval (x=0xb7ee9b78, env=0xb7ee97f0) at eval.c:4059
#32 0x001b438b in scm_primitive_eval_x (exp=0xb7ee9b78)
    at eval.c:5921
#33 0x001cee83 in scm_primitive_load (filename=0xb7f6d950)
    at load.c:109
#34 0x001acb4e in ceval (x=0x404, env=0xb7ee90e0) at eval.c:4232
#35 0x0019d8bc in scm_start_stack (id=0xb7f66070, exp=0xb7fc77b0, 
    env=0xb7ee90e0) at debug.c:457
#36 0x0019d954 in scm_m_start_stack (exp=<value optimized out>, env=
    0xb7ee90e0) at debug.c:473
#37 0x001ae8ad in scm_apply (proc=0xb7fcf1d0, arg1=0xb7fc77e8, 
    args=<value optimized out>) at eval.c:4882
#38 0x001abf86 in ceval (x=0xb7fc77e8, env=0xb7ee90e0) at eval.c:4059
#39 0x001b3d0e in scm_call_0 (proc=0xb7ee9118) at eval.c:4666
#40 0x001b678e in apply_thunk (thunk=0xb7ee9118) at fluids.c:400
#41 0x001b696f in scm_c_with_fluid (fluid=0x8053ff0, value=0x4, 
    cproc=0x1b6770 <apply_thunk>, cdata=0xb7ee9118) at fluids.c:463
#42 0x001b69c6 in scm_with_fluid (fluid=0x8053ff0, value=0x4, thunk=
    0xb7ee9118) at fluids.c:450
#43 0x001ac412 in ceval (x=<value optimized out>, env=0xb7ee9158)
    at eval.c:4547
#44 0x001b3d0e in scm_call_0 (proc=0xb7ee9298) at eval.c:4666
#45 0x001a241c in scm_dynamic_wind (in_guard=0xb7ee9250, thunk=
    0xb7ee9298, out_guard=0xb7ee9240) at dynwind.c:111
#46 0x001ac412 in ceval (x=<value optimized out>, env=0xb7ee9260)
    at eval.c:4547
#47 0x001b438b in scm_primitive_eval_x (exp=0xb7ee9338)
    at eval.c:5921
#48 0x0020d018 in inner_eval_string (data=0xb7f52178)
    at strports.c:500
#49 0x001b696f in scm_c_with_fluid (fluid=0x804f100, value=
    0xb7f6e5d0, cproc=0x20cff0 <inner_eval_string>, cdata=0xb7f52178)
    at fluids.c:463
#50 0x001d0746 in scm_c_call_with_current_module (module=0xb7f6e5d0, 
    func=0x20cff0 <inner_eval_string>, data=0xb7f52178)
    at modules.c:107
#51 0x0020d247 in scm_eval_string_in_module (string=0xb7f6d930, 
    module=0xb7f6e5d0) at strports.c:527
#52 0x001ac936 in ceval (x=0x404, env=0xb7f52188) at eval.c:4258
#53 0x001acf92 in ceval (x=<value optimized out>, env=0xb7f52188)
    at eval.c:3368
#54 0x001b438b in scm_primitive_eval_x (exp=0xb7f521b8)
    at eval.c:5921
#55 0x001b43e6 in scm_eval_x (exp=0xb7f521b8, module_or_state=
    0xb7f6e5d0) at eval.c:5956
#56 0x08048fdd in shell_main (data=0x0, argc=7, argv=0xbfffed24)
    at shell.c:222
#57 0x001cb3f7 in invoke_main_func (body_data=0xbfffec40)
    at init.c:367
#58 0x0019bf43 in c_body (d=0xbfffebb4) at continuations.c:349
#59 0x002135f4 in scm_c_catch (tag=0x104, body=0x19bf30 <c_body>, 
    body_data=0xbfffebb4, handler=0x19bf60 <c_handler>, handler_data=
    0xbfffebb4, pre_unwind_handler=
    0x212ab0 <scm_handle_by_message_noexit>, pre_unwind_handler_data=
    0x0) at throw.c:203
#60 0x0019c543 in scm_i_with_continuation_barrier (body=
    0x19bf30 <c_body>, body_data=0xbfffebb4, handler=
    0x19bf60 <c_handler>, handler_data=0xbfffebb4, 
    pre_unwind_handler=0x212ab0 <scm_handle_by_message_noexit>, 
    pre_unwind_handler_data=0x0) at continuations.c:325
#61 0x0019c624 in scm_c_with_continuation_barrier (func=
    0x1cb3b0 <invoke_main_func>, data=0xbfffec40)
    at continuations.c:367
#62 0x0021216a in scm_i_with_guile_and_parent (func=
    0x1cb3b0 <invoke_main_func>, data=0xbfffec40, parent=0x0)
    at threads.c:733
#63 0x0021225f in scm_with_guile (func=0x1cb3b0 <invoke_main_func>, 
    data=0xbfffec40) at threads.c:721
#64 0x001cb390 in scm_boot_guile (argc=7, argv=0xbfffed24, main_func=
    0x8048cf0 <shell_main>, closure=0x0) at init.c:350
#65 0x08048ccb in main (argc=7, argv=0xbfffed24) at shell.c:234

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Help needed debugging segfault with Guile 1.8.7
  2010-11-10 12:43 Help needed debugging segfault with Guile 1.8.7 Peter TB Brett
@ 2010-11-10 21:35 ` Peter TB Brett
  2010-11-11 10:52   ` Peter Brett
  2010-11-11  8:22 ` rixed
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Peter TB Brett @ 2010-11-10 21:35 UTC (permalink / raw)
  To: guile-user

[-- Attachment #1: Type: text/plain, Size: 1215 bytes --]

Peter TB Brett <peter@peter-b.co.uk> writes:

> I've added a testsuite for the libgeda Scheme API.  In one commit [2],
> the testsuite succeeds.  In the following commit [3], a test which does
> not touch any of the changed code starts causing a segfault in the Guile
> interpreter.
>
> [snip]
>
> Valgrind / Memcheck wasn't very helpful either.  Is there a way to make
> Guile zero each new heap page it requests before using it in order to
> reduce the number of false positives the gc generates in Memcheck?
>
> In fact, *none* of the ways I've found of running the test without the
> segfault occurring have given me *any* clue as to how to fix it
> "properly".
>
> I'd really appreciate any suggestions that anyone might be able to give
> me on figuring out how I've managed to break things.  At the moment, I'm
> at a complete loss.

So far, as far as I can tell the problem is due to freelist corruption
in the garbage collector, and I'm not sufficiently au fait with Guile
internals to be able to diagnose what's going on.  Is this a known
issue?

                              Peter


-- 
Peter Brett <peter@peter-b.co.uk>
Remote Sensing Research Group
Surrey Space Centre

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Help needed debugging segfault with Guile 1.8.7
  2010-11-10 12:43 Help needed debugging segfault with Guile 1.8.7 Peter TB Brett
  2010-11-10 21:35 ` Peter TB Brett
@ 2010-11-11  8:22 ` rixed
  2010-11-11  8:33 ` Neil Jerram
  2010-11-11 13:30 ` Ludovic Courtès
  3 siblings, 0 replies; 15+ messages in thread
From: rixed @ 2010-11-11  8:22 UTC (permalink / raw)
  To: guile-user

General 2cents:

If you have multiple guile scheme, don't.
If you keep any SCM value across a call to scm_function(), protect it
from the GC.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Help needed debugging segfault with Guile 1.8.7
  2010-11-10 12:43 Help needed debugging segfault with Guile 1.8.7 Peter TB Brett
  2010-11-10 21:35 ` Peter TB Brett
  2010-11-11  8:22 ` rixed
@ 2010-11-11  8:33 ` Neil Jerram
  2010-11-11 13:30 ` Ludovic Courtès
  3 siblings, 0 replies; 15+ messages in thread
From: Neil Jerram @ 2010-11-11  8:33 UTC (permalink / raw)
  To: Peter TB Brett; +Cc: guile-user

Peter TB Brett <peter@peter-b.co.uk> writes:

> I'd really appreciate any suggestions that anyone might be able to give
> me on figuring out how I've managed to break things.  At the moment, I'm
> at a complete loss.

I'm afraid I only have two general ideas.

1. You can use gdb_print and gdb_output to see the Scheme values of SCM
variables.  For example, to see the value of `l', in the top frame of
your backtrace:

(gdb) call gdb_print(l)
(gdb) p gdb_output

2. You can try modifying your script in small ways to see if the problem
disappears.  If it does, that may give a clue about what the problem is.

Regards,
        Neil



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Help needed debugging segfault with Guile 1.8.7
  2010-11-10 21:35 ` Peter TB Brett
@ 2010-11-11 10:52   ` Peter Brett
  2010-11-11 12:37     ` Thien-Thi Nguyen
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Brett @ 2010-11-11 10:52 UTC (permalink / raw)
  To: guile-user

Peter TB Brett <peter@peter-b.co.uk> writes:

> Peter TB Brett <peter@peter-b.co.uk> writes:

>> I'd really appreciate any suggestions that anyone might be able to give
>> me on figuring out how I've managed to break things.  At the moment, I'm
>> at a complete loss.
>
> So far, as far as I can tell the problem is due to freelist corruption
> in the garbage collector, and I'm not sufficiently au fait with Guile
> internals to be able to diagnose what's going on.  Is this a known
> issue?

So it turned out to be a stupid logic error in some weak ref management
code that I'd written.  Although in the process of finding it I've
learnt a lot about Guile internals, so it wasn't a complete write-off of
my time.

Note to self: bugs can lie in wait for you for a long time...

Thanks to all who provided debugging tips -- I'll keep them in mind for
the future.

                                  Peter

-- 
Peter Brett <peter@peter-b.co.uk>
Remote Sensing Research Group
Surrey Space Centre




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Help needed debugging segfault with Guile 1.8.7
  2010-11-11 10:52   ` Peter Brett
@ 2010-11-11 12:37     ` Thien-Thi Nguyen
  2010-11-11 14:22       ` Peter Brett
  0 siblings, 1 reply; 15+ messages in thread
From: Thien-Thi Nguyen @ 2010-11-11 12:37 UTC (permalink / raw)
  To: Peter Brett; +Cc: guile-user

() Peter Brett <peter@peter-b.co.uk>
() Thu, 11 Nov 2010 10:52:41 +0000

   stupid logic error in some weak ref management code

Could you please describe this error?



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Help needed debugging segfault with Guile 1.8.7
  2010-11-10 12:43 Help needed debugging segfault with Guile 1.8.7 Peter TB Brett
                   ` (2 preceding siblings ...)
  2010-11-11  8:33 ` Neil Jerram
@ 2010-11-11 13:30 ` Ludovic Courtès
  3 siblings, 0 replies; 15+ messages in thread
From: Ludovic Courtès @ 2010-11-11 13:30 UTC (permalink / raw)
  To: guile-user

Hi,

Can you try to compile Guile with CPPFLAGS=-DSCM_DEBUG=1?  It might
report the problem before it hits.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Help needed debugging segfault with Guile 1.8.7
  2010-11-11 12:37     ` Thien-Thi Nguyen
@ 2010-11-11 14:22       ` Peter Brett
  2010-11-28 11:38         ` Neil Jerram
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Brett @ 2010-11-11 14:22 UTC (permalink / raw)
  To: guile-user; +Cc: geda-dev

Thien-Thi Nguyen <ttn@gnuvola.org> writes:

> () Peter Brett <peter@peter-b.co.uk>
> () Thu, 11 Nov 2010 10:52:41 +0000
>
>>    stupid logic error in some weak ref management code
>
> Could you please describe this error?
>

Sure.  libgeda uses direct management of memory, and the structures used
in its document object model need to be explicitly deleted when finished
with.  I decided to use a Guile smob to represent these structures for
access from Scheme code, with the pointer to the actual structure in
SCM_SMOB_DATA and with the low nibble of SCM_SMOB_FLAGS indicating which
type of DOM structure the smob references.

This would have been sufficient if Scheme code had only been working
with libgeda DOMs created and managed entirely via Scheme code.
However, here Guile is being used simply to provide extensibility to
electronics engineering applications based on libgeda, such as gschem.
It would theoretically be possible for the following sequence of events
to occur:

 1. In a Scheme function called from the schematic editor, a transistor
    is instantiated, added to the current page, and also stashed
    somewhere in the Guile environment.

 2. A bit later, the user closes the page.  It is destroyed from C code,
    and so is the transistor instance.

 3. Finally, a Scheme function is called that unstashes the transistor
    instance, and tries to use it, leading to a segfault.

There were two main design considerations taken into account when
looking for a solution to this problem.  Firstly, I wanted it to be
impossible to make libgeda leak memory from Scheme code, so that doing
something like

   (do ((i 1 (1+ i)) ((> i 1000000)))
       (make-transistor))

would be safe.  That meant that it had to be possible to destroy DOM
structures from the smob_free() function.  On the other hand, I wanted
to find a solution that avoided adding explicit Guile dependencies to
the core of libgeda (since I hope to split off the Scheme binding into a
separate library at some point).

The solution was to add weak reference facilities to the libgeda DOM
data structures.  A weak reference is added by calling a function
similar to:

  object_weak_ref (OBJECT *object, void (*notify_func)(void *, void *),
                   void *user_data);

The notify_func and user_data are prepended to a singly-linked list in
the OBJECT structure.  When the OBJECT is deleted via the C API, each
entry in the list is alerted by calling:

  notify_func (object, user_data)

The notification function I use for the weak references held by smobs
looks like this:

  static void
  smob_weakref_notify (void *target, void *smob) {
    SCM s = (SCM) smob;
    SCM_SET_SMOB_DATA (s, NULL);
  }

I've provided wrapper functions & macros for checking smob validity
(i.e. non-null smob data) before allowing any dereference of
e.g. (OBJECT *) SCM_SMOB_DATA (smob).

To allow garbage collection of DOM element smobs where possible, I use
bit 4 of the SCM_SMOB_FLAGS as a `GC allowed' flag.  Any API functions
that put a DOM element in a state where it may be destroyed from C code
rather than the smob_free() function are required to clear the flag (and
vice versa).

So, where was the bug?  When a smob is GC'd, and if the pointer it
contains hasn't already been cleared, it calls:

  object_weak_unref (SCM_SMOB_DATA (smob), smob_weakref_notify, smob);

Before I fixed the bug, object_weak_unref() contained code that looked
something like this:

  for (iter = weak_refs; iter != NULL; iter = list_next (iter)) {
    struct WeakRef *entry = iter->data;
    if ((entry->notify_func == notify_func) &&
        (entry->user_data != user_data)) { // ERROR: != should be ==
      free (entry);
      iter->data = NULL;
    }
  }
  weak_refs = list_remove_all (weak_refs, NULL);

Now, how does this result in Guile GC freelist corruption?  It requires
two smobs to be created for the same DOM structure S, let's say A and B.
(This can only occur if S is being managed from C code, so we know that
the `GC allowed' flag will be cleared).

            smob_addr      cell CAR         cell CDR
     A      0x1000         0x0              <ptr to S>
     B      0x1008         0x0              <ptr to S>

     Weakref user_data for S: 0x1008, 0x1000

Now, smob A is garbage collected.  Because we've told smob_free that
it's not allowed to destroy S, the smob_free() function calls
object_weak_unref().  Since the latter is broken, it clears the wrong
weak reference.  Now things look like this:

            smob_addr      cell CAR         cell CDR
     A      0x1000         <magic number>   <ptr to next free cell>
     B      0x1008         0x0              <ptr to S>

     Weakref user_data for S: 0x1000

Some time later, S is destroyed from C code, and this results in the
smob_weakref_notify() function described earlier being called thus:

   smob_weakref_notify (<pointer to S>, 0x1000)

After S is destroyed, things look like this:

            smob_addr      cell CAR         cell CDR
     A      0x1000         <magic number>   0x0 (OH NOES)
     B      0x1008         0x0              <ptr to S>

At this stage, the freelist has become corrupted, and will result in a
segfault in scm_cell() at some indeterminate future time.

With the fix in this commit:

  http://git.gpleda.org/?p=gaf.git;a=commit;h=41ea61b2f156

this memory corruption does not occur.

I hope that explained things reasonably precisely!

Regards,

                                 Peter

-- 
Peter Brett <peter@peter-b.co.uk>
Remote Sensing Research Group
Surrey Space Centre




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Help needed debugging segfault with Guile 1.8.7
  2010-11-11 14:22       ` Peter Brett
@ 2010-11-28 11:38         ` Neil Jerram
  2010-11-28 17:21           ` Linas Vepstas
  2010-11-30 19:43           ` Peter TB Brett
  0 siblings, 2 replies; 15+ messages in thread
From: Neil Jerram @ 2010-11-28 11:38 UTC (permalink / raw)
  To: Peter Brett; +Cc: guile-user, geda-dev

Hi Peter,

Thanks for providing such a clear explanation of the problem.  Here are
a few comments.

Peter Brett <peter@peter-b.co.uk> writes:

> Sure.  libgeda uses direct management of memory, and the structures used
> in its document object model need to be explicitly deleted when finished
> with.  I decided to use a Guile smob to represent these structures for
> access from Scheme code, with the pointer to the actual structure in
> SCM_SMOB_DATA and with the low nibble of SCM_SMOB_FLAGS indicating which
> type of DOM structure the smob references.
>
> This would have been sufficient if Scheme code had only been working
> with libgeda DOMs created and managed entirely via Scheme code. [...]

I think your design is similar to what is outlined in the `Extending
Dia' node of the Guile manual.  Were you aware of that doc before
working out your design?  If not, I guess we need to make it more
prominent.  If yes, I'd appreciate any suggestions you have for how it
may be improved.

> So, where was the bug?  When a smob is GC'd, and if the pointer it
> contains hasn't already been cleared, [...]

Now that you've successfully debugged this, is there any general advice
that you would offer for "how to investigate a free list corruption?"  I
would guess not, as corruption is fundamentally a general thing and
has infinite possible causes - but perhaps I'm missing something.

> I hope that explained things reasonably precisely!

Thank you, it certainly did.  To conclude, I'll just note that in the
Guile 2.0 future we won't have such difficult problems, because of using
libgc - which will automatically find active references anywhere in the
whole application.  (And of course I understand that your code still
needs to work with Guile 1.8.x now.)

Regards,
        Neil



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Help needed debugging segfault with Guile 1.8.7
  2010-11-28 11:38         ` Neil Jerram
@ 2010-11-28 17:21           ` Linas Vepstas
  2010-11-30 19:56             ` Peter TB Brett
  2010-11-30 19:43           ` Peter TB Brett
  1 sibling, 1 reply; 15+ messages in thread
From: Linas Vepstas @ 2010-11-28 17:21 UTC (permalink / raw)
  To: Neil Jerram; +Cc: guile-user, Peter Brett, geda-dev

[-- Attachment #1: Type: text/plain, Size: 2904 bytes --]

On 28 November 2010 05:38, Neil Jerram <neil@ossau.uklinux.net> wrote:

> Peter Brett <peter@peter-b.co.uk> writes:
>
>> Sure.  libgeda uses direct management of memory, and the structures used
>> in its document object model need to be explicitly deleted when finished
>> with.  I decided to use a Guile smob to represent these structures for
>> access from Scheme code, with the pointer to the actual structure in
>> SCM_SMOB_DATA and with the low nibble of SCM_SMOB_FLAGS indicating which
>> type of DOM structure the smob references.
>>
>> This would have been sufficient if Scheme code had only been working
>> with libgeda DOMs created and managed entirely via Scheme code. [...]
>
> I think your design is similar to what is outlined in the `Extending
> Dia' node of the Guile manual.  Were you aware of that doc before
> working out your design?  If not, I guess we need to make it more
> prominent.  If yes, I'd appreciate any suggestions you have for how it
> may be improved.

Yes, I code almost entirely 'by example', so having a good cookbook is
critical for me. I haven't read 'Extending Dia' before, its probably more
recent than the last time I set up guile bindings, some 2-3 years ago;
I skimmed it briefly just now.

Several comments on example code:
1) its typically not possible to wrap the C main(), so having a well-defined
init() that happens some time later would be best.

2) http://www.gnu.org/software/guile/manual/html_node/Dia-Hook.html
is lame.  What I have to do is this:

   SCM rc = scm_c_catch (SCM_BOOL_T,
                         (scm_t_catch_body) scm_c_eval_string,
                         (void *) expr_str,
                         SchemeEval::catch_handler_wrapper, this,
                         SchemeEval::preunwind_handler_wrapper, this);

and my catch_handler and preunwind_handler are fairly involved.

Basically, if you are going to let users enter arbitrary scheme into
your app, they *will* enter malformed, broken expressions, and you
have to deal with these.  Among other things, you have to give them
a clue as to what the error was -- some sort of trace, error reporting.

For me, this was implementing a REPL loop look-alike in my app.
I can't say "work-alike", I think mine *almost* works-alike, but not sure.

It took me a *long* time to figure out I needed the pre-unwind version
of things, and even then, it took me a fair amount of effort to figure
out what to put in there.

Having a section showing how to implement a repl-work-alike loop
in one's app, with reasonable debugging/stack-printing output,
would be nice to have.  Figuring out how to do this one one's
own requires a lot of tenacity.

For the record, I've attched the code I wrote to do the above (and
to multi-thread, which someone later on disabled :-( Its in C++,
sorry about that, don't blame me.)

--linas

[-- Attachment #2: SchemeEval.h --]
[-- Type: text/x-chdr, Size: 3283 bytes --]

/*
 * SchemeEval.h
 *
 * Simple scheme expression evaluator
 * Copyright (c) 2008 Linas Vepstas <linas@linas.org>
 */

#ifndef OPENCOG_SCHEME_EVAL_H
#define OPENCOG_SCHEME_EVAL_H
#ifdef HAVE_GUILE

#include <string>
#include <pthread.h>
#include <libguile.h>
#include <opencog/atomspace/Handle.h>

namespace opencog {

class SchemeEval
{
	private:
		// Initialization stuff
		void init(void);
		static void * c_wrap_init(void *);
		void per_thread_init(void);
		void thread_lock(void);
		void thread_unlock(void);
	
		// destructor stuff
		void finish(void);
		static void * c_wrap_finish(void *);
	
		// Things related to evaluation
		std::string do_eval(const std::string &);
		static void * c_wrap_eval(void *);
		static void * c_wrap_eval_h(void *);
		const std::string *pexpr;
		std::string answer;
		
		std::string input_line;
		bool pending_input;
	
		// straight-up evaluation
		static SCM wrap_scm_eval(void *);
		SCM do_scm_eval(SCM);
		SCM do_scm_eval_str(const std::string &);
	
		// Handle apply
		Handle do_apply(const std::string& func, Handle varargs);
		SCM do_apply_scm(const std::string& func, Handle varargs);
		Handle hargs;
		static void * c_wrap_apply(void *);
		static void * c_wrap_apply_scm(void *);
	
		// Error handling stuff
		SCM error_string_port;
		SCM captured_stack;
		static SCM preunwind_handler_wrapper(void *, SCM, SCM);
		static SCM catch_handler_wrapper(void *, SCM, SCM);
		SCM preunwind_handler(SCM, SCM);
		SCM catch_handler(SCM, SCM);
		bool caught_error;
	
		// printing of basic types
		static std::string prt(SCM);
	
		// output port
		SCM outport;
		SCM saved_outport;
	
		// Make constructor, destructor private; force
		// everyone to use the singleton instance, for now.
		SchemeEval(void);
		~SchemeEval();
		static SchemeEval* singletonInstance;
		
	public:
					
		std::string eval(const std::string &);
		Handle eval_h(const std::string &);
		Handle apply(const std::string& func, Handle varargs);
		std::string apply_generic(const std::string& func, Handle varargs);
	
		bool input_pending(void);
		void clear_pending(void);
		bool eval_error(void);
	
		// Someone thinks that there some scheme threading bug somewhere,
		// and the current hack around this is to use a singleton instance.
		static SchemeEval& instance(void)
		{
			if (!singletonInstance) 
				singletonInstance = new SchemeEval();
			return *singletonInstance;
		}
};

}

#else /* HAVE_GUILE */

#include <opencog/atomspace/Handle.h>

namespace opencog {

class SchemeEval
{
	private:
		static SchemeEval* singletonInstance;
	public:
		std::string eval(const std::string &s) { return ""; }
		Handle eval_h(const std::string &s) { return Handle::UNDEFINED; }
		Handle apply(const std::string &s, Handle args) {
			return Handle::UNDEFINED; }
		std::string apply_generic(const std::string& f, Handle args) {
			return ""; }
	
		bool input_pending(void) { return false; }
		void clear_pending(void) {}

		// If guile is not installed, then *every* call to eval_error()
		// must report that an error occurred! 
		bool eval_error(void) { return true; }

		static SchemeEval& instance(void)
		{
			if (!singletonInstance) 
				singletonInstance = new SchemeEval();
			return *singletonInstance;
		}
};

}
#endif/* HAVE_GUILE */
#endif /* OPENCOG_SCHEME_EVAL_H */

[-- Attachment #3: SchemeExec.cc --]
[-- Type: text/x-c++src, Size: 1934 bytes --]

/*
 * SchemeExec.cc
 *
 * Execute ExecutionLink's
 * Copyright (c) 2009 Linas Vepstas <linasvepstas@gmail.com>
 */

#ifdef HAVE_GUILE

#include <libguile.h>

#include <opencog/atomspace/Link.h>
#include <opencog/server/CogServer.h>

#include <boost/shared_ptr.hpp>

#include "SchemeEval.h"
#include "SchemeSmob.h"

using namespace opencog;

/**
 * do_apply -- apply named function func to arguments in ListLink
 * It is assumed that varargs is a ListLink, containing a list of
 * atom handles. This list is unpacked, and then the fuction func
 * is applied to them. If the function returns an atom handle, then
 * this is returned.
 */
Handle SchemeEval::do_apply(const std::string &func, Handle varargs)
{
	// Apply the function to the args
	SCM sresult = do_apply_scm (func, varargs);
	
	// If the result is a handle, return the handle.
	if (!SCM_SMOB_PREDICATE(SchemeSmob::cog_handle_tag, sresult))
	{
		return Handle::UNDEFINED;
	}
	return SchemeSmob::scm_to_handle(sresult);
}

/**
 * do_apply_scm -- apply named function func to arguments in ListLink
 * It is assumed that varargs is a ListLink, containing a list of
 * atom handles. This list is unpacked, and then the fuction func
 * is applied to them. The SCM value returned by the function is returned.
 */
SCM SchemeEval::do_apply_scm( const std::string& func, Handle varargs )
{
	SCM sfunc = scm_from_locale_symbol(func.c_str());
	SCM expr = SCM_EOL;
	
	// If there were args, pass the args to the function.
    boost::shared_ptr<Link> largs = cogserver().getAtomSpace()->cloneLink(varargs);
	if (largs)
	{
		const std::vector<Handle> &oset = largs->getOutgoingSet();
		
		size_t sz = oset.size();
		for (int i=sz-1; i>=0; i--)
		{
			Handle h = oset[i];
			SCM sh = SchemeSmob::handle_to_scm(h);
			expr = scm_cons(sh, expr);
		}
	}
	expr = scm_cons(sfunc, expr);
	return do_scm_eval(expr);
}

#endif
/* ===================== END OF FILE ============================ */

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Help needed debugging segfault with Guile 1.8.7
  2010-11-28 11:38         ` Neil Jerram
  2010-11-28 17:21           ` Linas Vepstas
@ 2010-11-30 19:43           ` Peter TB Brett
  2010-12-01 13:46             ` Ludovic Courtès
  1 sibling, 1 reply; 15+ messages in thread
From: Peter TB Brett @ 2010-11-30 19:43 UTC (permalink / raw)
  To: guile-user

[-- Attachment #1: Type: text/plain, Size: 2398 bytes --]

Neil Jerram <neil@ossau.uklinux.net> writes:

> [snip]
>
> I think your design is similar to what is outlined in the `Extending
> Dia' node of the Guile manual.  Were you aware of that doc before
> working out your design?  If not, I guess we need to make it more
> prominent.  If yes, I'd appreciate any suggestions you have for how it
> may be improved.

I wasn't aware of that document, despite using the Guile Manual every
day!  Could I please suggest that filing it under "Programming Overview"
probably isn't the best place for it?

I think it should be highlighted that adding a Guile-specific field to
dia_shape (or its equivalent) may not always be possible, e.g. if Guile
is just one of several language bindings.

>> So, where was the bug?  When a smob is GC'd, and if the pointer it
>> contains hasn't already been cleared, [...]
>
> Now that you've successfully debugged this, is there any general
> advice that you would offer for "how to investigate a free list
> corruption?"  I would guess not, as corruption is fundamentally a
> general thing and has infinite possible causes - but perhaps I'm
> missing something.

One thing that would have been *AWESOME* is if Guile 1.8.x's GC had used
the macros defined in Valgrind's `memcheck.h' (which is BSD licensed
IIRC).  It would make running programs with libguile under Valgrind so
much more useful, and would have *instantly* highlighted what was going
wrong with my code -- it would probably have saved me a couple of days
of beating my head against what turned out to be a really simple bug.
(There's literally no runtime overhead if a program's not being run
under Valgrind).

>> I hope that explained things reasonably precisely!
>
> Thank you, it certainly did.  To conclude, I'll just note that in the
> Guile 2.0 future we won't have such difficult problems, because of
> using libgc - which will automatically find active references anywhere
> in the whole application.  (And of course I understand that your code
> still needs to work with Guile 1.8.x now.)

Thanks for the info.  Judging by previous experience, gEDA will need to
support Guile 1.8.x for at least two years after Guile 2.x arrives.
It's probably going to be painful. :-/

                                   Peter

-- 
Peter Brett <peter@peter-b.co.uk>
Remote Sensing Research Group
Surrey Space Centre

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Help needed debugging segfault with Guile 1.8.7
  2010-11-28 17:21           ` Linas Vepstas
@ 2010-11-30 19:56             ` Peter TB Brett
  2010-12-01 19:48               ` Andy Wingo
  0 siblings, 1 reply; 15+ messages in thread
From: Peter TB Brett @ 2010-11-30 19:56 UTC (permalink / raw)
  To: guile-user

[-- Attachment #1: Type: text/plain, Size: 1972 bytes --]

Linas Vepstas <linasvepstas@gmail.com> writes:

> Basically, if you are going to let users enter arbitrary scheme into
> your app, they *will* enter malformed, broken expressions, and you
> have to deal with these.  Among other things, you have to give them
> a clue as to what the error was -- some sort of trace, error reporting.
>
> For me, this was implementing a REPL loop look-alike in my app.
> I can't say "work-alike", I think mine *almost* works-alike, but not sure.
>
> It took me a *long* time to figure out I needed the pre-unwind version
> of things, and even then, it took me a fair amount of effort to figure
> out what to put in there.
>
> Having a section showing how to implement a repl-work-alike loop
> in one's app, with reasonable debugging/stack-printing output,
> would be nice to have.  Figuring out how to do this one one's
> own requires a lot of tenacity.

Guile 2.x would be hugely more attractive as an extension language if
there were some higher-level interfaces for application developers.  In
gEDA, we have to use a ridiculous amount of C code just to set things up
so that user-provided Scheme expressions can't leave the user staring at
a command prompt wondering where the last hour's work disappeared to
(and even that's not as reliable as we'd like).

Ideally, there should be:

 - An easily re-usable/embeddable and well-documented REPL, to aid
   developers in implementing a "Scheme Interaction Window" or similar
   in their apps.

 - An high-level API for passing strings for evaluation, that reports
   back the results or error information in a way that's easy to deal
   with from C.

In a perfect world, these would be part of the same API. ;-) I believe
that these would actually be applicable to the majority of use-cases of
libguile.

Cheers,

                                 Peter

-- 
Peter Brett <peter@peter-b.co.uk>
Remote Sensing Research Group
Surrey Space Centre

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Help needed debugging segfault with Guile 1.8.7
  2010-11-30 19:43           ` Peter TB Brett
@ 2010-12-01 13:46             ` Ludovic Courtès
  2010-12-03  7:52               ` Peter TB Brett
  0 siblings, 1 reply; 15+ messages in thread
From: Ludovic Courtès @ 2010-12-01 13:46 UTC (permalink / raw)
  To: guile-user

Hello,

Peter TB Brett <peter@peter-b.co.uk> writes:

> Thanks for the info.  Judging by previous experience, gEDA will need to
> support Guile 1.8.x for at least two years after Guile 2.x arrives.
> It's probably going to be painful. :-/

Out of curiosity, did you try building/running gEDA with Guile 1.9.x,
the pre-releases of 2.x?

If not, we’d be glad if you could test it and report back on any
difficulties you may encounter.  The 1.9 tarballs are available from
alpha.gnu.org.  The ‘NEWS’ file documents some incompatibility issues.

Thanks,
Ludo’.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Help needed debugging segfault with Guile 1.8.7
  2010-11-30 19:56             ` Peter TB Brett
@ 2010-12-01 19:48               ` Andy Wingo
  0 siblings, 0 replies; 15+ messages in thread
From: Andy Wingo @ 2010-12-01 19:48 UTC (permalink / raw)
  To: Peter TB Brett; +Cc: guile-user

On Tue 30 Nov 2010 20:56, Peter TB Brett <peter@peter-b.co.uk> writes:

>  - An easily re-usable/embeddable and well-documented REPL, to aid
>    developers in implementing a "Scheme Interaction Window" or similar
>    in their apps.

This is surprisingly difficult to do. A traditional REPL is the bottom
of the main loop of an application (though "application" might be
stretching it); integrating it with other main loops is tricky.

Essentially you need threads. Either pthreads, which bring their own
issues ("what thread is calling my app??? I thought i didn't use
threads?"), coroutines (but since `read' is implemented in C, delimited
continuations that involve the readers are not resumable), or simulated
threads (for example, the gtk repl in guile-gnome that runs recursive
main loops inside soft port readers).

Dunno. Perhaps there are other interfaces that are less repl-like, but
still useful. Suggestions and patches are welcome :-)

>  - An high-level API for passing strings for evaluation, that reports
>    back the results or error information in a way that's easy to deal
>    with from C.

What would this API look like?

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Help needed debugging segfault with Guile 1.8.7
  2010-12-01 13:46             ` Ludovic Courtès
@ 2010-12-03  7:52               ` Peter TB Brett
  0 siblings, 0 replies; 15+ messages in thread
From: Peter TB Brett @ 2010-12-03  7:52 UTC (permalink / raw)
  To: guile-user

[-- Attachment #1: Type: text/plain, Size: 873 bytes --]

ludo@gnu.org (Ludovic Courtès) writes:

> Peter TB Brett <peter@peter-b.co.uk> writes:
>
>> Thanks for the info.  Judging by previous experience, gEDA will need to
>> support Guile 1.8.x for at least two years after Guile 2.x arrives.
>> It's probably going to be painful. :-/
>
> Out of curiosity, did you try building/running gEDA with Guile 1.9.x,
> the pre-releases of 2.x?
>
> If not, we’d be glad if you could test it and report back on any
> difficulties you may encounter.  The 1.9 tarballs are available from
> alpha.gnu.org.  The ‘NEWS’ file documents some incompatibility issues.
>

I'll try and make time to give it a go today.  Is it also at all useful
to try a build against Guile's git master?

                                    Peter

-- 
Peter Brett <peter@peter-b.co.uk>
Remote Sensing Research Group
Surrey Space Centre

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-12-03  7:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-10 12:43 Help needed debugging segfault with Guile 1.8.7 Peter TB Brett
2010-11-10 21:35 ` Peter TB Brett
2010-11-11 10:52   ` Peter Brett
2010-11-11 12:37     ` Thien-Thi Nguyen
2010-11-11 14:22       ` Peter Brett
2010-11-28 11:38         ` Neil Jerram
2010-11-28 17:21           ` Linas Vepstas
2010-11-30 19:56             ` Peter TB Brett
2010-12-01 19:48               ` Andy Wingo
2010-11-30 19:43           ` Peter TB Brett
2010-12-01 13:46             ` Ludovic Courtès
2010-12-03  7:52               ` Peter TB Brett
2010-11-11  8:22 ` rixed
2010-11-11  8:33 ` Neil Jerram
2010-11-11 13:30 ` Ludovic Courtès

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).