Cygwin port of Guile 2.2

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* Cygwin port of Guile 2.2
@ 2017-04-04 16:45 Mike Gran
  2017-04-04 17:03 ` Eli Zaretskii
  2017-04-14  8:35 ` Andy Wingo
  0 siblings, 2 replies; 16+ messages in thread
From: Mike Gran @ 2017-04-04 16:45 UTC (permalink / raw)
  To: guile-devel

Hello All,

I push a branch that mostly makes Guile 2.2 work on Cygwin.

In the end, I had a few issues.

Cygwin's uselocale is broken and can segrault.  I fell back
to the standard nl-langinfo.  I understand the issue and may
push a patch upstream to Cygwin sometime.

Also Cygwin's langinfo is incomplete with regards to LC
MONETARY.  This also is fixable upstream in Cygwin.

Cygwin requires stability on both ends of a pipe being used
for inter-thread communication.  If one tries to read from a
pipe that is closing on the other end, Cygwin may fault.  I
worked around this by replacing pipes with a
producer/consumer pattern of semaphores and mutex-protected
variables.  A dirty workaround.

Lastly, Cygwin's fork just doesn't play well with Guile's
new GC and threading model.  Despite a copy of days in the
guts of it all, I don't understand the specifics of the
issue.  So, in the branch, I disabled forking altogether,
and by extension, popen and friends.

While I couldn't get fork to work, the other threading
primitives seem fine.

The branch is wip-cygwin-guile-2.2.

Thanks,

Mike Gran

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-04-04 16:45 Cygwin port of Guile 2.2 Mike Gran
@ 2017-04-04 17:03 ` Eli Zaretskii
  2017-04-14  8:35 ` Andy Wingo
  1 sibling, 0 replies; 16+ messages in thread
From: Eli Zaretskii @ 2017-04-04 17:03 UTC (permalink / raw)
  To: Mike Gran; +Cc: guile-devel

> From: "Mike Gran" <spk121@yahoo.com>
> Date: Tue, 04 Apr 2017 09:45:43 -0700
> 
> Lastly, Cygwin's fork just doesn't play well with Guile's
> new GC and threading model.  Despite a copy of days in the
> guts of it all, I don't understand the specifics of the
> issue.  So, in the branch, I disabled forking altogether,
> and by extension, popen and friends.

For popen, you could use the MinGW code, I think.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-04-04 16:45 Cygwin port of Guile 2.2 Mike Gran
  2017-04-04 17:03 ` Eli Zaretskii
@ 2017-04-14  8:35 ` Andy Wingo
  2017-04-14 13:41   ` Derek Upham
  1 sibling, 1 reply; 16+ messages in thread
From: Andy Wingo @ 2017-04-14  8:35 UTC (permalink / raw)
  To: Mike Gran; +Cc: guile-devel

On Tue 04 Apr 2017 18:45, "Mike Gran" <spk121@yahoo.com> writes:

> I push a branch that mostly makes Guile 2.2 work on Cygwin.

Neat.  LGTM with some nits, with the exception of the "(throw
'unresolved)" patch -- the corresponding catch is in
test-suite/test-suite/lib.scm:344.

> Cygwin's uselocale is broken and can segrault.  I fell back
> to the standard nl-langinfo.  I understand the issue and may
> push a patch upstream to Cygwin sometime.

That would be great.  Can you add a comment to this effect to the line
where you force the nl_langinfo path?  It would be nice to be able to
remove this in the future.

> Also Cygwin's langinfo is incomplete with regards to LC
> MONETARY.  This also is fixable upstream in Cygwin.

I didn't recall seeing a patch here but if there was, please note this
situation there.

> Cygwin requires stability on both ends of a pipe being used
> for inter-thread communication.  If one tries to read from a
> pipe that is closing on the other end, Cygwin may fault.  I
> worked around this by replacing pipes with a
> producer/consumer pattern of semaphores and mutex-protected
> variables.  A dirty workaround.

This was the part I liked least, as you can imagine :)  Oh well
though.  I think it is fine as it is, if you add a comment as to why
it's there.  Of course if you can find a solution that works with pipes,
that would be ideal!

> Lastly, Cygwin's fork just doesn't play well with Guile's
> new GC and threading model.  Despite a copy of days in the
> guts of it all, I don't understand the specifics of the
> issue.  So, in the branch, I disabled forking altogether,
> and by extension, popen and friends.

I guess the new thing being the finalizer thread?  In theory before the
fork, Guile will stop the finalizer thread.  If that's not working, that
could be it.

Another option would be to disable fork() but only if Guile is built
with threads.  Actually this is probably better -- I bet there are Guile
people that expect to be able to fork on Cygwin and would be OK if their
Guile had no threads.

Please feel free to push to master once there are adequate comments.
Someone will come later and want to enable fork() on Cygwin with threads
and will need to know why it's disabled and what they could do to fix it
:)

Thanks!

Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-04-14  8:35 ` Andy Wingo
@ 2017-04-14 13:41   ` Derek Upham
  2017-04-17  8:04     ` Andy Wingo
  0 siblings, 1 reply; 16+ messages in thread
From: Derek Upham @ 2017-04-14 13:41 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-devel

Andy Wingo <wingo@pobox.com> writes:

> I guess the new thing being the finalizer thread?  In theory before the
> fork, Guile will stop the finalizer thread.  If that's not working, that
> could be it.
>
> Another option would be to disable fork() but only if Guile is built
> with threads.  Actually this is probably better -- I bet there are Guile
> people that expect to be able to fork on Cygwin and would be OK if their
> Guile had no threads.
>
> Please feel free to push to master once there are adequate comments.
> Someone will come later and want to enable fork() on Cygwin with threads
> and will need to know why it's disabled and what they could do to fix it
> :)

There’s the finalizer thread, but there’s also the signal delivery
thread.  The “sigaction” code ensures that signal delivery thread is
running.  And note that “primitive-fork” is supposed to display a
warning if you are forking with outstanding threads, but it explicitly
ignores the signal delivery thread during its check.

I have found that what actually hangs after a fork are the mutexes
supporting the threads: they are kernel-level resources, referenced by
ID, and end up being shared between parent and child.

I’m baking a fix that wipes out user signals before calling
“primitve-fork”, and then restores them separately in the parent and
child processes.  Doing so gives them different mutexes.  It required
some support in scmsigs.c to:

- Have “restore-signals” return a description of the signal handlers,
  that we can use to restore those handlers.
- Have “restore-signals” close the signal pipe (which kills the signal
  delivery thread).
- When closing the signal pipe, wait for the signal delivery thread to
  exit.

I don’t think there’s any safe way to restore the finalizer thread and
support SCSH-style (begin ...) process forms.  Shutting down the
finalizer thread is the best we can do.

All of this is on GNU/Linux, of course, not Cygwin.

Derek

-- 
Derek Upham
sand@blarg.net

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-04-14 13:41   ` Derek Upham
@ 2017-04-17  8:04     ` Andy Wingo
  2017-04-17 15:05       ` Derek Upham
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Wingo @ 2017-04-17  8:04 UTC (permalink / raw)
  To: Derek Upham; +Cc: guile-devel

On Fri 14 Apr 2017 15:41, Derek Upham <sand@blarg.net> writes:

> There’s the finalizer thread, but there’s also the signal delivery
> thread.  The “sigaction” code ensures that signal delivery thread is
> running.  And note that “primitive-fork” is supposed to display a
> warning if you are forking with outstanding threads, but it explicitly
> ignores the signal delivery thread during its check.

Yep, though the signal delivery thread doesn't run unless you have
installed a signal handler.  I think currently the combination of
signals + guile-with-threads + primitive-fork (rather than open-process)
isn't supported.  It warns in 2.2 but the situation was the same in
2.0.  I am not sure how it can be made to work (though I did try!) and
would be interested to see your fix :)

> I have found that what actually hangs after a fork are the mutexes
> supporting the threads: they are kernel-level resources, referenced by
> ID, and end up being shared between parent and child.

Which ones, precisely?

> I don’t think there’s any safe way to restore the finalizer thread and
> support SCSH-style (begin ...) process forms.  Shutting down the
> finalizer thread is the best we can do.

The finalizer thread should be restored as needed, the next time GC
calls notify_finalizers_to_run.

I think also that if you are most interested in a system in which
primitive-fork plays a large role, then probably you want a Guile
without threads (including the GC mark threads).  Threads + fork is not
a recipe for success :)

Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-04-17  8:04     ` Andy Wingo
@ 2017-04-17 15:05       ` Derek Upham
  2017-05-01 20:48         ` Derek Upham
  0 siblings, 1 reply; 16+ messages in thread
From: Derek Upham @ 2017-04-17 15:05 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-devel

Andy Wingo <wingo@pobox.com> writes:

>> I have found that what actually hangs after a fork are the mutexes
>> supporting the threads: they are kernel-level resources, referenced by
>> ID, and end up being shared between parent and child.
>
> Which ones, precisely?

GDB shows the hanging child process waiting on a mutex down in the
BDW-GC garbage collection library.  The parent has the mutex, and is
blocked waiting to read data from the child.

>> I don’t think there’s any safe way to restore the finalizer thread and
>> support SCSH-style (begin ...) process forms.  Shutting down the
>> finalizer thread is the best we can do.
>
> The finalizer thread should be restored as needed, the next time GC
> calls notify_finalizers_to_run.

If you have two, initially identical Guile processes running, each of
which depend on the same external resource needing finalization, how do
you control cleanup in a way that’s safe for both processes?  If either
the parent or the child finalizes, the other process may still be
depending on the external resource.  The correct policy seems to be
“Don’t use finalizers with primitive-fork” in the same way that we say
“Don’t use threads with primitive-fork”.  It’s a problem with
(begin ...) process forms, not with situations where the process execs
another program.

> I think also that if you are most interested in a system in which
> primitive-fork plays a large role, then probably you want a Guile
> without threads (including the GC mark threads).  Threads + fork is not
> a recipe for success :)

Understood.  However, saying “primitive-fork requires a separate,
threadless Guile installation” would be very limiting.  I’d like to be
in a position where Guile will manage any threads that it creates
under the hood; then it’s up to the program owner to guarantee that
there are no other threads running at fork time.  Right now the
under-the-hood threads appear to be:

1. The finalizer thread.
2. The signal delivery thread.

(If you can think of any others, let me know.)

For SCSH-style process forms, this comes up when trying to implement
SCSH’s “early” child process auto-reap mechanism (section 3.4.1 in the
SCSH manual).  The “early” policy works by setting up a handler for
SIGCHLD.  We are guaranteed to have the signal delivery thread in that
situation.  (The “late” policy hooks into the garbage collector, so it
avoids this problem; it can have pathological edge cases for
long-running programs, however.)

Derek

-- 
Derek Upham
sand@blarg.net

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-04-17 15:05       ` Derek Upham
@ 2017-05-01 20:48         ` Derek Upham
  2017-05-02 19:35           ` Andy Wingo
  0 siblings, 1 reply; 16+ messages in thread
From: Derek Upham @ 2017-05-01 20:48 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-devel

Derek Upham <sand@blarg.net> writes:
>> I think also that if you are most interested in a system in which
>> primitive-fork plays a large role, then probably you want a Guile
>> without threads (including the GC mark threads).  Threads + fork is not
>> a recipe for success :)
>
> Understood.  However, saying “primitive-fork requires a separate,
> threadless Guile installation” would be very limiting.  I’d like to be
> in a position where Guile will manage any threads that it creates
> under the hood; then it’s up to the program owner to guarantee that
> there are no other threads running at fork time.  Right now the
> under-the-hood threads appear to be:
>
> 1. The finalizer thread.
> 2. The signal delivery thread.
>
> (If you can think of any others, let me know.)
>
> For SCSH-style process forms, this comes up when trying to implement
> SCSH’s “early” child process auto-reap mechanism (section 3.4.1 in the
> SCSH manual).  The “early” policy works by setting up a handler for
> SIGCHLD.  We are guaranteed to have the signal delivery thread in that
> situation.  (The “late” policy hooks into the garbage collector, so it
> avoids this problem; it can have pathological edge cases for
> long-running programs, however.)

It turns out that, in the current Guile implementation, the thread cleanup code has inherent race conditions.  Running pthread_join() on a thread only guarantees that the thread has returned an exit value.  It doesn’t say anything about whether the thread has finally terminnated (which for Guile means running the thread-local variable cleanup using on_thread_exit()).  If there is a delay in running the thread-local variable cleanup (which happens occasionally), we end up forking a child that doesn’t run the thread-local variable cleanup.  In that case, the child process never updates the global list of threads and doesn’t unregister the thread from the garbage collector.

In these sorts of situations, the standard operating procedure is to extract most of the on_thread_exit() code into its own function, which the parent can call synchronously before the fork.  For safety when doing explicit cleanup, we would:

- Use the thread admin mutex to provide cross-thread protection.
- Use the thread struct “exited” field to prevent double-freeing.

(To make things more interesting, on_thread_exit() can end up shutting down the signal delivery thread.  We will have to handle or avoid a recursive mutex lock.)

However, the on_thread_exit() function has the comment

  /* This handler is executed in non-guile mode. */

at the top.  Can someone explain what that means and what the implications are?  How can I safely invoke the contained code from other contexts?

GDB also shows that Guile is spawning two threads at startup, and those don’t show up in the list of threads known to the system.

Thanks,

Derek

-- 
Derek Upham
sand@blarg.net

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-05-01 20:48         ` Derek Upham
@ 2017-05-02 19:35           ` Andy Wingo
  2017-05-03  3:18             ` Derek Upham
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Wingo @ 2017-05-02 19:35 UTC (permalink / raw)
  To: Derek Upham; +Cc: guile-devel

On Mon 01 May 2017 22:48, Derek Upham <sand@blarg.net> writes:

> Running pthread_join() on a thread only guarantees that the thread has
> returned an exit value.

Would you mind providing a reference please?  It is not that I don't
believe you but I think it's important to know whether this is a bug in
Guile or in the pthreads implementation.

> the on_thread_exit() function has the comment
>
>   /* This handler is executed in non-guile mode. */
>
> at the top.  Can someone explain what that means and what the
> implications are?  How can I safely invoke the contained code from
> other contexts?

https://www.gnu.org/software/guile/manual/html_node/Initialization.html

Basically this code is not automatically traced by GC, it can't
allocate, and it can't call (most) libguile functions.

Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-05-02 19:35           ` Andy Wingo
@ 2017-05-03  3:18             ` Derek Upham
  2017-05-03  9:24               ` Andy Wingo
  2017-05-04  5:21               ` zv
  0 siblings, 2 replies; 16+ messages in thread
From: Derek Upham @ 2017-05-03  3:18 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-devel

Andy Wingo <wingo@pobox.com> writes:

> On Mon 01 May 2017 22:48, Derek Upham <sand@blarg.net> writes:
>
>> Running pthread_join() on a thread only guarantees that the thread has
>> returned an exit value.
>
> Would you mind providing a reference please?  It is not that I don't
> believe you but I think it's important to know whether this is a bug in
> Guile or in the pthreads implementation.

It’s not explicit, but it’s heavily implied by the pthread_exit(3) man
page:

       The pthread_exit() function terminates the calling thread and
       returns a value via retval that (if the thread is joinable) is
       available to another thread in the same process that calls
       pthread_join(3).

       Any clean-up handlers established by pthread_cleanup_push(3) that
       have not yet been popped, are popped (in the reverse of the order
       in which they were pushed) and executed.  If the thread has any
       thread-specific data, then, after the clean-up handlers have been
       executed, the corresponding destructor functions are called, in
       an unspecified order.

The retval becomes available to anyone joining.  Then the clean-up
handlers run.  Then the thread-specific destructors run.

I think the flaw is that Guile is using the thread-specific destructors
to update global values.  The thread is still effectively “live” (has
visible external effects) until the destructors finish running, even
though it appears to be dead if you were to ask pthreads about it.  If
the destructor were just freeing heap memory (for example), then no one
would notice the delayed actions.

Derek

-- 
Derek Upham
sand@blarg.net

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-05-03  3:18             ` Derek Upham
@ 2017-05-03  9:24               ` Andy Wingo
  2017-05-03  9:39                 ` szgyg
  2017-05-03 14:21                 ` Derek Upham
  2017-05-04  5:21               ` zv
  1 sibling, 2 replies; 16+ messages in thread
From: Andy Wingo @ 2017-05-03  9:24 UTC (permalink / raw)
  To: Derek Upham; +Cc: guile-devel

On Wed 03 May 2017 05:18, Derek Upham <sand@blarg.net> writes:

> Andy Wingo <wingo@pobox.com> writes:
>
>> On Mon 01 May 2017 22:48, Derek Upham <sand@blarg.net> writes:
>>
>>> Running pthread_join() on a thread only guarantees that the thread has
>>> returned an exit value.
>>
>> Would you mind providing a reference please?  It is not that I don't
>> believe you but I think it's important to know whether this is a bug in
>> Guile or in the pthreads implementation.
>
> It’s not explicit, but it’s heavily implied by the pthread_exit(3) man
> page:

A specific implementation's man page is interesting but not the
specification.  The specification is "The Open Group Base Specifications
Issue 7 IEEE Std 1003.1-2008, 2016 Edition" and can be found here:

  http://pubs.opengroup.org/onlinepubs/9699919799/

Note that pthread_exit() is not the normal way for a Guile thread to
exit.  I thought we removed uses of this but I see that there are a
couple more that need to be removed.  Are you using it?  It's certainly
not one that we suggest.  We don't use pthread_cleanup_push either --
the lifetime of the thread from Guile's POV isn't limited to a dynamic
extent of scm_with_guile, as the thread could be created by the user, as

  https://lists.gnu.org/archive/html/guile-devel/2011-04/msg00133.html

Anyway, I digress.  After looking at the specification of pthread_join
(http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_join.html),
things do not seem clear to me.  Does the thread "terminate" before or
after the pthread_key destructors run?  Given that:

  * pthread_key destructors must run in the thread AFAIU; to be able to
    call pthread_setspecific in a destructor you need a fresh
    or from a fresh new thread, and spawning a new thread to run cleanup
    seems unlikely

  * pthread_key destructors can call any function

I think there's an argument that a thread doesn't "terminate" until its
thread-local key destructors have finished running, and therefore
pthread_join doesn't return until after the key destructors have run.
This is my understanding of what happens from reading NPTL.  Do I
understand correctly that you are on Cygwin?  Could it be a cygwin
pthreads incompatibility?

Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-05-03  9:24               ` Andy Wingo
@ 2017-05-03  9:39                 ` szgyg
  2017-05-03 14:21                 ` Derek Upham
  1 sibling, 0 replies; 16+ messages in thread
From: szgyg @ 2017-05-03  9:39 UTC (permalink / raw)
  To: guile-devel

On Wed, May 03, 2017 at 11:24:30AM +0200, Andy Wingo wrote:
> On Wed 03 May 2017 05:18, Derek Upham <sand@blarg.net> writes:
> 
> > Andy Wingo <wingo@pobox.com> writes:
> >
> >> On Mon 01 May 2017 22:48, Derek Upham <sand@blarg.net> writes:
> >>
> >>> Running pthread_join() on a thread only guarantees that the thread has
> >>> returned an exit value.
> >>
> >> Would you mind providing a reference please?  It is not that I don't
> >> believe you but I think it's important to know whether this is a bug in
> >> Guile or in the pthreads implementation.
> >
> > It’s not explicit, but it’s heavily implied by the pthread_exit(3) man
> > page:
> 
> A specific implementation's man page is interesting but not the
> specification.  The specification is "The Open Group Base Specifications
> Issue 7 IEEE Std 1003.1-2008, 2016 Edition" and can be found here:
> 
>   http://pubs.opengroup.org/onlinepubs/9699919799/
> 
> Note that pthread_exit() is not the normal way for a Guile thread to
> exit.  I thought we removed uses of this but I see that there are a
> couple more that need to be removed.  Are you using it?  It's certainly
> not one that we suggest.  We don't use pthread_cleanup_push either --
> the lifetime of the thread from Guile's POV isn't limited to a dynamic
> extent of scm_with_guile, as the thread could be created by the user, as
> 
>   https://lists.gnu.org/archive/html/guile-devel/2011-04/msg00133.html
> 
> Anyway, I digress.  After looking at the specification of pthread_join
> (http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_join.html),
> things do not seem clear to me.  Does the thread "terminate" before or
> after the pthread_key destructors run?  Given that:
> 
>   * pthread_key destructors must run in the thread AFAIU; to be able to
>     call pthread_setspecific in a destructor you need a fresh
>     or from a fresh new thread, and spawning a new thread to run cleanup
>     seems unlikely
> 
>   * pthread_key destructors can call any function
> 
> I think there's an argument that a thread doesn't "terminate" until its
> thread-local key destructors have finished running, and therefore
> pthread_join doesn't return until after the key destructors have run.
> This is my understanding of what happens from reading NPTL.  Do I
> understand correctly that you are on Cygwin?  Could it be a cygwin
> pthreads incompatibility?
> 
> Andy
> 

http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09
2.9.5 Thread Cancellation
Thread Cancellation Cleanup Handlers

#v+
After all cancellation cleanup handlers and thread-specific data
destructors have returned, thread execution is terminated.
#v-

szgyg



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-05-03  9:24               ` Andy Wingo
  2017-05-03  9:39                 ` szgyg
@ 2017-05-03 14:21                 ` Derek Upham
  2017-05-09 19:08                   ` Andy Wingo
  1 sibling, 1 reply; 16+ messages in thread
From: Derek Upham @ 2017-05-03 14:21 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-devel

Andy Wingo <wingo@pobox.com> writes:
> I think there's an argument that a thread doesn't "terminate" until its
> thread-local key destructors have finished running, and therefore
> pthread_join doesn't return until after the key destructors have run.
> This is my understanding of what happens from reading NPTL.  Do I
> understand correctly that you are on Cygwin?  Could it be a cygwin
> pthreads incompatibility?

GNU/Linux, Debian unstable, 4.9.x kernel.  I can’t get too fancy with the stderr tracing in a multi-threaded environment, but I think I can demonstrate the delay.

Here’s a successful run in GDB:

  Catchpoint 1 (load)
  [Thread debugging using libthread_db enabled]
  Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
  Breakpoint 2 at 0x7ffff7b7937c: file posix.c, line 1239.
  [New Thread 0x7ffff6197700 (LWP 2225)]
  [New Thread 0x7ffff5996700 (LWP 2226)]
  [New Thread 0x7ffff5195700 (LWP 2227)]
  [New Thread 0x7ffff44ff700 (LWP 2228)]
  [New Thread 0x7ffff39fe700 (LWP 2229)]
  checking device "/dev/disk/by-id/ata-WDC_WD1001FALS-00E8B0_WD-WMATV4103981-part1"
  scm_join_thread start
  on_thread_exit start
  scm_join_thread end
  scm_fork #threads (start) = 2
  on_thread_exit end
  on_thread_exit start
  on_thread_exit end
  scm_fork #threads (end) = 1
  [Thread 0x7ffff39fe700 (LWP 2229) exited]
  [Thread 0x7ffff44ff700 (LWP 2228) exited]
  got timestamp string: "1493715602" "/dev/disk/by-id/ata-WDC_WD1001FALS-00E8B0_WD-WMATV4103981-part1"
  using device: "/dev/disk/by-id/ata-WDC_WD1001FALS-00E8B0_WD-WMATV4103981-part1"
  [Thread 0x7ffff5195700 (LWP 2227) exited]
  [Thread 0x7ffff6197700 (LWP 2225) exited]
  [Thread 0x7ffff7fcb740 (LWP 2218) exited]
  [Inferior 1 (process 2218) exited normally]

The scm_join_thread function starts, then on_thread_exit starts, then scm_join_thread ends.  (The underlying join operation completed.)  Notice that we have already gotten into scm_fork before the cleanup finishes.  Those are all for the signal delivery thread.  Between the scm_fork start and end traces, we run the pre-fork finalizer cleanup, which causes the second on_thread_exit pair.

Unsuccessful run:

  Catchpoint 1 (load)
  [Thread debugging using libthread_db enabled]
  Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
  Breakpoint 2 at 0x7ffff7b7937c: file posix.c, line 1239.
  [New Thread 0x7ffff6197700 (LWP 2239)]
  [New Thread 0x7ffff5996700 (LWP 2240)]
  [New Thread 0x7ffff5195700 (LWP 2241)]
  [New Thread 0x7ffff44ff700 (LWP 2242)]
  [New Thread 0x7ffff39fe700 (LWP 2243)]
  checking device "/dev/disk/by-id/ata-WDC_WD1001FALS-00E8B0_WD-WMATV4103981-part1"
  scm_join_thread start
  scm_join_thread end
  scm_fork #threads (start) = 3
  on_thread_exit start
  on_thread_exit end
  scm_fork #threads (end) = 2
  [Thread 0x7ffff44ff700 (LWP 2242) exited]
  on_thread_exit start
  on_thread_exit end
  [Thread 0x7ffff39fe700 (LWP 2243) exited]

  Thread 1 "guile" hit Breakpoint 2, scm_fork () at posix.c:1239
  1239          scm_display

We join on the signal delivery thread and then set scm_i_signal_delivery_thread to NULL.  But we reach the start of scm_fork without any call to on_thread_exit.  The finalizer thread exits and we clean it up.  At that point scm_all_threads reports three threads, because (a) nothing has cleaned up the active thread list, and (b) none of the threads in the list matches scm_i_signal_delivery_thread (which is NULL).  We reach the scm_fork warning block, where I have a breakpoint.  The signal delivery thread cleanup finally happens between the breakpoint detection and the prompt.

Another successful run:

  scm_join_thread start
  on_thread_exit start
  scm_join_thread end
  on_thread_exit end
  scm_fork #threads (start) = 2
  on_thread_exit start
  on_thread_exit end
  scm_fork #threads (end) = 1
  [Thread 0x7ffff39fe700 (LWP 2212) exited]
  [Thread 0x7ffff44ff700 (LWP 2211) exited]

We join before on_thread_exit completes, but on_thread_exit completes before we start forking.

Another successful run:

  scm_join_thread start
  on_thread_exit start
  on_thread_exit end
  scm_join_thread end
  scm_fork #threads (start) = 2
  [Thread 0x7ffff39fe700 (LWP 2198) exited]
  on_thread_exit start
  on_thread_exit end
  scm_fork #threads (end) = 1
  [Thread 0x7ffff44ff700 (LWP 2197) exited]

on_thread_exit completes before we join.  (This is the happy case.)

Two things I just noticed:

1. Even in the successful cases there are delays between on_thread_exit completing and GDB reporting that the thread has exited.  Look at the second successful run (third in the overall list).  That’s a huge time gap.  It’s circumstantial evidence, but it’s suggestive.

2. In the first successful case (first it the overall list), the finalizer cleanup doesn’t start until the signal delivery thread cleanup completes.  There’s nothing at the code level enforcing that.  I wonder whether there is a single, hidden, background thread that coordinates the cleanup.  That’s pure speculation.

Derek

-- 
Derek Upham
sand@blarg.net

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-05-03  3:18             ` Derek Upham
  2017-05-03  9:24               ` Andy Wingo
@ 2017-05-04  5:21               ` zv
  1 sibling, 0 replies; 16+ messages in thread
From: zv @ 2017-05-04  5:21 UTC (permalink / raw)
  To: guile-devel

On 05/02/2017 08:18 PM, Derek Upham wrote:
> Andy Wingo <wingo@pobox.com> writes:
> 
>> On Mon 01 May 2017 22:48, Derek Upham <sand@blarg.net> writes:
>>
>>> Running pthread_join() on a thread only guarantees that the thread has
>>> returned an exit value.
>>
>> Would you mind providing a reference please?  It is not that I don't
>> believe you but I think it's important to know whether this is a bug in
>> Guile or in the pthreads implementation.
> 
> It’s not explicit, but it’s heavily implied by the pthread_exit(3) man
> page:
> 
>        The pthread_exit() function terminates the calling thread and
>        returns a value via retval that (if the thread is joinable) is
>        available to another thread in the same process that calls
>        pthread_join(3).
> 
>        Any clean-up handlers established by pthread_cleanup_push(3) that
>        have not yet been popped, are popped (in the reverse of the order
>        in which they were pushed) and executed.  If the thread has any
>        thread-specific data, then, after the clean-up handlers have been
>        executed, the corresponding destructor functions are called, in
>        an unspecified order.
> 
> The retval becomes available to anyone joining.  Then the clean-up
> handlers run.  Then the thread-specific destructors run.
> 
> I think the flaw is that Guile is using the thread-specific destructors
> to update global values.  The thread is still effectively “live” (has
> visible external effects) until the destructors finish running, even
> though it appears to be dead if you were to ask pthreads about it.  If
> the destructor were just freeing heap memory (for example), then no one
> would notice the delayed actions.
> 
> Derek
> 

I believe Andy is spot-on here. I'm not familiar with a pthreads implementation
where pthread_join() doesn't wait until the referenced tid is finished (unless
you have reused a tid) (there may be a POSIX specification page dictating these
sort of standards however)





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-05-03 14:21                 ` Derek Upham
@ 2017-05-09 19:08                   ` Andy Wingo
  2017-05-12 14:13                     ` Derek Upham
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Wingo @ 2017-05-09 19:08 UTC (permalink / raw)
  To: Derek Upham; +Cc: guile-devel

On Wed 03 May 2017 16:21, Derek Upham <sand@blarg.net> writes:

>   scm_join_thread start

scm_join_thread isn't actually implemented in terms of
scm_i_pthread_join any more.  Probably that's what's going wrong here --
and probably that should be fixed to ensure that we actually join the
thread.  (Otherwise it would be a memory leak too AFAIU.)  Bcc'ing
bug-guile to create a bug for that.

Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-05-09 19:08                   ` Andy Wingo
@ 2017-05-12 14:13                     ` Derek Upham
  2017-05-15 20:06                       ` Andy Wingo
  0 siblings, 1 reply; 16+ messages in thread
From: Derek Upham @ 2017-05-12 14:13 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-devel

Andy Wingo <wingo@pobox.com> writes:

> scm_join_thread isn't actually implemented in terms of
> scm_i_pthread_join any more.  Probably that's what's going wrong here --
> and probably that should be fixed to ensure that we actually join the
> thread.  (Otherwise it would be a memory leak too AFAIU.)  Bcc'ing
> bug-guile to create a bug for that.

I noticed that scm_join_thread was calling back into Scheme-land.  Are these statements all correct?

- We are using call-with-new-thread underneath the hood.
- call-with-new-thread is documented to return a Scheme object from a thunk/handler.  Any underlying pthreads should be implementation details.
- The spawned thread sends the Scheme object to the condition variable as soon as the user thunk exits.  Any number of operations can happen afterwards; the thread is still running in Scheme-land at this point, in call-with-new-thread’s wrapping thunk.
- join-thread waits on the condition variable only.

So at the end of join-thread we need to add a call to scm_i_pthread_join (which we implement in threads.c) to ensure that the pthread is completely gone before that join-thread returns.  Is that accurate?  We can also update stop_finalization_thread to use the new scm_i_pthread_join.

Unfortunately, I think the GC threads are going to end up being immovable objects in the path to full process-form support.

Derek

-- 
Derek Upham
sand@blarg.net

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Cygwin port of Guile 2.2
  2017-05-12 14:13                     ` Derek Upham
@ 2017-05-15 20:06                       ` Andy Wingo
  0 siblings, 0 replies; 16+ messages in thread
From: Andy Wingo @ 2017-05-15 20:06 UTC (permalink / raw)
  To: Derek Upham; +Cc: 26858, guile-devel

Greets,

On Fri 12 May 2017 16:13, Derek Upham <sand@blarg.net> writes:

> Andy Wingo <wingo@pobox.com> writes:
>
>> scm_join_thread isn't actually implemented in terms of
>> scm_i_pthread_join any more.  Probably that's what's going wrong here --
>> and probably that should be fixed to ensure that we actually join the
>> thread.  (Otherwise it would be a memory leak too AFAIU.)  Bcc'ing
>> bug-guile to create a bug for that.
>
> I noticed that scm_join_thread was calling back into Scheme-land.  Are these statements all correct?
>
> - We are using call-with-new-thread underneath the hood.

Underneath the hood of what? :)

> - call-with-new-thread is documented to return a Scheme object from a
> thunk/handler.  Any underlying pthreads should be implementation
> details.

Correct.  In practice call-with-new-thread will create a pthread but I
can imagine circumstances in which it might (in the future) spawn an
auxiliary pthread for some reason, and I wouldn't want to rule that out.

> - The spawned thread sends the Scheme object to the condition variable
> as soon as the user thunk exits.  Any number of operations can happen
> afterwards; the thread is still running in Scheme-land at this point,
> in call-with-new-thread’s wrapping thunk.
> - join-thread waits on the condition variable only.

These are implementation details.  They are correct but probably the
implementation should change to do the scm_i_pthread_join and we should
guarantee that after the join, the thread is really dead.  This is bug
26858.

> So at the end of join-thread we need to add a call to
> scm_i_pthread_join (which we implement in threads.c) to ensure that
> the pthread is completely gone before that join-thread returns.  Is
> that accurate?

Well... yes, but we have to ensure that we call scm_i_pthread_join at
most once.  I think calling pthread_join twice on a thread is
undefined.  So there are some gnarlies here.  Need to fix this.

> Unfortunately, I think the GC threads are going to end up being
> immovable objects in the path to full process-form support.

You can disable marker threads with the GC_MARKERS environment variable,
and the finalization thread should come and go as needed.  Probably this
is not a blocker from your POV.  Signal handling is probably the most
serious issue; perhaps we can avoid the thread somehow, since we handle
signals asynchronously anyway..

Andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-05-15 20:06 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-04-04 16:45 Cygwin port of Guile 2.2 Mike Gran
2017-04-04 17:03 ` Eli Zaretskii
2017-04-14  8:35 ` Andy Wingo
2017-04-14 13:41   ` Derek Upham
2017-04-17  8:04     ` Andy Wingo
2017-04-17 15:05       ` Derek Upham
2017-05-01 20:48         ` Derek Upham
2017-05-02 19:35           ` Andy Wingo
2017-05-03  3:18             ` Derek Upham
2017-05-03  9:24               ` Andy Wingo
2017-05-03  9:39                 ` szgyg
2017-05-03 14:21                 ` Derek Upham
2017-05-09 19:08                   ` Andy Wingo
2017-05-12 14:13                     ` Derek Upham
2017-05-15 20:06                       ` Andy Wingo
2017-05-04  5:21               ` zv

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).