From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.org!not-for-mail
From: Mark H Weaver <mhw@netris.org>
Newsgroups: gmane.lisp.guile.bugs
Subject: bug#19883: Smob's mark_smob has become unreliable in Guile 2.x
Date: Sun, 01 Mar 2015 12:02:41 -0500
Message-ID: <87vbik647i.fsf@netris.org>
References: <87twyln70f.fsf@fencepost.gnu.org> <87zj7x5hxr.fsf@netris.org>
	<87385p9ggl.fsf@fencepost.gnu.org>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: ger.gmane.org 1425229406 17621 80.91.229.3 (1 Mar 2015 17:03:26 GMT)
X-Complaints-To: usenet@ger.gmane.org
NNTP-Posting-Date: Sun, 1 Mar 2015 17:03:26 +0000 (UTC)
Cc: 19883@debbugs.gnu.org
To: David Kastrup <dak@gnu.org>
Original-X-From: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org Sun Mar 01 18:03:16 2015
Return-path: <bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org>
Envelope-to: guile-bugs@m.gmane.org
Original-Received: from lists.gnu.org ([208.118.235.17])
	by plane.gmane.org with esmtp (Exim 4.69)
	(envelope-from <bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org>)
	id 1YS7Gq-0003Qw-Lm
	for guile-bugs@m.gmane.org; Sun, 01 Mar 2015 18:03:12 +0100
Original-Received: from localhost ([::1]:53298 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org>)
	id 1YS7Gp-0008Ja-EH
	for guile-bugs@m.gmane.org; Sun, 01 Mar 2015 12:03:11 -0500
Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:49660)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1YS7Gk-0008Hj-7g
	for bug-guile@gnu.org; Sun, 01 Mar 2015 12:03:08 -0500
Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1YS7Gg-00075T-UK
	for bug-guile@gnu.org; Sun, 01 Mar 2015 12:03:06 -0500
Original-Received: from debbugs.gnu.org ([140.186.70.43]:58130)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1YS7Gg-00075O-Qq
	for bug-guile@gnu.org; Sun, 01 Mar 2015 12:03:02 -0500
Original-Received: from Debian-debbugs by debbugs.gnu.org with local (Exim 4.80)
	(envelope-from <Debian-debbugs@debbugs.gnu.org>) id 1YS7Gg-0007h3-GH
	for bug-guile@gnu.org; Sun, 01 Mar 2015 12:03:02 -0500
X-Loop: help-debbugs@gnu.org
Resent-From: Mark H Weaver <mhw@netris.org>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
Resent-CC: bug-guile@gnu.org
Resent-Date: Sun, 01 Mar 2015 17:03:02 +0000
Resent-Message-ID: <handler.19883.B19883.142522935929540@debbugs.gnu.org>
Resent-Sender: help-debbugs@gnu.org
X-GNU-PR-Message: followup 19883
X-GNU-PR-Package: guile
X-GNU-PR-Keywords: 
Original-Received: via spool by 19883-submit@debbugs.gnu.org id=B19883.142522935929540
	(code B ref 19883); Sun, 01 Mar 2015 17:03:02 +0000
Original-Received: (at 19883) by debbugs.gnu.org; 1 Mar 2015 17:02:39 +0000
Original-Received: from localhost ([127.0.0.1]:33495 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1YS7GI-0007gN-S2
	for submit@debbugs.gnu.org; Sun, 01 Mar 2015 12:02:39 -0500
Original-Received: from world.peace.net ([50.252.239.5]:60624 ident=hope8)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <mhw@netris.org>) id 1YS7GC-0007gA-I6
	for 19883@debbugs.gnu.org; Sun, 01 Mar 2015 12:02:36 -0500
Original-Received: from [10.1.10.78] (helo=jojen)
	by world.peace.net with esmtpsa (TLS1.0:RSA_AES_128_CBC_SHA1:16)
	(Exim 4.72) (envelope-from <mhw@netris.org>)
	id 1YS7G0-0004fC-Ch; Sun, 01 Mar 2015 12:02:20 -0500
In-Reply-To: <87385p9ggl.fsf@fencepost.gnu.org> (David Kastrup's message of
	"Sun, 01 Mar 2015 11:09:46 +0100")
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.4 (gnu/linux)
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x
X-Received-From: 140.186.70.43
X-BeenThere: bug-guile@gnu.org
List-Id: "Bug reports for GUILE,
	GNU's Ubiquitous Extension Language" <bug-guile.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/bug-guile>,
	<mailto:bug-guile-request@gnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/archive/html/bug-guile>
List-Post: <mailto:bug-guile@gnu.org>
List-Help: <mailto:bug-guile-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/bug-guile>,
	<mailto:bug-guile-request@gnu.org?subject=subscribe>
Errors-To: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org
Original-Sender: bug-guile-bounces+guile-bugs=m.gmane.org@gnu.org
Xref: news.gmane.org gmane.lisp.guile.bugs:7738
Archived-At: <http://permalink.gmane.org/gmane.lisp.guile.bugs/7738>

David Kastrup <dak@gnu.org> writes:

> Since LilyPond relies on C++ data structures and containers for its
> operation and those are also involved in the mark walks, "can be called
> on arbitrary garbage"

Who are you quoting here?

> But the mark-on-general-garbage problem seems sort of system-immanent to
> conservative garbage collection [...]

Marking arbitrary garbage is certainly not inherent (is that the word
you meant to use?) to conservative garbage collection.  There may be
bugs, but BDWGC is designed to prevent this.  See the "Allocation" and
"Mark phase" sections of <http://www.hboehm.info/gc/gcdescr.html>.  In
particular, from the "Allocation" section:

  The collector uses a two level allocator. A large block is defined to
  be one larger than half of HBLKSIZE, which is a power of 2, typically
  on the order of the page size.

  Large block sizes are rounded up to the next multiple of HBLKSIZE and
  then allocated by GC_allochblk. Recent versions of the collector use
  an approximate best fit algorithm by keeping free lists for several
  large block sizes. The actual implementation of GC_allochblk is
  significantly complicated by black-listing issues (see below).

  Small blocks are allocated in chunks of size HBLKSIZE. Each chunk is
  dedicated to only one object size and kind.

>From the "Mark phase" section:

  We determine whether a candidate pointer is actually the address of a
  heap block. This is done in the following steps: The candidate pointer
  is checked against rough heap bounds. These heap bounds are maintained
  such that all actual heap objects fall between them. In order to
  facilitate black-listing (see below) we also include address regions
  that the heap is likely to expand into. Most non-pointers fail this
  initial test.

  The candidate pointer is divided into two pieces; the most significant
  bits identify a HBLKSIZE-sized page in the address space, and the
  least significant bits specify an offset within that page. (A hardware
  page may actually consist of multiple such pages. HBLKSIZE is usually
  the page size divided by a small power of two.)
  
  The page address part of the candidate pointer is looked up in a
  table. Each table entry contains either 0, indicating that the page is
  not part of the garbage collected heap, a small integer n, indicating
  that the page is part of large object, starting at least n pages back,
  or a pointer to a descriptor for the page. In the first case, the
  candidate pointer i not a true pointer and can be safely ignored. In
  the last two cases, we can obtain a descriptor for the page containing
  the beginning of the object.
  
  The starting address of the referenced object is computed. The page
  descriptor contains the size of the object(s) in that page, the object
  kind, and the necessary mark bits for those objects. The size
  information can be used to map the candidate pointer to the object
  starting address. To accelerate this process, the page header also
  contains a pointer to a precomputed map of page offsets to
  displacements from the beginning of an object. The use of this map
  avoids a potentially slow integer remainder operation in computing the
  object start address.
  
  The mark bit for the target object is checked and set. If the object
  was previously unmarked, the object is pushed on the mark stack. The
  descriptor is read from the page descriptor. (This is computed from
  information GC_obj_kinds when the page is first allocated.)

In Guile, there is a dedicated "kind" for SMOBs that have user-defined
mark functions.  This means that they are segregrated into their own
HBLKSIZE-sized blocks, and so BDWGC is able to reliably determine
whether our general SMOB marker ('smob_mark' in smob.c) should be
called, without relying on some unreliable tag inside the block itself.

So, while it is true that a conservative GC cannot avoid possibly
retaining some garbage that should be freed, it is possible to reliably
avoid calling mark functions on arbitrary garbage, and BDWGC is designed
to do this.

BTW, a few more points to mention: I vaguely recall that you may have
wondered in the past if Boehm GC "parallel marking" means that it is
marking while the mutator (user code) is running.  That is not the case.

BDWGC first stops all user threads (on POSIX I believe this is done by
sending signals to all threads and keeping them in the signal handlers
during GC) and then runs multiple of its own marker threads in parallel.
I guess this means that user-specified mark functions may be called in
multiple threads.

However, this can be avoided by setting the GC_MARKERS environment
variable to "1", and I set this while running your test code to
eliminate this possible complication for now.

Another document worth looking at is:

  http://www.hboehm.info/gc/finalization.html

> However, from the GUILE side the mark_smob calls originate from objects
> detected to be SCM cells, and GUILE should be able to manage areas
> filled with SCM cells such that it can _reliably_ distinguish allocated
> from unallocated cells and there is no such thing as random garbage.

As described above, BDWGC already handles this, and so far I haven't
seen evidence of 'smob_mark' being called on random garbage.  However,
it seems that 'smob_mark' is being called on objects that have already
been finalized.

> When that is the case, it may be sufficient to clear out the tag field
> in the course of calling the free_smob hook, avoiding any subsequent
> mark calls to occur on a cell that has lost its type consistency anyway.
> Again, this might involve race conditions: no idea.

Yes, this sounds like a work-around worth considering.  I'll think about
whether it can be done reliably in Guile and/or LilyPond.

It occurs to me that during times of heavy allocation, it might be
possible for a GC to be triggered while a finalizer is being run.  Since
the object being finalized would be on the stack, it would be considered
live by the GC, even if most or all of the finalizer had already been
run.  It warrants investigation.

Also, while reading the GC design documentation, a particular warning
raised some alarms.  From "Finalization" in gcdescr.html:

  Any objects remaining unmarked at the end of this process are added to
  a queue of objects whose finalizers can be run. Depending on collector
  configuration, finalizers are dequeued and run either implicitly
  during allocation calls, or explicitly in response to a user
  request. (Note that the former is unfortunately both the default and
  not generally safe. If finalizers perform synchronization, it may
  result in deadlocks. Nontrivial finalizers generally need to perform
  synchronization, and thus require a different collector
  configuration.)

and from finalization.html:

  Finalization code may be run anyplace an allocation or other call to
  the collector takes place. In multithreaded programs, finalizers have
  to obey the normal locking conventions to ensure safety. Code run
  directly from finalizers should not acquire locks that may be held
  during allocation. This restriction can be easily circumvented by
  registering a finalizer which enqueues the real action for execution
  in a separate thread.

  In single-threaded code, it is also often easiest to have finalizers
  queue actions, which are then explicitly run during an explicit call
  by the user's program.

So it may be that we need to be careful of this.

Here's more recommended reading on the difficulties of implementing
finalizers properly:

  http://www.hpl.hp.com/techreports/2002/HPL-2002-335.html

Finalizers are a huge can of worms.  I suspect this is why Ludovic was
encouraging you to avoid them entirely, but I agree that we need to find
a way to make them work for when they are needed.

> Because of potential race conditions, a reliable fix needs to be in
> GUILE or BDWGC and cannot be done at the application end.  The best we
> will be able to achieve there is "crashes almost never randomly".

I'm not sure this is true.  It will require more thought.

>> First, some changes that I think might be needed for GC safety:
>>
>> --- smobs.hh.ORIG	2015-02-15 13:35:03.000000000 -0500
>> +++ smobs.hh	2015-02-28 23:42:06.346803668 -0500
>> @@ -270,8 +271,9 @@
>>    }
>>    SCM unprotected_smobify_self ()
>>    {
>> -    self_scm_ = Smob_base<Super>::register_ptr (static_cast<Super *> (this));
>> -    return self_scm_;
>> +    SCM s = Smob_base<Super>::register_ptr (static_cast<Super *> (this));
>> +    self_scm_ = s;
>> +    return s;
>
> Don't see the intended difference here.  It's conceivable that with
> several "volatile" declarations the compiler may generate different code
> that would protect against self_scm_ being modified before the function
> returns.  Is that supposed to help against non-atomic changes of
> self_scm_?

No, the idea is to ensure the existence of a GC-visible copy of the SCM
value until 'scm_gc_protect_object' is called.  For objects allocated in
the malloc/free heap, self_scm_ is not visible to the GC.

>> BTW, one other thing to keep in mind: it will never be enough if the
>> only GC-visible reference to an object is the C++ Smob object.  There
>> must always be a GC-visible reference to the SCM value.
>
> That's why we need the mark functions in LilyPond.

The mark functions can ensure that in your Family class, the children
will be marked even though the parent->child pointers point directly to
the C++ object.  However, whenever the root of the tree is not protected
by 'scm_gc_protect_object', the SCM value of that root must be visible
to the GC, in order to cause the user-specified 'mark' function to be
called in the first place.

It should be noted that for any particular C++ class, another
alternative is to arrange for instances of that class to be allocated
using 'scm_gc_malloc', as Ludovic suggested.  If you do that, then it
would suffice to have a GC-visible pointer to instances of that class,
which might make your life easier.

    Regards,
      Mark