GC thread performance

unofficial mirror of guile-user@gnu.org 
 help / color / mirror / Atom feed

* GC thread performance
@ 2017-11-27 22:40 Hans Åberg
  2017-11-27 23:05 ` Stefan Israelsson Tampe
  0 siblings, 1 reply; 12+ messages in thread
From: Hans Åberg @ 2017-11-27 22:40 UTC (permalink / raw)
  To: Guile User

With the Boehm GC in Guile, does multithread allocations take longer time than the same amount in a single thread?






^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: GC thread performance
  2017-11-27 22:40 GC thread performance Hans Åberg
@ 2017-11-27 23:05 ` Stefan Israelsson Tampe
  2017-11-27 23:13   ` Hans Åberg
  0 siblings, 1 reply; 12+ messages in thread
From: Stefan Israelsson Tampe @ 2017-11-27 23:05 UTC (permalink / raw)
  To: Hans Åberg

There are lists of free memory kept in each thread so usually small allocation is one multithreaded large allocation and then
a bunch of small allocation in thread consuming that large allocation. So for small memory segments I would say that the 
overhead probably is neglible.

On Mon, Nov 27, 2017 at 11:40 PM, Hans Åberg <haberg-1@telia.com <mailto:haberg-1@telia.com>> wrote:
With the Boehm GC in Guile, does multithread allocations take longer time than the same amount in a single thread?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: GC thread performance
  2017-11-27 23:05 ` Stefan Israelsson Tampe
@ 2017-11-27 23:13   ` Hans Åberg
  2017-11-27 23:23     ` Marko Rauhamaa
  0 siblings, 1 reply; 12+ messages in thread
From: Hans Åberg @ 2017-11-27 23:13 UTC (permalink / raw)
  To: Stefan Israelsson Tampe; +Cc: Guile User


> On 28 Nov 2017, at 00:05, Stefan Israelsson Tampe <stefan.itampe@gmail.com> wrote:
> 
> There are lists of free memory kept in each thread so usually small allocation is one multithreaded large allocation and then
> a bunch of small allocation in thread consuming that large allocation. So for small memory segments I would say that the 
> overhead probably is neglible.

I saw overhead also for the small allocations, 20-30% maybe. This is in a program that makes a lot of allocations relative other computations. So that made me wonder about Guile.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: GC thread performance
  2017-11-27 23:13   ` Hans Åberg
@ 2017-11-27 23:23     ` Marko Rauhamaa
  2017-11-27 23:44       ` Hans Åberg
  0 siblings, 1 reply; 12+ messages in thread
From: Marko Rauhamaa @ 2017-11-27 23:23 UTC (permalink / raw)
  To: Hans Åberg; +Cc: Guile User

Hans Åberg <haberg-1@telia.com>:
> I saw overhead also for the small allocations, 20-30% maybe. This is
> in a program that makes a lot of allocations relative other
> computations. So that made me wonder about Guile.

I don't have an answer to your question although I would imagine GC
cannot be effectively scaled across multiple threads.

I generally steer clear of threads and stick to multiple processes where
parallelism is needed.


Marko



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: GC thread performance
  2017-11-27 23:23     ` Marko Rauhamaa
@ 2017-11-27 23:44       ` Hans Åberg
  2017-12-01 19:49         ` Linas Vepstas
  0 siblings, 1 reply; 12+ messages in thread
From: Hans Åberg @ 2017-11-27 23:44 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: Guile User



> On 28 Nov 2017, at 00:23, Marko Rauhamaa <marko@pacujo.net> wrote:
> 
> Hans Åberg <haberg-1@telia.com>:
>> I saw overhead also for the small allocations, 20-30% maybe. This is
>> in a program that makes a lot of allocations relative other
>> computations. So that made me wonder about Guile.
> 
> I don't have an answer to your question although I would imagine GC
> cannot be effectively scaled across multiple threads.

Or perhaps special implementation methods are required.

> I generally steer clear of threads and stick to multiple processes where
> parallelism is needed.

Indeed, that was I did: just one crucial point.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: GC thread performance
  2017-11-27 23:44       ` Hans Åberg
@ 2017-12-01 19:49         ` Linas Vepstas
  2017-12-01 23:23           ` Hans Åberg
  2017-12-02  8:50           ` Marko Rauhamaa
  0 siblings, 2 replies; 12+ messages in thread
From: Linas Vepstas @ 2017-12-01 19:49 UTC (permalink / raw)
  To: Hans Åberg; +Cc: Guile User

On Mon, Nov 27, 2017 at 5:44 PM, Hans Åberg <haberg-1@telia.com> wrote:

>
>
> > On 28 Nov 2017, at 00:23, Marko Rauhamaa <marko@pacujo.net> wrote:
> >
> > Hans Åberg <haberg-1@telia.com>:
> >> I saw overhead also for the small allocations, 20-30% maybe. This is
> >> in a program that makes a lot of allocations relative other
> >> computations. So that made me wonder about Guile.
>

I cannot speak to GC, but I freuqently encounter situations in guile where
using the parallel constructs e.g. par-for-each, end up running slower than
the single-threaded version. For example, using 2 or 3 threads, I get a 2x
and 3x speedup, great, but using 4x gives a slowdown, often 10x slower than
single-threaded. I try to make sure that the insides of the loop are large
and long-running, so that the cost of creating and scheduling threads is
inconsequential.

I have not attempted to determine the cause of this, but basically, that
entire subsystem needs a careful review and measurement.

--linas

-- 
*"The problem is not that artificial intelligence will get too smart and
take over the world," computer scientist Pedro Domingos writes, "the
problem is that it's too stupid and already has." *

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: GC thread performance
  2017-12-01 19:49         ` Linas Vepstas
@ 2017-12-01 23:23           ` Hans Åberg
  2017-12-02  8:50           ` Marko Rauhamaa
  1 sibling, 0 replies; 12+ messages in thread
From: Hans Åberg @ 2017-12-01 23:23 UTC (permalink / raw)
  To: linasvepstas; +Cc: Guile User



> On 1 Dec 2017, at 20:49, Linas Vepstas <linasvepstas@gmail.com> wrote:
> 
> On Mon, Nov 27, 2017 at 5:44 PM, Hans Åberg <haberg-1@telia.com> wrote:
> 
>> 
>> 
>>> On 28 Nov 2017, at 00:23, Marko Rauhamaa <marko@pacujo.net> wrote:
>>> 
>>> Hans Åberg <haberg-1@telia.com>:
>>>> I saw overhead also for the small allocations, 20-30% maybe. This is
>>>> in a program that makes a lot of allocations relative other
>>>> computations. So that made me wonder about Guile.
>> 
> 
> I cannot speak to GC, but I freuqently encounter situations in guile where
> using the parallel constructs e.g. par-for-each, end up running slower than
> the single-threaded version. For example, using 2 or 3 threads, I get a 2x
> and 3x speedup, great, but using 4x gives a slowdown, often 10x slower than
> single-threaded. I try to make sure that the insides of the loop are large
> and long-running, so that the cost of creating and scheduling threads is
> inconsequential.

In the program I tested it depends on what I run, but one pattern is that user time goes up about two times when approaching the hardware concurrency, and then stabilizes. System time jumps up thirty times when going from one to two threads, and then further doubles when approaching the hardware concurrency.

> I have not attempted to determine the cause of this, but basically, that
> entire subsystem needs a careful review and measurement.

The GC also collects outside the main thread, so if that part takes up too much system resources, the program itself might too slow down by that.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: GC thread performance
  2017-12-01 19:49         ` Linas Vepstas
  2017-12-01 23:23           ` Hans Åberg
@ 2017-12-02  8:50           ` Marko Rauhamaa
  2017-12-02 14:20             ` Hans Åberg
  2017-12-02 14:30             ` Chris Vine
  1 sibling, 2 replies; 12+ messages in thread
From: Marko Rauhamaa @ 2017-12-02  8:50 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: Guile User, Hans Åberg

Linas Vepstas <linasvepstas@gmail.com>:
> I cannot speak to GC, but I freuqently encounter situations in guile
> where using the parallel constructs e.g. par-for-each, end up running
> slower than the single-threaded version. For example, using 2 or 3
> threads, I get a 2x and 3x speedup, great, but using 4x gives a
> slowdown, often 10x slower than single-threaded. I try to make sure
> that the insides of the loop are large and long-running, so that the
> cost of creating and scheduling threads is inconsequential.
>
> I have not attempted to determine the cause of this, but basically,
> that entire subsystem needs a careful review and measurement.

I'll have to speculate, too.

Guile guarantees the consistency of the data model under all
circumstances. Bad synchronization between threads is allowed to cause
unspecified behavior, but it should never trigger a SIGSEGV. In
practice, that means excessive locking: all data access needs to take
place in a critical section.

Linux (at least) has locking primitives that have virtually no overhead
when there is no contention. However, when there is contention, the
context is switched to the kernel and a memory barrier causes the CPU
hardware to flush its memory cache.

The same issue hampers C developers as well, although the newest C
standards explicitly shift to the coder the responsibility for the
consistency of the data model. Thus, in principle, a C programmer can
try clever tricks to eke out performance with multithreading.

CPython has famously grappled with the same performance issues. The
so-called GIL effectively turns every Python program into a
single-threaded process. The point of multithreading is not a gain in
performance. Rather, it is a programming paradigm to multiplex events.

These and other issues with multithreading have caused many of us to
reject multithreading as a paradigm altogether. Use event-driven
programming for multiplexing and multiprocessing for performance.

Marko

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: GC thread performance
  2017-12-02  8:50           ` Marko Rauhamaa
@ 2017-12-02 14:20             ` Hans Åberg
  2017-12-03 22:15               ` Arne Babenhauserheide
  2017-12-02 14:30             ` Chris Vine
  1 sibling, 1 reply; 12+ messages in thread
From: Hans Åberg @ 2017-12-02 14:20 UTC (permalink / raw)
  To: Marko Rauhamaa; +Cc: Guile User



> On 2 Dec 2017, at 09:50, Marko Rauhamaa <marko@pacujo.net> wrote:
> 
> Guile guarantees the consistency of the data model under all
> circumstances. Bad synchronization between threads is allowed to cause
> unspecified behavior, but it should never trigger a SIGSEGV. In
> practice, that means excessive locking: all data access needs to take
> place in a critical section.

> These and other issues with multithreading have caused many of us to
> reject multithreading as a paradigm altogether. Use event-driven
> programming for multiplexing and multiprocessing for performance.

I see the the expected behavior when turing off the GC altogether, using malloc and free without cleanup: I third of the time on one thread, decreasing up to hardware concurrency.

The Boehm GC sets a lock around all allocations and collections [1], so it might be such excessive locking that slows it down. That pages seems old: I could not see a difference when choosing the options at [2].


1. http://www.hboehm.info/gc/scale.html
2. http://www.hboehm.info/gc/simple_example.html





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: GC thread performance
  2017-12-02 14:20             ` Hans Åberg
@ 2017-12-03 22:15               ` Arne Babenhauserheide
  2017-12-04  9:38                 ` Hans Åberg
  0 siblings, 1 reply; 12+ messages in thread
From: Arne Babenhauserheide @ 2017-12-03 22:15 UTC (permalink / raw)
  To: Hans Åberg; +Cc: Guile User

[-- Attachment #1: Type: text/plain, Size: 576 bytes --]


Hans Åberg <haberg-1@telia.com> writes:

> I see the the expected behavior when turing off the GC altogether,
> using malloc and free without cleanup: I third of the time on one
> thread, decreasing up to hardware concurrency.

I also saw that - I think we tracked it down to the GC heuristics
miscalculating when to run and running either too often or too seldomly.

I think we didn't get to actually fixing the underlying issue, though we
found that it should be possible.

Best wishes,
Arne
-- 
Unpolitisch sein
heißt politisch sein
ohne es zu merken

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: GC thread performance
  2017-12-03 22:15               ` Arne Babenhauserheide
@ 2017-12-04  9:38                 ` Hans Åberg
  0 siblings, 0 replies; 12+ messages in thread
From: Hans Åberg @ 2017-12-04  9:38 UTC (permalink / raw)
  To: Arne Babenhauserheide; +Cc: Guile User



> On 3 Dec 2017, at 23:15, Arne Babenhauserheide <arne_bab@web.de> wrote:
> 
> Hans Åberg <haberg-1@telia.com> writes:
> 
>> I see the the expected behavior when turing off the GC altogether,
>> using malloc and free without cleanup: I third of the time on one
>> thread, decreasing up to hardware concurrency.
> 
> I also saw that - I think we tracked it down to the GC heuristics
> miscalculating when to run and running either too often or too seldomly.

I looked at the GC interface for way to make collections more rare, but could not find that.

> I think we didn't get to actually fixing the underlying issue, though we
> found that it should be possible.

The GC mailing list server is down, and has been that for some time it seems, so it is unclear about the state of the package.





^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: GC thread performance
  2017-12-02  8:50           ` Marko Rauhamaa
  2017-12-02 14:20             ` Hans Åberg
@ 2017-12-02 14:30             ` Chris Vine
  1 sibling, 0 replies; 12+ messages in thread
From: Chris Vine @ 2017-12-02 14:30 UTC (permalink / raw)
  To: guile-user

On Sat, 02 Dec 2017 10:50:29 +0200
Marko Rauhamaa <marko@pacujo.net> wrote:

> Linas Vepstas <linasvepstas@gmail.com>:
> > I cannot speak to GC, but I freuqently encounter situations in guile
> > where using the parallel constructs e.g. par-for-each, end up
> > running slower than the single-threaded version. For example, using
> > 2 or 3 threads, I get a 2x and 3x speedup, great, but using 4x
> > gives a slowdown, often 10x slower than single-threaded. I try to
> > make sure that the insides of the loop are large and long-running,
> > so that the cost of creating and scheduling threads is
> > inconsequential.
> >
> > I have not attempted to determine the cause of this, but basically,
> > that entire subsystem needs a careful review and measurement.  
> 
> I'll have to speculate, too.
> 
> Guile guarantees the consistency of the data model under all
> circumstances. Bad synchronization between threads is allowed to cause
> unspecified behavior, but it should never trigger a SIGSEGV. In
> practice, that means excessive locking: all data access needs to take
> place in a critical section.

If you mean that you believe guile carries out significant and excessive
locking to maintain the invariants of its containers and other SCM
objects, I do not think that is right.  I don't think guile generally
does use locking for that: instead it uses the fact that SCM objects
are stored in a type of size uintptr_t and aligned on 8-byte boundaries
to enable pointer tagging (see the file tags.h in the source
distribution).

Guile can therefore leverage the fact that on any platform supported by
guile threads, native pointer/integer types aligned on their natural
size boundary are atomic, with in C11 terms relaxed (ie unsynchronized)
memory ordering.  guile is not going to carry out CAS-style or
acquire/release synchronisation for you: a SCM object will be in a
valid state but that state may be completely different from the one you
expect if you don't synchronize in your own code.

Guile may of course lock its own internal global non-fluid data, but
that is a different point from the one I think you are making.

As I recall, Andy Wingo was reporting in relation to his fibers library
that it started to slow with more than about 8 native threads running.
As I recall, after 8 native threads, significant improvements in speed
failed to occur even with more than 8 processors.  You are also likely
to get a slow down if you run more native threads than you have
processors.

Chris

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-12-04  9:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-11-27 22:40 GC thread performance Hans Åberg
2017-11-27 23:05 ` Stefan Israelsson Tampe
2017-11-27 23:13   ` Hans Åberg
2017-11-27 23:23     ` Marko Rauhamaa
2017-11-27 23:44       ` Hans Åberg
2017-12-01 19:49         ` Linas Vepstas
2017-12-01 23:23           ` Hans Åberg
2017-12-02  8:50           ` Marko Rauhamaa
2017-12-02 14:20             ` Hans Åberg
2017-12-03 22:15               ` Arne Babenhauserheide
2017-12-04  9:38                 ` Hans Åberg
2017-12-02 14:30             ` Chris Vine

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).