[-- Attachment #1: Type: text/plain, Size: 6755 bytes --] Hello! (If you’re in a hurry, there are good news at the bottom.) I noticed that 3.0 (and also 2.2 actually) takes a long time to compile Guix’ gnu/services/mail.scm, which is macro-heavy, producing lots of top-level defines. At -O2 (the default), we have: --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> ,pr (compile-file "gnu/services/mail.scm") % cumulative self time seconds seconds procedure 13.79 19.16 14.46 language/cps/slot-allocation.scm:846:19 11.05 11.58 11.58 language/cps/intmap.scm:396:0:intmap-ref 7.56 12.63 7.92 anon #x10768e0 6.61 7.70 6.92 ice-9/popen.scm:145:0:reap-pipes 5.50 182.23 5.76 language/cps/intset.scm:470:5:visit-branch 4.65 4.87 4.87 system/vm/linker.scm:179:0:string-table-intern! 4.07 5.04 4.26 ice-9/vlist.scm:534:0:vhash-assoc 3.54 3.93 3.71 language/cps/intmap.scm:184:0:intmap-add! 3.28 6.65 3.43 language/cps/intset.scm:270:2:adjoin 2.70 2.82 2.82 language/cps/intset.scm:349:0:intset-ref 1.80 34.84 1.88 language/cps/intmap.scm:247:2:adjoin 1.80 5.93 1.88 language/cps/intset.scm:269:0:intset-add 1.74 18.22 1.83 language/cps/intmap.scm:246:0:intmap-add 1.22 3.27 1.27 language/cps/intset.scm:382:2:visit-node 1.16 2.94 1.22 language/cps/intset.scm:551:2:union 1.11 1.38 1.16 language/cps/intset.scm:204:0:intset-add! 0.74 1281.59 0.78 language/cps/intset.scm:472:5:visit-branch [...] Sample count: 1892 Total time: 104.795540582 seconds (85.091574653 seconds in GC) --8<---------------cut here---------------end--------------->8--- At -O1: --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> ,use(system base optimize) scheme@(guile-user)> ,pr (compile-file "gnu/services/mail.scm" #:opts (optimizations-for-level 1)) % cumulative self time seconds seconds procedure 11.76 129.78 7.60 language/cps/intset.scm:470:5:visit-branch 10.77 6.96 6.96 language/cps/intmap.scm:396:0:intmap-ref 10.43 11.69 6.74 language/cps/slot-allocation.scm:846:19 8.99 7.39 5.81 ice-9/vlist.scm:534:0:vhash-assoc 7.55 4.88 4.88 system/vm/linker.scm:179:0:string-table-intern! 6.44 4.16 4.16 ice-9/popen.scm:145:0:reap-pipes 4.22 2.80 2.72 language/cps/intmap.scm:184:0:intmap-add! 1.89 1.86 1.22 language/cps/slot-allocation.scm:681:17 1.89 1.43 1.22 ice-9/vlist.scm:539:0:vhash-assq 1.78 1.51 1.15 language/cps/slot-allocation.scm:505:17 1.22 1.36 0.79 language/cps/slot-allocation.scm:846:19 1.22 1.08 0.79 language/cps/slot-allocation.scm:505:17 [...] Sample count: 901 Total time: 64.602907835 seconds (55.87541493 seconds in GC) --8<---------------cut here---------------end--------------->8--- language/cps/slot-allocation.scm:846:19 corresponds to: (define (compute-live-slots* slots label live-vars) (intset-fold (lambda (var live) (match (get-slot slots var) (#f live) (slot (add-live-slot slot live)))) ;L846 (intmap-ref live-vars label) 0)) The GC times remain extremely high though, and it’s also coming from ‘compute-live-slots*’: --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> (gcprof (lambda () (compile-file "gnu/services/mail.scm" #:opts (optimizations-for-level 1)))) % cumulative self time seconds seconds procedure 58.14 34.56 34.56 language/cps/slot-allocation.scm:846:19 8.01 4.76 4.76 language/cps/slot-allocation.scm:681:17 8.01 4.76 4.76 language/cps/slot-allocation.scm:505:17 6.98 4.15 4.15 language/cps/slot-allocation.scm:505:17 6.46 3.84 3.84 language/cps/slot-allocation.scm:846:19 1.29 0.77 0.77 anon #x23e88e0 [...] Sample count: 387 Total time: 59.442422179 seconds (50.331193744 seconds in GC) --8<---------------cut here---------------end--------------->8--- (I believe Guile commit 5675e46410c9a24b05ddf58cbe3b998a4c9cad7c and its parent were made to optimize the -O1 case back in 2017¹.) ‘compute-live-slots*’ returns an integer and the allocation comes from line 846, where we allocate a bignum, in this case a verybignum even. And for each bignum, we register a finalizer, which itself takes space. (Time passes…) The patch below (also for 2.2) gives us better timing: --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> (gcprof (lambda () (compile-file "gnu/services/mail.scm" #:opts (optimizations-for-level 1)))) % cumulative self time seconds seconds procedure 18.75 2.49 2.49 anon #x6f58e0 [...] Sample count: 32 Total time: 13.290191232 seconds (4.584969888 seconds in GC) --8<---------------cut here---------------end--------------->8--- … but has the disadvantage that it doesn’t work: ‘numbers.test’ fails badly on bignums. However, it turns out that removing the ‘mp_set_memory_functions’ call works, and the result is: --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> (gcprof (lambda () (compile-file "gnu/services/mail.scm" #:opts (optimizations-for-level 1)))) % cumulative self time seconds seconds procedure 20.00 2.60 2.60 anon #x12578e0 10.00 3.47 1.30 language/cps/intset.scm:270:2:adjoin 6.67 0.87 0.87 ice-9/boot-9.scm:2201:0:%load-announce 6.67 0.87 0.87 anon #x1253160 3.33 146.48 0.43 ice-9/threads.scm:388:4 3.33 1.30 0.43 language/cps/intset.scm:759:8:lp 3.33 0.87 0.43 system/vm/assembler.scm:2854:4:write-die 3.33 0.43 0.43 language/cps/slot-allocation.scm:843:19 3.33 0.43 0.43 language/cps/intmap.scm:167:0:persistent-intmap [...] Sample count: 30 Total time: 13.001181844 seconds (4.278418897 seconds in GC) --8<---------------cut here---------------end--------------->8--- It’s 4.5 times faster than what we have now. Andy, anything against removing that ‘mp_set_memory_functions’ call altogether, or having ‘scm_install_gmp_memory_functions’ default to 0? Thanks, Ludo’. ¹ https://lists.gnu.org/archive/html/guile-devel/2017-10/msg00048.html [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: Type: text/x-patch, Size: 1559 bytes --] diff --git a/libguile/numbers.c b/libguile/numbers.c index d1b463358..cf21a86ca 100644 --- a/libguile/numbers.c +++ b/libguile/numbers.c @@ -1,4 +1,4 @@ -/* Copyright 1995-2016,2018-2019 +/* Copyright 1995-2016,2018-2020 Free Software Foundation, Inc. Portions Copyright 1990-1993 by AT&T Bell Laboratories and Bellcore. @@ -218,16 +218,6 @@ static mpz_t z_negative_one; \f -/* Clear the `mpz_t' embedded in bignum PTR. */ -static void -finalize_bignum (void *ptr, void *data) -{ - SCM bignum; - - bignum = SCM_PACK_POINTER (ptr); - mpz_clear (SCM_I_BIG_MPZ (bignum)); -} - /* The next three functions (custom_libgmp_*) are passed to mp_set_memory_functions (in GMP) so that memory used by the digits themselves is known to the garbage collector. This is needed so @@ -237,19 +227,20 @@ finalize_bignum (void *ptr, void *data) static void * custom_gmp_malloc (size_t alloc_size) { - return scm_malloc (alloc_size); + return scm_gc_malloc_pointerless (alloc_size, "GMP"); } static void * custom_gmp_realloc (void *old_ptr, size_t old_size, size_t new_size) { - return scm_realloc (old_ptr, new_size); + return scm_gc_realloc (old_ptr, old_size, new_size, "GMP"); } static void custom_gmp_free (void *ptr, size_t size) { - free (ptr); + /* Do nothing: all memory allocated by GMP is under GC control and + will be freed when needed. */ } @@ -264,8 +255,6 @@ make_bignum (void) "bignum"); p[0] = scm_tc16_big; - scm_i_set_finalizer (p, finalize_bignum, NULL); - return SCM_PACK (p); }
[-- Attachment #1: Type: text/plain, Size: 1656 bytes --] Hey ho! Ludovic Courtès <ludo@gnu.org> skribis: > … but has the disadvantage that it doesn’t work: ‘numbers.test’ fails > badly on bignums. I think with the excitement I no longer knew what I was saying. So, here’s a revised patch that actually preserves memory management (as in: ‘mpz_t’ are eventually freed), while getting rid of finalizers. Result: --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> ,use(system base optimize) scheme@(guile-user)> ,use(statprof) scheme@(guile-user)> (gcprof (lambda () (compile-file "gnu/services/mail.scm" #:opts (optimizations-for-level 1)))) % cumulative self time seconds seconds procedure 25.00 3.25 3.25 anon #x66e8e0 9.38 2.44 1.22 language/cps/intset.scm:551:2:union 6.25 0.81 0.81 ice-9/boot-9.scm:2201:0:%load-announce 6.25 0.81 0.81 anon #x66e468 3.13 139.83 0.41 ice-9/threads.scm:388:4 3.13 73.57 0.41 language/tree-il/peval.scm:710:2:loop 3.13 4.47 0.41 system/vm/assembler.scm:1201:0:intern-constant 3.13 1.63 0.41 ice-9/psyntax.scm:2964:6:match* 3.13 1.22 0.41 language/cps/intset.scm:270:2:adjoin 3.13 0.81 0.41 language/cps/intmap.scm:184:0:intmap-add! 3.13 0.41 0.41 language/cps/slot-allocation.scm:843:19 [...] Sample count: 32 Total time: 13.007326523 seconds (4.420918875 seconds in GC) --8<---------------cut here---------------end--------------->8--- I’d like to go ahead with this patch if there are no objections! Ludo’. [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: Type: text/x-patch, Size: 1813 bytes --] diff --git a/libguile/numbers.c b/libguile/numbers.c index d1b463358..8606780a8 100644 --- a/libguile/numbers.c +++ b/libguile/numbers.c @@ -1,4 +1,4 @@ -/* Copyright 1995-2016,2018-2019 +/* Copyright 1995-2016,2018-2020 Free Software Foundation, Inc. Portions Copyright 1990-1993 by AT&T Bell Laboratories and Bellcore. @@ -218,16 +218,6 @@ static mpz_t z_negative_one; \f -/* Clear the `mpz_t' embedded in bignum PTR. */ -static void -finalize_bignum (void *ptr, void *data) -{ - SCM bignum; - - bignum = SCM_PACK_POINTER (ptr); - mpz_clear (SCM_I_BIG_MPZ (bignum)); -} - /* The next three functions (custom_libgmp_*) are passed to mp_set_memory_functions (in GMP) so that memory used by the digits themselves is known to the garbage collector. This is needed so @@ -237,19 +227,20 @@ finalize_bignum (void *ptr, void *data) static void * custom_gmp_malloc (size_t alloc_size) { - return scm_malloc (alloc_size); + return scm_gc_malloc (alloc_size, "GMP"); } static void * custom_gmp_realloc (void *old_ptr, size_t old_size, size_t new_size) { - return scm_realloc (old_ptr, new_size); + return scm_gc_realloc (old_ptr, old_size, new_size, "GMP"); } static void custom_gmp_free (void *ptr, size_t size) { - free (ptr); + /* Do nothing: all memory allocated by GMP is under GC control and + will be freed when needed. */ } @@ -260,12 +251,10 @@ make_bignum (void) scm_t_bits *p; /* Allocate one word for the type tag and enough room for an `mpz_t'. */ - p = scm_gc_malloc_pointerless (sizeof (scm_t_bits) + sizeof (mpz_t), - "bignum"); + p = scm_gc_malloc (sizeof (scm_t_bits) + sizeof (mpz_t), + "bignum"); p[0] = scm_tc16_big; - scm_i_set_finalizer (p, finalize_bignum, NULL); - return SCM_PACK (p); }
> On 5 Feb 2020, at 17:29, Ludovic Courtès <ludo@gnu.org> wrote:
>
> Hey ho!
>
> Ludovic Courtès <ludo@gnu.org> skribis:
>
>> … but has the disadvantage that it doesn’t work: ‘numbers.test’ fails
>> badly on bignums.
>
> I think with the excitement I no longer knew what I was saying. So,
> here’s a revised patch that actually preserves memory management (as in:
> ‘mpz_t’ are eventually freed), while getting rid of finalizers.
When I tested the Boehm GC in a C++ program, the finalizers did not seem to make much difference. However, if the program is threaded, it puts locks around all GC allocations, which I think may be time consuming. So maybe you should try letting GMP use ordinary malloc, and GC allocations only for the SCM values creating them, deleting when finalized.
Hi :) Nice investigation! Perhaps slot-allocation should track live variables using something that's not bigints, but who knows. On Wed 05 Feb 2020 17:29, Ludovic Courtès <ludo@gnu.org> writes: > /* The next three functions (custom_libgmp_*) are passed to > mp_set_memory_functions (in GMP) so that memory used by the digits > themselves is known to the garbage collector. This is needed so > @@ -237,19 +227,20 @@ finalize_bignum (void *ptr, void *data) > static void * > custom_gmp_malloc (size_t alloc_size) > { > - return scm_malloc (alloc_size); > + return scm_gc_malloc (alloc_size, "GMP"); > } > > static void * > custom_gmp_realloc (void *old_ptr, size_t old_size, size_t new_size) > { > - return scm_realloc (old_ptr, new_size); > + return scm_gc_realloc (old_ptr, old_size, new_size, "GMP"); > } > > static void > custom_gmp_free (void *ptr, size_t size) > { > - free (ptr); > + /* Do nothing: all memory allocated by GMP is under GC control and > + will be freed when needed. */ > } I think this makes sense to me as a short-term fix. The down-side is that limbs can alias Scheme objects. In the long-term I think we should be representing bignums as pointerless objects whose first word is the tag and a word count, followed by inline "limbs" (in the sense of https://gmplib.org/manual/Nomenclature-and-Types.html#Nomenclature-and-Types). Generally we can use the low-level API to work on these (https://gmplib.org/manual/Low_002dlevel-Functions.html#Low_002dlevel-Functions), and if we need to use mpz_t, we can easily create an mpz_t that points to these values. Cheers, Andy
Hi! Andy Wingo <wingo@igalia.com> skribis: > Nice investigation! Perhaps slot-allocation should track live variables > using something that's not bigints, but who knows. Yeah I wondered; it’s not clear whether bitvectors would be more efficient, for instance, although we could make it perhaps locally imperative. > On Wed 05 Feb 2020 17:29, Ludovic Courtès <ludo@gnu.org> writes: > >> /* The next three functions (custom_libgmp_*) are passed to >> mp_set_memory_functions (in GMP) so that memory used by the digits >> themselves is known to the garbage collector. This is needed so >> @@ -237,19 +227,20 @@ finalize_bignum (void *ptr, void *data) >> static void * >> custom_gmp_malloc (size_t alloc_size) >> { >> - return scm_malloc (alloc_size); >> + return scm_gc_malloc (alloc_size, "GMP"); >> } >> >> static void * >> custom_gmp_realloc (void *old_ptr, size_t old_size, size_t new_size) >> { >> - return scm_realloc (old_ptr, new_size); >> + return scm_gc_realloc (old_ptr, old_size, new_size, "GMP"); >> } >> >> static void >> custom_gmp_free (void *ptr, size_t size) >> { >> - free (ptr); >> + /* Do nothing: all memory allocated by GMP is under GC control and >> + will be freed when needed. */ >> } > > I think this makes sense to me as a short-term fix. The down-side is > that limbs can alias Scheme objects. Yes. To my surprise, on a pure bignum microbenchmark, this is counterproductive: --8<---------------cut here---------------start------------->8--- $ guile ~/src/guile-debugging/bignum-finalizers.scm # 3.0.0 clock utime stime cutime cstime gctime 2.42 6.20 0.17 0.00 0.00 5.62 heap size: 2.0 MiB $ /data/src/guile-3.0/meta/guile ~/src/guile-debugging/bignum-finalizers.scm clock utime stime cutime cstime gctime 3.97 10.91 0.15 0.00 0.00 10.60 heap size: 3.0 MiB $ cat ~/src/guile-debugging/bignum-finalizers.scm (use-modules (ice-9 time)) (time (let loop ((n (expt 2 18)) (i 1)) (unless (zero? n) ;; (display ".") (loop (- n 1) (logior 0 (ash i 1)))))) (format #t "heap size: ~a MiB~%" (round (/ (assoc-ref (gc-stats) 'heap-size) (expt 2. 20)))) --8<---------------cut here---------------end--------------->8--- (Here we’re creating ~24 bignums, no more.) I wonder if there’s another part of the story that I’m missing here. Perf report for 3.0.0: --8<---------------cut here---------------start------------->8--- 46.93% guile libgc.so.1.3.6 [.] GC_mark_from 17.61% guile libgc.so.1.3.6 [.] GC_header_cache_miss 9.96% guile libgc.so.1.3.6 [.] GC_add_to_black_list_normal 5.20% guile libgmp.so.10.3.2 [.] __gmpn_lshift_coreisbr 4.13% guile libgc.so.1.3.6 [.] GC_find_header 2.28% guile libgc.so.1.3.6 [.] GC_finalize 2.09% guile libgc.so.1.3.6 [.] GC_base --8<---------------cut here---------------end--------------->8--- With the patch: --8<---------------cut here---------------start------------->8--- 48.40% guile libgc.so.1.3.6 [.] GC_mark_from 17.74% guile libgc.so.1.3.6 [.] GC_header_cache_miss 11.90% guile libgc.so.1.3.6 [.] GC_add_to_black_list_normal 4.45% guile libgc.so.1.3.6 [.] GC_find_header 2.31% guile libgmp.so.10.3.2 [.] __gmpn_lshift_coreisbr 2.30% guile libgc.so.1.3.6 [.] GC_base 1.73% guile libgc.so.1.3.6 [.] GC_finalize --8<---------------cut here---------------end--------------->8--- IOW, the relative part of computations drops from 5% to 2%. Thoughts? > In the long-term I think we should be representing bignums as > pointerless objects whose first word is the tag and a word count, > followed by inline "limbs" (in the sense of > https://gmplib.org/manual/Nomenclature-and-Types.html#Nomenclature-and-Types). > Generally we can use the low-level API to work on these > (https://gmplib.org/manual/Low_002dlevel-Functions.html#Low_002dlevel-Functions), > and if we need to use mpz_t, we can easily create an mpz_t that points > to these values. Yes, that sounds like the right approach longer-term. Note that ‘mpz_t’ is exposed through “numbers.h”, which I guess means we cannot change that in 3.0.x. Thanks, Ludo’.
Hi,
Ludovic Courtès <ludo@gnu.org> skribis:
> To my surprise, on a pure bignum microbenchmark, this is
> counterproductive:
I found out that I was comparing my own Guile build, which was against
the ‘libgc-back-pointers’ package, with that ‘guile-next’ build against
‘libgc’; no wonder mine was slower…
--8<---------------cut here---------------start------------->8---
$ cat ~/src/guile-debugging/bignum-finalizers.scm
(use-modules (ice-9 time))
(time
(let loop ((n (expt 2 18))
(i 1))
(unless (zero? n)
;; (display ".")
(loop (- n 1)
(logior 0 (ash i 1))))))
(format #t "heap size: ~a MiB~%"
(round
(/ (assoc-ref (gc-stats) 'heap-size) (expt 2. 20))))
--8<---------------cut here---------------end--------------->8---
There’s no noticeable difference on the microbenchmark when using the
same libgc both times, and there’s still ~2.5x difference when compiling
‘gnu/services/mail.scm’.
I went ahead and pushed the change as 00fbdfa7345765168e14438eed0b0b8c64c27ab9.
I did a full bootstrap build, which went fine and felt faster, though I
don’t have hard figures to compare to (110m user according to Bash’
‘time’ on my laptop).
Give it a spin and let me know!
Thanks,
Ludo’.