guile performance - Ackermann function: way slower than emacs, slower still if compiled

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* guile performance - Ackermann function: way slower than emacs, slower still if compiled
@ 2009-08-04  9:28 Ken Raeburn
  2009-08-04 13:52 ` Marijn Schouten (hkBst)
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Ken Raeburn @ 2009-08-04  9:28 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 5656 bytes --]

I implemented Ackermann's function A(m,n), a recursive function with  
scary performance characteristics, based on the definition given at  
wikipedia involving lots of +1 and -1 operations...  A(0,n) takes  
constant time to compute assuming bounded integer sizes, A(1,n) takes  
linear time, A(2,n) is quadratic, A(3,n) is exponential... A(4,n)  
takes a really long time if n=1, and for anything higher you can just  
give up now.  (A(4,2) is in the neighborhood of 2**65536, and A(4,3)  
is about 2 raised to that power.  And it gets there by adding or  
subtracting 1 in each invocation.)  Optimizations are possible, like  
A(1,n)=n+2, A(2,n)=2n+3, A(3,n)=2**(n+3)-3, but I didn't encode those  
optimizations.

I wrote both elisp and scheme versions; I changed the recursion to an  
explicitly-managed stack of values, to avoid some of the dynamic-vs- 
lexical argument binding issues, turn the native stack usage into  
large numbers of cell allocations and discards, and avoid exceeding  
the maximum Lisp evaluation depth Emacs permits.  Not surprisingly,  
it's not too fast, and Emacs spends a lot of time in garbage  
collection passes even with the byte-compiled version.  My time-test  
function here counts seconds of real time (on an 8-core Mac without a  
lot of other stuff going on, so CPU usage is probably very close):

ELISP> (time-test '(ackermann 3 1))
1.6e-05
ELISP> (time-test '(ackermann 3 6))
0.036018
ELISP> (time-test '(ackermann 3 7))
0.16574899999999998
ELISP> (time-test '(ackermann 3 8))
0.6457170000000001
ELISP> (time-test '(ackermann 3 9))
2.482042
ELISP> (time-test '(ackermann 4 0))
1.4e-05
ELISP> (time-test '(ackermann 4 1))
642.746514
ELISP>

With the installed guile-1.8.7 binary and the interpreted Scheme  
version of the code, performance was worse:

% time guile -l ack.scm -c '(ackermann 3 1)'
0.009u 0.003s 0:00.01 0.0%	0+0k 0+0io 0pf+0w
% time guile -l ack.scm -c '(ackermann 3 6)'
0.409u 0.078s 0:00.52 90.3%	0+0k 0+0io 0pf+0w
% time guile -l ack.scm -c '(ackermann 3 9)'
25.300u 4.540s 0:29.85 99.9%	0+0k 0+0io 0pf+0w
% time guile -l ack.scm -c '(ackermann 4 1)'
6016.150u 1122.109s 1:58:59.32 99.9%	0+0k 8+4io 149pf+0w
%

Yes, that's just shy of two hours for the last one, where Emacs took  
less than 11 minutes.

(Did I make some silly mistake in translating, that causes the Scheme  
version to be far less efficient?)

It's not a terribly fair comparison, for various reasons, but moving  
on...

I built and installed 1.9.1, so I could test with more recent code:

% time guile -l ack.scm -c '(ackermann 3 1)'
0.026u 0.006s 0:00.03 66.6%	0+0k 0+0io 0pf+0w
% time guile -l ack.scm -c '(ackermann 3 6)'
0.761u 0.141s 0:00.90 100.0%	0+0k 0+0io 0pf+0w
% time guile -l ack.scm -c '(ackermann 3 9)'
47.242u 8.644s 0:55.89 99.9%	0+0k 0+0io 0pf+0w
%

Since this behaved worse than 1.8.6, I skipped the 4,1 test.  I tested  
the compiled version:

% time guile -c '(begin (load-compiled "ack.go") (ackermann 3 1))'
0.012u 0.004s 0:00.01 100.0%	0+0k 0+0io 0pf+0w
% time guile -c '(begin (load-compiled "ack.go") (ackermann 3 6))'
0.772u 0.333s 0:01.10 100.0%	0+0k 0+0io 0pf+0w
% time guile -c '(begin (load-compiled "ack.go") (ackermann 3 9))'
48.728u 22.016s 1:10.75 99.9%	0+0k 0+0io 0pf+0w
%

Not much better than loading the .scm file, and only better for small  
values of n... in fact, with m=3 and n>=6 the performance seems worse  
in the precompiled case:

% time guile -l ack.scm -c '(ackermann 3 10)'
181.997u 34.928s 3:36.94 99.9%	0+0k 0+0io 0pf+0w
% time guile -c '(begin (load-compiled "ack.go") (ackermann 3 10))'
196.678u 85.618s 4:42.34 99.9%	0+0k 0+0io 0pf+0w
%

Poking at it a couple of times with Apple's process sampling tool, it  
does seem to be spending a bit of time in garbage collection -- but  
scm_leave_guile (called from scm_async_click via  
scm_pthread_mutex_lock) and getrusage (called from  
scm_c_get_internal_run_time via mytime) also seem to show up on the  
stack a significant fraction of the time, and each of those appears to  
be a comparable if not bigger time sink.

According to the execution timing with the compiled code, GC actually  
doesn't seem to be a huge fraction of the run time, maybe 8-9%:

scheme@(guile-user)> ,t (ackermann 3 8)
2045
clock utime stime cutime cstime gctime
18.43 12.96  5.47   0.00   0.00   1.66
scheme@(guile-user)> ,t (ackermann 3 9)
4093
clock utime stime cutime cstime gctime
73.47 51.61 21.85   0.00   0.00   5.95
scheme@(guile-user)>

The latter is the case that took under three seconds in Emacs, remember.

Garbage collection is much more of a factor in the non-compiled version:

scheme@(guile-user)> (load "ack.scm")
scheme@(guile-user)> ,t (ackermann 3 8)
2045
clock utime stime cutime cstime gctime
21.97 19.74  2.21   0.00   0.00   7.76
scheme@(guile-user)> ,t (ackermann 3 9)
4093
clock utime stime cutime cstime gctime
86.26 77.37  8.88   0.00   0.00  30.20
scheme@(guile-user)>

It also looks like execution time measurements may add more to the non- 
compiled execution time, as with the measurements the compiled code  
was still faster at 3,9.

In retrospect, this is much heavier on the interpretation or execution  
of some of the simple operations than I intended, and Emacs benefits a  
lot from opcodes like "add1", and I'm more interested specifically in  
exercising allocation and GC at the moment, so it's not that useful a  
test for me after all.  But I thought I'd send this along anyways in  
case someone was interested.  In particular the fact that the  
performance of the precompiled version gets worse as n grows might be  
something to investigate, unless the cause is already known....

Ken

[-- Attachment #2: ack.scm --]
[-- Type: application/octet-stream, Size: 889 bytes --]

(define (ackermann m n)
  (let ((stack (list m)))
    (while (not (null? stack))
      ;; Compute A(top-of-stack, n)
      (set! m (car stack))
      (set! stack (cdr stack))
      (while (not (= m 0))
	(cond ((and (= n 0)
		    (> m 0))
;;;	   (ackermann (- m 1) 1)
	       (set! m (- m 1))
	       (set! n 1))
	      (#t
;;;	   (ackermann (- m 1) (ackermann m (- n 1)))
	       ;; Push outer parameter onto the stack and
	       ;; compute the inner function.
	       (set! stack (cons (- m 1) stack))
	       (set! n (- n 1))
	       )))
      ;; Now m=0 so result is n+1.
      (set! n (+ n 1))
      ;; Done computing A() this round.
    ))
  n)

(define (time-diff t1 t2)
  (+ (* 1.0e-6 (- (cdr t1) (cdr t2)))
     (- (car t1) (car t2))))

(define (ack-time-test m n)
  (let ((start (gettimeofday)))
    (ackermann m n)
    (display (time-diff (gettimeofday) start))
    (newline)))

[-- Attachment #3: ack.el --]
[-- Type: application/octet-stream, Size: 867 bytes --]

(defun ackermann (m n)
  (let ((stack (list m)))
    (while stack
      ;; Compute A(top-of-stack, n)
      (setq m (car stack)
	    stack (cdr stack))
      (while (not (= m 0))
	(cond ((and (= n 0)
		    (> m 0))
;;;	   (ackermann (- m 1) 1)
	       (setq m (- m 1)
		     n 1))
	      (t
;;;	   (ackermann (- m 1) (ackermann m (- n 1)))
	       ;; Push outer parameter onto the stack and
	       ;; compute the inner function.
	       (setq stack (cons (- m 1) stack)
		     n (- n 1))
	       )))
      ;; Now m=0 so result is n+1.
      (setq n (+ n 1))
      ;; Done computing A() this round.
    ))
  n)

(defun time-diff (t1 t2)
 (+ (* 65536.0 (- (nth 0 t1) (nth 0 t2)))
    (* 1.0 (- (nth 1 t1) (nth 1 t2)))
    (* 1.0e-6 (- (nth 2 t1) (nth 2 t2)))))

(defun time-test (exp)
 (let ((start (current-time)))
   (eval exp)
   (time-diff (current-time) start)))

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: guile performance - Ackermann function: way slower than emacs,  slower still if compiled
  2009-08-04  9:28 guile performance - Ackermann function: way slower than emacs, slower still if compiled Ken Raeburn
@ 2009-08-04 13:52 ` Marijn Schouten (hkBst)
  2009-08-04 15:56   ` Ken Raeburn
  2009-08-05  9:06 ` Andy Wingo
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Marijn Schouten (hkBst) @ 2009-08-04 13:52 UTC (permalink / raw)
  To: Ken Raeburn; +Cc: guile-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ken Raeburn wrote:
> I implemented Ackermann's function A(m,n), a recursive function with
> scary performance characteristics, based on the definition given at
> wikipedia involving lots of +1 and -1 operations...  A(0,n) takes
> constant time to compute assuming bounded integer sizes, A(1,n) takes
> linear time, A(2,n) is quadratic, A(3,n) is exponential... A(4,n) takes
> a really long time if n=1, and for anything higher you can just give up
> now.  (A(4,2) is in the neighborhood of 2**65536, and A(4,3) is about 2
> raised to that power.  And it gets there by adding or subtracting 1 in
> each invocation.)  Optimizations are possible, like A(1,n)=n+2,
> A(2,n)=2n+3, A(3,n)=2**(n+3)-3, but I didn't encode those optimizations.
> 
> I wrote both elisp and scheme versions; I changed the recursion to an
> explicitly-managed stack of values, to avoid some of the
> dynamic-vs-lexical argument binding issues, turn the native stack usage
> into large numbers of cell allocations and discards, and avoid exceeding
> the maximum Lisp evaluation depth Emacs permits.  Not surprisingly, it's
> not too fast, and Emacs spends a lot of time in garbage collection
> passes even with the byte-compiled version.  My time-test function here
> counts seconds of real time (on an 8-core Mac without a lot of other
> stuff going on, so CPU usage is probably very close):
> 
> ELISP> (time-test '(ackermann 3 1))
> 1.6e-05
> ELISP> (time-test '(ackermann 3 6))
> 0.036018
> ELISP> (time-test '(ackermann 3 7))
> 0.16574899999999998
> ELISP> (time-test '(ackermann 3 8))
> 0.6457170000000001
> ELISP> (time-test '(ackermann 3 9))
> 2.482042
> ELISP> (time-test '(ackermann 4 0))
> 1.4e-05
> ELISP> (time-test '(ackermann 4 1))
> 642.746514
> ELISP>
> 
> With the installed guile-1.8.7 binary and the interpreted Scheme version
> of the code, performance was worse:
> 
> % time guile -l ack.scm -c '(ackermann 3 1)'
> 0.009u 0.003s 0:00.01 0.0%    0+0k 0+0io 0pf+0w
> % time guile -l ack.scm -c '(ackermann 3 6)'
> 0.409u 0.078s 0:00.52 90.3%    0+0k 0+0io 0pf+0w
> % time guile -l ack.scm -c '(ackermann 3 9)'
> 25.300u 4.540s 0:29.85 99.9%    0+0k 0+0io 0pf+0w
> % time guile -l ack.scm -c '(ackermann 4 1)'
> 6016.150u 1122.109s 1:58:59.32 99.9%    0+0k 8+4io 149pf+0w
> %
> 
> Yes, that's just shy of two hours for the last one, where Emacs took
> less than 11 minutes.
> 
> (Did I make some silly mistake in translating, that causes the Scheme
> version to be far less efficient?)
> 
> It's not a terribly fair comparison, for various reasons, but moving on...
> 
> I built and installed 1.9.1, so I could test with more recent code:
> 
> % time guile -l ack.scm -c '(ackermann 3 1)'
> 0.026u 0.006s 0:00.03 66.6%    0+0k 0+0io 0pf+0w
> % time guile -l ack.scm -c '(ackermann 3 6)'
> 0.761u 0.141s 0:00.90 100.0%    0+0k 0+0io 0pf+0w
> % time guile -l ack.scm -c '(ackermann 3 9)'
> 47.242u 8.644s 0:55.89 99.9%    0+0k 0+0io 0pf+0w
> %
> 
> Since this behaved worse than 1.8.6, I skipped the 4,1 test.  I tested
> the compiled version:
> 
> % time guile -c '(begin (load-compiled "ack.go") (ackermann 3 1))'
> 0.012u 0.004s 0:00.01 100.0%    0+0k 0+0io 0pf+0w
> % time guile -c '(begin (load-compiled "ack.go") (ackermann 3 6))'
> 0.772u 0.333s 0:01.10 100.0%    0+0k 0+0io 0pf+0w
> % time guile -c '(begin (load-compiled "ack.go") (ackermann 3 9))'
> 48.728u 22.016s 1:10.75 99.9%    0+0k 0+0io 0pf+0w
> %
> 
> Not much better than loading the .scm file, and only better for small
> values of n... in fact, with m=3 and n>=6 the performance seems worse in
> the precompiled case:
> 
> % time guile -l ack.scm -c '(ackermann 3 10)'
> 181.997u 34.928s 3:36.94 99.9%    0+0k 0+0io 0pf+0w
> % time guile -c '(begin (load-compiled "ack.go") (ackermann 3 10))'
> 196.678u 85.618s 4:42.34 99.9%    0+0k 0+0io 0pf+0w
> %
> 
> Poking at it a couple of times with Apple's process sampling tool, it
> does seem to be spending a bit of time in garbage collection -- but
> scm_leave_guile (called from scm_async_click via scm_pthread_mutex_lock)
> and getrusage (called from scm_c_get_internal_run_time via mytime) also
> seem to show up on the stack a significant fraction of the time, and
> each of those appears to be a comparable if not bigger time sink.
> 
> According to the execution timing with the compiled code, GC actually
> doesn't seem to be a huge fraction of the run time, maybe 8-9%:
> 
> scheme@(guile-user)> ,t (ackermann 3 8)
> 2045
> clock utime stime cutime cstime gctime
> 18.43 12.96  5.47   0.00   0.00   1.66
> scheme@(guile-user)> ,t (ackermann 3 9)
> 4093
> clock utime stime cutime cstime gctime
> 73.47 51.61 21.85   0.00   0.00   5.95
> scheme@(guile-user)>
> 
> The latter is the case that took under three seconds in Emacs, remember.
> 
> Garbage collection is much more of a factor in the non-compiled version:
> 
> scheme@(guile-user)> (load "ack.scm")
> scheme@(guile-user)> ,t (ackermann 3 8)
> 2045
> clock utime stime cutime cstime gctime
> 21.97 19.74  2.21   0.00   0.00   7.76
> scheme@(guile-user)> ,t (ackermann 3 9)
> 4093
> clock utime stime cutime cstime gctime
> 86.26 77.37  8.88   0.00   0.00  30.20
> scheme@(guile-user)>
> 
> It also looks like execution time measurements may add more to the
> non-compiled execution time, as with the measurements the compiled code
> was still faster at 3,9.
> 
> In retrospect, this is much heavier on the interpretation or execution
> of some of the simple operations than I intended, and Emacs benefits a
> lot from opcodes like "add1", and I'm more interested specifically in
> exercising allocation and GC at the moment, so it's not that useful a
> test for me after all.  But I thought I'd send this along anyways in
> case someone was interested.  In particular the fact that the
> performance of the precompiled version gets worse as n grows might be
> something to investigate, unless the cause is already known....

A more idiomatic way to implement it that also works in other Schemes is:


; $ larceny -- acker.scm -e "(time (pretty-print (A 3 9)))" -e "(quit)"
; $ mzscheme -f acker.scm -e "(time (display (A 3 9))(newline))"
; $ time guile -l acker.scm -c "(display (A 3 9))(newline)"
; $ gsi acker.scm -e "(time (pp (A 3 9)))"

(define (A m n)
  (let loop ((m+ (list m)) (n n))
    (if (null? m+)
	n
	(let loop1 ((m (car m+)) (m* (cdr m+)) (n n))
	  (if (= m 0)
	      (loop m* (+ n 1))
	      (if (= n 0)
		  (loop1 (- m 1) m* 1)
		  (loop1 m (cons (- m 1) m*) (- n 1)) ))))))


On my machine larceny computes (A 3 9) in 83 ms and (A 4 1) in less than 22
seconds. In guile-1.8.6 this version is also twice as fast as your version for
(A 3 9); they run 9s and 18s respectively. I haven't tried the new guile
compiler yet.

Marijn

- --
If you cannot read my mind, then listen to what I say.

Marijn Schouten (hkBst), Gentoo Lisp project, Gentoo ML
<http://www.gentoo.org/proj/en/lisp/>, #gentoo-{lisp,ml} on FreeNode
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkp4ONoACgkQp/VmCx0OL2wCmwCgh4a3N6mifoRKuQXG/eaylTqb
yzQAn0Kb6YRQy6CXyAEcs8t+yBZsHWLa
=6wJR
-----END PGP SIGNATURE-----




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: guile performance - Ackermann function: way slower than emacs, slower still if compiled
  2009-08-04 13:52 ` Marijn Schouten (hkBst)
@ 2009-08-04 15:56   ` Ken Raeburn
  0 siblings, 0 replies; 16+ messages in thread
From: Ken Raeburn @ 2009-08-04 15:56 UTC (permalink / raw)
  To: Marijn Schouten; +Cc: guile-devel

On Aug 4, 2009, at 09:52, Marijn Schouten (hkBst) wrote:
> A more idiomatic way to implement it that also works in other  
> Schemes is:

I know, shame on me for using "set!" in my code. :-)
I was aiming for code very similar between Lisp and Scheme versions,  
so that (insert some hand-waving here) it might be more reasonable to  
compare Emacs and Guile performance; named loops wouldn't work for  
Lisp, tail-recursive functions don't get optimized into loops in Emacs  
Lisp, etc.  So while loops with explicit setting of new values seemed  
like a good compromise.  (Then again, like I said, I was aiming for  
something to exercise the GC system in particular, and failed there  
anyways.)  It would've been better still if I'd pulled up the elisp  
branch and compared that against Emacs for processing the same Lisp  
code....

> On my machine larceny computes (A 3 9) in 83 ms and (A 4 1) in less  
> than 22
> seconds. In guile-1.8.6 this version is also twice as fast as your  
> version for
> (A 3 9); they run 9s and 18s respectively. I haven't tried the new  
> guile
> compiler yet.

Thanks!
The compiler geek in me wants to suggest that the compiler should be  
able to transform away most of the differences and the performance  
should be nearly the same. :-)  (The compiler geek in me also knows  
better than to expect it.)

For completeness, the timings I get for A(3,9) with your version, on  
the same machine I used before:

1.8.7:                8.4s
1.9.1, not compiled: 13.1s
1.9.1, compiled:      2.8s

I also tried A(4,1) in the compiled version, and it took 13m20s.  So  
the 1.9.1-compiled version of your code, from two data points, appears  
to be only a bit slower than the byte-compiled Emacs version.

Ken

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: guile performance - Ackermann function: way slower than emacs,  slower still if compiled
  2009-08-04  9:28 guile performance - Ackermann function: way slower than emacs, slower still if compiled Ken Raeburn
  2009-08-04 13:52 ` Marijn Schouten (hkBst)
@ 2009-08-05  9:06 ` Andy Wingo
  2009-08-05 14:06   ` Ken Raeburn
  2009-08-08 21:52   ` guile performance - Ackermann function: way slower than emacs, slower still if compiled Ludovic Courtès
  2009-08-05 10:42 ` Andy Wingo
  2009-08-05 11:15 ` Andy Wingo
  3 siblings, 2 replies; 16+ messages in thread
From: Andy Wingo @ 2009-08-05  9:06 UTC (permalink / raw)
  To: Ken Raeburn; +Cc: guile-devel

Hi Ken,

On Tue 04 Aug 2009 11:28, Ken Raeburn <raeburn@raeburn.org> writes:

> I implemented Ackermann's function A(m,n), a recursive function with
> scary performance characteristics, based on the definition given at
> wikipedia involving lots of +1 and -1 operations...

Guile's VM does not yet have opcodes for +1 and and -1. It probably
should, though. Still, this is a fairly artificial test.

> I wrote both elisp and scheme versions; I changed the recursion to an
> explicitly-managed stack of values, to avoid some of the dynamic-vs- 
> lexical argument binding issues, turn the native stack usage into large
> numbers of cell allocations and discards, and avoid exceeding  the
> maximum Lisp evaluation depth Emacs permits.

OK now we know we are entering the realm of the artificial benchmark :-)

> Not surprisingly,  it's
> not too fast, and Emacs spends a lot of time in garbage  collection
> passes even with the byte-compiled version.  My time-test  function here
> counts seconds of real time (on an 8-core Mac without a  lot of other
> stuff going on, so CPU usage is probably very close):
>
> ELISP> (time-test '(ackermann 3 1))
> 1.6e-05
> ELISP> (time-test '(ackermann 3 6))
> 0.036018
> ELISP> (time-test '(ackermann 3 7))
> 0.16574899999999998
> ELISP> (time-test '(ackermann 3 8))
> 0.6457170000000001
> ELISP> (time-test '(ackermann 3 9))
> 2.482042
> ELISP> (time-test '(ackermann 4 0))
> 1.4e-05
> ELISP> (time-test '(ackermann 4 1))
> 642.746514

OK

> With the installed guile-1.8.7 binary and the interpreted Scheme version
> of the code, performance was worse:
>
> % time guile -l ack.scm -c '(ackermann 3 1)'
> 0.009u 0.003s 0:00.01 0.0%	0+0k 0+0io 0pf+0w
> % time guile -l ack.scm -c '(ackermann 3 6)'
> 0.409u 0.078s 0:00.52 90.3%	0+0k 0+0io 0pf+0w
> % time guile -l ack.scm -c '(ackermann 3 9)'
> 25.300u 4.540s 0:29.85 99.9%	0+0k 0+0io 0pf+0w
> % time guile -l ack.scm -c '(ackermann 4 1)'
> 6016.150u 1122.109s 1:58:59.32 99.9%	0+0k 8+4io 149pf+0w

So, you're also timing Guile startup here, which takes about 20 ms.

> Yes, that's just shy of two hours for the last one, where Emacs took
> less than 11 minutes.

Of course, that's comparing compiled code to interpreted code, but that
is a pretty bad result, yes.

> I built and installed 1.9.1, so I could test with more recent code:
>
> % time guile -l ack.scm -c '(ackermann 3 1)'
> 0.026u 0.006s 0:00.03 66.6%	0+0k 0+0io 0pf+0w
> % time guile -l ack.scm -c '(ackermann 3 6)'
> 0.761u 0.141s 0:00.90 100.0%	0+0k 0+0io 0pf+0w
> % time guile -l ack.scm -c '(ackermann 3 9)'
> 47.242u 8.644s 0:55.89 99.9%	0+0k 0+0io 0pf+0w
> %
>
> Since this behaved worse than 1.8.6, I skipped the 4,1 test.

Yes, Guile 1.9's interpreter will be slower than 1.8's.

> I tested
> the compiled version:
>
> % time guile -c '(begin (load-compiled "ack.go") (ackermann 3 1))'
> 0.012u 0.004s 0:00.01 100.0%	0+0k 0+0io 0pf+0w
> % time guile -c '(begin (load-compiled "ack.go") (ackermann 3 6))'
> 0.772u 0.333s 0:01.10 100.0%	0+0k 0+0io 0pf+0w
> % time guile -c '(begin (load-compiled "ack.go") (ackermann 3 9))'
> 48.728u 22.016s 1:10.75 99.9%	0+0k 0+0io 0pf+0w
> %
>
> Not much better than loading the .scm file, and only better for small
> values of n... in fact, with m=3 and n>=6 the performance seems worse
> in the precompiled case:
>
> % time guile -l ack.scm -c '(ackermann 3 10)'
> 181.997u 34.928s 3:36.94 99.9%	0+0k 0+0io 0pf+0w
> % time guile -c '(begin (load-compiled "ack.go") (ackermann 3 10))'
> 196.678u 85.618s 4:42.34 99.9%	0+0k 0+0io 0pf+0w
> %
>
> Poking at it a couple of times with Apple's process sampling tool, it
> does seem to be spending a bit of time in garbage collection -- but
> scm_leave_guile (called from scm_async_click via
> scm_pthread_mutex_lock) and getrusage (called from
> scm_c_get_internal_run_time via mytime) also seem to show up on the
> stack a significant fraction of the time, and each of those appears to
> be a comparable if not bigger time sink.

This could be simply that Mac OS has different performance
characteristics than Linux. If locking uncontended mutexen or getting
the resource usage of a process are expensive, that doesn't sound very
good...

> According to the execution timing with the compiled code, GC actually
> doesn't seem to be a huge fraction of the run time, maybe 8-9%:
>
> scheme@(guile-user)> ,t (ackermann 3 8)
> 2045
> clock utime stime cutime cstime gctime
> 18.43 12.96  5.47   0.00   0.00   1.66
> scheme@(guile-user)> ,t (ackermann 3 9)
> 4093
> clock utime stime cutime cstime gctime
> 73.47 51.61 21.85   0.00   0.00   5.95
> scheme@(guile-user)>
>
> The latter is the case that took under three seconds in Emacs,
> remember.

Perhaps the deal is that 1+ is not being inlined. It seems to be calling
out to the 1+ procedure, which is a primitive, actually making it a
bit slower. I'll see what I can do.

> In retrospect, this is much heavier on the interpretation or execution
> of some of the simple operations than I intended, and Emacs benefits a
> lot from opcodes like "add1", and I'm more interested specifically in
> exercising allocation and GC at the moment, so it's not that useful a
> test for me after all.  But I thought I'd send this along anyways in
> case someone was interested.  In particular the fact that the
> performance of the precompiled version gets worse as n grows might be
> something to investigate, unless the cause is already known....

Thank you for the report, I'll investigate.

Andy
-- 
http://wingolog.org/




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: guile performance - Ackermann function: way slower than emacs, slower still if compiled
  2009-08-05  9:06 ` Andy Wingo
@ 2009-08-05 14:06   ` Ken Raeburn
  2009-08-06  7:31     ` Ken Raeburn
  2009-08-06 16:30     ` entering and leaving guile mode, and GC stack protection (was Re: guile performance - Ackermann function: way slower than emacs, slower still if compiled) Ken Raeburn
  2009-08-08 21:52   ` guile performance - Ackermann function: way slower than emacs, slower still if compiled Ludovic Courtès
  1 sibling, 2 replies; 16+ messages in thread
From: Ken Raeburn @ 2009-08-05 14:06 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-devel

On Aug 5, 2009, at 05:06, Andy Wingo wrote:
> Guile's VM does not yet have opcodes for +1 and and -1. It probably
> should, though. Still, this is a fairly artificial test.

Yeah, and it turned out to be useless for what I had wanted to test at  
the start, but I was still curious to see how it would work out.   
Based on your analysis of the "while" issue, it turned out to be  
interesting. :-)

> So, you're also timing Guile startup here, which takes about 20 ms.

Yes; except for the smallest cases, startup seems to be in the noise.

>> Poking at it a couple of times with Apple's process sampling tool, it
>> does seem to be spending a bit of time in garbage collection -- but
>> scm_leave_guile (called from scm_async_click via
>> scm_pthread_mutex_lock) and getrusage (called from
>> scm_c_get_internal_run_time via mytime) also seem to show up on the
>> stack a significant fraction of the time, and each of those appears  
>> to
>> be a comparable if not bigger time sink.
>
> This could be simply that Mac OS has different performance
> characteristics than Linux. If locking uncontended mutexen or getting
> the resource usage of a process are expensive, that doesn't sound very
> good...

 From my experience with profiling code on another project, mutexes  
definitely appear more expensive on Mac OS X vs Linux/glibc when  
there's contention.  I just ran a quick test on a couple of x86  
machines, and it does appear that the Mac OS X version of  
pthread_mutex_lock+pthread_mutex_unlock takes half again as many  
cycles as the GNU/Linux version, without contention.  (A test with  
10000000 iterations of getrusage(RUSAGE_SELF) shows a more drastic  
difference between OSes -- in terms of cycle counts, more than 3.5x as  
expensive on Mac OS X as on GNU/Linux.)

But it's not pthread_mutex_lock that's showing up most often, it's  
sigaltstack and sigprocmask, called by scm_leave_guile.  Some thoughts  
come to mind, but I don't have much time to work with them right now,  
maybe later:

(1) In scm_pthread_mutex_lock, we leave and re-enter guile mode so  
that we don't block the thread while in guile mode.  But we could use  
pthread_mutex_trylock first, and avoid the costs scm_leave_guile seems  
to incur on the Mac.  If we can't acquire the lock, it should return  
immediately, and then we can do the expensive, blocking version.  A  
quick, hack version of this changed my run time for A(3,8) from 17.5s  
to 14.5s, saving about 17%; sigaltstack and sigprocmask are still in  
the picture, because they're called from  
scm_catch_with_pre_unwind_handler.  I'll work up a nicer patch later.

(2) We're locking a global mutex to check if there's an async item  
pending for the current thread.  First, why not a per-thread mutex so  
we don't have to potentially stop *all* threads just to see if the  
current one has async work to do?  But second, it looks to me like by  
far the common case will be checking the async queue when it's empty  
-- not changing it, and not checking it when it's non-empty.  If  
that's the case, can we use some other mechanism, perhaps atomic queue  
manipulation operations (like Apple's OSAtomic{Enqueue,Dequeue}) or  
atomic compare-and-swap, to speed up the common case?  It introduces  
some portability problems, though, and if pthread_mutex_trylock helps  
enough maybe it doesn't matter.

(3) My four-year-old comments on scm_enter/leave_guile, recorded in  
threads.c around line 300, still stand....  Those functions really  
ought to go away.  At least they're confined to one file, now.  Some  
of it looks a little messy, but I can probably get rid of some of the  
uses....

(4) Yeah, yeah, I should run gprof already. :-)

Ken

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: guile performance - Ackermann function: way slower than emacs, slower still if compiled
  2009-08-05 14:06   ` Ken Raeburn
@ 2009-08-06  7:31     ` Ken Raeburn
  2009-08-26 22:39       ` Neil Jerram
  2009-08-06 16:30     ` entering and leaving guile mode, and GC stack protection (was Re: guile performance - Ackermann function: way slower than emacs, slower still if compiled) Ken Raeburn
  1 sibling, 1 reply; 16+ messages in thread
From: Ken Raeburn @ 2009-08-06  7:31 UTC (permalink / raw)
  To: guile-devel

On Aug 5, 2009, at 10:06, Ken Raeburn wrote:
> (1) In scm_pthread_mutex_lock, we leave and re-enter guile mode so  
> that we don't block the thread while in guile mode.  But we could  
> use pthread_mutex_trylock first, and avoid the costs scm_leave_guile  
> seems to incur on the Mac.  If we can't acquire the lock, it should  
> return immediately, and then we can do the expensive, blocking  
> version.  A quick, hack version of this changed my run time for  
> A(3,8) from 17.5s to 14.5s, saving about 17%; sigaltstack and  
> sigprocmask are still in the picture, because they're called from  
> scm_catch_with_pre_unwind_handler.  I'll work up a nicer patch later.

Ah, we already had scm_i_pthread_mutex_trylock lying around; that made  
things easy.
A second timing test with A(3,9) and this version of the patch (based  
on 1.9.1) shows the same improvement.

diff --git a/libguile/threads.c b/libguile/threads.c
index 9589336..8458a60 100644
--- a/libguile/threads.c
+++ b/libguile/threads.c
@@ -1826,10 +1826,15 @@ scm_std_select (int nfds,
  int
  scm_pthread_mutex_lock (scm_i_pthread_mutex_t *mutex)
  {
-  scm_t_guile_ticket t = scm_leave_guile ();
-  int res = scm_i_pthread_mutex_lock (mutex);
-  scm_enter_guile (t);
-  return res;
+  if (scm_i_pthread_mutex_trylock (mutex) == 0)
+    return 0;
+  else
+    {
+      scm_t_guile_ticket t = scm_leave_guile ();
+      int res = scm_i_pthread_mutex_lock (mutex);
+      scm_enter_guile (t);
+      return res;
+    }
  }

  static void





^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: guile performance - Ackermann function: way slower than emacs, slower still if compiled
  2009-08-06  7:31     ` Ken Raeburn
@ 2009-08-26 22:39       ` Neil Jerram
  0 siblings, 0 replies; 16+ messages in thread
From: Neil Jerram @ 2009-08-26 22:39 UTC (permalink / raw)
  To: Ken Raeburn; +Cc: guile-devel

Ken Raeburn <raeburn@raeburn.org> writes:

> Ah, we already had scm_i_pthread_mutex_trylock lying around; that made
> things easy.
> A second timing test with A(3,9) and this version of the patch (based
> on 1.9.1) shows the same improvement.
>
> diff --git a/libguile/threads.c b/libguile/threads.c

Thanks, I've committed this now.  (Although note that this one will
also evaporate when BDW-GC is merged.)

    Neil




^ permalink raw reply	[flat|nested] 16+ messages in thread

* entering and leaving guile mode, and GC stack protection (was Re: guile performance - Ackermann function: way slower than emacs, slower still if compiled)
  2009-08-05 14:06   ` Ken Raeburn
  2009-08-06  7:31     ` Ken Raeburn
@ 2009-08-06 16:30     ` Ken Raeburn
  2009-08-12 22:06       ` entering and leaving guile mode, and GC stack protection Andy Wingo
  1 sibling, 1 reply; 16+ messages in thread
From: Ken Raeburn @ 2009-08-06 16:30 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 2287 bytes --]

On Aug 5, 2009, at 10:06, I wrote:
> (3) My four-year-old comments on scm_enter/leave_guile, recorded in  
> threads.c around line 300, still stand....  Those functions really  
> ought to go away.  At least they're confined to one file, now.  Some  
> of it looks a little messy, but I can probably get rid of some of  
> the uses....

I've made a bit of progress on this.

The basic issue is that the way scm_enter/leave_guile are implemented,  
they make some bad assumptions about how the stack will be laid out  
and where SCM values might get stored.  The right way to tackle it, I  
believe, is to only provide functions like scm_without_guile that make  
callbacks, and get complete control over the transitions both into and  
out of guile mode so they can better manage the stack layout.

It's not just scm_leave_guile() that needs to go away and be replaced  
by a different interface, but suspend() and anything that calls it  
without pairing it up nicely with resume().  Functions pairing them up  
need updating too, but not their interfaces; the code in between  
suspend and resume just needs to be pulled out into a callback function.

So my changes in this first cut are basically:
  * New function while_suspended takes callback and replaces  
suspend...resume sequence; used only in scm_i_thread_sleep_for_gc now  
but scm_without_guile is an obvious additional place, when we can get  
rid of scm_leave_guile.
  * New helper function scm_without_guile_int, like scm_without_guile  
but return type is int.  Static for now; would it be useful to export?
  * Replace scm_i_thread_put_to_sleep and _wake_up with  
scm_i_with_threads_sleeping which takes a callback.  Change  
resize_all_states, scm_i_gc, scm_i_string_writable_chars accordingly.
  * Use scm_without_guile(_int) in scm_std_select,  
scm_pthread_mutex_lock (assumes my earlier change!),  
scm_pthread_cond_wait, scm_pthread_cond_timedwait,  
scm_i_with_threads_sleeping.

There are still unpaired uses of scm_leave_guile and scm_enter_guile  
that I'm going to have to look at more closely before I can figure out  
what to do about them.  And the problematic suspend() function itself  
is still there, but once its usage is confined to while_suspended,  
it'll be easier to clean up.

Comments?

Ken


[-- Attachment #2: guile-patch.diff --]
[-- Type: application/octet-stream, Size: 14137 bytes --]

diff --git a/libguile/fluids.c b/libguile/fluids.c
index bcd04c4..e532e39 100644
--- a/libguile/fluids.c
+++ b/libguile/fluids.c
@@ -87,6 +87,27 @@ static scm_t_bits tc16_dynamic_state;
 static SCM all_dynamic_states = SCM_EOL;
 static SCM all_fluids = SCM_EOL;
 
+static void
+update_all_states (void *p)
+{
+  SCM new_vectors = *(SCM *)p;
+  SCM state;
+
+  for (state = all_dynamic_states; !scm_is_null (state);
+       state = DYNAMIC_STATE_NEXT (state))
+    {
+      SCM old_fluids = DYNAMIC_STATE_FLUIDS (state);
+      SCM new_fluids = SCM_CAR (new_vectors);
+      size_t i, old_len = SCM_SIMPLE_VECTOR_LENGTH (old_fluids);
+
+      for (i = 0; i < old_len; i++)
+	SCM_SIMPLE_VECTOR_SET (new_fluids, i,
+			       SCM_SIMPLE_VECTOR_REF (old_fluids, i));
+      SET_DYNAMIC_STATE_FLUIDS (state, new_fluids);
+      new_vectors = SCM_CDR (new_vectors);
+    }
+}
+
 /* Make sure that all states have the right size.  This must be called
    while fluid_admin_mutex is held.
 */
@@ -112,21 +133,7 @@ resize_all_states ()
 					       SCM_BOOL_F),
 			    new_vectors);
 
-  scm_i_thread_put_to_sleep ();
-  for (state = all_dynamic_states; !scm_is_null (state);
-       state = DYNAMIC_STATE_NEXT (state))
-    {
-      SCM old_fluids = DYNAMIC_STATE_FLUIDS (state);
-      SCM new_fluids = SCM_CAR (new_vectors);
-      size_t i, old_len = SCM_SIMPLE_VECTOR_LENGTH (old_fluids);
-
-      for (i = 0; i < old_len; i++)
-	SCM_SIMPLE_VECTOR_SET (new_fluids, i,
-			       SCM_SIMPLE_VECTOR_REF (old_fluids, i));
-      SET_DYNAMIC_STATE_FLUIDS (state, new_fluids);
-      new_vectors = SCM_CDR (new_vectors);
-    }
-  scm_i_thread_wake_up ();
+  scm_i_with_threads_sleeping (update_all_states, &new_vectors);
 }
 
 /* This is called during GC, that is, while being single threaded.
diff --git a/libguile/gc.c b/libguile/gc.c
index b7a3bf0..818aec6 100644
--- a/libguile/gc.c
+++ b/libguile/gc.c
@@ -561,18 +561,20 @@ scm_check_deprecated_memory_return ()
 
 long int scm_i_last_marked_cell_count;
 
-/* Must be called while holding scm_i_sweep_mutex.
+/* Must be called while holding scm_i_sweep_mutex and with all threads
+   sleeping.
 
    This function is fairly long, but it touches various global
    variables. To not obscure the side effects on global variables,
    this function has not been split up.
  */
-void
-scm_i_gc (const char *what)
+static void
+scm_i_gc_1 (void *p)
 {
   unsigned long t_before_gc = 0;
-  
-  scm_i_thread_put_to_sleep ();
+#ifdef DEBUGINFO
+  const char *what = *(const char *)p;
+#endif
   
   scm_c_hook_run (&scm_before_gc_c_hook, 0);
 
@@ -678,8 +680,6 @@ scm_i_gc (const char *what)
      the time taken.
   */
   scm_gc_time_taken += (scm_c_get_internal_run_time () - t_before_gc);
-    
-  scm_i_thread_wake_up ();
   /*
     For debugging purposes, you could do
     scm_i_sweep_all_segments ("debug"), but then the remains of the
@@ -687,6 +687,11 @@ scm_i_gc (const char *what)
    */
 }
 
+void
+scm_i_gc (const char *what)
+{
+  scm_i_with_threads_sleeping (scm_i_gc_1, &what);
+}
 
 \f
 /* {GC Protection Helper Functions}
diff --git a/libguile/strings.c b/libguile/strings.c
index 4e21f3e..ab8cd14 100644
--- a/libguile/strings.c
+++ b/libguile/strings.c
@@ -339,6 +339,21 @@ scm_i_string_chars (SCM str)
   return STRINGBUF_CHARS (buf) + start;
 }
 
+struct replace_data {
+  SCM str, new_buf;
+  size_t start;
+};
+
+static void
+replace_stringbuf_data (void *p)
+{
+  struct replace_data *r = p;
+
+  SET_STRING_STRINGBUF (r->str, r->new_buf);
+  r->start = STRING_START (r->str);
+  SET_STRING_START (r->str, 0);
+}
+
 char *
 scm_i_string_writable_chars (SCM orig_str)
 {
@@ -357,19 +372,17 @@ scm_i_string_writable_chars (SCM orig_str)
 
       size_t len = STRING_LENGTH (str);
       SCM new_buf;
+      struct replace_data r;
 
       scm_i_pthread_mutex_unlock (&stringbuf_write_mutex);
 
       new_buf = make_stringbuf (len);
       memcpy (STRINGBUF_CHARS (new_buf),
 	      STRINGBUF_CHARS (buf) + STRING_START (str), len);
-
-      scm_i_thread_put_to_sleep ();
-      SET_STRING_STRINGBUF (str, new_buf);
-      start -= STRING_START (str);
-      SET_STRING_START (str, 0);
-      scm_i_thread_wake_up ();
-
+      r.str = str;
+      r.new_buf = new_buf;
+      scm_i_with_threads_sleeping (replace_stringbuf_data, &r);
+      start -= r.start;
       buf = new_buf;
 
       scm_i_pthread_mutex_lock (&stringbuf_write_mutex);
diff --git a/libguile/threads.c b/libguile/threads.c
index 8458a60..a2a98f3 100644
--- a/libguile/threads.c
+++ b/libguile/threads.c
@@ -428,6 +428,25 @@ suspend (void)
   setjmp (t->regs);
   return t;
 }
+  
+
+#if __GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 1)
+/* For paranoia's sake, make sure the stack slot we take the address
+   of is allocated after (deeper on the stack) than anything from the
+   parent, by not permitting inline expansion and thus possible
+   rearranging of things within the caller's stack frame.  */
+static void
+while_suspended (void (*func)(void *, scm_i_thread *), void *data)
+  __attribute__((noinline));
+#endif
+
+static void
+while_suspended (void (*func)(void *, scm_i_thread *), void *data)
+{
+  scm_i_thread *t = suspend ();
+  func (data, t);
+  resume (t);
+}
 
 static scm_t_guile_ticket
 scm_leave_guile ()
@@ -870,6 +889,20 @@ scm_without_guile (void *(*func)(void *), void *data)
   return res;
 }
 
+/* Internal helper function -- like scm_without_guile, but for various
+   cases in this file where we actually want to return an int, not
+   a pointer.  */
+static int
+scm_without_guile_int (int (*func)(void *), void *data)
+{
+  int res;
+  scm_t_guile_ticket t;
+  t = scm_leave_guile ();
+  res = func (data);
+  scm_enter_guile (t);
+  return res;
+}
+
 /*** Thread creation */
 
 typedef struct {
@@ -1769,6 +1802,28 @@ scm_threads_mark_stacks (void)
 
 /*** Select */
 
+struct select_callback_data {
+  SELECT_TYPE *readfds, *writefds, *exceptfds;
+  struct timeval *timeout;
+  scm_i_thread *t;
+  int nfds, wakeup_fd, eno;
+};
+
+static int
+select_callback (void *p)
+{
+  struct select_callback_data *cb = p;
+  int res;
+
+  FD_SET (cb->wakeup_fd, cb->readfds);
+  if (cb->wakeup_fd >= cb->nfds)
+    cb->nfds = cb->wakeup_fd+1;
+  res = select (cb->nfds, cb->readfds, cb->writefds, cb->exceptfds, cb->timeout);
+  cb->t->sleep_fd = -1;
+  cb->eno = errno;
+  return res;
+}
+
 int
 scm_std_select (int nfds,
 		SELECT_TYPE *readfds,
@@ -1777,45 +1832,41 @@ scm_std_select (int nfds,
 		struct timeval *timeout)
 {
   fd_set my_readfds;
-  int res, eno, wakeup_fd;
+  int res;
   scm_i_thread *t = SCM_I_CURRENT_THREAD;
-  scm_t_guile_ticket ticket;
+  struct select_callback_data cb = {
+    readfds, writefds, exceptfds, timeout, t, nfds,
+  };
 
   if (readfds == NULL)
     {
       FD_ZERO (&my_readfds);
-      readfds = &my_readfds;
+      cb.readfds = &my_readfds;
     }
 
   while (scm_i_setup_sleep (t, SCM_BOOL_F, NULL, t->sleep_pipe[1]))
     SCM_TICK;
 
-  wakeup_fd = t->sleep_pipe[0];
-  ticket = scm_leave_guile ();
-  FD_SET (wakeup_fd, readfds);
-  if (wakeup_fd >= nfds)
-    nfds = wakeup_fd+1;
-  res = select (nfds, readfds, writefds, exceptfds, timeout);
-  t->sleep_fd = -1;
-  eno = errno;
-  scm_enter_guile (ticket);
+  cb.wakeup_fd = t->sleep_pipe[0];
+
+  res = scm_without_guile_int (select_callback, &cb);
 
   scm_i_reset_sleep (t);
 
-  if (res > 0 && FD_ISSET (wakeup_fd, readfds))
+  if (res > 0 && FD_ISSET (cb.wakeup_fd, cb.readfds))
     {
       char dummy;
-      full_read (wakeup_fd, &dummy, 1);
+      full_read (cb.wakeup_fd, &dummy, 1);
 
-      FD_CLR (wakeup_fd, readfds);
+      FD_CLR (cb.wakeup_fd, cb.readfds);
       res -= 1;
       if (res == 0)
 	{
-	  eno = EINTR;
+	  cb.eno = EINTR;
 	  res = -1;
 	}
     }
-  errno = eno;
+  errno = cb.eno;
   return res;
 }
 
@@ -1823,18 +1874,18 @@ scm_std_select (int nfds,
 
 #if SCM_USE_PTHREAD_THREADS
 
+static int
+mutex_lock_callback (void *p)
+{
+  return scm_i_pthread_mutex_lock (p);
+}
+
 int
 scm_pthread_mutex_lock (scm_i_pthread_mutex_t *mutex)
 {
   if (scm_i_pthread_mutex_trylock (mutex) == 0)
     return 0;
-  else
-    {
-      scm_t_guile_ticket t = scm_leave_guile ();
-      int res = scm_i_pthread_mutex_lock (mutex);
-      scm_enter_guile (t);
-      return res;
-    }
+  return scm_without_guile_int (mutex_lock_callback, mutex);
 }
 
 static void
@@ -1850,14 +1901,45 @@ scm_dynwind_pthread_mutex_lock (scm_i_pthread_mutex_t *mutex)
   scm_dynwind_unwind_handler (do_unlock, mutex, SCM_F_WIND_EXPLICITLY);
 }
 
+struct cond_wait_callback_data {
+  scm_i_pthread_cond_t *cond;
+  scm_i_pthread_mutex_t *mutex;
+};
+
+static int
+cond_wait_callback (void *p)
+{
+  struct cond_wait_callback_data *cb = p;
+  int res;
+  scm_i_thread *t = SCM_I_CURRENT_THREAD;
+  t->held_mutex = cb->mutex;
+  res = scm_i_pthread_cond_wait (cb->cond, cb->mutex);
+  t->held_mutex = NULL;
+  return res;
+}
+
 int
 scm_pthread_cond_wait (scm_i_pthread_cond_t *cond, scm_i_pthread_mutex_t *mutex)
 {
-  scm_t_guile_ticket t = scm_leave_guile ();
-  ((scm_i_thread *)t)->held_mutex = mutex;
-  int res = scm_i_pthread_cond_wait (cond, mutex);
-  ((scm_i_thread *)t)->held_mutex = NULL;
-  scm_enter_guile (t);
+  struct cond_wait_callback_data cb = { cond, mutex };
+  return scm_without_guile_int (cond_wait_callback, &cb);
+}
+
+struct cond_timedwait_callback_data {
+  scm_i_pthread_cond_t *cond;
+  scm_i_pthread_mutex_t *mutex;
+  const scm_t_timespec *wt;
+};
+
+static int
+cond_timedwait_callback (void *p)
+{
+  struct cond_timedwait_callback_data *cb = p;
+  int res;
+  scm_i_thread *t = SCM_I_CURRENT_THREAD;
+  t->held_mutex = cb->mutex;
+  res = scm_i_pthread_cond_timedwait (cb->cond, cb->mutex, cb->wt);
+  t->held_mutex = NULL;
   return res;
 }
 
@@ -1866,12 +1948,8 @@ scm_pthread_cond_timedwait (scm_i_pthread_cond_t *cond,
 			    scm_i_pthread_mutex_t *mutex,
 			    const scm_t_timespec *wt)
 {
-  scm_t_guile_ticket t = scm_leave_guile ();
-  ((scm_i_thread *)t)->held_mutex = mutex;
-  int res = scm_i_pthread_cond_timedwait (cond, mutex, wt);
-  ((scm_i_thread *)t)->held_mutex = NULL;
-  scm_enter_guile (t);
-  return res;
+  struct cond_timedwait_callback_data cb = { cond, mutex, wt };
+  return scm_without_guile_int (cond_timedwait_callback, &cb);
 }
 
 #endif
@@ -1970,25 +2048,6 @@ int scm_i_thread_go_to_sleep;
 static int threads_initialized_p = 0;
 
 void
-scm_i_thread_put_to_sleep ()
-{
-  if (threads_initialized_p)
-    {
-      scm_i_thread *t;
-
-      scm_leave_guile ();
-      scm_i_pthread_mutex_lock (&thread_admin_mutex);
-
-      /* Signal all threads to go to sleep
-       */
-      scm_i_thread_go_to_sleep = 1;
-      for (t = all_threads; t; t = t->next_thread)
-	scm_i_pthread_mutex_lock (&t->heap_mutex);
-      scm_i_thread_go_to_sleep = 0;
-    }
-}
-
-void
 scm_i_thread_invalidate_freelists ()
 {
   /* thread_admin_mutex is already locked. */
@@ -1999,34 +2058,68 @@ scm_i_thread_invalidate_freelists ()
       t->clear_freelists_p = 1;
 }
 
+struct sleeping_callback {
+  void (*func) (void *);
+  void *data;
+};
+
+static void *
+sleeping_callback (void *p)
+{
+  struct sleeping_callback *cb = p;
+  scm_i_thread *t;
+
+  scm_i_pthread_mutex_lock (&thread_admin_mutex);
+
+  /* Signal all threads to go to sleep
+   */
+  scm_i_thread_go_to_sleep = 1;
+  for (t = all_threads; t; t = t->next_thread)
+    scm_i_pthread_mutex_lock (&t->heap_mutex);
+  scm_i_thread_go_to_sleep = 0;
+
+  cb->func (cb->data);
+
+  scm_i_pthread_cond_broadcast (&wake_up_cond);
+  for (t = all_threads; t; t = t->next_thread)
+    scm_i_pthread_mutex_unlock (&t->heap_mutex);
+  scm_i_pthread_mutex_unlock (&thread_admin_mutex);
+
+  return NULL;
+}
+
 void
-scm_i_thread_wake_up ()
+scm_i_with_threads_sleeping (void (*func)(void *), void *data)
 {
   if (threads_initialized_p)
     {
-      scm_i_thread *t;
-
-      scm_i_pthread_cond_broadcast (&wake_up_cond);
-      for (t = all_threads; t; t = t->next_thread)
-	scm_i_pthread_mutex_unlock (&t->heap_mutex);
-      scm_i_pthread_mutex_unlock (&thread_admin_mutex);
-      scm_enter_guile ((scm_t_guile_ticket) SCM_I_CURRENT_THREAD);
+      struct sleeping_callback cb = { func, data };
+      scm_without_guile (sleeping_callback, &cb);
+    }
+  else
+    {
+      func (data);
+      /* Can func ever cause initialization to happen if it hadn't
+	 been done before?  */
+      assert (threads_initialized_p == 0);
     }
 }
 
-void
-scm_i_thread_sleep_for_gc ()
+static void
+sleep_for_gc_helper (void *p, scm_i_thread *t)
 {
-  scm_i_thread *t = suspend ();
-
   /* Don't put t->heap_mutex in t->held_mutex here, because if the
      thread is cancelled during the cond wait, the thread's cleanup
      function (scm_leave_guile_cleanup) will handle unlocking the
      heap_mutex, so we don't need to do that again in on_thread_exit.
   */
   scm_i_pthread_cond_wait (&wake_up_cond, &t->heap_mutex);
+}
 
-  resume (t);
+void
+scm_i_thread_sleep_for_gc ()
+{
+  while_suspended (sleep_for_gc_helper, NULL);
 }
 
 /* This mutex is used by SCM_CRITICAL_SECTION_START/END.
diff --git a/libguile/threads.h b/libguile/threads.h
index 32b0ea6..407acb7 100644
--- a/libguile/threads.h
+++ b/libguile/threads.h
@@ -153,8 +153,8 @@ SCM_INTERNAL void *scm_i_with_guile_and_parent (void *(*func)(void *),
 
 extern int scm_i_thread_go_to_sleep;
 
-SCM_INTERNAL void scm_i_thread_put_to_sleep (void);
-SCM_INTERNAL void scm_i_thread_wake_up (void);
+SCM_INTERNAL void scm_i_with_threads_sleeping (void (*func)(void *),
+					       void *data);
 SCM_INTERNAL void scm_i_thread_invalidate_freelists (void);
 void scm_i_thread_sleep_for_gc (void);
 

[-- Attachment #3: Type: text/plain, Size: 1 bytes --]



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: entering and leaving guile mode, and GC stack protection
  2009-08-06 16:30     ` entering and leaving guile mode, and GC stack protection (was Re: guile performance - Ackermann function: way slower than emacs, slower still if compiled) Ken Raeburn
@ 2009-08-12 22:06       ` Andy Wingo
  2009-08-14  7:58         ` Ludovic Courtès
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Wingo @ 2009-08-12 22:06 UTC (permalink / raw)
  To: Ken Raeburn; +Cc: guile-devel

Hi Ken,

On Thu 06 Aug 2009 18:30, Ken Raeburn <raeburn@raeburn.org> writes:

> On Aug 5, 2009, at 10:06, I wrote:
>> (3) My four-year-old comments on scm_enter/leave_guile, recorded in
>> threads.c around line 300, still stand....  Those functions really
>> ought to go away.  At least they're confined to one file, now.  Some
>> of it looks a little messy, but I can probably get rid of some of  the
>> uses....
>
> I've made a bit of progress on this.

The patches look good to me; my only wonder is what relation they have
to the BDW-GC branch Ludovic was working on. If BDW will land before
2.0, then perhaps all this mess can go away (wishful thinking);
otherwise we should apply it now (after the release). Ludovic? :)

Cheers,

Andy
-- 
http://wingolog.org/




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: entering and leaving guile mode, and GC stack protection
  2009-08-12 22:06       ` entering and leaving guile mode, and GC stack protection Andy Wingo
@ 2009-08-14  7:58         ` Ludovic Courtès
  0 siblings, 0 replies; 16+ messages in thread
From: Ludovic Courtès @ 2009-08-14  7:58 UTC (permalink / raw)
  To: guile-devel

Hello,

Andy Wingo <wingo@pobox.com> writes:

> On Thu 06 Aug 2009 18:30, Ken Raeburn <raeburn@raeburn.org> writes:
>
>> On Aug 5, 2009, at 10:06, I wrote:
>>> (3) My four-year-old comments on scm_enter/leave_guile, recorded in
>>> threads.c around line 300, still stand....  Those functions really
>>> ought to go away.  At least they're confined to one file, now.  Some
>>> of it looks a little messy, but I can probably get rid of some of  the
>>> uses....
>>
>> I've made a bit of progress on this.
>
> The patches look good to me; my only wonder is what relation they have
> to the BDW-GC branch Ludovic was working on. If BDW will land before
> 2.0, then perhaps all this mess can go away (wishful thinking);
> otherwise we should apply it now (after the release). Ludovic? :)

Exactly.  I've been meaning to reply to this thread because of this.

These functions don't do much in the BDW-GC branch:

  http://git.savannah.gnu.org/cgit/guile.git/tree/libguile/threads.c?h=boehm-demers-weiser-gc#n379

Likewise, `scm_without_guile ()' does a `GC_do_blocking ()', but in many
cases `scm_without_guile ()' is not needed because BDW-GC doesn't rely
on cooperation from all threads to work.

And yes, I do hope to have it part of 2.0, but I haven't taken the time
to update it lately.

Thanks,
Ludo'.





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: guile performance - Ackermann function: way slower than emacs, slower still if compiled
  2009-08-05  9:06 ` Andy Wingo
  2009-08-05 14:06   ` Ken Raeburn
@ 2009-08-08 21:52   ` Ludovic Courtès
  2009-08-12 22:28     ` Andy Wingo
  1 sibling, 1 reply; 16+ messages in thread
From: Ludovic Courtès @ 2009-08-08 21:52 UTC (permalink / raw)
  To: guile-devel

Hello!

Andy Wingo <wingo@pobox.com> writes:

> Perhaps the deal is that 1+ is not being inlined. It seems to be calling
> out to the 1+ procedure, which is a primitive, actually making it a
> bit slower. I'll see what I can do.

Currently, invoking subrs conses, which is Bad(TM).  Code that keeps
invoking `1+' and `1-' must suffer from this.

Thanks,
Ludo'.





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: guile performance - Ackermann function: way slower than emacs,  slower still if compiled
  2009-08-08 21:52   ` guile performance - Ackermann function: way slower than emacs, slower still if compiled Ludovic Courtès
@ 2009-08-12 22:28     ` Andy Wingo
  0 siblings, 0 replies; 16+ messages in thread
From: Andy Wingo @ 2009-08-12 22:28 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

Heya,

On Sat 08 Aug 2009 23:52, ludo@gnu.org (Ludovic Courtès) writes:

> Andy Wingo <wingo@pobox.com> writes:
>
>> Perhaps the deal is that 1+ is not being inlined. It seems to be calling
>> out to the 1+ procedure, which is a primitive, actually making it a
>> bit slower. I'll see what I can do.
>
> Currently, invoking subrs conses, which is Bad(TM).

Indeed.

> Code that keeps invoking `1+' and `1-' must suffer from this.

Not any more! There are add1 and sub1 instructions now.

A
-- 
http://wingolog.org/




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: guile performance - Ackermann function: way slower than emacs,  slower still if compiled
  2009-08-04  9:28 guile performance - Ackermann function: way slower than emacs, slower still if compiled Ken Raeburn
  2009-08-04 13:52 ` Marijn Schouten (hkBst)
  2009-08-05  9:06 ` Andy Wingo
@ 2009-08-05 10:42 ` Andy Wingo
  2009-08-06 15:59   ` Andy Wingo
  2009-08-05 11:15 ` Andy Wingo
  3 siblings, 1 reply; 16+ messages in thread
From: Andy Wingo @ 2009-08-05 10:42 UTC (permalink / raw)
  To: Ken Raeburn; +Cc: guile-devel

Hi,

On Tue 04 Aug 2009 11:28, Ken Raeburn <raeburn@raeburn.org> writes:

> I implemented Ackermann's function A(m,n), a recursive function with
> scary performance characteristics, based on the definition given at
> wikipedia involving lots of +1 and -1 operations...

I have now added 1+ and 1- ops to Guile's VM. They don't seem to make a
terrible difference for your case however -- I get similar timings both
with Marijn's code and with yours.

One issue is that Guile's `while' macro provides continue and break
constructs, which involve consing up those closures when you enter a
while block. The compiler should be able to note when those vars are not
referenced, but it does not do so now. That will have to wait for our
inliner.

> I wrote both elisp and scheme versions; I changed the recursion to an
> explicitly-managed stack of values, to avoid some of the dynamic-vs- 
> lexical argument binding issues, turn the native stack usage into large
> numbers of cell allocations and discards, and avoid exceeding  the
> maximum Lisp evaluation depth Emacs permits.  Not surprisingly,  it's
> not too fast, and Emacs spends a lot of time in garbage  collection
> passes even with the byte-compiled version.

Indeed, moving to an explicitly-managed stack tests GC much more than a
recursive solution.

The simple stack formulation of Ackermann's function is faster of
course:

    (define (A m n)
      (if (= m 0)
          (+ n 1)
          (if (= n 0)
              (A (- m 1) 1)
              (A (- m 1) (A m (- n 1))))))

(A 3 9) completes in 1.45 seconds for me. 

Curiously, the letrec case takes more time:

    (define (A m n)
      (let A ((m m) (n n))
        (if (= m 0)
            (+ n 1)
            (if (= n 0)
                (A (- m 1) 1)
                (A (- m 1) (A m (- n 1)))))))

It should be faster, because we don't have to deref the toplevel
variable for A, but the compiler needs to be a little smarter. This
version takes about 1.6 seconds for me.

For comparison, your elisp version computes A(3,9) in 2.9 seconds on
this box. Apples and oranges, of course, due to the different
implementation strategies.

> ELISP> (time-test '(ackermann 3 9))
> 2.482042

I suppose this is the comparison, then.

> ELISP> (time-test '(ackermann 4 1))
> 642.746514

And the stack-based version overflows for this one. We need to implement
overflow handlers for the VM stack.

> % time guile -c '(begin (load-compiled "ack.go") (ackermann 3 9))'
> 48.728u 22.016s 1:10.75 99.9%	0+0k 0+0io 0pf+0w

Indeed, I get similar times (44 seconds; presumably some of the
compilation fixes actually had an effect).

I'm running this under valgrind now, I'll see what pops up. Very
interesting test, thanks.

Cheers,

Andy
-- 
http://wingolog.org/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: guile performance - Ackermann function: way slower than emacs,  slower still if compiled
  2009-08-05 10:42 ` Andy Wingo
@ 2009-08-06 15:59   ` Andy Wingo
  2009-08-07 17:27     ` Andy Wingo
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Wingo @ 2009-08-06 15:59 UTC (permalink / raw)
  To: Ken Raeburn; +Cc: guile-devel

Hi,

On Wed 05 Aug 2009 12:42, Andy Wingo <wingo@pobox.com> writes:

> The simple stack formulation of Ackermann's function is faster of
> course:
>
>     (define (A m n)
>       (if (= m 0)
>           (+ n 1)
>           (if (= n 0)
>               (A (- m 1) 1)
>               (A (- m 1) (A m (- n 1))))))
>
> (A 3 9) completes in 1.45 seconds for me. 
>
> Curiously, the letrec case takes more time:
>
>     (define (A m n)
>       (let A ((m m) (n n))
>         (if (= m 0)
>             (+ n 1)
>             (if (= n 0)
>                 (A (- m 1) 1)
>                 (A (- m 1) (A m (- n 1)))))))
>
> It should be faster, because we don't have to deref the toplevel
> variable for A, but the compiler needs to be a little smarter. This
> version takes about 1.6 seconds for me.

I have fixed this issue. The letrec version now takes slightly less time
than the version that recurses through the toplevel.

>> % time guile -c '(begin (load-compiled "ack.go") (ackermann 3 9))'
>> 48.728u 22.016s 1:10.75 99.9%	0+0k 0+0io 0pf+0w
>
> Indeed, I get similar times (44 seconds; presumably some of the
> compilation fixes actually had an effect).

This one now takes very, very marginally less time now (about a second) -- not
much of an improvement, but I wanted to take care of it before taking
catch / dynamic-wind.

Thanks,

Andy
-- 
http://wingolog.org/




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: guile performance - Ackermann function: way slower than emacs,  slower still if compiled
  2009-08-06 15:59   ` Andy Wingo
@ 2009-08-07 17:27     ` Andy Wingo
  0 siblings, 0 replies; 16+ messages in thread
From: Andy Wingo @ 2009-08-07 17:27 UTC (permalink / raw)
  To: Ken Raeburn; +Cc: guile-devel

On Thu 06 Aug 2009 17:59, Andy Wingo <wingo@pobox.com> writes:

> On Wed 05 Aug 2009 12:42, Andy Wingo <wingo@pobox.com> writes:
>
>> The simple stack formulation of Ackermann's function is faster of
>> course:
>>
>>     (define (A m n)
>>       (if (= m 0)
>>           (+ n 1)
>>           (if (= n 0)
>>               (A (- m 1) 1)
>>               (A (- m 1) (A m (- n 1))))))
>>
>> (A 3 9) completes in 1.45 seconds for me. 
>>
>> Curiously, the letrec case takes more time:
>>
>>     (define (A m n)
>>       (let A ((m m) (n n))
>>         (if (= m 0)
>>             (+ n 1)
>>             (if (= n 0)
>>                 (A (- m 1) 1)
>>                 (A (- m 1) (A m (- n 1)))))))
>>
>> It should be faster, because we don't have to deref the toplevel
>> variable for A, but the compiler needs to be a little smarter. This
>> version takes about 1.6 seconds for me.
>
> I have fixed this issue. The letrec version now takes slightly less time
> than the version that recurses through the toplevel.

So, I couldn't stop, being so close -- I went ahead and implemented
mutual tail-recursion as Steele did in the Lambda: The Ultimate GOTO
paper. It doesn't have any effect in this case, but in Daniel's
tight-loop code it should have an effect. Here was that code:

  (define (test to)
    (let ((i 0))
      (while (< i to)
        (set! i (1+ i)))))

Daniel said this was the result:

> Tight loop, (test 1000000):
>   Scheme: 0.72s

That's with 1 million iterations. But I just tried it with 10 million
iterations, with current Guile from this afternoon, and got the same
timing. Daniel can you test again? This would be a pleasant surprise :)

It doesn't help the ackermann case, though. That will come later.
Unfortunately the letrec case is actually taking more time it used to
before this patch; alack.

It doesn't look like I'll get to dynamic-wind until next thursday or so.
Until then, happy hacking, and keep up the great bug reports!

A
-- 
http://wingolog.org/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: guile performance - Ackermann function: way slower than emacs,  slower still if compiled
  2009-08-04  9:28 guile performance - Ackermann function: way slower than emacs, slower still if compiled Ken Raeburn
                   ` (2 preceding siblings ...)
  2009-08-05 10:42 ` Andy Wingo
@ 2009-08-05 11:15 ` Andy Wingo
  3 siblings, 0 replies; 16+ messages in thread
From: Andy Wingo @ 2009-08-05 11:15 UTC (permalink / raw)
  To: Ken Raeburn; +Cc: guile-devel

Hi Ken,

Replying a lot :)

On Tue 04 Aug 2009 11:28, Ken Raeburn <raeburn@raeburn.org> writes:

> % time guile -c '(begin (load-compiled "ack.go") (ackermann 3 9))'
> 48.728u 22.016s 1:10.75 99.9%	0+0k 0+0io 0pf+0w
> %
>
> Not much better than loading the .scm file, and only better for small
> values of n... in fact, with m=3 and n>=6 the performance seems worse
> in the precompiled case:

I ran this under kcachegrind and it seems that the penalty is leaving
the VM for dispatching to `catch'. This is related to Daniel's issues
leaving the VM for dynamic-wind. So it seems that's what I should work
on.

There are a number of other little issues, but this is the big one.

Andy
-- 
http://wingolog.org/




^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-08-26 22:39 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-04  9:28 guile performance - Ackermann function: way slower than emacs, slower still if compiled Ken Raeburn
2009-08-04 13:52 ` Marijn Schouten (hkBst)
2009-08-04 15:56   ` Ken Raeburn
2009-08-05  9:06 ` Andy Wingo
2009-08-05 14:06   ` Ken Raeburn
2009-08-06  7:31     ` Ken Raeburn
2009-08-26 22:39       ` Neil Jerram
2009-08-06 16:30     ` entering and leaving guile mode, and GC stack protection (was Re: guile performance - Ackermann function: way slower than emacs, slower still if compiled) Ken Raeburn
2009-08-12 22:06       ` entering and leaving guile mode, and GC stack protection Andy Wingo
2009-08-14  7:58         ` Ludovic Courtès
2009-08-08 21:52   ` guile performance - Ackermann function: way slower than emacs, slower still if compiled Ludovic Courtès
2009-08-12 22:28     ` Andy Wingo
2009-08-05 10:42 ` Andy Wingo
2009-08-06 15:59   ` Andy Wingo
2009-08-07 17:27     ` Andy Wingo
2009-08-05 11:15 ` Andy Wingo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).