Hey all,

I've finally gotten back to profiling Guile. I'd like to write about
things that I have found. First, however, I will describe the process I
am using.


The subject of investigation
----------------------------

My machine is a Powerbook 1.5 Ghz G4 PPC with 512 MB RAM. Not
particularly beefy.

I am interested in measuring Guile 1.8 from CVS; I compile with GCC
4.1.2, with CFLAGS='-g -O2'. CFLAGS are very important for Guile, as
without -O, scm_is_pair is not inlined.


How I benchmark
---------------

Before beginning, we need to make benchmarks about what the current
status is. To give context for the benchmarks, I will also measure Guile
1.6.8 from Debian, and also Python 2.5.1c1 and SCM 5e3 where applicable.

The benchmarks will be made of real time, on an otherwise unloaded
system running X, since this is the figure that we actually wish to
improve. Timing measurements will be made with `time' from bash; I
export TIMEFORMAT=%R to just print out the real time. For example, to
measure Guile's initialization time, I run:

  for i in `seq 10`; do time guile -c 1; done

I stopped powernowd for the duration of the tests so that the CPU is
running at full speed. I actually record the observations only when the
numbers "settle", i.e. with warm caches.

I then put the observations into a file like the one attached, which is
actually a script that will update a graph. I have attached a graph as
well. As you can see, Guile 1.8 is significantly slower than anything
out there.


How I profile
-------------

I use valgrind's "callgrind" tool, available from valgrind SVN. This
tool simulates execution of the program, recording all instruction
fetches as costs. These instruction fetches correspond in a rough way to
time; however you can't trust them completely, re-benchmarking is always
the way to prove that changes have had effect.

I run a test like this:

  valgrind --tool=callgrind --num-callers=100 \
           --base=guile-1.8-tls-callgrind --separate-recs=1 \
           /opt/guile/bin/guile -c 1

This will dump a log into the file guile-1.8-tls-callgrind.PID, in
"callgrind format". I use the callgrind-output-gprof script to analyze
this file, which can be found at
http://wingolog.org/pub/callgrind-output-gprof. (The script itself may
be verified against callgrind_annotate, or kcachegrind.)


Analysis
--------

Profiling initialization (guile -c 1) of current 1.8 CVS with Ludovic's
TLS patch gives the following results, ordered by self percent:

cumulative   self      total
 percent    percent    calls    file:function
   15.19     12.99       1577   gc-card.c:scm_i_sweep_card[/opt/guile/lib/libguile.so.17.0.1]
   10.27      8.52     144923   ports.c:scm_getc[/opt/guile/lib/libguile.so.17.0.1]
    7.95      7.95     111779   ???:pthread_mutex_lock[/lib/libpthread-2.5.so]
    6.79      6.79     129005   ???:__tls_get_addr[/lib/ld-2.5.so]
    5.22      5.22     111778   ???:__pthread_mutex_unlock_usercnt[/lib/libpthread-2.5.so]
   62.35      3.65      57086   strings.c:scm_c_string_set_x[/opt/guile/lib/libguile.so.17.0.1]
   36.91      3.61      57086   strings.c:scm_i_string_writable_chars[/opt/guile/lib/libguile.so.17.0.1]
   21.77      3.60     128584   gc-mark.c:scm_gc_mark[/opt/guile/lib/libguile.so.17.0.1]
   18.07      3.08      27606   gc-mark.c:scm_gc_mark_dependencies[/opt/guile/lib/libguile.so.17.0.1]
  365.99      2.52      42509   eval.c:ceval[/opt/guile/lib/libguile.so.17.0.1]
[...]

Cumulative percent measures the total time[1] spent in a function or its
callees. It can be misleading; for example, scm_gc_mark calls itself,
which adds to its cumulative time figure without actually meaning
anything. The most "real" numbers are the self percent numbers, but if
you can convince yourself that a function does not recurse, the
cumulative percent numbers can be convincing.

For example, in this case, we see that 62.35% of the time in
initialization is spent in scm_c_string_set_x. At this point we have two
questions: (1) can this be possible?, and (2) what calls this function?

To answer (1), it seems that yes, scm_c_string_set_x has the potential
to be really, really slow:

  1. First scm_c_string_set_x calls scm_i_string_writable_chars.

     1.1. scm_i_string_writable_chars locks a global mutex. 
     1.2. In the case in which the string is shared (STRINGBUF_SHARED
          ()), the lock is released, the buffer copied, all threads are
          put to sleep (!), and the lock retaken.

  2. Next scm_c_string_set_x sets the char, i.e. buf[i] = c.
  3. scm_i_string_stop_writing is called, which unlocks the mutex.

It seems that if we look at the top elements on the profile, it is
plausible that all of them are caused by string-set!: the time in GC
because we are allocating a new buffer in many of the cases that we do
string-set!, and the mutexen being the string write mutex. 

At this point, to improve performance, we have two choices: (1) make
string-set! cheaper, or (2) avoid string-set!. I do not know how to do
(1) in the presence of threads[2]. (2) seems feasible, if we look at what
functions are actually calling scm_c_string_set_x. The ones that show up
in the profile are all in read.c:

    ./read.c:628:	  scm_c_string_set_x (*tok_buf, j, SCM_MAKE_CHAR (c));
    ./read.c:703:      scm_c_string_set_x (*tok_buf, j, SCM_MAKE_CHAR (c));
    ./read.c:766:            scm_c_string_set_x (*tok_buf, j, SCM_MAKE_CHAR (c));

All of these calls use the token buffer API, in which a SCM string is
allocated and grown as necessary. The readers fill the buffer with
string-set!. Probably the easiest way to make this faster is to make the
token buffer be a special kind of SMOB with that holds a scm_malloc'd
data area, which is assumed to be accessed only from one thread, as is
the case in read(). At the end, the data can be stolen from the token
buffer into a string via scm_i_take_stringbufn.

  [1] Really, the total percent of instruction fetches. Since this
      corresponds more or less to time, I say time.

  [2] I am told that mutable strings are not even part of the r6rs
      proposal.


Conclusions
-----------

I have done the analysis and am hoping to convince Ludovic to make a
patch :-) If we get Guile 1.8 load time down to Python load time, I will
be happy. At that point I can go back to optimizing guile-gnome's load
time, which currently runs at about 2 seconds.


Cheers,

Andy.
-- 
http://wingolog.org/