Hey all, I've finally gotten back to profiling Guile. I'd like to write about things that I have found. First, however, I will describe the process I am using. The subject of investigation ---------------------------- My machine is a Powerbook 1.5 Ghz G4 PPC with 512 MB RAM. Not particularly beefy. I am interested in measuring Guile 1.8 from CVS; I compile with GCC 4.1.2, with CFLAGS='-g -O2'. CFLAGS are very important for Guile, as without -O, scm_is_pair is not inlined. How I benchmark --------------- Before beginning, we need to make benchmarks about what the current status is. To give context for the benchmarks, I will also measure Guile 1.6.8 from Debian, and also Python 2.5.1c1 and SCM 5e3 where applicable. The benchmarks will be made of real time, on an otherwise unloaded system running X, since this is the figure that we actually wish to improve. Timing measurements will be made with `time' from bash; I export TIMEFORMAT=%R to just print out the real time. For example, to measure Guile's initialization time, I run: for i in `seq 10`; do time guile -c 1; done I stopped powernowd for the duration of the tests so that the CPU is running at full speed. I actually record the observations only when the numbers "settle", i.e. with warm caches. I then put the observations into a file like the one attached, which is actually a script that will update a graph. I have attached a graph as well. As you can see, Guile 1.8 is significantly slower than anything out there. How I profile ------------- I use valgrind's "callgrind" tool, available from valgrind SVN. This tool simulates execution of the program, recording all instruction fetches as costs. These instruction fetches correspond in a rough way to time; however you can't trust them completely, re-benchmarking is always the way to prove that changes have had effect. I run a test like this: valgrind --tool=callgrind --num-callers=100 \ --base=guile-1.8-tls-callgrind --separate-recs=1 \ /opt/guile/bin/guile -c 1 This will dump a log into the file guile-1.8-tls-callgrind.PID, in "callgrind format". I use the callgrind-output-gprof script to analyze this file, which can be found at http://wingolog.org/pub/callgrind-output-gprof. (The script itself may be verified against callgrind_annotate, or kcachegrind.) Analysis -------- Profiling initialization (guile -c 1) of current 1.8 CVS with Ludovic's TLS patch gives the following results, ordered by self percent: cumulative self total percent percent calls file:function 15.19 12.99 1577 gc-card.c:scm_i_sweep_card[/opt/guile/lib/libguile.so.17.0.1] 10.27 8.52 144923 ports.c:scm_getc[/opt/guile/lib/libguile.so.17.0.1] 7.95 7.95 111779 ???:pthread_mutex_lock[/lib/libpthread-2.5.so] 6.79 6.79 129005 ???:__tls_get_addr[/lib/ld-2.5.so] 5.22 5.22 111778 ???:__pthread_mutex_unlock_usercnt[/lib/libpthread-2.5.so] 62.35 3.65 57086 strings.c:scm_c_string_set_x[/opt/guile/lib/libguile.so.17.0.1] 36.91 3.61 57086 strings.c:scm_i_string_writable_chars[/opt/guile/lib/libguile.so.17.0.1] 21.77 3.60 128584 gc-mark.c:scm_gc_mark[/opt/guile/lib/libguile.so.17.0.1] 18.07 3.08 27606 gc-mark.c:scm_gc_mark_dependencies[/opt/guile/lib/libguile.so.17.0.1] 365.99 2.52 42509 eval.c:ceval[/opt/guile/lib/libguile.so.17.0.1] [...] Cumulative percent measures the total time[1] spent in a function or its callees. It can be misleading; for example, scm_gc_mark calls itself, which adds to its cumulative time figure without actually meaning anything. The most "real" numbers are the self percent numbers, but if you can convince yourself that a function does not recurse, the cumulative percent numbers can be convincing. For example, in this case, we see that 62.35% of the time in initialization is spent in scm_c_string_set_x. At this point we have two questions: (1) can this be possible?, and (2) what calls this function? To answer (1), it seems that yes, scm_c_string_set_x has the potential to be really, really slow: 1. First scm_c_string_set_x calls scm_i_string_writable_chars. 1.1. scm_i_string_writable_chars locks a global mutex. 1.2. In the case in which the string is shared (STRINGBUF_SHARED ()), the lock is released, the buffer copied, all threads are put to sleep (!), and the lock retaken. 2. Next scm_c_string_set_x sets the char, i.e. buf[i] = c. 3. scm_i_string_stop_writing is called, which unlocks the mutex. It seems that if we look at the top elements on the profile, it is plausible that all of them are caused by string-set!: the time in GC because we are allocating a new buffer in many of the cases that we do string-set!, and the mutexen being the string write mutex. At this point, to improve performance, we have two choices: (1) make string-set! cheaper, or (2) avoid string-set!. I do not know how to do (1) in the presence of threads[2]. (2) seems feasible, if we look at what functions are actually calling scm_c_string_set_x. The ones that show up in the profile are all in read.c: ./read.c:628: scm_c_string_set_x (*tok_buf, j, SCM_MAKE_CHAR (c)); ./read.c:703: scm_c_string_set_x (*tok_buf, j, SCM_MAKE_CHAR (c)); ./read.c:766: scm_c_string_set_x (*tok_buf, j, SCM_MAKE_CHAR (c)); All of these calls use the token buffer API, in which a SCM string is allocated and grown as necessary. The readers fill the buffer with string-set!. Probably the easiest way to make this faster is to make the token buffer be a special kind of SMOB with that holds a scm_malloc'd data area, which is assumed to be accessed only from one thread, as is the case in read(). At the end, the data can be stolen from the token buffer into a string via scm_i_take_stringbufn. [1] Really, the total percent of instruction fetches. Since this corresponds more or less to time, I say time. [2] I am told that mutable strings are not even part of the r6rs proposal. Conclusions ----------- I have done the analysis and am hoping to convince Ludovic to make a patch :-) If we get Guile 1.8 load time down to Python load time, I will be happy. At that point I can go back to optimizing guile-gnome's load time, which currently runs at about 2 seconds. Cheers, Andy. -- http://wingolog.org/