Here is the next version of immediate strings patch, with further improvements suggested by Paul. As it was said, strings up to 21 bytes on 64-bit and up to 9 bytes on 32-bit can be immediate (trailing '\0' is not counted). Note this code assumes sizeof (EMACS_INT) is equal to sizeof (void *), so it's not compatible with WIDE_EMACS_INT. Since there was a reasonable doubts whether this stuff is practically useful, I did two benchmarks. The fisrt one was a simple string allocation benchmark, attached as stringbench.el. The second one was just a compilation of all stuff in lisp subdirectory with byte-force-recompile. Everything was tested with 64-bit executables and '-Q -batch' command line options. Configuration: ./configure --prefix=/not/exists --without-sound --without-pop \ --with-x-toolkit=lucid --without-dbus --without-libotf \ --without-selinux --without-xft --without-gsettings \ --without-gnutls --without-rsvg --without-xml2 Compiler: gcc 4.6.1, optimization flags -O3 Old executable size 12855360 bytes, new exectable size 12904512 bytes (0.38% larger code size). * Benchmark 1, 8 runs for each executable: -- Old -- 33.24user 0.23system 0:33.72elapsed 99%CPU (0avgtext+0avgdata 368268maxresident)k 0inputs+0outputs (0major+112338minor)pagefaults 0swaps 32.29user 0.25system 0:32.77elapsed 99%CPU (0avgtext+0avgdata 338012maxresident)k 0inputs+0outputs (0major+124684minor)pagefaults 0swaps 33.31user 0.24system 0:33.80elapsed 99%CPU (0avgtext+0avgdata 330612maxresident)k 0inputs+0outputs (0major+120164minor)pagefaults 0swaps 33.91user 0.24system 0:34.41elapsed 99%CPU (0avgtext+0avgdata 351588maxresident)k 0inputs+0outputs (0major+125401minor)pagefaults 0swaps 33.17user 0.27system 0:33.69elapsed 99%CPU (0avgtext+0avgdata 331480maxresident)k 0inputs+0outputs (0major+120374minor)pagefaults 0swaps 33.26user 0.31system 0:33.83elapsed 99%CPU (0avgtext+0avgdata 332956maxresident)k 0inputs+0outputs (0major+148027minor)pagefaults 0swaps 33.38user 0.28system 0:33.90elapsed 99%CPU (0avgtext+0avgdata 334400maxresident)k 0inputs+0outputs (0major+133420minor)pagefaults 0swaps 33.13user 0.23system 0:33.61elapsed 99%CPU (0avgtext+0avgdata 331132maxresident)k 0inputs+0outputs (0major+120341minor)pagefaults 0swaps -- New -- 32.59user 0.35system 0:33.18elapsed 99%CPU (0avgtext+0avgdata 332528maxresident)k 0inputs+0outputs (0major+149273minor)pagefaults 0swaps 32.62user 0.31system 0:33.17elapsed 99%CPU (0avgtext+0avgdata 332532maxresident)k 0inputs+0outputs (0major+149274minor)pagefaults 0swaps 32.44user 0.30system 0:32.98elapsed 99%CPU (0avgtext+0avgdata 333696maxresident)k 0inputs+0outputs (0major+145349minor)pagefaults 0swaps 29.29user 0.30system 0:29.80elapsed 99%CPU (0avgtext+0avgdata 366444maxresident)k 0inputs+0outputs (0major+136105minor)pagefaults 0swaps 31.90user 0.33system 0:32.47elapsed 99%CPU (0avgtext+0avgdata 362092maxresident)k 0inputs+0outputs (0major+161330minor)pagefaults 0swaps 34.29user 0.34system 0:34.88elapsed 99%CPU (0avgtext+0avgdata 375636maxresident)k 0inputs+0outputs (0major+160050minor)pagefaults 0swaps 32.64user 0.31system 0:33.20elapsed 99%CPU (0avgtext+0avgdata 336572maxresident)k 0inputs+0outputs (0major+150284minor)pagefaults 0swaps 33.17user 0.27system 0:33.69elapsed 99%CPU (0avgtext+0avgdata 360560maxresident)k 0inputs+0outputs (0major+126406minor)pagefaults 0swaps -- Results -- Got 2.5% better speed, but ~3.1% larger heap usage. It's expected that heap usage should be smaller, why it isn't? Old code increments consing_since_gc with the number of bytes allocated for each new string's data, but new code does so only for non-immediate strings; so, old code calls GC earlier than new, thus giving smaller peak heap usage. * Benchmark 2, 8 runs for each executable: -- Old -- 91.86user 0.49system 2:27.21elapsed 62%CPU (0avgtext+0avgdata 74736maxresident)k 0inputs+77864outputs (0major+39292minor)pagefaults 0swaps 91.57user 0.54system 2:27.30elapsed 62%CPU (0avgtext+0avgdata 74648maxresident)k 0inputs+78536outputs (0major+38641minor)pagefaults 0swaps 89.58user 0.52system 2:21.93elapsed 63%CPU (0avgtext+0avgdata 74684maxresident)k 0inputs+78536outputs (0major+38903minor)pagefaults 0swaps 91.53user 0.53system 2:25.14elapsed 63%CPU (0avgtext+0avgdata 74612maxresident)k 0inputs+78536outputs (0major+38538minor)pagefaults 0swaps 91.49user 0.56system 2:24.56elapsed 63%CPU (0avgtext+0avgdata 74708maxresident)k 0inputs+78528outputs (0major+38716minor)pagefaults 0swaps 91.77user 0.53system 2:24.01elapsed 64%CPU (0avgtext+0avgdata 74660maxresident)k 0inputs+78536outputs (0major+39164minor)pagefaults 0swaps 91.44user 0.54system 2:27.12elapsed 62%CPU (0avgtext+0avgdata 74728maxresident)k 0inputs+78536outputs (0major+39173minor)pagefaults 0swaps 91.72user 0.50system 2:24.25elapsed 63%CPU (0avgtext+0avgdata 74680maxresident)k 0inputs+78528outputs (0major+39538minor)pagefaults 0swaps -- New -- 89.98user 0.53system 2:22.79elapsed 63%CPU (0avgtext+0avgdata 73440maxresident)k 0inputs+78536outputs (0major+36362minor)pagefaults 0swaps 89.91user 0.51system 2:24.10elapsed 62%CPU (0avgtext+0avgdata 73528maxresident)k 0inputs+78528outputs (0major+36753minor)pagefaults 0swaps 89.85user 0.48system 2:24.74elapsed 62%CPU (0avgtext+0avgdata 73392maxresident)k 0inputs+78536outputs (0major+36745minor)pagefaults 0swaps 90.12user 0.54system 2:22.56elapsed 63%CPU (0avgtext+0avgdata 73440maxresident)k 0inputs+78536outputs (0major+37347minor)pagefaults 0swaps 89.95user 0.53system 2:23.74elapsed 62%CPU (0avgtext+0avgdata 73416maxresident)k 0inputs+78536outputs (0major+37292minor)pagefaults 0swaps 91.26user 0.53system 2:25.64elapsed 63%CPU (0avgtext+0avgdata 73440maxresident)k 0inputs+78536outputs (0major+36782minor)pagefaults 0swaps 90.03user 0.56system 2:25.01elapsed 62%CPU (0avgtext+0avgdata 73376maxresident)k 0inputs+78536outputs (0major+37418minor)pagefaults 0swaps 90.15user 0.54system 2:25.73elapsed 62%CPU (0avgtext+0avgdata 73448maxresident)k 0inputs+78536outputs (0major+37279minor)pagefaults 0swaps -- Results -- Got ~1.3% better speed, ~1.7% smaller heap usage. Since this benchmark does a lot of things besides string allocation, 'later GC' effect is negligible here. Obviously, new string code is more complex, and, аs it seems at first, should be slower because any access to string data involves an evaluation of a conditional expression, which creates more pressure to instruction cache and branch prediction logic. But an overall improvement may be explained by better spatial locality and thus better data cache utilization (normal string and it's data may be allocated far away from each other, so when cache line is filled by accessing a member of Lisp_String, it's very unlikely to get the same cache line filled with string data; for an immediate string, such a case should be quite rare). This may be checked, for example, with valgrind by using it's cachegrind tool (but I didn't tried this yet). Dmitry