I wanted to dig into the CPU versus memory limitation a bit. I switched back to writing to a vector, not touching any buffers to wrap the profile around the hot section. CPU + mem results: fixed point byte-compiled: 195 samples 3,890,391 bytes fixed point native compiled: 250 samples 4,205,335 bytes floating point byte-compiled: 560 samples 221,858,239 bytes floating point native compiled: 409 samples 221,732,687 bytes * There is no typo in the native & byte compiled fixed point results. I made a few runs. It is slower when native compiled. * I found no other combination of `defsubst' `defun' or declaring speed or not that could get the native compiled fixed point version to be faster than byte compiled. * The floating point version was only measured when the run did not trigger GC. This inflates its performance because GC cost, about 30%, is not shown. While the fixed point is fast considering it has to do more work, perhaps if the native compiler didn't manage to slow it down, it could be faster. Regarding my previous statements about memory bandwidth being the limiting factor, while the memory used is on the lower end of an order of magnitude within the random write bandwidth available. If that amount was limiting, the compilation method would not be expected to affect performance much. Since the native compiled floating point is faster, it is likely CPU bound. Consumption is just a proxy for bandwidth used. Runtimes use bandwidth without new allocation. Better tools are needed to answer this conclusively. I may continue this exercise. My initial idea to create clarity is by writing a function that just throws away conses that hold values that require 1, 10, 100, or 1000 steps to compute. If the performance scales inversely by conses but not by steps, we can isolate memory versus computation as the bottleneck for some workflows and provide a benchmark for the compiler and GC to be smarter about not generating garbage in the first place.