I wanted to dig into the CPU versus memory limitation a bit.

I switched back to writing to a vector, not touching any buffers to
wrap the profile around the hot section.  CPU + mem results:

fixed point byte-compiled:       195 samples     3,890,391 bytes
fixed point native compiled:     250 samples     4,205,335 bytes
floating point byte-compiled:    560 samples   221,858,239 bytes
floating point native compiled:  409 samples   221,732,687 bytes

* There is no typo in the native & byte compiled fixed point results.
I made a few runs.  It is slower when native compiled.
* I found no other combination of `defsubst' `defun' or declaring
speed or not that could get the native compiled fixed point version to
be faster than byte compiled.
* The floating point version was only measured when the run did not
trigger GC.  This inflates its performance because GC cost, about 30%,
is not shown.

While the fixed point is fast considering it has to do more work,
perhaps if the native compiler didn't manage to slow it down, it could
be faster.

Regarding my previous statements about memory bandwidth being the
limiting factor, while the memory used is on the lower end of an order
of magnitude within the random write bandwidth available.  If that
amount was limiting, the compilation method would not be expected to
affect performance much.  Since the native compiled floating point is
faster, it is likely CPU bound.  Consumption is just a proxy for
bandwidth used.  Runtimes use bandwidth without new allocation.
Better tools are needed to answer this conclusively.

I may continue this exercise.  My initial idea to create clarity is by
writing a function that just throws away conses that hold values that
require 1, 10, 100, or 1000 steps to compute.  If the performance
scales inversely by conses but not by steps, we can isolate memory
versus computation as the bottleneck for some workflows and provide a
benchmark for the compiler and GC to be smarter about not generating
garbage in the first place.