Hi interesting thoughts Mark,

The below refere to to (very) simplistic vm compiler I'm working on right
now.

The current overhead found in function call's, function setup and
function return means that it's hard to factor in the lowest level
of optimizations. I view the call overhead so large that function call's is
dispatched to gosub routines and not inlined. E.g. it is very close in
spirit
to a VM call. On the other hand the compiler can be improved and and some
point in the future function call's might be so fast (especially for
functions where
we can prove features) so that your ideas can be a real boon. I will try to
do my
best though to implement some of your ideas at a second rewrite of the
compiler.
but at the first step I care to make sure to inline for example + - ash
etc. so that
they are fast on fixnums, that the branching is done natively list
processing is
fast and a large enough set of translations of VM operations is ready so
that we
can translate most of the code in guile. This can increase the speed at the
first step
with a factor of maybe 3-5. We can then continue to work from there.

I would also like to say that the current rtl call overhead is less then
the old
stable-2.0 versions so plain rtl VM will be faster in this respect.

Also to note is that, by the nature of the new VM, a simple compilation
might yield
less of an advantage then the stable-2.0 VM. The reason is that it looks
like many
operations in the RTL VM does more things per operations - a boon for it's
speed
because those things  will mean that we don't gain as much on a native
compilation
of the RTL VM as of the stable-2.0 VM.

A though:
assert_nargs_ee
reserve_locals
assert_nargs_ee_locals
bind_rest
bind_kwargs

Could we not implement this logic in the call instructions?

/Stefan

On Fri, Aug 3, 2012 at 4:29 AM, Mark H Weaver <mhw@netris.org> wrote:

> Hi Andy, thanks for the update!  Exciting times for Guile :)
>
>
> On 08/02/2012 10:29 AM, Andy Wingo wrote:
>
>> Instead I'd rather just use Dybvig's suggestion: every call instruction
>> is preceded by an MV return address.  For e.g. (values (f)), calling `f'
>> would be:
>>
>>      ...
>>      goto CALL
>> MVRA:
>>      truncate-and-jump RA
>> CALL:
>>      call f
>> RA:
>>      return
>>
>> So the overhead of multiple values in the normal single-value case is
>> one jump per call.  When we do native compilation, this cost will be
>> negligible.  OTOH for MV returns, we return to a different address than
>> the one on the stack, which will cause a branch misprediction (google
>> "return stack buffers" for more info).
>>
>
> I wonder if it might be better to avoid this branch misprediction by
> always returning to the same address.  Upon return, a special register
> would contain N-1, where N is the number of return values.  The first few
> return values would also be stored in registers (hopefully at least two),
> and if necessary the remaining values would be stored elsewhere, perhaps on
> the stack or in a list or vector pointed to by another register.
>
> In the common case where a given call site expects a small constant number
> of return values, the compiler could emit a statically-predicted
> conditional branch to verify that N-1 is the expected value (usually zero),
> and then generate code that expects to find the return values in the
> appropriate registers.
>
> On some architectures, it might also make sense for the callee to set the
> processor's "zero?" condition code as if N-1 had been tested, to allow for
> a shorter check in the common single-value case.
>
> Of course, the calling convention can be chosen independently for each
> instruction set architecture / ABI.
>
> What do you think?
>
>     Mark
>
>
>