unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed
* wip-rtl return location
@ 2012-08-02 14:29 Andy Wingo
  2012-08-02 22:21 ` Noah Lavine
  2012-08-03  2:29 ` Mark H Weaver
  0 siblings, 2 replies; 8+ messages in thread
From: Andy Wingo @ 2012-08-02 14:29 UTC (permalink / raw)
  To: guile-devel

Hi,

Some brief thoughts on the wip-rtl branch.  Currently it has this
strange "return location" thing, where it specifies the register(s) to
which to return value(s), and the number of expected values and whether
it expects a rest list or not.  Problem is, this return location is like
a little program that needs to be interpreted at runtime.  Worse, it
seems to assume that return values will have to be passed in memory.

Instead I'd rather just use Dybvig's suggestion: every call instruction
is preceded by an MV return address.  For e.g. (values (f)), calling `f'
would be:

    ...
    goto CALL
MVRA:    
    truncate-and-jump RA
CALL:
    call f
RA:
    return

So the overhead of multiple values in the normal single-value case is
one jump per call.  When we do native compilation, this cost will be
negligible.  OTOH for MV returns, we return to a different address than
the one on the stack, which will cause a branch misprediction (google
"return stack buffers" for more info).  Of course this is not relevant
to the interpreter, because all of these branches are indirect, but it
will be in the future, so it's a good idea to think about these things
now.  With this design, the caller is responsible for handling MV
returns, not the callee.

Anyway, MV return will cause a branch misprediction.  Oh well.  I think
we can live with it.  Single-valued returns are the common case, and
they will be predicted correctly.

So, another thing.  The reason for the previous "return location" design
was because I wanted to have just two registers reserved by the
implementation: the instruction pointer and the frame pointer.  Wanting
an IP is obvious.  It's important to locate frame pointers so that
various pieces of code can walk the stack frames: for example the
delimited continuation code, the backtrace printer, the debugger, etc.
It's possible to just using a stack pointer and use dynamic tables to
find where the frame pointer is, like the x86-64 architecture does (or
-fomit-frame-pointer), but that requires more sophistication on the part
of the runtime, and I don't think we're really ready for that right now.

As I said, I wanted just the IP and the FP.  I didn't want an SP because
it causes so much performance noise in the current VM.  But then I
realized that in the RTL VM, it doesn't need to be accessed frequently,
because more values are addressed against the FP, and we're not pushing
and popping temporaries.  So we can actually keep it around, and it
might not need to be in a register.  It retains its useful
characteristics of allowing variable-sized data to be (temporarily)
allocated on the stack, as in procedure calls or MV returns, and as a
stack delimiter for GC.

In summary:

  - I will remove the "return location" stuff from wip-rtl;

  - All calls will be mv-calls

  - MV returns will return to 1 instruction before the RA

  - All calls will be preceded by a jump over the MVRA

  - Eventually we can remove the MVRA slot from stack frames, because it
    is computable from the RA

  - The stack pointer is back in town!

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: wip-rtl return location
  2012-08-02 14:29 wip-rtl return location Andy Wingo
@ 2012-08-02 22:21 ` Noah Lavine
  2012-08-03  8:06   ` Andy Wingo
  2012-08-03  2:29 ` Mark H Weaver
  1 sibling, 1 reply; 8+ messages in thread
From: Noah Lavine @ 2012-08-02 22:21 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-devel

That sounds interesting, but I have a question - why not make the MVRA
return address immediately after the call, instead of immediately
before it? In the common case when returning to the regular return
address, that would eliminate the extra branch (although it's a very
small branch anyway).

I would guess the reason to put it before is for variable-length
instructions, but you could handle that by reserving enough bytes for
a jump instruction. So it would look like

CALL:
  call f
MVRA:
  jump mv-handler
RA:
  ... rest of function ...

I really don't know if this is better or not. I'm just curious why it
isn't arranged like this.

Thanks,
Noah

On Thu, Aug 2, 2012 at 10:29 AM, Andy Wingo <wingo@pobox.com> wrote:
> Hi,
>
> Some brief thoughts on the wip-rtl branch.  Currently it has this
> strange "return location" thing, where it specifies the register(s) to
> which to return value(s), and the number of expected values and whether
> it expects a rest list or not.  Problem is, this return location is like
> a little program that needs to be interpreted at runtime.  Worse, it
> seems to assume that return values will have to be passed in memory.
>
> Instead I'd rather just use Dybvig's suggestion: every call instruction
> is preceded by an MV return address.  For e.g. (values (f)), calling `f'
> would be:
>
>     ...
>     goto CALL
> MVRA:
>     truncate-and-jump RA
> CALL:
>     call f
> RA:
>     return
>
> So the overhead of multiple values in the normal single-value case is
> one jump per call.  When we do native compilation, this cost will be
> negligible.  OTOH for MV returns, we return to a different address than
> the one on the stack, which will cause a branch misprediction (google
> "return stack buffers" for more info).  Of course this is not relevant
> to the interpreter, because all of these branches are indirect, but it
> will be in the future, so it's a good idea to think about these things
> now.  With this design, the caller is responsible for handling MV
> returns, not the callee.
>
> Anyway, MV return will cause a branch misprediction.  Oh well.  I think
> we can live with it.  Single-valued returns are the common case, and
> they will be predicted correctly.
>
> So, another thing.  The reason for the previous "return location" design
> was because I wanted to have just two registers reserved by the
> implementation: the instruction pointer and the frame pointer.  Wanting
> an IP is obvious.  It's important to locate frame pointers so that
> various pieces of code can walk the stack frames: for example the
> delimited continuation code, the backtrace printer, the debugger, etc.
> It's possible to just using a stack pointer and use dynamic tables to
> find where the frame pointer is, like the x86-64 architecture does (or
> -fomit-frame-pointer), but that requires more sophistication on the part
> of the runtime, and I don't think we're really ready for that right now.
>
> As I said, I wanted just the IP and the FP.  I didn't want an SP because
> it causes so much performance noise in the current VM.  But then I
> realized that in the RTL VM, it doesn't need to be accessed frequently,
> because more values are addressed against the FP, and we're not pushing
> and popping temporaries.  So we can actually keep it around, and it
> might not need to be in a register.  It retains its useful
> characteristics of allowing variable-sized data to be (temporarily)
> allocated on the stack, as in procedure calls or MV returns, and as a
> stack delimiter for GC.
>
> In summary:
>
>   - I will remove the "return location" stuff from wip-rtl;
>
>   - All calls will be mv-calls
>
>   - MV returns will return to 1 instruction before the RA
>
>   - All calls will be preceded by a jump over the MVRA
>
>   - Eventually we can remove the MVRA slot from stack frames, because it
>     is computable from the RA
>
>   - The stack pointer is back in town!
>
> Andy
> --
> http://wingolog.org/
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: wip-rtl return location
  2012-08-02 14:29 wip-rtl return location Andy Wingo
  2012-08-02 22:21 ` Noah Lavine
@ 2012-08-03  2:29 ` Mark H Weaver
  2012-08-03  8:24   ` Andy Wingo
  2012-08-03 11:54   ` Stefan Israelsson Tampe
  1 sibling, 2 replies; 8+ messages in thread
From: Mark H Weaver @ 2012-08-03  2:29 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-devel

Hi Andy, thanks for the update!  Exciting times for Guile :)

On 08/02/2012 10:29 AM, Andy Wingo wrote:
> Instead I'd rather just use Dybvig's suggestion: every call instruction
> is preceded by an MV return address.  For e.g. (values (f)), calling `f'
> would be:
>
>      ...
>      goto CALL
> MVRA:
>      truncate-and-jump RA
> CALL:
>      call f
> RA:
>      return
>
> So the overhead of multiple values in the normal single-value case is
> one jump per call.  When we do native compilation, this cost will be
> negligible.  OTOH for MV returns, we return to a different address than
> the one on the stack, which will cause a branch misprediction (google
> "return stack buffers" for more info).

I wonder if it might be better to avoid this branch misprediction by 
always returning to the same address.  Upon return, a special register 
would contain N-1, where N is the number of return values.  The first 
few return values would also be stored in registers (hopefully at least 
two), and if necessary the remaining values would be stored elsewhere, 
perhaps on the stack or in a list or vector pointed to by another register.

In the common case where a given call site expects a small constant 
number of return values, the compiler could emit a statically-predicted 
conditional branch to verify that N-1 is the expected value (usually 
zero), and then generate code that expects to find the return values in 
the appropriate registers.

On some architectures, it might also make sense for the callee to set 
the processor's "zero?" condition code as if N-1 had been tested, to 
allow for a shorter check in the common single-value case.

Of course, the calling convention can be chosen independently for each 
instruction set architecture / ABI.

What do you think?

     Mark




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: wip-rtl return location
  2012-08-02 22:21 ` Noah Lavine
@ 2012-08-03  8:06   ` Andy Wingo
  0 siblings, 0 replies; 8+ messages in thread
From: Andy Wingo @ 2012-08-03  8:06 UTC (permalink / raw)
  To: Noah Lavine; +Cc: guile-devel

Hi Noah,

Thanks for the thoughts.

On Fri 03 Aug 2012 00:21, Noah Lavine <noah.b.lavine@gmail.com> writes:

> That sounds interesting, but I have a question - why not make the MVRA
> return address immediately after the call, instead of immediately
> before it? In the common case when returning to the regular return
> address, that would eliminate the extra branch (although it's a very
> small branch anyway).

It's a good idea; I have it in my local branch.  Putting the MVRA before
the call didn't work out because the call instruction is
variable-length.

> CALL:
>   call f
> MVRA:
>   jump mv-handler
> RA:
>   ... rest of function ...

Yes and for a truncating return:

  CALL:
    call f
  MVRA:
    truncate
  RA:
    ... rest of function ...

Native compilation should do something different though.  But Mark's
mail has more on that, so I'll reply there.

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: wip-rtl return location
  2012-08-03  2:29 ` Mark H Weaver
@ 2012-08-03  8:24   ` Andy Wingo
  2012-08-03 11:54   ` Stefan Israelsson Tampe
  1 sibling, 0 replies; 8+ messages in thread
From: Andy Wingo @ 2012-08-03  8:24 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel

Heya Mark,

Thanks for the comments :-)

So the current thing is:

  CALL:
       call f
  MVRA:
       truncate
  RA:
       return

In interpreted code it's fine that we return to a different address, as
they are all indirect jumps anyway.  For compiled code, the "call"
instruction won't be just one instruction -- in RTL it's a
macro-instruction that also handles shuffling the arguments into place,
whereas with native compilation you would handle the shuffling then do
the call.  So for native compilation you could put the MVRA before the
call because the call would have fixed width.

Anyway, on to your point:

On Fri 03 Aug 2012 04:29, Mark H Weaver <mhw@netris.org> writes:

> I wonder if it might be better to avoid this branch misprediction by
> always returning to the same address.  Upon return, a special register
> would contain N-1, where N is the number of return values.  The first
> few return values would also be stored in registers (hopefully at least
> two), and if necessary the remaining values would be stored elsewhere,
> perhaps on the stack or in a list or vector pointed to by another
> register.

It's a good idea for the native-compilation case.  I don't think the
overhead of the conditional jump in interpreted code would be worth it,
though.  Dunno.  Probably wouldn't matter?

> In the common case where a given call site expects a small constant
> number of return values, the compiler could emit a statically-predicted
> conditional branch to verify that N-1 is the expected value (usually
> zero), and then generate code that expects to find the return values in
> the appropriate registers.

But here's the rub, this introduces a conditional branch into the
calling sequence where there was no conditional branch before (for the
single-valued case, which is empirically the majority of cases).

Apparently the intel Core architecture ignores static branch
predictions:

  http://www.agner.org/optimize/microarchitecture.pdf

So it would increase pressure on the branch target buffer.  OTOH the
dynamic predictions would almost always hit, in either case.

> On some architectures, it might also make sense for the callee to set
> the processor's "zero?" condition code as if N-1 had been tested, to
> allow for a shorter check in the common single-value case.
>
> Of course, the calling convention can be chosen independently for each
> instruction set architecture / ABI.
>
> What do you think?

I think it's definitely worth exploring.  I would be OK with it, and
receiving results in registers would be good.

In the context of what we do with the bytecode (as opposed to calling
convention optimizations that we will do with native code), WDYT about
the bytecode calling convention I outlined above?

Cheers,

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: wip-rtl return location
  2012-08-03  2:29 ` Mark H Weaver
  2012-08-03  8:24   ` Andy Wingo
@ 2012-08-03 11:54   ` Stefan Israelsson Tampe
  2012-08-03 12:38     ` Andy Wingo
  1 sibling, 1 reply; 8+ messages in thread
From: Stefan Israelsson Tampe @ 2012-08-03 11:54 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Andy Wingo, guile-devel

[-- Attachment #1: Type: text/plain, Size: 3732 bytes --]

Hi interesting thoughts Mark,

The below refere to to (very) simplistic vm compiler I'm working on right
now.

The current overhead found in function call's, function setup and
function return means that it's hard to factor in the lowest level
of optimizations. I view the call overhead so large that function call's is
dispatched to gosub routines and not inlined. E.g. it is very close in
spirit
to a VM call. On the other hand the compiler can be improved and and some
point in the future function call's might be so fast (especially for
functions where
we can prove features) so that your ideas can be a real boon. I will try to
do my
best though to implement some of your ideas at a second rewrite of the
compiler.
but at the first step I care to make sure to inline for example + - ash
etc. so that
they are fast on fixnums, that the branching is done natively list
processing is
fast and a large enough set of translations of VM operations is ready so
that we
can translate most of the code in guile. This can increase the speed at the
first step
with a factor of maybe 3-5. We can then continue to work from there.

I would also like to say that the current rtl call overhead is less then
the old
stable-2.0 versions so plain rtl VM will be faster in this respect.

Also to note is that, by the nature of the new VM, a simple compilation
might yield
less of an advantage then the stable-2.0 VM. The reason is that it looks
like many
operations in the RTL VM does more things per operations - a boon for it's
speed
because those things  will mean that we don't gain as much on a native
compilation
of the RTL VM as of the stable-2.0 VM.

A though:
assert_nargs_ee
reserve_locals
assert_nargs_ee_locals
bind_rest
bind_kwargs

Could we not implement this logic in the call instructions?

/Stefan

On Fri, Aug 3, 2012 at 4:29 AM, Mark H Weaver <mhw@netris.org> wrote:

> Hi Andy, thanks for the update!  Exciting times for Guile :)
>
>
> On 08/02/2012 10:29 AM, Andy Wingo wrote:
>
>> Instead I'd rather just use Dybvig's suggestion: every call instruction
>> is preceded by an MV return address.  For e.g. (values (f)), calling `f'
>> would be:
>>
>>      ...
>>      goto CALL
>> MVRA:
>>      truncate-and-jump RA
>> CALL:
>>      call f
>> RA:
>>      return
>>
>> So the overhead of multiple values in the normal single-value case is
>> one jump per call.  When we do native compilation, this cost will be
>> negligible.  OTOH for MV returns, we return to a different address than
>> the one on the stack, which will cause a branch misprediction (google
>> "return stack buffers" for more info).
>>
>
> I wonder if it might be better to avoid this branch misprediction by
> always returning to the same address.  Upon return, a special register
> would contain N-1, where N is the number of return values.  The first few
> return values would also be stored in registers (hopefully at least two),
> and if necessary the remaining values would be stored elsewhere, perhaps on
> the stack or in a list or vector pointed to by another register.
>
> In the common case where a given call site expects a small constant number
> of return values, the compiler could emit a statically-predicted
> conditional branch to verify that N-1 is the expected value (usually zero),
> and then generate code that expects to find the return values in the
> appropriate registers.
>
> On some architectures, it might also make sense for the callee to set the
> processor's "zero?" condition code as if N-1 had been tested, to allow for
> a shorter check in the common single-value case.
>
> Of course, the calling convention can be chosen independently for each
> instruction set architecture / ABI.
>
> What do you think?
>
>     Mark
>
>
>

[-- Attachment #2: Type: text/html, Size: 4432 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: wip-rtl return location
  2012-08-03 11:54   ` Stefan Israelsson Tampe
@ 2012-08-03 12:38     ` Andy Wingo
  2012-08-03 13:13       ` Stefan Israelsson Tampe
  0 siblings, 1 reply; 8+ messages in thread
From: Andy Wingo @ 2012-08-03 12:38 UTC (permalink / raw)
  To: Stefan Israelsson Tampe; +Cc: Mark H Weaver, guile-devel

On Fri 03 Aug 2012 13:54, Stefan Israelsson Tampe <stefan.itampe@gmail.com> writes:

> A though:
> assert_nargs_ee
> reserve_locals
> assert_nargs_ee_locals
> bind_rest
> bind_kwargs
>
> Could we not implement this logic in the call instructions?

This is a characteristic of the callee -- more work is needed if there
are optional/kw args, but in the normal case you just have
assert-nargs-ee/locals which is very cheap, no?

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: wip-rtl return location
  2012-08-03 12:38     ` Andy Wingo
@ 2012-08-03 13:13       ` Stefan Israelsson Tampe
  0 siblings, 0 replies; 8+ messages in thread
From: Stefan Israelsson Tampe @ 2012-08-03 13:13 UTC (permalink / raw)
  To: Andy Wingo; +Cc: Mark H Weaver, guile-devel

[-- Attachment #1: Type: text/plain, Size: 1029 bytes --]

 > in the normal case you just have
> assert-nargs-ee/locals which is very cheap, no?

Sure it's not the slowest of instructions, but in the VM it's an extra jump
and for the Native part it either bloats the code or one has to jump out
to supporting code subs in the VM. Considering the other call overhead
it's maybe a non issue but I think we should time current setup with a
version
where we grovel the callee information from the program datastructure.

/Stefan

On Fri, Aug 3, 2012 at 2:38 PM, Andy Wingo <wingo@pobox.com> wrote:

> On Fri 03 Aug 2012 13:54, Stefan Israelsson Tampe <stefan.itampe@gmail.com>
> writes:
>
> > A though:
> > assert_nargs_ee
> > reserve_locals
> > assert_nargs_ee_locals
> > bind_rest
> > bind_kwargs
> >
> > Could we not implement this logic in the call instructions?
>
> This is a characteristic of the callee -- more work is needed if there
> are optional/kw args, but in the normal case you just have
> assert-nargs-ee/locals which is very cheap, no?
>
> Andy
> --
> http://wingolog.org/
>

[-- Attachment #2: Type: text/html, Size: 1583 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-08-03 13:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-02 14:29 wip-rtl return location Andy Wingo
2012-08-02 22:21 ` Noah Lavine
2012-08-03  8:06   ` Andy Wingo
2012-08-03  2:29 ` Mark H Weaver
2012-08-03  8:24   ` Andy Wingo
2012-08-03 11:54   ` Stefan Israelsson Tampe
2012-08-03 12:38     ` Andy Wingo
2012-08-03 13:13       ` Stefan Israelsson Tampe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).