Register VM WIP

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* Register VM WIP
@ 2012-05-11 16:19 Andy Wingo
  2012-05-11 20:29 ` Stefan Israelsson Tampe
  2012-05-14 21:09 ` Ludovic Courtès
  0 siblings, 2 replies; 20+ messages in thread
From: Andy Wingo @ 2012-05-11 16:19 UTC (permalink / raw)
  To: guile-devel

Hi all,

This mail announces some very early work on a register VM.  The code is
in wip-rtl ("work in progress, register transfer language".  The latter
bit is something of a misnomer.).  There is not much there yet:
basically just the VM, an assembler, and a disassembler.  Still, it's
interesting, and I thought people might want to hear more about it.

So, the deal: why is it interesting to switch from a stack VM, which is
what we have, to a register VM?  There are three overriding
disadvantages to the current VM.

  1) With our stack VM, instructions are smaller.  They do less, so you
     need more of them.  This increases dispatch cost, which is the
     largest cost of a VM.

  2) On a stack VM, there is a penalty to naming values.  Since the only
     values that are accessible to an instruction are the ones on the
     top of the stack, whenever you want to use more names, you end up
     doing a lot of local-ref / local-set operations.  In contrast an
     instruction for a register VM can address many more operands, so
     there is much less penalty to storing something on the stack.  (The
     penalty is not so much in the storage, but in the pushing and
     popping needed to access it.)

  3) Our stack VM has variable-sized stack frames, so we need to check
     for overflow every time we push a value on the stack.  This is
     quite costly.

The WIP register VM fixes all of these issues.

The basic design of the VM is: 32-bit instruction words, 8-bit opcodes,
variable-length instructions, maximum of 24-bit register addressing, and
static, relocatable allocation of constants.

Also, with the wip-rtl VM there is no stack pointer: locals are
addressed directly via the frame pointer, and the call frame for a
function is of a fixed size.  Indeed the IP and FP are the only state
variables of the VM, which makes it much easier to think about native
compilation, given the scarcity of CPU registers on some architectures.

See vm-engine.c from around line 1000 for a commented set of
instructions.  It's messy in many ways now, but hey.

As far as performance goes, we won't know yet.  But at least for a
simple loop, counting down from a billion, the register VM is a few
times faster than the stack VM.  Still, I would be happy if the general
speedup were on the order of 40%.  We'll see.

Here's that loop in rtl VM:

   (use-modules (system vm rtl))

   (assemble-rtl-program
     0
     '((assert-nargs-ee/locals 1 2)
       (br fix-body)
       loop-head
       (make-short-immediate 2 0)
       (br-if-= 1 2 out)
       (sub1 1 1)
       (br loop-head)
       fix-body
       (mov 1 0)
       (br loop-head)
       out
       (make-short-immediate 0 #t)
       (return 0)))

There are various ways to improve this, but its structure is like what
the stack VM produces.

Compare to the current opcode:

    scheme@(guile-user)> (define (countdown n) (let lp ((n n)) (or (zero? n) (lp (1- n)))))
    scheme@(guile-user)> ,x countdown
    Disassembly of #<procedure countdown (n)>:

       0    (assert-nargs-ee/locals 17)     ;; 1 arg, 2 locals
       2    (br :L186)                      ;; -> 30
       6    (local-ref 1)                   ;; `n'
       8    (make-int8:0)                   ;; 0
       9    (ee?)                           
      10    (local-set 2)                   ;; `t'
      12    (local-ref 2)                   ;; `t'
      14    (br-if-not :L187)               ;; -> 21
      18    (local-ref 2)                   ;; `t'
      20    (return)                        
      21    (local-ref 1)                   ;; `n'
      23    (sub1)                          
      24    (local-set 1)                   ;; `n'
      26    (br :L188)                      ;; -> 6
      30    (local-ref 0)                   ;; `n'
      32    (local-set 1)                   
      34    (br :L188)                      ;; -> 6

OK, time to set down the keyboard; been working far too much on this in
recent days.  I still need to adapt the compiler to produce RTL
bytecode.  I am going to let it sit for a week or two before touching it
again.  Comments welcome.

Regards,

Andy
-- 
http://wingolog.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-11 16:19 Register VM WIP Andy Wingo
@ 2012-05-11 20:29 ` Stefan Israelsson Tampe
  2012-05-16 15:01   ` Andy Wingo
  2012-05-14 21:09 ` Ludovic Courtès
  1 sibling, 1 reply; 20+ messages in thread
From: Stefan Israelsson Tampe @ 2012-05-11 20:29 UTC (permalink / raw)
  To: Andy Wingo; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 5358 bytes --]

Hi,

This looks very good. i like the hole approach and this approach has the
potential to address most of the issues I have seen when disassembling
guile-2.0 output. A few notes.

1. What about growing stacks any coments if they will be easier to manage
for this setup. Can one copy the C stack logic?

2. Is there an instruction that does what call does but can be used for
tail call's
when it needs it e.g. the code
 for (n = 0; n < nargs; n++)
        LOCAL_SET (n, old_fp[ip[4 + n]]);
that is missing for the tail code

3. I would appriciate if the frame is always below say 256 SCM:s of the fp
stack limit
that way when preparing tail calling one doesn't usally need to check if
the argument fit's
when issuing a tail call. If you compile a function that tail call more
then 254 (?) arguments
then you can as well check because then be free relative the argument
handling.

4. I think the logic code hook I recently investigated could easily fit
into this VM engine with
using similar techniques as I described in previous mails.

Thanks for your work on this
Stefan


On Fri, May 11, 2012 at 6:19 PM, Andy Wingo <wingo@pobox.com> wrote:

> Hi all,
>
> This mail announces some very early work on a register VM.  The code is
> in wip-rtl ("work in progress, register transfer language".  The latter
> bit is something of a misnomer.).  There is not much there yet:
> basically just the VM, an assembler, and a disassembler.  Still, it's
> interesting, and I thought people might want to hear more about it.
>
> So, the deal: why is it interesting to switch from a stack VM, which is
> what we have, to a register VM?  There are three overriding
> disadvantages to the current VM.
>
>  1) With our stack VM, instructions are smaller.  They do less, so you
>     need more of them.  This increases dispatch cost, which is the
>     largest cost of a VM.
>
>  2) On a stack VM, there is a penalty to naming values.  Since the only
>     values that are accessible to an instruction are the ones on the
>     top of the stack, whenever you want to use more names, you end up
>     doing a lot of local-ref / local-set operations.  In contrast an
>     instruction for a register VM can address many more operands, so
>     there is much less penalty to storing something on the stack.  (The
>     penalty is not so much in the storage, but in the pushing and
>     popping needed to access it.)
>
>  3) Our stack VM has variable-sized stack frames, so we need to check
>     for overflow every time we push a value on the stack.  This is
>     quite costly.
>
> The WIP register VM fixes all of these issues.
>
> The basic design of the VM is: 32-bit instruction words, 8-bit opcodes,
> variable-length instructions, maximum of 24-bit register addressing, and
> static, relocatable allocation of constants.
>
> Also, with the wip-rtl VM there is no stack pointer: locals are
> addressed directly via the frame pointer, and the call frame for a
> function is of a fixed size.  Indeed the IP and FP are the only state
> variables of the VM, which makes it much easier to think about native
> compilation, given the scarcity of CPU registers on some architectures.
>
> See vm-engine.c from around line 1000 for a commented set of
> instructions.  It's messy in many ways now, but hey.
>
> As far as performance goes, we won't know yet.  But at least for a
> simple loop, counting down from a billion, the register VM is a few
> times faster than the stack VM.  Still, I would be happy if the general
> speedup were on the order of 40%.  We'll see.
>
> Here's that loop in rtl VM:
>
>   (use-modules (system vm rtl))
>
>   (assemble-rtl-program
>     0
>     '((assert-nargs-ee/locals 1 2)
>       (br fix-body)
>       loop-head
>       (make-short-immediate 2 0)
>       (br-if-= 1 2 out)
>       (sub1 1 1)
>       (br loop-head)
>       fix-body
>       (mov 1 0)
>       (br loop-head)
>       out
>       (make-short-immediate 0 #t)
>       (return 0)))
>
> There are various ways to improve this, but its structure is like what
> the stack VM produces.
>
> Compare to the current opcode:
>
>    scheme@(guile-user)> (define (countdown n) (let lp ((n n)) (or (zero?
> n) (lp (1- n)))))
>    scheme@(guile-user)> ,x countdown
>    Disassembly of #<procedure countdown (n)>:
>
>       0    (assert-nargs-ee/locals 17)     ;; 1 arg, 2 locals
>       2    (br :L186)                      ;; -> 30
>       6    (local-ref 1)                   ;; `n'
>       8    (make-int8:0)                   ;; 0
>       9    (ee?)
>      10    (local-set 2)                   ;; `t'
>      12    (local-ref 2)                   ;; `t'
>      14    (br-if-not :L187)               ;; -> 21
>      18    (local-ref 2)                   ;; `t'
>      20    (return)
>      21    (local-ref 1)                   ;; `n'
>      23    (sub1)
>      24    (local-set 1)                   ;; `n'
>      26    (br :L188)                      ;; -> 6
>      30    (local-ref 0)                   ;; `n'
>      32    (local-set 1)
>      34    (br :L188)                      ;; -> 6
>
> OK, time to set down the keyboard; been working far too much on this in
> recent days.  I still need to adapt the compiler to produce RTL
> bytecode.  I am going to let it sit for a week or two before touching it
> again.  Comments welcome.
>
> Regards,
>
> Andy
> --
> http://wingolog.org/
>
>

[-- Attachment #2: Type: text/html, Size: 6283 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-11 16:19 Register VM WIP Andy Wingo
  2012-05-11 20:29 ` Stefan Israelsson Tampe
@ 2012-05-14 21:09 ` Ludovic Courtès
  2012-05-14 21:28   ` Andrew Gwozdziewycz
  2012-05-15 18:49   ` Andy Wingo
  1 sibling, 2 replies; 20+ messages in thread
From: Ludovic Courtès @ 2012-05-14 21:09 UTC (permalink / raw)
  To: guile-devel

Hi Andy!

This all looks pretty exciting!  Being able to get rid of all repeated
‘local-{ref,set}’ instructions sounds compelling.  And it does seem to
bring us one step closer to native code.

Presumably the tricky part will be the register allocator, right?

Looking at the ‘countdown’ example, I wonder how much could be achieved
in the stack VM by using well-chosen super-instructions:

       0    (assert-nargs-ee/locals 17)
       2    (br :L186)                      ;; -> 30
       6    (local-ref 1)
       8    (make-int8:0)
       9    (ee?)                           
      10    (local-set 2)   ;;
      12    (local-ref 2)   ;; → use ‘local-set* 2’, which doesn’t pop
      14    (br-if-not :L187)               ;; -> 21
      18    (local-ref 2)
      20    (return)                        
      21    (local-ref 1)   ;;
      23    (sub1)          ;; → use ‘local-sub1 1’              
      24    (local-set 1)   ;;
      26    (br :L188)                      ;; -> 6
      30    (local-ref 0)   ;;
      32    (local-set 1)   ;; → use ‘local-mov 0 1’                   
      34    (br :L188)                      ;; -> 6

This would amount to making some of the instructions like those of a
register VM, but it could be done incrementally.

Anyway, this is inspiring, and promising!

Thanks,
Ludo’.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-14 21:09 ` Ludovic Courtès
@ 2012-05-14 21:28   ` Andrew Gwozdziewycz
  2012-05-15 18:45     ` Andy Wingo
  2012-05-15 18:49   ` Andy Wingo
  1 sibling, 1 reply; 20+ messages in thread
From: Andrew Gwozdziewycz @ 2012-05-14 21:28 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

On Mon, May 14, 2012 at 5:09 PM, Ludovic Courtès <ludo@gnu.org> wrote:
> Hi Andy!
>
> This all looks pretty exciting!  Being able to get rid of all repeated
> ‘local-{ref,set}’ instructions sounds compelling.  And it does seem to
> bring us one step closer to native code.
>
> Presumably the tricky part will be the register allocator, right?

The register based VMs I've seen ignore this issue by allowing for an
infinite set of registers. :)

-- 
http://www.apgwoz.com



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-14 21:28   ` Andrew Gwozdziewycz
@ 2012-05-15 18:45     ` Andy Wingo
  2012-05-16  0:39       ` Noah Lavine
  0 siblings, 1 reply; 20+ messages in thread
From: Andy Wingo @ 2012-05-15 18:45 UTC (permalink / raw)
  To: Andrew Gwozdziewycz; +Cc: Ludovic Courtès, guile-devel

On Mon 14 May 2012 23:28, Andrew Gwozdziewycz <apgwoz@gmail.com> writes:

> On Mon, May 14, 2012 at 5:09 PM, Ludovic Courtès <ludo@gnu.org> wrote:
>>
>> Presumably the tricky part will be the register allocator, right?
>
> The register based VMs I've seen ignore this issue by allowing for an
> infinite set of registers. :)

Indeed, that's the plan :)  The first shot at an allocator will look a
lot like the one in (language tree-il analyze).

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-14 21:09 ` Ludovic Courtès
  2012-05-14 21:28   ` Andrew Gwozdziewycz
@ 2012-05-15 18:49   ` Andy Wingo
  1 sibling, 0 replies; 20+ messages in thread
From: Andy Wingo @ 2012-05-15 18:49 UTC (permalink / raw)
  To: Ludovic Courtès; +Cc: guile-devel

Heya Ludovic,

On Mon 14 May 2012 23:09, ludo@gnu.org (Ludovic Courtès) writes:

>        6    (local-ref 1)
>        8    (make-int8:0)
>        9    (ee?)                           
>       10    (local-set 2)   ;;
>       12    (local-ref 2)   ;; → use ‘local-set* 2’, which doesn’t pop
>       14    (br-if-not :L187)               ;; -> 21

The register VM will do a br-if-= N M ! :LABEL, where ! is a flag
indicating that the test is inverted.  It addresses the operands
directly in registers, no need to push test values.  But yes, there are
places we could win, like local-sub1 and local-mov, as you say.

Regards,

Andy
-- 
http://wingolog.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-15 18:45     ` Andy Wingo
@ 2012-05-16  0:39       ` Noah Lavine
  2012-05-16  4:23         ` Mark H Weaver
  2012-05-16  7:10         ` Andy Wingo
  0 siblings, 2 replies; 20+ messages in thread
From: Noah Lavine @ 2012-05-16  0:39 UTC (permalink / raw)
  To: Andy Wingo; +Cc: Ludovic Courtès, guile-devel

Hello,

>> The register based VMs I've seen ignore this issue by allowing for an
>> infinite set of registers. :)
>
> Indeed, that's the plan :)  The first shot at an allocator will look a
> lot like the one in (language tree-il analyze).

That was a bit surprising to me. Do you mean that the register pool
will grow and shrink for each function call? Is that why the stack
frames can be fixed-size?

Thanks,
Noah



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-16  0:39       ` Noah Lavine
@ 2012-05-16  4:23         ` Mark H Weaver
  2012-05-16  7:15           ` Andy Wingo
  2012-05-16  7:10         ` Andy Wingo
  1 sibling, 1 reply; 20+ messages in thread
From: Mark H Weaver @ 2012-05-16  4:23 UTC (permalink / raw)
  To: Noah Lavine; +Cc: Andy Wingo, Ludovic Courtès, guile-devel

Noah Lavine <noah.b.lavine@gmail.com> writes:
>>> The register based VMs I've seen ignore this issue by allowing for an
>>> infinite set of registers. :)
>>
>> Indeed, that's the plan :)  The first shot at an allocator will look a
>> lot like the one in (language tree-il analyze).
>
> That was a bit surprising to me. Do you mean that the register pool
> will grow and shrink for each function call? Is that why the stack
> frames can be fixed-size?

It's surprising to me for another reason: in order to make the
instructions reasonably compact, only a limited number of bits are
available in each instruction to specify which registers to use.

      Mark



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-16  0:39       ` Noah Lavine
  2012-05-16  4:23         ` Mark H Weaver
@ 2012-05-16  7:10         ` Andy Wingo
  1 sibling, 0 replies; 20+ messages in thread
From: Andy Wingo @ 2012-05-16  7:10 UTC (permalink / raw)
  To: Noah Lavine; +Cc: Ludovic Courtès, guile-devel

On Wed 16 May 2012 02:39, Noah Lavine <noah.b.lavine@gmail.com> writes:

> Do you mean that the register pool will grow and shrink for each
> function call? Is that why the stack frames can be fixed-size?

The register pool is the set of locals on the stack.  Registers for one
function are stored in the stack frame.

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-16  4:23         ` Mark H Weaver
@ 2012-05-16  7:15           ` Andy Wingo
  2012-05-16 13:44             ` Mark H Weaver
  0 siblings, 1 reply; 20+ messages in thread
From: Andy Wingo @ 2012-05-16  7:15 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel

On Wed 16 May 2012 06:23, Mark H Weaver <mhw@netris.org> writes:

> It's surprising to me for another reason: in order to make the
> instructions reasonably compact, only a limited number of bits are
> available in each instruction to specify which registers to use.

It turns out that being reasonably compact isn't terribly important --
more important is the number of opcodes it takes to get something done,
which translates to the number of dispatches.  Have you seen the "direct
threading" VM implementation strategy?  In that case the opcode is not
an index into a jump table, it's a word that encodes the pointer
directly.  So it's a word wide, just for the opcode.  That's what
JavaScriptCore does, for example.  The opcode is a word wide, and each
operand is a word as well.

The design of the wip-rtl VM is to allow 16M registers (24-bit
addressing).  However many instructions can just address 2**8 registers
(8-bit addressing) or 2**12 registers (12-bit addressing).  We will
reserve registers 253 to 255 as temporaries.  If you have so many
registers as to need more than that, then you have to shuffle operands
down into the temporaries.  That's the plan, anyway.

Cheers,

Andy
-- 
http://wingolog.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-16  7:15           ` Andy Wingo
@ 2012-05-16 13:44             ` Mark H Weaver
  2012-05-16 14:00               ` David Kastrup
                                 ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Mark H Weaver @ 2012-05-16 13:44 UTC (permalink / raw)
  To: Andy Wingo; +Cc: Ludovic Courtès, guile-devel

Hi Andy!

Andy Wingo <wingo@pobox.com> writes:
> On Wed 16 May 2012 06:23, Mark H Weaver <mhw@netris.org> writes:
>
>> It's surprising to me for another reason: in order to make the
>> instructions reasonably compact, only a limited number of bits are
>> available in each instruction to specify which registers to use.
>
> It turns out that being reasonably compact isn't terribly important --
> more important is the number of opcodes it takes to get something done,
> which translates to the number of dispatches.  Have you seen the "direct
> threading" VM implementation strategy?  In that case the opcode is not
> an index into a jump table, it's a word that encodes the pointer
> directly.  So it's a word wide, just for the opcode.  That's what
> JavaScriptCore does, for example.  The opcode is a word wide, and each
> operand is a word as well.
>
> The design of the wip-rtl VM is to allow 16M registers (24-bit
> addressing).  However many instructions can just address 2**8 registers
> (8-bit addressing) or 2**12 registers (12-bit addressing).  We will
> reserve registers 253 to 255 as temporaries.  If you have so many
> registers as to need more than that, then you have to shuffle operands
> down into the temporaries.  That's the plan, anyway.

I'm very concerned about this design, for the same reason that I was
concerned about NaN-boxing on 32-bit platforms.  Efficient use of memory
is extremely important on modern architectures, because of the vast (and
increasing) disparity between cache speed and RAM speed.  If you can fit
the active set into the cache, that often makes a profound difference in
the speed of a program.

I agree that with VMs, minimizing the number of dispatches is crucial,
but beyond a certain point, having more registers is not going to save
you any dispatches, because they will almost never be used anyway.
2^12 registers is _far_ beyond that point.

As I wrote before concerning NaN-boxing, I suspect that the reason these
memory-bloated designs are so successful in the JavaScript world is that
they are specifically optimized for use within a modern web browser,
which is already a memory hog anyway.  Therefore, if the language
implementation wastes yet more memory it will hardly be noticed.        

If I were designing this VM, I'd work hard to allow as many loops as
possible to run completely in the cache.  That means that three things
have to fit into the cache together: the VM itself, the user loop code,
and the user data.  IMO, the sum of these three things should be made as
small as possible.

I certainly agree that we should have a generous number of registers,
but I suspect that the sweet spot for a VM is 256, because it enables
more compact dispatching code in the VM, and yet is more than enough to
allow a decent register allocator to generate good code.

That's my educated guess anyway.  Feel free to prove me wrong :)

    Regards,
      Mark

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-16 13:44             ` Mark H Weaver
@ 2012-05-16 14:00               ` David Kastrup
  2012-05-16 14:54               ` Noah Lavine
  2012-05-16 14:58               ` Andy Wingo
  2 siblings, 0 replies; 20+ messages in thread
From: David Kastrup @ 2012-05-16 14:00 UTC (permalink / raw)
  To: guile-devel

Mark H Weaver <mhw@netris.org> writes:

> I certainly agree that we should have a generous number of registers,
> but I suspect that the sweet spot for a VM is 256, because it enables
> more compact dispatching code in the VM, and yet is more than enough to
> allow a decent register allocator to generate good code.
>
> That's my educated guess anyway.  Feel free to prove me wrong :)

The counterproof will usually be done by benchmarking, and will even
differ between different processors sharing the same instruction set.

I see two ways out:
a) pick the register size individually for each function, as small as
possible without spillage.  Which makes the whole indistinguishable from
a stack-based VM.
b) don't generate the final bytecode until the code is actually being run.

That means that _if_ code is precompiled, it will be precompiled into
either stack-based VM or some other representation better suited to
compile into code for a certain amount of registers.  Of course, the
threshold to picking actual registers of the available processor and
compiling native code is then not all too large.

-- 
David Kastrup

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-16 13:44             ` Mark H Weaver
  2012-05-16 14:00               ` David Kastrup
@ 2012-05-16 14:54               ` Noah Lavine
  2012-05-16 15:05                 ` Andy Wingo
  2012-05-16 20:39                 ` Ludovic Courtès
  2012-05-16 14:58               ` Andy Wingo
  2 siblings, 2 replies; 20+ messages in thread
From: Noah Lavine @ 2012-05-16 14:54 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Andy Wingo, Ludovic Courtès, guile-devel

Hi Mark,

You are thinking along very similar lines to how I used to think. But
I have a different way to think about it that might make it seem
better.

In our current VM, we have two stacks: the local-variable stack, which
has frames for different function calls and is generally what you'd
think of as a stack, and the temporary-variable stack, which is
literally a stack in the sense that you only operate on the top of it.
The temporary-variable stack makes us do a lot of unnecessary work,
because we have to load things from the local-variable stack to the
temporary-variable stack.

I think what Andy is proposing to do is to get rid of the
temporary-variable stack and operate directly on the local-variable
stack. We shouldn't think of these registers as being like machine
registers, and in fact maybe "registers" is not a good name for these
objects. They are really just variables in the topmost stack frame.
This should only reduce memory usage, because the local-variable stack
stays the same and the temporary-variable stack goes away (some
temporaries might move to the local-variable stack, but it can't be
more than were on the temporary-variable stack, so that's still a
win).

The machine I was initially thinking of, and I imagine you were too,
is different. I had imagined a machine where the number of registers
was limited, ideally to the length of a processor cache line, and was
separate from the local-variables stack. In such a machine, the
registers are used as a cache for the local variables, and you get to
deal with all the register allocation problems that a standard
compiler would. That would accomplish the goal of keeping more things
in cache.

The "registers as cache" idea may result in faster code than the
"directly addressing local variables" idea, but it's also more
complicated to implement. So it makes sense to me that we would try
directly addressing local variables first, and maybe later move to
using a fixed-size cache of registers. It also occurs to me that the
RTL intermediate language, which is really just a language for
directly addressing an arbitrary number of local variables, is a
standard compiler intermediate language. So it might be useful to have
that around anyway, because we could more easily feed its output into,
for instance, GCC.

Andy, is this an accurate description of the register VM? And Mark and
everyone else, does it seem better when you look at it this way?

Noah

On Wed, May 16, 2012 at 9:44 AM, Mark H Weaver <mhw@netris.org> wrote:
> Hi Andy!
>
> Andy Wingo <wingo@pobox.com> writes:
>> On Wed 16 May 2012 06:23, Mark H Weaver <mhw@netris.org> writes:
>>
>>> It's surprising to me for another reason: in order to make the
>>> instructions reasonably compact, only a limited number of bits are
>>> available in each instruction to specify which registers to use.
>>
>> It turns out that being reasonably compact isn't terribly important --
>> more important is the number of opcodes it takes to get something done,
>> which translates to the number of dispatches.  Have you seen the "direct
>> threading" VM implementation strategy?  In that case the opcode is not
>> an index into a jump table, it's a word that encodes the pointer
>> directly.  So it's a word wide, just for the opcode.  That's what
>> JavaScriptCore does, for example.  The opcode is a word wide, and each
>> operand is a word as well.
>>
>> The design of the wip-rtl VM is to allow 16M registers (24-bit
>> addressing).  However many instructions can just address 2**8 registers
>> (8-bit addressing) or 2**12 registers (12-bit addressing).  We will
>> reserve registers 253 to 255 as temporaries.  If you have so many
>> registers as to need more than that, then you have to shuffle operands
>> down into the temporaries.  That's the plan, anyway.
>
> I'm very concerned about this design, for the same reason that I was
> concerned about NaN-boxing on 32-bit platforms.  Efficient use of memory
> is extremely important on modern architectures, because of the vast (and
> increasing) disparity between cache speed and RAM speed.  If you can fit
> the active set into the cache, that often makes a profound difference in
> the speed of a program.
>
> I agree that with VMs, minimizing the number of dispatches is crucial,
> but beyond a certain point, having more registers is not going to save
> you any dispatches, because they will almost never be used anyway.
> 2^12 registers is _far_ beyond that point.
>
> As I wrote before concerning NaN-boxing, I suspect that the reason these
> memory-bloated designs are so successful in the JavaScript world is that
> they are specifically optimized for use within a modern web browser,
> which is already a memory hog anyway.  Therefore, if the language
> implementation wastes yet more memory it will hardly be noticed.
>
> If I were designing this VM, I'd work hard to allow as many loops as
> possible to run completely in the cache.  That means that three things
> have to fit into the cache together: the VM itself, the user loop code,
> and the user data.  IMO, the sum of these three things should be made as
> small as possible.
>
> I certainly agree that we should have a generous number of registers,
> but I suspect that the sweet spot for a VM is 256, because it enables
> more compact dispatching code in the VM, and yet is more than enough to
> allow a decent register allocator to generate good code.
>
> That's my educated guess anyway.  Feel free to prove me wrong :)
>
>    Regards,
>      Mark

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-16 13:44             ` Mark H Weaver
  2012-05-16 14:00               ` David Kastrup
  2012-05-16 14:54               ` Noah Lavine
@ 2012-05-16 14:58               ` Andy Wingo
  2012-05-16 16:27                 ` Mark H Weaver
  2 siblings, 1 reply; 20+ messages in thread
From: Andy Wingo @ 2012-05-16 14:58 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel

Howdy,

On Wed 16 May 2012 15:44, Mark H Weaver <mhw@netris.org> writes:

>> The design of the wip-rtl VM is to allow 16M registers (24-bit
>> addressing).  However many instructions can just address 2**8 registers
>> (8-bit addressing) or 2**12 registers (12-bit addressing).  We will
>> reserve registers 253 to 255 as temporaries.  If you have so many
>> registers as to need more than that, then you have to shuffle operands
>> down into the temporaries.  That's the plan, anyway.
>
> I'm very concerned about this design, for the same reason that I was
> concerned about NaN-boxing on 32-bit platforms.  Efficient use of memory
> is extremely important on modern architectures, because of the vast (and
> increasing) disparity between cache speed and RAM speed.  If you can fit
> the active set into the cache, that often makes a profound difference in
> the speed of a program.
>
> I agree that with VMs, minimizing the number of dispatches is crucial,
> but beyond a certain point, having more registers is not going to save
> you any dispatches, because they will almost never be used anyway.
> 2^12 registers is _far_ beyond that point.

I'm probably not explaining myself clearly.  Here goes.

I willingly grant that 256 registers is usually enough.  But there are
valid reasons to use 2**12 registers: for example in the mov
instruction, if you have an 8-bit opcode, you have 24 bits left.  Using
12 for each operand makes sense.  There are other cases in which you
want to reference 24-bit values, for relative jumps; and even 32-bit
values, to reference constants using relative addressing.  (64 MB is too
small a limit for one compilation unit.  16 GB is fine.)

Likewise I can imagine cases in which you might end up with more than
2**12 active locals, especially in the presence of macros.  In that case
you spill.  But where do you spill?  For Guile, this means spilling to
additional registers, and having to shuffle with long-mov.  Otherwise
you would spill to a vector or something.  The WIP-RTL strategy
adequately captures the edge case while making the normal case fast.

> If I were designing this VM, I'd work hard to allow as many loops as
> possible to run completely in the cache.  That means that three things
> have to fit into the cache together: the VM itself, the user loop code,
> and the user data.  IMO, the sum of these three things should be made as
> small as possible.

Certainly.

> I certainly agree that we should have a generous number of registers,
> but I suspect that the sweet spot for a VM is 256, because it enables
> more compact dispatching code in the VM, and yet is more than enough to
> allow a decent register allocator to generate good code.
>
> That's my educated guess anyway.  Feel free to prove me wrong :)

I will do better: I will prove you right and prove me right at the same
time :)  The instructions in wip-rtl try to stay in one 32-bit unit.  In
that case they have limits, usually 8 bits.  But when they need to
"spill", they will do so on the stack, not on the heap.

Regards,

Andy
-- 
http://wingolog.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-11 20:29 ` Stefan Israelsson Tampe
@ 2012-05-16 15:01   ` Andy Wingo
  0 siblings, 0 replies; 20+ messages in thread
From: Andy Wingo @ 2012-05-16 15:01 UTC (permalink / raw)
  To: Stefan Israelsson Tampe; +Cc: guile-devel

Hi Stefan,

On Fri 11 May 2012 22:29, Stefan Israelsson Tampe <stefan.itampe@gmail.com> writes:

> 1. What about growing stacks any coments if they will be easier to manage
> for this setup. Can one copy the C stack logic?

Having a fixed-size frame means that it's easier to have disjoint
stacks, since a register VM addresses operands relative to the frame
pointer and not the stack pointer.  I hope to be able to decrease our
default stack size, and allow it instead to grow dynamically.

> 2. Is there an instruction that does what call does but can be used for tail call's
> when it needs it e.g. the code
>  for (n = 0; n < nargs; n++)
>         LOCAL_SET (n, old_fp[ip[4 + n]]);
> that is missing for the tail code

This is another advantage of wip-rtl.  In it, the compiler is
responsible for shuffling tail arguments.  It can do a parallel move
possibly without shuffling args to the top of the stack.  Then tail-call
just sets a new procedure and jumps to its entry.

> 3. I would appriciate if the frame is always below say 256 SCM:s of the fp stack limit
> that way when preparing tail calling one doesn't usally need to check if the argument fit's
> when issuing a tail call.

See above :)

> 4. I think the logic code hook I recently investigated could easily fit into this VM engine with
> using similar techniques as I described in previous mails.

I'm still working back through the mails; remind me again if it seems I
overlooked this mail.

Cheers,

Andy
-- 
http://wingolog.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-16 14:54               ` Noah Lavine
@ 2012-05-16 15:05                 ` Andy Wingo
  2012-05-16 20:39                 ` Ludovic Courtès
  1 sibling, 0 replies; 20+ messages in thread
From: Andy Wingo @ 2012-05-16 15:05 UTC (permalink / raw)
  To: Noah Lavine; +Cc: guile-devel

On Wed 16 May 2012 16:54, Noah Lavine <noah.b.lavine@gmail.com> writes:

> In our current VM, we have two stacks: the local-variable stack, which
> has frames for different function calls and is generally what you'd
> think of as a stack, and the temporary-variable stack, which is
> literally a stack in the sense that you only operate on the top of it.

Nice explanation!  Surely "register vm" is a poor name, if one can have
a variable number of registers -- registers are normally something of
which one has a fixed number.

Cheers,

Andy
-- 
http://wingolog.org/



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-16 14:58               ` Andy Wingo
@ 2012-05-16 16:27                 ` Mark H Weaver
  2012-05-16 16:39                   ` Andy Wingo
  0 siblings, 1 reply; 20+ messages in thread
From: Mark H Weaver @ 2012-05-16 16:27 UTC (permalink / raw)
  To: Andy Wingo; +Cc: Ludovic Courtès, guile-devel

Hi Andy,

Andy Wingo <wingo@pobox.com> writes:
> Likewise I can imagine cases in which you might end up with more than
> 2**12 active locals, especially in the presence of macros.  In that case
> you spill.  But where do you spill?

You spill to them to stack of course, which brings me to my next point:
as discussed in chapter 4 of David Kranz's thesis on the Orbit compiler
(highly recommended reading) it is sometimes better to store a local on
the stack, even if you have registers to spare.  The reason is that
registers must be saved and restored for every procedure call, and as we
all know, Scheme has no shortage of those.

What's your plan for saving and restoring such a large register file?

    Best,
     Mark

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-16 16:27                 ` Mark H Weaver
@ 2012-05-16 16:39                   ` Andy Wingo
  2012-05-16 18:23                     ` Noah Lavine
  0 siblings, 1 reply; 20+ messages in thread
From: Andy Wingo @ 2012-05-16 16:39 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: Ludovic Courtès, guile-devel

On Wed 16 May 2012 18:27, Mark H Weaver <mhw@netris.org> writes:

> What's your plan for saving and restoring such a large register file?

Here it's clear that I didn't communicate well.  What I am terming a
"register" is a value in a local stack slot.  That's all.  No need to
save and restore, since they're already on the stack.

In a virtual machine, it doesn't make sense to have registers in the
hardware sense, I don't think.

Perhaps it needs a different name than "register virtual machine".

Andy
-- 
http://wingolog.org/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-16 16:39                   ` Andy Wingo
@ 2012-05-16 18:23                     ` Noah Lavine
  0 siblings, 0 replies; 20+ messages in thread
From: Noah Lavine @ 2012-05-16 18:23 UTC (permalink / raw)
  To: Andy Wingo; +Cc: Mark H Weaver, Ludovic Courtès, guile-devel

> Perhaps it needs a different name than "register virtual machine".

How about "RTL VM", since it's a virtual machine that interprets RTL?

Or maybe "frame-addressed VM", because the operations address objects
in the current stack frame?

Noah



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Register VM WIP
  2012-05-16 14:54               ` Noah Lavine
  2012-05-16 15:05                 ` Andy Wingo
@ 2012-05-16 20:39                 ` Ludovic Courtès
  1 sibling, 0 replies; 20+ messages in thread
From: Ludovic Courtès @ 2012-05-16 20:39 UTC (permalink / raw)
  To: Noah Lavine; +Cc: Andy Wingo, Mark H Weaver, guile-devel

Hi,

Noah Lavine <noah.b.lavine@gmail.com> skribis:

> I think what Andy is proposing to do is to get rid of the
> temporary-variable stack and operate directly on the local-variable
> stack. We shouldn't think of these registers as being like machine
> registers, and in fact maybe "registers" is not a good name for these
> objects. They are really just variables in the topmost stack frame.

Yeah, I too was confused the first time I heard about “register VMs.”

The key idea is that opcodes encode the offset of the operand they work
on, rather than working only on the last words of the stack (for example,
‘add x y’ instead of ‘local-ref x’ + ‘local-ref y’ + ‘add’ + ‘local-set x’.)

Thanks,
Ludo’.



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-05-16 20:39 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-11 16:19 Register VM WIP Andy Wingo
2012-05-11 20:29 ` Stefan Israelsson Tampe
2012-05-16 15:01   ` Andy Wingo
2012-05-14 21:09 ` Ludovic Courtès
2012-05-14 21:28   ` Andrew Gwozdziewycz
2012-05-15 18:45     ` Andy Wingo
2012-05-16  0:39       ` Noah Lavine
2012-05-16  4:23         ` Mark H Weaver
2012-05-16  7:15           ` Andy Wingo
2012-05-16 13:44             ` Mark H Weaver
2012-05-16 14:00               ` David Kastrup
2012-05-16 14:54               ` Noah Lavine
2012-05-16 15:05                 ` Andy Wingo
2012-05-16 20:39                 ` Ludovic Courtès
2012-05-16 14:58               ` Andy Wingo
2012-05-16 16:27                 ` Mark H Weaver
2012-05-16 16:39                   ` Andy Wingo
2012-05-16 18:23                     ` Noah Lavine
2012-05-16  7:10         ` Andy Wingo
2012-05-15 18:49   ` Andy Wingo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).