thoughts on native code

unofficial mirror of guile-devel@gnu.org 
 help / color / mirror / Atom feed

* thoughts on native code
@ 2012-11-10 14:41 Stefan Israelsson Tampe
  2012-11-10 22:06 ` Stefan Israelsson Tampe
  0 siblings, 1 reply; 12+ messages in thread
From: Stefan Israelsson Tampe @ 2012-11-10 14:41 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 2523 bytes --]

Hi all,

After talking with Mark Weaver about his view on native code, I have been
pondering how to best model our needs.

I do have a framework now that translates almost all of the rtl vm directly
to native code and it do shows a speed increase of say 4x compared to
runing a rtl VM. I can also generate rtl code all the way from guile scheme
right now so It's pretty easy to generate test cases. The problem that Mark
point out to is that we need to take care to not blow the instructuction
cache. This is not seen in these simple examples but we need larger code
bases to test out what is actually true. What we can note though is that I
expect the size of the code to blow up with a factor of around 10 compared
to the instruction feed in the rtl code.

One interesting fact is that SBCL does fairly well by basically using the
native instruction as the instruction flow to it's VM. For example if it
can deduce that a + operation works with fixnums it simply compiles that as
a function call to a general + routine e.g. it will do a long jump to the +
routine, do the plus, and longjump back essentially dispatching general
instructions like + * / etc, directly e.g. sbcl do have a virtual machine,
it just don't to table lookup to do the dispatch, but function call's in
stead. If you count longjumps this means that the number of jumps for these
instructions are double that of using the original table lookup methods.
But for calling functions and returning functions the number of longjumps
are the same and moving local variables in place , jumping  is really fast.

Anyway, this method of dispatching would mean a fairly small footprint with
respect to the direct assembler. Another big chunk of code that we can
speedup without to much bloat in the instruction cache is the lookup of
pairs, structs and arrays, the reason is that in many cases we can deduce
at compilation so much that we do not need to check the type all the time
but can safely lookup the needed infromation.

Now is this method fast? well, looking a the sbcl code for calculating 1+ 2
+ 3 + 4 , (disassembling it) I see that it do uses the mechanism above, it
manages to sum 150M terms in one second, that's quite a feat for a VM with
no JIT. The same with the rtl VM is 65M.

Now, sbcl's compiler is quite matured and uses native registers quite well
which explains one of the reasons why the speed. My point is though that we
can model efficiently a VM by call's and using the native instructions and
a instructions flow.

Regards Stefan

[-- Attachment #2: Type: text/html, Size: 2608 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts on native code
  2012-11-10 14:41 thoughts on native code Stefan Israelsson Tampe
@ 2012-11-10 22:06 ` Stefan Israelsson Tampe
  2012-11-10 22:49   ` Noah Lavine
  2012-11-11 19:28   ` Stefan Israelsson Tampe
  0 siblings, 2 replies; 12+ messages in thread
From: Stefan Israelsson Tampe @ 2012-11-10 22:06 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 5194 bytes --]

I would like to continue the discussion about native code.

Some facts are,
For example, consider this
(define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+ s i) (+ i
1)))))

The timings for (f 100000000)  ~ (f 100M) is

1) current vm                 : 2.93s
2) rtl                              : 1.67s
3) compressed native     : 1.15s
4) uncompressed native : 0.54s

sbcl = compressed nativ + better register allocations (normal optimization
level) : 0.68s

To note is that for this example the call overhead is close to 5ns per
iteration and meaning that
if we combined 4 with better register handling the potential is to get this
loop to run at 0.2s which means
that the loop has the potential of running 500M iterations in one second
without sacrifying safety and not
have a extraterestial code analyzer. Also to note is that the native code
for the compressed native is smaller then the
rtl code by some factor and if we could make use of registers in a better
way we would end up with even less overhead.

To note is that compressed native is a very simple mechanism to gain some
speed and also improve on memory
usage in the instruction flow, Also the assembler is very simplistic and it
would not be to much hassle to port a new
instruction format to that environment. Also it's probably possible to
handle the complexity of the code in pure C
for the stubs and by compiling them in a special way make sure they output
a format that can be combined
with the meta information in special registers needed to make the execution
of the compiled scheme effective.

This study also shows that there is a clear benefit to be able to use the
computers registers, and I think this is the way
you would like the system to behave in the end. sbcl does this rather
nicely and we could look at their way of doing it.

So, the main question now to you is how to implement the register
allocations? Basic principles of register allocation can be gotten out from
the internet, I'm assure of, but the problem is how to handle the
interaction with the helper stubs. That is
something i'm not sure of yet.

A simple solution would be to assume that the native code have a set of
available registers r1,...,ri and then force the
compilation of the stubs to treat the just like the registers bp, sp, and
bx. I'm sure that this is possible to configure in gcc.

So the task for me right now is to find out more how to do this, if you
have any pointers or ideas, please help out.

Cheers
Stefan

On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe <
stefan.itampe@gmail.com> wrote:

> Hi all,
>
> After talking with Mark Weaver about his view on native code, I have been
> pondering how to best model our needs.
>
> I do have a framework now that translates almost all of the rtl vm
> directly to native code and it do shows a speed increase of say 4x compared
> to runing a rtl VM. I can also generate rtl code all the way from guile
> scheme right now so It's pretty easy to generate test cases. The problem
> that Mark point out to is that we need to take care to not blow the
> instructuction cache. This is not seen in these simple examples but we need
> larger code bases to test out what is actually true. What we can note
> though is that I expect the size of the code to blow up with a factor of
> around 10 compared to the instruction feed in the rtl code.
>
> One interesting fact is that SBCL does fairly well by basically using the
> native instruction as the instruction flow to it's VM. For example if it
> can deduce that a + operation works with fixnums it simply compiles that as
> a function call to a general + routine e.g. it will do a long jump to the +
> routine, do the plus, and longjump back essentially dispatching general
> instructions like + * / etc, directly e.g. sbcl do have a virtual machine,
> it just don't to table lookup to do the dispatch, but function call's in
> stead. If you count longjumps this means that the number of jumps for these
> instructions are double that of using the original table lookup methods.
> But for calling functions and returning functions the number of longjumps
> are the same and moving local variables in place , jumping  is really fast.
>
> Anyway, this method of dispatching would mean a fairly small footprint
> with respect to the direct assembler. Another big chunk of code that we can
> speedup without to much bloat in the instruction cache is the lookup of
> pairs, structs and arrays, the reason is that in many cases we can deduce
> at compilation so much that we do not need to check the type all the time
> but can safely lookup the needed infromation.
>
> Now is this method fast? well, looking a the sbcl code for calculating 1+
> 2 + 3 + 4 , (disassembling it) I see that it do uses the mechanism above,
> it manages to sum 150M terms in one second, that's quite a feat for a VM
> with no JIT. The same with the rtl VM is 65M.
>
> Now, sbcl's compiler is quite matured and uses native registers quite well
> which explains one of the reasons why the speed. My point is though that we
> can model efficiently a VM by call's and using the native instructions and
> a instructions flow.
>
> Regards Stefan
>

[-- Attachment #2: Type: text/html, Size: 5657 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts on native code
  2012-11-10 22:06 ` Stefan Israelsson Tampe
@ 2012-11-10 22:49   ` Noah Lavine
  2012-11-12 21:50     ` Stefan Israelsson Tampe
  2012-11-11 19:28   ` Stefan Israelsson Tampe
  1 sibling, 1 reply; 12+ messages in thread
From: Noah Lavine @ 2012-11-10 22:49 UTC (permalink / raw)
  To: Stefan Israelsson Tampe; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 6463 bytes --]

Hello,

I assume "compressed native" is the idea you wrote about in your last
email, where we generate native code which is a sequence of function calls
to VM operations.

I really like that idea. As you said, it uses the instruction cache better.
But it also fixes something I was worried about, which is that it's a lot
of work to port an assembler to a new architecture, so we might end up not
supporting many native architectures. But it seems much easier to make an
assembler that only knows how to make call instructions and branches. So we
could support compressed native on lots of architectures, and maybe
uncompressed native only on some.

If you want a quick way to do compressed native with reasonable register
allocation, GNU libjit might work. I used it a couple years ago for a JIT
project that we never fully implemented. I chose it over GNU Lightning
specifically because it did register allocation. It implements a full
assembler, not just calls, which could also be nice later.

Noah



On Sat, Nov 10, 2012 at 5:06 PM, Stefan Israelsson Tampe <
stefan.itampe@gmail.com> wrote:

> I would like to continue the discussion about native code.
>
> Some facts are,
> For example, consider this
> (define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+ s i) (+ i
> 1)))))
>
> The timings for (f 100000000)  ~ (f 100M) is
>
> 1) current vm                 : 2.93s
> 2) rtl                              : 1.67s
> 3) compressed native     : 1.15s
> 4) uncompressed native : 0.54s
>
> sbcl = compressed nativ + better register allocations (normal optimization
> level) : 0.68s
>
> To note is that for this example the call overhead is close to 5ns per
> iteration and meaning that
> if we combined 4 with better register handling the potential is to get
> this loop to run at 0.2s which means
> that the loop has the potential of running 500M iterations in one second
> without sacrifying safety and not
> have a extraterestial code analyzer. Also to note is that the native code
> for the compressed native is smaller then the
> rtl code by some factor and if we could make use of registers in a better
> way we would end up with even less overhead.
>
> To note is that compressed native is a very simple mechanism to gain some
> speed and also improve on memory
> usage in the instruction flow, Also the assembler is very simplistic and
> it would not be to much hassle to port a new
> instruction format to that environment. Also it's probably possible to
> handle the complexity of the code in pure C
> for the stubs and by compiling them in a special way make sure they output
> a format that can be combined
> with the meta information in special registers needed to make the
> execution of the compiled scheme effective.
>
> This study also shows that there is a clear benefit to be able to use the
> computers registers, and I think this is the way
> you would like the system to behave in the end. sbcl does this rather
> nicely and we could look at their way of doing it.
>
> So, the main question now to you is how to implement the register
> allocations? Basic principles of register allocation can be gotten out from
> the internet, I'm assure of, but the problem is how to handle the
> interaction with the helper stubs. That is
> something i'm not sure of yet.
>
> A simple solution would be to assume that the native code have a set of
> available registers r1,...,ri and then force the
> compilation of the stubs to treat the just like the registers bp, sp, and
> bx. I'm sure that this is possible to configure in gcc.
>
> So the task for me right now is to find out more how to do this, if you
> have any pointers or ideas, please help out.
>
> Cheers
> Stefan
>
>
>
>
>
>
> On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe <
> stefan.itampe@gmail.com> wrote:
>
>> Hi all,
>>
>> After talking with Mark Weaver about his view on native code, I have been
>> pondering how to best model our needs.
>>
>> I do have a framework now that translates almost all of the rtl vm
>> directly to native code and it do shows a speed increase of say 4x compared
>> to runing a rtl VM. I can also generate rtl code all the way from guile
>> scheme right now so It's pretty easy to generate test cases. The problem
>> that Mark point out to is that we need to take care to not blow the
>> instructuction cache. This is not seen in these simple examples but we need
>> larger code bases to test out what is actually true. What we can note
>> though is that I expect the size of the code to blow up with a factor of
>> around 10 compared to the instruction feed in the rtl code.
>>
>> One interesting fact is that SBCL does fairly well by basically using the
>> native instruction as the instruction flow to it's VM. For example if it
>> can deduce that a + operation works with fixnums it simply compiles that as
>> a function call to a general + routine e.g. it will do a long jump to the +
>> routine, do the plus, and longjump back essentially dispatching general
>> instructions like + * / etc, directly e.g. sbcl do have a virtual machine,
>> it just don't to table lookup to do the dispatch, but function call's in
>> stead. If you count longjumps this means that the number of jumps for these
>> instructions are double that of using the original table lookup methods.
>> But for calling functions and returning functions the number of longjumps
>> are the same and moving local variables in place , jumping  is really fast.
>>
>> Anyway, this method of dispatching would mean a fairly small footprint
>> with respect to the direct assembler. Another big chunk of code that we can
>> speedup without to much bloat in the instruction cache is the lookup of
>> pairs, structs and arrays, the reason is that in many cases we can deduce
>> at compilation so much that we do not need to check the type all the time
>> but can safely lookup the needed infromation.
>>
>> Now is this method fast? well, looking a the sbcl code for calculating 1+
>> 2 + 3 + 4 , (disassembling it) I see that it do uses the mechanism above,
>> it manages to sum 150M terms in one second, that's quite a feat for a VM
>> with no JIT. The same with the rtl VM is 65M.
>>
>> Now, sbcl's compiler is quite matured and uses native registers quite
>> well which explains one of the reasons why the speed. My point is though
>> that we can model efficiently a VM by call's and using the native
>> instructions and a instructions flow.
>>
>> Regards Stefan
>>
>
>

[-- Attachment #2: Type: text/html, Size: 7305 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts on native code
  2012-11-10 22:49   ` Noah Lavine
@ 2012-11-12 21:50     ` Stefan Israelsson Tampe
  2012-11-15 10:19       ` Sjoerd van Leent Privé
  0 siblings, 1 reply; 12+ messages in thread
From: Stefan Israelsson Tampe @ 2012-11-12 21:50 UTC (permalink / raw)
  To: Noah Lavine; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 7354 bytes --]

Thanks for your mail Noah,

Yea libjit is quite interesting. But playing around with an assembler in
scheme I do not want to go back to
C or C++ land. The only problem is that we need a GNU scheme assembler and
right now I use sbcl's assembler
ported to scheme. We could perhaps use weinholts assembler as well in
industria if he could sign papers to make it GNU. For the register
allocation part I would really like to play a little in scheme to explore
the idea you saw from my previous mail in this thread. Again I think it's
natural to have this features in scheme and do not want to mess in C land
too much.

Am I wrong?

Cheers
Stefan


On Sat, Nov 10, 2012 at 11:49 PM, Noah Lavine <noah.b.lavine@gmail.com>wrote:

> Hello,
>
> I assume "compressed native" is the idea you wrote about in your last
> email, where we generate native code which is a sequence of function calls
> to VM operations.
>
> I really like that idea. As you said, it uses the instruction cache
> better. But it also fixes something I was worried about, which is that it's
> a lot of work to port an assembler to a new architecture, so we might end
> up not supporting many native architectures. But it seems much easier to
> make an assembler that only knows how to make call instructions and
> branches. So we could support compressed native on lots of architectures,
> and maybe uncompressed native only on some.
>
> If you want a quick way to do compressed native with reasonable register
> allocation, GNU libjit might work. I used it a couple years ago for a JIT
> project that we never fully implemented. I chose it over GNU Lightning
> specifically because it did register allocation. It implements a full
> assembler, not just calls, which could also be nice later.
>
> Noah
>
>
>
> On Sat, Nov 10, 2012 at 5:06 PM, Stefan Israelsson Tampe <
> stefan.itampe@gmail.com> wrote:
>
>> I would like to continue the discussion about native code.
>>
>> Some facts are,
>> For example, consider this
>> (define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+ s i) (+ i
>> 1)))))
>>
>> The timings for (f 100000000)  ~ (f 100M) is
>>
>> 1) current vm                 : 2.93s
>> 2) rtl                              : 1.67s
>> 3) compressed native     : 1.15s
>> 4) uncompressed native : 0.54s
>>
>> sbcl = compressed nativ + better register allocations (normal
>> optimization level) : 0.68s
>>
>> To note is that for this example the call overhead is close to 5ns per
>> iteration and meaning that
>> if we combined 4 with better register handling the potential is to get
>> this loop to run at 0.2s which means
>> that the loop has the potential of running 500M iterations in one second
>> without sacrifying safety and not
>> have a extraterestial code analyzer. Also to note is that the native code
>> for the compressed native is smaller then the
>> rtl code by some factor and if we could make use of registers in a better
>> way we would end up with even less overhead.
>>
>> To note is that compressed native is a very simple mechanism to gain some
>> speed and also improve on memory
>> usage in the instruction flow, Also the assembler is very simplistic and
>> it would not be to much hassle to port a new
>> instruction format to that environment. Also it's probably possible to
>> handle the complexity of the code in pure C
>> for the stubs and by compiling them in a special way make sure they
>> output a format that can be combined
>> with the meta information in special registers needed to make the
>> execution of the compiled scheme effective.
>>
>> This study also shows that there is a clear benefit to be able to use the
>> computers registers, and I think this is the way
>> you would like the system to behave in the end. sbcl does this rather
>> nicely and we could look at their way of doing it.
>>
>> So, the main question now to you is how to implement the register
>> allocations? Basic principles of register allocation can be gotten out from
>> the internet, I'm assure of, but the problem is how to handle the
>> interaction with the helper stubs. That is
>> something i'm not sure of yet.
>>
>> A simple solution would be to assume that the native code have a set of
>> available registers r1,...,ri and then force the
>> compilation of the stubs to treat the just like the registers bp, sp, and
>> bx. I'm sure that this is possible to configure in gcc.
>>
>> So the task for me right now is to find out more how to do this, if you
>> have any pointers or ideas, please help out.
>>
>> Cheers
>> Stefan
>>
>>
>>
>>
>>
>>
>> On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe <
>> stefan.itampe@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> After talking with Mark Weaver about his view on native code, I have
>>> been pondering how to best model our needs.
>>>
>>> I do have a framework now that translates almost all of the rtl vm
>>> directly to native code and it do shows a speed increase of say 4x compared
>>> to runing a rtl VM. I can also generate rtl code all the way from guile
>>> scheme right now so It's pretty easy to generate test cases. The problem
>>> that Mark point out to is that we need to take care to not blow the
>>> instructuction cache. This is not seen in these simple examples but we need
>>> larger code bases to test out what is actually true. What we can note
>>> though is that I expect the size of the code to blow up with a factor of
>>> around 10 compared to the instruction feed in the rtl code.
>>>
>>> One interesting fact is that SBCL does fairly well by basically using
>>> the native instruction as the instruction flow to it's VM. For example if
>>> it can deduce that a + operation works with fixnums it simply compiles that
>>> as a function call to a general + routine e.g. it will do a long jump to
>>> the + routine, do the plus, and longjump back essentially dispatching
>>> general instructions like + * / etc, directly e.g. sbcl do have a virtual
>>> machine, it just don't to table lookup to do the dispatch, but function
>>> call's in stead. If you count longjumps this means that the number of jumps
>>> for these instructions are double that of using the original table lookup
>>> methods. But for calling functions and returning functions the number of
>>> longjumps are the same and moving local variables in place , jumping  is
>>> really fast.
>>>
>>> Anyway, this method of dispatching would mean a fairly small footprint
>>> with respect to the direct assembler. Another big chunk of code that we can
>>> speedup without to much bloat in the instruction cache is the lookup of
>>> pairs, structs and arrays, the reason is that in many cases we can deduce
>>> at compilation so much that we do not need to check the type all the time
>>> but can safely lookup the needed infromation.
>>>
>>> Now is this method fast? well, looking a the sbcl code for calculating
>>> 1+ 2 + 3 + 4 , (disassembling it) I see that it do uses the mechanism
>>> above, it manages to sum 150M terms in one second, that's quite a feat for
>>> a VM with no JIT. The same with the rtl VM is 65M.
>>>
>>> Now, sbcl's compiler is quite matured and uses native registers quite
>>> well which explains one of the reasons why the speed. My point is though
>>> that we can model efficiently a VM by call's and using the native
>>> instructions and a instructions flow.
>>>
>>> Regards Stefan
>>>
>>
>>
>

[-- Attachment #2: Type: text/html, Size: 8458 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts on native code
  2012-11-12 21:50     ` Stefan Israelsson Tampe
@ 2012-11-15 10:19       ` Sjoerd van Leent Privé
  2012-11-15 16:04         ` Stefan Israelsson Tampe
                           ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Sjoerd van Leent Privé @ 2012-11-15 10:19 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 9525 bytes --]

Hi Stefan,

Just my idea about an assembler in Scheme. Sounds interesting. If it's 
done properly, it can be very promising to use scheme itself to directly 
emit machine instructions. This would also be interesting for meta 
compilation in the future (think of aiding GCC).

So you are thinking about an assembler for x86? Perhaps I can help out 
on this one. I would like to do this part, as I haven't been able to aid 
on other parts besides voicing my ideas (anyways, I am on embedded 
development these days.)

The only discussion is the syntax I believe, I mean, should it be AT&T 
like, Intel like, or leave this domain and do something new. I would go 
for instructions like this (using macros):

(let ((target :x686))
   (assemble target
    ((mov long 100 EAX)
     (mov long 200 EBX)
     (add long EBX EAX))))

Giving back the native machine code instructions. Perhaps special 
constructions can be made to return partially complete instructions 
(such as missing labels or calls to guile procedures...)

Sjoerd


On 11/12/2012 10:50 PM, Stefan Israelsson Tampe wrote:
> Thanks for your mail Noah,
>
> Yea libjit is quite interesting. But playing around with an assembler 
> in scheme I do not want to go back to
> C or C++ land. The only problem is that we need a GNU scheme assembler 
> and right now I use sbcl's assembler
> ported to scheme. We could perhaps use weinholts assembler as well in 
> industria if he could sign papers to make it GNU. For the register 
> allocation part I would really like to play a little in scheme to 
> explore the idea you saw from my previous mail in this thread. Again I 
> think it's natural to have this features in scheme and do not want to 
> mess in C land too much.
>
> Am I wrong?
>
> Cheers
> Stefan
>
>
> On Sat, Nov 10, 2012 at 11:49 PM, Noah Lavine <noah.b.lavine@gmail.com 
> <mailto:noah.b.lavine@gmail.com>> wrote:
>
>     Hello,
>
>     I assume "compressed native" is the idea you wrote about in your
>     last email, where we generate native code which is a sequence of
>     function calls to VM operations.
>
>     I really like that idea. As you said, it uses the instruction
>     cache better. But it also fixes something I was worried about,
>     which is that it's a lot of work to port an assembler to a new
>     architecture, so we might end up not supporting many native
>     architectures. But it seems much easier to make an assembler that
>     only knows how to make call instructions and branches. So we could
>     support compressed native on lots of architectures, and maybe
>     uncompressed native only on some.
>
>     If you want a quick way to do compressed native with reasonable
>     register allocation, GNU libjit might work. I used it a couple
>     years ago for a JIT project that we never fully implemented. I
>     chose it over GNU Lightning specifically because it did register
>     allocation. It implements a full assembler, not just calls, which
>     could also be nice later.
>
>     Noah
>
>
>
>     On Sat, Nov 10, 2012 at 5:06 PM, Stefan Israelsson Tampe
>     <stefan.itampe@gmail.com <mailto:stefan.itampe@gmail.com>> wrote:
>
>         I would like to continue the discussion about native code.
>
>         Some facts are,
>         For example, consider this
>         (define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+
>         s i) (+ i 1)))))
>
>         The timings for (f 100000000)  ~ (f 100M) is
>
>         1) current vm                 : 2.93s
>         2) rtl                              : 1.67s
>         3) compressed native     : 1.15s
>         4) uncompressed native : 0.54s
>
>         sbcl = compressed nativ + better register allocations (normal
>         optimization level) : 0.68s
>
>         To note is that for this example the call overhead is close to
>         5ns per iteration and meaning that
>         if we combined 4 with better register handling the potential
>         is to get this loop to run at 0.2s which means
>         that the loop has the potential of running 500M iterations in
>         one second without sacrifying safety and not
>         have a extraterestial code analyzer. Also to note is that the
>         native code for the compressed native is smaller then the
>         rtl code by some factor and if we could make use of registers
>         in a better way we would end up with even less overhead.
>
>         To note is that compressed native is a very simple mechanism
>         to gain some speed and also improve on memory
>         usage in the instruction flow, Also the assembler is very
>         simplistic and it would not be to much hassle to port a new
>         instruction format to that environment. Also it's probably
>         possible to handle the complexity of the code in pure C
>         for the stubs and by compiling them in a special way make sure
>         they output a format that can be combined
>         with the meta information in special registers needed to make
>         the execution of the compiled scheme effective.
>
>         This study also shows that there is a clear benefit to be able
>         to use the computers registers, and I think this is the way
>         you would like the system to behave in the end. sbcl does this
>         rather nicely and we could look at their way of doing it.
>
>         So, the main question now to you is how to implement the
>         register allocations? Basic principles of register allocation
>         can be gotten out from the internet, I'm assure of, but the
>         problem is how to handle the interaction with the helper
>         stubs. That is
>         something i'm not sure of yet.
>
>         A simple solution would be to assume that the native code have
>         a set of available registers r1,...,ri and then force the
>         compilation of the stubs to treat the just like the registers
>         bp, sp, and bx. I'm sure that this is possible to configure in
>         gcc.
>
>         So the task for me right now is to find out more how to do
>         this, if you have any pointers or ideas, please help out.
>
>         Cheers
>         Stefan
>
>
>
>
>
>
>         On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe
>         <stefan.itampe@gmail.com <mailto:stefan.itampe@gmail.com>> wrote:
>
>             Hi all,
>
>             After talking with Mark Weaver about his view on native
>             code, I have been pondering how to best model our needs.
>
>             I do have a framework now that translates almost all of
>             the rtl vm directly to native code and it do shows a speed
>             increase of say 4x compared to runing a rtl VM. I can also
>             generate rtl code all the way from guile scheme right now
>             so It's pretty easy to generate test cases. The problem
>             that Mark point out to is that we need to take care to not
>             blow the instructuction cache. This is not seen in these
>             simple examples but we need larger code bases to test out
>             what is actually true. What we can note though is that I
>             expect the size of the code to blow up with a factor of
>             around 10 compared to the instruction feed in the rtl code.
>
>             One interesting fact is that SBCL does fairly well by
>             basically using the native instruction as the instruction
>             flow to it's VM. For example if it can deduce that a +
>             operation works with fixnums it simply compiles that as a
>             function call to a general + routine e.g. it will do a
>             long jump to the + routine, do the plus, and longjump back
>             essentially dispatching general instructions like + * /
>             etc, directly e.g. sbcl do have a virtual machine, it just
>             don't to table lookup to do the dispatch, but function
>             call's in stead. If you count longjumps this means that
>             the number of jumps for these instructions are double that
>             of using the original table lookup methods. But for
>             calling functions and returning functions the number of
>             longjumps are the same and moving local variables in place
>             , jumping  is really fast.
>
>             Anyway, this method of dispatching would mean a fairly
>             small footprint with respect to the direct assembler.
>             Another big chunk of code that we can speedup without to
>             much bloat in the instruction cache is the lookup of
>             pairs, structs and arrays, the reason is that in many
>             cases we can deduce at compilation so much that we do not
>             need to check the type all the time but can safely lookup
>             the needed infromation.
>
>             Now is this method fast? well, looking a the sbcl code for
>             calculating 1+ 2 + 3 + 4 , (disassembling it) I see that
>             it do uses the mechanism above, it manages to sum 150M
>             terms in one second, that's quite a feat for a VM with no
>             JIT. The same with the rtl VM is 65M.
>
>             Now, sbcl's compiler is quite matured and uses native
>             registers quite well which explains one of the reasons why
>             the speed. My point is though that we can model
>             efficiently a VM by call's and using the native
>             instructions and a instructions flow.
>
>             Regards Stefan
>
>
>
>


[-- Attachment #2: Type: text/html, Size: 16136 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts on native code
  2012-11-15 10:19       ` Sjoerd van Leent Privé
@ 2012-11-15 16:04         ` Stefan Israelsson Tampe
  2012-11-15 16:13         ` Stefan Israelsson Tampe
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Stefan Israelsson Tampe @ 2012-11-15 16:04 UTC (permalink / raw)
  To: Sjoerd van Leent Privé; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 9417 bytes --]

Hi Sjoerd,

1. I (mov dst src), this is what's used in the rtl vm code in C and I would
like to keep a match between
the two

2. I wouldl like to have  prefix for instruction to mark those out like
    (inst mov dst src)

3. It would be nice to have a default size of the architecture e.g.
   (inst mov quad dst src)  is equivalent to (inst mov dst src)

4. I prefere to have an evironment like
   (assemble target
      (inst mov b a)



On Thu, Nov 15, 2012 at 11:19 AM, Sjoerd van Leent Privé <
svanleent@gmail.com> wrote:

>  Hi Stefan,
>
> Just my idea about an assembler in Scheme. Sounds interesting. If it's
> done properly, it can be very promising to use scheme itself to directly
> emit machine instructions. This would also be interesting for meta
> compilation in the future (think of aiding GCC).
>
> So you are thinking about an assembler for x86? Perhaps I can help out on
> this one. I would like to do this part, as I haven't been able to aid on
> other parts besides voicing my ideas (anyways, I am on embedded development
> these days.)
>
> The only discussion is the syntax I believe, I mean, should it be AT&T
> like, Intel like, or leave this domain and do something new. I would go for
> instructions like this (using macros):
>
> (let ((target :x686))
>   (assemble target
>    ((mov long 100 EAX)
>     (mov long 200 EBX)
>     (add long EBX EAX))))
>
> Giving back the native machine code instructions. Perhaps special
> constructions can be made to return partially complete instructions (such
> as missing labels or calls to guile procedures...)
>
> Sjoerd
>
>
>
> On 11/12/2012 10:50 PM, Stefan Israelsson Tampe wrote:
>
> Thanks for your mail Noah,
>
> Yea libjit is quite interesting. But playing around with an assembler in
> scheme I do not want to go back to
> C or C++ land. The only problem is that we need a GNU scheme assembler and
> right now I use sbcl's assembler
> ported to scheme. We could perhaps use weinholts assembler as well in
> industria if he could sign papers to make it GNU. For the register
> allocation part I would really like to play a little in scheme to explore
> the idea you saw from my previous mail in this thread. Again I think it's
> natural to have this features in scheme and do not want to mess in C land
> too much.
>
> Am I wrong?
>
> Cheers
> Stefan
>
>
> On Sat, Nov 10, 2012 at 11:49 PM, Noah Lavine <noah.b.lavine@gmail.com>wrote:
>
>> Hello,
>>
>>  I assume "compressed native" is the idea you wrote about in your last
>> email, where we generate native code which is a sequence of function calls
>> to VM operations.
>>
>>  I really like that idea. As you said, it uses the instruction cache
>> better. But it also fixes something I was worried about, which is that it's
>> a lot of work to port an assembler to a new architecture, so we might end
>> up not supporting many native architectures. But it seems much easier to
>> make an assembler that only knows how to make call instructions and
>> branches. So we could support compressed native on lots of architectures,
>> and maybe uncompressed native only on some.
>>
>>  If you want a quick way to do compressed native with reasonable
>> register allocation, GNU libjit might work. I used it a couple years ago
>> for a JIT project that we never fully implemented. I chose it over GNU
>> Lightning specifically because it did register allocation. It implements a
>> full assembler, not just calls, which could also be nice later.
>>
>>  Noah
>>
>>
>>
>> On Sat, Nov 10, 2012 at 5:06 PM, Stefan Israelsson Tampe <
>> stefan.itampe@gmail.com> wrote:
>>
>>> I would like to continue the discussion about native code.
>>>
>>> Some facts are,
>>> For example, consider this
>>> (define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+ s i) (+ i
>>> 1)))))
>>>
>>> The timings for (f 100000000)  ~ (f 100M) is
>>>
>>> 1) current vm                 : 2.93s
>>> 2) rtl                              : 1.67s
>>> 3) compressed native     : 1.15s
>>> 4) uncompressed native : 0.54s
>>>
>>> sbcl = compressed nativ + better register allocations (normal
>>> optimization level) : 0.68s
>>>
>>> To note is that for this example the call overhead is close to 5ns per
>>> iteration and meaning that
>>> if we combined 4 with better register handling the potential is to get
>>> this loop to run at 0.2s which means
>>> that the loop has the potential of running 500M iterations in one second
>>> without sacrifying safety and not
>>> have a extraterestial code analyzer. Also to note is that the native
>>> code for the compressed native is smaller then the
>>> rtl code by some factor and if we could make use of registers in a
>>> better way we would end up with even less overhead.
>>>
>>> To note is that compressed native is a very simple mechanism to gain
>>> some speed and also improve on memory
>>> usage in the instruction flow, Also the assembler is very simplistic and
>>> it would not be to much hassle to port a new
>>> instruction format to that environment. Also it's probably possible to
>>> handle the complexity of the code in pure C
>>> for the stubs and by compiling them in a special way make sure they
>>> output a format that can be combined
>>> with the meta information in special registers needed to make the
>>> execution of the compiled scheme effective.
>>>
>>> This study also shows that there is a clear benefit to be able to use
>>> the computers registers, and I think this is the way
>>> you would like the system to behave in the end. sbcl does this rather
>>> nicely and we could look at their way of doing it.
>>>
>>> So, the main question now to you is how to implement the register
>>> allocations? Basic principles of register allocation can be gotten out from
>>> the internet, I'm assure of, but the problem is how to handle the
>>> interaction with the helper stubs. That is
>>> something i'm not sure of yet.
>>>
>>> A simple solution would be to assume that the native code have a set of
>>> available registers r1,...,ri and then force the
>>> compilation of the stubs to treat the just like the registers bp, sp,
>>> and bx. I'm sure that this is possible to configure in gcc.
>>>
>>> So the task for me right now is to find out more how to do this, if you
>>> have any pointers or ideas, please help out.
>>>
>>> Cheers
>>> Stefan
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe <
>>> stefan.itampe@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> After talking with Mark Weaver about his view on native code, I have
>>>> been pondering how to best model our needs.
>>>>
>>>> I do have a framework now that translates almost all of the rtl vm
>>>> directly to native code and it do shows a speed increase of say 4x compared
>>>> to runing a rtl VM. I can also generate rtl code all the way from guile
>>>> scheme right now so It's pretty easy to generate test cases. The problem
>>>> that Mark point out to is that we need to take care to not blow the
>>>> instructuction cache. This is not seen in these simple examples but we need
>>>> larger code bases to test out what is actually true. What we can note
>>>> though is that I expect the size of the code to blow up with a factor of
>>>> around 10 compared to the instruction feed in the rtl code.
>>>>
>>>> One interesting fact is that SBCL does fairly well by basically using
>>>> the native instruction as the instruction flow to it's VM. For example if
>>>> it can deduce that a + operation works with fixnums it simply compiles that
>>>> as a function call to a general + routine e.g. it will do a long jump to
>>>> the + routine, do the plus, and longjump back essentially dispatching
>>>> general instructions like + * / etc, directly e.g. sbcl do have a virtual
>>>> machine, it just don't to table lookup to do the dispatch, but function
>>>> call's in stead. If you count longjumps this means that the number of jumps
>>>> for these instructions are double that of using the original table lookup
>>>> methods. But for calling functions and returning functions the number of
>>>> longjumps are the same and moving local variables in place , jumping  is
>>>> really fast.
>>>>
>>>> Anyway, this method of dispatching would mean a fairly small footprint
>>>> with respect to the direct assembler. Another big chunk of code that we can
>>>> speedup without to much bloat in the instruction cache is the lookup of
>>>> pairs, structs and arrays, the reason is that in many cases we can deduce
>>>> at compilation so much that we do not need to check the type all the time
>>>> but can safely lookup the needed infromation.
>>>>
>>>> Now is this method fast? well, looking a the sbcl code for calculating
>>>> 1+ 2 + 3 + 4 , (disassembling it) I see that it do uses the mechanism
>>>> above, it manages to sum 150M terms in one second, that's quite a feat for
>>>> a VM with no JIT. The same with the rtl VM is 65M.
>>>>
>>>> Now, sbcl's compiler is quite matured and uses native registers quite
>>>> well which explains one of the reasons why the speed. My point is though
>>>> that we can model efficiently a VM by call's and using the native
>>>> instructions and a instructions flow.
>>>>
>>>> Regards Stefan
>>>>
>>>
>>>
>>
>
>

[-- Attachment #2: Type: text/html, Size: 16564 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts on native code
  2012-11-15 10:19       ` Sjoerd van Leent Privé
  2012-11-15 16:04         ` Stefan Israelsson Tampe
@ 2012-11-15 16:13         ` Stefan Israelsson Tampe
  2012-11-15 17:50         ` Mark H Weaver
  2012-11-15 22:44         ` Andreas Rottmann
  3 siblings, 0 replies; 12+ messages in thread
From: Stefan Israelsson Tampe @ 2012-11-15 16:13 UTC (permalink / raw)
  To: Sjoerd van Leent Privé; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 10028 bytes --]

Arg, I prematurely send that mail, well here is the continuation

4. I prefere to have an evironment like
   (assemble target
      (inst jmp label:)
      (inst mov b a)
    label:
      (inst mov b c)

This makes the labels stand out and makes for a nice read of the assembler.

You can see how weinholt in industria solves the assembling issue, also MIT
Scheme as an assembler that you can learn from, I was not liking the syntax
of that one though, but You maybe can port that over to guile, this is GNU
and may
be the shortest path to success.

I like the environment in the sbcl assembler, I ported that over in the
aschm repo and you can see a lot of assembler
in the nativ/vm/insts.scm file in that repo on gitorious. We cannot use
that one due to the fact that it's not GNU although
it's open source and the code is somewhat unclean.

It would also be nice if we could concentrate on a restricted set of
instruction support from the beginning and make sure to support more
architectures instead.

What do you think?

/Stefan


On Thu, Nov 15, 2012 at 11:19 AM, Sjoerd van Leent Privé <
svanleent@gmail.com> wrote:

>  Hi Stefan,
>
> Just my idea about an assembler in Scheme. Sounds interesting. If it's
> done properly, it can be very promising to use scheme itself to directly
> emit machine instructions. This would also be interesting for meta
> compilation in the future (think of aiding GCC).
>
> So you are thinking about an assembler for x86? Perhaps I can help out on
> this one. I would like to do this part, as I haven't been able to aid on
> other parts besides voicing my ideas (anyways, I am on embedded development
> these days.)
>
> The only discussion is the syntax I believe, I mean, should it be AT&T
> like, Intel like, or leave this domain and do something new. I would go for
> instructions like this (using macros):
>
> (let ((target :x686))
>   (assemble target
>    ((mov long 100 EAX)
>     (mov long 200 EBX)
>     (add long EBX EAX))))
>
> Giving back the native machine code instructions. Perhaps special
> constructions can be made to return partially complete instructions (such
> as missing labels or calls to guile procedures...)
>
> Sjoerd
>
>
>
> On 11/12/2012 10:50 PM, Stefan Israelsson Tampe wrote:
>
> Thanks for your mail Noah,
>
> Yea libjit is quite interesting. But playing around with an assembler in
> scheme I do not want to go back to
> C or C++ land. The only problem is that we need a GNU scheme assembler and
> right now I use sbcl's assembler
> ported to scheme. We could perhaps use weinholts assembler as well in
> industria if he could sign papers to make it GNU. For the register
> allocation part I would really like to play a little in scheme to explore
> the idea you saw from my previous mail in this thread. Again I think it's
> natural to have this features in scheme and do not want to mess in C land
> too much.
>
> Am I wrong?
>
> Cheers
> Stefan
>
>
> On Sat, Nov 10, 2012 at 11:49 PM, Noah Lavine <noah.b.lavine@gmail.com>wrote:
>
>> Hello,
>>
>>  I assume "compressed native" is the idea you wrote about in your last
>> email, where we generate native code which is a sequence of function calls
>> to VM operations.
>>
>>  I really like that idea. As you said, it uses the instruction cache
>> better. But it also fixes something I was worried about, which is that it's
>> a lot of work to port an assembler to a new architecture, so we might end
>> up not supporting many native architectures. But it seems much easier to
>> make an assembler that only knows how to make call instructions and
>> branches. So we could support compressed native on lots of architectures,
>> and maybe uncompressed native only on some.
>>
>>  If you want a quick way to do compressed native with reasonable
>> register allocation, GNU libjit might work. I used it a couple years ago
>> for a JIT project that we never fully implemented. I chose it over GNU
>> Lightning specifically because it did register allocation. It implements a
>> full assembler, not just calls, which could also be nice later.
>>
>>  Noah
>>
>>
>>
>> On Sat, Nov 10, 2012 at 5:06 PM, Stefan Israelsson Tampe <
>> stefan.itampe@gmail.com> wrote:
>>
>>> I would like to continue the discussion about native code.
>>>
>>> Some facts are,
>>> For example, consider this
>>> (define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+ s i) (+ i
>>> 1)))))
>>>
>>> The timings for (f 100000000)  ~ (f 100M) is
>>>
>>> 1) current vm                 : 2.93s
>>> 2) rtl                              : 1.67s
>>> 3) compressed native     : 1.15s
>>> 4) uncompressed native : 0.54s
>>>
>>> sbcl = compressed nativ + better register allocations (normal
>>> optimization level) : 0.68s
>>>
>>> To note is that for this example the call overhead is close to 5ns per
>>> iteration and meaning that
>>> if we combined 4 with better register handling the potential is to get
>>> this loop to run at 0.2s which means
>>> that the loop has the potential of running 500M iterations in one second
>>> without sacrifying safety and not
>>> have a extraterestial code analyzer. Also to note is that the native
>>> code for the compressed native is smaller then the
>>> rtl code by some factor and if we could make use of registers in a
>>> better way we would end up with even less overhead.
>>>
>>> To note is that compressed native is a very simple mechanism to gain
>>> some speed and also improve on memory
>>> usage in the instruction flow, Also the assembler is very simplistic and
>>> it would not be to much hassle to port a new
>>> instruction format to that environment. Also it's probably possible to
>>> handle the complexity of the code in pure C
>>> for the stubs and by compiling them in a special way make sure they
>>> output a format that can be combined
>>> with the meta information in special registers needed to make the
>>> execution of the compiled scheme effective.
>>>
>>> This study also shows that there is a clear benefit to be able to use
>>> the computers registers, and I think this is the way
>>> you would like the system to behave in the end. sbcl does this rather
>>> nicely and we could look at their way of doing it.
>>>
>>> So, the main question now to you is how to implement the register
>>> allocations? Basic principles of register allocation can be gotten out from
>>> the internet, I'm assure of, but the problem is how to handle the
>>> interaction with the helper stubs. That is
>>> something i'm not sure of yet.
>>>
>>> A simple solution would be to assume that the native code have a set of
>>> available registers r1,...,ri and then force the
>>> compilation of the stubs to treat the just like the registers bp, sp,
>>> and bx. I'm sure that this is possible to configure in gcc.
>>>
>>> So the task for me right now is to find out more how to do this, if you
>>> have any pointers or ideas, please help out.
>>>
>>> Cheers
>>> Stefan
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe <
>>> stefan.itampe@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> After talking with Mark Weaver about his view on native code, I have
>>>> been pondering how to best model our needs.
>>>>
>>>> I do have a framework now that translates almost all of the rtl vm
>>>> directly to native code and it do shows a speed increase of say 4x compared
>>>> to runing a rtl VM. I can also generate rtl code all the way from guile
>>>> scheme right now so It's pretty easy to generate test cases. The problem
>>>> that Mark point out to is that we need to take care to not blow the
>>>> instructuction cache. This is not seen in these simple examples but we need
>>>> larger code bases to test out what is actually true. What we can note
>>>> though is that I expect the size of the code to blow up with a factor of
>>>> around 10 compared to the instruction feed in the rtl code.
>>>>
>>>> One interesting fact is that SBCL does fairly well by basically using
>>>> the native instruction as the instruction flow to it's VM. For example if
>>>> it can deduce that a + operation works with fixnums it simply compiles that
>>>> as a function call to a general + routine e.g. it will do a long jump to
>>>> the + routine, do the plus, and longjump back essentially dispatching
>>>> general instructions like + * / etc, directly e.g. sbcl do have a virtual
>>>> machine, it just don't to table lookup to do the dispatch, but function
>>>> call's in stead. If you count longjumps this means that the number of jumps
>>>> for these instructions are double that of using the original table lookup
>>>> methods. But for calling functions and returning functions the number of
>>>> longjumps are the same and moving local variables in place , jumping  is
>>>> really fast.
>>>>
>>>> Anyway, this method of dispatching would mean a fairly small footprint
>>>> with respect to the direct assembler. Another big chunk of code that we can
>>>> speedup without to much bloat in the instruction cache is the lookup of
>>>> pairs, structs and arrays, the reason is that in many cases we can deduce
>>>> at compilation so much that we do not need to check the type all the time
>>>> but can safely lookup the needed infromation.
>>>>
>>>> Now is this method fast? well, looking a the sbcl code for calculating
>>>> 1+ 2 + 3 + 4 , (disassembling it) I see that it do uses the mechanism
>>>> above, it manages to sum 150M terms in one second, that's quite a feat for
>>>> a VM with no JIT. The same with the rtl VM is 65M.
>>>>
>>>> Now, sbcl's compiler is quite matured and uses native registers quite
>>>> well which explains one of the reasons why the speed. My point is though
>>>> that we can model efficiently a VM by call's and using the native
>>>> instructions and a instructions flow.
>>>>
>>>> Regards Stefan
>>>>
>>>
>>>
>>
>
>

[-- Attachment #2: Type: text/html, Size: 17192 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts on native code
  2012-11-15 10:19       ` Sjoerd van Leent Privé
  2012-11-15 16:04         ` Stefan Israelsson Tampe
  2012-11-15 16:13         ` Stefan Israelsson Tampe
@ 2012-11-15 17:50         ` Mark H Weaver
  2012-11-15 18:03           ` Stefan Israelsson Tampe
  2012-11-15 22:44         ` Andreas Rottmann
  3 siblings, 1 reply; 12+ messages in thread
From: Mark H Weaver @ 2012-11-15 17:50 UTC (permalink / raw)
  To: Sjoerd van Leent Privé; +Cc: guile-devel

Before anyone spends any more time on this, I want to make it clear that
although I very much appreciate Stefan's pioneering spirit, and some of
his ideas are likely to be incorporated, Stefan's work on native
compilation is an independent project of his, and is unlikely to be
merged into the official Guile.  Andy and I have been planning a
different approach.

      Mark

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts on native code
  2012-11-15 17:50         ` Mark H Weaver
@ 2012-11-15 18:03           ` Stefan Israelsson Tampe
  2012-11-15 20:30             ` Ludovic Courtès
  0 siblings, 1 reply; 12+ messages in thread
From: Stefan Israelsson Tampe @ 2012-11-15 18:03 UTC (permalink / raw)
  To: Mark H Weaver; +Cc: guile-devel

[-- Attachment #1: Type: text/plain, Size: 981 bytes --]

Hi,

Yes this is pre-work and what i'm doing is an investigation trying out
things. bare that in mind :-)

For the assembler it can be really good to support one that comes with
guile so I do not find this
work as a research work but as a service work to propose components that
can be included in guile.

Mark, don't you agree on that my higher level research here is
experimental, but that you will need to have
some kind of assembler in the end to have a sane working environment to
output native code?

/Stefan

On Thu, Nov 15, 2012 at 6:50 PM, Mark H Weaver <mhw@netris.org> wrote:

> Before anyone spends any more time on this, I want to make it clear that
> although I very much appreciate Stefan's pioneering spirit, and some of
> his ideas are likely to be incorporated, Stefan's work on native
> compilation is an independent project of his, and is unlikely to be
> merged into the official Guile.  Andy and I have been planning a
> different approach.
>
>       Mark
>

[-- Attachment #2: Type: text/html, Size: 1388 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts on native code
  2012-11-15 18:03           ` Stefan Israelsson Tampe
@ 2012-11-15 20:30             ` Ludovic Courtès
  0 siblings, 0 replies; 12+ messages in thread
From: Ludovic Courtès @ 2012-11-15 20:30 UTC (permalink / raw)
  To: guile-devel

Hello,

Regarding the assembler, if I were to actually hack something ;-), I’d
choose Sassy [0], along with Industria’s disassemblers [1].

Ludo’.

[0] http://sassy.sourceforge.net/
[1] http://weinholt.se/industria/




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts on native code
  2012-11-15 10:19       ` Sjoerd van Leent Privé
                           ` (2 preceding siblings ...)
  2012-11-15 17:50         ` Mark H Weaver
@ 2012-11-15 22:44         ` Andreas Rottmann
  3 siblings, 0 replies; 12+ messages in thread
From: Andreas Rottmann @ 2012-11-15 22:44 UTC (permalink / raw)
  To: Sjoerd van Leent Privé; +Cc: guile-devel

Sjoerd van Leent Privé <svanleent@gmail.com> writes:

> Hi Stefan,
>
> Just my idea about an assembler in Scheme. Sounds interesting. If it's
> done properly, it can be very promising to use scheme itself to
> directly emit machine instructions. This would also be interesting for
> meta compilation in the future (think of aiding GCC).
>
> So you are thinking about an assembler for x86? Perhaps I can help out
> on this one. I would like to do this part, as I haven't been able to
> aid on other parts besides voicing my ideas (anyways, I am on embedded
> development these days.)
>
> The only discussion is the syntax I believe, I mean, should it be AT&T
> like, Intel like, or leave this domain and do something new. I would
> go for instructions like this (using macros):
>
> (let ((target :x686))
>   (assemble target
>    ((mov long 100 EAX)
>     (mov long 200 EBX)
>     (add long EBX EAX))))
>
> Giving back the native machine code instructions. Perhaps special
> constructions can be made to return partially complete instructions
> (such as missing labels or calls to guile procedures...)
>
Regarding the assembler: I have a working AVR assembler [0] in my
"avrth" AVR Forth implementation [1]. That assembler also employs this
"partially complete instructions" idea you mentioned, having a symbol
resolution step. It also supports a simple evaluator for Assembly-time
expressions.  Maybe you find it interesting :-).

Here's an example snippet, from the Forth implementation's runtime:

(define-primitive-vocable vocabulary:primitive "1ms" (vm)
  (scheme
   (sleep-seconds 0.001))
  (assembly
   (ldi zl (lo8 (/ cpu-frequency 4000)))
   (ldi zh (hi8 (/ cpu-frequency 4000)))
   (sbiw zl 42) ;internal plus forth kernel overhead
   PFA_1MS1
   (sbiw zl 1)
   (brne PFA_1MS1)))

The assembler code is the `(assembly ...)' part, but above that you can
see the runable documentation, written in Scheme ;-).  You may also have
noticed the expressions used, i.e. `(lo8 ...)' and `(hi8 ...)' -- these
are evaluated at assembly-time by the `assembler-eval' as found in [0].

While I'm probably too time-starved to really help with Guile's
assembler, I'd be interested in having, and maybe working on (if time
permits), a Scheme-based assembler targeting ARM platforms, so I might
chime in this area at some point.

It should be fine to use my code (from a copyright view angle) as a
basis something to be incorporated into Guile (even though that file is
marked GPLv2, not GPLv2+), as long as you take care to eliminate the
actual AVR instruction generation code.  This part (i.e., the section
marked "Code emitters") is mostly originally transcribed from the
assembler [2] written in Forth included in AmForth [3], which is GPLv2
(only, unfortunatly).  However, I'm not even sure if the AmForth
copyright is still applicable to that part of my code, as the code has
been transformed substantially by transcription.  That might well be a
moot point, we probably don't want to target 8-bit microcontrollers as
Guile platforms anyway ;-).

[0] http://rotty.xx.vu/gitweb/?p=scheme/avrth.git;a=blob;f=assembler.sls;hb=HEAD
[1] http://rotty.xx.vu/gitweb/?p=scheme/avrth.git;a=summary
[2] http://amforth.svn.sourceforge.net/viewvc/amforth/trunk/lib/assembler.frt?revision=1301&view=markup
[3] http://amforth.sourceforge.net/

Regards, Rotty
-- 
Andreas Rottmann -- <http://rotty.xx.vu/>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: thoughts on native code
  2012-11-10 22:06 ` Stefan Israelsson Tampe
  2012-11-10 22:49   ` Noah Lavine
@ 2012-11-11 19:28   ` Stefan Israelsson Tampe
  1 sibling, 0 replies; 12+ messages in thread
From: Stefan Israelsson Tampe @ 2012-11-11 19:28 UTC (permalink / raw)
  To: guile-devel

[-- Attachment #1: Type: text/plain, Size: 8255 bytes --]

Hi, hope that i'm not boring you!

Ok, The next effort will be to investigate the issue of cpu register
allocations. Some features noted here is
that 1. some instruction takes vastly more resources than others and we
need to track this, I'm leaning on implementing
a version of the linear scan algorithm. This is how I think.

The first issue is that I love to relate everything to costs. So basically
if we manage to keep a local variable in a register, we would gain
something e.g. g. On the other hand during the intervall of setting and
retrieving the local variable we have performed a set of operations e.g w.
If we find that g / w is large we would decide that we would gain so much
during the interval that we should put it in a register. Or else we do not
bother putting it in a register. Now over to a branching for a branch we
split up the liveness interval in the number of segments between usages,
for such a segment over a branch we would have

[g1, w1] -> { [g2,w2] , [g3,w3] }

Now generally we would like to assign some kind of probability to the
different branches and do something like

F = p  * (g1 + g2) / (w1 + w2) + (1-p) * (g1 + g3) / (w1 + w3)

If we do not have any information we just put p = 1/2. The astute reader
would realize that this mean a potential
geometrical explosion in the number of terms in the sum let's try

F = g1 / w1 + p (g2 / w2) + (1 -p) (g2/w3)

Nah that's not good either, if g1 is zero we would not amortize over w1,
this leads to

g1' = (g1 + L ( p g2/w2 + (1-p) g3 / w3))
F  = g1'/w1

The linear updating will be
p = 1 => g1' = g1 + w1 L* g2/w2

and
F = g1/w1 + L/w1 * g2/w2

How to select L? consider take one w2 for which we think that this
expression should be true to the more theoretically compelling e.g.

F = g1/w + g2/(w1 + w2) = g1/w + L * g2/w2

And  we would then conclude to use L = w / (w1+w), and

F = g1/w1 + w / (w1 + w) * (g2 / w2)

Although i'm not certain this is the best, maybe you see a better solution,
but the nice thing is that we can build up the solution recursively if we
follows this scheme. and for code graph being a tree we could scan the tree
only ones and build
up these numbers. This method can be refined.

Anyhow we can take all usages and sum up a F for the start and move forward
at each segment for the whole liveness
of the local variable. Now do all this for all variables. Then employ a
version of the linear scan algorithm but only select a variable for
register allocation if the corresponding F is above a threshold C. Which
threshold? Well you could  try to vary F, e.g. Let Q(C) = fraction of
allocated registers per unit of time. then select C so that MIN(C s.t. Q(C)
> 0.9).

We could refine this method to try vary C over the graph somewhat.

What do you think?

Happy Mathing!!
/Stefan






On Sat, Nov 10, 2012 at 11:06 PM, Stefan Israelsson Tampe <
stefan.itampe@gmail.com> wrote:

> I would like to continue the discussion about native code.
>
> Some facts are,
> For example, consider this
> (define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+ s i) (+ i
> 1)))))
>
> The timings for (f 100000000)  ~ (f 100M) is
>
> 1) current vm                 : 2.93s
> 2) rtl                              : 1.67s
> 3) compressed native     : 1.15s
> 4) uncompressed native : 0.54s
>
> sbcl = compressed nativ + better register allocations (normal optimization
> level) : 0.68s
>
> To note is that for this example the call overhead is close to 5ns per
> iteration and meaning that
> if we combined 4 with better register handling the potential is to get
> this loop to run at 0.2s which means
> that the loop has the potential of running 500M iterations in one second
> without sacrifying safety and not
> have a extraterestial code analyzer. Also to note is that the native code
> for the compressed native is smaller then the
> rtl code by some factor and if we could make use of registers in a better
> way we would end up with even less overhead.
>
> To note is that compressed native is a very simple mechanism to gain some
> speed and also improve on memory
> usage in the instruction flow, Also the assembler is very simplistic and
> it would not be to much hassle to port a new
> instruction format to that environment. Also it's probably possible to
> handle the complexity of the code in pure C
> for the stubs and by compiling them in a special way make sure they output
> a format that can be combined
> with the meta information in special registers needed to make the
> execution of the compiled scheme effective.
>
> This study also shows that there is a clear benefit to be able to use the
> computers registers, and I think this is the way
> you would like the system to behave in the end. sbcl does this rather
> nicely and we could look at their way of doing it.
>
> So, the main question now to you is how to implement the register
> allocations? Basic principles of register allocation can be gotten out from
> the internet, I'm assure of, but the problem is how to handle the
> interaction with the helper stubs. That is
> something i'm not sure of yet.
>
> A simple solution would be to assume that the native code have a set of
> available registers r1,...,ri and then force the
> compilation of the stubs to treat the just like the registers bp, sp, and
> bx. I'm sure that this is possible to configure in gcc.
>
> So the task for me right now is to find out more how to do this, if you
> have any pointers or ideas, please help out.
>
> Cheers
> Stefan
>
>
>
>
>
>
> On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe <
> stefan.itampe@gmail.com> wrote:
>
>> Hi all,
>>
>> After talking with Mark Weaver about his view on native code, I have been
>> pondering how to best model our needs.
>>
>> I do have a framework now that translates almost all of the rtl vm
>> directly to native code and it do shows a speed increase of say 4x compared
>> to runing a rtl VM. I can also generate rtl code all the way from guile
>> scheme right now so It's pretty easy to generate test cases. The problem
>> that Mark point out to is that we need to take care to not blow the
>> instructuction cache. This is not seen in these simple examples but we need
>> larger code bases to test out what is actually true. What we can note
>> though is that I expect the size of the code to blow up with a factor of
>> around 10 compared to the instruction feed in the rtl code.
>>
>> One interesting fact is that SBCL does fairly well by basically using the
>> native instruction as the instruction flow to it's VM. For example if it
>> can deduce that a + operation works with fixnums it simply compiles that as
>> a function call to a general + routine e.g. it will do a long jump to the +
>> routine, do the plus, and longjump back essentially dispatching general
>> instructions like + * / etc, directly e.g. sbcl do have a virtual machine,
>> it just don't to table lookup to do the dispatch, but function call's in
>> stead. If you count longjumps this means that the number of jumps for these
>> instructions are double that of using the original table lookup methods.
>> But for calling functions and returning functions the number of longjumps
>> are the same and moving local variables in place , jumping  is really fast.
>>
>> Anyway, this method of dispatching would mean a fairly small footprint
>> with respect to the direct assembler. Another big chunk of code that we can
>> speedup without to much bloat in the instruction cache is the lookup of
>> pairs, structs and arrays, the reason is that in many cases we can deduce
>> at compilation so much that we do not need to check the type all the time
>> but can safely lookup the needed infromation.
>>
>> Now is this method fast? well, looking a the sbcl code for calculating 1+
>> 2 + 3 + 4 , (disassembling it) I see that it do uses the mechanism above,
>> it manages to sum 150M terms in one second, that's quite a feat for a VM
>> with no JIT. The same with the rtl VM is 65M.
>>
>> Now, sbcl's compiler is quite matured and uses native registers quite
>> well which explains one of the reasons why the speed. My point is though
>> that we can model efficiently a VM by call's and using the native
>> instructions and a instructions flow.
>>
>> Regards Stefan
>>
>
>

[-- Attachment #2: Type: text/html, Size: 9189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-11-15 22:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-10 14:41 thoughts on native code Stefan Israelsson Tampe
2012-11-10 22:06 ` Stefan Israelsson Tampe
2012-11-10 22:49   ` Noah Lavine
2012-11-12 21:50     ` Stefan Israelsson Tampe
2012-11-15 10:19       ` Sjoerd van Leent Privé
2012-11-15 16:04         ` Stefan Israelsson Tampe
2012-11-15 16:13         ` Stefan Israelsson Tampe
2012-11-15 17:50         ` Mark H Weaver
2012-11-15 18:03           ` Stefan Israelsson Tampe
2012-11-15 20:30             ` Ludovic Courtès
2012-11-15 22:44         ` Andreas Rottmann
2012-11-11 19:28   ` Stefan Israelsson Tampe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).