Thanks for your mail Noah,

Yea libjit is quite interesting. But playing around with an assembler in
scheme I do not want to go back to
C or C++ land. The only problem is that we need a GNU scheme assembler and
right now I use sbcl's assembler
ported to scheme. We could perhaps use weinholts assembler as well in
industria if he could sign papers to make it GNU. For the register
allocation part I would really like to play a little in scheme to explore
the idea you saw from my previous mail in this thread. Again I think it's
natural to have this features in scheme and do not want to mess in C land
too much.

Am I wrong?

Cheers
Stefan


On Sat, Nov 10, 2012 at 11:49 PM, Noah Lavine <noah.b.lavine@gmail.com>wrote:

> Hello,
>
> I assume "compressed native" is the idea you wrote about in your last
> email, where we generate native code which is a sequence of function calls
> to VM operations.
>
> I really like that idea. As you said, it uses the instruction cache
> better. But it also fixes something I was worried about, which is that it's
> a lot of work to port an assembler to a new architecture, so we might end
> up not supporting many native architectures. But it seems much easier to
> make an assembler that only knows how to make call instructions and
> branches. So we could support compressed native on lots of architectures,
> and maybe uncompressed native only on some.
>
> If you want a quick way to do compressed native with reasonable register
> allocation, GNU libjit might work. I used it a couple years ago for a JIT
> project that we never fully implemented. I chose it over GNU Lightning
> specifically because it did register allocation. It implements a full
> assembler, not just calls, which could also be nice later.
>
> Noah
>
>
>
> On Sat, Nov 10, 2012 at 5:06 PM, Stefan Israelsson Tampe <
> stefan.itampe@gmail.com> wrote:
>
>> I would like to continue the discussion about native code.
>>
>> Some facts are,
>> For example, consider this
>> (define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+ s i) (+ i
>> 1)))))
>>
>> The timings for (f 100000000)  ~ (f 100M) is
>>
>> 1) current vm                 : 2.93s
>> 2) rtl                              : 1.67s
>> 3) compressed native     : 1.15s
>> 4) uncompressed native : 0.54s
>>
>> sbcl = compressed nativ + better register allocations (normal
>> optimization level) : 0.68s
>>
>> To note is that for this example the call overhead is close to 5ns per
>> iteration and meaning that
>> if we combined 4 with better register handling the potential is to get
>> this loop to run at 0.2s which means
>> that the loop has the potential of running 500M iterations in one second
>> without sacrifying safety and not
>> have a extraterestial code analyzer. Also to note is that the native code
>> for the compressed native is smaller then the
>> rtl code by some factor and if we could make use of registers in a better
>> way we would end up with even less overhead.
>>
>> To note is that compressed native is a very simple mechanism to gain some
>> speed and also improve on memory
>> usage in the instruction flow, Also the assembler is very simplistic and
>> it would not be to much hassle to port a new
>> instruction format to that environment. Also it's probably possible to
>> handle the complexity of the code in pure C
>> for the stubs and by compiling them in a special way make sure they
>> output a format that can be combined
>> with the meta information in special registers needed to make the
>> execution of the compiled scheme effective.
>>
>> This study also shows that there is a clear benefit to be able to use the
>> computers registers, and I think this is the way
>> you would like the system to behave in the end. sbcl does this rather
>> nicely and we could look at their way of doing it.
>>
>> So, the main question now to you is how to implement the register
>> allocations? Basic principles of register allocation can be gotten out from
>> the internet, I'm assure of, but the problem is how to handle the
>> interaction with the helper stubs. That is
>> something i'm not sure of yet.
>>
>> A simple solution would be to assume that the native code have a set of
>> available registers r1,...,ri and then force the
>> compilation of the stubs to treat the just like the registers bp, sp, and
>> bx. I'm sure that this is possible to configure in gcc.
>>
>> So the task for me right now is to find out more how to do this, if you
>> have any pointers or ideas, please help out.
>>
>> Cheers
>> Stefan
>>
>>
>>
>>
>>
>>
>> On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe <
>> stefan.itampe@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> After talking with Mark Weaver about his view on native code, I have
>>> been pondering how to best model our needs.
>>>
>>> I do have a framework now that translates almost all of the rtl vm
>>> directly to native code and it do shows a speed increase of say 4x compared
>>> to runing a rtl VM. I can also generate rtl code all the way from guile
>>> scheme right now so It's pretty easy to generate test cases. The problem
>>> that Mark point out to is that we need to take care to not blow the
>>> instructuction cache. This is not seen in these simple examples but we need
>>> larger code bases to test out what is actually true. What we can note
>>> though is that I expect the size of the code to blow up with a factor of
>>> around 10 compared to the instruction feed in the rtl code.
>>>
>>> One interesting fact is that SBCL does fairly well by basically using
>>> the native instruction as the instruction flow to it's VM. For example if
>>> it can deduce that a + operation works with fixnums it simply compiles that
>>> as a function call to a general + routine e.g. it will do a long jump to
>>> the + routine, do the plus, and longjump back essentially dispatching
>>> general instructions like + * / etc, directly e.g. sbcl do have a virtual
>>> machine, it just don't to table lookup to do the dispatch, but function
>>> call's in stead. If you count longjumps this means that the number of jumps
>>> for these instructions are double that of using the original table lookup
>>> methods. But for calling functions and returning functions the number of
>>> longjumps are the same and moving local variables in place , jumping  is
>>> really fast.
>>>
>>> Anyway, this method of dispatching would mean a fairly small footprint
>>> with respect to the direct assembler. Another big chunk of code that we can
>>> speedup without to much bloat in the instruction cache is the lookup of
>>> pairs, structs and arrays, the reason is that in many cases we can deduce
>>> at compilation so much that we do not need to check the type all the time
>>> but can safely lookup the needed infromation.
>>>
>>> Now is this method fast? well, looking a the sbcl code for calculating
>>> 1+ 2 + 3 + 4 , (disassembling it) I see that it do uses the mechanism
>>> above, it manages to sum 150M terms in one second, that's quite a feat for
>>> a VM with no JIT. The same with the rtl VM is 65M.
>>>
>>> Now, sbcl's compiler is quite matured and uses native registers quite
>>> well which explains one of the reasons why the speed. My point is though
>>> that we can model efficiently a VM by call's and using the native
>>> instructions and a instructions flow.
>>>
>>> Regards Stefan
>>>
>>
>>
>