Thanks for your mail Noah, Yea libjit is quite interesting. But playing around with an assembler in scheme I do not want to go back to C or C++ land. The only problem is that we need a GNU scheme assembler and right now I use sbcl's assembler ported to scheme. We could perhaps use weinholts assembler as well in industria if he could sign papers to make it GNU. For the register allocation part I would really like to play a little in scheme to explore the idea you saw from my previous mail in this thread. Again I think it's natural to have this features in scheme and do not want to mess in C land too much. Am I wrong? Cheers Stefan On Sat, Nov 10, 2012 at 11:49 PM, Noah Lavine wrote: > Hello, > > I assume "compressed native" is the idea you wrote about in your last > email, where we generate native code which is a sequence of function calls > to VM operations. > > I really like that idea. As you said, it uses the instruction cache > better. But it also fixes something I was worried about, which is that it's > a lot of work to port an assembler to a new architecture, so we might end > up not supporting many native architectures. But it seems much easier to > make an assembler that only knows how to make call instructions and > branches. So we could support compressed native on lots of architectures, > and maybe uncompressed native only on some. > > If you want a quick way to do compressed native with reasonable register > allocation, GNU libjit might work. I used it a couple years ago for a JIT > project that we never fully implemented. I chose it over GNU Lightning > specifically because it did register allocation. It implements a full > assembler, not just calls, which could also be nice later. > > Noah > > > > On Sat, Nov 10, 2012 at 5:06 PM, Stefan Israelsson Tampe < > stefan.itampe@gmail.com> wrote: > >> I would like to continue the discussion about native code. >> >> Some facts are, >> For example, consider this >> (define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+ s i) (+ i >> 1))))) >> >> The timings for (f 100000000) ~ (f 100M) is >> >> 1) current vm : 2.93s >> 2) rtl : 1.67s >> 3) compressed native : 1.15s >> 4) uncompressed native : 0.54s >> >> sbcl = compressed nativ + better register allocations (normal >> optimization level) : 0.68s >> >> To note is that for this example the call overhead is close to 5ns per >> iteration and meaning that >> if we combined 4 with better register handling the potential is to get >> this loop to run at 0.2s which means >> that the loop has the potential of running 500M iterations in one second >> without sacrifying safety and not >> have a extraterestial code analyzer. Also to note is that the native code >> for the compressed native is smaller then the >> rtl code by some factor and if we could make use of registers in a better >> way we would end up with even less overhead. >> >> To note is that compressed native is a very simple mechanism to gain some >> speed and also improve on memory >> usage in the instruction flow, Also the assembler is very simplistic and >> it would not be to much hassle to port a new >> instruction format to that environment. Also it's probably possible to >> handle the complexity of the code in pure C >> for the stubs and by compiling them in a special way make sure they >> output a format that can be combined >> with the meta information in special registers needed to make the >> execution of the compiled scheme effective. >> >> This study also shows that there is a clear benefit to be able to use the >> computers registers, and I think this is the way >> you would like the system to behave in the end. sbcl does this rather >> nicely and we could look at their way of doing it. >> >> So, the main question now to you is how to implement the register >> allocations? Basic principles of register allocation can be gotten out from >> the internet, I'm assure of, but the problem is how to handle the >> interaction with the helper stubs. That is >> something i'm not sure of yet. >> >> A simple solution would be to assume that the native code have a set of >> available registers r1,...,ri and then force the >> compilation of the stubs to treat the just like the registers bp, sp, and >> bx. I'm sure that this is possible to configure in gcc. >> >> So the task for me right now is to find out more how to do this, if you >> have any pointers or ideas, please help out. >> >> Cheers >> Stefan >> >> >> >> >> >> >> On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe < >> stefan.itampe@gmail.com> wrote: >> >>> Hi all, >>> >>> After talking with Mark Weaver about his view on native code, I have >>> been pondering how to best model our needs. >>> >>> I do have a framework now that translates almost all of the rtl vm >>> directly to native code and it do shows a speed increase of say 4x compared >>> to runing a rtl VM. I can also generate rtl code all the way from guile >>> scheme right now so It's pretty easy to generate test cases. The problem >>> that Mark point out to is that we need to take care to not blow the >>> instructuction cache. This is not seen in these simple examples but we need >>> larger code bases to test out what is actually true. What we can note >>> though is that I expect the size of the code to blow up with a factor of >>> around 10 compared to the instruction feed in the rtl code. >>> >>> One interesting fact is that SBCL does fairly well by basically using >>> the native instruction as the instruction flow to it's VM. For example if >>> it can deduce that a + operation works with fixnums it simply compiles that >>> as a function call to a general + routine e.g. it will do a long jump to >>> the + routine, do the plus, and longjump back essentially dispatching >>> general instructions like + * / etc, directly e.g. sbcl do have a virtual >>> machine, it just don't to table lookup to do the dispatch, but function >>> call's in stead. If you count longjumps this means that the number of jumps >>> for these instructions are double that of using the original table lookup >>> methods. But for calling functions and returning functions the number of >>> longjumps are the same and moving local variables in place , jumping is >>> really fast. >>> >>> Anyway, this method of dispatching would mean a fairly small footprint >>> with respect to the direct assembler. Another big chunk of code that we can >>> speedup without to much bloat in the instruction cache is the lookup of >>> pairs, structs and arrays, the reason is that in many cases we can deduce >>> at compilation so much that we do not need to check the type all the time >>> but can safely lookup the needed infromation. >>> >>> Now is this method fast? well, looking a the sbcl code for calculating >>> 1+ 2 + 3 + 4 , (disassembling it) I see that it do uses the mechanism >>> above, it manages to sum 150M terms in one second, that's quite a feat for >>> a VM with no JIT. The same with the rtl VM is 65M. >>> >>> Now, sbcl's compiler is quite matured and uses native registers quite >>> well which explains one of the reasons why the speed. My point is though >>> that we can model efficiently a VM by call's and using the native >>> instructions and a instructions flow. >>> >>> Regards Stefan >>> >> >> >