From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!not-for-mail From: Noah Lavine Newsgroups: gmane.lisp.guile.devel Subject: Re: thoughts on native code Date: Sat, 10 Nov 2012 17:49:07 -0500 Message-ID: References: NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: multipart/alternative; boundary=f46d0444e9d137334004ce2be067 X-Trace: ger.gmane.org 1352587757 26011 80.91.229.3 (10 Nov 2012 22:49:17 GMT) X-Complaints-To: usenet@ger.gmane.org NNTP-Posting-Date: Sat, 10 Nov 2012 22:49:17 +0000 (UTC) Cc: guile-devel To: Stefan Israelsson Tampe Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sat Nov 10 23:49:27 2012 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1TXJri-0003nB-5j for guile-devel@m.gmane.org; Sat, 10 Nov 2012 23:49:26 +0100 Original-Received: from localhost ([::1]:54654 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TXJrY-0000Vw-Qr for guile-devel@m.gmane.org; Sat, 10 Nov 2012 17:49:16 -0500 Original-Received: from eggs.gnu.org ([208.118.235.92]:49801) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TXJrU-0000Uw-2G for guile-devel@gnu.org; Sat, 10 Nov 2012 17:49:15 -0500 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TXJrR-0003Uj-0B for guile-devel@gnu.org; Sat, 10 Nov 2012 17:49:11 -0500 Original-Received: from mail-ob0-f169.google.com ([209.85.214.169]:53372) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TXJrQ-0003SV-Hw for guile-devel@gnu.org; Sat, 10 Nov 2012 17:49:08 -0500 Original-Received: by mail-ob0-f169.google.com with SMTP id lz20so995423obb.0 for ; Sat, 10 Nov 2012 14:49:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=sT5JYwFeT33gvpfN+kygb8yvcPKqjDqS73/pKKUNbJQ=; b=tVuN/rsvCcvSc7TWhd7Xg6HEhitwPg2V2JtiL3y9SDRA//1rvKDbpTGmEt+oZQWEPc Z6L2LHOqUJHyj2k4FALZVgg9ES/4NI5MIRtfOxnV7V6jJSqHP4gPbv2VlrRkWkQyPKpi dydcLTbZFT6fFFmhxYKCwf+q9w4KZ54ZjBU9CKakEVkDdO6F+/F/WieXlH4PkMobtRQL FEn7S7ObFj7UXwSrCAEWLWFlbvQYQQvs/YNrIaxd731LaSv2rwfxH8l5MsUyKmaDms2y xMhtONNg+VauTe2ELegLgB5MOnyRdcwr3D/xTwz6cwjlk+Jvu+soo3wywxJfdkwa52xc qfvA== Original-Received: by 10.182.86.225 with SMTP id s1mr11795989obz.91.1352587747471; Sat, 10 Nov 2012 14:49:07 -0800 (PST) Original-Received: by 10.76.120.236 with HTTP; Sat, 10 Nov 2012 14:49:07 -0800 (PST) In-Reply-To: X-Google-Sender-Auth: gXiS9NO2wUqaoPG9J-AE42df4X8 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy] X-Received-From: 209.85.214.169 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Xref: news.gmane.org gmane.lisp.guile.devel:15146 Archived-At: --f46d0444e9d137334004ce2be067 Content-Type: text/plain; charset=ISO-8859-1 Hello, I assume "compressed native" is the idea you wrote about in your last email, where we generate native code which is a sequence of function calls to VM operations. I really like that idea. As you said, it uses the instruction cache better. But it also fixes something I was worried about, which is that it's a lot of work to port an assembler to a new architecture, so we might end up not supporting many native architectures. But it seems much easier to make an assembler that only knows how to make call instructions and branches. So we could support compressed native on lots of architectures, and maybe uncompressed native only on some. If you want a quick way to do compressed native with reasonable register allocation, GNU libjit might work. I used it a couple years ago for a JIT project that we never fully implemented. I chose it over GNU Lightning specifically because it did register allocation. It implements a full assembler, not just calls, which could also be nice later. Noah On Sat, Nov 10, 2012 at 5:06 PM, Stefan Israelsson Tampe < stefan.itampe@gmail.com> wrote: > I would like to continue the discussion about native code. > > Some facts are, > For example, consider this > (define (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+ s i) (+ i > 1))))) > > The timings for (f 100000000) ~ (f 100M) is > > 1) current vm : 2.93s > 2) rtl : 1.67s > 3) compressed native : 1.15s > 4) uncompressed native : 0.54s > > sbcl = compressed nativ + better register allocations (normal optimization > level) : 0.68s > > To note is that for this example the call overhead is close to 5ns per > iteration and meaning that > if we combined 4 with better register handling the potential is to get > this loop to run at 0.2s which means > that the loop has the potential of running 500M iterations in one second > without sacrifying safety and not > have a extraterestial code analyzer. Also to note is that the native code > for the compressed native is smaller then the > rtl code by some factor and if we could make use of registers in a better > way we would end up with even less overhead. > > To note is that compressed native is a very simple mechanism to gain some > speed and also improve on memory > usage in the instruction flow, Also the assembler is very simplistic and > it would not be to much hassle to port a new > instruction format to that environment. Also it's probably possible to > handle the complexity of the code in pure C > for the stubs and by compiling them in a special way make sure they output > a format that can be combined > with the meta information in special registers needed to make the > execution of the compiled scheme effective. > > This study also shows that there is a clear benefit to be able to use the > computers registers, and I think this is the way > you would like the system to behave in the end. sbcl does this rather > nicely and we could look at their way of doing it. > > So, the main question now to you is how to implement the register > allocations? Basic principles of register allocation can be gotten out from > the internet, I'm assure of, but the problem is how to handle the > interaction with the helper stubs. That is > something i'm not sure of yet. > > A simple solution would be to assume that the native code have a set of > available registers r1,...,ri and then force the > compilation of the stubs to treat the just like the registers bp, sp, and > bx. I'm sure that this is possible to configure in gcc. > > So the task for me right now is to find out more how to do this, if you > have any pointers or ideas, please help out. > > Cheers > Stefan > > > > > > > On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tampe < > stefan.itampe@gmail.com> wrote: > >> Hi all, >> >> After talking with Mark Weaver about his view on native code, I have been >> pondering how to best model our needs. >> >> I do have a framework now that translates almost all of the rtl vm >> directly to native code and it do shows a speed increase of say 4x compared >> to runing a rtl VM. I can also generate rtl code all the way from guile >> scheme right now so It's pretty easy to generate test cases. The problem >> that Mark point out to is that we need to take care to not blow the >> instructuction cache. This is not seen in these simple examples but we need >> larger code bases to test out what is actually true. What we can note >> though is that I expect the size of the code to blow up with a factor of >> around 10 compared to the instruction feed in the rtl code. >> >> One interesting fact is that SBCL does fairly well by basically using the >> native instruction as the instruction flow to it's VM. For example if it >> can deduce that a + operation works with fixnums it simply compiles that as >> a function call to a general + routine e.g. it will do a long jump to the + >> routine, do the plus, and longjump back essentially dispatching general >> instructions like + * / etc, directly e.g. sbcl do have a virtual machine, >> it just don't to table lookup to do the dispatch, but function call's in >> stead. If you count longjumps this means that the number of jumps for these >> instructions are double that of using the original table lookup methods. >> But for calling functions and returning functions the number of longjumps >> are the same and moving local variables in place , jumping is really fast. >> >> Anyway, this method of dispatching would mean a fairly small footprint >> with respect to the direct assembler. Another big chunk of code that we can >> speedup without to much bloat in the instruction cache is the lookup of >> pairs, structs and arrays, the reason is that in many cases we can deduce >> at compilation so much that we do not need to check the type all the time >> but can safely lookup the needed infromation. >> >> Now is this method fast? well, looking a the sbcl code for calculating 1+ >> 2 + 3 + 4 , (disassembling it) I see that it do uses the mechanism above, >> it manages to sum 150M terms in one second, that's quite a feat for a VM >> with no JIT. The same with the rtl VM is 65M. >> >> Now, sbcl's compiler is quite matured and uses native registers quite >> well which explains one of the reasons why the speed. My point is though >> that we can model efficiently a VM by call's and using the native >> instructions and a instructions flow. >> >> Regards Stefan >> > > --f46d0444e9d137334004ce2be067 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hello,

I assume "compressed native" is the ide= a you wrote about in your last email, where we generate native code which i= s a sequence of function calls to VM operations.

I really like that idea. As you said, it uses the instruction cache better.= But it also fixes something I was worried about, which is that it's a = lot of work to port an assembler to a new architecture, so we might end up = not supporting many native architectures. But it seems much easier to make = an assembler that only knows how to make call instructions and branches. So= we could support compressed native on lots of architectures, and maybe unc= ompressed native only on some.

If you want a quick way to do compressed native with re= asonable register allocation, GNU libjit might work. I used it a couple yea= rs ago for a JIT project that we never fully implemented. I chose it over G= NU Lightning specifically because it did register allocation.=A0It implemen= ts a full assembler, not just calls, which could also be nice later.

Noah


On Sat, Nov 10, 2012 at 5:06 PM, Stefan Isr= aelsson Tampe <stefan.itampe@gmail.com> wrote:
I would like to continue the discussion abou= t native code.

Some facts are,
For example, consider this
(def= ine (f x) (let loop ((s 0) (i 0)) (if (eq? i x) s (loop (+ s i) (+ i 1)))))=

The timings for (f 100000000)=A0 ~ (f 100M) is

1) current vm=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 : 2.93s2) rtl=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0=A0 : 1.67s
3) compressed native=A0=A0=A0=A0 : 1.15s4) uncompressed native : 0.54s

sbcl =3D compressed nativ + better = register allocations (normal optimization level) : 0.68s

To note is that for this example the call overhead is close to 5ns per = iteration and meaning that
if we combined 4 with better register handlin= g the potential is to get this loop to run at 0.2s which means
that the= loop has the potential of running 500M iterations in one second without sa= crifying safety and not
have a extraterestial code analyzer. Also to note is that the native code f= or the compressed native is smaller then the
rtl code by some factor an= d if we could make use of registers in a better way we would end up with ev= en less overhead.

To note is that compressed native is a very simple mechanism to gain so= me speed and also improve on memory
usage in the instruction flow, Also= the assembler is very simplistic and it would not be to much hassle to por= t a new
instruction format to that environment. Also it's probably possible to = handle the complexity of the code in pure C
for the stubs and by compil= ing them in a special way make sure they output a format that can be combin= ed
with the meta information in special registers needed to make the execution= of the compiled scheme effective.

This study also shows that there = is a clear benefit to be able to use the computers registers, and I think t= his is the way
you would like the system to behave in the end. sbcl does this rather nicel= y and we could look at their way of doing it.

So, the main question = now to you is how to implement the register allocations? Basic principles o= f register allocation can be gotten out from the internet, I'm assure o= f, but the problem is how to handle the interaction with the helper stubs. = That is
something i'm not sure of yet.

A simple solution would be to ass= ume that the native code have a set of available registers r1,...,ri and th= en force the
compilation of the stubs to treat the just like the regist= ers bp, sp, and bx. I'm sure that this is possible to configure in gcc.=

So the task for me right now is to find out more how to do this, if you= have any pointers or ideas, please help out.

Cheers
Stefan






On Sat, Nov 10, 2012 at 3:41 PM, Stefan Israelsson Tam= pe <stefan.itampe@gmail.com> wrote:
Hi all,

After talking with Mark Weave= r about his view on native code, I have been pondering how to best model ou= r needs.

I do have a framework now that translates almost all of the rtl vm dire= ctly to native code and it do shows a speed increase of say 4x compared to = runing a rtl VM. I can also generate rtl code all the way from guile scheme= right now so It's pretty easy to generate test cases. The problem that= Mark point out to is that we need to take care to not blow the instructuct= ion cache. This is not seen in these simple examples but we need larger cod= e bases to test out what is actually true. What we can note though is that = I expect the size of the code to blow up with a factor of around 10 compare= d to the instruction feed in the rtl code.

One interesting fact is that SBCL does fairly well by basically using t= he native instruction as the instruction flow to it's VM. For example i= f it can deduce that a + operation works with fixnums it simply compiles th= at as a function call to a general + routine e.g. it will do a long jump to= the + routine, do the plus, and longjump back essentially dispatching gene= ral instructions like + * / etc, directly e.g. sbcl do have a virtual machi= ne, it just don't to table lookup to do the dispatch, but function call= 's in stead. If you count longjumps this means that the number of jumps= for these instructions are double that of using the original table lookup = methods. But for calling functions and returning functions the number of lo= ngjumps are the same and moving local variables in place , jumping=A0 is re= ally fast.

Anyway, this method of dispatching would mean a fairly small footprint = with respect to the direct assembler. Another big chunk of code that we can= speedup without to much bloat in the instruction cache is the lookup of pa= irs, structs and arrays, the reason is that in many cases we can deduce at = compilation so much that we do not need to check the type all the time but = can safely lookup the needed infromation.

Now is this method fast? well, looking a the sbcl code for calculating = 1+ 2 + 3 + 4 , (disassembling it) I see that it do uses the mechanism above= , it manages to sum 150M terms in one second, that's quite a feat for a= VM with no JIT. The same with the rtl VM is 65M.

Now, sbcl's compiler is quite matured and uses native registers qui= te well which explains one of the reasons why the speed. My point is though= that we can model efficiently a VM by call's and using the native inst= ructions and a instructions flow.

Regards Stefan


--f46d0444e9d137334004ce2be067--