From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Andy Wingo Newsgroups: gmane.lisp.guile.devel Subject: Re: guile 3 update, june 2018 edition Date: Thu, 05 Jul 2018 19:05:02 +0200 Message-ID: <87o9flipfl.fsf@pobox.com> References: <871scq2env.fsf@pobox.com> <87y3euj8aw.fsf@gnu.org> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Trace: blaine.gmane.org 1530810204 29856 195.159.176.226 (5 Jul 2018 17:03:24 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Thu, 5 Jul 2018 17:03:24 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) Cc: guile-devel@gnu.org To: ludo@gnu.org (Ludovic =?utf-8?Q?Court=C3=A8s?=) Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Thu Jul 05 19:03:20 2018 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fb7f1-0007Z9-Hs for guile-devel@m.gmane.org; Thu, 05 Jul 2018 19:03:15 +0200 Original-Received: from localhost ([::1]:53913 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fb7h8-0001mS-Qm for guile-devel@m.gmane.org; Thu, 05 Jul 2018 13:05:26 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:44673) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fb7h1-0001kb-8C for guile-devel@gnu.org; Thu, 05 Jul 2018 13:05:21 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fb7gv-00040I-VC for guile-devel@gnu.org; Thu, 05 Jul 2018 13:05:19 -0400 Original-Received: from pb-sasl2.pobox.com ([64.147.108.67]:51463 helo=sasl.smtp.pobox.com) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1fb7gv-0003zN-ET; Thu, 05 Jul 2018 13:05:13 -0400 Original-Received: from sasl.smtp.pobox.com (unknown [127.0.0.1]) by pb-sasl2.pobox.com (Postfix) with ESMTP id AC010E12F7; Thu, 5 Jul 2018 13:05:11 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=pobox.com; h=from:to:cc :subject:references:date:in-reply-to:message-id:mime-version :content-type:content-transfer-encoding; s=sasl; bh=X+QW5jqxXlqe IwGLbTGaC1Dcz4s=; b=gQy+jMXJWyyoNq6aCblaOQmyOyO3Y77wUtoNtOMzClSF +3kDeIrURvYiAGqMJYloW+EjdW8eLrnZPBxeM2TSGThgNily00Skrco7Nx+7COiu JQGdZgxZkGZesAM0rKb2YLmQgDNYvpw3KHBRcxbmpnZuw4FtBWHl8Dx7rjSGBh0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=pobox.com; h=from:to:cc :subject:references:date:in-reply-to:message-id:mime-version :content-type:content-transfer-encoding; q=dns; s=sasl; b=oe/XEJ fYzULpCln308tKfl01S5WsDX8PpBU1LBt4y80yFq3LIFZ5+tI9xY68FsJ2YJE+oq mmTVZ4dzILa79gFwuuvzTLXAsCszpvsPN6L6AFFF9ruRWbzR9Jx3w18X33r/WB7y SMnZSe/6lNsJ980StAWwcDUh8gh/bteJMoH2s= Original-Received: from pb-sasl2.nyi.icgroup.com (unknown [127.0.0.1]) by pb-sasl2.pobox.com (Postfix) with ESMTP id A1D5FE12F6; Thu, 5 Jul 2018 13:05:11 -0400 (EDT) Original-Received: from sparrow (unknown [88.160.190.192]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by pb-sasl2.pobox.com (Postfix) with ESMTPSA id 92013E12F5; Thu, 5 Jul 2018 13:05:10 -0400 (EDT) In-Reply-To: <87y3euj8aw.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Mon, 02 Jul 2018 11:28:23 +0200") X-Pobox-Relay-ID: 938343B6-8075-11E8-B271-4B4D894C8D7C-02397024!pb-sasl2.pobox.com X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 64.147.108.67 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: "guile-devel" Xref: news.gmane.org gmane.lisp.guile.devel:19584 Archived-At: Hi :) On Mon 02 Jul 2018 11:28, ludo@gnu.org (Ludovic Court=C3=A8s) writes: > Andy Wingo skribis: > >> My current plan is that the frame overhead will still be two slots: the >> saved previous FP, and the saved return address. Right now the return >> address is always a bytecode address. In the future it will be bytecode >> or native code. Guile will keep a runtime routine marking regions of >> native code so it can know if it needs to if an RA is bytecode or native >> code, for debugging reasons; but in most operation, Guile won't need to >> know. The interpreter will tier up to JIT code through an adapter frame >> that will do impedance matching over virtual<->physical addresses. To >> tier down to the interpreter (e.g. when JIT code calls interpreted >> code), the JIT will simply return to the interpreter, which will pick up >> state from the virtual IP, SP, and FP saved in the VM state. > > What will the =E2=80=9Cadapter frame=E2=80=9D look like? Aah, sadly it won't work like this. Somehow I was thinking of an adapter frame on the C stack. However an adapter frame corresponds to a continuation, so it would have to have the life of a continuation, so it would have to be on the VM stack. I don't think I want adapter frames on the VM stack, so I have to scrap this. More below... >> We do walk the stack from Scheme sometimes, notably when making a >> backtrace. So, we'll make the runtime translate the JIT return >> addresses to virtual return addresses in the frame API. To Scheme, it >> will be as if all things were interpreted. > > Currently you can inspect the locals of a stack frame. Will that be > possible with frames corresponding to native code? (I suppose that=E2=80= =99d be > difficult.) Yes, because native code manipulates the VM stack in exactly the same way as bytecode. Eventually we should do register allocation and avoid always writing values to the stack, but that is down the road. >> My current problem is knowing when a callee has JIT code. Say you're in >> JITted function F which calls G. Can you directly jump to G's native >> code, or is G not compiled yet and you need to use the interpreter? I >> haven't solved this yet. "Known calls" that use call-label and similar >> can of course eagerly ensure their callees are JIT-compiled, at >> compilation time. Unknown calls are the problem. I don't know whether >> to consider reserving another word in scm_tc7_program objects for JIT >> code. I have avoided JIT overhead elsewhere and would like to do so >> here as well! > > In the absence of a native code pointer in scm_tc7_program objects, how > will libguile find the native code for a given program? This is a good question and it was not clear to me when I wrote this! I think I have a solution now but it involves memory overhead. Oh well. Firstly, I propose to add a slot to stack frames. Stack frames will now store the saved FP, the virtual return address (vRA), and the machine return address IP (mRA). When in JIT code, a return will check if the mRA is nonzero, and if so jump to that mRA. Otherwise it will return from JIT, and the interpreter should continue. Likewise when doing a function return from the interpreter and the mRA is nonzero, the interpreter should return by entering JIT code to that address. When building an interpreter-only Guile (Guile without JIT) or an AOT-only Guile (doesn't exist currently), we could configure Guile to not reserve this extra stack word. However that would be a different ABI: a .go file built with interpreter-only Guile wouldn't work on Guile-with-JIT, because interpreter-only Guile would think stack frames only need two reserved words, whereas Guile-with-JIT would write three words. To avoid the complication, for 3.0 I think we should just use 3-word frames all the time. So, that's returns. Other kinds of non-local returns like abort-to-prompt, resuming delimited continuations, or calling undelimited continuations would work similarly: the continuation would additionally record an mRA, and resuming would jump there instead, if appropriate. Now, calls. One of the reasons that I wanted to avoid an extra program word was because scm_tc7_program doesn't exist in a one-to-one relationship with code. "Well-known" procedures get compiled by closure optimization to be always called via call-label or tail-call-label -- so some code doesn't have program objects. On the other hand, closures mean that some code has many program objects. So I thought about using side tables indexed by code; or inline "maybe-tier-up-here" instructions, which would reference a code pointer location, that if nonzero, would be the JIT code. However I see now that really we need to optimize for the JIT-to-JIT call case, as by definition that's going to be the hot case. Of course call-label from JIT can do an unconditional jmp. But calling a program object... how do we do this? This is complicated by code pages being read-only, so we don't have space to store a pointer in with the code. I think the answer is, like you say, another word in program objects. JIT code will then do, when it sees a call: if (SCM_PROGRAM_P (x) && SCM_PROGRAM_MCODE (x)) goto *SCM_PROGRAM_MCODE (x); vp->ip =3D SCM_PROGRAM_CODE (x); return; // Handle call in interpreter. which is reasonably direct and terse. I can't think of any other option that would be reasonably fast. Now, we're left with a couple of issues: what to do about call-label in interpreted code? Also: when should we JIT? Let's take the second question first. I think we should JIT based on counters, i.e. each function has an associated counter, and when that counter overflows (or underflows, depending on how we configure it), then we JIT that function. This has the advantage of being deterministic. Another option would be statprof-like profiling, which is cool because it measures real run time, but beside not being terrible portable, it isn't deterministic, so it makes bugs hard to reproduce. Another option would be a side hash table keyed by code address. This is what LuaJIT does but you get collisions and since addresses are randomized, that also makes JIT unpredictable. Usually your counter limit is set depending on how expensive it is to JIT. If JIT were free the counter limit would be 0. Guile's JIT will be cheap, though not free. Additionally some JITs set higher counter values because they do some form of adaptive optimization and they want to see more values. E.g. JavaScriptCore sets its default threshold for its first tier at 500, incrementing by 15 for each call, 10 for each return, and 1 for each loop iteration. But LuaJIT sets the threshold at 56, incrementing by 1 each call and 2 each loop. Weird, right? Anyway that's how it is. We'll have to measure. Note that there are two ways you might want to tier up to JIT from the interpreter -- at function entry, and in loops. Given that JIT code and interpreter code manipulates the stack in the same way, you *can* tier up anywhere -- but those are the places that it's worth checking. Function entry is the essential place. That's when you have the scm_tc7_program in slot 0, if that's how the procedure is compiled, and that's when you can patch the JIT code slot in the program object. The counter itself should be statically allocated in the image rather than existing as part of the scm_tc7_program object, because multiple closures can share code. So, that to me says: * we make a pass in the compiler in a late stage to determine which function entry labels will be ever put in a scm_tc7_program * we add "enter-function" opcodes to the beginning of those functions' code that will increment a statically allocated counter, maybe tier up and patch the program object. * the next call to those functions will see the JIT code in the program and jump in directly For nests of functions that call each other via call-label, I was thinking that these functions should be compiled all at once. But in the case of a big nested function where many things are visible -- the style of programming used in the fastest program -- then there we could cause too much latency, by biting off too much at once. So perhaps there we should use enter-function as well, just with a flag that this function has no closure so there's nothing to patch. If, when compiling, we see a call-label, we can speculatively see if there's JIT code for the callee already, and in that case emit a direct jump; but otherwise we can instead emit an indirect jump. The JIT code pointer would then be not in the scm_tc7_program, but statically allocated in the image. In fact probably we want statically allocated JIT code pointers in the image anyway so that when one closure tiers up, the next closures tier up "for free" on their next call just by checking that pointer. Tier-up sites inside functions are ephemeral: they are typically only entered once. The next time the function is entered it's via JIT, usually. So, no need to cache them, we can just tier up the function (ensuring its corresponding statically allocated JIT pointer is set), then jump to the offset in the JIT code corresponding to that bytecode offset. Probably we can make "handle-interrupts" do this tier-up. OK this mail is long enough!!! Does it make things clearer though? To-do: * adapt compiler and runtime for 3-word stack frames * adapt compiler and runtime for extra word in scm_tc7_program * allocate counter and code pointer for each function * adapt compiler and runtime to insert opcode at program entry (maybe it can run the apply hook!), with flag indicating whether argv[0] might need patching * re-take JIT implementation Cheers, Andy ps. I am thinking of calling bytecode "vcode" in the future, for virtual code, and JIT code "mcode", for machine code. WDYT? :)