From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.org!.POSTED!not-for-mail From: Andy Wingo Newsgroups: gmane.lisp.guile.devel Subject: guile 3 update Date: Sun, 22 Oct 2017 12:24:33 +0200 Message-ID: <87shebscny.fsf@pobox.com> NNTP-Posting-Host: blaine.gmane.org Mime-Version: 1.0 Content-Type: text/plain X-Trace: blaine.gmane.org 1508667913 19504 195.159.176.226 (22 Oct 2017 10:25:13 GMT) X-Complaints-To: usenet@blaine.gmane.org NNTP-Posting-Date: Sun, 22 Oct 2017 10:25:13 +0000 (UTC) User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux) To: guile-devel@gnu.org Original-X-From: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Sun Oct 22 12:25:07 2017 Return-path: Envelope-to: guile-devel@m.gmane.org Original-Received: from lists.gnu.org ([208.118.235.17]) by blaine.gmane.org with esmtp (Exim 4.84_2) (envelope-from ) id 1e6DRB-0002jg-Uy for guile-devel@m.gmane.org; Sun, 22 Oct 2017 12:24:58 +0200 Original-Received: from localhost ([::1]:60657 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1e6DRG-0006qz-2Z for guile-devel@m.gmane.org; Sun, 22 Oct 2017 06:25:02 -0400 Original-Received: from eggs.gnu.org ([2001:4830:134:3::10]:37017) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1e6DR4-0006qt-TW for guile-devel@gnu.org; Sun, 22 Oct 2017 06:24:53 -0400 Original-Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1e6DR1-0004Pe-Lh for guile-devel@gnu.org; Sun, 22 Oct 2017 06:24:50 -0400 Original-Received: from pb-sasl1.pobox.com ([64.147.108.66]:52049 helo=sasl.smtp.pobox.com) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1e6DR1-0004Mt-G2 for guile-devel@gnu.org; Sun, 22 Oct 2017 06:24:47 -0400 Original-Received: from sasl.smtp.pobox.com (unknown [127.0.0.1]) by pb-sasl1.pobox.com (Postfix) with ESMTP id 44BDBA3A8B for ; Sun, 22 Oct 2017 06:24:44 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=pobox.com; h=from:to :subject:date:message-id:mime-version:content-type; s=sasl; bh=c Ek6FicqQItqcx6IhlN8iM0NOVk=; b=IoS6rYwdrfFj9U8mkmT3M/O74uiZBdBmX fkrpy3pxRifSZgieffCd6i5M0KtlAC7EmGSXRHA9YOCrP7UHvazJa3Xjb/2G3icU MMk0Xrlq1Vx3BVxF4HGEvPzXl3iUl/wR+yzeTRuOcznB7kgr6kuNTR6hSscYYXvu dZPRB32dmk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=pobox.com; h=from:to:subject :date:message-id:mime-version:content-type; q=dns; s=sasl; b=KbX KL+ePf4x3Gbu5J7D5vgYpk/e0SZ/YEjABEPuZ2YCeh8R+I/t5EPlcI+lPTm7/l+w +eBGmjYlT5F1VfwsyfZgxMthjIp8YtIjcac0+b8IDXrKimDFmFnVGeVGLCJmXUY7 hGizjfJKVFaYXO7i90r98It5XdeeZv8efebCdr2U= Original-Received: from pb-sasl1.nyi.icgroup.com (unknown [127.0.0.1]) by pb-sasl1.pobox.com (Postfix) with ESMTP id 3AAAAA3A8A for ; Sun, 22 Oct 2017 06:24:44 -0400 (EDT) Original-Received: from sparrow (unknown [88.160.190.192]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by pb-sasl1.pobox.com (Postfix) with ESMTPSA id CA414A3A89 for ; Sun, 22 Oct 2017 06:24:42 -0400 (EDT) X-Pobox-Relay-ID: 3825A828-B713-11E7-8199-ABEFD5707B88-02397024!pb-sasl1.pobox.com X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 64.147.108.66 X-BeenThere: guile-devel@gnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "Developers list for Guile, the GNU extensibility library" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: guile-devel-bounces+guile-devel=m.gmane.org@gnu.org Original-Sender: "guile-devel" Xref: news.gmane.org gmane.lisp.guile.devel:19327 Archived-At: Hi! I have been thinking on how to get from where we are to good native compilation. I think the next thing is instruction explosion. As described here: https://wingolog.org/archives/2016/02/04/guile-compiler-tasks Currently in Guile's VM there are instructions like vector-ref. This is a little silly: there are also instructions to branch on the type of an object (br-if-tc7 in this case), to get the vector's length, and to do a branching integer comparison. Really we should replace vector-ref with a combination of these test-and-branches, with real control flow in the function, and then the actual ref should use some more primitive unchecked memory reference instruction. Optimization could end up hoisting everything but the primitive unchecked memory reference, while preserving safety, which would be a win. But probably in most cases optimization wouldn't manage to do this, which would be a lose overall because you have more instruction dispatch. Well, this transformation is something we need for native compilation anyway. I would accept a patch to do this kind of transformation on the master branch, after version 2.2.0 has forked. In theory this would remove most all high level instructions from the VM, making the bytecode closer to a virtual CPU, and likewise making it easier for the compiler to emit native code as it's working at a lower level. I started looking at this and stumbled when figuring out how to explode struct-ref. In Guile 2.2, our "struct" data type is a bit complicated, and when you go to inline it you really realize that it should be more simple. So the first thing I did was make some changes to structs. I landed some changes in stable-2.2, as described by NEWS: * New interfaces ** `struct-ref/unboxed' and `struct-set!/unboxed' These procedures should be used when accessing struct fields with type `u' (unboxed). See "Structure Basics" in the manual, for full details. * New deprecations ** Struct tail arrays deprecated Guile's structures used to have a facility whereby each instance of a vtable can contain a variable-length tail array of values. The length of the tail array was stored in the structure. This facility was originally intended to allow C code to expose raw C structures with word-sized tail arrays to Scheme. However, the tail array facility was confusing and doesn't work very well. It was very rarely used, but it insinuates itself into all invocations of `make-struct'. For this reason the clumsily-named `make-struct/no-tail' procedure can actually be more elegant in actual use, because it doesn't have a random `0' argument stuck in the middle. Tail arrays also inhibit optimization by allowing instances to affect their shapes. In the absence of tail arrays, all instances of a given vtable have the same number and kinds of fields. This uniformity can be exploited by the runtime and the optimizer. The presence of tail arrays make some of these optimizations more difficult. Finally, the tail array facility is ad-hoc and does not compose with the rest of Guile. If a Guile user wants an array with user-specified length, it's best to use a vector. It is more clear in the code, and the standard optimization techniques will do a good job with it. For all of these reasons, tail arrays are deprecated in Guile 2.2 and will be removed from Guile 3.0. Likewise, `make-struct' / `scm_make_struct' is deprecated in favor of `make-struct/no-tail' / `scm_make_struct_no_tail'. Perhaps one day we will be able to reclaim the `make-struct' name! ** Struct "self" slots deprecated It used to be that you could make a structure vtable that had "self" slots. Instances of that vtable would have those slots initialized to the instance itself. This can be useful in C code where you might have a pointer to the data array, and want to get the `SCM' handle for the structure. However this was a little used complication without any use vin Scheme code. To replace it, just use "p" slots and initialize the slot values manually on initialization. ** Struct fields with opaque ("o") protection deprecated Struct fields are declared with a "protection", meaning read-only ('r'), read-write ('w'), or opaque ('o'). There is also "hidden" ('o') which is read-write but which isn't initialized by arguments passed to `make-struct/no-tail', but that's a detail. Opaque struct fields were used to allocate storage in a struct that could only be accessed by C. This facility was very rarely used (unused in Guile itself) but now that we are implementing more and more in Scheme, it is completely useless. To enforce permissions on struct fields, instead layer on an abstraction at a higher level, in the same way that immutable record fields are simply those which don't have an accessor. ** Using `struct-ref' and `struct-set!' on unboxed fields is deprecated Use the new `struct-ref/unboxed' and `struct-set!/unboxed' instead. The master branch then removes support for deprecated behavior, and on top of that does this: ** By default, GOOPS classes are not redefinable It used to be that all GOOPS classes were redefinable, at least in theory. This facility was supported by an indirection in all "struct" instances, even though only a subset of structs would need redefinition. We wanted to remove this indirection, in order to speed up Guile records, allow immutable Guile records to eventually be described by classes, and allow for some optimizations in core GOOPS classes that shouldn't be redefined anyway. Thus in GOOPS now there are classes that are redefinable and classes that aren't. By default, classes created with GOOPS are not redefinable. To make a class redefinable, it should be an instance of `'. See "Redefining a Class" in the manual for more information. ** Remove "self" field from vtables and "redefined" field from classes These fields were used as part of the machinery for class redefinition and is no longer needed. Anyway, structs are more simple now. OK now on to the heart of this mail, instruction explosion. I have gone through all instructions in Guile 2.2's VM and written down "exploded" variants of them. The goal is to make the VM lower-level: closer to a CPU instruction set. The current work moves in that direction but doesn't quite get there. Here's an annotated example of how `(vector-ref V I)' might compile. Note that this instruction receives I as an untagged 64-bit number. ;; Compare the 3 lowest bits (the itag) of V to 0b000, and set the flags ;; register. The flags register is internal to the VM, and has 3 ;; bits: less-than, equal, and greater-than. This comparison ;; operation will set less-than and greater-than to 0, and set the ;; equal bit to 1 if the itag matches the given value, and 0 ;; otherwise. ;; ;; Since 000 is the itag of heap objects (non-immediates), this ;; comparison is just, "is V a heap object"? cmp-itag3 V 000 ;; If the equal bit is not set, go to the "not-vector" label. This ;; instruction is "Jump if Not Equal", and gets its name from that. ;; The available conditional jumps are JL, JLE, JE, JGE, and JG, ;; which branch if the less-than, less-than or equal, equal, ;; greater-than or equal, or greater-than bits are set, ;; respectively. Then there are the corresponding JNL, JNLE, JNE, ;; JNGE, and JNG instructions, which branch if no corresponding bit ;; is set. jne -> not-vector ;; Incidentally, this split between comparisons and jumps is a ;; departure from 2.2, where there are instructions like br-if-< and ;; so on. This attempt at a 3.0 instruction set will privilege ;; orthogonality: there are N types of jumps that one might want to ;; make, and M types of comparisons, and better to have N + M rather ;; than N * M. Of course some comparisons are invertible: (< N M) ;; may be the same as (not (>= N M)), but not all are (notably ;; comparisons with NaN). So, like ARM and x86, the 3.0 VM adds a ;; flags register. RISC-V incidentally uses an approach more like ;; 2.2 with fused compare-and-branch instructions, but sadly their ;; floating-point support writes to a general-purpose register, then ;; you have to test-and-branch on that -- so it seems that a flags ;; register may be best. ;; OK, now we compare the low bits of the first word of V to the ;; vector typecode (scm_tc7_vector). All heap objects have a type ;; code in their low bits. cmp/tc7 V vector ;; If the tc7 matched, the equal bit of the flags register will be ;; set, and unset otherwise. If it wasn't a vector, branch to the ;; error case. jne -> not-vector ;; If we got here, V is a vector. To get its length, first we fetch ;; its first word, then we right-shift by 8 bits. ;; Here we fetch the first word of V and store it to register TMP as ;; an unsigned 64-bit integer. Four things to note: ;; ;; One, this VM still has unlimited registers, which are basically ;; stack slots. Each function ensures that the stack has enough ;; space for its slots, in the same way as Guile 2.2. I will figure ;; out register allocation later. ;; ;; Two, as in Guile 2.2, the instructions are typed. This one takes ;; a SCM value and a U64 index and writes a U64 value. ;; ;; Thirdly, this fetch performs no checking. Its safety should be ;; guaranteed by construction. In this case, safety is ensured ;; because the object has an itag of 000, indicating that it's a ;; heap object and thus that there is at least one word addressable ;; at the pointer encoded by V. ;; ;; Finally, you may have noted that this fetch is redundant: we ;; already had to fetch the first word for the cmp/tc7. Indeed. ;; However it's easier for the current compiler to remove the tc7 ;; test if it takes the vector as an argument, instead of taking the ;; first word as an argument. More on this later. word-ref TMP V 0 ;; Right-shift the first word by 8 bits and write the U64 result to ;; LENGTH. ursh/imm LENGTH TMP 8 ;; Now compare the U64 value in I (the index) to LENGTH. cmp-u64 I length ;; If I was greater than or equal to LENGTH, then error. jge -> index out of range for vector-ref ;; Otherwise we're good! The value for index I will be the word ;; stored at offset I+1 in V, so add 1 to I and dereference V. uadd/imm I* I 1 scm-ref V I* Now, some bigger picture things to note. One advantage of instruction explosion is optimizability. If we know that V is a vector (perhaps because we already ran vector-length on it, or because we already saw that (vector? V) => #t, or we saw its constructor, or something like that), then the compiler can elide the first two checks (respectively because it knows that the itag3 of a vector is indeed 0, and because the tc7 of a vector is scm_tc7_vector). If we know the length of the vector (or perhaps the minimum length of the vector), the length test may fold as well, leaving us with a lone memory reference. Another advantage is that because this instruction set is more orthogonal, it's easier to construct specialized variants. For example right now in Guile 2.2's VM, there is vector-ref and vector-ref/imm, the second of which is when we know the index. It's easy to see how we'd specialize the above instruction sequence to include an immediate index; small orthogonal instructions allow easier specialization. Clearly an exploded instruction sequence is going to be more instructions, unless the optimizer has an especially good day ;) So as long as we are using a bytecode interpreter, that's going to be a risk. Even though each bytecode is doing less work, there is a constant overhead to dispatch of each instruction, and this overhead will play a larger role in the performance of a Guile 3.0 bytecode interpreter. There are a couple of things that mitigate this overhead, however, besides the possibility that parts of an exploded instruction sequence may be elided by the optimizer. One is that in some near future, instead of emitting bytecode to be interpreted by a VM running on top of the CPU, we will emit machine code instead to be interpreted directly by the CPU. I think that's close, although we need to figure out some other issues (for example, how to link to runtime routines written in C, and how to keep portability). However I think there's also room for a simple template JIT in the very short term. Because these instructions correspond directly to machine instructions, it will be relatively little work to write a simple template JIT, perhaps using lightning. I am really concerned about JIT complexity, but am happy going this way because we already do all our optimization in the general compiler path and the JIT would really only have to emit per-instruction machine code. Because it doesn't need to optimize and because it doesn't change the operational model (e.g. from Scheme's perspective we are still working on bytecodes that are available, can be parsed, have debugging information, etc), the complexity is bounded. Having a simple JIT would allow us to ensure that even though we explode the instructions, we're still getting faster. I should also note about GC -- with exploded instructions, there are more risks. For example, instead of incrementing the vector index by one, we could have extracted the base pointer for the vector (V+1). However it's V that's protected by GC, not V+1; maybe having a loose V+1 reference would not be enough to keep the vector's storage alive. So, I've been careful to have any instruction that dereferences memory take a SCM value as an input, assuming that SCM value protects its memory. However with exploded instructions we might see more intermediate states in some interrupt. Of course proper interrupts are handled by the handle-interrupts opcode, but there are also cross-thread GC interrupts which currently work by sending all active threads a SIGPWR or something like that. Ideally in the future we could avoid signals and instead wait for remote threads to come to safe points -- that would be best. But until then we have to be ready for functions to be paused and GC'd at any moment. I think the strategy of conservatively marking the innermost frame will continue to work fine, but we seen some reports of cross-thread GC problems right now... I don't know. OK, that's all the things I can think of. Over the next month I will be incrementally adding new instructions, changing the compiler to lower to the new instructions (probably after closure conversion, though who knows), and removing old instructions. Hopefully things remain manageably fast until we can get the JIT in place. Thoughts welcome if I have forgotten something. Cheers :) Andy