From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Lynn Winebarger Newsgroups: gmane.emacs.devel Subject: Re: native compilation units Date: Sun, 26 Jun 2022 10:14:52 -0400 Message-ID: References: Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="00000000000000035405e25a6e3e" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="14914"; mail-complaints-to="usenet@ciao.gmane.io" Cc: Andrea Corallo , emacs-devel To: Stefan Monnier Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sun Jun 26 16:17:48 2022 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1o5T59-0003ik-PY for ged-emacs-devel@m.gmane-mx.org; Sun, 26 Jun 2022 16:17:47 +0200 Original-Received: from localhost ([::1]:41806 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1o5T58-00088A-7J for ged-emacs-devel@m.gmane-mx.org; Sun, 26 Jun 2022 10:17:46 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:36410) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1o5T2a-0007Hz-KG for emacs-devel@gnu.org; Sun, 26 Jun 2022 10:15:08 -0400 Original-Received: from mail-oa1-x2a.google.com ([2001:4860:4864:20::2a]:47010) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1o5T2X-0007qO-Vb for emacs-devel@gnu.org; Sun, 26 Jun 2022 10:15:08 -0400 Original-Received: by mail-oa1-x2a.google.com with SMTP id 586e51a60fabf-1013ecaf7e0so9927143fac.13 for ; Sun, 26 Jun 2022 07:15:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=+RALJdM6ehjM50euP8K6/d9oK3s8f78I/Gylqhzvrno=; b=BXdlGQ46C+DthX0HaZ8xkLsPtY6Op7LzAyiSAXsRz9gK4cR23QPTCXRGwVorGvItec /NQdoAnVps7tFuj1OuxYow6CJ29Y5nUSzrI62b71pth1h7zM4zFExqIkvOb5Ylexe5J/ jKMipBUlmBsGhtYTMuv0MQPH57xW4cAGA5aZVEjbHCx5bTttHmmqy1fMwjTnIXV+A7oz W0eENk9I0UXfUYfCy4jrsrmiy76caMb0BgaDSXhZ5GcFtTnsSoGExusfGmw0scF8NKhf w/HIWJogFOPb5yfwGqSukPLJaozwjuxyQUd1qLi2zZqLT0onZqV9wzzgxSypCIKZ8rv6 3SZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=+RALJdM6ehjM50euP8K6/d9oK3s8f78I/Gylqhzvrno=; b=j5ATSmvFKflzip1nzTlVmeBT1h9WCppirt/ipvWzT9jTGvutD1cRJEmfK6wJt7ZyGN EaKYtucHnFIIBgONbBFG8Ii5zM7SorneRP5jurlG22B1Va8mGFRodorik2brNAn0kVV1 NnRMZAUdExbbN+XFeFt1rS4hLh4cKC13ZKNyZTjFG6VQQYZYXgOnTfozDxuyOu9YQXcL Matg2q+LsCNL3AoAq3tvfMB9lRnbBdAnB1ZwVh71vP7bAHNrQj0B6YVCPB/iQe14UGPw w+6nLhIivIsUzHwSNQEyfMRNUHPwzzBg3V7GpeIXki+2Cs88rnVHuMkzOqfzJqYkJgKw 4xfw== X-Gm-Message-State: AJIora9AwalLci87K9ihWWHlrqUEjSNhBvpWDphBjmPO1CPIrDLJpV7t 4ZwXAH+vsaaX262bGeDZ1vIQaGj0cOutHb2BYxc= X-Google-Smtp-Source: AGRyM1udF+0IuI3iQB/HzXFEI+2y2G6MI77r8J3jqjXB48rHMHvSLf8qDeFiQ21BdjmUWCrZHI+b8yhU5tyqYVXYFmI= X-Received: by 2002:a05:6870:d791:b0:101:ad64:1e74 with SMTP id bd17-20020a056870d79100b00101ad641e74mr5016056oab.162.1656252903980; Sun, 26 Jun 2022 07:15:03 -0700 (PDT) In-Reply-To: Received-SPF: pass client-ip=2001:4860:4864:20::2a; envelope-from=owinebar@gmail.com; helo=mail-oa1-x2a.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:291632 Archived-At: --00000000000000035405e25a6e3e Content-Type: text/plain; charset="UTF-8" On Sat, Jun 25, 2022, 2:12 PM Lynn Winebarger wrote: > The part that's incompatible with current semantics of symbols is >> importing that symbol as >> an immutable symbolic reference. Not really a "variable" reference, but >> as a binding >> of a symbol to a value in the run-time namespace (or package in CL >> terminology, although >> CL did not allow any way to specify what I'm suggesting either, as far as >> I know). >> >> However, that would capture the semantics of ELF shared objects with the >> text and ro_data >> segments loaded into memory that is in fact immutable for a userspace >> program. >> > > It looks to me like the portable dump code/format could be adapted to > serve the purpose I have in mind here. What needs to be added is a way to > limit the scope of the dump so only the appropriate set of objects are > captured. > I'm going to start with a copy of pdumper.c and pdumper.h renamed to ndumper (n for namespace). The pdmp format conceptually organizes the emacs executable space into a graph with three nodes - an "Emacs executable" node (or the temacs text and ro sections), "Emacs static" (sections of the executable loaded into writeable memory), and a "dump" node, corresponding to heap-allocated objects that were live at the time of the dump. The dump node has relocations that can point into itself or to the emacs executable, and "discardable" relocations for values instantiated into the "Emacs static". While the data structure doesn't require it, the only values saved from the Emacs static data are symbols, primitive subrs (not native compiled), and the thread structure for the main thread. There can be cycles between these nodes in the memory graph, but cutting the edge[s] between the emacs executable and the Emacs static nodes yields a DAG. Note, pdumper does not make the partition I'm describing explicitly. I'm inferring that there must be such a partition. The discardable relocations should be ones that instantiate into static data of the temacs executable. My plan is to refine the structure of the Emacs process introduced by pdumper to yield a namespace graph structure with the same property - cutting the edge from executable to runtime state yields a DAG whose only root is the emacs executable. Each ndmp namespace (or module or cl-package) would have its own symbol table and a unique namespace identifier, with a runtime mapping to the file backing it (if loaded from a file). Interned symbols will be extended with three additional properties: static value, constant value and constant function. For variables, scope resolution will be done at compile time: * Value if not void (undefined), else * Static value A constant symbol is referenced by importing a constant symbol, either from another namespace or a variable in the current namespace's compile-time environment. The attempt at run-time to rebind a symbol bound by an import form will signal an error. Multiple imports binding a particular symbol at run-time will effectively cause the shadowing of an earlier binding by the later binding. Any sequence of imports and other forms that would result in the ambiguity of the resolution of a particular variable at compile time will signal an error. That is, a given symbol will have only one associated binding in the namespace scope during a particular evaluation time (eval, compile, compile-compile, etc) A static value binding will be global but not dynamic. A constant value binding will result from an export form in an eval-when-compile form encountered while compiling the source of the ndmp module. Since static bindings capture the "global" aspect of the current semantics of special variable bindings, dynamic scope can be safely restricted to provide thread-local semantics. Instantiation of a compiled ndmp object will initialize the bindings to be consistent with the current semantics of defvar and setq in global scope, as well as the separation of compile-time and eval-time variable bindings. [I am not certain what the exact approach will be to ensure that will be]. Note constant bindings are only created by "importing" from the compile-time environment through eval-when-compile under the current semantics model. This approach simply avoids the beta substitution of compile-time variable references performed in the current implementation of eval-when-compile semantics. Macro expansion is still available to insert such values directly in forms from the compile-time environment. A function symbol will resolve to the function property if not void, and the constant function property otherwise. Each ndmp module will explicitly identify the symbols it exports, and those it imports. The storage of variable bindings for unexported symbols will not be directly referenceable from any other namespace. Constant bindings may be enforced by loading into a read-only page of memory, a write barrier implemented by the system, or unenforced. In other words, attempting to set a constant binding is an error with unspecified effect. Additional declarations may be provided to require the signaling of an error, the enforcement of constancy (without an error), both, or neither. The storage of static and constant variables may or may not be incorporated directly in the symbol object. For example, such storage may be allocated using separate hash tables for static and constant symbol tables to reduce the allocation of space for variables without a static or constant binding. When compiling a form that imports a symbol from an ndmp module, importing in an eval-when-compile context will resolve to the constant value binding of the symbol, as though the source forms were concatenated during compilation to have a single compile time environment. Otherwise, the resolution will proceed as described above. There will be a distinguished ndmp object that contains relocations instantiated into the Emacs static nodes, serving the baseline function of pdmp. There will also be a distinguished ndmp object "ELISP" that exports all the primitives of Emacs lisp. The symbols of this namespace will be implicitly imported into every ndmp unless overridden by a special form to be specified. In this way, a namespace may use an alternative lisp semantic model, eg CL. Additonal forms for importing symbols from other namespaces remain to be specified. Ideally the byte code vm would be able to treat an ndmp object as an extended byte code vector, but the restriction of the byte-codes to 16-bit addressing is problematic. For 64-bit machines, the ndmp format will restrict the (stored) addresses to 32 bits, and use the remaining bits of relocs not already used for administrative purposes as an index into a vector of imported namespaces in the ndmp file itself, where the 0 value corresponds to an "un-interned" namespace that is not backed by a (permanent) file. I don't know what the split should be in 32-bit systems (without the wide-int option). The interpretation of the bits is specific to file-backed compiled namespaces, so it may restrict the number of namespace imports in a compiled object without restricting the number of namespaces imported in the runtime namespace. Once implemented, this functionality should significantly reduce the need for a monolithic dump or "redumping" functionality. Or rather, "dumping" will be done incrementally. My ultimate goal is to introduce a clean way to express a compiled object that has multiple code labels, and a mechanism to call or jump to them directly, so that the expressible control-flow structure of native and byte compiled code will be equivalent (I believe the technical term is that there will be a bisimulation between their operational semantics, but it's been a while). An initial version might move in this direction by encoding the namespaces using a byte-code vector to trampoline to the code-entry points, but this would not provide a bisimulation. Eventually, the byte-code VM and compiler will have to be modified to make full use of ndmp objects as primary semantic objects without intermediation through byte-code vectors as currently implemented. If there's an error in my interpretation of current implementation (particular pdumper), I'd be happy to find out about it now. As a practical matter, I've been working with the 28.1 source. Am I better off continuing with that, or starting from a more recent commit to the main branch? Lynn --00000000000000035405e25a6e3e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
=
On Sat, Jun 25, 2022, 2:12 PM Lynn Wi= nebarger <owinebar@gmail.com> wrote:
T= he part that's incompatible with current semantics of symbols is import= ing that symbol as=C2=A0
an immutable symbolic reference.=C2=A0 N= ot really a "variable" reference, but as a binding
of a= symbol to a value in the run-time namespace (or package in CL terminology,= although
CL did not allow any way to specify what I'm sugges= ting either, as far as I know).

However, that woul= d capture the semantics of ELF shared objects with the text and ro_data
segments loaded into memory that is in fact immutable for a userspac= e program.

=
It looks to me like the portable dump code/format c= ould be adapted to serve the purpose I have in mind here.=C2=A0 What needs = to be added is a way to limit the scope of the dump so only the appropriate= set of objects are captured.

I'm going to sta= rt with a copy of pdumper.c and pdumper.h renamed to ndumper (n for namespa= ce).=C2=A0 The pdmp format conceptually organizes the emacs executable spac= e into a graph with three nodes - an "Emacs executable" node (or = the temacs text and ro sections),=C2=A0 "Emacs static" (sections = of the executable loaded into writeable memory), and a "dump" nod= e, corresponding to heap-allocated objects that were live at the time of th= e dump.=C2=A0 The dump node has relocations that can point into itself or t= o the emacs executable, and "discardable" relocations for values = instantiated into the "Emacs static".=C2=A0 While the data struct= ure doesn't require it, the only values saved from the Emacs static dat= a are symbols, primitive subrs (not native compiled), and the thread struct= ure for the main thread.=C2=A0=C2=A0

There can be cycles between these nodes in the memory graph, b= ut cutting the edge[s] between the emacs executable and the Emacs static no= des yields a DAG.
Note, pdumper does not make the pa= rtition I'm describing explicitly.=C2=A0 I'm inferring that there m= ust be such a partition.=C2=A0 The discardable relocations should be ones t= hat instantiate into static data of the temacs executable.

My plan is to refine the structure of th= e Emacs process introduced by pdumper to yield a namespace graph structure = with the same property - cutting the edge from executable to runtime state = yields a DAG whose only root is the emacs executable.

Each ndmp namespace (or module or cl-package)= would have its own symbol table and a unique namespace identifier, with a = runtime mapping to the file backing it (if loaded from a file).=C2=A0=C2=A0=

Interned symbols will b= e extended with three additional properties: static value, constant value a= nd constant function.=C2=A0 For variables, scope resolution will be done at= compile time:
* Value if not void (undefined), else=
* Static value=C2=A0
A constant symbol is= referenced by importing a=C2=A0constant symbol, either from another namesp= ace or a variable in the current namespace's compile-time environment.= =C2=A0 The attempt at run-time to rebind a symbol bound by an import form w= ill signal an error.=C2=A0 Multiple imports binding a particular symbol at = run-time will effectively cause the shadowing of an earlier binding by the = later binding.=C2=A0 Any sequence of imports and other forms that would res= ult in the ambiguity of the resolution of a particular variable at compile = time will signal an error.=C2=A0 That is, a given symbol will have only one= associated binding in the namespace scope during a particular evaluation t= ime (eval, compile, compile-compile, etc)

A static value binding will be global but not dynamic.=C2=A0 A constan= t value binding will result from an export form in an eval-when-compile for= m encountered while compiling the source of the ndmp module.=C2=A0 Since st= atic bindings capture the "global" aspect of the current semantic= s of special variable bindings, dynamic scope can be safely restricted to p= rovide thread-local semantics.=C2=A0 Instantiation of a compiled ndmp objec= t will initialize the bindings to be consistent with the current semantics = of defvar and setq=C2=A0 in global scope, as well as the separation of comp= ile-time and eval-time variable bindings.=C2=A0 [I am not certain what the = exact approach will be to ensure that will be].=C2=A0 Note constant binding= s are only created by "importing" from the compile-time environme= nt through eval-when-compile under the current semantics model.=C2=A0 This = approach simply avoids the beta substitution of compile-time variable refer= ences performed in the current implementation of eval-when-compile semantic= s.=C2=A0 Macro expansion is still available to insert such values directly = in forms from the compile-time environment.

A function symbol will resolve to the function property= if not void, and the constant function property otherwise.

Each ndmp module will explicitly iden= tify the symbols it exports, and those it imports.=C2=A0 The storage of var= iable bindings for unexported symbols will not be directly referenceable fr= om any other namespace.=C2=A0 Constant bindings may be enforced by loading = into a read-only page of memory, a write barrier implemented by the system,= or unenforced. In other words, attempting to set a constant binding is an = error with unspecified effect.=C2=A0 Additional declarations may be provide= d to require the signaling of an error, the enforcement=C2=A0of constancy (= without an error), both, or neither.=C2=A0 The storage of static and consta= nt variables may or may not be incorporated directly in the symbol object.= =C2=A0 For example, such storage may be allocated using separate hash table= s for static and constant symbol tables to reduce the allocation of space f= or variables without a static or constant binding.
<= br>
When compiling a form that imports a symbol from= an ndmp module, importing in an eval-when-compile context will resolve to = the constant value binding of the symbol, as though the source forms were c= oncatenated during compilation to have a single compile time environment. O= therwise, the resolution will proceed as described above.=C2=A0

There will be a distinguished ndmp = object that contains relocations instantiated into the Emacs static nodes, = serving the baseline function of pdmp.=C2=A0 There will also be a distingui= shed ndmp object "ELISP" that exports all the primitives of Emacs= lisp.=C2=A0 The symbols of this namespace will be implicitly imported into= every ndmp unless overridden by a special form to be specified.=C2=A0 In t= his way, a namespace may use an alternative lisp semantic model, eg CL.=C2= =A0 Additonal forms for importing symbols from other namespaces remain to b= e specified.

Ideally= the byte code vm would be able to treat an ndmp object as an extended byte= code vector, but the restriction of the byte-codes to 16-bit addressing is= problematic.
For 64-bit machines, the ndmp format w= ill restrict the (stored) addresses to 32 bits, and use the remaining bits = of relocs not already used for administrative purposes as an index into a v= ector of imported namespaces in the ndmp file itself, where the 0 value cor= responds to an "un-interned" namespace that is not backed by a (p= ermanent) file.=C2=A0 I don't know what the split should be in 32-bit s= ystems (without the wide-int option).=C2=A0 The interpretation of the bits = is specific to file-backed compiled namespaces, so it may restrict the numb= er of namespace imports in a compiled object without restricting the number= of namespaces imported in the runtime namespace.
Once implemented, this functionality should signi= ficantly reduce the need for a monolithic dump=C2=A0 or "redumping&quo= t; functionality.=C2=A0 Or rather, "dumping" will be done increme= ntally.

My ultimate goal is to introd= uce a clean way to express a compiled object that has multiple code labels,= and a mechanism to call or jump to them directly, so that the expressible = control-flow structure of native and byte compiled code will be equivalent = (I believe the technical term is that there will be a bisimulation between = their operational semantics, but it's been a while).=C2=A0 An initial v= ersion might move in this direction by encoding the namespaces using a byte= -code vector to trampoline
to the code-entry points, but this wou= ld not provide a bisimulation.=C2=A0 Eventually, the byte-code VM and compi= ler will have to be modified to make full use of ndmp objects as primary se= mantic objects without intermediation through byte-code vectors as currentl= y implemented.

If there's an error in my inter= pretation of current implementation (particular pdumper), I'd be happy = to find out about it now.

As a practical matter, I= 've been working with the 28.1 source.=C2=A0 Am I better off continuing= with that, or starting from a more recent commit to the main branch?
=

Lynn


--00000000000000035405e25a6e3e--