From mboxrd@z Thu Jan  1 00:00:00 1970
Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail
From: Rocky Bernstein <rocky@gnu.org>
Newsgroups: gmane.emacs.devel
Subject: source-code to xxx mappings (Was: Update 2 on Bytecode Offset
 tracking)
Date: Sat, 1 Aug 2020 01:47:55 -0400
Message-ID: <CANCp2gap7DkNGDJJkZq48fZatup-P41QwOe3khx3ncbvnBXn6A@mail.gmail.com>
Mime-Version: 1.0
Content-Type: multipart/alternative; boundary="00000000000015f37305abca7348"
Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214";
	logging-data="12936"; mail-complaints-to="usenet@ciao.gmane.io"
To: emacs-devel <emacs-devel@gnu.org>
Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sat Aug 01 07:48:42 2020
Return-path: <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Envelope-to: ged-emacs-devel@m.gmane-mx.org
Original-Received: from lists.gnu.org ([209.51.188.17])
	by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1k1kNu-0003F4-B2
	for ged-emacs-devel@m.gmane-mx.org; Sat, 01 Aug 2020 07:48:42 +0200
Original-Received: from localhost ([::1]:51812 helo=lists1p.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.90_1)
	(envelope-from <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>)
	id 1k1kNt-00013T-D5
	for ged-emacs-devel@m.gmane-mx.org; Sat, 01 Aug 2020 01:48:41 -0400
Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:60168)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <rocky.bernstein@gmail.com>)
 id 1k1kNO-0000d3-G3
 for emacs-devel@gnu.org; Sat, 01 Aug 2020 01:48:10 -0400
Original-Received: from mail-qk1-f182.google.com ([209.85.222.182]:39917)
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <rocky.bernstein@gmail.com>)
 id 1k1kNM-0006BO-6P
 for emacs-devel@gnu.org; Sat, 01 Aug 2020 01:48:10 -0400
Original-Received: by mail-qk1-f182.google.com with SMTP id l6so30873281qkc.6
 for <emacs-devel@gnu.org>; Fri, 31 Jul 2020 22:48:07 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
 bh=7yX8dzRG9nbmQdesbqh9cJZIfE8JA7EXQOEm1dQJ7yQ=;
 b=csB6YejgndkQq8J+vYfl/IJ0oIFu9qlL3M7kK7aHJ2uftkofH8FmNkwVsstmW/hlQX
 1VzMg8Nbv4Tn2qFoWSQ/NNY8zeiMCVce6N0D9YIaQj5yJiyFe7G4555iWCKV6yUDlcXr
 unVtS/4LkoaFl65dmmxXtYYfjNLhagr2nWQUJkv/k2RmyGMIoUqb6CzILIt+sZmbOZKk
 d3j9ChJopVOt9HxI6b1eBgNAM1nufo3dOS4fovvGxEnX66EEepsWdOOi0AejRNdCPIFA
 GgPYoVXiLjH9Qrc3FvOV98tu2QBMhXrsV6gWWTmLw1PMLVbYn/DXm4EZU8NeCfLRRNPq
 7Jfw==
X-Gm-Message-State: AOAM533hJ5PZ/MWCsDt7MOCIWkisj5v1cgYOhYEch5xtjOMw060ox0We
 opRx/oILR8Ugq6bAPE14+Ug9y1TDOQb+7CxweXspYvCKw8k=
X-Google-Smtp-Source: ABdhPJwG0soqE58zfqphuK6spVglppNKm3W7Ykd3q2SpZFOaAr7lqz66LFRTCGDvfDOdY67quNnzs/KP1ivMeXNqvEg=
X-Received: by 2002:a05:620a:144e:: with SMTP id
 i14mr7032445qkl.453.1596260886113; 
 Fri, 31 Jul 2020 22:48:06 -0700 (PDT)
Received-SPF: pass client-ip=209.85.222.182;
 envelope-from=rocky.bernstein@gmail.com; helo=mail-qk1-f182.google.com
X-detected-operating-system: by eggs.gnu.org: First seen = 2020/08/01 01:48:06
X-ACL-Warn: Detected OS   = Linux 2.2.x-3.x [generic] [fuzzy]
X-Spam_score_int: -7
X-Spam_score: -0.8
X-Spam_bar: /
X-Spam_report: (-0.8 / 5.0 requ) BAYES_00=-1.9, FREEMAIL_FORGED_FROMDOMAIN=1,
 FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=1, HTML_MESSAGE=0.001,
 RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-1, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001, URIBL_BLOCKED=0.001,
 URI_HEX=0.1 autolearn=no autolearn_force=no
X-Spam_action: no action
X-BeenThere: emacs-devel@gnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Emacs development discussions." <emacs-devel.gnu.org>
List-Unsubscribe: <https://lists.gnu.org/mailman/options/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=unsubscribe>
List-Archive: <https://lists.gnu.org/archive/html/emacs-devel>
List-Post: <mailto:emacs-devel@gnu.org>
List-Help: <mailto:emacs-devel-request@gnu.org?subject=help>
List-Subscribe: <https://lists.gnu.org/mailman/listinfo/emacs-devel>,
 <mailto:emacs-devel-request@gnu.org?subject=subscribe>
Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org
Original-Sender: "Emacs-devel"
 <emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org>
Xref: news.gmane.io gmane.emacs.devel:253372
Archived-At: <http://permalink.gmane.org/gmane.emacs.devel/253372>

--00000000000015f37305abca7348
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Thanks for "Update2 on Bytecode Offset tracking), Zack!

With respect to the C patch (bug#42499)
<https://lists.gnu.org/archive/html/bug-gnu-emacs/2020-07/msg00874.html>, I
updated the savannah git branch feature/soc-bytecode-in-traceback-reduced
<http://git.savannah.gnu.org/cgit/emacs.git/commit/?h=3Dfeature/soc-bytecod=
e-in-traceback-reduced&id=3Dbed4004fd334e101a5fb4ee8a3bbf23f4679ba87>
to
include the latest changes. This gives developers an alternative way to
inspect and try out the code. As before, any developers should feel free to
add changes on top of this.

With respect to mapping the source information down to LAP, I think this is
a close-enough demonstration to show that it could be done and roughly
what's involved. There are still a number of details lying in the weeds.

The last missing little piece to have a complete solution of hooking this
into the backtrace is small enough of a gap for developers to easily
imagine and leap over.

Of more concern now is the scalability and maintenance of the code.

So at this point, it is time to switch gears and look at the big and
long-term picture far outside the 2020 Summer of Code.

In the below I give my thoughts on some of the questions raised.

*1. Source-to-Elisp Maps*

*1a. Lisp Reader*

The Lisp reader based on *edebug* I think is fine as it is in terms of
speed for now. As is obvious, getting source information is *optional:* it
is analogous to running  gcc with or without  -ggdb. Perhaps a better
analogy, since we are talking about speed, is with higher levels of
optimization like -O3. So even if this is 5 times slower (and it isn't),
the process could be done on the GNU Emacs Elisp distribution just before
the final release. The benefit would last several months or more, enough to
justify a slowdown in minutes.

*1b. Source-to-Elisp Map Data Structure*

The *information* that is produced, although okay for a proof of concept,
could be improved and made more complete.

As Stefan observed, storing strings in each struct is a bit wasteful and
unnecessary. It really only needs to be stored at the top-level
expression/form with some way in each child to get to the parent. Elsewhere
<https://lists.gnu.org/archive/html/emacs-devel/2020-07/msg00412.html> I
give motivation for storing the string along with buffer offsets.

Also nice in the parent would be the container that the function is found
in, and for any other containers that appear in the subexpressions in the
tree. (Recall a "container" could be a buffer, a file, an archive member of
some sort, or none if the function was built on the fly).

Nice to have at the top level but not strictly necessary, would be some
sort of signature of the text. Since the full string is also at the top
level, computing signatures (for comparison on both ends of the mapping)
can be done on the fly. The cheapest kind of computed signature is simply
the length of the string. At the other end of the computation effort is a
SHA. Recent Python bytecode uses SipHash <https://131002.net/siphash/> whic=
h
is somewhere in between in computation time.

If we want to look up the text of a string and we are not at the top-level,
how do we do that if we magically have a pointer to someplace inside a
source-to-elisp tree structure? There could be a pointer to the top for the
non-top children. However I think it better to have parent pointers. Often
when looking at some source code where there is a problem, you may want to
see the bigger context, and the parent gives that. What about speed in
looking up the location? The number of nesting levels in a program is
generally pretty small, like less than 15. Also, this information is needed
only when there is a problem which is probably rare. In short, I don't
think one needs to obsess over code to string lookup time.

Another possibility is to assume that access must start at the top.

Lastly, a small thing about a name used in the source-map structure: "code"
is used to refer to the source text, a *string*. "code" like "object" is a
bit ambiguous; code could refer to LAP code, bytecode, lambda expressions,
or fully compiled code. My suggestion would be to call it "source-text".

*1c. Source-to-Elisp Map Files*

Just as there are bytecode files, Zach's project produces files with
source-text-elisp mappings in them, and these can be independently produced
and loaded into memory. Discussed above, this is string bloat in this file.
However it is missing additional information that might be nice to have:
some sort of optional signature for the container (file, buffer, ...) that
this was built from.

*2. Code-to-source Maps*

In "Update2", a source string was attached to the LAP instruction or
LAP-instruction proxy. This was for expedience and proof of concept. In the
long term what is wanted is an index into the Source-to-Elisp Map structure=
.

Therefore, in addition to the source-to-elisp map structure, there should
be a way to map a bytecode offset into a position in the Source-to-Elisp
Map File. If this is saved on disk, similar to the way Source-to-Elisp Map
files are done, then the position would be some tree-node index number in
the source-to-elisp tree. When that information is read in, the index node
number can be materialized to a true pointer into the source-to-elisp
structure, if desired. Or as stated above, access might have to always
start from the top-level function.

Note that the overall structure described above is equally valid for other
kinds of code, such as the great work that Andrea Corallo is doing.

For compiled code you would take the Source-to-Elisp map and the location
indexes through to DWARF. A lot of the decoding-to-source-position code
when there is an error I think can also be shared.

*2a. ByteCode Compiler *

Finally, we come to the thorny problem of dealing with the Bytecode
compiler. The code behind Update2 is largely cut and paste which is also
okay for prototyping, but untenable in the longer term. Even as this work
was being done, changes were going on to the bytecode compiler. See
bug#41247
<https://lists.gnu.org/archive/html/bug-gnu-emacs/2020-07/msg00010.html>.

Here is a possible solution based on Lisp's rich history of being about to
introspect and modify code. Basically a transformer function or macro could
be written that takes in a byte-compile function as input and changes that
into something that works with the source-to-elisp structure instead.

Recall that the source-to-elisp structure can be thought of as a "fat cons"
node, but it is implemented in Elisp as opposed to being built-in datatype.

This transformer function might look for function calls that start with the
prefix byte-code- and turn those function-calls into corresponding calls in
the source-map-bytecode- space.

The cons-node functions could also be looked for and replaced with
corresponding wrappers. For example instead of car, source-map-car would be
called.
A source-map-car function just checks to see if the object passed is a
source-to-elisp-structure. If it is, then the source-map structure's sexp
field is used; if not  the underlying car function is called.

A transformer function can also have specific transformations for specific
functions. Or there may be specific functions that are just cut and pasted
as was done in Update2. But this kind of activity would be done keeping in
mind that the underlying functions may change, and with a story for how
such changes can be tracked without too much trouble.


*3. Alists, Plists, and Hashes oh my*

Since arefs are pretty fast, I=E2=80=99m guessing the struct slot accesses =
wouldn=E2=80=99t
> slow down execution too much, but creating so many records certainly uses=
 a
> lot of memory.


I don't think this needs to be worried about for a long time, if ever. Only
when an error is hit, does this information need to be read in it could be
done via some autoload mechanism.

Presumably the number of errors is small enough that just about anything
would do. And since progress basically comes to a halt anyway, it really
doesn't matter if one method is even a few seconds slower than another.
That said, for very small bytecode, alists are probably the fastest.
However since offsets are unique, Plists would work too. And the larger the
number of offsets there are in bytecode, the more hashes are more desirable=
.

But as Donald Knuth has said: premature optimization is the root of all
evil.  I think it sufficient just to have a story or plan for how to speed
such things up when the time comes.

--00000000000015f37305abca7348
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks for &quot;Update2 on Bytecode Offset tracking), Zac=
k!<br><br>With respect to the <a href=3D"https://lists.gnu.org/archive/html=
/bug-gnu-emacs/2020-07/msg00874.html">C patch (bug#42499)</a>, I updated th=
e savannah git branch=C2=A0<a href=3D"http://git.savannah.gnu.org/cgit/emac=
s.git/commit/?h=3Dfeature/soc-bytecode-in-traceback-reduced&amp;id=3Dbed400=
4fd334e101a5fb4ee8a3bbf23f4679ba87">feature/soc-bytecode-in-traceback-reduc=
ed</a>=C2=A0to include the latest changes. This gives developers an alterna=
tive way to inspect and try out the code. As before, any developers should =
feel free to add changes on top of this.<br><br>With respect to mapping the=
 source information down to LAP, I think this is a close-enough demonstrati=
on to show that it could be done and roughly what&#39;s involved. There are=
 still a number of details lying in the weeds.<br><br>The last missing litt=
le piece to have a complete solution of hooking this into the backtrace is =
small enough of a gap for developers to easily imagine and leap over.<br><b=
r>Of more concern now is the scalability and maintenance of the code.<br><b=
r>So at this point, it is time to switch gears and look at the big and long=
-term picture far outside the 2020 Summer of Code.<br><br>In the below I gi=
ve my thoughts on some of the questions raised.<br><br><b>1. Source-to-Elis=
p Maps</b><br><br><i>1a. Lisp Reader</i><br><br>The Lisp reader based on <i=
>edebug</i> I think is fine as it is in terms of speed for now. As is obvio=
us, getting source information is <i>optional:</i> it is=C2=A0analogous to =
running=C2=A0 <font face=3D"monospace">gcc</font> with or without=C2=A0 -<f=
ont face=3D"monospace">ggdb</font>. Perhaps a better analogy, since we are =
talking about speed, is with higher levels of optimization like <font face=
=3D"monospace">-O3</font>. So even if this is 5 times slower (and it isn=
9;t), the process could be done on the GNU Emacs Elisp distribution just be=
fore the final release. The benefit would last=C2=A0several months or more,=
 enough to justify a slowdown in minutes.<br><br><i>1b. Source-to-Elisp Map=
 Data Structure</i><br><br>The <i>information</i> that is produced, althoug=
h okay for a proof of concept, could be improved and made more complete.<br=
><br>As Stefan observed, storing strings in each struct is a bit wasteful a=
nd unnecessary. It really only needs to be stored at the top-level expressi=
on/form with some way in each child to get to the parent. <a href=3D"https:=
//lists.gnu.org/archive/html/emacs-devel/2020-07/msg00412.html">Elsewhere</=
a>=C2=A0I give motivation for storing the string along with buffer offsets.=
=C2=A0<div><br>Also nice in the parent would be the container that the func=
tion is found in, and for any other containers that appear in the subexpres=
sions in the tree. (Recall a &quot;container&quot; could be a buffer, a fil=
e, an archive member of some sort, or none if the function was built on the=
 fly).<br><br>Nice to have at the top level but not strictly necessary, wou=
ld be some sort of signature of the text. Since the full string is also at =
the top level, computing signatures (for comparison=C2=A0on=C2=A0both ends =
of the mapping) can be done on the fly. The cheapest kind of computed signa=
ture is simply the length of the string. At the other end of the computatio=
n effort is a SHA. Recent Python bytecode uses <a href=3D"https://131002.ne=
t/siphash/">SipHash</a>=C2=A0which is somewhere in between in computation t=
ime.<br><br>If we want to look up the text of a string and we are not at th=
e top-level, how do we do that if we magically have a pointer to someplace =
inside a source-to-elisp tree structure? There could be a pointer to the to=
p for the non-top children. However I think it better to have parent pointe=
rs. Often when looking at some source code where there is a problem, you ma=
y want to see the bigger context, and the parent gives that. What about spe=
ed in looking up the location? The number of nesting levels in a program is=
 generally pretty small, like less than 15. Also, this information is neede=
d only when there is a problem which is probably rare. In short, I don&#39;=
t think one needs to obsess over code to string lookup time.</div><div><br>=
</div><div>Another possibility is to assume that access must start at the t=
op.=C2=A0<br><br>Lastly, a small thing about a name used in the source-map =
structure: &quot;code&quot; is used to refer to the source text, a <i>strin=
g</i>. &quot;code&quot; like &quot;object&quot; is a bit ambiguous; code co=
uld refer to LAP code, bytecode, lambda expressions, or fully compiled code=
. My suggestion would be to call it &quot;source-text&quot;.<br><br><i>1c. =
Source-to-Elisp Map Files</i><br><br>Just as there are bytecode files, Zach=
&#39;s project produces files with source-text-elisp mappings in them, and =
these can be independently produced and loaded into memory. Discussed above=
, this is string=C2=A0bloat in this file. However it is missing additional =
information that might be nice to have: some sort of optional signature for=
 the container (file, buffer, ...) that this was built from.<br><br><b>2. C=
ode-to-source Maps</b><br><br>In &quot;Update2&quot;, a source string was a=
ttached to the LAP instruction or LAP-instruction proxy. This was for exped=
ience and proof of concept. In the long term what is wanted is an index int=
o the Source-to-Elisp Map structure.<br><br>Therefore, in addition to the s=
ource-to-elisp map structure, there should be a way to map a bytecode offse=
t into a position in the Source-to-Elisp Map File. If this is saved on disk=
, similar to the way Source-to-Elisp Map files are done, then the position =
would be some tree-node index number in the source-to-elisp tree. When that=
 information is read in, the index node number can be materialized to a tru=
e pointer into the source-to-elisp structure, if desired. Or as stated abov=
e, access might have to always start from the top-level function.<br><br>No=
te that the overall structure described above is equally valid for other ki=
nds of code, such as the great work that Andrea Corallo is doing.<div><br>F=
or compiled code you would take the Source-to-Elisp map and the location in=
dexes through to DWARF. A lot of the decoding-to-source-position code when =
there is an error I think can also be shared.<br><br><i>2a. ByteCode Compil=
er=C2=A0</i></div><div><i><br></i>Finally, we come to the thorny problem of=
 dealing with the Bytecode compiler. The code behind Update2 is largely cut=
 and paste which is also okay for prototyping, but untenable in the longer =
term. Even as this work was being done, changes were going on to the byteco=
de compiler. See <a href=3D"https://lists.gnu.org/archive/html/bug-gnu-emac=
s/2020-07/msg00010.html">bug#41247</a>.=C2=A0</div><div><br>Here is a possi=
ble solution based on Lisp&#39;s rich history of being about to introspect =
and modify code. Basically a transformer function or macro could be written=
 that takes in a byte-compile function as input and changes that into somet=
hing that works with the <font face=3D"monospace">source-to-elisp</font> st=
ructure instead.<br><br></div><div>Recall that the <font face=3D"monospace"=
>source-to-elisp</font> structure can be thought of as a &quot;fat cons&quo=
t; node, but it is implemented in Elisp as opposed to being built-in dataty=
pe.<br><br>This transformer function might look for function calls that sta=
rt with the prefix=C2=A0<font face=3D"monospace">byte-code-</font> and turn=
 those function-calls into corresponding calls in the <font face=3D"monospa=
ce">source-map-bytecode-</font> space.<br><br>The cons-node functions could=
 also be looked for and replaced with corresponding wrappers. For example i=
nstead of <font face=3D"monospace">car</font>, <font face=3D"monospace">sou=
rce-map-car=C2=A0</font>would be called.<br>A <font face=3D"monospace">sour=
ce-map-car</font>=C2=A0function just checks to see if the object passed is =
a source-to-elisp-structure. If it is, then the source-map structure&#39;s =
<font face=3D"monospace">sexp</font> field is used; if not=C2=A0 the underl=
ying <font face=3D"monospace">car </font>function is called.<br><br>A trans=
former function can also have specific transformations for specific functio=
ns. Or there may be specific functions that are just cut and pasted as was =
done in Update2. But this kind of activity would be done keeping in mind th=
at the underlying functions may change, and with a story for how such chang=
es can be tracked without too much trouble.<br><br><br><b>3. Alists, Plists=
, and Hashes oh my</b><br><br><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:=
1ex">Since arefs are pretty fast, I=E2=80=99m guessing the struct slot acce=
sses wouldn=E2=80=99t slow down execution too much, but creating so many re=
cords certainly uses a lot of memory.</blockquote><br>I don&#39;t think thi=
s needs to be worried about for a long time, if ever. Only when an error is=
 hit, does this information need to be read in it could be done via some au=
toload mechanism.<br><br>Presumably the number of errors is small enough th=
at just about anything would do. And since progress basically comes to a ha=
lt anyway, it really doesn&#39;t matter if one method is even a few seconds=
 slower than another. That said, for very small bytecode, alists are probab=
ly the fastest. However since offsets are unique, Plists would work too. An=
d the larger the number of offsets there are in bytecode, the more hashes a=
re more desirable.<br><br>But as Donald Knuth has said: premature optimizat=
ion is the root of all evil.=C2=A0 I think it sufficient just to have a sto=
ry or plan for how to speed such things up when the time comes.<br></div></=
div></div>

--00000000000015f37305abca7348--