source-code to xxx mappings (Was: Update 2 on Bytecode Offset tracking)

all messages for Emacs-related lists mirrored at yhetil.org
 help / color / mirror / code / Atom feed

* source-code to xxx mappings (Was: Update 2 on Bytecode Offset tracking)
@ 2020-08-01  5:47 Rocky Bernstein
  0 siblings, 0 replies; only message in thread
From: Rocky Bernstein @ 2020-08-01  5:47 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 8672 bytes --]

Thanks for "Update2 on Bytecode Offset tracking), Zack!

With respect to the C patch (bug#42499)
<https://lists.gnu.org/archive/html/bug-gnu-emacs/2020-07/msg00874.html>, I
updated the savannah git branch feature/soc-bytecode-in-traceback-reduced
<http://git.savannah.gnu.org/cgit/emacs.git/commit/?h=feature/soc-bytecode-in-traceback-reduced&id=bed4004fd334e101a5fb4ee8a3bbf23f4679ba87>
to
include the latest changes. This gives developers an alternative way to
inspect and try out the code. As before, any developers should feel free to
add changes on top of this.

With respect to mapping the source information down to LAP, I think this is
a close-enough demonstration to show that it could be done and roughly
what's involved. There are still a number of details lying in the weeds.

The last missing little piece to have a complete solution of hooking this
into the backtrace is small enough of a gap for developers to easily
imagine and leap over.

Of more concern now is the scalability and maintenance of the code.

So at this point, it is time to switch gears and look at the big and
long-term picture far outside the 2020 Summer of Code.

In the below I give my thoughts on some of the questions raised.

*1. Source-to-Elisp Maps*

*1a. Lisp Reader*

The Lisp reader based on *edebug* I think is fine as it is in terms of
speed for now. As is obvious, getting source information is *optional:* it
is analogous to running  gcc with or without  -ggdb. Perhaps a better
analogy, since we are talking about speed, is with higher levels of
optimization like -O3. So even if this is 5 times slower (and it isn't),
the process could be done on the GNU Emacs Elisp distribution just before
the final release. The benefit would last several months or more, enough to
justify a slowdown in minutes.

*1b. Source-to-Elisp Map Data Structure*

The *information* that is produced, although okay for a proof of concept,
could be improved and made more complete.

As Stefan observed, storing strings in each struct is a bit wasteful and
unnecessary. It really only needs to be stored at the top-level
expression/form with some way in each child to get to the parent. Elsewhere
<https://lists.gnu.org/archive/html/emacs-devel/2020-07/msg00412.html> I
give motivation for storing the string along with buffer offsets.

Also nice in the parent would be the container that the function is found
in, and for any other containers that appear in the subexpressions in the
tree. (Recall a "container" could be a buffer, a file, an archive member of
some sort, or none if the function was built on the fly).

Nice to have at the top level but not strictly necessary, would be some
sort of signature of the text. Since the full string is also at the top
level, computing signatures (for comparison on both ends of the mapping)
can be done on the fly. The cheapest kind of computed signature is simply
the length of the string. At the other end of the computation effort is a
SHA. Recent Python bytecode uses SipHash <https://131002.net/siphash/> which
is somewhere in between in computation time.

If we want to look up the text of a string and we are not at the top-level,
how do we do that if we magically have a pointer to someplace inside a
source-to-elisp tree structure? There could be a pointer to the top for the
non-top children. However I think it better to have parent pointers. Often
when looking at some source code where there is a problem, you may want to
see the bigger context, and the parent gives that. What about speed in
looking up the location? The number of nesting levels in a program is
generally pretty small, like less than 15. Also, this information is needed
only when there is a problem which is probably rare. In short, I don't
think one needs to obsess over code to string lookup time.

Another possibility is to assume that access must start at the top.

Lastly, a small thing about a name used in the source-map structure: "code"
is used to refer to the source text, a *string*. "code" like "object" is a
bit ambiguous; code could refer to LAP code, bytecode, lambda expressions,
or fully compiled code. My suggestion would be to call it "source-text".

*1c. Source-to-Elisp Map Files*

Just as there are bytecode files, Zach's project produces files with
source-text-elisp mappings in them, and these can be independently produced
and loaded into memory. Discussed above, this is string bloat in this file.
However it is missing additional information that might be nice to have:
some sort of optional signature for the container (file, buffer, ...) that
this was built from.

*2. Code-to-source Maps*

In "Update2", a source string was attached to the LAP instruction or
LAP-instruction proxy. This was for expedience and proof of concept. In the
long term what is wanted is an index into the Source-to-Elisp Map structure.

Therefore, in addition to the source-to-elisp map structure, there should
be a way to map a bytecode offset into a position in the Source-to-Elisp
Map File. If this is saved on disk, similar to the way Source-to-Elisp Map
files are done, then the position would be some tree-node index number in
the source-to-elisp tree. When that information is read in, the index node
number can be materialized to a true pointer into the source-to-elisp
structure, if desired. Or as stated above, access might have to always
start from the top-level function.

Note that the overall structure described above is equally valid for other
kinds of code, such as the great work that Andrea Corallo is doing.

For compiled code you would take the Source-to-Elisp map and the location
indexes through to DWARF. A lot of the decoding-to-source-position code
when there is an error I think can also be shared.

*2a. ByteCode Compiler *

Finally, we come to the thorny problem of dealing with the Bytecode
compiler. The code behind Update2 is largely cut and paste which is also
okay for prototyping, but untenable in the longer term. Even as this work
was being done, changes were going on to the bytecode compiler. See
bug#41247
<https://lists.gnu.org/archive/html/bug-gnu-emacs/2020-07/msg00010.html>.

Here is a possible solution based on Lisp's rich history of being about to
introspect and modify code. Basically a transformer function or macro could
be written that takes in a byte-compile function as input and changes that
into something that works with the source-to-elisp structure instead.

Recall that the source-to-elisp structure can be thought of as a "fat cons"
node, but it is implemented in Elisp as opposed to being built-in datatype.

This transformer function might look for function calls that start with the
prefix byte-code- and turn those function-calls into corresponding calls in
the source-map-bytecode- space.

The cons-node functions could also be looked for and replaced with
corresponding wrappers. For example instead of car, source-map-car would be
called.
A source-map-car function just checks to see if the object passed is a
source-to-elisp-structure. If it is, then the source-map structure's sexp
field is used; if not  the underlying car function is called.

A transformer function can also have specific transformations for specific
functions. Or there may be specific functions that are just cut and pasted
as was done in Update2. But this kind of activity would be done keeping in
mind that the underlying functions may change, and with a story for how
such changes can be tracked without too much trouble.

*3. Alists, Plists, and Hashes oh my*

Since arefs are pretty fast, I’m guessing the struct slot accesses wouldn’t
> slow down execution too much, but creating so many records certainly uses a
> lot of memory.

I don't think this needs to be worried about for a long time, if ever. Only
when an error is hit, does this information need to be read in it could be
done via some autoload mechanism.

Presumably the number of errors is small enough that just about anything
would do. And since progress basically comes to a halt anyway, it really
doesn't matter if one method is even a few seconds slower than another.
That said, for very small bytecode, alists are probably the fastest.
However since offsets are unique, Plists would work too. And the larger the
number of offsets there are in bytecode, the more hashes are more desirable.

But as Donald Knuth has said: premature optimization is the root of all
evil.  I think it sufficient just to have a story or plan for how to speed
such things up when the time comes.

[-- Attachment #2: Type: text/html, Size: 9550 bytes --]

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2020-08-01  5:47 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-08-01  5:47 source-code to xxx mappings (Was: Update 2 on Bytecode Offset tracking) Rocky Bernstein

Code repositories for project(s) associated with this external index

	https://git.savannah.gnu.org/cgit/emacs.git
	https://git.savannah.gnu.org/cgit/emacs/org-mode.git

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.