From mboxrd@z Thu Jan 1 00:00:00 1970 Path: news.gmane.io!.POSTED.blaine.gmane.org!not-for-mail From: Rocky Bernstein Newsgroups: gmane.emacs.devel Subject: source-code to xxx mappings (Was: Update 2 on Bytecode Offset tracking) Date: Sat, 1 Aug 2020 01:47:55 -0400 Message-ID: Mime-Version: 1.0 Content-Type: multipart/alternative; boundary="00000000000015f37305abca7348" Injection-Info: ciao.gmane.io; posting-host="blaine.gmane.org:116.202.254.214"; logging-data="12936"; mail-complaints-to="usenet@ciao.gmane.io" To: emacs-devel Original-X-From: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Sat Aug 01 07:48:42 2020 Return-path: Envelope-to: ged-emacs-devel@m.gmane-mx.org Original-Received: from lists.gnu.org ([209.51.188.17]) by ciao.gmane.io with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1k1kNu-0003F4-B2 for ged-emacs-devel@m.gmane-mx.org; Sat, 01 Aug 2020 07:48:42 +0200 Original-Received: from localhost ([::1]:51812 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1k1kNt-00013T-D5 for ged-emacs-devel@m.gmane-mx.org; Sat, 01 Aug 2020 01:48:41 -0400 Original-Received: from eggs.gnu.org ([2001:470:142:3::10]:60168) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1k1kNO-0000d3-G3 for emacs-devel@gnu.org; Sat, 01 Aug 2020 01:48:10 -0400 Original-Received: from mail-qk1-f182.google.com ([209.85.222.182]:39917) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1k1kNM-0006BO-6P for emacs-devel@gnu.org; Sat, 01 Aug 2020 01:48:10 -0400 Original-Received: by mail-qk1-f182.google.com with SMTP id l6so30873281qkc.6 for ; Fri, 31 Jul 2020 22:48:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=7yX8dzRG9nbmQdesbqh9cJZIfE8JA7EXQOEm1dQJ7yQ=; b=csB6YejgndkQq8J+vYfl/IJ0oIFu9qlL3M7kK7aHJ2uftkofH8FmNkwVsstmW/hlQX 1VzMg8Nbv4Tn2qFoWSQ/NNY8zeiMCVce6N0D9YIaQj5yJiyFe7G4555iWCKV6yUDlcXr unVtS/4LkoaFl65dmmxXtYYfjNLhagr2nWQUJkv/k2RmyGMIoUqb6CzILIt+sZmbOZKk d3j9ChJopVOt9HxI6b1eBgNAM1nufo3dOS4fovvGxEnX66EEepsWdOOi0AejRNdCPIFA GgPYoVXiLjH9Qrc3FvOV98tu2QBMhXrsV6gWWTmLw1PMLVbYn/DXm4EZU8NeCfLRRNPq 7Jfw== X-Gm-Message-State: AOAM533hJ5PZ/MWCsDt7MOCIWkisj5v1cgYOhYEch5xtjOMw060ox0We opRx/oILR8Ugq6bAPE14+Ug9y1TDOQb+7CxweXspYvCKw8k= X-Google-Smtp-Source: ABdhPJwG0soqE58zfqphuK6spVglppNKm3W7Ykd3q2SpZFOaAr7lqz66LFRTCGDvfDOdY67quNnzs/KP1ivMeXNqvEg= X-Received: by 2002:a05:620a:144e:: with SMTP id i14mr7032445qkl.453.1596260886113; Fri, 31 Jul 2020 22:48:06 -0700 (PDT) Received-SPF: pass client-ip=209.85.222.182; envelope-from=rocky.bernstein@gmail.com; helo=mail-qk1-f182.google.com X-detected-operating-system: by eggs.gnu.org: First seen = 2020/08/01 01:48:06 X-ACL-Warn: Detected OS = Linux 2.2.x-3.x [generic] [fuzzy] X-Spam_score_int: -7 X-Spam_score: -0.8 X-Spam_bar: / X-Spam_report: (-0.8 / 5.0 requ) BAYES_00=-1.9, FREEMAIL_FORGED_FROMDOMAIN=1, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, URI_HEX=0.1 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: emacs-devel@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Emacs development discussions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: emacs-devel-bounces+ged-emacs-devel=m.gmane-mx.org@gnu.org Original-Sender: "Emacs-devel" Xref: news.gmane.io gmane.emacs.devel:253372 Archived-At: --00000000000015f37305abca7348 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks for "Update2 on Bytecode Offset tracking), Zack! With respect to the C patch (bug#42499) , I updated the savannah git branch feature/soc-bytecode-in-traceback-reduced to include the latest changes. This gives developers an alternative way to inspect and try out the code. As before, any developers should feel free to add changes on top of this. With respect to mapping the source information down to LAP, I think this is a close-enough demonstration to show that it could be done and roughly what's involved. There are still a number of details lying in the weeds. The last missing little piece to have a complete solution of hooking this into the backtrace is small enough of a gap for developers to easily imagine and leap over. Of more concern now is the scalability and maintenance of the code. So at this point, it is time to switch gears and look at the big and long-term picture far outside the 2020 Summer of Code. In the below I give my thoughts on some of the questions raised. *1. Source-to-Elisp Maps* *1a. Lisp Reader* The Lisp reader based on *edebug* I think is fine as it is in terms of speed for now. As is obvious, getting source information is *optional:* it is analogous to running gcc with or without -ggdb. Perhaps a better analogy, since we are talking about speed, is with higher levels of optimization like -O3. So even if this is 5 times slower (and it isn't), the process could be done on the GNU Emacs Elisp distribution just before the final release. The benefit would last several months or more, enough to justify a slowdown in minutes. *1b. Source-to-Elisp Map Data Structure* The *information* that is produced, although okay for a proof of concept, could be improved and made more complete. As Stefan observed, storing strings in each struct is a bit wasteful and unnecessary. It really only needs to be stored at the top-level expression/form with some way in each child to get to the parent. Elsewhere I give motivation for storing the string along with buffer offsets. Also nice in the parent would be the container that the function is found in, and for any other containers that appear in the subexpressions in the tree. (Recall a "container" could be a buffer, a file, an archive member of some sort, or none if the function was built on the fly). Nice to have at the top level but not strictly necessary, would be some sort of signature of the text. Since the full string is also at the top level, computing signatures (for comparison on both ends of the mapping) can be done on the fly. The cheapest kind of computed signature is simply the length of the string. At the other end of the computation effort is a SHA. Recent Python bytecode uses SipHash whic= h is somewhere in between in computation time. If we want to look up the text of a string and we are not at the top-level, how do we do that if we magically have a pointer to someplace inside a source-to-elisp tree structure? There could be a pointer to the top for the non-top children. However I think it better to have parent pointers. Often when looking at some source code where there is a problem, you may want to see the bigger context, and the parent gives that. What about speed in looking up the location? The number of nesting levels in a program is generally pretty small, like less than 15. Also, this information is needed only when there is a problem which is probably rare. In short, I don't think one needs to obsess over code to string lookup time. Another possibility is to assume that access must start at the top. Lastly, a small thing about a name used in the source-map structure: "code" is used to refer to the source text, a *string*. "code" like "object" is a bit ambiguous; code could refer to LAP code, bytecode, lambda expressions, or fully compiled code. My suggestion would be to call it "source-text". *1c. Source-to-Elisp Map Files* Just as there are bytecode files, Zach's project produces files with source-text-elisp mappings in them, and these can be independently produced and loaded into memory. Discussed above, this is string bloat in this file. However it is missing additional information that might be nice to have: some sort of optional signature for the container (file, buffer, ...) that this was built from. *2. Code-to-source Maps* In "Update2", a source string was attached to the LAP instruction or LAP-instruction proxy. This was for expedience and proof of concept. In the long term what is wanted is an index into the Source-to-Elisp Map structure= . Therefore, in addition to the source-to-elisp map structure, there should be a way to map a bytecode offset into a position in the Source-to-Elisp Map File. If this is saved on disk, similar to the way Source-to-Elisp Map files are done, then the position would be some tree-node index number in the source-to-elisp tree. When that information is read in, the index node number can be materialized to a true pointer into the source-to-elisp structure, if desired. Or as stated above, access might have to always start from the top-level function. Note that the overall structure described above is equally valid for other kinds of code, such as the great work that Andrea Corallo is doing. For compiled code you would take the Source-to-Elisp map and the location indexes through to DWARF. A lot of the decoding-to-source-position code when there is an error I think can also be shared. *2a. ByteCode Compiler * Finally, we come to the thorny problem of dealing with the Bytecode compiler. The code behind Update2 is largely cut and paste which is also okay for prototyping, but untenable in the longer term. Even as this work was being done, changes were going on to the bytecode compiler. See bug#41247 . Here is a possible solution based on Lisp's rich history of being about to introspect and modify code. Basically a transformer function or macro could be written that takes in a byte-compile function as input and changes that into something that works with the source-to-elisp structure instead. Recall that the source-to-elisp structure can be thought of as a "fat cons" node, but it is implemented in Elisp as opposed to being built-in datatype. This transformer function might look for function calls that start with the prefix byte-code- and turn those function-calls into corresponding calls in the source-map-bytecode- space. The cons-node functions could also be looked for and replaced with corresponding wrappers. For example instead of car, source-map-car would be called. A source-map-car function just checks to see if the object passed is a source-to-elisp-structure. If it is, then the source-map structure's sexp field is used; if not the underlying car function is called. A transformer function can also have specific transformations for specific functions. Or there may be specific functions that are just cut and pasted as was done in Update2. But this kind of activity would be done keeping in mind that the underlying functions may change, and with a story for how such changes can be tracked without too much trouble. *3. Alists, Plists, and Hashes oh my* Since arefs are pretty fast, I=E2=80=99m guessing the struct slot accesses = wouldn=E2=80=99t > slow down execution too much, but creating so many records certainly uses= a > lot of memory. I don't think this needs to be worried about for a long time, if ever. Only when an error is hit, does this information need to be read in it could be done via some autoload mechanism. Presumably the number of errors is small enough that just about anything would do. And since progress basically comes to a halt anyway, it really doesn't matter if one method is even a few seconds slower than another. That said, for very small bytecode, alists are probably the fastest. However since offsets are unique, Plists would work too. And the larger the number of offsets there are in bytecode, the more hashes are more desirable= . But as Donald Knuth has said: premature optimization is the root of all evil. I think it sufficient just to have a story or plan for how to speed such things up when the time comes. --00000000000015f37305abca7348 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks for "Update2 on Bytecode Offset tracking), Zac= k!

With respect to the C patch (bug#42499), I updated th= e savannah git branch=C2=A0feature/soc-bytecode-in-traceback-reduc= ed=C2=A0to include the latest changes. This gives developers an alterna= tive way to inspect and try out the code. As before, any developers should = feel free to add changes on top of this.

With respect to mapping the= source information down to LAP, I think this is a close-enough demonstrati= on to show that it could be done and roughly what's involved. There are= still a number of details lying in the weeds.

The last missing litt= le piece to have a complete solution of hooking this into the backtrace is = small enough of a gap for developers to easily imagine and leap over.
Of more concern now is the scalability and maintenance of the code.
So at this point, it is time to switch gears and look at the big and long= -term picture far outside the 2020 Summer of Code.

In the below I gi= ve my thoughts on some of the questions raised.

1. Source-to-Elis= p Maps

1a. Lisp Reader

The Lisp reader based on edebug I think is fine as it is in terms of speed for now. As is obvio= us, getting source information is optional: it is=C2=A0analogous to = running=C2=A0 gcc with or without=C2=A0 -ggdb. Perhaps a better analogy, since we are = talking about speed, is with higher levels of optimization like -O3. So even if this is 5 times slower (and it isn= 9;t), the process could be done on the GNU Emacs Elisp distribution just be= fore the final release. The benefit would last=C2=A0several months or more,= enough to justify a slowdown in minutes.

1b. Source-to-Elisp Map= Data Structure

The information that is produced, althoug= h okay for a proof of concept, could be improved and made more complete.
As Stefan observed, storing strings in each struct is a bit wasteful a= nd unnecessary. It really only needs to be stored at the top-level expressi= on/form with some way in each child to get to the parent. Elsewhere=C2=A0I give motivation for storing the string along with buffer offsets.= =C2=A0

Also nice in the parent would be the container that the func= tion is found in, and for any other containers that appear in the subexpres= sions in the tree. (Recall a "container" could be a buffer, a fil= e, an archive member of some sort, or none if the function was built on the= fly).

Nice to have at the top level but not strictly necessary, wou= ld be some sort of signature of the text. Since the full string is also at = the top level, computing signatures (for comparison=C2=A0on=C2=A0both ends = of the mapping) can be done on the fly. The cheapest kind of computed signa= ture is simply the length of the string. At the other end of the computatio= n effort is a SHA. Recent Python bytecode uses
SipHash=C2=A0which is somewhere in between in computation t= ime.

If we want to look up the text of a string and we are not at th= e top-level, how do we do that if we magically have a pointer to someplace = inside a source-to-elisp tree structure? There could be a pointer to the to= p for the non-top children. However I think it better to have parent pointe= rs. Often when looking at some source code where there is a problem, you ma= y want to see the bigger context, and the parent gives that. What about spe= ed in looking up the location? The number of nesting levels in a program is= generally pretty small, like less than 15. Also, this information is neede= d only when there is a problem which is probably rare. In short, I don'= t think one needs to obsess over code to string lookup time.

=
Another possibility is to assume that access must start at the t= op.=C2=A0

Lastly, a small thing about a name used in the source-map = structure: "code" is used to refer to the source text, a strin= g. "code" like "object" is a bit ambiguous; code co= uld refer to LAP code, bytecode, lambda expressions, or fully compiled code= . My suggestion would be to call it "source-text".

1c. = Source-to-Elisp Map Files

Just as there are bytecode files, Zach= 's project produces files with source-text-elisp mappings in them, and = these can be independently produced and loaded into memory. Discussed above= , this is string=C2=A0bloat in this file. However it is missing additional = information that might be nice to have: some sort of optional signature for= the container (file, buffer, ...) that this was built from.

2. C= ode-to-source Maps

In "Update2", a source string was a= ttached to the LAP instruction or LAP-instruction proxy. This was for exped= ience and proof of concept. In the long term what is wanted is an index int= o the Source-to-Elisp Map structure.

Therefore, in addition to the s= ource-to-elisp map structure, there should be a way to map a bytecode offse= t into a position in the Source-to-Elisp Map File. If this is saved on disk= , similar to the way Source-to-Elisp Map files are done, then the position = would be some tree-node index number in the source-to-elisp tree. When that= information is read in, the index node number can be materialized to a tru= e pointer into the source-to-elisp structure, if desired. Or as stated abov= e, access might have to always start from the top-level function.

No= te that the overall structure described above is equally valid for other ki= nds of code, such as the great work that Andrea Corallo is doing.

F= or compiled code you would take the Source-to-Elisp map and the location in= dexes through to DWARF. A lot of the decoding-to-source-position code when = there is an error I think can also be shared.

2a. ByteCode Compil= er=C2=A0

Finally, we come to the thorny problem of= dealing with the Bytecode compiler. The code behind Update2 is largely cut= and paste which is also okay for prototyping, but untenable in the longer = term. Even as this work was being done, changes were going on to the byteco= de compiler. See bug#41247.=C2=A0

Here is a possi= ble solution based on Lisp's rich history of being about to introspect = and modify code. Basically a transformer function or macro could be written= that takes in a byte-compile function as input and changes that into somet= hing that works with the source-to-elisp st= ructure instead.

Recall that the source-to-elisp structure can be thought of as a "fat cons&quo= t; node, but it is implemented in Elisp as opposed to being built-in dataty= pe.

This transformer function might look for function calls that sta= rt with the prefix=C2=A0byte-code- and turn= those function-calls into corresponding calls in the source-map-bytecode- space.

The cons-node functions could= also be looked for and replaced with corresponding wrappers. For example i= nstead of car, sou= rce-map-car=C2=A0would be called.
A sour= ce-map-car=C2=A0function just checks to see if the object passed is = a source-to-elisp-structure. If it is, then the source-map structure's = sexp field is used; if not=C2=A0 the underl= ying car function is called.

A trans= former function can also have specific transformations for specific functio= ns. Or there may be specific functions that are just cut and pasted as was = done in Update2. But this kind of activity would be done keeping in mind th= at the underlying functions may change, and with a story for how such chang= es can be tracked without too much trouble.


3. Alists, Plists= , and Hashes oh my

Since arefs are pretty fast, I=E2=80=99m guessing the struct slot acce= sses wouldn=E2=80=99t slow down execution too much, but creating so many re= cords certainly uses a lot of memory.

I don't think thi= s needs to be worried about for a long time, if ever. Only when an error is= hit, does this information need to be read in it could be done via some au= toload mechanism.

Presumably the number of errors is small enough th= at just about anything would do. And since progress basically comes to a ha= lt anyway, it really doesn't matter if one method is even a few seconds= slower than another. That said, for very small bytecode, alists are probab= ly the fastest. However since offsets are unique, Plists would work too. An= d the larger the number of offsets there are in bytecode, the more hashes a= re more desirable.

But as Donald Knuth has said: premature optimizat= ion is the root of all evil.=C2=A0 I think it sufficient just to have a sto= ry or plan for how to speed such things up when the time comes.
--00000000000015f37305abca7348--