Update 1 on Bytecode Offset tracking

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Update 1 on Bytecode Offset tracking
@ 2020-07-15 23:10 Zach Shaftel
  2020-07-16  3:55 ` Stefan Monnier
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Zach Shaftel @ 2020-07-15 23:10 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/html, Size: 10496 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update 1 on Bytecode Offset tracking
  2020-07-15 23:10 Update 1 on Bytecode Offset tracking Zach Shaftel
@ 2020-07-16  3:55 ` Stefan Monnier
  2020-07-16 22:45   ` Zach Shaftel
  2020-07-16  7:25 ` Andrea Corallo via Emacs development discussions.
  2020-07-28 19:19 ` Update 2 " Zach Shaftel
  2 siblings, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2020-07-16  3:55 UTC (permalink / raw)
  To: Zach Shaftel; +Cc: emacs-devel

> The second branch saves the offset only before a call. Therefore, the
> traceback on all of the functions other than the current one are
> accurate, but the current one is not accurate if the error happens in
> a byte op.

IIUC this has a negligible performance impact.  The info it provides in
not 100% accurate, but I think it's a "sweet spot": it does provide the
byte-offset info and is cheap enough to be acceptable into `master` with
no real downside.

I'd look at it as a "step" along the way: subsequent steps can be to
make use of that info, or to improve the accuracy of that info.

> The third branch bypasses invoking Ffuncall from within
> exec_byte_code, and instead does essentially the same thing that
> Ffuncall does, right in the Bcall ops.

This would be useful in its own right.
So I suggest you try and get this code into shape for `master` as well.

I expect this will tend to suffer from some amount of code duplication.
Maybe we can avoid it via refactoring, or maybe by "clever" macro
tricks, but if the speedup is important enough, we can probably live
with some amount of duplication.

> All of them print the offset next to function names in the backtrace like this: 
>
>
> Debugger entered--Lisp error: (wrong-type-argument stringp t)
>        string-match(t t nil)
>     13 test-condition-case()
>        load("/home/zach/.repos/bench-compare.el/test/test-debug...")
>     78 byte-recompile-file("/home/zach/.repos/bench-compare.el/test/test-debug..." nil 0 t)
>     35 emacs-lisp-byte-compile-and-load()
>        funcall-interactively(emacs-lisp-byte-compile-and-load)
>        call-interactively(emacs-lisp-byte-compile-and-load record nil)
>    101 command-execute(emacs-lisp-byte-compile-and-load record)

Cool!

> With respect to reporting offsets, using code from edebug we have
> a Lisp-Expression reader that will track source-code locations and
> store the information in a source-map-expression cl-struct.  The code
> in progress is here.

How does the performance of this code compare to that of the "native" `read?
And to put it into perspective, have you looked at the relative
proportion of time spent in `read` during a "typical" byte compilation?

There's no doubt that preserving source code information will slow down
byte-compilation but depending on how slow it gets we may find it's not
"worth it".

> Information currently saved is: 
>
> * The expression itself
> * The exact string that was read
> * Begin and end points of the sexp in the buffer
> * source-map-expression children (for conses and vectors)

Sounds like a lot of information, which in turn implies a potentially
high overhead (e.g. the "exact string" sounds like it might cost O(N²)
in corner cases, yet provides redundant info that can be recovered from
begin+end points).  Note also that while `read` returns a sexp made
exclusively of data coming from a particular buffer, the code after
macro-expansion can include chunks coming from other buffers, so if we
want to keep the same representation of "sexp with extra info" in both
cases, we can't just assume "the buffer".

        Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update 1 on Bytecode Offset tracking
  2020-07-15 23:10 Update 1 on Bytecode Offset tracking Zach Shaftel
  2020-07-16  3:55 ` Stefan Monnier
@ 2020-07-16  7:25 ` Andrea Corallo via Emacs development discussions.
  2020-07-17  0:24   ` Zach Shaftel
  2020-07-28 19:19 ` Update 2 " Zach Shaftel
  2 siblings, 1 reply; 15+ messages in thread
From: Andrea Corallo via Emacs development discussions. @ 2020-07-16  7:25 UTC (permalink / raw)
  To: Zach Shaftel; +Cc: emacs-devel, Rocky Bernstein

Hi Zach and Rocky,

IMO having the exact offset for all functions in the stack except the
last is already a measurable improvement.  Reevaluating the top function
and rerunning is not a huge deal, reevaluating N functions and rerunning
each time trying to figure out what is going wrong and where on the
contrary can be considerably more painful.

Zach Shaftel <zshaftel@gmail.com> writes:

> With respect to reporting offsets, using code from edebug we have a
> Lisp-Expression reader that will track source-code locations and
> store the information in a source-map-expression cl-struct. The code
> in progress is here.
>
> Information currently saved is:
>
>     The expression itself
>     The exact string that was read
>     Begin and end points of the sexp in the buffer
>     source-map-expression children (for conses and vectors)
>
> which can be generated for a whole lisp file with the function
> source-map-file. We are testing this on lots of files such as the
> lisp files in the GNU Emacs distribution. After this is done we will
> try hooking this into the compilation process.

Regarding the reader I fear modifying the C one will be the only way if
we want to have something sufficiently high performance to be used as
default.  That said having one to begin with experimenting is a very
good start.  I guess we'll want to have the 'children' as key of an hash
where the rest is the value.

Thanks you and Rocky for the effort!

  Andrea

-- 
akrl@sdf.org

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update 1 on Bytecode Offset tracking
  2020-07-16  3:55 ` Stefan Monnier
@ 2020-07-16 22:45   ` Zach Shaftel
  2020-07-17  3:44     ` Eli Zaretskii
  2020-07-17 16:20     ` Stefan Monnier
  0 siblings, 2 replies; 15+ messages in thread
From: Zach Shaftel @ 2020-07-16 22:45 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Hi Stefan,

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> The second branch saves the offset only before a call. Therefore, the
>> traceback on all of the functions other than the current one are
>> accurate, but the current one is not accurate if the error happens in
>> a byte op.
>
> IIUC this has a negligible performance impact.  The info it provides in
> not 100% accurate, but I think it's a "sweet spot": it does provide the
> byte-offset info and is cheap enough to be acceptable into `master` with
> no real downside.

Great! I just followed up on my copyright assignment as I still haven't
finished that process. I don't know whether this could be exempt or if
Rocky's assignment is sufficient, but hopefully I will hear back from
copyright-clerk soon.

> I'd look at it as a "step" along the way: subsequent steps can be to
> make use of that info, or to improve the accuracy of that info.

Absolutely, it would be great to have that in place as a basis for
further improvement.

>> The third branch bypasses invoking Ffuncall from within
>> exec_byte_code, and instead does essentially the same thing that
>> Ffuncall does, right in the Bcall ops.
>
> This would be useful in its own right.
> So I suggest you try and get this code into shape for `master` as well.

I will definitely continue work on this.

> I expect this will tend to suffer from some amount of code duplication.
> Maybe we can avoid it via refactoring, or maybe by "clever" macro
> tricks, but if the speedup is important enough, we can probably live
> with some amount of duplication.

That seems to be the case. I'll keep looking to see if there's any low
hanging fruit in terms of splitting up the funcall logic without slowing
things down. More testing is necessary, but if a moderate chunk of
duplicated code is acceptable then there may not be as much work needed
on that branch as I had thought.

>> All of them print the offset next to function names in the backtrace like this:
>>
>>
>> Debugger entered--Lisp error: (wrong-type-argument stringp t)
>>        string-match(t t nil)
>>     13 test-condition-case()
>>        load("/home/zach/.repos/bench-compare.el/test/test-debug...")
>>     78 byte-recompile-file("/home/zach/.repos/bench-compare.el/test/test-debug..." nil 0 t)
>>     35 emacs-lisp-byte-compile-and-load()
>>        funcall-interactively(emacs-lisp-byte-compile-and-load)
>>        call-interactively(emacs-lisp-byte-compile-and-load record nil)
>>    101 command-execute(emacs-lisp-byte-compile-and-load record)
>
> Cool!
>
>> With respect to reporting offsets, using code from edebug we have
>> a Lisp-Expression reader that will track source-code locations and
>> store the information in a source-map-expression cl-struct.  The code
>> in progress is here.
>
> How does the performance of this code compare to that of the "native" `read?

Rough tests indicate it's about three times slower. Running it on all
274 files in the `lisp` directory of the GNU Emacs sources takes ~11-12
seconds (after removing the string from the struct and not
pretty-printing). A similar function which just calls `read` takes ~4
seconds. There are probably ways to further improve the performance of
`source-map-read`, but I don't know much more speed can realistically be
gained.

> And to put it into perspective, have you looked at the relative
> proportion of time spent in `read` during a "typical" byte compilation?

I have not yet, but I'll evaluate that and keep it in mind.

> There's no doubt that preserving source code information will slow down
> byte-compilation but depending on how slow it gets we may find it's not
> "worth it".
>
>> Information currently saved is:
>>
>> * The expression itself
>> * The exact string that was read
>> * Begin and end points of the sexp in the buffer
>> * source-map-expression children (for conses and vectors)
>
> Sounds like a lot of information, which in turn implies a potentially
> high overhead (e.g. the "exact string" sounds like it might cost O(N²)
> in corner cases, yet provides redundant info that can be recovered from
> begin+end points).

Removing the string did improve performance, but not by as much as I
expected. The function that constructs the tree of "children" may be
slower than it needs to be, so I'll look into improving that. It may not
be necessary to create the children for vectors since they're constants
(outside of backquote, at least).

>Note also that while `read` returns a sexp made exclusively of data
>coming from a particular buffer, the code after macro-expansion can
>include chunks coming from other buffers, so if we want to keep the
>same representation of "sexp with extra info" in both cases, we can't
>just assume "the buffer".

Yes, and it won't be easy to maintain the read locations across
macroexpansion, byte-opt and cconv. It's tough to say at this point how
much the final product will slow down compilation, but I suspect it will
be significant.

>         Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update 1 on Bytecode Offset tracking
  2020-07-16  7:25 ` Andrea Corallo via Emacs development discussions.
@ 2020-07-17  0:24   ` Zach Shaftel
  2020-07-17 13:47     ` Rocky Bernstein
  0 siblings, 1 reply; 15+ messages in thread
From: Zach Shaftel @ 2020-07-17  0:24 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: Rocky Bernstein, emacs-devel

Hi Andrea,

Andrea Corallo <akrl@sdf.org> writes:

> Hi Zach and Rocky,
>
> IMO having the exact offset for all functions in the stack except the
> last is already a measurable improvement.  Reevaluating the top function
> and rerunning is not a huge deal, reevaluating N functions and rerunning
> each time trying to figure out what is going wrong and where on the
> contrary can be considerably more painful.

Absolutely, and since this deficiency arises when errors are signalled
from byte ops other than `Bcall`, many backtraces still present accurate
offsets top to bottom.

> Zach Shaftel <zshaftel@gmail.com> writes:
>
>> With respect to reporting offsets, using code from edebug we have a
>> Lisp-Expression reader that will track source-code locations and
>> store the information in a source-map-expression cl-struct. The code
>> in progress is here.
>>
>> Information currently saved is:
>>
>>     The expression itself
>>     The exact string that was read
>>     Begin and end points of the sexp in the buffer
>>     source-map-expression children (for conses and vectors)
>>
>> which can be generated for a whole lisp file with the function
>> source-map-file. We are testing this on lots of files such as the
>> lisp files in the GNU Emacs distribution. After this is done we will
>> try hooking this into the compilation process.
>
> Regarding the reader I fear modifying the C one will be the only way if
> we want to have something sufficiently high performance to be used as
> default.

I suspect you're right that processing and saving this information in
Lisp won't be fast enough to maintain reasonable compilation time.
Perhaps it could be made available as an option, in the form of
alternate byte-compile functions for example, but that would be an
unfortunate compromise. Down the line, perhaps the native compiler could
change that.

> That said having one to begin with experimenting is a very
> good start.  I guess we'll want to have the 'children' as key of an hash
> where the rest is the value.

That's the plan!

> Thanks you and Rocky for the effort!
>
>   Andrea




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update 1 on Bytecode Offset tracking
  2020-07-16 22:45   ` Zach Shaftel
@ 2020-07-17  3:44     ` Eli Zaretskii
  2020-07-17 16:20     ` Stefan Monnier
  1 sibling, 0 replies; 15+ messages in thread
From: Eli Zaretskii @ 2020-07-17  3:44 UTC (permalink / raw)
  To: Zach Shaftel; +Cc: monnier, emacs-devel

> From: Zach Shaftel <zshaftel@gmail.com>
> Date: Thu, 16 Jul 2020 18:45:00 -0400
> Cc: emacs-devel@gnu.org
> 
> I just followed up on my copyright assignment as I still haven't
> finished that process.

Let me know if I can help in any way.

> I don't know whether this could be exempt or if Rocky's assignment
> is sufficient, but hopefully I will hear back from copyright-clerk
> soon.

If you don't hear from the in a week or two, ping them and CC me.

Thanks.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update 1 on Bytecode Offset tracking
  2020-07-17  0:24   ` Zach Shaftel
@ 2020-07-17 13:47     ` Rocky Bernstein
  0 siblings, 0 replies; 15+ messages in thread
From: Rocky Bernstein @ 2020-07-17 13:47 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 2847 bytes --]

ON Wed, 15 Jul 2020 23:55:19 -0400 Stefan Monnier wrote:

Sounds like a lot of information, which in turn implies a potentially
> high overhead (e.g. the "exact string" sounds like it might cost O(N²)
> in corner cases, yet provides redundant info that can be recovered from
> begin+end points). Note also that while `read` returns a sexp made
> exclusively of data coming from a particular buffer, the code after
> macro-expansion can include chunks coming from other buffers, so if we
> want to keep the same representation of "sexp with extra info" in both
> cases, we can't just assume "the buffer".

Yes, when I last looked, yes, there is bloat in the way source mappings are
done. But let me explain:

As a Google Summer of Code project, the project has always been been a bit
behind. So the approach I had been taking was that if something is usable
for now, go with it and move onto other uncharted territory. In other
words, get something out,  complete what remains and *only then* go back
and iterate on the parts that need improving.  The C changes were little
bit different because of the (necessarily) long lead time to get things
into master and because one can't put something inefficient into the core.

The source-code string is needed in the source map only at the top-level.
(Oddly the member name for this is "code"). I had suggested that offsets
should be relative to the beginning of the function, and the function node
would have the position from the beginning of the container (e.g. file)
that it is in. However this isn't a big deal, since conversions are easily
done.

As for handling bits of S-expressions that represent the conglomeration of
a number of containers/files, that's pretty easily handled inside the
structure. I am not totally clear about how the container information is
determined. I imagine some of it would be noticed in the parameters when
the macro is defined, and some of each time the macro is expanded. But once
it is determined that certain S-expressions go with certain containers, it
is trivial to add it to a source-map object

One cool thing about having the source string stored in the sourcemap
object (whether just at the top-level of in more places) is that in
tracebacks is that exact information can be given without searching around.
In fact, the source code may have *never* existed inside a file and this
still works.

Another great thing about this is that it can tolerate mismatches between
the Elisp compiled and the Elisp that is have available. If there were
changes outside the toplevel object but not inside the object, then it is
pretty easy to detect and correct for this. Even if the discrepency is
inside the object, the differences are also easiliy detected. Adjusting is
a little more difficult, but still doable.

[-- Attachment #2: Type: text/html, Size: 3769 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update 1 on Bytecode Offset tracking
  2020-07-16 22:45   ` Zach Shaftel
  2020-07-17  3:44     ` Eli Zaretskii
@ 2020-07-17 16:20     ` Stefan Monnier
  2020-07-17 20:19       ` Zach Shaftel
  1 sibling, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2020-07-17 16:20 UTC (permalink / raw)
  To: Zach Shaftel; +Cc: emacs-devel

> Great! I just followed up on my copyright assignment as I still haven't
> finished that process. I don't know whether this could be exempt or if
> Rocky's assignment is sufficient, but hopefully I will hear back from
> copyright-clerk soon.

While waiting for the paperwork to go through, you can prepare the patch
and we can start discussing it.

> That seems to be the case.  I'll keep looking to see if there's any low
> hanging fruit in terms of splitting up the funcall logic without slowing
> things down.  More testing is necessary, but if a moderate chunk of
> duplicated code is acceptable then there may not be as much work needed
> on that branch as I had thought.

It's a tradeoff, so it's hard to say what is acceptable without seeing
the actual path along with the corresponding measurements of the
performance impact.

> Rough tests indicate it's about three times slower.

Wow!  That's a lot less than I expected.  That makes it quite usable.
This said, we'll probably still want to merge the feature into the
C code simply to avoid the duplication (I expect that the Edebug reader
has never been 100% faithful and that it has probably diverged over
time).

> Removing the string did improve performance, but not by as much as I
> expected. The function that constructs the tree of "children" may be
> slower than it needs to be, so I'll look into improving that. It may not
> be necessary to create the children for vectors since they're constants
> (outside of backquote, at least).

I think vectors are rare enough that the performance benefit of special
casing them is not worth the downside of losing source-location info
when it'd be beneficial.

> Yes, and it won't be easy to maintain the read locations across
> macroexpansion, byte-opt and cconv.

For byte-opt and cconv it's mostly a question of labour.

For macros, OTOH, it's really fundamentally hard (or impossible, in
general).  We could/should introduce some new way to define macros which
knows about "source code annotated with locations".
There's a lot of work on Scheme macros we could leverage for that.

        Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update 1 on Bytecode Offset tracking
  2020-07-17 16:20     ` Stefan Monnier
@ 2020-07-17 20:19       ` Zach Shaftel
  2020-07-17 22:08         ` Stefan Monnier
  0 siblings, 1 reply; 15+ messages in thread
From: Zach Shaftel @ 2020-07-17 20:19 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel

Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> Great! I just followed up on my copyright assignment as I still haven't
>> finished that process. I don't know whether this could be exempt or if
>> Rocky's assignment is sufficient, but hopefully I will hear back from
>> copyright-clerk soon.
>
> While waiting for the paperwork to go through, you can prepare the patch
> and we can start discussing it.

Sure, does that just mean the 'git format-patch -1' emailed to
bug-gnu-emacs@gnu.org, as mentioned in CONTRIBUTE? If that's the gist of
it then I can do that shortly.

>> That seems to be the case.  I'll keep looking to see if there's any low
>> hanging fruit in terms of splitting up the funcall logic without slowing
>> things down.  More testing is necessary, but if a moderate chunk of
>> duplicated code is acceptable then there may not be as much work needed
>> on that branch as I had thought.
>
> It's a tradeoff, so it's hard to say what is acceptable without seeing
> the actual path along with the corresponding measurements of the
> performance impact.
>
>> Rough tests indicate it's about three times slower.
>
> Wow!  That's a lot less than I expected.  That makes it quite usable.

I was able to speed that function up to the point that it's about the
same as one using `read`. Those functions are doing a whole lot of IO
(reading and writing hundreds of files) so it's not really a fair
comparison. I've done more tests with functions that just read a whole
buffer, collecting what they read into a list. In a 9600 line file with
just over 500 sexps, the `read` version took about ~.02-.04 seconds
(according to `benchmark-run-compiled`), and the `source-map-read`
version took ~.08 seconds when it didn't GC, but unlike with `read` it
did cause a GC 10-20% of the time.

> This said, we'll probably still want to merge the feature into the
> C code simply to avoid the duplication (I expect that the Edebug reader
> has never been 100% faithful and that it has probably diverged over
> time).

I already had a couple minor issues with edebug's reader, though they
were easily remedied. It's pretty hard to hook into it since it relies
heavily on dynamic binding. So I agree edebug is not the ideal candidate
for a final product.

>> Removing the string did improve performance, but not by as much as I
>> expected. The function that constructs the tree of "children" may be
>> slower than it needs to be, so I'll look into improving that. It may not
>> be necessary to create the children for vectors since they're constants
>> (outside of backquote, at least).
>
> I think vectors are rare enough that the performance benefit of special
> casing them is not worth the downside of losing source-location info
> when it'd be beneficial.
>
>> Yes, and it won't be easy to maintain the read locations across
>> macroexpansion, byte-opt and cconv.
>
> For byte-opt and cconv it's mostly a question of labour.
>
> For macros, OTOH, it's really fundamentally hard (or impossible, in
> general).

Helmut Eller mentioned before that most macros do use at least some of
the original code in their expansion. A crude test with a pair of
hash-tables indicates 120/594 of the conses in the definition of
`cl-typep` remained after `macroexpand-all`ing. That's not much, but
considering how macro-laden the function is, that doesn't seem totally
impossible to work with.

> We could/should introduce some new way to define macros which
> knows about "source code annotated with locations".

I've wondered about this too but don't know what the right approach
would be. I doubt anyone would want to use something like
macro-cons/list/append etc. functions, and somehow modifying the
behavior of the real list operations during macroexpansion seems
impractical. I'm sure someone can come up with a good approach to this.

> There's a lot of work on Scheme macros we could leverage for that.

Interesting, so far I've had some difficulty finding documentation about
how other Lisps track source locations. If you know where I can find
some more details about that I'd love to take a look.

-Zach

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update 1 on Bytecode Offset tracking
  2020-07-17 20:19       ` Zach Shaftel
@ 2020-07-17 22:08         ` Stefan Monnier
  2020-07-18 21:41           ` Zach Shaftel
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2020-07-17 22:08 UTC (permalink / raw)
  To: Zach Shaftel; +Cc: emacs-devel

>> While waiting for the paperwork to go through, you can prepare the patch
>> and we can start discussing it.
> Sure, does that just mean the 'git format-patch -1' emailed to
> bug-gnu-emacs@gnu.org, as mentioned in CONTRIBUTE? If that's the gist of
> it then I can do that shortly.

Pretty much, yes.  You can add some text to give extra background on the
design, the motivation for some of the choices, or ask questions about
particular details, but that's not indispensable.

You can also send an email that just refers to a branch in emacs.git.
But for the discussion to work well, it's usually better to make sure
this branch is "small" so people aren't discouraged to read the large
diff ;-)

> I was able to speed that function up to the point that it's about the
> same as one using `read`. Those functions are doing a whole lot of IO
> (reading and writing hundreds of files) so it's not really a fair
> comparison. I've done more tests with functions that just read a whole
> buffer, collecting what they read into a list. In a 9600 line file with
> just over 500 sexps, the `read` version took about ~.02-.04 seconds
> (according to `benchmark-run-compiled`), and the `source-map-read`
> version took ~.08 seconds when it didn't GC, but unlike with `read` it
> did cause a GC 10-20% of the time.

IME when the time is in the sub-second range the measurements are very
imprecise, so better measure the time to repeat the same `read` N times
so the total time is a few seconds (and since it's the same `read`,
it won't suffer from extra IO overhead).

>> For macros, OTOH, it's really fundamentally hard (or impossible, in
>> general).
> Helmut Eller mentioned before that most macros do use at least some of
> the original code in their expansion.

We can definitely hope to use some heuristics that will preserve "most"
source info for "most" existing macros, yes.
But it's still a fundamentally impossible problem in general ;-)

>> We could/should introduce some new way to define macros which
>> knows about "source code annotated with locations".
> I've wondered about this too but don't know what the right approach
> would be.

The first step is to define a `defmacro2` which works like `defmacro`
but is defined to take as arguments (and to return) annotated-sexps
instead of "bare sexps".  It'll be less convenient to use, but

In Scheme "annotated sexps" are called "syntax objects".

> I doubt anyone would want to use something like macro-cons/list/append
> etc. functions,

Scheme avoids the problem by defining additional higher-level layers,
where macros are defined in a more restrictive way using templates, so
for most macros the programmer doesn't need to use care very much about
the difference between bare sexps and syntax objects.

The main motivation for it was hygiene (the framework takes care of
adding the needed `gensym`s where applicable) rather than tracking
source-location, but fundamentally the issue is the same: an AST node is
not just some random sexp.

IOW "code and data aren't quite the same, after all" ;-)

See for example `syntax-case` https://www.gnu.org/software/guile/manual/html_node/Syntax-Case.html
Note that Scheme uses the #' notation for syntax objects.  Adapting the
example for `when` to an Elisp syntax could look like:

    (defmacro2 when (form)
      (elisp-case form
        ((_ test e e* ...) (elisp (if test (progn e e* ...))))))

[ Where I used `elisp` instead of Scheme's `syntax` since we already use
  the prefix "syntax-" for things related to syntax-tables.  ]

Notice how it's `elisp-case` which extracts `test`, `e`, and `e*` and
then it's `syntax` which builds the new chunk of code, so all the
replacement of `car` with `elisp-car` can be hidden within the definition
of `elisp-case` and `elisp`.

>> There's a lot of work on Scheme macros we could leverage for that.
> Interesting, so far I've had some difficulty finding documentation about
> how other Lisps track source locations.

It's not really discussed, but the distinction between "sexp" and
"syntax object" is the key.  It's largely not discussed because Scheme
macros have never officially included the equivalent of `defmacro`
operating on raw sexps, so they've never really had to deal with the
issue (tho Gambit does provide a `define-macro` which operates like our
`defmacro` but it's rarely used so Gambit just punts on the
source-location issue in that case).

        Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update 1 on Bytecode Offset tracking
  2020-07-17 22:08         ` Stefan Monnier
@ 2020-07-18 21:41           ` Zach Shaftel
  2020-07-19  2:34             ` Stefan Monnier
  0 siblings, 1 reply; 15+ messages in thread
From: Zach Shaftel @ 2020-07-18 21:41 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel


Stefan Monnier <monnier@iro.umontreal.ca> writes:

>>> While waiting for the paperwork to go through, you can prepare the patch
>>> and we can start discussing it.
>> Sure, does that just mean the 'git format-patch -1' emailed to
>> bug-gnu-emacs@gnu.org, as mentioned in CONTRIBUTE? If that's the gist of
>> it then I can do that shortly.
>
> Pretty much, yes.  You can add some text to give extra background on the
> design, the motivation for some of the choices, or ask questions about
> particular details, but that's not indispensable.
>
> You can also send an email that just refers to a branch in emacs.git.
> But for the discussion to work well, it's usually better to make sure
> this branch is "small" so people aren't discouraged to read the large
> diff ;-)

Sounds good, I'll proceed with that once I make sure there are no issues
with that branch.

>> I was able to speed that function up to the point that it's about the
>> same as one using `read`. Those functions are doing a whole lot of IO
>> (reading and writing hundreds of files) so it's not really a fair
>> comparison. I've done more tests with functions that just read a whole
>> buffer, collecting what they read into a list. In a 9600 line file with
>> just over 500 sexps, the `read` version took about ~.02-.04 seconds
>> (according to `benchmark-run-compiled`), and the `source-map-read`
>> version took ~.08 seconds when it didn't GC, but unlike with `read` it
>> did cause a GC 10-20% of the time.
>
> IME when the time is in the sub-second range the measurements are very
> imprecise, so better measure the time to repeat the same `read` N times
> so the total time is a few seconds (and since it's the same `read`,
> it won't suffer from extra IO overhead).

Sure, I'll do some more exhaustive testing. So far though, the results
aren't great, the biggest issue being memory usage. The
`source-map-read` can GC over 5 times more often than `read`. Obviously
edebug isn't the answer. I could start trying to simplify and adapt the
useful bits of edebug for the source-map reader directly, but I think
it's more sensible to accept that a real implementation will have to be
in C and this reader will just remain a prototype.

>>> For macros, OTOH, it's really fundamentally hard (or impossible, in
>>> general).
>> Helmut Eller mentioned before that most macros do use at least some of
>> the original code in their expansion.
>
> We can definitely hope to use some heuristics that will preserve "most"
> source info for "most" existing macros, yes.
> But it's still a fundamentally impossible problem in general ;-)
>
>>> We could/should introduce some new way to define macros which
>>> knows about "source code annotated with locations".
>> I've wondered about this too but don't know what the right approach
>> would be.
>
> The first step is to define a `defmacro2` which works like `defmacro`
> but is defined to take as arguments (and to return) annotated-sexps
> instead of "bare sexps".  It'll be less convenient to use, but
>
> In Scheme "annotated sexps" are called "syntax objects".
>
>> I doubt anyone would want to use something like macro-cons/list/append
>> etc. functions,
>
> Scheme avoids the problem by defining additional higher-level layers,
> where macros are defined in a more restrictive way using templates, so
> for most macros the programmer doesn't need to use care very much about
> the difference between bare sexps and syntax objects.
>
> The main motivation for it was hygiene (the framework takes care of
> adding the needed `gensym`s where applicable) rather than tracking
> source-location, but fundamentally the issue is the same: an AST node is
> not just some random sexp.
>
> IOW "code and data aren't quite the same, after all" ;-)
>
> See for example `syntax-case` https://www.gnu.org/software/guile/manual/html_node/Syntax-Case.html
> Note that Scheme uses the #' notation for syntax objects.  Adapting the
> example for `when` to an Elisp syntax could look like:
>
>     (defmacro2 when (form)
>       (elisp-case form
>         ((_ test e e* ...) (elisp (if test (progn e e* ...))))))
>
> [ Where I used `elisp` instead of Scheme's `syntax` since we already use
>   the prefix "syntax-" for things related to syntax-tables.  ]
>
> Notice how it's `elisp-case` which extracts `test`, `e`, and `e*` and
> then it's `syntax` which builds the new chunk of code, so all the
> replacement of `car` with `elisp-car` can be hidden within the definition
> of `elisp-case` and `elisp`.

Aha, I had never even considered hygienic macros in Elisp (nor had I
recognized how trivial it is to track their source-code). That would be
an amazing development for Emacs Lisp, but is certainly a huge
undertaking, not something I could fit into the GSoC timeline. I know
that it has been done in Common Lisp (by Pascal Costanza), but I believe
that implementation serves the sole purpose of capture avoidance and
doesn't abstract syntax. For Emacs I assume this would have to be done
in C, but I do wonder if an Elisp implementation would be possible.

>>> There's a lot of work on Scheme macros we could leverage for that.
>> Interesting, so far I've had some difficulty finding documentation about
>> how other Lisps track source locations.
>
> It's not really discussed, but the distinction between "sexp" and
> "syntax object" is the key.  It's largely not discussed because Scheme
> macros have never officially included the equivalent of `defmacro`
> operating on raw sexps, so they've never really had to deal with the
> issue (tho Gambit does provide a `define-macro` which operates like our
> `defmacro` but it's rarely used so Gambit just punts on the
> source-location issue in that case).

Doing the similar thing in Elisp -- relegating source location tracking
to code using only a specialized kind of macro, hygienic or otherwise --
would of course be a major loss, since it would take years for that new
paradigm to become commonplace.

-Zach




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update 1 on Bytecode Offset tracking
  2020-07-18 21:41           ` Zach Shaftel
@ 2020-07-19  2:34             ` Stefan Monnier
  2020-07-21  0:28               ` Zach Shaftel
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Monnier @ 2020-07-19  2:34 UTC (permalink / raw)
  To: Zach Shaftel; +Cc: emacs-devel

> Sure, I'll do some more exhaustive testing. So far though, the results
> aren't great, the biggest issue being memory usage. The
> `source-map-read` can GC over 5 times more often than `read`.

Sounds fine for a prototype.

> I think it's more sensible to accept that a real implementation will
> have to be in C and this reader will just remain a prototype.

Indeed.

> Aha, I had never even considered hygienic macros in Elisp (nor had I
> recognized how trivial it is to track their source-code).  That would be
> an amazing development for Emacs Lisp, but is certainly a huge
> undertaking, not something I could fit into the GSoC timeline.

No, I'm just discussing what the longer-run might look like.

> I know that it has been done in Common Lisp (by Pascal Costanza), but
> I believe that implementation serves the sole purpose of capture
> avoidance and doesn't abstract syntax. For Emacs I assume this would
> have to be done in C, but I do wonder if an Elisp implementation would
> be possible.

I haven't thought very much about it, but I can't see any reason why it
would need to be done in C, no (tho I wouldn't be surprised if it could
benefit from a bit of help from the C side, of course).

> Doing the similar thing in Elisp -- relegating source location tracking
> to code using only a specialized kind of macro, hygienic or otherwise --
> would of course be a major loss, since it would take years for that new
> paradigm to become commonplace.

Indeed, we'll need some fallback heuristic for all the existing
`defmacro`s.

Part of the issue is "tracking source location" but another important
part is to take the annotated source code and "de-annotate" it
(recursively) to pass it to the macro, since the macro expects
a raw sexp.

That's why we've been thinking about annotated representations
which are "transparent" (i.e. can be used as if they weren't annotated).
Either using "fat cons-cells" or using "fat symbols" or storing the
annotations in an eq-hash-table.

Another way to attack the problem is to rely on the Edebug spec: you can
refrain from de-annotating all the parts marked as `form` or `body` (as
long as the annotations themselves look sufficiently like normal code,
at least).

        Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update 1 on Bytecode Offset tracking
  2020-07-19  2:34             ` Stefan Monnier
@ 2020-07-21  0:28               ` Zach Shaftel
  2020-07-21  2:51                 ` Stefan Monnier
  0 siblings, 1 reply; 15+ messages in thread
From: Zach Shaftel @ 2020-07-21  0:28 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: emacs-devel


Stefan Monnier <monnier@iro.umontreal.ca> writes:

>> Sure, I'll do some more exhaustive testing. So far though, the results
>> aren't great, the biggest issue being memory usage. The
>> `source-map-read` can GC over 5 times more often than `read`.
>
> Sounds fine for a prototype.
>
>> I think it's more sensible to accept that a real implementation will
>> have to be in C and this reader will just remain a prototype.
>
> Indeed.
>
>> Aha, I had never even considered hygienic macros in Elisp (nor had I
>> recognized how trivial it is to track their source-code).  That would be
>> an amazing development for Emacs Lisp, but is certainly a huge
>> undertaking, not something I could fit into the GSoC timeline.
>
> No, I'm just discussing what the longer-run might look like.
>
>> I know that it has been done in Common Lisp (by Pascal Costanza), but
>> I believe that implementation serves the sole purpose of capture
>> avoidance and doesn't abstract syntax. For Emacs I assume this would
>> have to be done in C, but I do wonder if an Elisp implementation would
>> be possible.
>
> I haven't thought very much about it, but I can't see any reason why it
> would need to be done in C, no (tho I wouldn't be surprised if it could
> benefit from a bit of help from the C side, of course).

This is my gut feeling as well, but in the few discussions I've seen
about implementing hygiene in an unhygienic macro system people suggest
that it's nigh impossible without rewriting core parts of the language.
I haven't seen a convincing defense of that argument in any of those
discussions though.

>> Doing the similar thing in Elisp -- relegating source location tracking
>> to code using only a specialized kind of macro, hygienic or otherwise --
>> would of course be a major loss, since it would take years for that new
>> paradigm to become commonplace.
>
> Indeed, we'll need some fallback heuristic for all the existing
> `defmacro`s.
>
> Part of the issue is "tracking source location" but another important
> part is to take the annotated source code and "de-annotate" it
> (recursively) to pass it to the macro, since the macro expects
> a raw sexp.
>
> That's why we've been thinking about annotated representations
> which are "transparent" (i.e. can be used as if they weren't annotated).
> Either using "fat cons-cells" or using "fat symbols" or storing the
> annotations in an eq-hash-table.

A hash-table seems like the most straightforward approach, which I'm
working on now. Doing something like in the scratch/accurate-warning-pos
branch adds a whole lot of complexity to both C and Lisp, and toggling
byte-compilation versions of subrs feels clunky to me (though that's
obviously just a prototype). A hash-table of conses will hopefully be
enough, with or without the `source-map` stuff.

> Another way to attack the problem is to rely on the Edebug spec: you can
> refrain from de-annotating all the parts marked as `form` or `body` (as
> long as the annotations themselves look sufficiently like normal code,
> at least).

Interesting, that's not something I had thought about. I suspect flawed
edebug specs are common enough that this can't be relied upon.


-Zach



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Update 1 on Bytecode Offset tracking
  2020-07-21  0:28               ` Zach Shaftel
@ 2020-07-21  2:51                 ` Stefan Monnier
  0 siblings, 0 replies; 15+ messages in thread
From: Stefan Monnier @ 2020-07-21  2:51 UTC (permalink / raw)
  To: Zach Shaftel; +Cc: emacs-devel

> A hash-table seems like the most straightforward approach, which I'm
> working on now. Doing something like in the scratch/accurate-warning-pos
> branch adds a whole lot of complexity to both C and Lisp, and toggling
> byte-compilation versions of subrs feels clunky to me (though that's
> obviously just a prototype). A hash-table of conses will hopefully be
> enough, with or without the `source-map` stuff.

Preliminary measurements suggested that the hash-table way would be too slow.
But I guess it depends on the granularity.
Maybe if we only put such annotations on the "head-cons-cells", it might
be cheap enough.

> Interesting, that's not something I had thought about.  I suspect flawed
> edebug specs are common enough that this can't be relied upon.

Actually, I don't think so: since Edebug rewrites all the parts marked
as `form` or `body` with significant changes, such errors are rather
unlikely.  The more frequent problems will be when the Edebug spec is
missing or failed to mark the relevant parts as `form` or `body`, in
which case we'd not only lose the annotations (which costs us extra work
to go and remove the annotations).  But it's a "safe" form of failure
(a kind of "graceful degradation").

        Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Update 2 on Bytecode Offset tracking
  2020-07-15 23:10 Update 1 on Bytecode Offset tracking Zach Shaftel
  2020-07-16  3:55 ` Stefan Monnier
  2020-07-16  7:25 ` Andrea Corallo via Emacs development discussions.
@ 2020-07-28 19:19 ` Zach Shaftel
  2 siblings, 0 replies; 15+ messages in thread
From: Zach Shaftel @ 2020-07-28 19:19 UTC (permalink / raw)
  To: emacs-devel

[-- Attachment #1: Type: text/html, Size: 12296 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-07-28 19:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-07-15 23:10 Update 1 on Bytecode Offset tracking Zach Shaftel
2020-07-16  3:55 ` Stefan Monnier
2020-07-16 22:45   ` Zach Shaftel
2020-07-17  3:44     ` Eli Zaretskii
2020-07-17 16:20     ` Stefan Monnier
2020-07-17 20:19       ` Zach Shaftel
2020-07-17 22:08         ` Stefan Monnier
2020-07-18 21:41           ` Zach Shaftel
2020-07-19  2:34             ` Stefan Monnier
2020-07-21  0:28               ` Zach Shaftel
2020-07-21  2:51                 ` Stefan Monnier
2020-07-16  7:25 ` Andrea Corallo via Emacs development discussions.
2020-07-17  0:24   ` Zach Shaftel
2020-07-17 13:47     ` Rocky Bernstein
2020-07-28 19:19 ` Update 2 " Zach Shaftel

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).