Thanks for looking over the changes and making suggestions.

On Sat, Jul 11, 2020 at 11:29 AM Tom Tromey <tom@tromey.com> wrote:

> >>>>> "Zach" == Zach Shaftel <zshaftel@gmail.com> writes:
>
> Zach> Rocky has pushed the code to Savannah. In the
> Zach> feature/zach-soc-funcall-from-bytecode branch
> [...]
> Zach> But the code on that branch is just
> Zach> a rough proof-of-concept and isn't 100% accurate. Doing it right
> would
> Zach> require a lot of refactoring that could ultimately lead to a dead
> end.
>
> Could you say more specifically what is wrong on that branch?  What kind
> of refactoring is needed?
>
> I think other interpreters out there take this "self-call" approach, and
> it seems like a decent idea to try in Emacs.
>

Zach: the code added to exec_byte_code in that branch is almost identical to
the contents of Ffuncall, with the exception of the lexical bytecode
special case. That part is just a copy of some of funcall_lambda and
fetch_and_exec_byte_code. That doesn't seem to be a problem in terms of
performance or functionality (so far) but the code duplication is
obviously not ideal. I don't see any way to avoid this without some big
changes to the whole funcall call graph.

That being said, Zach would love to have this in the bytecode interpreter,
so he'll
would gladly make those changes if this ends up being the best way
forward.  However unless it is necessary now, it would probably be
done after the Summer of Code project when we hope to have more of the
complete solution done.


>
> Zach> In the feature/zach-soc-bytecode-in-traceback branch, the offset is
> Zach> stored in the thread_state struct. Prior to the most recent commit,
> the
> Zach> offset was then stashed in the backtrace specbinding frame from
> Zach> record_in_backtrace, so each bytecode call in the backtrace buffer
> was
> Zach> annotated with the offset when the error occurred. This is the
> ultimate
> Zach> goal, but the current implementation is flawed and a significant
> source
> Zach> of slowdown.
>
> I took a brief look at it.  One thing I noticed is that the assignment
> is done in NEXT:
>
> +#define NEXT UPDATE_OFFSET; goto *(targets[op = FETCH])
>
> However, it seems to me that there cases where this is not needed -- any
> bytecode that cannot possibly cause an exception doesn't need to record
> the offset.  For example:
>
>         CASE (Bvarref6):
>           op = FETCH;
>         varref:
>           {
>             Lisp_Object v1 = vectorp[op], v2;
>             if (!SYMBOLP (v1)
>                 || XSYMBOL (v1)->u.s.redirect != SYMBOL_PLAINVAL
>                 || (v2 = SYMBOL_VAL (XSYMBOL (v1)), EQ (v2, Qunbound)))
>               v2 = Fsymbol_value (v1);
>             PUSH (v2);
>             NEXT;
>           }
>
> Here, I think the updating only has to be done before the call to
> Fsymbol_value.
>
> Or:
>
>         CASE (Bgotoifnil):
>           {
>             Lisp_Object v1 = POP;
>             op = FETCH2;
>             if (NILP (v1))
>               goto op_branch;
>             NEXT;
>           }
>
> This opcode doesn't need to update the offset at all.
>
> This matters because the change is introducing an extra store into
> performance-sensitive code.
>

Absolutely correct. There is about a 7% slowdown and we've tried various
approaches, some that you've mentioned below and some that were
mentioned by Andrea Corallo.

In particular, storing bytstr_data in the frame. As a fallback position
Andrea also suggested giving up on the most recent offset and keeping
offsets in the backtrace.

We think that this is bad in that the most-recent position is the one
that is most desired.

No matter what is done, there is going to be a loss of performance
unless we gain it back somewhere else, e.g. bytecode-to-bytecode call
which is going to be complicated

*Please developers, we need some feedback here.*

1. What would a maximum acceptable slowdown be?

2. If we follow the update only as needed there are going to be a
   massive number of changes which will make review harder. Will that
   diminish the chances of this being merged into master?

Note that if the code saved offsets initially (or was written less well)
then a 7% slowdown
probably wouldn't be an issue. If you were to take your other favorite
programming
language that reports bytecode offsets, and announce that you can speed up
the
interpreter by 7% but you lose backtrace locations, I suspect the community
wouldn't
go for that.


>
> I had a couple of other ideas for how to micro-optimize this store.
> I don't know if they will work, but you might try them.
>
> The updating looks like this:
>
> +#define UPDATE_OFFSET (backtrace_byte_offset = pc - bytestr_data)
>
> If you also stored bytestr_data in the structure, then instead of doing
> the subtraction in the hot loop, you could do the subtraction when
> computing the location -- which happens more rarely, and which is not
> performance-sensitive.
>

Andrea suggested this too. We will do this.


>
>
> Also, as you said, backtrace_byte_offset actually references a field in
> the thread state:
>
> +  /* The offset of the current op of the byte-code function being
> +     executed. */
> +  int m_backtrace_byte_offset;
> +#define backtrace_byte_offset (current_thread->m_backtrace_byte_offset)
>
> You might try changing this to be a pointer, and have it point to a
> local variable that is updated by UPDATE_OFFSET.  This might be a bit
> faster due to cache effects.
>

 That's a great idea, We will try that out.


> Note that you do *not* want to have it point to the actual "pc" variable
> -- if you take the address of "pc", the compiler will probably not put
> it in a register.  I mean, you might try it, but I would expect a
> performance loss in this case.
>

We saw this firsthand when we kept the pointers in the
specbinding frame. I'm glad you clarified what was going on because we
were puzzled.


>
> Tom
>


Rocky and Zach