Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)

unofficial mirror of emacs-devel@gnu.org 
 help / color / mirror / code / Atom feed

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
@ 2020-03-31 17:07 Tuấn Anh Nguyễn
  2020-03-31 17:50 ` Eli Zaretskii
  0 siblings, 1 reply; 43+ messages in thread
From: Tuấn Anh Nguyễn @ 2020-03-31 17:07 UTC (permalink / raw)
  To: emacs-devel

> In any case, I hope that passing the buffer to tree-sitter doesn't
> involve marshalling the entire buffer text via a function call as a
> huge string, or some such.  We should instead request that tree-sitter
> exposes an API through which we could give it direct access to buffer
> text as 2 parts, before and after the gap, like we do with regex
> code.  Otherwise this will be a bottleneck in the long run, not unlike
> the problem we have with LSP.

It does support parsing through direct access. Which is why I wanted
dynamic modules to have direct access to buffer text.

>> How large is "very large" here?
>
> xdisp.c comes to mind, obviously.

On my machine, a 3.39 GHz Intel Core i7:

    (0.150791 0 0.0) ; 1 full parse
    (2.142236 5 0.6105190000000107) ; 10 full parses
    (0.015423 0 0.0) ; incremental parsing, after typing 1 character

--
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-03-31 17:07 Reliable after-change-functions (via: Using incremental parsing in Emacs) Tuấn Anh Nguyễn
@ 2020-03-31 17:50 ` Eli Zaretskii
  2020-04-01  6:17   ` Tuấn Anh Nguyễn
  0 siblings, 1 reply; 43+ messages in thread
From: Eli Zaretskii @ 2020-03-31 17:50 UTC (permalink / raw)
  To: Tuấn Anh Nguyễn; +Cc: emacs-devel

> From: Tuấn Anh Nguyễn <ubolonton@gmail.com>
> Date: Wed, 1 Apr 2020 00:07:27 +0700
> 
> > xdisp.c comes to mind, obviously.
> 
> On my machine, a 3.39 GHz Intel Core i7:
> 
>     (0.150791 0 0.0) ; 1 full parse

How did you submit xdisp.c to the parser?

In any case, IIUC, the first time a buffer needs to be displayed, we
need to wait for these 150 msec?  That's annoyingly long (and I
suspect in real Emacs usage will be significantly longer, due to
memory allocation, encoding, etc.).



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-03-31 17:50 ` Eli Zaretskii
@ 2020-04-01  6:17   ` Tuấn Anh Nguyễn
  2020-04-01 13:26     ` Eli Zaretskii
  0 siblings, 1 reply; 43+ messages in thread
From: Tuấn Anh Nguyễn @ 2020-04-01  6:17 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1220 bytes --]

On Wed, Apr 1, 2020 at 12:50 AM Eli Zaretskii <eliz@gnu.org> wrote:

> > From: Tuấn Anh Nguyễn <ubolonton@gmail.com>
> > Date: Wed, 1 Apr 2020 00:07:27 +0700
> >
> > > xdisp.c comes to mind, obviously.
> >
> > On my machine, a 3.39 GHz Intel Core i7:
> >
> >     (0.150791 0 0.0) ; 1 full parse
>
> How did you submit xdisp.c to the parser?
>

    (with-current-buffer "xdisp.c"
      (let ((language (tree-sitter-require 'c))
            (parser (ts-make-parser)))
        (ts-set-language parser language)
        (garbage-collect)
        (message "%s" (benchmark-run (ts-parse parser #'ts-buffer-input
nil)))))


> In any case, IIUC, the first time a buffer needs to be displayed, we
> need to wait for these 150 msec?  That's annoyingly long (and I
> suspect in real Emacs usage will be significantly longer, due to
> memory allocation, encoding, etc.).
>

Real usage with "xdisp.c":

    (define-advice tree-sitter--do-parse (:around (f &rest args) benchmark)
      (message "%s" (benchmark-run (apply f args))))

    (0.257998 1 0.13326100000000096)

So yes, direct access to buffer's text from dynamic modules would be nice.

--
Tuấn-Anh Nguyễn
Software Engineer

[-- Attachment #2: Type: text/html, Size: 2127 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-01  6:17   ` Tuấn Anh Nguyễn
@ 2020-04-01 13:26     ` Eli Zaretskii
  2020-04-01 15:47       ` Jorge Javier Araya Navarro
  2020-04-01 17:55       ` Tuấn-Anh Nguyễn
  0 siblings, 2 replies; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-01 13:26 UTC (permalink / raw)
  To: Tuấn Anh Nguyễn; +Cc: emacs-devel

> From: Tuấn Anh Nguyễn <ubolonton@gmail.com>
> Date: Wed, 1 Apr 2020 13:17:42 +0700
> Cc: emacs-devel@gnu.org
> 
> Real usage with "xdisp.c":
> 
>     (define-advice tree-sitter--do-parse (:around (f &rest args) benchmark)
>       (message "%s" (benchmark-run (apply f args))))
> 
>     (0.257998 1 0.13326100000000096)

And that is even without encoding the buffer text, IIUC what the
package does.

> So yes, direct access to buffer's text from dynamic modules would be nice.

Did you consider using the API where an application can provide a
function to return text at a given offset?  Such a function could be
relatively easily implemented for Emacs.

Btw, what do you do with the tree returned by the tree-sitter parser?
store it in some buffer-local variable?  If so, how much memory does
such a tree take, and when, if ever, is that memory released?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-01 13:26     ` Eli Zaretskii
@ 2020-04-01 15:47       ` Jorge Javier Araya Navarro
  2020-04-01 16:07         ` Eli Zaretskii
  2020-04-01 17:55       ` Tuấn-Anh Nguyễn
  1 sibling, 1 reply; 43+ messages in thread
From: Jorge Javier Araya Navarro @ 2020-04-01 15:47 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Tuấn Anh Nguyễn, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 1326 bytes --]

> Did you consider using the API where an application can provide a
function to return text at a given offset?  Such a function could be
relatively easily implemented for Emacs.

But why not just allow access to buffers for dynamic modules, otherwise
what would be the point of dynamic modules?

El mié., 1 de abr. de 2020 a la(s) 07:26, Eli Zaretskii (eliz@gnu.org)
escribió:

> > From: Tuấn Anh Nguyễn <ubolonton@gmail.com>
> > Date: Wed, 1 Apr 2020 13:17:42 +0700
> > Cc: emacs-devel@gnu.org
> >
> > Real usage with "xdisp.c":
> >
> >     (define-advice tree-sitter--do-parse (:around (f &rest args)
> benchmark)
> >       (message "%s" (benchmark-run (apply f args))))
> >
> >     (0.257998 1 0.13326100000000096)
>
> And that is even without encoding the buffer text, IIUC what the
> package does.
>
> > So yes, direct access to buffer's text from dynamic modules would be
> nice.
>
> Did you consider using the API where an application can provide a
> function to return text at a given offset?  Such a function could be
> relatively easily implemented for Emacs.
>
> Btw, what do you do with the tree returned by the tree-sitter parser?
> store it in some buffer-local variable?  If so, how much memory does
> such a tree take, and when, if ever, is that memory released?
>
>

[-- Attachment #2: Type: text/html, Size: 2038 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-01 15:47       ` Jorge Javier Araya Navarro
@ 2020-04-01 16:07         ` Eli Zaretskii
  0 siblings, 0 replies; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-01 16:07 UTC (permalink / raw)
  To: Jorge Javier Araya Navarro; +Cc: ubolonton, emacs-devel

> From: Jorge Javier Araya Navarro <jorge@esavara.cr>
> Date: Wed, 1 Apr 2020 09:47:48 -0600
> Cc: Tuấn Anh Nguyễn <ubolonton@gmail.com>, 
> 	emacs-devel@gnu.org
> 
> > Did you consider using the API where an application can provide a function to return text at a given offset? 
> Such a function could be relatively easily implemented for Emacs.
> 
> But why not just allow access to buffers for dynamic modules, otherwise what would be the point of dynamic
> modules?

These two are orthogonal issues: if we allow such access from modules,
will this particular module use it, and if so, how?



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-01 13:26     ` Eli Zaretskii
  2020-04-01 15:47       ` Jorge Javier Araya Navarro
@ 2020-04-01 17:55       ` Tuấn-Anh Nguyễn
  2020-04-01 19:33         ` Eli Zaretskii
  1 sibling, 1 reply; 43+ messages in thread
From: Tuấn-Anh Nguyễn @ 2020-04-01 17:55 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On Wed, Apr 1, 2020 at 8:26 PM Eli Zaretskii <eliz@gnu.org> wrote:
>
> > From: Tuấn Anh Nguyễn <ubolonton@gmail.com>
> > Date: Wed, 1 Apr 2020 13:17:42 +0700
> > Cc: emacs-devel@gnu.org
> >
> > Real usage with "xdisp.c":
> >
> >     (define-advice tree-sitter--do-parse (:around (f &rest args) benchmark)
> >       (message "%s" (benchmark-run (apply f args))))
> >
> >     (0.257998 1 0.13326100000000096)
>
> And that is even without encoding the buffer text, IIUC what the
> package does.
>
> > So yes, direct access to buffer's text from dynamic modules would be nice.
>
> Did you consider using the API where an application can provide a
> function to return text at a given offset?  Such a function could be
> relatively easily implemented for Emacs.
>

I don't understand what you mean. Below I'll explain how it works
currently.

`ts-parse' uses the Tree-sitter's API that consumes text in chunks:

    TSTree *ts_parser_parse(
      TSParser *self,
      const TSTree *old_tree,
      TSInput input
    );

    typedef struct {
      void *payload;
      const char *(*read)(
        void *payload,
        uint32_t byte_offset,
        TSPoint position,
        uint32_t *bytes_read
      );
      TSInputEncoding encoding;
    } TSInput;

Because dynamic modules don't have direct access to buffer text,
`ts-parse' uses the module function `copy_string_contents', and exposes
this interface:

    (ts-parse PARSER INPUT-FUNCTION OLD-TREE)

Here INPUT-FUNCTION must return a chunk of the buffer text, starting
from the given byte offset, as a Lisp string. `ts-buffer-input' is one
such function.

So:

1. Chunks of the buffer text are copied into Lisp strings, through
   `buffer-substring-no-properties'.
2. These Lisp strings are copied into buffers of null-terminated utf-8
   bytes, through `copy_string_contents'.
3. All these temporary Lisp strings create GC pressure. In the xdisp.c
   example, it was 100ms for GC, in addition to 150ms for parsing.
4. emacs-module-rs has an automatic, blanket workaround for this bug
   https://debbugs.gnu.org/cgi/bugreport.cgi?bug=31238. The workaround
   involves pairs of `make_global_ref' and `free_global_ref' calls, on
   all "suspected" `emacs_value's.

#4 can be avoided if emacs-module-rs allows selectively disabling the
blanket workaround. It's band-aid on top of band-aid, but at least it's
workable.

#3 can probably be alleviated by increasing the chunk size.

However, they are consequences of #1 and #2. If dynamic modules have
direct access to the buffer text, none of the above is an issue.

Such direct access can be enabled by something like this:

    char* (*access_buffer_text) (emacs_env *env,
                                 emacs_value buffer,
                                 ptrdiff_t byte_offset,
                                 ptrdiff_t *size_inout);

Of course, such an API would require extensive documentation on how it
must be used, to ensure safety and correctness.

> Btw, what do you do with the tree returned by the tree-sitter parser?
> store it in some buffer-local variable?  If so, how much memory does
> such a tree take, and when, if ever, is that memory released?
>

It's stored in a buffer-local variable. I haven't measured the memory
they take. Memory is released when the tree object is garbage-collected
(it's a `user-ptr').

--
Tuấn-Anh Nguyễn
Software Engineer

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-01 17:55       ` Tuấn-Anh Nguyễn
@ 2020-04-01 19:33         ` Eli Zaretskii
  2020-04-01 23:38           ` Stephen Leake
  2020-04-02  4:21           ` Tuấn-Anh Nguyễn
  0 siblings, 2 replies; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-01 19:33 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn; +Cc: emacs-devel

> From: Tuấn-Anh Nguyễn <ubolonton@gmail.com>
> Date: Thu, 2 Apr 2020 00:55:45 +0700
> Cc: emacs-devel@gnu.org
> 
> > Did you consider using the API where an application can provide a
> > function to return text at a given offset?  Such a function could be
> > relatively easily implemented for Emacs.
> >
> 
> I don't understand what you mean. Below I'll explain how it works
> currently.  [...]  If dynamic modules have direct access to the
> buffer text, none of the above is an issue.
> 
> Such direct access can be enabled by something like this:
> 
>     char* (*access_buffer_text) (emacs_env *env,
>                                  emacs_value buffer,
>                                  ptrdiff_t byte_offset,
>                                  ptrdiff_t *size_inout);
> 
> Of course, such an API would require extensive documentation on how it
> must be used, to ensure safety and correctness.

I think you are moving too fast, and keep the current implementation
in sight too much.

What I suggest is to step back and see how such direct access, if it
were available, could be used with tree-sitter.  Let's forget about
modules for a moment and consider tree-sitter linked with Emacs and
capable of calling any C function in core.  How would you use that?

Buffer text is not exactly UTF-8, it's a superset of UTF-8.  So one
question to answer is what to do with byte sequences that are not
valid UTF-8.  Any suggestions or ideas?  How does tree-sitter handle
invalid byte sequences in general?

Also, direct access to buffer text generally means we must make sure
GC never runs as long as pointers to buffer text are lying around.
Can any Lisp run between calls to the reader function that the
tree-sitter parser calls to access the buffer text?  If so, we need to
take care of that issue.

Next, I'm still asking whether parsing the whole buffer when it is
first created is necessary.  Can we pass to the parser just a small
chunk (say, 500 bytes) of the buffer around the window-full to be
displayed next?  If this presents problems, what are those problems?

IOW, the issue with exposing access to buffer text to modules is IMO
secondary.  My suggestion is first to figure out how to do this stuff
efficiently from within Emacs itself, as if the module interface were
not part of the equation.  We can add that aspect back later.

And yes, doing this by consing strings is not a good idea, it will
slow things down and cause a lot of GC.  It is best avoided.  Thus my
questions above.

> > Btw, what do you do with the tree returned by the tree-sitter parser?
> > store it in some buffer-local variable?  If so, how much memory does
> > such a tree take, and when, if ever, is that memory released?
> >
> 
> It's stored in a buffer-local variable. I haven't measured the memory
> they take. Memory is released when the tree object is garbage-collected
> (it's a `user-ptr').

So if I have many hundreds of buffers, I could have such a tree in
each one of them indefinitely?  Perhaps that's one more design issue
to consider, given that the parsing is so fast.  Similar to what we do
with image and face caches -- we flush them from time to time, to keep
the memory footprint in check.  So a buffer that was not current more
than some time interval ago could have its tree GCed.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-01 19:33         ` Eli Zaretskii
@ 2020-04-01 23:38           ` Stephen Leake
  2020-04-02  0:25             ` Stephen Leake
                               ` (3 more replies)
  2020-04-02  4:21           ` Tuấn-Anh Nguyễn
  1 sibling, 4 replies; 43+ messages in thread
From: Stephen Leake @ 2020-04-01 23:38 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

> Also, direct access to buffer text generally means we must make sure
> GC never runs as long as pointers to buffer text are lying around.
> Can any Lisp run between calls to the reader function that the
> tree-sitter parser calls to access the buffer text?  

If the parser copies the text into an internal buffer, that reader
function should only be called once per call to the parser. Parsers used
to based around small buffers that would read in a file a chunk at a
time, but that is not necessary on any machine that can run Emacs. Since
Emacs has the entire file in memory, the parser can too.

However, if we are really trying to avoid copying text (which is very
premature optimization), then the reader function will be called many
times during parsing (to fetch each word), and possibly during the
grammar actions (to compute indent or face).

> Next, I'm still asking whether parsing the whole buffer when it is
> first created is necessary.  

To some extent, that depends on the language.

The parser must be able to complete a parse, to generate a complete
syntax tree. I'll assume no error correction for a moment; more below.

In C or C++ body files, "a complete parse" is typically one variable or
function declaration. So if Emacs can reliably find the beginning and
end of those declarations, it could pass just the ones containing the
region of interest to the parser. Tree-sitter (if it supports this at
all, or is modified to) would end up with a forest of small parse trees,
rather than one large one. They might get merged if large chunks of text
are parsed together.

In Ada and Java, and most C++ header files, "a complete parse" is a
file; it contains an Ada package spec or body, or a Java or C++ class,
or a C++ namespace.

There are many "small languages" for which "a complete parse" is similar
to a "statement". Bash shell, for example. They could pass just the
statement, but only if Emacs can reliably find the start and end (not
always easy).

It is also possible to modify the language grammar to allow smaller
pieces of code to be a complete parse; ada-mode does this, making a
single declaration or statement "a complete parse", in order to support
"partial parse". That can easily lead to errors in indent, since the
indent of the start of the text portion is unknown (ada-mode simply
assumes it is correct in the buffer).

Another reason to allow smaller code chunks to be a complete parse is to
allow parsing the code fragments that appear in grammar actions; the
ELPA package wisitoken-grammar-mode uses this for wisitoken grammar
files with Ada actions.

In sum, the short answer is "yes, you must parse the whole file, unless
your language is particularly simple".

Since we need to support the worst case, we should assume the whole file
must be parsed at least once.

> Can we pass to the parser just a small chunk (say, 500 bytes) of the
> buffer around the window-full to be displayed next? If this presents
> problems, what are those problems?

In wisi, the error correction code will fill in the missing text so a
complete parse is possible. Since some of that is guesses, the results
may not be very good. Tree-sitter also has error correction; I'm not
clear how good it is.

> IOW, the issue with exposing access to buffer text to modules is IMO
> secondary.  

yes, because copying text is fast compared to everything else going on.

> My suggestion is first to figure out how to do this stuff efficiently
> from within Emacs itself, as if the module interface were not part of
> the equation. We can add that aspect back later.

There are two times the wisi code that wraps the parser needs access to
the buffer; first to copy the text, second to add text properties
(faces, indent values, navigation markers). There are usually many text
properties output by each parse.

The positions and values of the text properties are computed by
functions that run after the complete syntax tree has been produced. In
wisi, those functions are added directly in the grammar source file
(where they are called "post-parse grammar actions"). In tree-sitter, I
assume they are called from some mode-author-written code that traverses
the syntax tree (wisi provides that internally). Except I see below that
the emacs tree-sitter package stores the syntax tree in the buffer.

One option here is to try to standardize on an elisp representation of a
syntax tree, and have both the wisi and tree-sitter parsers provide
that. Then the grammar actions could be implemented in elisp. I suspect
that would be very slow; elisp is just not good at traversing large
complex data structures. That is not just my bias showing (I _much_
prefer doing as much as possible in Ada); I first wrote the ada-mode
parser and grammar actions in elisp, and then did a complete rewrite in
Ada, gaining significant speed. Although I never considered passing the
syntax tree to elisp as a single object, so maybe that could work well.

There is no universal standard for representing "a syntax tree". In
wisi, the tree is directly produced by the LR shift and reduce
operations, and thus is very close the the grammar expressed in BNF. I
don't know what the tree-sitter parse tree looks like. AdaCore provides
a parser similar in purpose to the wisi parsers
(https://github.com/AdaCore/libadalang), that also does more of what an
Ada compiler does (which could allow even better font-lock and navigation).
To support those additional operations, the syntax tree is quite
different from the ada-mode one.

In general, each parser library, and even each grammar author, will have
different representations for the syntax tree.

So if we want to support different parsers, I think it is best to define
the Emacs "parser API" as "give text to parser; accept text properties
from parser".

LSP (via eglot) provides other things the parser can return; code
completion menus, for example. And for indent and face, it returns
formatted text with markdown. I plan to translate that to text
properties to integrate LSP into wisi. Whether LSP requires a full
initial parse is up to the LSP server author (LSP itself provides both
"here's the full text" and "here's partial text" messages); they have
the same considerations discussed above.

> And yes, doing this by consing strings is not a good idea, it will
> slow things down and cause a lot of GC.  It is best avoided.  Thus my
> questions above.

I'm not sure how "convert syntax tree to elisp" compares to "consing
strings". I would certainly expect it to cause a lot of GC.

>> > Btw, what do you do with the tree returned by the tree-sitter parser?
>> > store it in some buffer-local variable?  If so, how much memory does
>> > such a tree take, and when, if ever, is that memory released?
>> >
>> 
>> It's stored in a buffer-local variable. I haven't measured the memory
>> they take. Memory is released when the tree object is garbage-collected
>> (it's a `user-ptr').

Is it an elisp structure (or accesible from elisp)? Have you written
code that traverses it to provide faces and indentation?

-- 
-- Stephe

PS; I have the beginnings of a migraine while typing this, so some of it
may not make sense. Sigh.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-01 23:38           ` Stephen Leake
@ 2020-04-02  0:25             ` Stephen Leake
  2020-04-02  2:46             ` Stefan Monnier
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 43+ messages in thread
From: Stephen Leake @ 2020-04-02  0:25 UTC (permalink / raw)
  To: emacs-devel

I looked at the tree-sitter source in git-hub
(https://github.com/ubolonton/emacs-tree-sitter) and the tree-sitter doc
that points to (https://tree-sitter.github.io/tree-sitter/using-parsers)

Stephen Leake <stephen_leake@stephe-leake.org> writes:

>>> > Btw, what do you do with the tree returned by the tree-sitter parser?
>>> > store it in some buffer-local variable?  If so, how much memory does
>>> > such a tree take, and when, if ever, is that memory released?
>>> >
>>> 
>>> It's stored in a buffer-local variable. I haven't measured the memory
>>> they take. Memory is released when the tree object is garbage-collected
>>> (it's a `user-ptr').
>

> Is it an elisp structure (or accesible from elisp)? 

It's a Rust structure; there is an emacs module providing elisp access
to it (things like "find syntax tree node at point", "get parent node",
"get node text").

The syntax tree is a "concrete syntax tree"; it should be quite close to
the wisi syntax tree.

> Have you written code that traverses it to provide faces and
> indentation?

Not in that repository.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-01 23:38           ` Stephen Leake
  2020-04-02  0:25             ` Stephen Leake
@ 2020-04-02  2:46             ` Stefan Monnier
  2020-04-02  4:36               ` Tuấn-Anh Nguyễn
  2020-04-02 14:44               ` Eli Zaretskii
  2020-04-02  5:21             ` Tuấn-Anh Nguyễn
  2020-04-02 14:36             ` Eli Zaretskii
  3 siblings, 2 replies; 43+ messages in thread
From: Stefan Monnier @ 2020-04-02  2:46 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

> In C or C++ body files, "a complete parse" is typically one variable or
> function declaration. So if Emacs can reliably find the beginning and
> end of those declarations,

IIUC, a large part of CC-mode's trouble is exactly the need to find
somewhat reliably a position vaguely like "the beginning of
a declaration".

It's very much a non-trivial problem (and in the general case to
properly handle all possible comments you need to start parsing from
point-min).

>> And yes, doing this by consing strings is not a good idea, it will
>> slow things down and cause a lot of GC.  It is best avoided.  Thus my
>> questions above.
> I'm not sure how "convert syntax tree to elisp" compares to "consing
> strings". I would certainly expect it to cause a lot of GC.

If the GC is the worry, we can use a function which encodes the
buffer using a given coding-system and returns a malloc'd array of bytes.

>>> It's stored in a buffer-local variable. I haven't measured the memory
>>> they take. Memory is released when the tree object is garbage-collected
>>> (it's a `user-ptr').
> Is it an elisp structure (or accesible from elisp)? Have you written
> code that traverses it to provide faces and indentation?

According to https://github.com/tree-sitter/tree-sitter/issues/222 the
parse tree takes around 10 times the size of the source text.  At least
that's for tree-sitter's own parse-tree; not sure how that relates to
emacs-tree-sitter's yet.


        Stefan




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-01 19:33         ` Eli Zaretskii
  2020-04-01 23:38           ` Stephen Leake
@ 2020-04-02  4:21           ` Tuấn-Anh Nguyễn
  2020-04-02  5:19             ` Jorge Javier Araya Navarro
                               ` (3 more replies)
  1 sibling, 4 replies; 43+ messages in thread
From: Tuấn-Anh Nguyễn @ 2020-04-02  4:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On Thu, Apr 2, 2020 at 2:33 AM Eli Zaretskii <eliz@gnu.org> wrote:
>
> > From: Tuấn-Anh Nguyễn <ubolonton@gmail.com>
> > Date: Thu, 2 Apr 2020 00:55:45 +0700
> > Cc: emacs-devel@gnu.org
> >
> > > Did you consider using the API where an application can provide a
> > > function to return text at a given offset?  Such a function could be
> > > relatively easily implemented for Emacs.
> > >
> >
> > I don't understand what you mean. Below I'll explain how it works
> > currently.  [...]  If dynamic modules have direct access to the
> > buffer text, none of the above is an issue.
> >
> > Such direct access can be enabled by something like this:
> >
> >     char* (*access_buffer_text) (emacs_env *env,
> >                                  emacs_value buffer,
> >                                  ptrdiff_t byte_offset,
> >                                  ptrdiff_t *size_inout);
> >
> > Of course, such an API would require extensive documentation on how it
> > must be used, to ensure safety and correctness.
>
> I think you are moving too fast, and keep the current implementation
> in sight too much.
>

I'm actually moving too slow here. I have thought about this part quite
a bit, but I'm currently focusing on other things, partially because
this is not painful bottleneck.

> What I suggest is to step back and see how such direct access, if it
> were available, could be used with tree-sitter.  Let's forget about
> modules for a moment and consider tree-sitter linked with Emacs and
> capable of calling any C function in core.  How would you use that?
>
> Buffer text is not exactly UTF-8, it's a superset of UTF-8.  So one
> question to answer is what to do with byte sequences that are not
> valid UTF-8.  Any suggestions or ideas?  How does tree-sitter handle
> invalid byte sequences in general?
>

I haven't checked yet. It will probably bail out, which is usually the
desired behavior. The tree-sitter's author is likely open to making this
behavior configurable here, though. Alternatively, the direct access
function can offer different behaviors: as-is, bail-out, skip-over, or
null-out (tree-sitter will skip over null bytes, IIRC).

> Also, direct access to buffer text generally means we must make sure
> GC never runs as long as pointers to buffer text are lying around.
> Can any Lisp run between calls to the reader function that the
> tree-sitter parser calls to access the buffer text?  If so, we need to
> take care of that issue.
>

With direct access, no Lisp code will be run between these calls.

> Next, I'm still asking whether parsing the whole buffer when it is
> first created is necessary.  Can we pass to the parser just a small
> chunk (say, 500 bytes) of the buffer around the window-full to be
> displayed next?  If this presents problems, what are those problems?
>

In principle (not in tree-sitter ATM), and in very specific cases, yes.
IMO that's the wrong focus on a premature optimization anyway. As others
noted, even in the pathological case of xdisp.c, the performance is
acceptable. Also keep in mind that syntax highlighting is just one
application. Other use cases usually want a full parse tree.

If we really want to tackle this issue, there are other approaches to
consider, e.g. background parsing, or parsing up until a time limit, and
resume parsing when Emacs is idle. Tree-sitter's API supports the
latter.

But again, both thought exercises and my usage so far point to this
being a non-issue.

> IOW, the issue with exposing access to buffer text to modules is IMO
> secondary.  My suggestion is first to figure out how to do this stuff
> efficiently from within Emacs itself, as if the module interface were
> not part of the equation.  We can add that aspect back later.
>

My opinion is that it's better to experiment with this kind of stuff
out-of-core. It can move forward faster that way, allowing more lessons
to be learned. Real lessons, involving real-world use cases, not thought
exercises.

In a somewhat similar vein, writing emacs-tree-sitter highlighted real
issues with dynamic modules, which I'm going to write up sometime.

> And yes, doing this by consing strings is not a good idea, it will
> slow things down and cause a lot of GC.  It is best avoided.  Thus my
> questions above.
>
> > > Btw, what do you do with the tree returned by the tree-sitter parser?
> > > store it in some buffer-local variable?  If so, how much memory does
> > > such a tree take, and when, if ever, is that memory released?
> > >
> >
> > It's stored in a buffer-local variable. I haven't measured the memory
> > they take. Memory is released when the tree object is garbage-collected
> > (it's a `user-ptr').
>
> So if I have many hundreds of buffers, I could have such a tree in
> each one of them indefinitely?  Perhaps that's one more design issue
> to consider, given that the parsing is so fast.  Similar to what we do
> with image and face caches -- we flush them from time to time, to keep
> the memory footprint in check.  So a buffer that was not current more
> than some time interval ago could have its tree GCed.
>

That can work. Alternatively, tree-sitter can add support for "folding"
subtrees, as Stefan suggested.

--
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-02  2:46             ` Stefan Monnier
@ 2020-04-02  4:36               ` Tuấn-Anh Nguyễn
  2020-04-02 14:44               ` Eli Zaretskii
  1 sibling, 0 replies; 43+ messages in thread
From: Tuấn-Anh Nguyễn @ 2020-04-02  4:36 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: Stephen Leake, emacs-devel

> If the GC is the worry, we can use a function which encodes the
> buffer using a given coding-system and returns a malloc'd array of bytes.
>

If we are talking about a function exposed to dynamic modules, then we
will also need to expose another function to free that byte array,
because the dynamic module may use a different allocator. It's probably
better to ask the caller to prepare that array, like what
`copy_string_contents' does.

> >>> It's stored in a buffer-local variable. I haven't measured the memory
> >>> they take. Memory is released when the tree object is garbage-collected
> >>> (it's a `user-ptr').
> > Is it an elisp structure (or accesible from elisp)? Have you written
> > code that traverses it to provide faces and indentation?
>
> According to https://github.com/tree-sitter/tree-sitter/issues/222 the
> parse tree takes around 10 times the size of the source text.  At least
> that's for tree-sitter's own parse-tree; not sure how that relates to
> emacs-tree-sitter's yet.
>

emacs-tree-sitter adds 16 bytes for reference counting and 8 bytes for
checking concurrent modifications (because nodes are also exposed to
Lisp as objects). That's negligible I think.

--
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-02  4:21           ` Tuấn-Anh Nguyễn
@ 2020-04-02  5:19             ` Jorge Javier Araya Navarro
  2020-04-02  9:29               ` Stephen Leake
  2020-04-02 10:37             ` Andrea Corallo
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Jorge Javier Araya Navarro @ 2020-04-02  5:19 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn; +Cc: Eli Zaretskii, emacs-devel

[-- Attachment #1: Type: text/plain, Size: 5897 bytes --]

>  Also keep in mind that syntax highlighting is just one
application. Other use cases usually want a full parse tree.

like indentation, or so I think 🤔, but indentation may be one of those use
cases.

El mié., 1 de abril de 2020 22:22, Tuấn-Anh Nguyễn <ubolonton@gmail.com>
escribió:

> On Thu, Apr 2, 2020 at 2:33 AM Eli Zaretskii <eliz@gnu.org> wrote:
> >
> > > From: Tuấn-Anh Nguyễn <ubolonton@gmail.com>
> > > Date: Thu, 2 Apr 2020 00:55:45 +0700
> > > Cc: emacs-devel@gnu.org
> > >
> > > > Did you consider using the API where an application can provide a
> > > > function to return text at a given offset?  Such a function could be
> > > > relatively easily implemented for Emacs.
> > > >
> > >
> > > I don't understand what you mean. Below I'll explain how it works
> > > currently.  [...]  If dynamic modules have direct access to the
> > > buffer text, none of the above is an issue.
> > >
> > > Such direct access can be enabled by something like this:
> > >
> > >     char* (*access_buffer_text) (emacs_env *env,
> > >                                  emacs_value buffer,
> > >                                  ptrdiff_t byte_offset,
> > >                                  ptrdiff_t *size_inout);
> > >
> > > Of course, such an API would require extensive documentation on how it
> > > must be used, to ensure safety and correctness.
> >
> > I think you are moving too fast, and keep the current implementation
> > in sight too much.
> >
>
> I'm actually moving too slow here. I have thought about this part quite
> a bit, but I'm currently focusing on other things, partially because
> this is not painful bottleneck.
>
> > What I suggest is to step back and see how such direct access, if it
> > were available, could be used with tree-sitter.  Let's forget about
> > modules for a moment and consider tree-sitter linked with Emacs and
> > capable of calling any C function in core.  How would you use that?
> >
> > Buffer text is not exactly UTF-8, it's a superset of UTF-8.  So one
> > question to answer is what to do with byte sequences that are not
> > valid UTF-8.  Any suggestions or ideas?  How does tree-sitter handle
> > invalid byte sequences in general?
> >
>
> I haven't checked yet. It will probably bail out, which is usually the
> desired behavior. The tree-sitter's author is likely open to making this
> behavior configurable here, though. Alternatively, the direct access
> function can offer different behaviors: as-is, bail-out, skip-over, or
> null-out (tree-sitter will skip over null bytes, IIRC).
>
> > Also, direct access to buffer text generally means we must make sure
> > GC never runs as long as pointers to buffer text are lying around.
> > Can any Lisp run between calls to the reader function that the
> > tree-sitter parser calls to access the buffer text?  If so, we need to
> > take care of that issue.
> >
>
> With direct access, no Lisp code will be run between these calls.
>
> > Next, I'm still asking whether parsing the whole buffer when it is
> > first created is necessary.  Can we pass to the parser just a small
> > chunk (say, 500 bytes) of the buffer around the window-full to be
> > displayed next?  If this presents problems, what are those problems?
> >
>
> In principle (not in tree-sitter ATM), and in very specific cases, yes.
> IMO that's the wrong focus on a premature optimization anyway. As others
> noted, even in the pathological case of xdisp.c, the performance is
> acceptable. Also keep in mind that syntax highlighting is just one
> application. Other use cases usually want a full parse tree.
>
> If we really want to tackle this issue, there are other approaches to
> consider, e.g. background parsing, or parsing up until a time limit, and
> resume parsing when Emacs is idle. Tree-sitter's API supports the
> latter.
>
> But again, both thought exercises and my usage so far point to this
> being a non-issue.
>
> > IOW, the issue with exposing access to buffer text to modules is IMO
> > secondary.  My suggestion is first to figure out how to do this stuff
> > efficiently from within Emacs itself, as if the module interface were
> > not part of the equation.  We can add that aspect back later.
> >
>
> My opinion is that it's better to experiment with this kind of stuff
> out-of-core. It can move forward faster that way, allowing more lessons
> to be learned. Real lessons, involving real-world use cases, not thought
> exercises.
>
> In a somewhat similar vein, writing emacs-tree-sitter highlighted real
> issues with dynamic modules, which I'm going to write up sometime.
>
> > And yes, doing this by consing strings is not a good idea, it will
> > slow things down and cause a lot of GC.  It is best avoided.  Thus my
> > questions above.
> >
> > > > Btw, what do you do with the tree returned by the tree-sitter parser?
> > > > store it in some buffer-local variable?  If so, how much memory does
> > > > such a tree take, and when, if ever, is that memory released?
> > > >
> > >
> > > It's stored in a buffer-local variable. I haven't measured the memory
> > > they take. Memory is released when the tree object is garbage-collected
> > > (it's a `user-ptr').
> >
> > So if I have many hundreds of buffers, I could have such a tree in
> > each one of them indefinitely?  Perhaps that's one more design issue
> > to consider, given that the parsing is so fast.  Similar to what we do
> > with image and face caches -- we flush them from time to time, to keep
> > the memory footprint in check.  So a buffer that was not current more
> > than some time interval ago could have its tree GCed.
> >
>
> That can work. Alternatively, tree-sitter can add support for "folding"
> subtrees, as Stefan suggested.
>
> --
> Tuấn-Anh Nguyễn
> Software Engineer
>
>

[-- Attachment #2: Type: text/html, Size: 7465 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-01 23:38           ` Stephen Leake
  2020-04-02  0:25             ` Stephen Leake
  2020-04-02  2:46             ` Stefan Monnier
@ 2020-04-02  5:21             ` Tuấn-Anh Nguyễn
  2020-04-02  9:24               ` [SPAM UNSURE] " Stephen Leake
  2020-04-02 14:36             ` Eli Zaretskii
  3 siblings, 1 reply; 43+ messages in thread
From: Tuấn-Anh Nguyễn @ 2020-04-02  5:21 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

> > My suggestion is first to figure out how to do this stuff efficiently
> > from within Emacs itself, as if the module interface were not part of
> > the equation. We can add that aspect back later.
>
> There are two times the wisi code that wraps the parser needs access to
> the buffer; first to copy the text, second to add text properties
> (faces, indent values, navigation markers). There are usually many text
> properties output by each parse.
>
> The positions and values of the text properties are computed by
> functions that run after the complete syntax tree has been produced. In
> wisi, those functions are added directly in the grammar source file
> (where they are called "post-parse grammar actions"). In tree-sitter, I
> assume they are called from some mode-author-written code that traverses
> the syntax tree (wisi provides that internally). Except I see below that
> the emacs tree-sitter package stores the syntax tree in the buffer.
>

The preferred approach with tree-sitter is querying:
https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries

--
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-02  5:21             ` Tuấn-Anh Nguyễn
@ 2020-04-02  9:24               ` Stephen Leake
  0 siblings, 0 replies; 43+ messages in thread
From: Stephen Leake @ 2020-04-02  9:24 UTC (permalink / raw)
  To: emacs-devel

Tuấn-Anh Nguyễn <ubolonton@gmail.com> writes:

>> > My suggestion is first to figure out how to do this stuff efficiently
>> > from within Emacs itself, as if the module interface were not part of
>> > the equation. We can add that aspect back later.
>>
>> There are two times the wisi code that wraps the parser needs access to
>> the buffer; first to copy the text, second to add text properties
>> (faces, indent values, navigation markers). There are usually many text
>> properties output by each parse.
>>
>> The positions and values of the text properties are computed by
>> functions that run after the complete syntax tree has been produced. In
>> wisi, those functions are added directly in the grammar source file
>> (where they are called "post-parse grammar actions"). In tree-sitter, I
>> assume they are called from some mode-author-written code that traverses
>> the syntax tree (wisi provides that internally). Except I see below that
>> the emacs tree-sitter package stores the syntax tree in the buffer.
>>
>
> The preferred approach with tree-sitter is querying:
> https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries

For access to the syntax tree, yes. There must be code somewhere that
computes face, indent and navigation (and code completion, ...). That
code will build on top of the syntax tree access; it could be in Rust
(in the module) or in elisp (using the module functions).

Or in C linked directly into Emacs, as Eli suggests. But I don't think
he meant that as an actual implementation approach, just as a design
approach.

The wisi Ada code that computes the text properties accesses the syntax
tree more directly, but that's just an implementation detail.

I think it makes sense at this point to try to merge wisi and
emacs-tree-sitter. There are several approaches:

1. rewrite the wisi grammar actions in elisp, using the
emacs-tree-sitter module functions to access the syntax tree.

2. rewrite the wisi grammar actions in Rust, using Rust functions to
access the syntax tree

3. rewrite the emacs-tree-sitter module in Ada, using an Ada binding to
the Tree-Sitter C API. Then the Emacs module would provide the current
wisi Ada code, modified to work with a Tree-Sitter parser.

4. There would also be value in doing an independent design and
implementation of code to compute face, indent and navigation using the
tree-sitter syntax tree; there might be a better approach than what wisi
does.

1 is probably the quickest path to getting something working, but 2 or 3
will probably provide faster execution time. Ideally we'd do all three
(or four) and get some good metrics.

After doing one of the above, we must still write the calls to the
grammar actions for each language of interest. In wisi, this is done by
adding grammar actions to the grammar source code; for example, here is
the indent action for the Ada 'if then end if' statement:

if_statement
  : IF expression_opt THEN sequence_of_statements_opt END IF SEMICOLON
    %((wisi-indent-action [nil [(wisi-hanging% ada-indent-broken (* 2 ada-indent-broken))
                              ada-indent-broken]
                             nil
                             [ada-indent ada-indent] nil nil nil]))%

There is one lisp form for each token in the grammar production.

IF is not indented by this action; it is indented by the enclosing Ada
statement.

The conditional expression is indented by wisi-hanging; comments within
the expression (assuming it is multi-line) are indented by
ada-indent-broken. wisi-hanging takes care of indenting the second line
in a long expression.

THEN, END IF SEMICOLON are not indented. The statements in the true
branch are indented by ada-indent.

If ada-indent is 3, ada-broken-indent 2, this produces:

     if a or
        b
        -- a comment
    then
       statement_1;
       statement_2;
       --  another comment
    end if;

In the upstream development repository for the wisi package
(https://savannah.nongnu.org/projects/ada-mode/), there is a user guide
to the grammar actions. I can provide an html or info version on
request. (or get my act together and do another wisi/ada-mode release).

In Tree-Sitter, the calls to the grammar actions are written in code
that traverses the syntax tree; this would be a higher level elisp or
Rust function, for 1 and 2 above.

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-02  5:19             ` Jorge Javier Araya Navarro
@ 2020-04-02  9:29               ` Stephen Leake
  0 siblings, 0 replies; 43+ messages in thread
From: Stephen Leake @ 2020-04-02  9:29 UTC (permalink / raw)
  To: emacs-devel

Jorge Javier Araya Navarro <jorge@esavara.cr> writes:

>>  Also keep in mind that syntax highlighting is just one
> application. Other use cases usually want a full parse tree.
>
> like indentation, or so I think 🤔, but indentation may be one of those use
> cases.

To correctly compute indentation for Ada code, you need to parse the
full file initially. After than, indent-region to indent edited code
fits nicely with incremental parse.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-02  4:21           ` Tuấn-Anh Nguyễn
  2020-04-02  5:19             ` Jorge Javier Araya Navarro
@ 2020-04-02 10:37             ` Andrea Corallo
  2020-04-02 11:14               ` Tuấn-Anh Nguyễn
  2020-04-02 13:02             ` Stefan Monnier
  2020-04-02 15:02             ` Eli Zaretskii
  3 siblings, 1 reply; 43+ messages in thread
From: Andrea Corallo @ 2020-04-02 10:37 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn; +Cc: Eli Zaretskii, emacs-devel

Tuấn-Anh Nguyễn <ubolonton@gmail.com> writes:

> In principle (not in tree-sitter ATM), and in very specific cases, yes.
> IMO that's the wrong focus on a premature optimization anyway. As others
> noted, even in the pathological case of xdisp.c, the performance is
> acceptable. 

Please do not assume xdisp.c is the worst case scenario, I can testify
it is not :)

Andrea

-- 
akrl@sdf.org



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-02 10:37             ` Andrea Corallo
@ 2020-04-02 11:14               ` Tuấn-Anh Nguyễn
  0 siblings, 0 replies; 43+ messages in thread
From: Tuấn-Anh Nguyễn @ 2020-04-02 11:14 UTC (permalink / raw)
  To: Andrea Corallo; +Cc: Eli Zaretskii, emacs-devel

On Thu, Apr 2, 2020 at 5:37 PM Andrea Corallo <akrl@sdf.org> wrote:
>
> Tuấn-Anh Nguyễn <ubolonton@gmail.com> writes:
>
> > In principle (not in tree-sitter ATM), and in very specific cases, yes.
> > IMO that's the wrong focus on a premature optimization anyway. As others
> > noted, even in the pathological case of xdisp.c, the performance is
> > acceptable.
>
> Please do not assume xdisp.c is the worst case scenario, I can testify
> it is not :)
>

Fair enough.

--
Tuấn-Anh Nguyễn
Software Engineer



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-02  4:21           ` Tuấn-Anh Nguyễn
  2020-04-02  5:19             ` Jorge Javier Araya Navarro
  2020-04-02 10:37             ` Andrea Corallo
@ 2020-04-02 13:02             ` Stefan Monnier
  2020-04-02 15:06               ` Eli Zaretskii
  2020-04-02 15:02             ` Eli Zaretskii
  3 siblings, 1 reply; 43+ messages in thread
From: Stefan Monnier @ 2020-04-02 13:02 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn; +Cc: Eli Zaretskii, emacs-devel

> If we really want to tackle this issue, there are other approaches to
> consider, e.g. background parsing, or parsing up until a time limit, and
> resume parsing when Emacs is idle. Tree-sitter's API supports the
> latter.

Emacs is in dire need to exploit multiple cores.  It would be very
natural to run tree-parser's initial parse asynchronously in a separate
thread.  This requires to pass tree-parser a *copy* of the
buffer's text.


        Stefan




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-01 23:38           ` Stephen Leake
                               ` (2 preceding siblings ...)
  2020-04-02  5:21             ` Tuấn-Anh Nguyễn
@ 2020-04-02 14:36             ` Eli Zaretskii
  2020-04-03  2:27               ` Stephen Leake
  3 siblings, 1 reply; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-02 14:36 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Date: Wed, 01 Apr 2020 15:38:26 -0800
> 
> Eli Zaretskii <eliz@gnu.org> writes:
> 
> > Also, direct access to buffer text generally means we must make sure
> > GC never runs as long as pointers to buffer text are lying around.
> > Can any Lisp run between calls to the reader function that the
> > tree-sitter parser calls to access the buffer text?  
> 
> If the parser copies the text into an internal buffer, that reader
> function should only be called once per call to the parser.

Such copying is not really scalable, and IMO should be avoided.
During active editing, redisplay runs very frequently, and having to
copy portions of the buffer, let alone all of it, each time, which
necessarily requires memory allocation, consing of Lisp objects, etc.,
will produce significant memory pressure, expensive heap
allocations/deallocations, and a lot of GC.  Recall that on many
modern platforms Emacs doesn't really return memory to the system,
which means we risk increasing the memory footprint, and create
system-wide memory pressure.  It isn't a catastrophe, but we should
try to avoid it if possible.

> Since Emacs has the entire file in memory, the parser can too.

Having the file twice or more in memory is worse than having it only
once.

> However, if we are really trying to avoid copying text (which is very
> premature optimization)

I don't think it's premature.

> In sum, the short answer is "yes, you must parse the whole file, unless
> your language is particularly simple".

Funny, my conclusion from reading your detailed description was
entirely different.

> > IOW, the issue with exposing access to buffer text to modules is IMO
> > secondary.  
> 
> yes, because copying text is fast compared to everything else going on.

That wasn't my motivation when I wrote that.

> In general, each parser library, and even each grammar author, will have
> different representations for the syntax tree.
> 
> So if we want to support different parsers, I think it is best to define
> the Emacs "parser API" as "give text to parser; accept text properties
> from parser".

Yes, something like that.  It's probably enough to accept a list of
regions with syntactic attributes.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-02  2:46             ` Stefan Monnier
  2020-04-02  4:36               ` Tuấn-Anh Nguyễn
@ 2020-04-02 14:44               ` Eli Zaretskii
  2020-04-02 15:19                 ` Stefan Monnier
  2020-04-03  2:49                 ` [SPAM UNSURE] " Stephen Leake
  1 sibling, 2 replies; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-02 14:44 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: stephen_leake, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Date: Wed, 01 Apr 2020 22:46:18 -0400
> Cc: emacs-devel <emacs-devel@gnu.org>
> 
> If the GC is the worry, we can use a function which encodes the
> buffer using a given coding-system and returns a malloc'd array of bytes.

I think we should try to avoid both copying and encoding the text we
send to the parser.  Both operations are expensive and require memory
allocation.

> According to https://github.com/tree-sitter/tree-sitter/issues/222 the
> parse tree takes around 10 times the size of the source text.

Yes, that's another reason why it might make sense to "forget" trees
of buffers that were not displayed for a long time.  But this is an
optimization that can be added later without any significant changes
in the design.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-02  4:21           ` Tuấn-Anh Nguyễn
                               ` (2 preceding siblings ...)
  2020-04-02 13:02             ` Stefan Monnier
@ 2020-04-02 15:02             ` Eli Zaretskii
  2020-04-03 14:34               ` Tuấn-Anh Nguyễn
  3 siblings, 1 reply; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-02 15:02 UTC (permalink / raw)
  To: Tuấn-Anh Nguyễn; +Cc: emacs-devel

> From: Tuấn-Anh Nguyễn <ubolonton@gmail.com>
> Date: Thu, 2 Apr 2020 11:21:49 +0700
> Cc: emacs-devel@gnu.org
> 
> > Buffer text is not exactly UTF-8, it's a superset of UTF-8.  So one
> > question to answer is what to do with byte sequences that are not
> > valid UTF-8.  Any suggestions or ideas?  How does tree-sitter handle
> > invalid byte sequences in general?
> >
> 
> I haven't checked yet. It will probably bail out, which is usually the
> desired behavior.

"Bail out" meaning that this breaks the parse?  I'd be surprised if
that was what happens in these cases.  But if it does, we will need to
replace such sequences by the likes of U+FFFD in the reader function
we provide.

> With direct access, no Lisp code will be run between these calls.

Then this issue is taken care of.

> > Next, I'm still asking whether parsing the whole buffer when it is
> > first created is necessary.  Can we pass to the parser just a small
> > chunk (say, 500 bytes) of the buffer around the window-full to be
> > displayed next?  If this presents problems, what are those problems?
> >
> 
> In principle (not in tree-sitter ATM), and in very specific cases, yes.
> IMO that's the wrong focus on a premature optimization anyway.

I tried to explain elsewhere why I don't think this is premature.

> As others noted, even in the pathological case of xdisp.c, the
> performance is acceptable.

xdisp.c is not a pathological case for me, I edit it very frequently.
More importantly, this scales poorly.

> Also keep in mind that syntax highlighting is just one
> application. Other use cases usually want a full parse tree.

Other applications have different restrictions and requirements, so
trying to satisfy all of them at once might not be the best way.

> If we really want to tackle this issue, there are other approaches to
> consider, e.g. background parsing, or parsing up until a time limit, and
> resume parsing when Emacs is idle. Tree-sitter's API supports the
> latter.

JIT-lock already supports background fontification (see
jit-lock-stealth-time), so using such parsers from jit-lock gives that
to you at almost no cost.

> > IOW, the issue with exposing access to buffer text to modules is IMO
> > secondary.  My suggestion is first to figure out how to do this stuff
> > efficiently from within Emacs itself, as if the module interface were
> > not part of the equation.  We can add that aspect back later.
> >
> 
> My opinion is that it's better to experiment with this kind of stuff
> out-of-core. It can move forward faster that way, allowing more lessons
> to be learned. Real lessons, involving real-world use cases, not thought
> exercises.

I'm talking about trying different design ideas.  It is best to do
that without being limited by what modules can and cannot do.
Building a hacked version of Emacs to test those ideas doesn't
necessarily contradict the desire to collect real-life experience.

IOW, I suggest to test alternative design ideas that are not based on
copying portions of the buffer via Lisp strings.  If those ideas are
workable (and I think they are), they will support a more scalable
implementation that exerts less memory pressure on Emacs and on the
host system.

HTH

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-02 13:02             ` Stefan Monnier
@ 2020-04-02 15:06               ` Eli Zaretskii
  0 siblings, 0 replies; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-02 15:06 UTC (permalink / raw)
  To: Stefan Monnier; +Cc: ubolonton, emacs-devel

> From: Stefan Monnier <monnier@iro.umontreal.ca>
> Cc: Eli Zaretskii <eliz@gnu.org>,  emacs-devel@gnu.org
> Date: Thu, 02 Apr 2020 09:02:41 -0400
> 
> > If we really want to tackle this issue, there are other approaches to
> > consider, e.g. background parsing, or parsing up until a time limit, and
> > resume parsing when Emacs is idle. Tree-sitter's API supports the
> > latter.
> 
> Emacs is in dire need to exploit multiple cores.

True.

> It would be very natural to run tree-parser's initial parse
> asynchronously in a separate thread.  This requires to pass
> tree-parser a *copy* of the buffer's text.

This also raises a lot of issues and problems of its own, of which
copying the buffer is the least one.  We don't yet have any example of
such asynchronous processing, so this feature will have to be the
first that does it, and will then have to resolve the issues in
addition to doing its main job.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-02 14:44               ` Eli Zaretskii
@ 2020-04-02 15:19                 ` Stefan Monnier
  2020-04-03  2:49                 ` [SPAM UNSURE] " Stephen Leake
  1 sibling, 0 replies; 43+ messages in thread
From: Stefan Monnier @ 2020-04-02 15:19 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen_leake, emacs-devel

> I think we should try to avoid both copying and encoding the text we
> send to the parser.  Both operations are expensive and require memory
> allocation.

I think both operations are cheap enough relatively to the actual
parsing that it is not indispensable to avoid them: maybe it will be
worth the effort, but maybe not.

In any case, it's a minor implementation detail that can easily be
changed in the future without impacting the rest of the code.

So, I think it falls squarely in the realm of premature optimization.

>> According to https://github.com/tree-sitter/tree-sitter/issues/222 the
>> parse tree takes around 10 times the size of the source text.
> Yes, that's another reason why it might make sense to "forget" trees
> of buffers that were not displayed for a long time.

Agreed, tho I wouldn't word it that way: parse trees are not needed for
redisplay and can be used for things that don't relate to redisplay
(e.g. navigation, indentation, ...).

> But this is an optimization that can be added later without any
> significant changes in the design.

Agreed as well.

        Stefan

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-02 14:36             ` Eli Zaretskii
@ 2020-04-03  2:27               ` Stephen Leake
  2020-04-03  7:43                 ` Eli Zaretskii
  0 siblings, 1 reply; 43+ messages in thread
From: Stephen Leake @ 2020-04-03  2:27 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stephen Leake <stephen_leake@stephe-leake.org>
>> Date: Wed, 01 Apr 2020 15:38:26 -0800
>> 
>> Eli Zaretskii <eliz@gnu.org> writes:
>> 
>> > Also, direct access to buffer text generally means we must make sure
>> > GC never runs as long as pointers to buffer text are lying around.
>> > Can any Lisp run between calls to the reader function that the
>> > tree-sitter parser calls to access the buffer text?  
>> 
>> If the parser copies the text into an internal buffer, that reader
>> function should only be called once per call to the parser.
>
> Such copying is not really scalable, and IMO should be avoided.
> During active editing, redisplay runs very frequently, and having to
> copy portions of the buffer, let alone all of it, each time, which
> necessarily requires memory allocation, consing of Lisp objects, etc.,
> will produce significant memory pressure, expensive heap
> allocations/deallocations, and a lot of GC.  Recall that on many
> modern platforms Emacs doesn't really return memory to the system,
> which means we risk increasing the memory footprint, and create
> system-wide memory pressure.  It isn't a catastrophe, but we should
> try to avoid it if possible.

Ok. I know very little about the internal storage of text in Emacs.
There is at least two strings with a gap at the current edit point; if
we pass a simple pointer to tree-sitter, it will have to handle the gap.
You mention "consing of Lisp objects" above, which says to me that the
text is stored in a more complex structure. How can we provide direct
access of that to tree-sitter?

Avoid _all_ copying is impossible; the parser must store the contents of
each token in some way. Typically that is done by storing
pointers/indices into the text buffer that contains the entire text.

>> In sum, the short answer is "yes, you must parse the whole file, unless
>> your language is particularly simple".
>
> Funny, my conclusion from reading your detailed description was
> entirely different.

I need more than that to respond in a helpful way.

>> In general, each parser library, and even each grammar author, will have
>> different representations for the syntax tree.
>> 
>> So if we want to support different parsers, I think it is best to define
>> the Emacs "parser API" as "give text to parser; accept text properties
>> from parser".
>
> Yes, something like that.  It's probably enough to accept a list of
> regions with syntactic attributes.

Ok, good.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-02 14:44               ` Eli Zaretskii
  2020-04-02 15:19                 ` Stefan Monnier
@ 2020-04-03  2:49                 ` Stephen Leake
  2020-04-03  7:47                   ` Eli Zaretskii
  2020-04-03  8:11                   ` Robert Pluim
  1 sibling, 2 replies; 43+ messages in thread
From: Stephen Leake @ 2020-04-03  2:49 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stefan Monnier <monnier@iro.umontreal.ca>
>> Date: Wed, 01 Apr 2020 22:46:18 -0400
>> Cc: emacs-devel <emacs-devel@gnu.org>
>> 
>> If the GC is the worry, we can use a function which encodes the
>> buffer using a given coding-system and returns a malloc'd array of bytes.
>
> I think we should try to avoid both copying and encoding the text we
> send to the parser.  Both operations are expensive and require memory
> allocation.

I don't understand what the alternative is. The parser imposes the
reasonable requirement that the input text be utf-8 (or possibly some
other standard format). Emacs raw buffer text is not utf-8, so we must
do some encoding.

If we try to pass a plain pointer to a point in the Emacs internal
buffer, there is no way to do that encoding.

It would be possible to change the lexer in the parser to accept Emacs
raw buffer format, but I don't think you are proposing that.

-- 
-- Stephe

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-03  2:27               ` Stephen Leake
@ 2020-04-03  7:43                 ` Eli Zaretskii
  2020-04-03 17:45                   ` Stephen Leake
  0 siblings, 1 reply; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-03  7:43 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Date: Thu, 02 Apr 2020 18:27:59 -0800
> 
> > Such copying is not really scalable, and IMO should be avoided.
> > During active editing, redisplay runs very frequently, and having to
> > copy portions of the buffer, let alone all of it, each time, which
> > necessarily requires memory allocation, consing of Lisp objects, etc.,
> > will produce significant memory pressure, expensive heap
> > allocations/deallocations, and a lot of GC.  Recall that on many
> > modern platforms Emacs doesn't really return memory to the system,
> > which means we risk increasing the memory footprint, and create
> > system-wide memory pressure.  It isn't a catastrophe, but we should
> > try to avoid it if possible.
> 
> Ok. I know very little about the internal storage of text in Emacs.
> There is at least two strings with a gap at the current edit point; if
> we pass a simple pointer to tree-sitter, it will have to handle the gap.

Tree-sitter allows the application to define a "reader" function that
it will then call to access buffer text.  That function should cope
with the gap.

> You mention "consing of Lisp objects" above, which says to me that the
> text is stored in a more complex structure.

I meant the consing that is necessary to make a buffer-substring that
will be passed to the parser.

> How can we provide direct access of that to tree-sitter?

See above: by writing our function to access buffer text.

> Avoid _all_ copying is impossible; the parser must store the contents of
> each token in some way. Typically that is done by storing
> pointers/indices into the text buffer that contains the entire text.

I don't think tree-sitter does that, because the text it gets is
ephemeral.  If we pass it a buffer-substring, it's a temporary string
which will be GCed after it's used; if we pass it pointers to buffer
text, those pointers can be invalid after GC, because GC can relocate
buffer text to a different memory region.

They definitely do copy portions of the text they get for internal
processing purposes, but I doubt that they duplicate all of it,
because that would not be scalable to huge buffers.  And in any case,
any copying we do would be _in_addition_ to what tree-sitter does
internally.

> >> In sum, the short answer is "yes, you must parse the whole file, unless
> >> your language is particularly simple".
> >
> > Funny, my conclusion from reading your detailed description was
> > entirely different.
> 
> I need more than that to respond in a helpful way.

Well, you said:

> To some extent, that depends on the language.

and then went on to describing how each language might _not_ need a
full parse in many cases.  Thus the conclusion sounded a bit radical
to me.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-03  2:49                 ` [SPAM UNSURE] " Stephen Leake
@ 2020-04-03  7:47                   ` Eli Zaretskii
  2020-04-03 18:11                     ` Stephen Leake
  2020-04-03  8:11                   ` Robert Pluim
  1 sibling, 1 reply; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-03  7:47 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Date: Thu, 02 Apr 2020 18:49:07 -0800
> 
> > I think we should try to avoid both copying and encoding the text we
> > send to the parser.  Both operations are expensive and require memory
> > allocation.
> 
> I don't understand what the alternative is. The parser imposes the
> reasonable requirement that the input text be utf-8 (or possibly some
> other standard format). Emacs raw buffer text is not utf-8, so we must
> do some encoding.

Emacs represents buffer text as a superset of UTF-8, with the
violations of strict UTF-8 being very rare in buffers that hold
program sources.  The function we can provide that lets tree-sitter
access buffer text can cope with those violations, if it turns out
that tree-sitter cannot do that by itself (which frankly, I'd expect
it to be able to do).



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-03  2:49                 ` [SPAM UNSURE] " Stephen Leake
  2020-04-03  7:47                   ` Eli Zaretskii
@ 2020-04-03  8:11                   ` Robert Pluim
  2020-04-03 11:00                     ` Eli Zaretskii
  1 sibling, 1 reply; 43+ messages in thread
From: Robert Pluim @ 2020-04-03  8:11 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

>>>>> On Thu, 02 Apr 2020 18:49:07 -0800, Stephen Leake <stephen_leake@stephe-leake.org> said:

    Stephen> Eli Zaretskii <eliz@gnu.org> writes:
    >>> From: Stefan Monnier <monnier@iro.umontreal.ca>
    >>> Date: Wed, 01 Apr 2020 22:46:18 -0400
    >>> Cc: emacs-devel <emacs-devel@gnu.org>
    >>> 
    >>> If the GC is the worry, we can use a function which encodes the
    >>> buffer using a given coding-system and returns a malloc'd array of bytes.
    >> 
    >> I think we should try to avoid both copying and encoding the text we
    >> send to the parser.  Both operations are expensive and require memory
    >> allocation.

    Stephen> I don't understand what the alternative is. The parser imposes the
    Stephen> reasonable requirement that the input text be utf-8 (or possibly some
    Stephen> other standard format). Emacs raw buffer text is not utf-8, so we must
    Stephen> do some encoding.

Itʼs pretty close, apart from raw bytes. How much of an imposition
would it be in practice to say 'source code must not contain raw bytes'?

    Stephen> If we try to pass a plain pointer to a point in the Emacs internal
    Stephen> buffer, there is no way to do that encoding.

As pointed out elsewhere, you'd have to take the gap into account, so
it would be two pointers and two lengths to describe the entire buffer
text.

Robert



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-03  8:11                   ` Robert Pluim
@ 2020-04-03 11:00                     ` Eli Zaretskii
  2020-04-03 11:09                       ` Robert Pluim
  2020-04-03 11:21                       ` John Yates
  0 siblings, 2 replies; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-03 11:00 UTC (permalink / raw)
  To: Robert Pluim; +Cc: stephen_leake, emacs-devel

> From: Robert Pluim <rpluim@gmail.com>
> Date: Fri, 03 Apr 2020 10:11:07 +0200
> Cc: emacs-devel <emacs-devel@gnu.org>
> 
>     Stephen> I don't understand what the alternative is. The parser imposes the
>     Stephen> reasonable requirement that the input text be utf-8 (or possibly some
>     Stephen> other standard format). Emacs raw buffer text is not utf-8, so we must
>     Stephen> do some encoding.
> 
> Itʼs pretty close, apart from raw bytes.

Not only raw bytes: also some characters that cannot be unified with
Unicode.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-03 11:00                     ` Eli Zaretskii
@ 2020-04-03 11:09                       ` Robert Pluim
  2020-04-03 12:44                         ` Eli Zaretskii
  2020-04-03 11:21                       ` John Yates
  1 sibling, 1 reply; 43+ messages in thread
From: Robert Pluim @ 2020-04-03 11:09 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: stephen_leake, emacs-devel

>>>>> On Fri, 03 Apr 2020 14:00:06 +0300, Eli Zaretskii <eliz@gnu.org> said:

    >> From: Robert Pluim <rpluim@gmail.com>
    >> Date: Fri, 03 Apr 2020 10:11:07 +0200
    >> Cc: emacs-devel <emacs-devel@gnu.org>
    >> 
    Stephen> I don't understand what the alternative is. The parser imposes the
    Stephen> reasonable requirement that the input text be utf-8 (or possibly some
    Stephen> other standard format). Emacs raw buffer text is not utf-8, so we must
    Stephen> do some encoding.
    >> 
    >> Itʼs pretty close, apart from raw bytes.

    Eli> Not only raw bytes: also some characters that cannot be unified with
    Eli> Unicode.

And again: how likely are those characters to be in source code?

Robert



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-03 11:00                     ` Eli Zaretskii
  2020-04-03 11:09                       ` Robert Pluim
@ 2020-04-03 11:21                       ` John Yates
  2020-04-03 12:50                         ` Eli Zaretskii
  1 sibling, 1 reply; 43+ messages in thread
From: John Yates @ 2020-04-03 11:21 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Robert Pluim, Stephen Leake, Emacs developers

On Fri, Apr 3, 2020 at 7:01 AM Eli Zaretskii <eliz@gnu.org> wrote:
> Not only raw bytes: also some characters that cannot be unified with
> Unicode.

How are the semantics of such characters queried?
E.g. can such characters participate in case conversion?

/john



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-03 11:09                       ` Robert Pluim
@ 2020-04-03 12:44                         ` Eli Zaretskii
  0 siblings, 0 replies; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-03 12:44 UTC (permalink / raw)
  To: Robert Pluim; +Cc: stephen_leake, emacs-devel

> From: Robert Pluim <rpluim@gmail.com>
> Cc: stephen_leake@stephe-leake.org,  emacs-devel@gnu.org
> Date: Fri, 03 Apr 2020 13:09:34 +0200
> 
>     Eli> Not only raw bytes: also some characters that cannot be unified with
>     Eli> Unicode.
> 
> And again: how likely are those characters to be in source code?

Probably not too likely, but we need to have a solution for when they
do happen.

I guess a useful first step would be for someone to find out what does
tree-sitter do when it encounters such byte sequences.  We can then
devise a solution that will produce good results, preferably without
the need to ask tree-sitter developers to cater to these situations.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-03 11:21                       ` John Yates
@ 2020-04-03 12:50                         ` Eli Zaretskii
  0 siblings, 0 replies; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-03 12:50 UTC (permalink / raw)
  To: John Yates; +Cc: rpluim, stephen_leake, emacs-devel

> From: John Yates <john@yates-sheets.org>
> Date: Fri, 3 Apr 2020 07:21:28 -0400
> Cc: Robert Pluim <rpluim@gmail.com>, Stephen Leake <stephen_leake@stephe-leake.org>, 
> 	Emacs developers <emacs-devel@gnu.org>
> 
> On Fri, Apr 3, 2020 at 7:01 AM Eli Zaretskii <eliz@gnu.org> wrote:
> > Not only raw bytes: also some characters that cannot be unified with
> > Unicode.
> 
> How are the semantics of such characters queried?

I don't think I understand well enough what you mean by "semantics" in
this context.  Pleased elaborate: what else is in the semantics except
case conversions that you mention below?

> E.g. can such characters participate in case conversion?

Yes, of course.  Emacs converts letter-case by consulting the case
tables, see the node "Case Tables" in the ELisp manual.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-02 15:02             ` Eli Zaretskii
@ 2020-04-03 14:34               ` Tuấn-Anh Nguyễn
  0 siblings, 0 replies; 43+ messages in thread
From: Tuấn-Anh Nguyễn @ 2020-04-03 14:34 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: emacs-devel

On Thu, Apr 2, 2020 at 10:02 PM Eli Zaretskii <eliz@gnu.org> wrote:
>
> > From: Tuấn-Anh Nguyễn <ubolonton@gmail.com>
> > Date: Thu, 2 Apr 2020 11:21:49 +0700
> > Cc: emacs-devel@gnu.org
> >
> > > Buffer text is not exactly UTF-8, it's a superset of UTF-8.  So one
> > > question to answer is what to do with byte sequences that are not
> > > valid UTF-8.  Any suggestions or ideas?  How does tree-sitter handle
> > > invalid byte sequences in general?
> > >
> >
> > I haven't checked yet. It will probably bail out, which is usually the
> > desired behavior.
>
> "Bail out" meaning that this breaks the parse?  I'd be surprised if
> that was what happens in these cases.  But if it does, we will need to
> replace such sequences by the likes of U+FFFD in the reader function
> we provide.
>

Agreed. I'll try checking its behavior on this.

> > > IOW, the issue with exposing access to buffer text to modules is IMO
> > > secondary.  My suggestion is first to figure out how to do this stuff
> > > efficiently from within Emacs itself, as if the module interface were
> > > not part of the equation.  We can add that aspect back later.
> > >
> >
> > My opinion is that it's better to experiment with this kind of stuff
> > out-of-core. It can move forward faster that way, allowing more lessons
> > to be learned. Real lessons, involving real-world use cases, not thought
> > exercises.
>
> I'm talking about trying different design ideas.  It is best to do
> that without being limited by what modules can and cannot do.
> Building a hacked version of Emacs to test those ideas doesn't
> necessarily contradict the desire to collect real-life experience.
>
> IOW, I suggest to test alternative design ideas that are not based on
> copying portions of the buffer via Lisp strings.  If those ideas are
> workable (and I think they are), they will support a more scalable
> implementation that exerts less memory pressure on Emacs and on the
> host system.
>
> HTH
>

Yeah, I agree that going through Lisp strings for this is sub-optimal.
When I have time to come back to this part, I'll hack up my local Emacs
to allow dynamic modules to access buffer texts directly, to test out
the idea.

--
Tuấn-Anh Nguyễn
Software Engineer

P.S. Sorry Gmail messed up my first reply.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-03  7:43                 ` Eli Zaretskii
@ 2020-04-03 17:45                   ` Stephen Leake
  2020-04-03 18:31                     ` Eli Zaretskii
  0 siblings, 1 reply; 43+ messages in thread
From: Stephen Leake @ 2020-04-03 17:45 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stephen Leake <stephen_leake@stephe-leake.org>
>> Date: Thu, 02 Apr 2020 18:27:59 -0800
>> 
>> > Such copying is not really scalable, and IMO should be avoided.
>> > During active editing, redisplay runs very frequently, and having to
>> > copy portions of the buffer, let alone all of it, each time, which
>> > necessarily requires memory allocation, consing of Lisp objects, etc.,
>> > will produce significant memory pressure, expensive heap
>> > allocations/deallocations, and a lot of GC.  Recall that on many
>> > modern platforms Emacs doesn't really return memory to the system,
>> > which means we risk increasing the memory footprint, and create
>> > system-wide memory pressure.  It isn't a catastrophe, but we should
>> > try to avoid it if possible.
>> 
>> Ok. I know very little about the internal storage of text in Emacs.
>> There is at least two strings with a gap at the current edit point; if
>> we pass a simple pointer to tree-sitter, it will have to handle the gap.
>
> Tree-sitter allows the application to define a "reader" function that
> it will then call to access buffer text.  That function should cope
> with the gap.

and also with the encoding, which you did not address. I don't see how
that is different from the C level buffer-substring. Certainly there
should be a module function buffer-substring that is as efficient as possible.

>> You mention "consing of Lisp objects" above, which says to me that the
>> text is stored in a more complex structure.
>
> I meant the consing that is necessary to make a buffer-substring that
> will be passed to the parser.

Since are are calling the parser from C (if it is linked into Emacs, or
in a module), I still don't understand. Does C code have to cons to
create a string? It will have to allocate if the requested range is not
contiguous in the buffer.

>> Avoid _all_ copying is impossible; the parser must store the contents of
>> each token in some way. Typically that is done by storing
>> pointers/indices into the text buffer that contains the entire text.
>
> I don't think tree-sitter does that, because the text it gets is
> ephemeral.  If we pass it a buffer-substring, it's a temporary string
> which will be GCed after it's used; if we pass it pointers to buffer
> text, those pointers can be invalid after GC, because GC can relocate
> buffer text to a different memory region.

Hmm.
https://tree-sitter.github.io/tree-sitter/using-parsers#providing-the-code
says:

    Syntax nodes store their position in the source code both in terms
    of raw bytes and row/column coordinates

In the case of passing a pointer to a string (or buffer, etc), those
positions are relative to that original buffer. So the Emacs buffer is
serving as the parse buffer. Ok, that avoids any copying.

If we pass a buffer-substring to the parser, we are then responsible for
mapping positions relative to the substring into positions relative to
the full buffer. wisi delegates that to the parser; it can pass
start-char-pos and start-byte-pos to the parser along with a string.


>> >> In sum, the short answer is "yes, you must parse the whole file, unless
>> >> your language is particularly simple".
>> >
>> > Funny, my conclusion from reading your detailed description was
>> > entirely different.
>> 
>> I need more than that to respond in a helpful way.
>
> Well, you said:
>
>> To some extent, that depends on the language.
>
> and then went on to describing how each language might _not_ need a
> full parse in many cases.  Thus the conclusion sounded a bit radical
> to me.

Ok, we are putting different spins on what "particularly simple" means.

A more neutral phrasing would be:

    Some languages require parsing the whole file, some do not.
    
-- 
-- Stephe



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-03  7:47                   ` Eli Zaretskii
@ 2020-04-03 18:11                     ` Stephen Leake
  2020-04-03 18:46                       ` Eli Zaretskii
  0 siblings, 1 reply; 43+ messages in thread
From: Stephen Leake @ 2020-04-03 18:11 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stephen Leake <stephen_leake@stephe-leake.org>
>> Date: Thu, 02 Apr 2020 18:49:07 -0800
>>
>> > I think we should try to avoid both copying and encoding the text we
>> > send to the parser.  Both operations are expensive and require memory
>> > allocation.
>>
>> I don't understand what the alternative is. The parser imposes the
>> reasonable requirement that the input text be utf-8 (or possibly some
>> other standard format). Emacs raw buffer text is not utf-8, so we must
>> do some encoding.
>
> Emacs represents buffer text as a superset of UTF-8, with the
> violations of strict UTF-8 being very rare in buffers that hold
> program sources.  The function we can provide that lets tree-sitter
> access buffer text can cope with those violations,

Ok. "cope with those violations" = "do some encoding".

We can avoid copying _if_ the encoding does not change character
positions, or somehow preserves positions, for example with an auxiliary
table of changes due to encoding.

Coping with violations in the lexer would make it much easier to avoid
changing character positions; it is easy to simply ignore bytes there.

wisi makes it easy to implement this in the lexer (because it uses
re2c), although currently there is no way to make that language-specific
(that would be an enhancement).

https://tree-sitter.github.io/tree-sitter/creating-parsers#external-scanners
describes the facility for enhancing the tree-sitter lexer (aka
scanner). That is not convenient for handling this issue, so we'd have
to request (and or provide) an enhancement.

We cannot avoid encoding (either in the read function provided to
tree-sitter, or in the tree-sitter lexer), but the encoding may be very
simple and efficient.

--
-- Stephe

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-03 17:45                   ` Stephen Leake
@ 2020-04-03 18:31                     ` Eli Zaretskii
  2020-04-04  0:04                       ` Stephen Leake
  0 siblings, 1 reply; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-03 18:31 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Date: Fri, 03 Apr 2020 09:45:44 -0800
> 
> > Tree-sitter allows the application to define a "reader" function that
> > it will then call to access buffer text.  That function should cope
> > with the gap.
> 
> and also with the encoding, which you did not address.

I mentioned that in another message: I don't think encoding is
necessary in this case.

> I don't see how that is different from the C level
> buffer-substring. Certainly there should be a module function
> buffer-substring that is as efficient as possible.

If modules are allowed direct access to buffer text, then it's indeed
not different.  But the alternative that was discussed was different.
May I suggest that you look at the code of the module which triggered
this?

> >> You mention "consing of Lisp objects" above, which says to me that the
> >> text is stored in a more complex structure.
> >
> > I meant the consing that is necessary to make a buffer-substring that
> > will be passed to the parser.
> 
> Since are are calling the parser from C (if it is linked into Emacs, or
> in a module), I still don't understand. Does C code have to cons to
> create a string?

If course.  How else do you get a UTF-8 encoded string to pass to the
parser as a copy of buffer text?

> > I don't think tree-sitter does that, because the text it gets is
> > ephemeral.  If we pass it a buffer-substring, it's a temporary string
> > which will be GCed after it's used; if we pass it pointers to buffer
> > text, those pointers can be invalid after GC, because GC can relocate
> > buffer text to a different memory region.
> 
> Hmm.
> https://tree-sitter.github.io/tree-sitter/using-parsers#providing-the-code
> says:
> 
>     Syntax nodes store their position in the source code both in terms
>     of raw bytes and row/column coordinates

Positions are okay; 'char *' pointers to buffer or string text are
not.



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-03 18:11                     ` Stephen Leake
@ 2020-04-03 18:46                       ` Eli Zaretskii
  2020-04-04  0:05                         ` Stephen Leake
  0 siblings, 1 reply; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-03 18:46 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Date: Fri, 03 Apr 2020 10:11:05 -0800
> 
> > Emacs represents buffer text as a superset of UTF-8, with the
> > violations of strict UTF-8 being very rare in buffers that hold
> > program sources.  The function we can provide that lets tree-sitter
> > access buffer text can cope with those violations,
> 
> Ok. "cope with those violations" = "do some encoding".

If we use "encoding" terminology for this, it will be confusing and
will cause misunderstandings.  "Conversion" is better, IMO.  Some
sequences may need to be converted when feeding them to tree-sitter.

But I think tree-sitter should be able to cope with this itself.  It
is unreasonable to expect strict UTF-8 from all applications.  Maybe
I'm dreaming, but ISTR there is (or was) an issue on their issue
tracker about this.

> We cannot avoid encoding (either in the read function provided to
> tree-sitter, or in the tree-sitter lexer), but the encoding may be very
> simple and efficient.

Once again, please reserve "encoding" to the likes of
encode-coding-region or code_convert_string, to avoid confusion.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-03 18:31                     ` Eli Zaretskii
@ 2020-04-04  0:04                       ` Stephen Leake
  2020-04-04  7:13                         ` Eli Zaretskii
  0 siblings, 1 reply; 43+ messages in thread
From: Stephen Leake @ 2020-04-04  0:04 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stephen Leake <stephen_leake@stephe-leake.org>
>> 
>> >> You mention "consing of Lisp objects" above, which says to me that the
>> >> text is stored in a more complex structure.
>> >
>> > I meant the consing that is necessary to make a buffer-substring that
>> > will be passed to the parser.
>> 
>> Since are are calling the parser from C (if it is linked into Emacs, or
>> in a module), I still don't understand. Does C code have to cons to
>> create a string?
>
> If course.  How else do you get a UTF-8 encoded string to pass to the
> parser as a copy of buffer text?

malloc and memcpy. I guess that's what you mean by "cons"; I was
assuming you meant the actual elisp function.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-03 18:46                       ` Eli Zaretskii
@ 2020-04-04  0:05                         ` Stephen Leake
  0 siblings, 0 replies; 43+ messages in thread
From: Stephen Leake @ 2020-04-04  0:05 UTC (permalink / raw)
  To: emacs-devel

Eli Zaretskii <eliz@gnu.org> writes:

>> From: Stephen Leake <stephen_leake@stephe-leake.org>
>> Date: Fri, 03 Apr 2020 10:11:05 -0800
>> 
>> > Emacs represents buffer text as a superset of UTF-8, with the
>> > violations of strict UTF-8 being very rare in buffers that hold
>> > program sources.  The function we can provide that lets tree-sitter
>> > access buffer text can cope with those violations,
>> 
>> Ok. "cope with those violations" = "do some encoding".
>
> If we use "encoding" terminology for this, it will be confusing and
> will cause misunderstandings.  

Yeah, I realized that after I posted this.

-- 
-- Stephe



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs)
  2020-04-04  0:04                       ` Stephen Leake
@ 2020-04-04  7:13                         ` Eli Zaretskii
  0 siblings, 0 replies; 43+ messages in thread
From: Eli Zaretskii @ 2020-04-04  7:13 UTC (permalink / raw)
  To: Stephen Leake; +Cc: emacs-devel

> From: Stephen Leake <stephen_leake@stephe-leake.org>
> Date: Fri, 03 Apr 2020 16:04:04 -0800
> 
> >> Since are are calling the parser from C (if it is linked into Emacs, or
> >> in a module), I still don't understand. Does C code have to cons to
> >> create a string?
> >
> > If course.  How else do you get a UTF-8 encoded string to pass to the
> > parser as a copy of buffer text?
> 
> malloc and memcpy.

How do you know how much memory to allocate?  And memcpy doesn't cut
it, because you forgot the encoding step.

You could, of course, take the low-level encoding code from coding.c
and make your own high-level functions that don't work with Lisp
objects.  But (a) why bother doing that? and (b) I think you will
quickly find out that this is a non-trivial job, since coding.c
"knows", to the lowest level, that it's dealing with Lisp objects
(buffers or strings), so you'd need pretty much to rewrite everything.

It's no accident that the Cygwin port uses the Lisp string machinery
even when it needs to convert strings from UTF-16 (see from_unicode),
even though it basically needs to convert C strings.

> I guess that's what you mean by "cons"; I was assuming you meant the
> actual elisp function.

No, I meant "consing" as in "make a Lisp string", then encode it
(which makes another Lisp string).

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2020-04-04  7:13 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-31 17:07 Reliable after-change-functions (via: Using incremental parsing in Emacs) Tuấn Anh Nguyễn
2020-03-31 17:50 ` Eli Zaretskii
2020-04-01  6:17   ` Tuấn Anh Nguyễn
2020-04-01 13:26     ` Eli Zaretskii
2020-04-01 15:47       ` Jorge Javier Araya Navarro
2020-04-01 16:07         ` Eli Zaretskii
2020-04-01 17:55       ` Tuấn-Anh Nguyễn
2020-04-01 19:33         ` Eli Zaretskii
2020-04-01 23:38           ` Stephen Leake
2020-04-02  0:25             ` Stephen Leake
2020-04-02  2:46             ` Stefan Monnier
2020-04-02  4:36               ` Tuấn-Anh Nguyễn
2020-04-02 14:44               ` Eli Zaretskii
2020-04-02 15:19                 ` Stefan Monnier
2020-04-03  2:49                 ` [SPAM UNSURE] " Stephen Leake
2020-04-03  7:47                   ` Eli Zaretskii
2020-04-03 18:11                     ` Stephen Leake
2020-04-03 18:46                       ` Eli Zaretskii
2020-04-04  0:05                         ` Stephen Leake
2020-04-03  8:11                   ` Robert Pluim
2020-04-03 11:00                     ` Eli Zaretskii
2020-04-03 11:09                       ` Robert Pluim
2020-04-03 12:44                         ` Eli Zaretskii
2020-04-03 11:21                       ` John Yates
2020-04-03 12:50                         ` Eli Zaretskii
2020-04-02  5:21             ` Tuấn-Anh Nguyễn
2020-04-02  9:24               ` [SPAM UNSURE] " Stephen Leake
2020-04-02 14:36             ` Eli Zaretskii
2020-04-03  2:27               ` Stephen Leake
2020-04-03  7:43                 ` Eli Zaretskii
2020-04-03 17:45                   ` Stephen Leake
2020-04-03 18:31                     ` Eli Zaretskii
2020-04-04  0:04                       ` Stephen Leake
2020-04-04  7:13                         ` Eli Zaretskii
2020-04-02  4:21           ` Tuấn-Anh Nguyễn
2020-04-02  5:19             ` Jorge Javier Araya Navarro
2020-04-02  9:29               ` Stephen Leake
2020-04-02 10:37             ` Andrea Corallo
2020-04-02 11:14               ` Tuấn-Anh Nguyễn
2020-04-02 13:02             ` Stefan Monnier
2020-04-02 15:06               ` Eli Zaretskii
2020-04-02 15:02             ` Eli Zaretskii
2020-04-03 14:34               ` Tuấn-Anh Nguyễn

Code repositories for project(s) associated with this public inbox

	https://git.savannah.gnu.org/cgit/emacs.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).