* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) @ 2020-03-31 17:07 Tuấn Anh Nguyễn 2020-03-31 17:50 ` Eli Zaretskii 0 siblings, 1 reply; 43+ messages in thread From: Tuấn Anh Nguyễn @ 2020-03-31 17:07 UTC (permalink / raw) To: emacs-devel > In any case, I hope that passing the buffer to tree-sitter doesn't > involve marshalling the entire buffer text via a function call as a > huge string, or some such. We should instead request that tree-sitter > exposes an API through which we could give it direct access to buffer > text as 2 parts, before and after the gap, like we do with regex > code. Otherwise this will be a bottleneck in the long run, not unlike > the problem we have with LSP. It does support parsing through direct access. Which is why I wanted dynamic modules to have direct access to buffer text. >> How large is "very large" here? > > xdisp.c comes to mind, obviously. On my machine, a 3.39 GHz Intel Core i7: (0.150791 0 0.0) ; 1 full parse (2.142236 5 0.6105190000000107) ; 10 full parses (0.015423 0 0.0) ; incremental parsing, after typing 1 character -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-03-31 17:07 Reliable after-change-functions (via: Using incremental parsing in Emacs) Tuấn Anh Nguyễn @ 2020-03-31 17:50 ` Eli Zaretskii 2020-04-01 6:17 ` Tuấn Anh Nguyễn 0 siblings, 1 reply; 43+ messages in thread From: Eli Zaretskii @ 2020-03-31 17:50 UTC (permalink / raw) To: Tuấn Anh Nguyễn; +Cc: emacs-devel > From: Tuấn Anh Nguyễn <ubolonton@gmail.com> > Date: Wed, 1 Apr 2020 00:07:27 +0700 > > > xdisp.c comes to mind, obviously. > > On my machine, a 3.39 GHz Intel Core i7: > > (0.150791 0 0.0) ; 1 full parse How did you submit xdisp.c to the parser? In any case, IIUC, the first time a buffer needs to be displayed, we need to wait for these 150 msec? That's annoyingly long (and I suspect in real Emacs usage will be significantly longer, due to memory allocation, encoding, etc.). ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-03-31 17:50 ` Eli Zaretskii @ 2020-04-01 6:17 ` Tuấn Anh Nguyễn 2020-04-01 13:26 ` Eli Zaretskii 0 siblings, 1 reply; 43+ messages in thread From: Tuấn Anh Nguyễn @ 2020-04-01 6:17 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel [-- Attachment #1: Type: text/plain, Size: 1220 bytes --] On Wed, Apr 1, 2020 at 12:50 AM Eli Zaretskii <eliz@gnu.org> wrote: > > From: Tuấn Anh Nguyễn <ubolonton@gmail.com> > > Date: Wed, 1 Apr 2020 00:07:27 +0700 > > > > > xdisp.c comes to mind, obviously. > > > > On my machine, a 3.39 GHz Intel Core i7: > > > > (0.150791 0 0.0) ; 1 full parse > > How did you submit xdisp.c to the parser? > (with-current-buffer "xdisp.c" (let ((language (tree-sitter-require 'c)) (parser (ts-make-parser))) (ts-set-language parser language) (garbage-collect) (message "%s" (benchmark-run (ts-parse parser #'ts-buffer-input nil))))) > In any case, IIUC, the first time a buffer needs to be displayed, we > need to wait for these 150 msec? That's annoyingly long (and I > suspect in real Emacs usage will be significantly longer, due to > memory allocation, encoding, etc.). > Real usage with "xdisp.c": (define-advice tree-sitter--do-parse (:around (f &rest args) benchmark) (message "%s" (benchmark-run (apply f args)))) (0.257998 1 0.13326100000000096) So yes, direct access to buffer's text from dynamic modules would be nice. -- Tuấn-Anh Nguyễn Software Engineer [-- Attachment #2: Type: text/html, Size: 2127 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-01 6:17 ` Tuấn Anh Nguyễn @ 2020-04-01 13:26 ` Eli Zaretskii 2020-04-01 15:47 ` Jorge Javier Araya Navarro 2020-04-01 17:55 ` Tuấn-Anh Nguyễn 0 siblings, 2 replies; 43+ messages in thread From: Eli Zaretskii @ 2020-04-01 13:26 UTC (permalink / raw) To: Tuấn Anh Nguyễn; +Cc: emacs-devel > From: Tuấn Anh Nguyễn <ubolonton@gmail.com> > Date: Wed, 1 Apr 2020 13:17:42 +0700 > Cc: emacs-devel@gnu.org > > Real usage with "xdisp.c": > > (define-advice tree-sitter--do-parse (:around (f &rest args) benchmark) > (message "%s" (benchmark-run (apply f args)))) > > (0.257998 1 0.13326100000000096) And that is even without encoding the buffer text, IIUC what the package does. > So yes, direct access to buffer's text from dynamic modules would be nice. Did you consider using the API where an application can provide a function to return text at a given offset? Such a function could be relatively easily implemented for Emacs. Btw, what do you do with the tree returned by the tree-sitter parser? store it in some buffer-local variable? If so, how much memory does such a tree take, and when, if ever, is that memory released? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-01 13:26 ` Eli Zaretskii @ 2020-04-01 15:47 ` Jorge Javier Araya Navarro 2020-04-01 16:07 ` Eli Zaretskii 2020-04-01 17:55 ` Tuấn-Anh Nguyễn 1 sibling, 1 reply; 43+ messages in thread From: Jorge Javier Araya Navarro @ 2020-04-01 15:47 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Tuấn Anh Nguyễn, emacs-devel [-- Attachment #1: Type: text/plain, Size: 1326 bytes --] > Did you consider using the API where an application can provide a function to return text at a given offset? Such a function could be relatively easily implemented for Emacs. But why not just allow access to buffers for dynamic modules, otherwise what would be the point of dynamic modules? El mié., 1 de abr. de 2020 a la(s) 07:26, Eli Zaretskii (eliz@gnu.org) escribió: > > From: Tuấn Anh Nguyễn <ubolonton@gmail.com> > > Date: Wed, 1 Apr 2020 13:17:42 +0700 > > Cc: emacs-devel@gnu.org > > > > Real usage with "xdisp.c": > > > > (define-advice tree-sitter--do-parse (:around (f &rest args) > benchmark) > > (message "%s" (benchmark-run (apply f args)))) > > > > (0.257998 1 0.13326100000000096) > > And that is even without encoding the buffer text, IIUC what the > package does. > > > So yes, direct access to buffer's text from dynamic modules would be > nice. > > Did you consider using the API where an application can provide a > function to return text at a given offset? Such a function could be > relatively easily implemented for Emacs. > > Btw, what do you do with the tree returned by the tree-sitter parser? > store it in some buffer-local variable? If so, how much memory does > such a tree take, and when, if ever, is that memory released? > > [-- Attachment #2: Type: text/html, Size: 2038 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-01 15:47 ` Jorge Javier Araya Navarro @ 2020-04-01 16:07 ` Eli Zaretskii 0 siblings, 0 replies; 43+ messages in thread From: Eli Zaretskii @ 2020-04-01 16:07 UTC (permalink / raw) To: Jorge Javier Araya Navarro; +Cc: ubolonton, emacs-devel > From: Jorge Javier Araya Navarro <jorge@esavara.cr> > Date: Wed, 1 Apr 2020 09:47:48 -0600 > Cc: Tuấn Anh Nguyễn <ubolonton@gmail.com>, > emacs-devel@gnu.org > > > Did you consider using the API where an application can provide a function to return text at a given offset? > Such a function could be relatively easily implemented for Emacs. > > But why not just allow access to buffers for dynamic modules, otherwise what would be the point of dynamic > modules? These two are orthogonal issues: if we allow such access from modules, will this particular module use it, and if so, how? ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-01 13:26 ` Eli Zaretskii 2020-04-01 15:47 ` Jorge Javier Araya Navarro @ 2020-04-01 17:55 ` Tuấn-Anh Nguyễn 2020-04-01 19:33 ` Eli Zaretskii 1 sibling, 1 reply; 43+ messages in thread From: Tuấn-Anh Nguyễn @ 2020-04-01 17:55 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel On Wed, Apr 1, 2020 at 8:26 PM Eli Zaretskii <eliz@gnu.org> wrote: > > > From: Tuấn Anh Nguyễn <ubolonton@gmail.com> > > Date: Wed, 1 Apr 2020 13:17:42 +0700 > > Cc: emacs-devel@gnu.org > > > > Real usage with "xdisp.c": > > > > (define-advice tree-sitter--do-parse (:around (f &rest args) benchmark) > > (message "%s" (benchmark-run (apply f args)))) > > > > (0.257998 1 0.13326100000000096) > > And that is even without encoding the buffer text, IIUC what the > package does. > > > So yes, direct access to buffer's text from dynamic modules would be nice. > > Did you consider using the API where an application can provide a > function to return text at a given offset? Such a function could be > relatively easily implemented for Emacs. > I don't understand what you mean. Below I'll explain how it works currently. `ts-parse' uses the Tree-sitter's API that consumes text in chunks: TSTree *ts_parser_parse( TSParser *self, const TSTree *old_tree, TSInput input ); typedef struct { void *payload; const char *(*read)( void *payload, uint32_t byte_offset, TSPoint position, uint32_t *bytes_read ); TSInputEncoding encoding; } TSInput; Because dynamic modules don't have direct access to buffer text, `ts-parse' uses the module function `copy_string_contents', and exposes this interface: (ts-parse PARSER INPUT-FUNCTION OLD-TREE) Here INPUT-FUNCTION must return a chunk of the buffer text, starting from the given byte offset, as a Lisp string. `ts-buffer-input' is one such function. So: 1. Chunks of the buffer text are copied into Lisp strings, through `buffer-substring-no-properties'. 2. These Lisp strings are copied into buffers of null-terminated utf-8 bytes, through `copy_string_contents'. 3. All these temporary Lisp strings create GC pressure. In the xdisp.c example, it was 100ms for GC, in addition to 150ms for parsing. 4. emacs-module-rs has an automatic, blanket workaround for this bug https://debbugs.gnu.org/cgi/bugreport.cgi?bug=31238. The workaround involves pairs of `make_global_ref' and `free_global_ref' calls, on all "suspected" `emacs_value's. #4 can be avoided if emacs-module-rs allows selectively disabling the blanket workaround. It's band-aid on top of band-aid, but at least it's workable. #3 can probably be alleviated by increasing the chunk size. However, they are consequences of #1 and #2. If dynamic modules have direct access to the buffer text, none of the above is an issue. Such direct access can be enabled by something like this: char* (*access_buffer_text) (emacs_env *env, emacs_value buffer, ptrdiff_t byte_offset, ptrdiff_t *size_inout); Of course, such an API would require extensive documentation on how it must be used, to ensure safety and correctness. > Btw, what do you do with the tree returned by the tree-sitter parser? > store it in some buffer-local variable? If so, how much memory does > such a tree take, and when, if ever, is that memory released? > It's stored in a buffer-local variable. I haven't measured the memory they take. Memory is released when the tree object is garbage-collected (it's a `user-ptr'). -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-01 17:55 ` Tuấn-Anh Nguyễn @ 2020-04-01 19:33 ` Eli Zaretskii 2020-04-01 23:38 ` Stephen Leake 2020-04-02 4:21 ` Tuấn-Anh Nguyễn 0 siblings, 2 replies; 43+ messages in thread From: Eli Zaretskii @ 2020-04-01 19:33 UTC (permalink / raw) To: Tuấn-Anh Nguyễn; +Cc: emacs-devel > From: Tuấn-Anh Nguyễn <ubolonton@gmail.com> > Date: Thu, 2 Apr 2020 00:55:45 +0700 > Cc: emacs-devel@gnu.org > > > Did you consider using the API where an application can provide a > > function to return text at a given offset? Such a function could be > > relatively easily implemented for Emacs. > > > > I don't understand what you mean. Below I'll explain how it works > currently. [...] If dynamic modules have direct access to the > buffer text, none of the above is an issue. > > Such direct access can be enabled by something like this: > > char* (*access_buffer_text) (emacs_env *env, > emacs_value buffer, > ptrdiff_t byte_offset, > ptrdiff_t *size_inout); > > Of course, such an API would require extensive documentation on how it > must be used, to ensure safety and correctness. I think you are moving too fast, and keep the current implementation in sight too much. What I suggest is to step back and see how such direct access, if it were available, could be used with tree-sitter. Let's forget about modules for a moment and consider tree-sitter linked with Emacs and capable of calling any C function in core. How would you use that? Buffer text is not exactly UTF-8, it's a superset of UTF-8. So one question to answer is what to do with byte sequences that are not valid UTF-8. Any suggestions or ideas? How does tree-sitter handle invalid byte sequences in general? Also, direct access to buffer text generally means we must make sure GC never runs as long as pointers to buffer text are lying around. Can any Lisp run between calls to the reader function that the tree-sitter parser calls to access the buffer text? If so, we need to take care of that issue. Next, I'm still asking whether parsing the whole buffer when it is first created is necessary. Can we pass to the parser just a small chunk (say, 500 bytes) of the buffer around the window-full to be displayed next? If this presents problems, what are those problems? IOW, the issue with exposing access to buffer text to modules is IMO secondary. My suggestion is first to figure out how to do this stuff efficiently from within Emacs itself, as if the module interface were not part of the equation. We can add that aspect back later. And yes, doing this by consing strings is not a good idea, it will slow things down and cause a lot of GC. It is best avoided. Thus my questions above. > > Btw, what do you do with the tree returned by the tree-sitter parser? > > store it in some buffer-local variable? If so, how much memory does > > such a tree take, and when, if ever, is that memory released? > > > > It's stored in a buffer-local variable. I haven't measured the memory > they take. Memory is released when the tree object is garbage-collected > (it's a `user-ptr'). So if I have many hundreds of buffers, I could have such a tree in each one of them indefinitely? Perhaps that's one more design issue to consider, given that the parsing is so fast. Similar to what we do with image and face caches -- we flush them from time to time, to keep the memory footprint in check. So a buffer that was not current more than some time interval ago could have its tree GCed. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-01 19:33 ` Eli Zaretskii @ 2020-04-01 23:38 ` Stephen Leake 2020-04-02 0:25 ` Stephen Leake ` (3 more replies) 2020-04-02 4:21 ` Tuấn-Anh Nguyễn 1 sibling, 4 replies; 43+ messages in thread From: Stephen Leake @ 2020-04-01 23:38 UTC (permalink / raw) To: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: > Also, direct access to buffer text generally means we must make sure > GC never runs as long as pointers to buffer text are lying around. > Can any Lisp run between calls to the reader function that the > tree-sitter parser calls to access the buffer text? If the parser copies the text into an internal buffer, that reader function should only be called once per call to the parser. Parsers used to based around small buffers that would read in a file a chunk at a time, but that is not necessary on any machine that can run Emacs. Since Emacs has the entire file in memory, the parser can too. However, if we are really trying to avoid copying text (which is very premature optimization), then the reader function will be called many times during parsing (to fetch each word), and possibly during the grammar actions (to compute indent or face). > Next, I'm still asking whether parsing the whole buffer when it is > first created is necessary. To some extent, that depends on the language. The parser must be able to complete a parse, to generate a complete syntax tree. I'll assume no error correction for a moment; more below. In C or C++ body files, "a complete parse" is typically one variable or function declaration. So if Emacs can reliably find the beginning and end of those declarations, it could pass just the ones containing the region of interest to the parser. Tree-sitter (if it supports this at all, or is modified to) would end up with a forest of small parse trees, rather than one large one. They might get merged if large chunks of text are parsed together. In Ada and Java, and most C++ header files, "a complete parse" is a file; it contains an Ada package spec or body, or a Java or C++ class, or a C++ namespace. There are many "small languages" for which "a complete parse" is similar to a "statement". Bash shell, for example. They could pass just the statement, but only if Emacs can reliably find the start and end (not always easy). It is also possible to modify the language grammar to allow smaller pieces of code to be a complete parse; ada-mode does this, making a single declaration or statement "a complete parse", in order to support "partial parse". That can easily lead to errors in indent, since the indent of the start of the text portion is unknown (ada-mode simply assumes it is correct in the buffer). Another reason to allow smaller code chunks to be a complete parse is to allow parsing the code fragments that appear in grammar actions; the ELPA package wisitoken-grammar-mode uses this for wisitoken grammar files with Ada actions. In sum, the short answer is "yes, you must parse the whole file, unless your language is particularly simple". Since we need to support the worst case, we should assume the whole file must be parsed at least once. > Can we pass to the parser just a small chunk (say, 500 bytes) of the > buffer around the window-full to be displayed next? If this presents > problems, what are those problems? In wisi, the error correction code will fill in the missing text so a complete parse is possible. Since some of that is guesses, the results may not be very good. Tree-sitter also has error correction; I'm not clear how good it is. > IOW, the issue with exposing access to buffer text to modules is IMO > secondary. yes, because copying text is fast compared to everything else going on. > My suggestion is first to figure out how to do this stuff efficiently > from within Emacs itself, as if the module interface were not part of > the equation. We can add that aspect back later. There are two times the wisi code that wraps the parser needs access to the buffer; first to copy the text, second to add text properties (faces, indent values, navigation markers). There are usually many text properties output by each parse. The positions and values of the text properties are computed by functions that run after the complete syntax tree has been produced. In wisi, those functions are added directly in the grammar source file (where they are called "post-parse grammar actions"). In tree-sitter, I assume they are called from some mode-author-written code that traverses the syntax tree (wisi provides that internally). Except I see below that the emacs tree-sitter package stores the syntax tree in the buffer. One option here is to try to standardize on an elisp representation of a syntax tree, and have both the wisi and tree-sitter parsers provide that. Then the grammar actions could be implemented in elisp. I suspect that would be very slow; elisp is just not good at traversing large complex data structures. That is not just my bias showing (I _much_ prefer doing as much as possible in Ada); I first wrote the ada-mode parser and grammar actions in elisp, and then did a complete rewrite in Ada, gaining significant speed. Although I never considered passing the syntax tree to elisp as a single object, so maybe that could work well. There is no universal standard for representing "a syntax tree". In wisi, the tree is directly produced by the LR shift and reduce operations, and thus is very close the the grammar expressed in BNF. I don't know what the tree-sitter parse tree looks like. AdaCore provides a parser similar in purpose to the wisi parsers (https://github.com/AdaCore/libadalang), that also does more of what an Ada compiler does (which could allow even better font-lock and navigation). To support those additional operations, the syntax tree is quite different from the ada-mode one. In general, each parser library, and even each grammar author, will have different representations for the syntax tree. So if we want to support different parsers, I think it is best to define the Emacs "parser API" as "give text to parser; accept text properties from parser". LSP (via eglot) provides other things the parser can return; code completion menus, for example. And for indent and face, it returns formatted text with markdown. I plan to translate that to text properties to integrate LSP into wisi. Whether LSP requires a full initial parse is up to the LSP server author (LSP itself provides both "here's the full text" and "here's partial text" messages); they have the same considerations discussed above. > And yes, doing this by consing strings is not a good idea, it will > slow things down and cause a lot of GC. It is best avoided. Thus my > questions above. I'm not sure how "convert syntax tree to elisp" compares to "consing strings". I would certainly expect it to cause a lot of GC. >> > Btw, what do you do with the tree returned by the tree-sitter parser? >> > store it in some buffer-local variable? If so, how much memory does >> > such a tree take, and when, if ever, is that memory released? >> > >> >> It's stored in a buffer-local variable. I haven't measured the memory >> they take. Memory is released when the tree object is garbage-collected >> (it's a `user-ptr'). Is it an elisp structure (or accesible from elisp)? Have you written code that traverses it to provide faces and indentation? -- -- Stephe PS; I have the beginnings of a migraine while typing this, so some of it may not make sense. Sigh. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-01 23:38 ` Stephen Leake @ 2020-04-02 0:25 ` Stephen Leake 2020-04-02 2:46 ` Stefan Monnier ` (2 subsequent siblings) 3 siblings, 0 replies; 43+ messages in thread From: Stephen Leake @ 2020-04-02 0:25 UTC (permalink / raw) To: emacs-devel I looked at the tree-sitter source in git-hub (https://github.com/ubolonton/emacs-tree-sitter) and the tree-sitter doc that points to (https://tree-sitter.github.io/tree-sitter/using-parsers) Stephen Leake <stephen_leake@stephe-leake.org> writes: >>> > Btw, what do you do with the tree returned by the tree-sitter parser? >>> > store it in some buffer-local variable? If so, how much memory does >>> > such a tree take, and when, if ever, is that memory released? >>> > >>> >>> It's stored in a buffer-local variable. I haven't measured the memory >>> they take. Memory is released when the tree object is garbage-collected >>> (it's a `user-ptr'). > > Is it an elisp structure (or accesible from elisp)? It's a Rust structure; there is an emacs module providing elisp access to it (things like "find syntax tree node at point", "get parent node", "get node text"). The syntax tree is a "concrete syntax tree"; it should be quite close to the wisi syntax tree. > Have you written code that traverses it to provide faces and > indentation? Not in that repository. -- -- Stephe ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-01 23:38 ` Stephen Leake 2020-04-02 0:25 ` Stephen Leake @ 2020-04-02 2:46 ` Stefan Monnier 2020-04-02 4:36 ` Tuấn-Anh Nguyễn 2020-04-02 14:44 ` Eli Zaretskii 2020-04-02 5:21 ` Tuấn-Anh Nguyễn 2020-04-02 14:36 ` Eli Zaretskii 3 siblings, 2 replies; 43+ messages in thread From: Stefan Monnier @ 2020-04-02 2:46 UTC (permalink / raw) To: Stephen Leake; +Cc: emacs-devel > In C or C++ body files, "a complete parse" is typically one variable or > function declaration. So if Emacs can reliably find the beginning and > end of those declarations, IIUC, a large part of CC-mode's trouble is exactly the need to find somewhat reliably a position vaguely like "the beginning of a declaration". It's very much a non-trivial problem (and in the general case to properly handle all possible comments you need to start parsing from point-min). >> And yes, doing this by consing strings is not a good idea, it will >> slow things down and cause a lot of GC. It is best avoided. Thus my >> questions above. > I'm not sure how "convert syntax tree to elisp" compares to "consing > strings". I would certainly expect it to cause a lot of GC. If the GC is the worry, we can use a function which encodes the buffer using a given coding-system and returns a malloc'd array of bytes. >>> It's stored in a buffer-local variable. I haven't measured the memory >>> they take. Memory is released when the tree object is garbage-collected >>> (it's a `user-ptr'). > Is it an elisp structure (or accesible from elisp)? Have you written > code that traverses it to provide faces and indentation? According to https://github.com/tree-sitter/tree-sitter/issues/222 the parse tree takes around 10 times the size of the source text. At least that's for tree-sitter's own parse-tree; not sure how that relates to emacs-tree-sitter's yet. Stefan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-02 2:46 ` Stefan Monnier @ 2020-04-02 4:36 ` Tuấn-Anh Nguyễn 2020-04-02 14:44 ` Eli Zaretskii 1 sibling, 0 replies; 43+ messages in thread From: Tuấn-Anh Nguyễn @ 2020-04-02 4:36 UTC (permalink / raw) To: Stefan Monnier; +Cc: Stephen Leake, emacs-devel > If the GC is the worry, we can use a function which encodes the > buffer using a given coding-system and returns a malloc'd array of bytes. > If we are talking about a function exposed to dynamic modules, then we will also need to expose another function to free that byte array, because the dynamic module may use a different allocator. It's probably better to ask the caller to prepare that array, like what `copy_string_contents' does. > >>> It's stored in a buffer-local variable. I haven't measured the memory > >>> they take. Memory is released when the tree object is garbage-collected > >>> (it's a `user-ptr'). > > Is it an elisp structure (or accesible from elisp)? Have you written > > code that traverses it to provide faces and indentation? > > According to https://github.com/tree-sitter/tree-sitter/issues/222 the > parse tree takes around 10 times the size of the source text. At least > that's for tree-sitter's own parse-tree; not sure how that relates to > emacs-tree-sitter's yet. > emacs-tree-sitter adds 16 bytes for reference counting and 8 bytes for checking concurrent modifications (because nodes are also exposed to Lisp as objects). That's negligible I think. -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-02 2:46 ` Stefan Monnier 2020-04-02 4:36 ` Tuấn-Anh Nguyễn @ 2020-04-02 14:44 ` Eli Zaretskii 2020-04-02 15:19 ` Stefan Monnier 2020-04-03 2:49 ` [SPAM UNSURE] " Stephen Leake 1 sibling, 2 replies; 43+ messages in thread From: Eli Zaretskii @ 2020-04-02 14:44 UTC (permalink / raw) To: Stefan Monnier; +Cc: stephen_leake, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Date: Wed, 01 Apr 2020 22:46:18 -0400 > Cc: emacs-devel <emacs-devel@gnu.org> > > If the GC is the worry, we can use a function which encodes the > buffer using a given coding-system and returns a malloc'd array of bytes. I think we should try to avoid both copying and encoding the text we send to the parser. Both operations are expensive and require memory allocation. > According to https://github.com/tree-sitter/tree-sitter/issues/222 the > parse tree takes around 10 times the size of the source text. Yes, that's another reason why it might make sense to "forget" trees of buffers that were not displayed for a long time. But this is an optimization that can be added later without any significant changes in the design. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-02 14:44 ` Eli Zaretskii @ 2020-04-02 15:19 ` Stefan Monnier 2020-04-03 2:49 ` [SPAM UNSURE] " Stephen Leake 1 sibling, 0 replies; 43+ messages in thread From: Stefan Monnier @ 2020-04-02 15:19 UTC (permalink / raw) To: Eli Zaretskii; +Cc: stephen_leake, emacs-devel > I think we should try to avoid both copying and encoding the text we > send to the parser. Both operations are expensive and require memory > allocation. I think both operations are cheap enough relatively to the actual parsing that it is not indispensable to avoid them: maybe it will be worth the effort, but maybe not. In any case, it's a minor implementation detail that can easily be changed in the future without impacting the rest of the code. So, I think it falls squarely in the realm of premature optimization. >> According to https://github.com/tree-sitter/tree-sitter/issues/222 the >> parse tree takes around 10 times the size of the source text. > Yes, that's another reason why it might make sense to "forget" trees > of buffers that were not displayed for a long time. Agreed, tho I wouldn't word it that way: parse trees are not needed for redisplay and can be used for things that don't relate to redisplay (e.g. navigation, indentation, ...). > But this is an optimization that can be added later without any > significant changes in the design. Agreed as well. Stefan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-02 14:44 ` Eli Zaretskii 2020-04-02 15:19 ` Stefan Monnier @ 2020-04-03 2:49 ` Stephen Leake 2020-04-03 7:47 ` Eli Zaretskii 2020-04-03 8:11 ` Robert Pluim 1 sibling, 2 replies; 43+ messages in thread From: Stephen Leake @ 2020-04-03 2:49 UTC (permalink / raw) To: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Stefan Monnier <monnier@iro.umontreal.ca> >> Date: Wed, 01 Apr 2020 22:46:18 -0400 >> Cc: emacs-devel <emacs-devel@gnu.org> >> >> If the GC is the worry, we can use a function which encodes the >> buffer using a given coding-system and returns a malloc'd array of bytes. > > I think we should try to avoid both copying and encoding the text we > send to the parser. Both operations are expensive and require memory > allocation. I don't understand what the alternative is. The parser imposes the reasonable requirement that the input text be utf-8 (or possibly some other standard format). Emacs raw buffer text is not utf-8, so we must do some encoding. If we try to pass a plain pointer to a point in the Emacs internal buffer, there is no way to do that encoding. It would be possible to change the lexer in the parser to accept Emacs raw buffer format, but I don't think you are proposing that. -- -- Stephe ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-03 2:49 ` [SPAM UNSURE] " Stephen Leake @ 2020-04-03 7:47 ` Eli Zaretskii 2020-04-03 18:11 ` Stephen Leake 2020-04-03 8:11 ` Robert Pluim 1 sibling, 1 reply; 43+ messages in thread From: Eli Zaretskii @ 2020-04-03 7:47 UTC (permalink / raw) To: Stephen Leake; +Cc: emacs-devel > From: Stephen Leake <stephen_leake@stephe-leake.org> > Date: Thu, 02 Apr 2020 18:49:07 -0800 > > > I think we should try to avoid both copying and encoding the text we > > send to the parser. Both operations are expensive and require memory > > allocation. > > I don't understand what the alternative is. The parser imposes the > reasonable requirement that the input text be utf-8 (or possibly some > other standard format). Emacs raw buffer text is not utf-8, so we must > do some encoding. Emacs represents buffer text as a superset of UTF-8, with the violations of strict UTF-8 being very rare in buffers that hold program sources. The function we can provide that lets tree-sitter access buffer text can cope with those violations, if it turns out that tree-sitter cannot do that by itself (which frankly, I'd expect it to be able to do). ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-03 7:47 ` Eli Zaretskii @ 2020-04-03 18:11 ` Stephen Leake 2020-04-03 18:46 ` Eli Zaretskii 0 siblings, 1 reply; 43+ messages in thread From: Stephen Leake @ 2020-04-03 18:11 UTC (permalink / raw) To: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Stephen Leake <stephen_leake@stephe-leake.org> >> Date: Thu, 02 Apr 2020 18:49:07 -0800 >> >> > I think we should try to avoid both copying and encoding the text we >> > send to the parser. Both operations are expensive and require memory >> > allocation. >> >> I don't understand what the alternative is. The parser imposes the >> reasonable requirement that the input text be utf-8 (or possibly some >> other standard format). Emacs raw buffer text is not utf-8, so we must >> do some encoding. > > Emacs represents buffer text as a superset of UTF-8, with the > violations of strict UTF-8 being very rare in buffers that hold > program sources. The function we can provide that lets tree-sitter > access buffer text can cope with those violations, Ok. "cope with those violations" = "do some encoding". We can avoid copying _if_ the encoding does not change character positions, or somehow preserves positions, for example with an auxiliary table of changes due to encoding. Coping with violations in the lexer would make it much easier to avoid changing character positions; it is easy to simply ignore bytes there. wisi makes it easy to implement this in the lexer (because it uses re2c), although currently there is no way to make that language-specific (that would be an enhancement). https://tree-sitter.github.io/tree-sitter/creating-parsers#external-scanners describes the facility for enhancing the tree-sitter lexer (aka scanner). That is not convenient for handling this issue, so we'd have to request (and or provide) an enhancement. We cannot avoid encoding (either in the read function provided to tree-sitter, or in the tree-sitter lexer), but the encoding may be very simple and efficient. -- -- Stephe ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-03 18:11 ` Stephen Leake @ 2020-04-03 18:46 ` Eli Zaretskii 2020-04-04 0:05 ` Stephen Leake 0 siblings, 1 reply; 43+ messages in thread From: Eli Zaretskii @ 2020-04-03 18:46 UTC (permalink / raw) To: Stephen Leake; +Cc: emacs-devel > From: Stephen Leake <stephen_leake@stephe-leake.org> > Date: Fri, 03 Apr 2020 10:11:05 -0800 > > > Emacs represents buffer text as a superset of UTF-8, with the > > violations of strict UTF-8 being very rare in buffers that hold > > program sources. The function we can provide that lets tree-sitter > > access buffer text can cope with those violations, > > Ok. "cope with those violations" = "do some encoding". If we use "encoding" terminology for this, it will be confusing and will cause misunderstandings. "Conversion" is better, IMO. Some sequences may need to be converted when feeding them to tree-sitter. But I think tree-sitter should be able to cope with this itself. It is unreasonable to expect strict UTF-8 from all applications. Maybe I'm dreaming, but ISTR there is (or was) an issue on their issue tracker about this. > We cannot avoid encoding (either in the read function provided to > tree-sitter, or in the tree-sitter lexer), but the encoding may be very > simple and efficient. Once again, please reserve "encoding" to the likes of encode-coding-region or code_convert_string, to avoid confusion. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-03 18:46 ` Eli Zaretskii @ 2020-04-04 0:05 ` Stephen Leake 0 siblings, 0 replies; 43+ messages in thread From: Stephen Leake @ 2020-04-04 0:05 UTC (permalink / raw) To: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Stephen Leake <stephen_leake@stephe-leake.org> >> Date: Fri, 03 Apr 2020 10:11:05 -0800 >> >> > Emacs represents buffer text as a superset of UTF-8, with the >> > violations of strict UTF-8 being very rare in buffers that hold >> > program sources. The function we can provide that lets tree-sitter >> > access buffer text can cope with those violations, >> >> Ok. "cope with those violations" = "do some encoding". > > If we use "encoding" terminology for this, it will be confusing and > will cause misunderstandings. Yeah, I realized that after I posted this. -- -- Stephe ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-03 2:49 ` [SPAM UNSURE] " Stephen Leake 2020-04-03 7:47 ` Eli Zaretskii @ 2020-04-03 8:11 ` Robert Pluim 2020-04-03 11:00 ` Eli Zaretskii 1 sibling, 1 reply; 43+ messages in thread From: Robert Pluim @ 2020-04-03 8:11 UTC (permalink / raw) To: Stephen Leake; +Cc: emacs-devel >>>>> On Thu, 02 Apr 2020 18:49:07 -0800, Stephen Leake <stephen_leake@stephe-leake.org> said: Stephen> Eli Zaretskii <eliz@gnu.org> writes: >>> From: Stefan Monnier <monnier@iro.umontreal.ca> >>> Date: Wed, 01 Apr 2020 22:46:18 -0400 >>> Cc: emacs-devel <emacs-devel@gnu.org> >>> >>> If the GC is the worry, we can use a function which encodes the >>> buffer using a given coding-system and returns a malloc'd array of bytes. >> >> I think we should try to avoid both copying and encoding the text we >> send to the parser. Both operations are expensive and require memory >> allocation. Stephen> I don't understand what the alternative is. The parser imposes the Stephen> reasonable requirement that the input text be utf-8 (or possibly some Stephen> other standard format). Emacs raw buffer text is not utf-8, so we must Stephen> do some encoding. Itʼs pretty close, apart from raw bytes. How much of an imposition would it be in practice to say 'source code must not contain raw bytes'? Stephen> If we try to pass a plain pointer to a point in the Emacs internal Stephen> buffer, there is no way to do that encoding. As pointed out elsewhere, you'd have to take the gap into account, so it would be two pointers and two lengths to describe the entire buffer text. Robert ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-03 8:11 ` Robert Pluim @ 2020-04-03 11:00 ` Eli Zaretskii 2020-04-03 11:09 ` Robert Pluim 2020-04-03 11:21 ` John Yates 0 siblings, 2 replies; 43+ messages in thread From: Eli Zaretskii @ 2020-04-03 11:00 UTC (permalink / raw) To: Robert Pluim; +Cc: stephen_leake, emacs-devel > From: Robert Pluim <rpluim@gmail.com> > Date: Fri, 03 Apr 2020 10:11:07 +0200 > Cc: emacs-devel <emacs-devel@gnu.org> > > Stephen> I don't understand what the alternative is. The parser imposes the > Stephen> reasonable requirement that the input text be utf-8 (or possibly some > Stephen> other standard format). Emacs raw buffer text is not utf-8, so we must > Stephen> do some encoding. > > Itʼs pretty close, apart from raw bytes. Not only raw bytes: also some characters that cannot be unified with Unicode. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-03 11:00 ` Eli Zaretskii @ 2020-04-03 11:09 ` Robert Pluim 2020-04-03 12:44 ` Eli Zaretskii 2020-04-03 11:21 ` John Yates 1 sibling, 1 reply; 43+ messages in thread From: Robert Pluim @ 2020-04-03 11:09 UTC (permalink / raw) To: Eli Zaretskii; +Cc: stephen_leake, emacs-devel >>>>> On Fri, 03 Apr 2020 14:00:06 +0300, Eli Zaretskii <eliz@gnu.org> said: >> From: Robert Pluim <rpluim@gmail.com> >> Date: Fri, 03 Apr 2020 10:11:07 +0200 >> Cc: emacs-devel <emacs-devel@gnu.org> >> Stephen> I don't understand what the alternative is. The parser imposes the Stephen> reasonable requirement that the input text be utf-8 (or possibly some Stephen> other standard format). Emacs raw buffer text is not utf-8, so we must Stephen> do some encoding. >> >> Itʼs pretty close, apart from raw bytes. Eli> Not only raw bytes: also some characters that cannot be unified with Eli> Unicode. And again: how likely are those characters to be in source code? Robert ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-03 11:09 ` Robert Pluim @ 2020-04-03 12:44 ` Eli Zaretskii 0 siblings, 0 replies; 43+ messages in thread From: Eli Zaretskii @ 2020-04-03 12:44 UTC (permalink / raw) To: Robert Pluim; +Cc: stephen_leake, emacs-devel > From: Robert Pluim <rpluim@gmail.com> > Cc: stephen_leake@stephe-leake.org, emacs-devel@gnu.org > Date: Fri, 03 Apr 2020 13:09:34 +0200 > > Eli> Not only raw bytes: also some characters that cannot be unified with > Eli> Unicode. > > And again: how likely are those characters to be in source code? Probably not too likely, but we need to have a solution for when they do happen. I guess a useful first step would be for someone to find out what does tree-sitter do when it encounters such byte sequences. We can then devise a solution that will produce good results, preferably without the need to ask tree-sitter developers to cater to these situations. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-03 11:00 ` Eli Zaretskii 2020-04-03 11:09 ` Robert Pluim @ 2020-04-03 11:21 ` John Yates 2020-04-03 12:50 ` Eli Zaretskii 1 sibling, 1 reply; 43+ messages in thread From: John Yates @ 2020-04-03 11:21 UTC (permalink / raw) To: Eli Zaretskii; +Cc: Robert Pluim, Stephen Leake, Emacs developers On Fri, Apr 3, 2020 at 7:01 AM Eli Zaretskii <eliz@gnu.org> wrote: > Not only raw bytes: also some characters that cannot be unified with > Unicode. How are the semantics of such characters queried? E.g. can such characters participate in case conversion? /john ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-03 11:21 ` John Yates @ 2020-04-03 12:50 ` Eli Zaretskii 0 siblings, 0 replies; 43+ messages in thread From: Eli Zaretskii @ 2020-04-03 12:50 UTC (permalink / raw) To: John Yates; +Cc: rpluim, stephen_leake, emacs-devel > From: John Yates <john@yates-sheets.org> > Date: Fri, 3 Apr 2020 07:21:28 -0400 > Cc: Robert Pluim <rpluim@gmail.com>, Stephen Leake <stephen_leake@stephe-leake.org>, > Emacs developers <emacs-devel@gnu.org> > > On Fri, Apr 3, 2020 at 7:01 AM Eli Zaretskii <eliz@gnu.org> wrote: > > Not only raw bytes: also some characters that cannot be unified with > > Unicode. > > How are the semantics of such characters queried? I don't think I understand well enough what you mean by "semantics" in this context. Pleased elaborate: what else is in the semantics except case conversions that you mention below? > E.g. can such characters participate in case conversion? Yes, of course. Emacs converts letter-case by consulting the case tables, see the node "Case Tables" in the ELisp manual. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-01 23:38 ` Stephen Leake 2020-04-02 0:25 ` Stephen Leake 2020-04-02 2:46 ` Stefan Monnier @ 2020-04-02 5:21 ` Tuấn-Anh Nguyễn 2020-04-02 9:24 ` [SPAM UNSURE] " Stephen Leake 2020-04-02 14:36 ` Eli Zaretskii 3 siblings, 1 reply; 43+ messages in thread From: Tuấn-Anh Nguyễn @ 2020-04-02 5:21 UTC (permalink / raw) To: Stephen Leake; +Cc: emacs-devel > > My suggestion is first to figure out how to do this stuff efficiently > > from within Emacs itself, as if the module interface were not part of > > the equation. We can add that aspect back later. > > There are two times the wisi code that wraps the parser needs access to > the buffer; first to copy the text, second to add text properties > (faces, indent values, navigation markers). There are usually many text > properties output by each parse. > > The positions and values of the text properties are computed by > functions that run after the complete syntax tree has been produced. In > wisi, those functions are added directly in the grammar source file > (where they are called "post-parse grammar actions"). In tree-sitter, I > assume they are called from some mode-author-written code that traverses > the syntax tree (wisi provides that internally). Except I see below that > the emacs tree-sitter package stores the syntax tree in the buffer. > The preferred approach with tree-sitter is querying: https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [SPAM UNSURE] Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-02 5:21 ` Tuấn-Anh Nguyễn @ 2020-04-02 9:24 ` Stephen Leake 0 siblings, 0 replies; 43+ messages in thread From: Stephen Leake @ 2020-04-02 9:24 UTC (permalink / raw) To: emacs-devel Tuấn-Anh Nguyễn <ubolonton@gmail.com> writes: >> > My suggestion is first to figure out how to do this stuff efficiently >> > from within Emacs itself, as if the module interface were not part of >> > the equation. We can add that aspect back later. >> >> There are two times the wisi code that wraps the parser needs access to >> the buffer; first to copy the text, second to add text properties >> (faces, indent values, navigation markers). There are usually many text >> properties output by each parse. >> >> The positions and values of the text properties are computed by >> functions that run after the complete syntax tree has been produced. In >> wisi, those functions are added directly in the grammar source file >> (where they are called "post-parse grammar actions"). In tree-sitter, I >> assume they are called from some mode-author-written code that traverses >> the syntax tree (wisi provides that internally). Except I see below that >> the emacs tree-sitter package stores the syntax tree in the buffer. >> > > The preferred approach with tree-sitter is querying: > https://tree-sitter.github.io/tree-sitter/using-parsers#pattern-matching-with-queries For access to the syntax tree, yes. There must be code somewhere that computes face, indent and navigation (and code completion, ...). That code will build on top of the syntax tree access; it could be in Rust (in the module) or in elisp (using the module functions). Or in C linked directly into Emacs, as Eli suggests. But I don't think he meant that as an actual implementation approach, just as a design approach. The wisi Ada code that computes the text properties accesses the syntax tree more directly, but that's just an implementation detail. I think it makes sense at this point to try to merge wisi and emacs-tree-sitter. There are several approaches: 1. rewrite the wisi grammar actions in elisp, using the emacs-tree-sitter module functions to access the syntax tree. 2. rewrite the wisi grammar actions in Rust, using Rust functions to access the syntax tree 3. rewrite the emacs-tree-sitter module in Ada, using an Ada binding to the Tree-Sitter C API. Then the Emacs module would provide the current wisi Ada code, modified to work with a Tree-Sitter parser. 4. There would also be value in doing an independent design and implementation of code to compute face, indent and navigation using the tree-sitter syntax tree; there might be a better approach than what wisi does. 1 is probably the quickest path to getting something working, but 2 or 3 will probably provide faster execution time. Ideally we'd do all three (or four) and get some good metrics. After doing one of the above, we must still write the calls to the grammar actions for each language of interest. In wisi, this is done by adding grammar actions to the grammar source code; for example, here is the indent action for the Ada 'if then end if' statement: if_statement : IF expression_opt THEN sequence_of_statements_opt END IF SEMICOLON %((wisi-indent-action [nil [(wisi-hanging% ada-indent-broken (* 2 ada-indent-broken)) ada-indent-broken] nil [ada-indent ada-indent] nil nil nil]))% There is one lisp form for each token in the grammar production. IF is not indented by this action; it is indented by the enclosing Ada statement. The conditional expression is indented by wisi-hanging; comments within the expression (assuming it is multi-line) are indented by ada-indent-broken. wisi-hanging takes care of indenting the second line in a long expression. THEN, END IF SEMICOLON are not indented. The statements in the true branch are indented by ada-indent. If ada-indent is 3, ada-broken-indent 2, this produces: if a or b -- a comment then statement_1; statement_2; -- another comment end if; In the upstream development repository for the wisi package (https://savannah.nongnu.org/projects/ada-mode/), there is a user guide to the grammar actions. I can provide an html or info version on request. (or get my act together and do another wisi/ada-mode release). In Tree-Sitter, the calls to the grammar actions are written in code that traverses the syntax tree; this would be a higher level elisp or Rust function, for 1 and 2 above. -- -- Stephe ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-01 23:38 ` Stephen Leake ` (2 preceding siblings ...) 2020-04-02 5:21 ` Tuấn-Anh Nguyễn @ 2020-04-02 14:36 ` Eli Zaretskii 2020-04-03 2:27 ` Stephen Leake 3 siblings, 1 reply; 43+ messages in thread From: Eli Zaretskii @ 2020-04-02 14:36 UTC (permalink / raw) To: Stephen Leake; +Cc: emacs-devel > From: Stephen Leake <stephen_leake@stephe-leake.org> > Date: Wed, 01 Apr 2020 15:38:26 -0800 > > Eli Zaretskii <eliz@gnu.org> writes: > > > Also, direct access to buffer text generally means we must make sure > > GC never runs as long as pointers to buffer text are lying around. > > Can any Lisp run between calls to the reader function that the > > tree-sitter parser calls to access the buffer text? > > If the parser copies the text into an internal buffer, that reader > function should only be called once per call to the parser. Such copying is not really scalable, and IMO should be avoided. During active editing, redisplay runs very frequently, and having to copy portions of the buffer, let alone all of it, each time, which necessarily requires memory allocation, consing of Lisp objects, etc., will produce significant memory pressure, expensive heap allocations/deallocations, and a lot of GC. Recall that on many modern platforms Emacs doesn't really return memory to the system, which means we risk increasing the memory footprint, and create system-wide memory pressure. It isn't a catastrophe, but we should try to avoid it if possible. > Since Emacs has the entire file in memory, the parser can too. Having the file twice or more in memory is worse than having it only once. > However, if we are really trying to avoid copying text (which is very > premature optimization) I don't think it's premature. > In sum, the short answer is "yes, you must parse the whole file, unless > your language is particularly simple". Funny, my conclusion from reading your detailed description was entirely different. > > IOW, the issue with exposing access to buffer text to modules is IMO > > secondary. > > yes, because copying text is fast compared to everything else going on. That wasn't my motivation when I wrote that. > In general, each parser library, and even each grammar author, will have > different representations for the syntax tree. > > So if we want to support different parsers, I think it is best to define > the Emacs "parser API" as "give text to parser; accept text properties > from parser". Yes, something like that. It's probably enough to accept a list of regions with syntactic attributes. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-02 14:36 ` Eli Zaretskii @ 2020-04-03 2:27 ` Stephen Leake 2020-04-03 7:43 ` Eli Zaretskii 0 siblings, 1 reply; 43+ messages in thread From: Stephen Leake @ 2020-04-03 2:27 UTC (permalink / raw) To: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Stephen Leake <stephen_leake@stephe-leake.org> >> Date: Wed, 01 Apr 2020 15:38:26 -0800 >> >> Eli Zaretskii <eliz@gnu.org> writes: >> >> > Also, direct access to buffer text generally means we must make sure >> > GC never runs as long as pointers to buffer text are lying around. >> > Can any Lisp run between calls to the reader function that the >> > tree-sitter parser calls to access the buffer text? >> >> If the parser copies the text into an internal buffer, that reader >> function should only be called once per call to the parser. > > Such copying is not really scalable, and IMO should be avoided. > During active editing, redisplay runs very frequently, and having to > copy portions of the buffer, let alone all of it, each time, which > necessarily requires memory allocation, consing of Lisp objects, etc., > will produce significant memory pressure, expensive heap > allocations/deallocations, and a lot of GC. Recall that on many > modern platforms Emacs doesn't really return memory to the system, > which means we risk increasing the memory footprint, and create > system-wide memory pressure. It isn't a catastrophe, but we should > try to avoid it if possible. Ok. I know very little about the internal storage of text in Emacs. There is at least two strings with a gap at the current edit point; if we pass a simple pointer to tree-sitter, it will have to handle the gap. You mention "consing of Lisp objects" above, which says to me that the text is stored in a more complex structure. How can we provide direct access of that to tree-sitter? Avoid _all_ copying is impossible; the parser must store the contents of each token in some way. Typically that is done by storing pointers/indices into the text buffer that contains the entire text. >> In sum, the short answer is "yes, you must parse the whole file, unless >> your language is particularly simple". > > Funny, my conclusion from reading your detailed description was > entirely different. I need more than that to respond in a helpful way. >> In general, each parser library, and even each grammar author, will have >> different representations for the syntax tree. >> >> So if we want to support different parsers, I think it is best to define >> the Emacs "parser API" as "give text to parser; accept text properties >> from parser". > > Yes, something like that. It's probably enough to accept a list of > regions with syntactic attributes. Ok, good. -- -- Stephe ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-03 2:27 ` Stephen Leake @ 2020-04-03 7:43 ` Eli Zaretskii 2020-04-03 17:45 ` Stephen Leake 0 siblings, 1 reply; 43+ messages in thread From: Eli Zaretskii @ 2020-04-03 7:43 UTC (permalink / raw) To: Stephen Leake; +Cc: emacs-devel > From: Stephen Leake <stephen_leake@stephe-leake.org> > Date: Thu, 02 Apr 2020 18:27:59 -0800 > > > Such copying is not really scalable, and IMO should be avoided. > > During active editing, redisplay runs very frequently, and having to > > copy portions of the buffer, let alone all of it, each time, which > > necessarily requires memory allocation, consing of Lisp objects, etc., > > will produce significant memory pressure, expensive heap > > allocations/deallocations, and a lot of GC. Recall that on many > > modern platforms Emacs doesn't really return memory to the system, > > which means we risk increasing the memory footprint, and create > > system-wide memory pressure. It isn't a catastrophe, but we should > > try to avoid it if possible. > > Ok. I know very little about the internal storage of text in Emacs. > There is at least two strings with a gap at the current edit point; if > we pass a simple pointer to tree-sitter, it will have to handle the gap. Tree-sitter allows the application to define a "reader" function that it will then call to access buffer text. That function should cope with the gap. > You mention "consing of Lisp objects" above, which says to me that the > text is stored in a more complex structure. I meant the consing that is necessary to make a buffer-substring that will be passed to the parser. > How can we provide direct access of that to tree-sitter? See above: by writing our function to access buffer text. > Avoid _all_ copying is impossible; the parser must store the contents of > each token in some way. Typically that is done by storing > pointers/indices into the text buffer that contains the entire text. I don't think tree-sitter does that, because the text it gets is ephemeral. If we pass it a buffer-substring, it's a temporary string which will be GCed after it's used; if we pass it pointers to buffer text, those pointers can be invalid after GC, because GC can relocate buffer text to a different memory region. They definitely do copy portions of the text they get for internal processing purposes, but I doubt that they duplicate all of it, because that would not be scalable to huge buffers. And in any case, any copying we do would be _in_addition_ to what tree-sitter does internally. > >> In sum, the short answer is "yes, you must parse the whole file, unless > >> your language is particularly simple". > > > > Funny, my conclusion from reading your detailed description was > > entirely different. > > I need more than that to respond in a helpful way. Well, you said: > To some extent, that depends on the language. and then went on to describing how each language might _not_ need a full parse in many cases. Thus the conclusion sounded a bit radical to me. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-03 7:43 ` Eli Zaretskii @ 2020-04-03 17:45 ` Stephen Leake 2020-04-03 18:31 ` Eli Zaretskii 0 siblings, 1 reply; 43+ messages in thread From: Stephen Leake @ 2020-04-03 17:45 UTC (permalink / raw) To: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Stephen Leake <stephen_leake@stephe-leake.org> >> Date: Thu, 02 Apr 2020 18:27:59 -0800 >> >> > Such copying is not really scalable, and IMO should be avoided. >> > During active editing, redisplay runs very frequently, and having to >> > copy portions of the buffer, let alone all of it, each time, which >> > necessarily requires memory allocation, consing of Lisp objects, etc., >> > will produce significant memory pressure, expensive heap >> > allocations/deallocations, and a lot of GC. Recall that on many >> > modern platforms Emacs doesn't really return memory to the system, >> > which means we risk increasing the memory footprint, and create >> > system-wide memory pressure. It isn't a catastrophe, but we should >> > try to avoid it if possible. >> >> Ok. I know very little about the internal storage of text in Emacs. >> There is at least two strings with a gap at the current edit point; if >> we pass a simple pointer to tree-sitter, it will have to handle the gap. > > Tree-sitter allows the application to define a "reader" function that > it will then call to access buffer text. That function should cope > with the gap. and also with the encoding, which you did not address. I don't see how that is different from the C level buffer-substring. Certainly there should be a module function buffer-substring that is as efficient as possible. >> You mention "consing of Lisp objects" above, which says to me that the >> text is stored in a more complex structure. > > I meant the consing that is necessary to make a buffer-substring that > will be passed to the parser. Since are are calling the parser from C (if it is linked into Emacs, or in a module), I still don't understand. Does C code have to cons to create a string? It will have to allocate if the requested range is not contiguous in the buffer. >> Avoid _all_ copying is impossible; the parser must store the contents of >> each token in some way. Typically that is done by storing >> pointers/indices into the text buffer that contains the entire text. > > I don't think tree-sitter does that, because the text it gets is > ephemeral. If we pass it a buffer-substring, it's a temporary string > which will be GCed after it's used; if we pass it pointers to buffer > text, those pointers can be invalid after GC, because GC can relocate > buffer text to a different memory region. Hmm. https://tree-sitter.github.io/tree-sitter/using-parsers#providing-the-code says: Syntax nodes store their position in the source code both in terms of raw bytes and row/column coordinates In the case of passing a pointer to a string (or buffer, etc), those positions are relative to that original buffer. So the Emacs buffer is serving as the parse buffer. Ok, that avoids any copying. If we pass a buffer-substring to the parser, we are then responsible for mapping positions relative to the substring into positions relative to the full buffer. wisi delegates that to the parser; it can pass start-char-pos and start-byte-pos to the parser along with a string. >> >> In sum, the short answer is "yes, you must parse the whole file, unless >> >> your language is particularly simple". >> > >> > Funny, my conclusion from reading your detailed description was >> > entirely different. >> >> I need more than that to respond in a helpful way. > > Well, you said: > >> To some extent, that depends on the language. > > and then went on to describing how each language might _not_ need a > full parse in many cases. Thus the conclusion sounded a bit radical > to me. Ok, we are putting different spins on what "particularly simple" means. A more neutral phrasing would be: Some languages require parsing the whole file, some do not. -- -- Stephe ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-03 17:45 ` Stephen Leake @ 2020-04-03 18:31 ` Eli Zaretskii 2020-04-04 0:04 ` Stephen Leake 0 siblings, 1 reply; 43+ messages in thread From: Eli Zaretskii @ 2020-04-03 18:31 UTC (permalink / raw) To: Stephen Leake; +Cc: emacs-devel > From: Stephen Leake <stephen_leake@stephe-leake.org> > Date: Fri, 03 Apr 2020 09:45:44 -0800 > > > Tree-sitter allows the application to define a "reader" function that > > it will then call to access buffer text. That function should cope > > with the gap. > > and also with the encoding, which you did not address. I mentioned that in another message: I don't think encoding is necessary in this case. > I don't see how that is different from the C level > buffer-substring. Certainly there should be a module function > buffer-substring that is as efficient as possible. If modules are allowed direct access to buffer text, then it's indeed not different. But the alternative that was discussed was different. May I suggest that you look at the code of the module which triggered this? > >> You mention "consing of Lisp objects" above, which says to me that the > >> text is stored in a more complex structure. > > > > I meant the consing that is necessary to make a buffer-substring that > > will be passed to the parser. > > Since are are calling the parser from C (if it is linked into Emacs, or > in a module), I still don't understand. Does C code have to cons to > create a string? If course. How else do you get a UTF-8 encoded string to pass to the parser as a copy of buffer text? > > I don't think tree-sitter does that, because the text it gets is > > ephemeral. If we pass it a buffer-substring, it's a temporary string > > which will be GCed after it's used; if we pass it pointers to buffer > > text, those pointers can be invalid after GC, because GC can relocate > > buffer text to a different memory region. > > Hmm. > https://tree-sitter.github.io/tree-sitter/using-parsers#providing-the-code > says: > > Syntax nodes store their position in the source code both in terms > of raw bytes and row/column coordinates Positions are okay; 'char *' pointers to buffer or string text are not. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-03 18:31 ` Eli Zaretskii @ 2020-04-04 0:04 ` Stephen Leake 2020-04-04 7:13 ` Eli Zaretskii 0 siblings, 1 reply; 43+ messages in thread From: Stephen Leake @ 2020-04-04 0:04 UTC (permalink / raw) To: emacs-devel Eli Zaretskii <eliz@gnu.org> writes: >> From: Stephen Leake <stephen_leake@stephe-leake.org> >> >> >> You mention "consing of Lisp objects" above, which says to me that the >> >> text is stored in a more complex structure. >> > >> > I meant the consing that is necessary to make a buffer-substring that >> > will be passed to the parser. >> >> Since are are calling the parser from C (if it is linked into Emacs, or >> in a module), I still don't understand. Does C code have to cons to >> create a string? > > If course. How else do you get a UTF-8 encoded string to pass to the > parser as a copy of buffer text? malloc and memcpy. I guess that's what you mean by "cons"; I was assuming you meant the actual elisp function. -- -- Stephe ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-04 0:04 ` Stephen Leake @ 2020-04-04 7:13 ` Eli Zaretskii 0 siblings, 0 replies; 43+ messages in thread From: Eli Zaretskii @ 2020-04-04 7:13 UTC (permalink / raw) To: Stephen Leake; +Cc: emacs-devel > From: Stephen Leake <stephen_leake@stephe-leake.org> > Date: Fri, 03 Apr 2020 16:04:04 -0800 > > >> Since are are calling the parser from C (if it is linked into Emacs, or > >> in a module), I still don't understand. Does C code have to cons to > >> create a string? > > > > If course. How else do you get a UTF-8 encoded string to pass to the > > parser as a copy of buffer text? > > malloc and memcpy. How do you know how much memory to allocate? And memcpy doesn't cut it, because you forgot the encoding step. You could, of course, take the low-level encoding code from coding.c and make your own high-level functions that don't work with Lisp objects. But (a) why bother doing that? and (b) I think you will quickly find out that this is a non-trivial job, since coding.c "knows", to the lowest level, that it's dealing with Lisp objects (buffers or strings), so you'd need pretty much to rewrite everything. It's no accident that the Cygwin port uses the Lisp string machinery even when it needs to convert strings from UTF-16 (see from_unicode), even though it basically needs to convert C strings. > I guess that's what you mean by "cons"; I was assuming you meant the > actual elisp function. No, I meant "consing" as in "make a Lisp string", then encode it (which makes another Lisp string). ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-01 19:33 ` Eli Zaretskii 2020-04-01 23:38 ` Stephen Leake @ 2020-04-02 4:21 ` Tuấn-Anh Nguyễn 2020-04-02 5:19 ` Jorge Javier Araya Navarro ` (3 more replies) 1 sibling, 4 replies; 43+ messages in thread From: Tuấn-Anh Nguyễn @ 2020-04-02 4:21 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel On Thu, Apr 2, 2020 at 2:33 AM Eli Zaretskii <eliz@gnu.org> wrote: > > > From: Tuấn-Anh Nguyễn <ubolonton@gmail.com> > > Date: Thu, 2 Apr 2020 00:55:45 +0700 > > Cc: emacs-devel@gnu.org > > > > > Did you consider using the API where an application can provide a > > > function to return text at a given offset? Such a function could be > > > relatively easily implemented for Emacs. > > > > > > > I don't understand what you mean. Below I'll explain how it works > > currently. [...] If dynamic modules have direct access to the > > buffer text, none of the above is an issue. > > > > Such direct access can be enabled by something like this: > > > > char* (*access_buffer_text) (emacs_env *env, > > emacs_value buffer, > > ptrdiff_t byte_offset, > > ptrdiff_t *size_inout); > > > > Of course, such an API would require extensive documentation on how it > > must be used, to ensure safety and correctness. > > I think you are moving too fast, and keep the current implementation > in sight too much. > I'm actually moving too slow here. I have thought about this part quite a bit, but I'm currently focusing on other things, partially because this is not painful bottleneck. > What I suggest is to step back and see how such direct access, if it > were available, could be used with tree-sitter. Let's forget about > modules for a moment and consider tree-sitter linked with Emacs and > capable of calling any C function in core. How would you use that? > > Buffer text is not exactly UTF-8, it's a superset of UTF-8. So one > question to answer is what to do with byte sequences that are not > valid UTF-8. Any suggestions or ideas? How does tree-sitter handle > invalid byte sequences in general? > I haven't checked yet. It will probably bail out, which is usually the desired behavior. The tree-sitter's author is likely open to making this behavior configurable here, though. Alternatively, the direct access function can offer different behaviors: as-is, bail-out, skip-over, or null-out (tree-sitter will skip over null bytes, IIRC). > Also, direct access to buffer text generally means we must make sure > GC never runs as long as pointers to buffer text are lying around. > Can any Lisp run between calls to the reader function that the > tree-sitter parser calls to access the buffer text? If so, we need to > take care of that issue. > With direct access, no Lisp code will be run between these calls. > Next, I'm still asking whether parsing the whole buffer when it is > first created is necessary. Can we pass to the parser just a small > chunk (say, 500 bytes) of the buffer around the window-full to be > displayed next? If this presents problems, what are those problems? > In principle (not in tree-sitter ATM), and in very specific cases, yes. IMO that's the wrong focus on a premature optimization anyway. As others noted, even in the pathological case of xdisp.c, the performance is acceptable. Also keep in mind that syntax highlighting is just one application. Other use cases usually want a full parse tree. If we really want to tackle this issue, there are other approaches to consider, e.g. background parsing, or parsing up until a time limit, and resume parsing when Emacs is idle. Tree-sitter's API supports the latter. But again, both thought exercises and my usage so far point to this being a non-issue. > IOW, the issue with exposing access to buffer text to modules is IMO > secondary. My suggestion is first to figure out how to do this stuff > efficiently from within Emacs itself, as if the module interface were > not part of the equation. We can add that aspect back later. > My opinion is that it's better to experiment with this kind of stuff out-of-core. It can move forward faster that way, allowing more lessons to be learned. Real lessons, involving real-world use cases, not thought exercises. In a somewhat similar vein, writing emacs-tree-sitter highlighted real issues with dynamic modules, which I'm going to write up sometime. > And yes, doing this by consing strings is not a good idea, it will > slow things down and cause a lot of GC. It is best avoided. Thus my > questions above. > > > > Btw, what do you do with the tree returned by the tree-sitter parser? > > > store it in some buffer-local variable? If so, how much memory does > > > such a tree take, and when, if ever, is that memory released? > > > > > > > It's stored in a buffer-local variable. I haven't measured the memory > > they take. Memory is released when the tree object is garbage-collected > > (it's a `user-ptr'). > > So if I have many hundreds of buffers, I could have such a tree in > each one of them indefinitely? Perhaps that's one more design issue > to consider, given that the parsing is so fast. Similar to what we do > with image and face caches -- we flush them from time to time, to keep > the memory footprint in check. So a buffer that was not current more > than some time interval ago could have its tree GCed. > That can work. Alternatively, tree-sitter can add support for "folding" subtrees, as Stefan suggested. -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-02 4:21 ` Tuấn-Anh Nguyễn @ 2020-04-02 5:19 ` Jorge Javier Araya Navarro 2020-04-02 9:29 ` Stephen Leake 2020-04-02 10:37 ` Andrea Corallo ` (2 subsequent siblings) 3 siblings, 1 reply; 43+ messages in thread From: Jorge Javier Araya Navarro @ 2020-04-02 5:19 UTC (permalink / raw) To: Tuấn-Anh Nguyễn; +Cc: Eli Zaretskii, emacs-devel [-- Attachment #1: Type: text/plain, Size: 5897 bytes --] > Also keep in mind that syntax highlighting is just one application. Other use cases usually want a full parse tree. like indentation, or so I think 🤔, but indentation may be one of those use cases. El mié., 1 de abril de 2020 22:22, Tuấn-Anh Nguyễn <ubolonton@gmail.com> escribió: > On Thu, Apr 2, 2020 at 2:33 AM Eli Zaretskii <eliz@gnu.org> wrote: > > > > > From: Tuấn-Anh Nguyễn <ubolonton@gmail.com> > > > Date: Thu, 2 Apr 2020 00:55:45 +0700 > > > Cc: emacs-devel@gnu.org > > > > > > > Did you consider using the API where an application can provide a > > > > function to return text at a given offset? Such a function could be > > > > relatively easily implemented for Emacs. > > > > > > > > > > I don't understand what you mean. Below I'll explain how it works > > > currently. [...] If dynamic modules have direct access to the > > > buffer text, none of the above is an issue. > > > > > > Such direct access can be enabled by something like this: > > > > > > char* (*access_buffer_text) (emacs_env *env, > > > emacs_value buffer, > > > ptrdiff_t byte_offset, > > > ptrdiff_t *size_inout); > > > > > > Of course, such an API would require extensive documentation on how it > > > must be used, to ensure safety and correctness. > > > > I think you are moving too fast, and keep the current implementation > > in sight too much. > > > > I'm actually moving too slow here. I have thought about this part quite > a bit, but I'm currently focusing on other things, partially because > this is not painful bottleneck. > > > What I suggest is to step back and see how such direct access, if it > > were available, could be used with tree-sitter. Let's forget about > > modules for a moment and consider tree-sitter linked with Emacs and > > capable of calling any C function in core. How would you use that? > > > > Buffer text is not exactly UTF-8, it's a superset of UTF-8. So one > > question to answer is what to do with byte sequences that are not > > valid UTF-8. Any suggestions or ideas? How does tree-sitter handle > > invalid byte sequences in general? > > > > I haven't checked yet. It will probably bail out, which is usually the > desired behavior. The tree-sitter's author is likely open to making this > behavior configurable here, though. Alternatively, the direct access > function can offer different behaviors: as-is, bail-out, skip-over, or > null-out (tree-sitter will skip over null bytes, IIRC). > > > Also, direct access to buffer text generally means we must make sure > > GC never runs as long as pointers to buffer text are lying around. > > Can any Lisp run between calls to the reader function that the > > tree-sitter parser calls to access the buffer text? If so, we need to > > take care of that issue. > > > > With direct access, no Lisp code will be run between these calls. > > > Next, I'm still asking whether parsing the whole buffer when it is > > first created is necessary. Can we pass to the parser just a small > > chunk (say, 500 bytes) of the buffer around the window-full to be > > displayed next? If this presents problems, what are those problems? > > > > In principle (not in tree-sitter ATM), and in very specific cases, yes. > IMO that's the wrong focus on a premature optimization anyway. As others > noted, even in the pathological case of xdisp.c, the performance is > acceptable. Also keep in mind that syntax highlighting is just one > application. Other use cases usually want a full parse tree. > > If we really want to tackle this issue, there are other approaches to > consider, e.g. background parsing, or parsing up until a time limit, and > resume parsing when Emacs is idle. Tree-sitter's API supports the > latter. > > But again, both thought exercises and my usage so far point to this > being a non-issue. > > > IOW, the issue with exposing access to buffer text to modules is IMO > > secondary. My suggestion is first to figure out how to do this stuff > > efficiently from within Emacs itself, as if the module interface were > > not part of the equation. We can add that aspect back later. > > > > My opinion is that it's better to experiment with this kind of stuff > out-of-core. It can move forward faster that way, allowing more lessons > to be learned. Real lessons, involving real-world use cases, not thought > exercises. > > In a somewhat similar vein, writing emacs-tree-sitter highlighted real > issues with dynamic modules, which I'm going to write up sometime. > > > And yes, doing this by consing strings is not a good idea, it will > > slow things down and cause a lot of GC. It is best avoided. Thus my > > questions above. > > > > > > Btw, what do you do with the tree returned by the tree-sitter parser? > > > > store it in some buffer-local variable? If so, how much memory does > > > > such a tree take, and when, if ever, is that memory released? > > > > > > > > > > It's stored in a buffer-local variable. I haven't measured the memory > > > they take. Memory is released when the tree object is garbage-collected > > > (it's a `user-ptr'). > > > > So if I have many hundreds of buffers, I could have such a tree in > > each one of them indefinitely? Perhaps that's one more design issue > > to consider, given that the parsing is so fast. Similar to what we do > > with image and face caches -- we flush them from time to time, to keep > > the memory footprint in check. So a buffer that was not current more > > than some time interval ago could have its tree GCed. > > > > That can work. Alternatively, tree-sitter can add support for "folding" > subtrees, as Stefan suggested. > > -- > Tuấn-Anh Nguyễn > Software Engineer > > [-- Attachment #2: Type: text/html, Size: 7465 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-02 5:19 ` Jorge Javier Araya Navarro @ 2020-04-02 9:29 ` Stephen Leake 0 siblings, 0 replies; 43+ messages in thread From: Stephen Leake @ 2020-04-02 9:29 UTC (permalink / raw) To: emacs-devel Jorge Javier Araya Navarro <jorge@esavara.cr> writes: >> Also keep in mind that syntax highlighting is just one > application. Other use cases usually want a full parse tree. > > like indentation, or so I think 🤔, but indentation may be one of those use > cases. To correctly compute indentation for Ada code, you need to parse the full file initially. After than, indent-region to indent edited code fits nicely with incremental parse. -- -- Stephe ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-02 4:21 ` Tuấn-Anh Nguyễn 2020-04-02 5:19 ` Jorge Javier Araya Navarro @ 2020-04-02 10:37 ` Andrea Corallo 2020-04-02 11:14 ` Tuấn-Anh Nguyễn 2020-04-02 13:02 ` Stefan Monnier 2020-04-02 15:02 ` Eli Zaretskii 3 siblings, 1 reply; 43+ messages in thread From: Andrea Corallo @ 2020-04-02 10:37 UTC (permalink / raw) To: Tuấn-Anh Nguyễn; +Cc: Eli Zaretskii, emacs-devel Tuấn-Anh Nguyễn <ubolonton@gmail.com> writes: > In principle (not in tree-sitter ATM), and in very specific cases, yes. > IMO that's the wrong focus on a premature optimization anyway. As others > noted, even in the pathological case of xdisp.c, the performance is > acceptable. Please do not assume xdisp.c is the worst case scenario, I can testify it is not :) Andrea -- akrl@sdf.org ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-02 10:37 ` Andrea Corallo @ 2020-04-02 11:14 ` Tuấn-Anh Nguyễn 0 siblings, 0 replies; 43+ messages in thread From: Tuấn-Anh Nguyễn @ 2020-04-02 11:14 UTC (permalink / raw) To: Andrea Corallo; +Cc: Eli Zaretskii, emacs-devel On Thu, Apr 2, 2020 at 5:37 PM Andrea Corallo <akrl@sdf.org> wrote: > > Tuấn-Anh Nguyễn <ubolonton@gmail.com> writes: > > > In principle (not in tree-sitter ATM), and in very specific cases, yes. > > IMO that's the wrong focus on a premature optimization anyway. As others > > noted, even in the pathological case of xdisp.c, the performance is > > acceptable. > > Please do not assume xdisp.c is the worst case scenario, I can testify > it is not :) > Fair enough. -- Tuấn-Anh Nguyễn Software Engineer ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-02 4:21 ` Tuấn-Anh Nguyễn 2020-04-02 5:19 ` Jorge Javier Araya Navarro 2020-04-02 10:37 ` Andrea Corallo @ 2020-04-02 13:02 ` Stefan Monnier 2020-04-02 15:06 ` Eli Zaretskii 2020-04-02 15:02 ` Eli Zaretskii 3 siblings, 1 reply; 43+ messages in thread From: Stefan Monnier @ 2020-04-02 13:02 UTC (permalink / raw) To: Tuấn-Anh Nguyễn; +Cc: Eli Zaretskii, emacs-devel > If we really want to tackle this issue, there are other approaches to > consider, e.g. background parsing, or parsing up until a time limit, and > resume parsing when Emacs is idle. Tree-sitter's API supports the > latter. Emacs is in dire need to exploit multiple cores. It would be very natural to run tree-parser's initial parse asynchronously in a separate thread. This requires to pass tree-parser a *copy* of the buffer's text. Stefan ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-02 13:02 ` Stefan Monnier @ 2020-04-02 15:06 ` Eli Zaretskii 0 siblings, 0 replies; 43+ messages in thread From: Eli Zaretskii @ 2020-04-02 15:06 UTC (permalink / raw) To: Stefan Monnier; +Cc: ubolonton, emacs-devel > From: Stefan Monnier <monnier@iro.umontreal.ca> > Cc: Eli Zaretskii <eliz@gnu.org>, emacs-devel@gnu.org > Date: Thu, 02 Apr 2020 09:02:41 -0400 > > > If we really want to tackle this issue, there are other approaches to > > consider, e.g. background parsing, or parsing up until a time limit, and > > resume parsing when Emacs is idle. Tree-sitter's API supports the > > latter. > > Emacs is in dire need to exploit multiple cores. True. > It would be very natural to run tree-parser's initial parse > asynchronously in a separate thread. This requires to pass > tree-parser a *copy* of the buffer's text. This also raises a lot of issues and problems of its own, of which copying the buffer is the least one. We don't yet have any example of such asynchronous processing, so this feature will have to be the first that does it, and will then have to resolve the issues in addition to doing its main job. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-02 4:21 ` Tuấn-Anh Nguyễn ` (2 preceding siblings ...) 2020-04-02 13:02 ` Stefan Monnier @ 2020-04-02 15:02 ` Eli Zaretskii 2020-04-03 14:34 ` Tuấn-Anh Nguyễn 3 siblings, 1 reply; 43+ messages in thread From: Eli Zaretskii @ 2020-04-02 15:02 UTC (permalink / raw) To: Tuấn-Anh Nguyễn; +Cc: emacs-devel > From: Tuấn-Anh Nguyễn <ubolonton@gmail.com> > Date: Thu, 2 Apr 2020 11:21:49 +0700 > Cc: emacs-devel@gnu.org > > > Buffer text is not exactly UTF-8, it's a superset of UTF-8. So one > > question to answer is what to do with byte sequences that are not > > valid UTF-8. Any suggestions or ideas? How does tree-sitter handle > > invalid byte sequences in general? > > > > I haven't checked yet. It will probably bail out, which is usually the > desired behavior. "Bail out" meaning that this breaks the parse? I'd be surprised if that was what happens in these cases. But if it does, we will need to replace such sequences by the likes of U+FFFD in the reader function we provide. > With direct access, no Lisp code will be run between these calls. Then this issue is taken care of. > > Next, I'm still asking whether parsing the whole buffer when it is > > first created is necessary. Can we pass to the parser just a small > > chunk (say, 500 bytes) of the buffer around the window-full to be > > displayed next? If this presents problems, what are those problems? > > > > In principle (not in tree-sitter ATM), and in very specific cases, yes. > IMO that's the wrong focus on a premature optimization anyway. I tried to explain elsewhere why I don't think this is premature. > As others noted, even in the pathological case of xdisp.c, the > performance is acceptable. xdisp.c is not a pathological case for me, I edit it very frequently. More importantly, this scales poorly. > Also keep in mind that syntax highlighting is just one > application. Other use cases usually want a full parse tree. Other applications have different restrictions and requirements, so trying to satisfy all of them at once might not be the best way. > If we really want to tackle this issue, there are other approaches to > consider, e.g. background parsing, or parsing up until a time limit, and > resume parsing when Emacs is idle. Tree-sitter's API supports the > latter. JIT-lock already supports background fontification (see jit-lock-stealth-time), so using such parsers from jit-lock gives that to you at almost no cost. > > IOW, the issue with exposing access to buffer text to modules is IMO > > secondary. My suggestion is first to figure out how to do this stuff > > efficiently from within Emacs itself, as if the module interface were > > not part of the equation. We can add that aspect back later. > > > > My opinion is that it's better to experiment with this kind of stuff > out-of-core. It can move forward faster that way, allowing more lessons > to be learned. Real lessons, involving real-world use cases, not thought > exercises. I'm talking about trying different design ideas. It is best to do that without being limited by what modules can and cannot do. Building a hacked version of Emacs to test those ideas doesn't necessarily contradict the desire to collect real-life experience. IOW, I suggest to test alternative design ideas that are not based on copying portions of the buffer via Lisp strings. If those ideas are workable (and I think they are), they will support a more scalable implementation that exerts less memory pressure on Emacs and on the host system. HTH ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: Reliable after-change-functions (via: Using incremental parsing in Emacs) 2020-04-02 15:02 ` Eli Zaretskii @ 2020-04-03 14:34 ` Tuấn-Anh Nguyễn 0 siblings, 0 replies; 43+ messages in thread From: Tuấn-Anh Nguyễn @ 2020-04-03 14:34 UTC (permalink / raw) To: Eli Zaretskii; +Cc: emacs-devel On Thu, Apr 2, 2020 at 10:02 PM Eli Zaretskii <eliz@gnu.org> wrote: > > > From: Tuấn-Anh Nguyễn <ubolonton@gmail.com> > > Date: Thu, 2 Apr 2020 11:21:49 +0700 > > Cc: emacs-devel@gnu.org > > > > > Buffer text is not exactly UTF-8, it's a superset of UTF-8. So one > > > question to answer is what to do with byte sequences that are not > > > valid UTF-8. Any suggestions or ideas? How does tree-sitter handle > > > invalid byte sequences in general? > > > > > > > I haven't checked yet. It will probably bail out, which is usually the > > desired behavior. > > "Bail out" meaning that this breaks the parse? I'd be surprised if > that was what happens in these cases. But if it does, we will need to > replace such sequences by the likes of U+FFFD in the reader function > we provide. > Agreed. I'll try checking its behavior on this. > > > IOW, the issue with exposing access to buffer text to modules is IMO > > > secondary. My suggestion is first to figure out how to do this stuff > > > efficiently from within Emacs itself, as if the module interface were > > > not part of the equation. We can add that aspect back later. > > > > > > > My opinion is that it's better to experiment with this kind of stuff > > out-of-core. It can move forward faster that way, allowing more lessons > > to be learned. Real lessons, involving real-world use cases, not thought > > exercises. > > I'm talking about trying different design ideas. It is best to do > that without being limited by what modules can and cannot do. > Building a hacked version of Emacs to test those ideas doesn't > necessarily contradict the desire to collect real-life experience. > > IOW, I suggest to test alternative design ideas that are not based on > copying portions of the buffer via Lisp strings. If those ideas are > workable (and I think they are), they will support a more scalable > implementation that exerts less memory pressure on Emacs and on the > host system. > > HTH > Yeah, I agree that going through Lisp strings for this is sub-optimal. When I have time to come back to this part, I'll hack up my local Emacs to allow dynamic modules to access buffer texts directly, to test out the idea. -- Tuấn-Anh Nguyễn Software Engineer P.S. Sorry Gmail messed up my first reply. ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2020-04-04 7:13 UTC | newest] Thread overview: 43+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-03-31 17:07 Reliable after-change-functions (via: Using incremental parsing in Emacs) Tuấn Anh Nguyễn 2020-03-31 17:50 ` Eli Zaretskii 2020-04-01 6:17 ` Tuấn Anh Nguyễn 2020-04-01 13:26 ` Eli Zaretskii 2020-04-01 15:47 ` Jorge Javier Araya Navarro 2020-04-01 16:07 ` Eli Zaretskii 2020-04-01 17:55 ` Tuấn-Anh Nguyễn 2020-04-01 19:33 ` Eli Zaretskii 2020-04-01 23:38 ` Stephen Leake 2020-04-02 0:25 ` Stephen Leake 2020-04-02 2:46 ` Stefan Monnier 2020-04-02 4:36 ` Tuấn-Anh Nguyễn 2020-04-02 14:44 ` Eli Zaretskii 2020-04-02 15:19 ` Stefan Monnier 2020-04-03 2:49 ` [SPAM UNSURE] " Stephen Leake 2020-04-03 7:47 ` Eli Zaretskii 2020-04-03 18:11 ` Stephen Leake 2020-04-03 18:46 ` Eli Zaretskii 2020-04-04 0:05 ` Stephen Leake 2020-04-03 8:11 ` Robert Pluim 2020-04-03 11:00 ` Eli Zaretskii 2020-04-03 11:09 ` Robert Pluim 2020-04-03 12:44 ` Eli Zaretskii 2020-04-03 11:21 ` John Yates 2020-04-03 12:50 ` Eli Zaretskii 2020-04-02 5:21 ` Tuấn-Anh Nguyễn 2020-04-02 9:24 ` [SPAM UNSURE] " Stephen Leake 2020-04-02 14:36 ` Eli Zaretskii 2020-04-03 2:27 ` Stephen Leake 2020-04-03 7:43 ` Eli Zaretskii 2020-04-03 17:45 ` Stephen Leake 2020-04-03 18:31 ` Eli Zaretskii 2020-04-04 0:04 ` Stephen Leake 2020-04-04 7:13 ` Eli Zaretskii 2020-04-02 4:21 ` Tuấn-Anh Nguyễn 2020-04-02 5:19 ` Jorge Javier Araya Navarro 2020-04-02 9:29 ` Stephen Leake 2020-04-02 10:37 ` Andrea Corallo 2020-04-02 11:14 ` Tuấn-Anh Nguyễn 2020-04-02 13:02 ` Stefan Monnier 2020-04-02 15:06 ` Eli Zaretskii 2020-04-02 15:02 ` Eli Zaretskii 2020-04-03 14:34 ` Tuấn-Anh Nguyễn
Code repositories for project(s) associated with this public inbox https://git.savannah.gnu.org/cgit/emacs.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).